Parsing CSV file where encapsulator in content ain't escaped properly

298 Views Asked by At

Hi I have a CSV File where the encapsulator character is not escaped properly.

Example

[email protected],"uhrege gerjhhg er<span style="background-color: rgb(0,153,0);">eriueiru kernger</span><font color="#009900"><span style="background-color: rgb(255,255,255);"> weiufhuweifbw fhew fibwefbw</span></font><div><font color="#009900"><span style="background-color: rgb(255,255,255);">wekifbwe fewf</span></font></div><div><font color="#009900"><span style="background-color: rgb(255,255,255);">weiuifgewbfjew f</span></font></div>",18-Oct-2016,

Delimiter -> ,

Encapsulator -> "

It breaks when I try to read using commons-csv reader , throws a ' invalid char between encapsulated token and delimiter' Exception .

However Microsoft excel seems to open the file perfectly. Any ideas on how to procced ? .

How does one parse CSV files where the encapsulator is not escaped properly ?.Excel seems to open such files fine.

2

There are 2 best solutions below

0
On

If you can't fix this at the source (i.e. generate a well-formed csv), and you want to parse this yourself, you could go the easy way:

Scan field1 up to ," - field2 up to ", - rest is field3 (trailing comma?).

Of course if a ", occurs in the html field, there's a problem. You could solve that by first scanning up to ,", and then backwards (starting at the end of the line) to ",.

If there are more fields than you show here, you could look for a , combined with a " (both combinations, could also be ",") and hope those do not appear in the field data.

0
On

univocity-parsers has a CSV parser that can handle this sort of input properly.

    //first configure the parser
    CsvParserSettings settings = new CsvParserSettings();
    settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_CLOSING_QUOTE);

    //then create a parser and parse your input line:
    CsvParser parser = new CsvParser(settings);
    String[] result = parser.parseLine("" +
            "[email protected],\"uhrege gerjhhg er<span style=\"background-color: rgb(0,153,0);\">eriueiru kernger</span><font color=\"#009900\"><span style=\"background-color: rgb(255,255,255);\"> weiufhuweifbw fhew fibwefbw</span></font><div><font color=\"#009900\"><span style=\"background-color: rgb(255,255,255);\">wekifbwe fewf</span></font></div><div><font color=\"#009900\"><span style=\"background-color: rgb(255,255,255);\">weiuifgewbfjew f</span></font></div>\",18-Oct-2016,");

    //here's the result (one value per line)
    for (String v : result) {
        System.out.println(v);
    }

This prints:

[email protected]
uhrege gerjhhg er<span style="background-color: rgb(0,153,0);">eriueiru kernger</span><font color="#009900"><span style="background-color: rgb(255,255,255);"> weiufhuweifbw fhew fibwefbw</span></font><div><font color="#009900"><span style="background-color: rgb(255,255,255);">wekifbwe fewf</span></font></div><div><font color="#009900"><span style="background-color: rgb(255,255,255);">weiuifgewbfjew f</span></font></div>
18-Oct-2016
null

Hope it helps.

Disclaimer: I'm the author of this library. It's open-source and free (Apache v2.0 license)