Why is lumenworks returning quotes around a field for a perfectly ordinary CSV file?

42 Views Asked by At

I have used lumenworks CSV reader several times before and have never seen this issue. I have a simple CSV file:

"1","001333","Test Company","","123 Test St","","Eland","NS","58601","USA","","","","","Company 1","","123 Destination St","","Schefield","ND","58601","USA","","","Standard No Options","Label 001","2","1","5","","","","","0","0","0","0","0","0","0","0","0","0","0","05/02/2023","001333","0"
"1","001333","Test Company","","123 Test St","","Eland","NS","58601","USA","","","","","Company 1","","123 Destination St","","Schefield","ND","58601","USA","","","Standard No Options","Label 001","2","2","125","","","","","0","0","0","0","0","0","0","0","0","0","0","05/02/2023","001333","0"

I am reading it by creating a new stream with the header definition specified at the top, and then copying the file stream in so that I am reading a CSV with headers.

This is an extremely simplified excerpt of the read code:

using StreamReader filestream = new StreamReader(csvfilepath);
using var finalcsvstream = await PrependHeaderToStream(filestream);
using var csv = new CsvReader(finalcsvstream, true);
while (csv.ReadNextRecord())
{
  var fileversionnumber = csv["FileVersionNumber"];
  var field2 = csv["field2"];
  // etc
}

FileVersionNumber is the first column and the only one where I am having issues. In the file it is clearly the number 1, but when I read it in, I am getting a string with escaped double quotes:

\"1\"

After tooling around and searching google for more than an hour, I have tried specifying delimiters and quotes, played with various trimming options all to no avail. I reviewed the source to a fork of the library and it doesn't look like this should happen. For the moment, I need this to work and have put in a specific workaround to trim this column.

Any ideas what could be going wrong? Should I make a complete working toy example and see if the problem persists?

I should also mention I tried the headers with and without double quotes to see if it would do anything.

EDIT: The source of my CSV file is a base64 encoded string. One would think that would be safe from any crazy character shenanigans. When I open the file in notepad, it looks fine, but if I write the base64 string to disk and use the following:

base64 -d < /tmp/b64.txt | hexdump -c

I see the following:

0000000 357 273 277   "   1   "   ,   "   0   0   1   3   3   3   "   ,
0000010   "   T   e   s   t       C   o   m   p   a   n   y   "   ,   "

Any suggestions on how to trim the source before opening it in lumenworks?

1

There are 1 best solutions below

0
On BEST ANSWER

The hexdump shows that your data contains a a UTF-8 byte-order-mark. (hexdump displays the constituent bytes in octal).

A byte-order-mark is associated with the beginning of the stream, not the first row of data. Your line of headings can't go in front of the BOM; if you try then the sequence of special characters no longer meets the definition of a byte-order-mark and instead appears in the content, which then breaks the test for whether content starts with a quote.

You have two options for inserting your headings row:

  • Put it between the BOM and first line of data (if the consuming library even knows what to do with a BOM when correctly positioned as a BOM)
  • Replace the BOM outright (probably safest)

When doing so, you can rely on the fact that there are only a handful of possible BOMs: UTF8, UTF16-BE, and UTF16-LE. And the latter two need an Encoding parameter to StreamReader anyway, so for the code shown you really only need worry about the UTF8 version, and you can look for and trim that exact set of three bytes.