BufferedOutputStream not working with Korean characters as expected

496 Views Asked by At

I'm trying to write Korean characters to a File and it's writing some gibberish data which I need to work around for showing as Korean data when I open it in CSV. How can I achieve my requirement without the workaround of decoding back to UTF-8 and show Korean data.

    File localExport = File.createTempFile("char-test", ".csv");
    try (
            FileOutputStream fos = new FileOutputStream(localExport);
            BufferedOutputStream bos = new BufferedOutputStream(fos);
            OutputStreamWriter outputStreamWriter =
                    new OutputStreamWriter(bos, StandardCharsets.UTF_8)
    ) {
        ArrayList<String> rows = new ArrayList<>();
        rows.add("\"가짜 사용자\",사용자123,saint1_user123");
        rows.add("\"페이크유저루노도스트레스 성도1\",saint1_user1");
        for (int i=0; i<2; i++) {
            String csvUserStr = rows.get(i);
            outputStreamWriter.write(csvUserStr);
        }
    }

It's writing the below data instead of the one I'm actually writing to the File.

Blockquote

3

There are 3 best solutions below

0
rzwitserloot On

There is absolutely nothing wrong with your java code. You are writing those characters, including the korean, precisely as written.

Whatever tool you are using to look at this file?

That's the broken one. Tell it that the file is UTF-8 based. If you can't, get a better tool or figure out which encoding it reads in, and update your java code.

Note that CSV files, text files, etc - they do not store the encoding that was used to write the data. All the programs that read/write to the file need to just know what encoding it is, there's no real way to know other than being told.


UPDATE: From a comment it looks like 'the tool that is reading this' is excel.

Excel asks for the encoding of the file when you use the 'import CSV' dialog. Pick UTF-8 in the dropdown. Depends on which version/OS you're on, but usually it's called 'File Origin'.

If you prefer that your client need not mess with the default, usually the default is something like MacRoman or Win1282, and with such an encoding, it is in fact impossible get korean characters. They simply aren't in that set.

if you want the fire and forget approach, generate the excel file yourself, for example using Apache POI.

0
erickson On

CSV files don't have any means to carry encoding information "in-band"—in the file itself. I'm guessing the default character encoding used for Excel CSV imports is the system default, so if that isn't Korean, they will have to specify the encoding when they import the CSV. If your client requires CSV, they have no choice but to accept that behavior.

However, if their requirement is to open your file in Excel (and not that the file has to be CSV format), you could write an Excel spreadsheet instead. The various Excel file formats do include character encoding information, so they would be able to open the file without manually specifying the encoding.

Library recommendations are off-topic, but libraries such Apache POI make writing simple Excel sheets fairly easy. There are additional benefits as well, such as taking care of any necessary escaping for you, so that your file doesn't repeatedly break when unanticipated values are included in the spreadsheet.

0
Joop Eggen On

As mentioned Excel fails to detect that the text is encoded in UTF-8. One solution is to write an invisible BOM character as first one:

  outputStreamWriter.write("\uFEFF");
  for...

This is a normally superfluous and ugly marker for miscellaneous UTF encoding.

By the way take a look at the class Files, that can reduce the code to one line.