When I create a file with java 8, using the Shift-JIS charset, some chars are substitute with char '?'

1.7k Views Asked by At

I have a problem when I create a file using the Shift-JIS charset.

This is an example of text that I want write into a txt file:

繰戻_日経選挙システム保守2019年1月10日~;[2019年度更新]横浜第1DCコロケ―ション(2ラック)

Using Shift-JIS charset, into the file I find two '?' instead of ~ and ―:

繰戻_日経選挙システム保守2019年1月10日?;[2019年度更新]横浜第1DCコロケ?ション(2ラック)

Using UTF-8 charset, into the file I find (all correct):

繰戻_日経選挙システム保守2019年1月10日~;[2019年度更新]横浜第1DCコロケ―ション(2ラック)

This is my code:

package it.grupposervizi.easy.ef.etl.elaboration;

import com.nimbusds.jose.util.StandardCharset;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Arrays;
import java.util.List;
import org.apache.commons.io.FileUtils;

public class TestShiftJIS {

  private static final String TEXT = "繰戻_日経選挙システム保守2019年1月10日~;[2019年度更新]横浜第1DCコロケ―ション(2ラック)";
  private static final String DIRECTORY = "C:\\temp\\japan\\";
  private static final String SHIFT_JIS = "Shift-JIS";
  private static final String UTF_8 = StandardCharset.UTF_8.name();
  private static final String EXTENSION = ".txt";

  public static void main(String[] args) {

    final List<String> charsets = Arrays.asList(SHIFT_JIS, UTF_8);
    charsets.forEach(c -> {
      final String fName = DIRECTORY + c + EXTENSION;
      File file = new File(fName);
      try {
        FileUtils.writeStringToFile(file, TEXT, Charset.forName(c));
      } catch (IOException e) {
        throw new RuntimeException(e);
      }
    });

    System.out.println("End Test");
  }
}

Do you have any idea why these two chars are not included into the Shift-JIS charset?

3

There are 3 best solutions below

0
On

As @Marcono1234 answered, the Shift-JIS mapping in Java does not support (U+FF5E) and (U+FF5E). To map these codepoints from UTF-8 into Shift-JIS encoding, you have to use Charset.forName("windows-31j"); or Charset.forName("MS932"); rather than Charset.forName("Shift-JIS");.

2
On

@JosefZ has basically already given the answer: Shift-JIS does not support (U+FF5E) and (U+FF5E).

This can be verified using Charset.newEncoder().canEncode(char):

public class ShiftJisTest {
    public static void main(String[] args) {
        // 繰戻_日経選挙システム保守2019年1月10日~;[2019年度更新]横浜第1DCコロケ―ション(2ラック)
        String s = "\u7e70\u623b\u005f\u65e5\u7d4c\u9078\u6319\u30b7\u30b9\u30c6\u30e0\u4fdd\u5b88\u0032\u0030\u0031\u0039\u5e74\u0031\u6708\u0031\u0030\u65e5\uff5e\u003b\u005b\u0032\u0030\u0031\u0039\u5e74\u5ea6\u66f4\u65b0\u005d\u6a2a\u6d5c\u7b2c\uff11\u0044\u0043\u30b3\u30ed\u30b1\u2015\u30b7\u30e7\u30f3\uff08\uff12\u30e9\u30c3\u30af\uff09";
        Charset charset = Charset.forName("Shift-JIS");
        for (char c : s.toCharArray()) {
            CharsetEncoder encoder = charset.newEncoder();
            if (!encoder.canEncode(c)) {
                System.out.printf("%s (U+%04X)%n", c, (int) c);
            }
        }
        
        try {
            charset.newEncoder().encode(CharBuffer.wrap(s));
        } catch (CharacterCodingException e) {
            // java.nio.charset.UnmappableCharacterException: Input length = 1
            e.printStackTrace();
        }
    }
}

The reason why you are seeing ? is because Apache Commons IO's FileUtils.writeStringToFile(File, String, Charset) internally (1, 2) uses String.getBytes(Charset) whose documentation says:

[...] This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array.

And the CharsetEncoder documentation says:

[...] The replacement is initially set to the encoder's default replacement, which often (but not always) has the initial value { (byte)'?' }

3
On

///EDIT:

You try to save file that has uncommon (different from default) encoding. Try to change encoding of chars. more about encoding » https://en.wikipedia.org/wiki/Character_encoding

///

Try using: Charset.forName("CP943C")