Is it safe to temporarily store UTF-8 strings as ISO-8859-1 in Java?

233 Views Asked by At

I have a properties file that is encoded as UTF-8 called theProperties.properties:

property1=Some Chinese Characters: 会意字會意字
property2=More chinese Char - 假借
property2=<any other valid UTF-8 characters>

I use a resource bundle to pull in the localized strings:

ResourceBundle localizedStrings = ResourceBundle.getBundle(
    "theProperties.properties",
    locale
);

Resource bundle assumes that all strings are in ISO-8859-1 my resource files are encoded as UTF-8. I need to convert the string to UTF-8

Is it safe to wrap resource bundle and pull strings out of it like this:

public String getLocalizedString(String key){
    String localizedString_ISO_8859_1 = localizedStrings.getString(key);
    String localizedString_UTF_8 = new String(localizedString_ISO_8859_1.getBytes("ISO-8859-1"), "UTF-8");
    return localizedString_UTF_8;
}

Are there any times when this code is unsafe? It feels like it may be unsafe but strings are immutable does that mean that the bytes underneath are also immutable?

There are other ways to do this but this method is shorter so if it is safe I would prefer to go with this.


This is the alternate way of solving this issue, but it is a bit longer and from a ease of read perspective I like the above better since this solution is only changing a single line in the Control class.

3

There are 3 best solutions below

0
On

That should work, though utterly ugly as bending everything needing a large comment.

It works as:

  • Every byte of the UTF-8 multi-byte string is taken as char by Java.
  • Converting that string to ISO-8859-x bytes makes every char a byte.
  • The interpreting those bytes as UTF-8 yields the correct interpretation.

If you have a build infrastructure like maven, there are plugins to convert the encoding from src to build directory.

Also there are .properties editors with a wysiwig editing.

Cleanest might maybe to write your own ListResourceBundle child or such. Simply not (ab)using .properties. See the JRE for example usage.

0
On

It should work the way you do it, here is why:

When Java reads and interprets the bytes of the properties file, it will just use the unsigned byte values as char values - this works, because, fortunately, the first 256 code points have the same encodings in Unicode, and since Strings are internally stored as UTF-16, no surrogate characters or other complicated things are needed. Hence, translation from and to bytes pretending it is ISO-8859 works without loss.

2
On

This is fine, because ISO-8859-1 has a one-one mapping between bytes and its char set.

Anytime you need a byte[] but you are forced to use a String, you should use ISO-8859-1 as the mapping, which is the fastest since it is essentially the identity mapping.