Java incorrectly reading accented characters from System.in

602 Views Asked by At

If you are facing the same problem, and your character set is covered by the ANSI test encoding (codepage 1252 or "ISO 8859-1"), you could use that encoding instead to temporarily circumvent the problem with UTF-8, however UTF-8 is the modern standard that encompasses every script for ultimate localisation.

I'm creating an application that has to read user input containing accented characters from the console. From what I've read online, modern consoles are capable of handling accented character outputs, and correctly encoding inputs, even though they show up as ? before sending the command.

PS C:\> echo ?
ü
Ps C:\>

Note: this behaviour is not reproducible in Command Prompt. Command Prompt, when run in Windows Terminal, seems to display accented characters correctly before sending as well.

However, when running the following test code:

package com.test.outputtest;

import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.*;
import java.nio.file.*;

public class OutputTest {

    public static void main(String[] args) {
        // Set I/O to use UTF-8
        System.setOut(new PrintStream(new FileOutputStream(FileDescriptor.out), true, StandardCharsets.UTF_8));

        // Create the response listener
        Scanner input = new Scanner(System.in, StandardCharsets.UTF_8);

        System.out.println(Arrays.toString("èéëê".getBytes(StandardCharsets.UTF_8)));
        String temp = input.nextLine();
        System.out.println(Arrays.toString(temp.getBytes(StandardCharsets.UTF_8)));
    }

}

this is the output (after building the artifact "app.jar"):

PS C:\Users\[name]\Desktop\output_test> chcp 65001
Active code page: 65001
PS C:\Users\[name]\Desktop\output_test> java "-Dfile.encoding=UTF-8" -jar app.jar
[-61, -88, -61, -87, -61, -85, -61, -86]
èéëê
[0, 0, 0, 0]

The first array of bytes comes from the pre-written string, the second array is the bytes of the inputted string. The fact that echo outputs accents correctly leads me to believe that this is a compiler error, but I'm not sure how to fix it. I've tried replacing the Scanner with Console, that gave me the same error.

When running inside of IntelliJ, the ü is read completely normally when inputting it in the terminal. This is also a reason why I suspect a problem during compilation. When running with command prompt instead of PowerShell, the same error occurs.

Note: I'm using Windows Terminal running PowerShell and using IntelliJ Idea Community Edition 2021.3. I have not edited the .xml files besides the artifact building file path and some other project-specific file paths.

  • OS: Windows 10 build 19045.2728
  • Java version: 17.0.6 (Also in IntelliJ)
  • Default codepage: 850 (OEM)
  • Codepage used in which the error occured: 65001 (UTF-8)
1

There are 1 best solutions below

2
skomisa On BEST ANSWER

I can reproduce your problem, but I see nothing wrong with your code and I have no easy solution. Incredibly, it seems that even with the most recent versions of Java (18, 19, 20), reading UTF-8 characters from a Windows console remains problematic.

This is formally documented in JDK bug JDK-8295672 Provide a better alternative to reading System.in which is open and unresolved. It states (with my emphasis added):

Reading System.in is problematic as it is an input stream encoded in the host's encoding. With the JEP 400, there are cases where the default encoding (UTF-8) and host's native encoding differ. To read the bytes correctly, users would have to convert the bytes native-to-default, which seems to be an obstacle for basic usage. Providing a better means to access (w/o considering encoding stuff) would be appropriate.

So setting the default charset to UTF-8 does not resolve the issue because the "host's native encoding" is not UTF-8, and there is nothing you can do about that (at least with respect to cmd.exe and PowerShell on Windows).

Notes:

Standardize on UTF-8 throughout the standard Java APIs, except for console I/O.

  • Presumably console I/O was excluded in JEP400 because of the "host's encoding" issue mentioned above.
  • An obvious question arising is why does your code work when run within Intellij? I suspect that is because JetBrains uses JNA to read the input from their console, but that's just a guess.