Java incorrectly reading accented characters from System.in

Question

Java incorrectly reading accented characters from System.in

602 Views Asked by ShadeOfLight At 04 April 2023 at 10:11

If you are facing the same problem, and your character set is covered by the ANSI test encoding (codepage 1252 or "ISO 8859-1"), you could use that encoding instead to temporarily circumvent the problem with UTF-8, however UTF-8 is the modern standard that encompasses every script for ultimate localisation.

I'm creating an application that has to read user input containing accented characters from the console. From what I've read online, modern consoles are capable of handling accented character outputs, and correctly encoding inputs, even though they show up as ? before sending the command.

PS C:\> echo ?
ü
Ps C:\>

Note: this behaviour is not reproducible in Command Prompt. Command Prompt, when run in Windows Terminal, seems to display accented characters correctly before sending as well.

However, when running the following test code:

package com.test.outputtest;

import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.*;
import java.nio.file.*;

public class OutputTest {

    public static void main(String[] args) {
        // Set I/O to use UTF-8
        System.setOut(new PrintStream(new FileOutputStream(FileDescriptor.out), true, StandardCharsets.UTF_8));

        // Create the response listener
        Scanner input = new Scanner(System.in, StandardCharsets.UTF_8);

        System.out.println(Arrays.toString("èéëê".getBytes(StandardCharsets.UTF_8)));
        String temp = input.nextLine();
        System.out.println(Arrays.toString(temp.getBytes(StandardCharsets.UTF_8)));
    }

}

this is the output (after building the artifact "app.jar"):

PS C:\Users\[name]\Desktop\output_test> chcp 65001
Active code page: 65001
PS C:\Users\[name]\Desktop\output_test> java "-Dfile.encoding=UTF-8" -jar app.jar
[-61, -88, -61, -87, -61, -85, -61, -86]
èéëê
[0, 0, 0, 0]

The first array of bytes comes from the pre-written string, the second array is the bytes of the inputted string. The fact that echo outputs accents correctly leads me to believe that this is a compiler error, but I'm not sure how to fix it. I've tried replacing the Scanner with Console, that gave me the same error.

When running inside of IntelliJ, the ü is read completely normally when inputting it in the terminal. This is also a reason why I suspect a problem during compilation. When running with command prompt instead of PowerShell, the same error occurs.

Note: I'm using Windows Terminal running PowerShell and using IntelliJ Idea Community Edition 2021.3. I have not edited the .xml files besides the artifact building file path and some other project-specific file paths.

OS: Windows 10 build 19045.2728
Java version: 17.0.6 (Also in IntelliJ)
Default codepage: 850 (OEM)
Codepage used in which the error occured: 65001 (UTF-8)

Original Q&A

There are 1 best solutions below

**skomisa** · Accepted Answer · 2023-04-04T22:23:17.893000

I can reproduce your problem, but I see nothing wrong with your code and I have no easy solution. Incredibly, it seems that even with the most recent versions of Java (18, 19, 20), reading UTF-8 characters from a Windows console remains problematic.

This is formally documented in JDK bug JDK-8295672 Provide a better alternative to reading System.in which is open and unresolved. It states (with my emphasis added):

Reading System.in is problematic as it is an input stream encoded in the host's encoding. With the JEP 400, there are cases where the default encoding (UTF-8) and host's native encoding differ. To read the bytes correctly, users would have to convert the bytes native-to-default, which seems to be an obstacle for basic usage. Providing a better means to access (w/o considering encoding stuff) would be appropriate.

So setting the default charset to UTF-8 does not resolve the issue because the "host's native encoding" is not UTF-8, and there is nothing you can do about that (at least with respect to cmd.exe and PowerShell on Windows).

Notes:

My understanding is that this is only an issue on Windows. Linux and Mac handle UTF-8 input without problems.
A possible workaround is using JNA (Java Native Access) methods to read the console input instead of using a Scanner. See How do I read the contents from an open Windows Console (Command Prompt) using Java Native Access to help get you started. Also see the Javadoc for JNA's WinCon interface, especially ReadConsoleInput().
Although it won't resolve your problem, you might consider upgrading to a more recent version of Java (18, 19 or 20) because of the implementation of JEP 400: UTF-8 by Default in Java 18. This was one of the goals of JEP400 (with my emphasis added):

Standardize on UTF-8 throughout the standard Java APIs, except for console I/O.

Presumably console I/O was excluded in JEP400 because of the "host's encoding" issue mentioned above.
An obvious question arising is why does your code work when run within Intellij? I suspect that is because JetBrains uses JNA to read the input from their console, but that's just a guess.

Java incorrectly reading accented characters from System.in

There are 1 best solutions below

Related Questions in JAVA

Related Questions in TERMINAL

Related Questions in UTF-8

Related Questions in JAVA.UTIL.SCANNER

Related Questions in SYSTEM.IN

Trending Questions

Popular # Hahtags

Popular Questions