Handling Turkish/Non-ascii Characters in subprocess.check_output in Python

69 Views Asked by At

I am encountering an issue with handling Turkish characters when using the subprocess.check_output method in Python. My objective is to retrieve Wi-Fi profile names on a Windows system, which may contain Turkish characters such as 'İ', 'ü', 'ş'. However, when using subprocess.check_output with encoding set to either 'utf-8', 'ISO-8859-9', or 'windows-1254', the Turkish characters are not correctly represented.

Here is a snippet of my code:

import subprocess

try:
    command_output = subprocess.check_output("netsh wlan show profile", shell=True, encoding='utf-8', errors='ignore')
    # Also tried with 'ISO-8859-9' and 'windows-1254'
except subprocess.CalledProcessError as e:
    print(f"Command error: {e}")
    command_output = ""

The issue arises with Wi-Fi names containing Turkish characters, for instance, 'İnternetim' ("my internet") or 'AĞIM' ("my network"). These characters are either omitted or replaced with incorrect characters, like '˜nternet

I ran my code and I got this error on wifi with Turkish profile names.

Profile error nternetim: Command 'netsh wlan show profile "nternetim" key=clear' returned non-zero exit status 1. Profil işlenirken hata AIM: Command 'netsh wlan show profile "AIM" key=clear' returned non-zero exit status 1.

As seen in the code output, letters are missing.

Also tried with 'ISO-8859-9' and 'windows-1254'.

1

There are 1 best solutions below

2
On

Because there is no guarantee as to which character encoding any SSID uses, you just have to guess. Take out the encoding keyword and you will receive bytes, which you can then successively try to decode according to whatever heuristics you can come up with.

import subprocess

try:
    command_output = subprocess.check_output("netsh wlan show profile", shell=True)
except subprocess.CalledProcessError as e:
    print(f"Command error: {e}")
    command_output = ""

for encoding in ('utf-8', 'ISO-8859-9', 'windows-1254'):
    try:
        command_output = command_output.decode(encoding)
        break
    except UnicodeDecodingError:
        pass
    else:
        raise UnicodeDecodingError("Could not find a valid encoding for '%r'" % command_output)

UTF-8 is nicely robust in that it will reject most strings which are not valid UTF-8. The other legacy 8-bit encodings will often gladly accept pretty much any string without errors, but result in bogus data if you guess incorrectly.

You might want to add more encodings, and/or see if you can perform some sort of frequency analysis on the strings to establish the most likely encoding. (The chardet library provides some facilities for this.)