Windows-1252 encoding to UTF-8

522 Views Asked by At

I'm working with an old software to collect acoustic data from a manufacturing process. The file that was generated is an encoding unknown to both me and every application I've used to open the file. I've also used python to try and convert the file from latin1 to UTF-8... no luck.

Could anyone suggest an alternative way to convert this code to something sensible? Or at the very least, help me confirm what encoding I'm dealing with? much appreciated!

The output should just be numbers. Ideally separated into columns and rows but any advice is appreciated.

encoded text

1

There are 1 best solutions below

0
Rahul Sahoo On

First, we need to determine the charset of the file. So we can either use chardet (python lib) or find -bi $file (Linux file command) to determine the charset. In practice, I have observed that chardet takes slightly longer processing time as compared to file command. Also, I'm executing my code in a Linux container so I don't have to worry about the availability of the file command. So for this reason I'm using the file command with the help of subprocess lib from Python to get the charset.

  1. Function to run Linux commands via python. run_cmd will run the provided linux command and return a tuple of stdout and stderr in string format.
def run_cmd(cmd: str) -> tuple[str, str]:
    process = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
        shell=True
    )
    std_out, std_err = process.communicate()
    return std_out, std_err
  1. Execute file command and process the result from run_cmd. If a file is encoded with us-ascii it will return this us-ascii as charset.
def get_charset(file: str) -> str | None:
    charset_cmd = f"file -bi {file}"
    std_out, std_err = run_cmd(charset_cmd)
    for result in std_out.strip("\n").split():
        if result.startswith('charset='):
            return result.split('charset=')[1]
    return None
  1. Using the identified charset you can convert the file to utf-8 or some other encoding.
def convert_charset(file: str, from_encoding: str, to_encoding: str, inplace=False):
    output_file = f"{get_file_name(file)}.utf8.{get_file_extension(file)}"
    try:
        with codecs.open(file, 'r', encoding=from_encoding) as f_in:
            with codecs.open(output_file, 'w', encoding=to_encoding) as f_out:
                f_out.write(f_in.read())
        if inplace:
            shutil.move(output_file, file)
    except Exception as ex:
        log.error(f"Error occurred while converting charset of file={file} from={from_encoding} to={to_encoding}")
        raise ex
  1. Now use the above command to convert your file.
charset = get_charset(file)
log.info(f"charset: {charset}")

convert_charset(file, charset, 'utf-8', inplace=True)

This will convert the file into a utf-8 encoded file.

Note: I have used log to print messages. You can use print instead.