Windows-1252 encoding to UTF-8

522 Views Asked by Willa At 08 May 2023 at 21:52

I'm working with an old software to collect acoustic data from a manufacturing process. The file that was generated is an encoding unknown to both me and every application I've used to open the file. I've also used python to try and convert the file from latin1 to UTF-8... no luck.

Could anyone suggest an alternative way to convert this code to something sensible? Or at the very least, help me confirm what encoding I'm dealing with? much appreciated!

The output should just be numbers. Ideally separated into columns and rows but any advice is appreciated.

encoded text

Original Q&A

There are 1 best solutions below

Rahul Sahoo On 06 June 2023 at 07:33

First, we need to determine the charset of the file. So we can either use chardet (python lib) or find -bi $file (Linux file command) to determine the charset. In practice, I have observed that chardet takes slightly longer processing time as compared to file command. Also, I'm executing my code in a Linux container so I don't have to worry about the availability of the file command. So for this reason I'm using the file command with the help of subprocess lib from Python to get the charset.

Function to run Linux commands via python. run_cmd will run the provided linux command and return a tuple of stdout and stderr in string format.

def run_cmd(cmd: str) -> tuple[str, str]:
    process = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
        shell=True
    )
    std_out, std_err = process.communicate()
    return std_out, std_err

Execute file command and process the result from run_cmd. If a file is encoded with us-ascii it will return this us-ascii as charset.

def get_charset(file: str) -> str | None:
    charset_cmd = f"file -bi {file}"
    std_out, std_err = run_cmd(charset_cmd)
    for result in std_out.strip("\n").split():
        if result.startswith('charset='):
            return result.split('charset=')[1]
    return None

Using the identified charset you can convert the file to utf-8 or some other encoding.

def convert_charset(file: str, from_encoding: str, to_encoding: str, inplace=False):
    output_file = f"{get_file_name(file)}.utf8.{get_file_extension(file)}"
    try:
        with codecs.open(file, 'r', encoding=from_encoding) as f_in:
            with codecs.open(output_file, 'w', encoding=to_encoding) as f_out:
                f_out.write(f_in.read())
        if inplace:
            shutil.move(output_file, file)
    except Exception as ex:
        log.error(f"Error occurred while converting charset of file={file} from={from_encoding} to={to_encoding}")
        raise ex

Now use the above command to convert your file.

charset = get_charset(file)
log.info(f"charset: {charset}")

convert_charset(file, charset, 'utf-8', inplace=True)

This will convert the file into a utf-8 encoded file.

Note: I have used log to print messages. You can use print instead.

Windows-1252 encoding to UTF-8

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in ENCODING

Related Questions in DATA-PROCESSING

Related Questions in ISO-8859-1

Related Questions in WINDOWS-1252

Trending Questions

Popular # Hahtags

Popular Questions