Decoding cp866-encoded filenames with Cyrillic letters after tar unarchiving

124 Views Asked by At

I have several files obtained by un-archiving some tar archive using gnu tar under macOS. These files have names like %8A%AE%AD%E1⠭⨭ - %84%87 %FCML1.ipynb due to using Cyrillic letters. It seems that %8A and so on are cp866-codes, but there are also some unicode characters presented (like ) that appear to be unicode representation of some byte sequences that accidently are valid utf-8 codes. I want to decode everything to unicode/UTF-8 to be able to rename my files. How can I do it?

1

There are 1 best solutions below

0
On BEST ANSWER

This Python function can help:

def decode_escaped_cp866(s):
    out = []
    for token in re.finditer(r"%([0-9A-F]{2})|(.)", s):
        if token.group(1) is not None:
            out.append(bytes([int(token.group(1), 16)]))
        elif token.group(2) is not None:
            out.append(token.group(2).encode('utf-8'))
    return b"".join(out).decode('cp866')

print(decode_escaped_cp866("%8A%AE%AD%E1⠭⨭ - %84%87 %FCML1.ipynb"))
# Константин - ДЗ №ML1.ipynb