Force Git to read ANSI files containing NUL

892 Views Asked by At

There are many, many places describing how to "force" Git to read a file as text. Generally, the solution involves adding a filter to .gitattributes to apply the text attribute to the file(s). Examples include:

* text
* text=auto
* text diff merge
* text=auto diff merge

But this solution seems to not work if the file contains NUL. Here is an example file text file with ANSI encoding and trailing null bytes:

enter image description here

It's completely readable as a text file, just not by Git. Every example filter above will fail and Git will identify as "binary" regardless. I think this is due to its hard-coded check for NUL in the first 8000 characters (ref).

Of course, as soon as I convert the file to UTF-8 Git happily identifies it as text. Here is that same file after conversion:

enter image description here

Frankly I don't mind not using ANSI encoding. I'm just trying to avoid constantly opening files in Notepad++ just to fix the file encoding. Is there a way to make Git handle the encoding conversion automatically?

1

There are 1 best solutions below

5
On

You have a couple of problems here. The first is that these are definitely not text files, since they contain a NUL byte. No major single-byte encoding permits NUL bytes to represent anything other than a NUL because C terminates its strings with that byte, and using it for another purpose would mean that text in that encoding would not fit into a normal C string. POSIX specifically excludes files containing NUL bytes from being text files for this reason.

The tool you're using to convert your “ANSI” files to UTF-8 is actually stripping out the NUL bytes, which is why they then work. The NUL byte in UTF-8 means exactly the same thing as it does in your single-byte encoding: a NUL. So this works because your tool is stripping them out instead of properly converting them.

It also isn't clear what you're asking Git to do in this case. The text attribute asks Git to perform end-of-line normalization. However, if your file contains NUL bytes, then Git is still going to think it's a binary file for the purposes of diffs and merge, because the text attribute doesn't control that. You need the diff and merge attributes as well.

Of course, if you don't really want or need the NUL bytes and these are supposed to be human-readable, then you really are better off just stripping out the NUL bytes and converting to UTF-8. In 2020, there's no longer any good reason to use a single-byte encoding. If that's what you want to do, then you can strip the NUL bytes and convert to UTF-8 by doing the following (assuming you're using Git Bash, WSL, or a Linux system):

$ tr -d '\0' FILENAME | iconv -f WINDOWS-1252 -t UTF-8 > FILENAME.tmp && \
  mv FILENAME.tmp FILENAME

That also assumes that the “ANSI” encoding you're using is actually Windows-1252. IANA (the register of character sets) doesn't know of any encodings called “ANSI”, but Windows-1252 is the most common character set referred to that way.

Finally, you can specify a working tree encoding with the working-tree-encoding value in gitattributes if you absolutely must handle non-UTF-8 files. That isn't going to fix your NUL problem, though, and UTF-8 is a better choice in almost all situations.