Git - print files mixed in different encoding

136 Views Asked by At

I am working on a project whose files are different in encoding.(My OS is centos 7)

For example, $SRC/a.cpp may encoded in UTF-8, while $SRC/b.cpp is encoded in GB 2312(simplified Chinese).

Now if I enter git diff, the content will not display properly due to the mixed encoding.

I've tried iconv like this

git diff HEAD~1 | iconv -f gb2312 -t utf8 | less

It works well if all the files involved are encoded in GB 2312. But if any UTF-8 file is mixed, then iconv will broke like this

some well displayed UTF-8 text
...
iconv: illegal input sequence at position 120

My question is that if it is possible to make commands like git diff work properly without changing the file itself? I hope there can be some scripts filtering non-UTF-8 file for iconv or some git confiuration that can run iconv for non-UTF-8 file only.

Edit: The client of this project requests some files to have specific encodings and wants as less changes as possible for stability, so modifying files' encoding directly is not possible. A workaround without modifying the project is prefer.

1

There are 1 best solutions below

2
VonC On BEST ANSWER

You might need a git config diff driver

That driver script would first identify the encoding of each file and then convert it to UTF-8 if necessary before showing the diff.

Create a shell script (for instance git-diff-encoding.sh, with chmod +x git-diff-encoding.sh) which identifies the encoding of the files and then converts them to UTF-8 if necessary before showing the diff.

#!/bin/bash

FILE1="path/to/file1"
FILE2="path/to/file2"

# Identify encoding of the files using file command
ENC1=$(file -bi "$FILE1" | awk -F charset= '{print $2}')
ENC2=$(file -bi "$FILE2" | awk -F charset= '{print $2}')

# Convert files to UTF-8 if necessary
[ "$ENC1" != "utf-8" ] && iconv -f "$ENC1" -t utf-8 "$FILE1" -o "$FILE1".utf8
[ "$ENC2" != "utf-8" ] && iconv -f "$ENC2" -t utf-8 "$FILE2" -o "$FILE2".utf8

# Run git diff with potentially converted files
git diff --no-index "${FILE1}${ENC1:+.utf8}" "${FILE2}${ENC2:+.utf8}"

In your .git/config file, add the following lines to define a new diff driver called "encoding":

[diff "encoding"]
    command = /path/to/your/git-diff-encoding.sh

Tell Git which files should be handled by this new diff driver. You can do this in your repository's .gitattributes file (create it, if it does not exist, at the root folder of your Git repository). Add lines specifying the files to be handled by your new diff driver, for example:

*.cpp diff=encoding

Now, git will use your custom diff script when running git diff for files matching the patterns specified in the .gitattributes file.