error: unmappable character for encoding UTF8 after GIT merge

1.3k Views Asked by At

After yet another git pull my project stopped building with bunch of messages:

error: unmappable character for encoding UTF-8

The messages point to the copyright symbol found in some of the files headers. There are many more files with same symbol but they seem to compile fine. When viewing in binary editor the good one appears as:

C2 A9

while bad one

A9

When viewing in vim both are shown as © (<©> 169, Hex 00a9, Octal 251) but IntelliJ Idea shows bad ones as diamond.

So I decided that I have messed something when merging (there were merge conflicts after pull) and went to look what files where changed with

git diff-tree --no-commit-id --name-only -r --full-index --binary 91cbe7b753d39905372c1ea41e04e7a3dbd2566e

but it produces no results. No changes found for the previous commit too. The log looks like this:

commit 91cbe7b753d39905372c1ea41e04e7a3dbd2566e
Merge: d7b4ae9 0dfc198
Author: Me Me <[email protected]>
Date:   Wed Dec 23 17:50:46 2015 +0100

    Merge branch 'development' of ssh://fsstash.cool.com:7999/our/server into my-branch

commit 0dfc19850b2e31d72c1d2923321430e8fc1b53cb
Merge: 724b8a7 d3478f9
Author: Good Guy <[email protected]>
Date:   Wed Dec 23 14:34:33 2015 +0200

    Merge branch 'development' of ssh://fsstash.cool.com:7999/our/server into development

when I do git checkout 0dfc19850b2e31d72c1d2923321430e8fc1b53cb everything compiles fine.

So the question is: how can I fix it?

By fix I mean understanding what's happend and reapplying the pull changes (maybe) so that I wouldn't have to commit anything related to this fix into upstream repo.

It seems like the bad one is UTF-16 (0x00A9) while good one is UTF-8 - (0xC2 0xA9). What might have changed it?

Build system is maven, but it's not related as same error reported by bare javac on copied and minified file. The os is ubuntu 15.10, locale says this:

locale
LANG=ru_RU.UTF-8
LANGUAGE=ru:en
LC_CTYPE="ru_RU.UTF-8"
LC_NUMERIC=ru_UA.UTF-8
LC_TIME=ru_UA.UTF-8
LC_COLLATE="ru_RU.UTF-8"
LC_MONETARY=ru_UA.UTF-8
LC_MESSAGES="ru_RU.UTF-8"
LC_PAPER=ru_UA.UTF-8
LC_NAME=ru_UA.UTF-8
LC_ADDRESS=ru_UA.UTF-8
LC_TELEPHONE=ru_UA.UTF-8
LC_MEASUREMENT=ru_UA.UTF-8
LC_IDENTIFICATION=ru_UA.UTF-8
LC_ALL=

java -version: 1.8.0_66.

Any help is highly appreciated!

PS: tried all --diff-algorithm={patience|minimal|histogram|myers} - still no changes found by git-diff-tree

PS: git reset --hard HEAD~1, git pull origin developemnt issued from the command line didn't help, so not related to Idea.

2

There are 2 best solutions below

0
On

git diff --name-only is indeed more suited for parsing, as shown with Git 2.32 (Q2 2021), which clarifies that pathnames recorded in Git trees are most often (but not necessarily) encoded in UTF-8.

See commit 9364bf4 (20 Apr 2021) by Andrey Bienkowski (hexagonrecursion).
(Merged by Junio C Hamano -- gitster -- in commit 93e0b28, 30 Apr 2021)

doc: clarify the filename encoding in git diff

AFAICT parsing the output of git diff --name-only master...feature(man) is the intended way of programmatically getting the list of files modified by a feature branch.

It is impossible to parse text unless you know what encoding it is in.

diff-options now includes in its man page:

Show only names of changed files. The file names are often encoded in UTF-8.

diff-options now includes in its man page:

Just like --name-only the file names are often encoded in UTF-8..

0
On

the git diff-tree appeared to be the wrong diff to use in this case. The git diff --name-only a35f25470bc8219e3f2a45316963dde660091bcb 0dfc19850b2e31d72c1d2923321430e8fc1b53cb

revealed a lot of changes between the branches and one of them update of maven-compiler-plugin configuration which changed the java version from 7 to 8. And it looks like javac 8 treats encoding as errors whereas 7 as warning (although writes absolutely identical "error: unmappable character for ..." warning to the log.