Unicode character ſ is matched as itself and as 's.'

70 Views Asked by At

I just tried to clean up an old German text containing the character 'ſ' (U+017F). I wanted to replace it with 's', but when I used :%s/ſ/s/g not only that character got replaced but also all occurrences of 's' followed by an arbitrary character, as if I had used the command :%s/s./s/g.

As an example, the text:

Die Gleichheit **) fordert das Nachdenken heraus durch Fragen, die ſich daran knüpfen und nicht ganz leicht zu beantworten ſind.

will be replaced by my command with:

Die Gleichheit **) fordert dasNachdenken herausdurch Fragen, die sich daran knüpfen und nicht ganz leicht zu beantworten sind.

I assume it might have something to do with the fact that 'ſ' is represented in UTF-8 as a sequence of two bytes (0xC5 0xBF). Isn't that a bug? If not, is there a way to just replace 'ſ' and not also 's'?

I am using fileencoding=utf-8 and:

> vim --version
VIM - Vi IMproved 9.1 (2024 Jan 02)
Included patches: 1-151
> echo $LANG
de_DE.UTF-8

Here is a screenshot with :set hlsearch:

Search highlighting reveals the matching


UPDATE: I got hold of an installation of vim version 8.0 patches up to 586 on Windows 10, exhibiting the same behavior with both my and the \%u versions of the command.

2

There are 2 best solutions below

6
romainl On

I am not sure if the observed behavior should be considered a bug or not but I certainly wouldn't expect it.

In general, searching/substituting characters outside of ASCII or perhaps Latin 1 or 2 is best done with the notations described under :help /\%u. In this case, I would use this notation:

:[range]s/\%u017F/s/g
3
diffset On

I seem to have at least partially solved the mystery. When I opened the file with vim -u /dev/null frege_sinn_1892.txt both versions of the command worked as expected. Using binary search in my rather large .vimrc file I found the culprit to be set ignorecase. However, I don't know why sometimes only the original version of the command is broken as stated by Friedrich in the comments to the (first) answer. Anyway even with :set ignorecase a single character shouldn't be matched to two characters. So I consider it a bug.