comm -23 not deleting all common lines

1.2k Views Asked by At

I want to delete lines from file 1.txt that are in file 2.txt and save the output to 3.txt, I am using this bash command:

comm -23 1.txt 2.txt > 3.txt

When I check the output in file 3.txt, I find that some common lines between 1.txt and 2.txt are still in 3.txt, take as an example the word "registry" , what is the problem?

You can download the two files below:

file 1.txt : https://ufile.io/n7vn6

file 2.txt : https://ufile.io/p4s58

2

There are 2 best solutions below

9
On BEST ANSWER

I'm not sure how you generated your text files, but the problem is that some of your 1.txt and 2.txt lines don't have consistent line terminations. Some have a CR character (ctrl-M) but not the sole line feed Linux expects for text files. For example, one of them has registry^M which doesn't match registry (Linux programs that examine text will see ^M as another character or white space but not as a line termination that gets ignored). When you look at the file with some text editors, the ^M isn't visible so it appears registry is the same in both places, but it isn't.

You could try:

dos2unix 1.txt 2.txt
comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt

dos2unix will make all of the line terminations correct (assuming they might be using the DOS CR). Note that this can affect the sort a little, so I'm also resorting them. You can try this without resorting, and if there's an issue comm will give an error that one of the files isn't sorted.

0
On

comm needs the input to be sorted. You can use process substitution for that:

comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt

Update, if you additionally have a problem with line endings, you can use sed to align that:

comm -23 <(sed 's/\r//g' 1.txt | sort) <(sed 's/\r//g' 2.txt| sort) > 3.txt