comm -23 not deleting all common lines

1.2k Views Asked by At

I want to delete lines from file 1.txt that are in file 2.txt and save the output to 3.txt, I am using this bash command:

comm -23 1.txt 2.txt > 3.txt

When I check the output in file 3.txt, I find that some common lines between 1.txt and 2.txt are still in 3.txt, take as an example the word "registry" , what is the problem?

You can download the two files below:

file 1.txt : https://ufile.io/n7vn6

file 2.txt : https://ufile.io/p4s58

2

There are 2 best solutions below

9
lurker On BEST ANSWER

I'm not sure how you generated your text files, but the problem is that some of your 1.txt and 2.txt lines don't have consistent line terminations. Some have a CR character (ctrl-M) but not the sole line feed Linux expects for text files. For example, one of them has registry^M which doesn't match registry (Linux programs that examine text will see ^M as another character or white space but not as a line termination that gets ignored). When you look at the file with some text editors, the ^M isn't visible so it appears registry is the same in both places, but it isn't.

You could try:

dos2unix 1.txt 2.txt
comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt

dos2unix will make all of the line terminations correct (assuming they might be using the DOS CR). Note that this can affect the sort a little, so I'm also resorting them. You can try this without resorting, and if there's an issue comm will give an error that one of the files isn't sorted.

0
hek2mgl On

comm needs the input to be sorted. You can use process substitution for that:

comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt

Update, if you additionally have a problem with line endings, you can use sed to align that:

comm -23 <(sed 's/\r//g' 1.txt | sort) <(sed 's/\r//g' 2.txt| sort) > 3.txt