uniq: printing all duplicated lines and repeat counts is NOT meaningless

142 Views Asked by At

Is there a way to show the duplicated counts with the actual duplicated lines repeated?

For example, input:

AAAA XXXX
AAAA YYYY
BBBB ZZZZ

Expected output:

2 AAAA XXXX
2 AAAA YYYY
1 BBBB ZZZZ

Using the Linux program uniq, it refuses to show the duplicated line 2 AAAA YYYY.

Linux command used:

printf 'AAAA XXXX\nAAAA YYYY\nBBBB ZZZZ' | uniq --count --check-chars 4
      2 AAAA XXXX
      1 BBBB ZZZZ

The -D option in uniq means print all duplicate lines. But it says it is meaningless.

printf 'AAAA XXXX\nAAAA YYYY\nBBBB ZZZZ' | uniq --count -D --check-chars 4
uniq: printing all duplicated lines and repeat counts is meaningless
Try 'uniq --help' for more information.

In my actual use case, XXXX YYYY ZZZZ are the file paths, and AAAA BBBB are the md5 hashes of the file contents. If XXXX and YYYY hashes are identical, I need to check file XXXX and YYYY. However I cannot get the file path of YYYY.

3

There are 3 best solutions below

2
Barmar On

You can use join to combine the uniq output with the original input.

$ join -1 1 -2 2 <( printf 'AAAA XXXX\nAAAA YYYY\nBBBB ZZZZ') <(printf 'AAAA XXXX\nAAAA YYYY\nBBBB ZZZZ' | uniq --count --check-chars 4) | cut -d' ' -f1-3
AAAA XXXX 2
AAAA YYYY 2
BBBB ZZZZ 1
1
Fravadona On

With this little awk you might get something usable?

awk '
    { arr[$1] = arr[$1] FS $2 }
    END {
        for (md5 in arr) {
            n = split(arr[md5], paths)
            print md5, n
            for (i = 1; i <= n; i++)
                print "\t" paths[i]
        }
    }
'
BBBB 1
    ZZZZ
AAAA 2
    XXXX
    YYYY
0
jared_mamrot On

Not sure if there's an easier method, but one potential option using awk:

printf 'AAAA XXXX\nAAAA YYYY\nBBBB ZZZZ' | awk '{a[$1]++; b[NR] = $1; c[NR] = $1 FS $2} END{for (i=1; i<=length(b); i++) {print a[b[i]], c[i]}}'
2 AAAA XXXX
2 AAAA YYYY
1 BBBB ZZZZ

Proper formatting:

printf 'AAAA XXXX\nAAAA YYYY\nBBBB ZZZZ' |\
awk '{
    a[$1]++
    b[NR] = $1
    c[NR] = $1 FS $2
}

END {
    for (i = 1; i <= length(b); i++) {
        print a[b[i]], c[i]
    }
}'
2 AAAA XXXX
2 AAAA YYYY
1 BBBB ZZZZ