Deduplicating a Text File and keeping the last occurence in one output file and moving others to another output file

Question

Deduplicating a Text File and keeping the last occurence in one output file and moving others to another output file

274 Views Asked by user2018441 At 08 September 2025 at 17:27

I have a file with dups records (dups are in columns). I want to keep only the last occurrence of the dup records in a file and move the all other dups in another file.

File : input

foo j
bar bn
bar b
bar bn
bar bn
bar bn
kkk hh
fjk ff
foo jj
xxx tt
kkk hh

I have used the following awk statement to keep the last occurrence --

awk '{line=$0; x[$1]=line;} END{ for (key in x) print x[key];}' input > output

File : output

foo jj
xxx tt
fjk ff
kkk hh
bar bn

How can I move the repeating records to another file (leaving the last occurrence)?

Moving foo j in one file let say d_output and keeping foo jj in output file

Original Q&A

There are 3 best solutions below

Chris Seymour On 15 March 2013 at 21:03

A trick is to used tac to reverse the file first (easier to grab first match than last):

$ tac file | awk 'a[$1]++{print $0 > "dup";next}{print $0 > "output"}'

$ cat output
kkk hh
xxx tt
foo jj
fjk ff
bar bn

$ cat dup
kkk hh
bar bn
bar bn
bar b
bar bn
foo j

Edit:

Here are the benchmark figures for the current 3 solutions over one million lines:

sudo_o

real    0m2.156s
user    0m1.004s
sys     0m0.117s

kent

real    0m2.806s
user    0m2.718s
sys     0m0.080s

scrutinizer

real    0m4.033s
user    0m3.939s
sys     0m0.082s

Verify here http://ideone.com/IBrNeh

On my local machine using the file seq 1 1000000 > bench:

# sudo_o
$ time tac bench | awk 'a[$1]++{print $0 > "dup";next}{print $0 > "output"}' 

real    0m0.729s
user    0m0.668s
sys     0m0.101s

# scrutinizer
$ time awk 'NR==FNR{A[$1]=NR; next} A[$1]!=FNR{print>f; next}1' f=dups bench bench > output

real    0m1.093s
user    0m1.016s
sys     0m0.070s

# kent 
$ time awk '$1 in a{print a[$1]>"dup.txt"}{a[$1]=$0}END{for(x in a)print a[x]}' bench > output

real    0m1.141s
user    0m1.055s
sys     0m0.080s

Kent On 15 March 2013 at 21:43

Tools like tac and rev are nice!. However they are not default for all distributions, particularly I found you have tagged the question with unix. Also tac changes the output/dup.txt order, if the order should be kept, there is extra efforts to maintain the order.

Try this line:

awk '$1 in a{print a[$1]>"dup.txt"}{a[$1]=$0}END{for(x in a)print a[x]}' file

with your example:

kent$  awk '$1 in a{print a[$1]>"dup.txt"}{a[$1]=$0}END{for(x in a)print a[x]}' file
foo jj
xxx tt
fjk ff
kkk hh
bar bn

kent$  cat dup.txt 
bar bn
bar b
bar bn
bar bn
foo j
kkk hh

**Scrutinizer** · Accepted Answer

Another option you could try, keeping the order by reading the input file twice:

awk 'NR==FNR{A[$1]=NR; next} A[$1]!=FNR{print>f; next}1' f=dups file file

output:

bar bn
fjk ff
foo jj
xxx tt
kkk hh

Duplicates:

$ cat dups
foo j
bar bn
bar b
bar bn
bar bn
kkk hh

@Sudo_O @WilliamPursell @user2018441. Sudo_O thank you for the performance test. I tried to reproduce them on my system, but it does not have tac available, so I tested with Kent's version and mine, but I could not reproduce those differences on my system.

Update: I tested with Sudo_O's version using cat instead of tac. Although on a system with tac there was a difference of 0,2 seconds between tac and cat when outputting to /dev/null (see at the bottom of this post)

I got:

Sudo_O
$ time cat <(seq 1 1000000) | awk 'a[$1]++{print $0 > "/dev/null";next}{print $0 > "/dev/null"}'

real    0m1.491s
user    0m1.307s
sys     0m0.415s

kent
$ time awk '$1 in a{print a[$1]>"/dev/null"}{a[$1]=$0}END{for(x in a)print a[x]}' <(seq 1 1000000) > /dev/null

real    0m1.238s
user    0m1.421s
sys     0m0.038s

scrutinizer
$ time awk 'NR==FNR{A[$1]=NR; next} A[$1]!=FNR{print>f; next}1' f=/dev/null <(seq 1 1000000) <(seq 1 1000000) > /dev/null

real    0m1.422s
user    0m1.778s
sys     0m0.078s

--

when using a file instead of the seq I got:

Sudo_O
$ time cat <infile | awk 'a[$1]++{print $0 > "/dev/null";next}{print $0 > "/dev/null"}'

real    0m1.519s
user    0m1.148s
sys     0m0.372s


kent
$ time awk '$1 in a{print a[$1]>"/dev/null"}{a[$1]=$0}END{for(x in a)print a[x]}' <infile > /dev/null

real    0m1.267s
user    0m1.227s
sys     0m0.037s

scrutinizer
$ time awk 'NR==FNR{A[$1]=NR; next} A[$1]!=FNR{print>f; next}1' f=/dev/null <infile <infile > /dev/null

real    0m0.737s
user    0m0.707s
sys     0m0.025s

Probably due to caching effects, which would be present also for larger files.. Creating the infile took:

$ time seq 1 1000000 > infile

real    0m0.224s
user    0m0.213s
sys     0m0.010s

Tested on a different system:

$ time cat <(seq 1 1000000) > /dev/null

real    0m0.764s
user    0m0.719s
sys     0m0.031s
$ time tac <(seq 1 1000000) > /dev/null

real    0m1.011s
user    0m0.820s
sys     0m0.082s

Deduplicating a Text File and keeping the last occurence in one output file and moving others to another output file

There are 3 best solutions below

Related Questions in LINUX

Related Questions in UNIX

Related Questions in AWK

Related Questions in SUA

Trending Questions

Popular # Hahtags

Popular Questions