Subset rows of file 2 based on common values of single column in file 1 UNIX

1.4k Views Asked by user2903950 At 27 July 2025 at 16:51

I'm dealing with large files where I would like to extract only the rows where the values of one column are contained in the values of a column of another file.

For example, in file1 I have 10,000 rows that look something like this:

Chr13998356 T   C
Chr1401532  A   G
Chr14021851 A   T

And I have in file2 (100,000+ rows) where I want only the rows where a values in column 1 in present in column 1 of file 1. So for file 2 I have:

Chr1    401530  G   G   60  0   60  11
Chr1    401531  A   A   60  0   60  11
Chr1    401532  A   G   30  170 60  11

I would like to end up with a third file with:

Chr1    401532  A   G   30  170 60  11

This file could have anywhere between few to 10,000 lines.

In R, i would use something like df3 <- df2[df2[,1] %in% df1[,2],], but the files are too large.

Is there a simple UNIX solution? Something like using comm, but only matching the first column instead of the whole line would be perfect. Or something with grep, but searching for 10K patterns at the same time.

Any ideas?

Original Q&A

There are 1 best solutions below

Ed Morton On 16 May 2014 at 20:16 BEST ANSWER

$ cat file1
Chr13998356 T   C
Chr1401532  A   G
Chr14021851 A   T

$ cat file2
Chr1    401530  G   G   60  0   60  11
Chr1    401531  A   A   60  0   60  11
Chr1    401532  A   G   30  170 60  11

$ awk 'NR==FNR{vals[$1];next} ($1$2) in vals' file1 file2
Chr1    401532  A   G   30  170 60  11

Subset rows of file 2 based on common values of single column in file 1 UNIX

There are 1 best solutions below

Related Questions in UNIX

Related Questions in AWK

Related Questions in GREP

Related Questions in COMM

Trending Questions

Popular # Hahtags

Popular Questions