Subset rows of file 2 based on common values of single column in file 1 UNIX

1.4k Views Asked by At

I'm dealing with large files where I would like to extract only the rows where the values of one column are contained in the values of a column of another file.

For example, in file1 I have 10,000 rows that look something like this:

Chr13998356 T   C
Chr1401532  A   G
Chr14021851 A   T

And I have in file2 (100,000+ rows) where I want only the rows where a values in column 1 in present in column 1 of file 1. So for file 2 I have:

Chr1    401530  G   G   60  0   60  11
Chr1    401531  A   A   60  0   60  11
Chr1    401532  A   G   30  170 60  11

I would like to end up with a third file with:

Chr1    401532  A   G   30  170 60  11

This file could have anywhere between few to 10,000 lines.

In R, i would use something like df3 <- df2[df2[,1] %in% df1[,2],], but the files are too large.

Is there a simple UNIX solution? Something like using comm, but only matching the first column instead of the whole line would be perfect. Or something with grep, but searching for 10K patterns at the same time.

Any ideas?

1

There are 1 best solutions below

0
On BEST ANSWER
$ cat file1
Chr13998356 T   C
Chr1401532  A   G
Chr14021851 A   T

$ cat file2
Chr1    401530  G   G   60  0   60  11
Chr1    401531  A   A   60  0   60  11
Chr1    401532  A   G   30  170 60  11

$ awk 'NR==FNR{vals[$1];next} ($1$2) in vals' file1 file2
Chr1    401532  A   G   30  170 60  11