Using awk to filter a file based on conditions

43 Views Asked by At

I have posted previosuly but I have not resolved the issue. I have two files one which includes 3 columns, called Chromosome and Gene_start Gene_end. There are approx 100 lines in this file. The first column contains a number (1-23) and Gene_start Gene_end contains a numeric value. If I filter for 4 in the first column or chromosome 4, my file looks like this:

>> cat gene_positions.txt
Chromosome Gene_start Gene_end 
4 25121627 25167204
4 170981373 171017850
4 37592422 37692998
4 84382094 84411290

I have another file with 14 columns, I am only interested in the first two columns of this file. CHROM and POS:

>> cat gwas_results.txt
CHROM   POS ID  REF ALT A1  TEST    OBS_CT  BETA    SE  L95 U95 T_STAT  P
1   22004791    1:54712:T:TTTTC T   TTTTC   T   ADD 1597    -0.00256645 0.0313666   -0.0640439  0.058911    -0.0818212  0.934799
1   22105000    1:54712:T:TC    T   TTTTC   T   ADD 1597    -0.00259875 0.0313666   -0.0640439  0.058911    -0.0818216  0.94799
1   825410  rs13303179:825410:G:A   G   A   G   ADD 1597    -0.017462   0.0235123   -0.0635454  0.0286213   -0.742676   0.457788

If the first column from file 1 matches exactly to the first column of the second file as well as the second column of gwas_results.txt called POS falls within the range of the second and third column of file 1 (Gene_start Gene_end) for that particular line, then to print the line in the lookup_output.txt. Essentially, the gene_positions is a lookup file, with three columns and the first is a number that should match the first column of the gwas_results.txt, next if the POS in gwas_results.txt falls within the range of the Gene_start Gene_end from the first file, I want the line from gwas_results.txt to be printed.

I am using the following awk command as was suggested before:

#!/bin/bash
#SBATCH --job-name=genelookup
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16gb
#SBATCH --time=01:00:00
#SBATCH --export=NONE

cd /data/genome/hj/GWAS_results 

awk '
NR == FNR {
   start[$1] = $2
   end[$1] = $3
   next
}
$1 in start && $2 >= start[$1] && $3 <= end[$1]
' gene_positions.txt gwas_results.txt > lookup_output.txt

What I get are some positions outside the range in my lookup_output.txt

4   84529777    4:84529777:ACGTG:A  ACGTG   A   ACGTG   ADD 1635    0.00986994  0.0253137   -0.039744   0.0594839   0.389905    0.696658
4   85092377    4:84392377:G:T  G   T   T   ADD 1635    -0.00336762 0.0415539   -0.0848117  0.0780765   -0.0810422  0.935418

For example 84529777 is outside the range from the gene_positions file (4 84382094 84411290). I need the lookup to be line to line basis. E.g. scans the file for chromosome 4 and positions within 84382094 84411290 range. If found print.

I also get

4   37700998    4:84392377:G:T  G   T   A   ADD 1635    -0.0033 0.047539    -0.0847 0.0780765   -0.08   0.93 

But this is outside the range (4 37592422 37692998) from the gene positions.

I really should be seeing

4   25121630    4:84392377:G:T  G   T   T   ADD 1635    -0.00336762 0.0415539   -0.0848117  0.0780765   -0.0810422  0.935418
4   25131627    4:84392378:A:T  A   T   T   ADD 1635    -0.00336762 0.0415539   -0.0848117  0.0780765   -0.0810422  0.935418
4   170981374   4:84415536:CA:C CA  C   C   ADD 1635    -0.0359081  0.0576552   -0.14891    0.0770941   -0.622807   0.533499
4   171017750   4:84415882:AAACATG:A    AAACATG A   A   ADD 1635    -0.00552875 0.0522986   -0.108032   0.0969745   -0.105715   0.915821
4   37492998    4:84454092:T:TTATATATG  T   TTATATATG   TTATATATG   ADD 1635    0.0148902   0.0442058   -0.0717516  0.101532    0.336838    0.736283
4   37592998    4:84458776:G:GTA    G   GTA GTA ADD 1635    -0.000157512    0.0252463   

etc...

BECAUSE...they second column called POS is all within the range of the positions in the gene_positions.txt file, providing it is chromosome 4. In other words, the first column of 'gwas_results.txt' ($1) matches a value in the first column of 'gene_positions.txt' and if this is true, then the second column of 'gwas_results.txt' ($2) falls within the range defined by the second and third columns of 'gene_positions.txt' (Gene_start and Gene_end) for the particular chromosome.

0

There are 0 best solutions below