how to format a large txt file to bed

766 Views Asked by At

I am trying to format CpG methylation calls from R package "methyKit" to simple bed format. Since it is a large file, i can not do it in Excel. I also tried Seqmonk, but it does not allow me to export the data in the format I want. Linux Awk/sed might be a good option, but I am new to them as well. Basically, I need to trim "chr" column, add "stop" column, convert "F" to "+" /"R" to "-", and freqC with 2 decimal places. Can you please help?

From:

chrBase chr base    strand  coverage  freqC   freqT
chr1.339    chr1    339 F   7      0.00   100.00
chr1.183    chr1    183 R   4      0.00   100.00
chr1.192    chr1    192 R   6      0.00   100.00
chr1.340    chr1    340 R   5      40.00  60.00
chr1.10007  chr1    10007   F   13     53.85  46.15
chr1.10317  chr1    10317   F   8      0.00   100.00
chr1.10346  chr1    10346   F   9      88.89  11.11
chr1.10349  chr1    10349   F   9      88.89  11.11

To:

   chr     start         stop      freqc  Coverage strand
    1   67678   67679   0   3   -
    1   67701   67702   0   3   -
    1   67724   67725   0   3   -
    1   67746   67747   0   3   -
    1   67768   67769   0.333333    3   -
    1   159446  159447  0   3   +
    1   162652  162653  0   3   +
    1   167767  167768  0.666667    3   +
    1   167789  167790  0.666667    3   +
    1   167797  167798  0   3   +
2

There are 2 best solutions below

0
On

This should do what you actually want, producing a BED6 file with the methylation percentage in the score column:

$ cat foo.awk
BEGIN{OFS="\t"}
{if(NR>1) {
    if($4=="F") {
        strand="+"
    } else {
        strand="-"
    }
    chromUse=gsub("chr", "", $2);
    print chromUse,$3-1,$3,$1,$6,strand,$5
}}

That would then be run with:

awk -f foo.awk input.txt > output.bed

The additional column 7 is the coverage, since genome viewers will only display a single score column:

1       338     339     chr1.339        0.00    +       7
1       182     183     chr1.183        0.00    -       4
1       191     192     chr1.192        0.00    -       6
1       339     340     chr1.340        40.00   -       5
1       10006   10007   chr1.10007      53.85   +       13
1       10316   10317   chr1.10317      0.00    +       8
1       10345   10346   chr1.10346      88.89   +       9
1       10348   10349   chr1.10349      88.89   +       9

You can tweak that further as needed.

0
On

It is not entirely clear the exact sequence you want since your "From" data does not correspond to what you show as your "To" results, but if what you are showing is the general format change and in the "From" data, you want to:

  • discard field 1,
  • retrieve the "chr" value from the end of field 2,
  • if the 4th field is "F" make it "+" else if it is "R" make it "-" otherwise leave it unchanged,
  • use the 3rd field as "start" and 3rd + 1 as "stop" (adjust whether to add or subtract 1 as needed to get the desired "start" and "stop" values),
  • print 6th field as "freqc",
  • output 5th field as "Coverage", and finally
  • output modified 4th field as "strand"

If that is your goal, then with your from data in the file named from, you can do something like the following:

awk '
  BEGIN { OFS="\t"; print "chr","start","stop","freqc","Coverage","strand" } 
  FNR > 1 { 
    match($2, /[[:digit:]]+$/, arr)
    if ($4 == "F")
      $4 = "+"
    else if ($4 == "R")
      $4 = "-"
    print arr[0], $3, $3 + 1, $6, $5, $4
  }
' from

Explanation, the BEGIN rule is run before awk starts processing records (lines) from the file. Above it simply sets the Output Field Separator to tab and prints the heading.

The condition (pattern) of FNR > 1 on the second rule processes the from file from the 2nd record (line) on (skipping the heading row). FNR is awk's way of saying File Record Number (even though it looks like the N and R are backwards).

match($2, /[[:digit:]]+$/, arr) splits the trailing digits from the second field into the first element of arr (e.g. arr[0]) and not relevant here sets the RSTART and RLENGTH internal variables telling you which character the first digit starts on and how many digits there are.

The if and else if statement does the "F" to "+" and "R" to "-" change. And, finally, the print statement just prints the modified values and unchanged fields in the order specified above.

Example Output

Running the above on your original "From" data will produce:

chr     start   stop    freqc   Coverage        strand
1       339     340     0.00    7       +
1       183     184     0.00    4       -
1       192     193     0.00    6       -
1       340     341     40.00   5       -
1       10007   10008   53.85   13      +
1       10317   10318   0.00    8       +
1       10346   10347   88.89   9       +
1       10349   10350   88.89   9       +

Let me know if this is close to what you explained in your question, and if not, drop a comment below.

The GNU Awk User's Guide is a great gawk/awk reference.