Error using BWA to map environmental transcriptome against a genomic reference

59 Views Asked by At

I have quality controlled paired end environmental transcriptomic data that I want to map against a reference database of 8 cyanobacterial genomes. I made this reference by joining together the fasta files of each genome.

I performed this mapping with BWA v0.7.3a but noticed that when I tried to count the reads mapped to each gene in the reference with htseq-count, the SAM file output from BWA was corrupted.

Here is my BWA command:

bwa mem -k 10 -aM -t 128 Merged_reference Input_file1 Input_file2 > Out.sam

Here is my HTSeq command:

htseq-count -f sam -r pos -a 0 -m intersection-strict -s no Out.sam Merged_ref2.gtf2 > HTSeq_out

This is the error I see when running HTSeq:

[W::sam_parse1] mapped query cannot have zero coordinate; treated as unmapped
[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1_sam] Parse error at line 8839134
Error occured when processing input (record #8809785 in file Out.sam):
  truncated file
  [Exception type: OSError, raised in libcalignmentfile.pyx:1876]

Here is what the SAM file looks like at lines 8839134 and 8839135:

A00521:380:HFKMLDSX5:4:1163:32081:25332 65      lcl|NC_013771.1_cds_WP_012953431.1_1    0       0       268435455S89S   lcl|NC_013771.1_cds_WP_012954293.1_845  87      0       Sequence   Quality     NM:i:8388607    AS:i:73 XS:i:72

A00521:380:HFKMLDSX5:4:1163:32081:25332 129     lcl|NC_013771.1_cds_WP_012954293.1_845  87      0       2S15M72S        lcl|NC_013771.1_cds_WP_012953431.1_1    0       0      Sequence  Quality  NM:i:0  AS:i:15 XS:i:15

I know that if I change these lines to the following and run htseq on a subset of my data up to the changed lines the error disappears:

A00521:380:HFKMLDSX5:4:1163:32081:25332 65      *    0       0       *   *      0     0 Sequence   Quality     NM:i:8388607    AS:i:73 XS:i:72

A00521:380:HFKMLDSX5:4:1163:32081:25332 129     *    0       0       *   *      0     0      Sequence  Quality  NM:i:0  AS:i:15 XS:i:15

While this change makes sense to me because the weird sequence causing this should be treated as an unmapped read, unfortunately I have noticed that this error occurs approximately every 4 million lines in my SAM file. Since I have ~115 million lines in the SAM file and 60+ SAM files I would like to get to the bottom of this to efficiently solve this problem.

Has anyone else seen this issue? I can't figure out why BWA would be making this mistake!

A few other notes, every time there is this error it is triggered by the lcl|NC_013771.1_cds_WP_012954293.1_845 sequence. This sequence is the first one in my reference file.

0

There are 0 best solutions below