Getting reads from one fastq file that are not found in another fastq file

238 Views Asked by At

I would like to get a fastq file comprised of fastq reads that are in F1.fastq but are not in F2.fastq.

The following code was offered as a solution here.

needed_reads = []
reads_array = []
chosen_array = []

for x in SeqIO.parse("F1.fastq","fastq"):
    reads_array.append(x)

for y in SeqIO.parse("F2.fastq","fastq"):
    chosen_array.append(y)

needed_reads = list(set(reads_array) - set(chosen_array))

output_handle = open("DIFF.fastq","w")
SeqIO.write(needed_reads,output_handle,"fastq")
output_handle.close()

However, it no longer works and produces a "unhashable type: SeqRecord" error when the set() function is run on the output of SeqIO.parse(). SeqIO.parse() still produces a list type object according to the type() function. However, that list appears to be unhashable. How can this be fixed?

3

There are 3 best solutions below

0
On

Briefly, instead of BioPython, use seqtk and UNIX tools, as shown here:

How to remove a list of reads from fastq file?

First, use grep to extract read ids from fastq file. Use comm to extract only the read names to keep:

comm -23 <( cat test2.1.fastq | paste - - - - | grep -Po '^@\K\S+' | sort ) <( cat test1.1.fastq | paste - - - - | grep -Po '^@\K\S+' | sort ) > name.lst

To extract reads from fastq files by IDs, use seqtk subseq:

seqtk subseq test2.1.fastq name.lst > out.fastq

It works with fasta files as well.

Alternatively, use BBMap filterbyname.sh, see docs:

By default, "filterbyname" discards reads with names in your name list, and keeps the rest. To include them and discard the others, do this:
filterbyname.sh in=003.fastq out=filter003.fq names=names003.txt include=t

See also:

To install these tools, use conda, specifically miniconda, for example:

conda create --channel bioconda --name seqtk seqtk
conda activate seqtk
# ... use seqtk here ...
conda deactivate

Here, GNU grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.

\K : Cause the regex engine to "keep" everything it had matched prior to the \K and not include it in the match. Specifically, ignore the preceding part of the regex when printing the match.

See also:

Files used in the above example:

cat test1.1.fastq
@M01873:M01873:000000000-B4TLH:1:1101:10001:20001/1
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTAAAAACCCCCGGGGGTTTTTAAAAAACCCCCCGGG
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@M01873:M01873:000000000-B4TLH:1:1101:10002:20002/1
GGGTTTTTTAAAAAAACCCCCCCGGGGGGGTTTTTTTAAAAAAAACCCCCCCCGGGGGGGGTTTTTTTTAAAAAA
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
cat test2.1.fastq
@M01873:M01873:000000000-B4TLH:1:1101:10001:20001/1
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTAAAAACCCCCGGGGGTTTTTAAAAAACCCCCCGGG
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@M01873:M01873:000000000-B4TLH:1:1101:10002:20002/1
GGGTTTTTTAAAAAAACCCCCCCGGGGGGGTTTTTTTAAAAAAAACCCCCCCCGGGGGGGGTTTTTTTTAAAAAA
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@M01873:M01873:000000000-B4TLH:1:1101:10003:20003/1
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTAAAAACCCCCGGGGGTTTTTAAAAAACCCCCCGGG
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@M01873:M01873:000000000-B4TLH:1:1101:10004:20004/1
GGGTTTTTTAAAAAAACCCCCCCGGGGGGGTTTTTTTAAAAAAAACCCCCCCCGGGGGGGGTTTTTTTTAAAAAA
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
cat name.lst
M01873:M01873:000000000-B4TLH:1:1101:10003:20003/1
M01873:M01873:000000000-B4TLH:1:1101:10004:20004/1
cat out.fastq
@M01873:M01873:000000000-B4TLH:1:1101:10003:20003/1
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTAAAAACCCCCGGGGGTTTTTAAAAAACCCCCCGGG
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@M01873:M01873:000000000-B4TLH:1:1101:10004:20004/1
GGGTTTTTTAAAAAAACCCCCCCGGGGGGGTTTTTTTAAAAAAAACCCCCCCCGGGGGGGGTTTTTTTTAAAAAA
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
0
On

say two fastq reads are the same if they share the same NAME:

linearize with paste, join on NAME, restore the fastq format with tr

something like (not tested)

 join -t $'\t' -v1 -1 1 -2 1 \
      <(cat F1.fastq | paste - - - - | sort -t $'\t' -k1,1) \
      <(cat F2.fastq | paste - - - - | sort -t $'\t' -k1,1) |\
 tr "\t" "\n"
0
On

Like the BioPython documentation tells you, they changed how SeqRecord objects are compared. (This is what SeqIO returns when you loop over it.) It's really a question of what exactly you want to compare.

Here is a refactoring which checks if the actual sequence is identical, and if so, skips the ones which are already in the smaller file.

from Bio import SeqIO

chosen_seqs = [y.seq for y in SeqIO.parse("F2.fastq", "fastq")]
needed_reads = []

for x in SeqIO.parse("F1.fastq", "fastq"):
    if x.seq not in chosen_seqs:
        needed_reads.append(x)

with open("DIFF.fastq", "w") as output_handle:
    SeqIO.write(needed_reads,output_handle, "fastq")

This is slightly refactored to avoid keeping all the information in memory; you only need to have the smaller set in memory and then iterate over the bigger one one item at a time.

To change the selection logic, change y.seq to whichever feature you want to examine from the smaller file, and correspondingly change x.seq to whichever other feature you extracted from each SeqRecord.


From the entry for Biopython 1.67 (8 June 2016) from https://raw.githubusercontent.com/biopython/biopython/master/NEWS.rst:

Comparison of SeqRecord objects until now has used the default Python object comparison (are they the same instance in memory?). This can be surprising, but comparing all of the attributes would be too complex. As of this release attempting to compare SeqRecord objects should raise an exception instead. If you want the old behaviour, use id(record1) == id(record2) instead.