I would like to get a fastq file comprised of fastq reads that are in F1.fastq
but are not in F2.fastq
.
The following code was offered as a solution here.
needed_reads = []
reads_array = []
chosen_array = []
for x in SeqIO.parse("F1.fastq","fastq"):
reads_array.append(x)
for y in SeqIO.parse("F2.fastq","fastq"):
chosen_array.append(y)
needed_reads = list(set(reads_array) - set(chosen_array))
output_handle = open("DIFF.fastq","w")
SeqIO.write(needed_reads,output_handle,"fastq")
output_handle.close()
However, it no longer works and produces a "unhashable type: SeqRecord"
error when the set()
function is run on the output of SeqIO.parse()
. SeqIO.parse()
still produces a list type object according to the type()
function. However, that list appears to be unhashable. How can this be fixed?
Briefly, instead of
BioPython
, useseqtk
and UNIX tools, as shown here:How to remove a list of reads from fastq file?
First, use
grep
to extract read ids from fastq file. Usecomm
to extract only the read names to keep:To extract reads from fastq files by IDs, use
seqtk subseq
:It works with
fasta
files as well.Alternatively, use
BBMap filterbyname.sh
, see docs:See also:
To install these tools, use
conda
, specificallyminiconda
, for example:Here, GNU
grep
uses the following options:-P
: Use Perl regexes.-o
: Print the matches only (1 match per line), not the entire lines.\K
: Cause the regex engine to "keep" everything it had matched prior to the\K
and not include it in the match. Specifically, ignore the preceding part of the regex when printing the match.See also:
grep
manualseqtk
conda
conda create
Files used in the above example: