I am trying to run the MAGeCK pipeline to analyze CRISPR knockout screen data produced by someone in my lab around 5 years ago. I was given the data as BAM files and also have the sequencing sample statistics from when the samples were run including the number of total reads, aligned reads, and unaligned reads. I converted the BAM files back to FASTQ files using
samtools view -h -F 2048 filename.bam > tmp.bam
bedtools bamtofastq -i tmp.bam -fq filename.fastq
After this, I ran the MAGeCK count function. I saw that none of the total reads matched between the MAGeCK count summary output file and the original sequencing statistics.
More problematically, a few samples had significantly fewer total reads (like 50,000) shown on the MAGeCK count summary output files relative to the original sequencing statistics (~44 million). As a result, many of the sgRNAs are not represented at all in the sample (there are ~61,000 zero counts relative to the total 77441 sgRNAs) which is impacting my analysis.
Can anyone help me understand why I might be losing sgRNAs using MAGeCK? Is it a problem with my source files? I would appreciate any help! Thanks!
I have tried re-downloading the bam files from the server and re-converting them to fastq-- I get the same result.