Combine files with same basename but different, specific set of IDs

24 Views Asked by At

I have over 800,000 fastq.gz files that I am trying to combine. Below is an example of my data. Each file has a basename (sample#) and a BC1 identifier (BC_#)

sample1_BC1_1_R1.fastq.gz
sample1_BC1_49_R1.fastq.gz
sample1_BC1_2_R1.fastq.gz
sample1_BC1_50_R1.fastq.gz

sample2_BC1_1_R1.fastq.gz
sample2_BC1_49_R1.fastq.gz
sample2_BC1_2_R1.fastq.gz
sample2_BC1_50_R1.fastq.gz

I want to combine files that have the same basename and a specific set of BC1 identifiers so that the following BC1 identifiers would be combined.

1 and 49
2 and 50 
3 and 51 
...
48 and 96

For the example above with 8 files, my output would be 4 files...

sample1_BC1_1-49_R1.fastq.gz
sample1_BC1_2-50_R1.fastq.gz
sample2_BC1_1-49_R1.fastq.gz
sample2_BC1_2-50_R1.fastq.gz

How can I do this in linux or python? Thank you in advance! I haven't quite reached high proficieny with linux or python yet, so any help is welcomed.

I have tried looping through files to identify files with similar basenames but am having trouble concatenating the files given they have the right BC1 identifiers.

0

There are 0 best solutions below