I need to find the matching coding region sequences (CDS) of a custom made "Hit list" of protein names (just a textfile of names) from the GTDB representatives (which can be downloaded from the GTDB webpage => https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_reps).
As a side note, I cannot use the hit list to just fetch the CDS for the respective proteins on GTDB/NCBI, as the "protein names" of the GTDB representative files are actually just the NCBI genome accession number with a natural number assigned to each protein from this genome (e.g. AE017199.1_1, AE017199.1_2, AE017199.1_3 etc.).
So far, I have created a bash script which allows me to identify the protein names within the header of all fna entries within a genome, which contain one of the protein names from the hit list and then collect these sequences entries. Unfortunately, for the bacterial fna files containing the CDS of all GDTB representatives, it would amount to a total of about 6 months to finish the script.
This is a snippet of the script I have used (the textfile of proteinnames was changed into an array in prior):
for gz_file in "$zip_directory"/*.fna.gz; do
current_file=$((current_file + 1))
echo "Processing file $current_file of $total_files"
gzip -dc "$gz_file" | while IFS= read -r line; do
if [[ $line == ">"* ]]; then
for word in "${words[@]}"; do
if [[ $line == *"$word "* ]]; then
echo "$line - $(basename "$gz_file" .fna.gz)" >> "$temp_dir/$(basename "$gz_file" .fna.gz).temp"
while IFS= read -r next_line && [[ ! $next_line == ">"* ]]; do
echo "$next_line" >> "$temp_dir/$(basename "$gz_file" .fna.gz).temp"
done
break
fi
done
fi
done
done
Thus, I asked myself, whether there is a more convenient way to collect the corresponding GTDB-reps CDS for the Hit list, for instance by running all files in parallel or a more convenient approach in general.
I would be deeply appreciative of any help and suggestions for this kind of issue.
Thank you in advance for your support!
We can gain some speed by replacing the inner loops with a
sedcommand.But it seems advisable to use a DBMS rather than plain text files for such big data processing.