Is there a way to quickly identify all matching coding region sequences based on a list of protein identifiers from all GTDB representatives?

Question

Is there a way to quickly identify all matching coding region sequences based on a list of protein identifiers from all GTDB representatives?

51 Views Asked by Marburg-Researcher At 18 March 2024 at 11:23

I need to find the matching coding region sequences (CDS) of a custom made "Hit list" of protein names (just a textfile of names) from the GTDB representatives (which can be downloaded from the GTDB webpage => https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_reps).

As a side note, I cannot use the hit list to just fetch the CDS for the respective proteins on GTDB/NCBI, as the "protein names" of the GTDB representative files are actually just the NCBI genome accession number with a natural number assigned to each protein from this genome (e.g. AE017199.1_1, AE017199.1_2, AE017199.1_3 etc.).

So far, I have created a bash script which allows me to identify the protein names within the header of all fna entries within a genome, which contain one of the protein names from the hit list and then collect these sequences entries. Unfortunately, for the bacterial fna files containing the CDS of all GDTB representatives, it would amount to a total of about 6 months to finish the script.

This is a snippet of the script I have used (the textfile of proteinnames was changed into an array in prior):

for gz_file in "$zip_directory"/*.fna.gz; do
    current_file=$((current_file + 1))
    echo "Processing file $current_file of $total_files"
    gzip -dc "$gz_file" | while IFS= read -r line; do
        if [[ $line == ">"* ]]; then
            for word in "${words[@]}"; do
                if [[ $line == *"$word "* ]]; then
                    echo "$line - $(basename "$gz_file" .fna.gz)" >> "$temp_dir/$(basename "$gz_file" .fna.gz).temp"
                    while IFS= read -r next_line && [[ ! $next_line == ">"* ]]; do
                        echo "$next_line" >> "$temp_dir/$(basename "$gz_file" .fna.gz).temp"
                    done
                    break
                fi
            done
        fi
    done
done

Thus, I asked myself, whether there is a more convenient way to collect the corresponding GTDB-reps CDS for the Hit list, for instance by running all files in parallel or a more convenient approach in general.

I would be deeply appreciative of any help and suggestions for this kind of issue.

Thank you in advance for your support!

Original Q&A

There are 1 best solutions below

**Armali** · Answer 1 · 2024-03-26T10:19:52.397000

We can gain some speed by replacing the inner loops with a sed command.

IFS=\|
regexp="${words[*]//|/\\|}"
for gz_file in "$zip_directory"/*.fna.gz
do  current_file=$((current_file + 1))
    echo "Processing file $current_file of $total_files"
    file=$(basename "$gz_file" .fna.gz)
    gzip -dc "$gz_file" |
    sed -nr "/$regexp/{s/$/ - $file/;:0;p;n;/^>/!b0;s/^/\n/;D}" >"$temp_dir/$file.temp"
done

But it seems advisable to use a DBMS rather than plain text files for such big data processing.

Is there a way to quickly identify all matching coding region sequences based on a list of protein identifiers from all GTDB representatives?

There are 1 best solutions below

Related Questions in LINUX

Related Questions in DATABASE

Related Questions in BIOINFORMATICS

Related Questions in PROTEIN-DATABASE

Trending Questions

Popular # Hahtags

Popular Questions