Why does my pattern extractor code return false positives?

35 Views Asked by At

My goal was to extract the sequence of "Antimicrobial Peptides" from the NCBI Database using keywords such as "Antimicrobial Peptides, AMPs" and also specifying its length.

I have written a code to extract the sequences of Antimicrobial Peptides (AMPs) from the NCBI protein database using the Biopython library.

The Code

from Bio import Entrez
from Bio import SeqIO

def scrape_fasta_sequence(keywords):
    # Provide your email address to NCBI
    Entrez.email = '[email protected]'

    # Create the query string with filters
    query = f'{keywords} AND srcdb_refseq[PROP] AND 7:50[SLEN] '

    # Search for protein IDs that match the query
    handle = Entrez.esearch(db='protein', term=query)
    record = Entrez.read(handle)
    id_list = record['IdList']

    # Use the protein IDs to fetch the sequences
    fasta_sequences = []
    for protein_id in id_list:
        handle = Entrez.efetch(db='protein', id=protein_id, rettype='fasta', retmode='text')
        fasta_sequence = SeqIO.read(handle, 'fasta')
        fasta_sequences.append(fasta_sequence)
        handle.close()

    return fasta_sequences

# Example usage
keywords = 'Antimicrobial Peptides OR AMPs'
fasta_sequences = scrape_fasta_sequence(keywords)
for fasta_sequence in fasta_sequences:
    print(fasta_sequence)

Output

Output was not specific to the Antimicrobial Peptides. It provided me with other proteins which were not needed. output snippet

Kindly help, if code can be modified to add precision only towards antimicrobial peptides.

1

There are 1 best solutions below

0
Umar On

You might want to add a more specific filter to target Antimicrobial Peptides (AMPs). One way to do this is by specifying the source organism or using a specific keyword related to AMPs. Here's an updated version of your code with an added filter for the source organism (bacteria, for example) and a specific keyword related to AMPs:

from Bio import Entrez
from Bio import SeqIO

def scrape_fasta_sequence(keywords):
    # Provide your email address to NCBI
    Entrez.email = '[email protected]'

    # Create the query string with filters
    query = f'{keywords} AND srcdb_refseq[PROP] AND 7:50[SLEN] AND bacteria[ORGN]'

    # Search for protein IDs that match the query
    handle = Entrez.esearch(db='protein', term=query)
    record = Entrez.read(handle)
    id_list = record['IdList']

    # Use the protein IDs to fetch the sequences
    fasta_sequences = []
    for protein_id in id_list:
        handle = Entrez.efetch(db='protein', id=protein_id, rettype='fasta', retmode='text')
        fasta_sequence = SeqIO.read(handle, 'fasta')
        fasta_sequences.append(fasta_sequence)
        handle.close()

    return fasta_sequences

# Example usage
keywords = 'Antimicrobial Peptides OR AMPs'
fasta_sequences = scrape_fasta_sequence(keywords)
for fasta_sequence in fasta_sequences:
    print(fasta_sequence)

So output will be

ID: WP_328102937.1
Name: WP_328102937.1
Description: WP_328102937.1 acinetodin/klebsidin/J25 family lasso peptide, partial [Escherichia marmotae]
Number of features: 0
Seq('IFHLLKEDYINKKSASQLTKGGEVHVPEYFAGIGTPISFCG')
ID: WP_328101564.1
Name: WP_328101564.1
Description: WP_328101564.1 acinetodin/klebsidin/J25 family lasso peptide, partial [Escherichia marmotae]
Number of features: 0
Seq('KKSASQLTKGGEVHVPEYFAGIGTPISFCG')
....
....
and so on