Clustering for the Protein sequences (With/without MSA)

Question

Clustering for the Protein sequences (With/without MSA)

77 Views Asked by shivam At 08 February 2024 at 08:00

I have NGS data (Unique clones only) and I want to group them based on the similarity (clustering is preferable) using Python language. Please have a look into the below sample sequences. Also to isolate the most distant clone as encircled in the image.

Note: I have already asked similar to this question before but that was for DNA of same length so MSA wasn't required but in below sample I want to explore if without MSA can we perform clustering and find the more distant clones.

NGS Sample Data is as below

>3971 
AVTLDESGGGLQTPGGALSLVCKASGFTFSDRGMGWVRQAPGKGLEFVACIENDGSWTAYGSAVKGRATISRDNGQSTVRLQLNNLRAEDTATYYCAKSAGGSLLLTVVILTVGSIDAWGHGT 
>3186 
AVTLGESGGGLQTPGGALSLVCKASGFTFSSHGMAWVRQAPGKGLEFVAGIGNTGSNPNYGAAVKGRATISRDNRQSTVRLQLNNLRAEDTGTYFCAKRAYAASWSGSDRIDAWGHGT 
>3066 
AVTLGESGGGLQTPGGGLSLVCKASGFTFSSFNMFWVRQAPGKGLEYVAGIDNTGSYTAYGAAVKGRATISRDNGQSTLRLQLNNLRAEDTATYYCAKSFDGRYYHSIDGVDAIDAWGHGT 
>3719 
AVTLGESGGGLQTPGGTVSLVCKGSGFDFSSYNMQWVRQAPGKGLEFIAQINGAGSGTNYGPAVKGRATISRDNGQSTVRLQLNNLRAEDTAIYYCAKSYDGRYYHSIDGVDAIDAWGHGT
>127 
AVTLGESGGGLQTPGGALSLVCKGSGFTLSSFNMGWVRQAPGKGLEWVGVISSSGRYTEYGAAVKGRAAISRDDGQSTVRLQLNNLRAEDTAIYFCAKGIGTAYCGSCVGEIDTWGHGT 
>144 
AVTLDESGGGLQTPGGGLSLVCKASGFTFSSHGMGWVRQAPGKGLEFVADISGSGSSTNYGPAVKGRATISRDNGQSTVRLQLNDLRAEDTATYYCAKYVGSIGCGSTAGIDAWGHGT 
>291 
AVTLDESGGGLQTPGGALSLVCKASGFTFSDRGMHWVRQAPGKGLEWVAGIGNSGSGTTYGSAVKGRATISRDNGQSTLRLQLNNLRPEDTATYFCARATCIGSGYCGWGTYRIDAWGHGT 
>381 
AVTLDESGGGLQTPGGALSLVCKASGFTFSRFNMFWVRQAPGKGLEWVAAISSGSSTWYGSAVKGRATISRDNGQSTVRLQLNNLRAEDTGTYYCTKAAGNGYRGWTTYIAGWIDAWGHGT

                       **Desired output is as below**

Original Q&A

There are 2 best solutions below

**Umar** · Answer 1 · 2024-02-08T08:20:18.083000

Followinf previous question for DNA asked clustering of a fasta file having DNA sequences to find the most unmatched clone, I have correctd for protein as asked

from Bio import SeqIO
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

# Example protein sequences in a FASTA file
example_fasta = """>144
AVTLGESGGGLQTPGGALSLVCKGSGFTLSSFNMGWVRQAPGKGLEWVGVISSSGRYTEYGAAVKGRAAISRDDGQSTVRLQLNNLRAEDTAIYFCAKGIGTAYCGSCVGEIDTWGHGT
>291
AVTLDESGGGLQTPGGALSLVCKASGFTFSDRGMHWVRQAPGKGLEWVAGIGNSGSGTTYGSAVKGRATISRDNGQSTLRLQLNNLRPEDTATYFCARATCIGSGYCGWGTYRIDAWGHGT
>381
AVTLDESGGGLQTPGGALSLVCKASGFTFSRFNMFWVRQAPGKGLEWVAAISSGSSTWYGSAVKGRATISRDNGQSTVRLQLNNLRAEDTGTYYCTKAAGNGYRGWTTYIAGWIDAWGHGT
>3971
AVTLDESGGGLQTPGGALSLVCKASGFTFSDRGMGWVRQAPGKGLEFVACIENDGSWTAYGSAVKGRATISRDNGQSTVRLQLNNLRAEDTATYYCAKSAGGSLLLTVVILTVGSIDAWGHGT
>3186
AVTLGESGGGLQTPGGALSLVCKASGFTFSSHGMAWVRQAPGKGLEFVAGIGNTGSNPNYGAAVKGRATISRDNRQSTVRLQLNNLRAEDTGTYFCAKRAYAASWSGSDRIDAWGHGT
>3066
AVTLGESGGGLQTPGGGLSLVCKASGFTFSSFNMFWVRQAPGKGLEYVAGIDNTGSYTAYGAAVKGRATISRDNGQSTLRLQLNNLRAEDTATYYCAKSFDGRYYHSIDGVDAIDAWGHGT
>3719
AVTLGESGGGLQTPGGTVSLVCKGSGFDFSSYNMQWVRQAPGKGLEFIAQINGAGSGTNYGPAVKGRATISRDNGQSTVRLQLNNLRAEDTAIYYCAKSYDGRYYHSIDGVDAIDAWGHGT"""

# Save example data to a FASTA file
fasta_file = "example_protein.fasta"
with open(fasta_file, "w") as f:
    f.write(example_fasta)

# Read protein sequences from the FASTA file
sequences = list(SeqIO.parse(fasta_file, "fasta"))
protein_sequences = [str(seq.seq).upper() for seq in sequences]

# Preprocessing and numerical representation of protein sequences
# Replace this with your specific method for encoding protein sequences
def protein_to_num(seq):
    # Example: encoding amino acids as numerical values based on their properties
    amino_acid_properties = {'A': 1, 'R': 2, 'N': 3, 'D': 4, 'C': 5, 'Q': 6, 'E': 7, 'G': 8, 'H': 9, 'I': 10,
                             'L': 11, 'K': 12, 'M': 13, 'F': 14, 'P': 15, 'S': 16, 'T': 17, 'W': 18, 'Y': 19, 'V': 20}
    return np.array([amino_acid_properties.get(aa, 0) for aa in seq])

# Convert protein sequences to numerical representations
protein_data = [protein_to_num(seq) for seq in protein_sequences]

# Find the maximum length of protein sequences
max_len = max(len(seq) for seq in protein_data)

# Pad protein sequences with zeros to make them of equal length
protein_data_padded = [np.pad(seq, (0, max_len - len(seq)), 'constant') for seq in protein_data]

# Convert the list of padded arrays to a 2D numpy array
protein_data_array = np.array(protein_data_padded)

# Handle missing values (NaNs) by replacing them with zeros
protein_data_array[np.isnan(protein_data_array)] = 0

# Perform PCA
pca = PCA(n_components=2)
pca_data = pca.fit_transform(protein_data_array)

# Perform KMeans clustering
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(pca_data)

# Visualize the clusters
colors = ['r', 'g', 'b', 'c', 'm']
for i, label in enumerate(set(kmeans.labels_)):
    cluster_points = np.array([pca_data[j] for j, l in enumerate(kmeans.labels_) if l == label])
    plt.scatter(cluster_points[:, 0], cluster_points[:, 1], color=colors[i], label=f'Cluster {label}')

    # Annotate points with sequence identifiers
    for j, seq_label in enumerate(sequences):
        if kmeans.labels_[j] == label:
            plt.annotate(seq_label.id, (pca_data[j, 0], pca_data[j, 1]), textcoords="offset points", xytext=(5,5), ha='right')

plt.legend()
plt.show()

**Paplepel93** · Answer 2 · 2024-02-08T08:21:46.370000

I recognise the image you have provided as a kmeans clustering example. I think that could be a good alternative if you want to avoid MSA. You can read more about it on here

You could for example cluster your data with kmeans based on the Levenshtein distance

Clustering for the Protein sequences (With/without MSA)

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in PYTHON-3.X

Related Questions in BIOPYTHON

Related Questions in HIERARCHICAL-CLUSTERING

Trending Questions

Popular # Hahtags

Popular Questions