Clustering for the Protein sequences (With/without MSA)

77 Views Asked by At

I have NGS data (Unique clones only) and I want to group them based on the similarity (clustering is preferable) using Python language. Please have a look into the below sample sequences. Also to isolate the most distant clone as encircled in the image.

Note: I have already asked similar to this question before but that was for DNA of same length so MSA wasn't required but in below sample I want to explore if without MSA can we perform clustering and find the more distant clones.

NGS Sample Data is as below

>3971 
AVTLDESGGGLQTPGGALSLVCKASGFTFSDRGMGWVRQAPGKGLEFVACIENDGSWTAYGSAVKGRATISRDNGQSTVRLQLNNLRAEDTATYYCAKSAGGSLLLTVVILTVGSIDAWGHGT 
>3186 
AVTLGESGGGLQTPGGALSLVCKASGFTFSSHGMAWVRQAPGKGLEFVAGIGNTGSNPNYGAAVKGRATISRDNRQSTVRLQLNNLRAEDTGTYFCAKRAYAASWSGSDRIDAWGHGT 
>3066 
AVTLGESGGGLQTPGGGLSLVCKASGFTFSSFNMFWVRQAPGKGLEYVAGIDNTGSYTAYGAAVKGRATISRDNGQSTLRLQLNNLRAEDTATYYCAKSFDGRYYHSIDGVDAIDAWGHGT 
>3719 
AVTLGESGGGLQTPGGTVSLVCKGSGFDFSSYNMQWVRQAPGKGLEFIAQINGAGSGTNYGPAVKGRATISRDNGQSTVRLQLNNLRAEDTAIYYCAKSYDGRYYHSIDGVDAIDAWGHGT
>127 
AVTLGESGGGLQTPGGALSLVCKGSGFTLSSFNMGWVRQAPGKGLEWVGVISSSGRYTEYGAAVKGRAAISRDDGQSTVRLQLNNLRAEDTAIYFCAKGIGTAYCGSCVGEIDTWGHGT 
>144 
AVTLDESGGGLQTPGGGLSLVCKASGFTFSSHGMGWVRQAPGKGLEFVADISGSGSSTNYGPAVKGRATISRDNGQSTVRLQLNDLRAEDTATYYCAKYVGSIGCGSTAGIDAWGHGT 
>291 
AVTLDESGGGLQTPGGALSLVCKASGFTFSDRGMHWVRQAPGKGLEWVAGIGNSGSGTTYGSAVKGRATISRDNGQSTLRLQLNNLRPEDTATYFCARATCIGSGYCGWGTYRIDAWGHGT 
>381 
AVTLDESGGGLQTPGGALSLVCKASGFTFSRFNMFWVRQAPGKGLEWVAAISSGSSTWYGSAVKGRATISRDNGQSTVRLQLNNLRAEDTGTYYCTKAAGNGYRGWTTYIAGWIDAWGHGT

                       **Desired output is as below**

enter image description here

2

There are 2 best solutions below

2
Umar On

Followinf previous question for DNA asked clustering of a fasta file having DNA sequences to find the most unmatched clone, I have correctd for protein as asked

from Bio import SeqIO
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

# Example protein sequences in a FASTA file
example_fasta = """>144
AVTLGESGGGLQTPGGALSLVCKGSGFTLSSFNMGWVRQAPGKGLEWVGVISSSGRYTEYGAAVKGRAAISRDDGQSTVRLQLNNLRAEDTAIYFCAKGIGTAYCGSCVGEIDTWGHGT
>291
AVTLDESGGGLQTPGGALSLVCKASGFTFSDRGMHWVRQAPGKGLEWVAGIGNSGSGTTYGSAVKGRATISRDNGQSTLRLQLNNLRPEDTATYFCARATCIGSGYCGWGTYRIDAWGHGT
>381
AVTLDESGGGLQTPGGALSLVCKASGFTFSRFNMFWVRQAPGKGLEWVAAISSGSSTWYGSAVKGRATISRDNGQSTVRLQLNNLRAEDTGTYYCTKAAGNGYRGWTTYIAGWIDAWGHGT
>3971
AVTLDESGGGLQTPGGALSLVCKASGFTFSDRGMGWVRQAPGKGLEFVACIENDGSWTAYGSAVKGRATISRDNGQSTVRLQLNNLRAEDTATYYCAKSAGGSLLLTVVILTVGSIDAWGHGT
>3186
AVTLGESGGGLQTPGGALSLVCKASGFTFSSHGMAWVRQAPGKGLEFVAGIGNTGSNPNYGAAVKGRATISRDNRQSTVRLQLNNLRAEDTGTYFCAKRAYAASWSGSDRIDAWGHGT
>3066
AVTLGESGGGLQTPGGGLSLVCKASGFTFSSFNMFWVRQAPGKGLEYVAGIDNTGSYTAYGAAVKGRATISRDNGQSTLRLQLNNLRAEDTATYYCAKSFDGRYYHSIDGVDAIDAWGHGT
>3719
AVTLGESGGGLQTPGGTVSLVCKGSGFDFSSYNMQWVRQAPGKGLEFIAQINGAGSGTNYGPAVKGRATISRDNGQSTVRLQLNNLRAEDTAIYYCAKSYDGRYYHSIDGVDAIDAWGHGT"""

# Save example data to a FASTA file
fasta_file = "example_protein.fasta"
with open(fasta_file, "w") as f:
    f.write(example_fasta)

# Read protein sequences from the FASTA file
sequences = list(SeqIO.parse(fasta_file, "fasta"))
protein_sequences = [str(seq.seq).upper() for seq in sequences]

# Preprocessing and numerical representation of protein sequences
# Replace this with your specific method for encoding protein sequences
def protein_to_num(seq):
    # Example: encoding amino acids as numerical values based on their properties
    amino_acid_properties = {'A': 1, 'R': 2, 'N': 3, 'D': 4, 'C': 5, 'Q': 6, 'E': 7, 'G': 8, 'H': 9, 'I': 10,
                             'L': 11, 'K': 12, 'M': 13, 'F': 14, 'P': 15, 'S': 16, 'T': 17, 'W': 18, 'Y': 19, 'V': 20}
    return np.array([amino_acid_properties.get(aa, 0) for aa in seq])

# Convert protein sequences to numerical representations
protein_data = [protein_to_num(seq) for seq in protein_sequences]

# Find the maximum length of protein sequences
max_len = max(len(seq) for seq in protein_data)

# Pad protein sequences with zeros to make them of equal length
protein_data_padded = [np.pad(seq, (0, max_len - len(seq)), 'constant') for seq in protein_data]

# Convert the list of padded arrays to a 2D numpy array
protein_data_array = np.array(protein_data_padded)

# Handle missing values (NaNs) by replacing them with zeros
protein_data_array[np.isnan(protein_data_array)] = 0

# Perform PCA
pca = PCA(n_components=2)
pca_data = pca.fit_transform(protein_data_array)

# Perform KMeans clustering
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(pca_data)

# Visualize the clusters
colors = ['r', 'g', 'b', 'c', 'm']
for i, label in enumerate(set(kmeans.labels_)):
    cluster_points = np.array([pca_data[j] for j, l in enumerate(kmeans.labels_) if l == label])
    plt.scatter(cluster_points[:, 0], cluster_points[:, 1], color=colors[i], label=f'Cluster {label}')

    # Annotate points with sequence identifiers
    for j, seq_label in enumerate(sequences):
        if kmeans.labels_[j] == label:
            plt.annotate(seq_label.id, (pca_data[j, 0], pca_data[j, 1]), textcoords="offset points", xytext=(5,5), ha='right')

plt.legend()
plt.show()
1
Paplepel93 On

I recognise the image you have provided as a kmeans clustering example. I think that could be a good alternative if you want to avoid MSA. You can read more about it on here

You could for example cluster your data with kmeans based on the Levenshtein distance