I have a database (synthetic_feature_file
) in csv format with over fifty thousand features, all of which have been processed and are not original features. There are 43 samples in this file.
I want to investigate whether I can find two new features from these over fifty thousand features that can separate the samples of Sarcopenia from those without Sarcopenia. I will use these two new features as the x and y axes and visualize them using a scatter plot.
I hope the classification results can resemble the image below, where the red samples form one cluster and the blue samples form another cluster, with no overlap between them.,for example:
(Image Source:https://medium.com/ai-academy-taiwan/clustering-%E5%88%86%E7%BE%A4%E6%87%B6%E4%BA%BA%E5%8C%85-9c0bb861a3ba)
Below is the code I have written. How should I modify it?
(I'm not sure how to select two features, so I've been running it repeatedly and examining the results each time. It is very inefficient.)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Read synthesized feature data
syn_data = pd.read_csv(synthetic_feature_file)
# labeled sample(people who suffer from Sarcopenia)
sample_indices = [1, 6, 7, 11, 14, 15, 27]
# Randomly pick two features as x and y axes
x_feature = np.random.choice(syn_data.columns[0:10000])
y_feature = np.random.choice(syn_data.columns[10001:20000])
# Clean feature names and remove illegal characters
x_feature = x_feature.strip().replace('\t', '')
y_feature = y_feature.strip().replace('\t', '')
plt.figure(figsize=(8, 6))
# Other samples(people who did not suffer from Sarcopenia)
other_samples = syn_data.drop(sample_indices)
plt.scatter(other_samples[x_feature], other_samples[y_feature], color='blue', label='Other Samples')
# Red sample
red_samples = syn_data.iloc[sample_indices]
plt.scatter(red_samples[x_feature], red_samples[y_feature], color='red', label='Sample Indices')
plt.xlabel(x_feature)
plt.ylabel(y_feature)
plt.title("Visualization")
plt.legend()
plt.show()
This can be done in this way :
Where
size=2
means selecting 2 samples andreplace=False
means selecting samples without replacement.As I mentioned in the comments, if you want to have access to columns, you should not rename them or clean them. because it makes them different from original column names and they won't be accessible.
Although the root of inefficiency is where you are trying to drop rows and create a copy. This can be done using boolean masking.
Selecting other samples :
Selecting red samples :