how to use Python to select two features from over 50,000 new features and visualize, through a scatter plot to seperate into two group

Question

how to use Python to select two features from over 50,000 new features and visualize, through a scatter plot to seperate into two group

67 Views Asked by 董珈妤 At 22 June 2025 at 01:49

I have a database (synthetic_feature_file) in csv format with over fifty thousand features, all of which have been processed and are not original features. There are 43 samples in this file.

I want to investigate whether I can find two new features from these over fifty thousand features that can separate the samples of Sarcopenia from those without Sarcopenia. I will use these two new features as the x and y axes and visualize them using a scatter plot.

I hope the classification results can resemble the image below, where the red samples form one cluster and the blue samples form another cluster, with no overlap between them.，for example:

enter image description here

(Image Source:https://medium.com/ai-academy-taiwan/clustering-%E5%88%86%E7%BE%A4%E6%87%B6%E4%BA%BA%E5%8C%85-9c0bb861a3ba)

Below is the code I have written. How should I modify it?

(I'm not sure how to select two features, so I've been running it repeatedly and examining the results each time. It is very inefficient.)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Read synthesized feature data
syn_data = pd.read_csv(synthetic_feature_file)

# labeled sample(people who suffer from Sarcopenia)
sample_indices = [1, 6, 7, 11, 14, 15, 27]

# Randomly pick two features as x and y axes
x_feature = np.random.choice(syn_data.columns[0:10000])
y_feature = np.random.choice(syn_data.columns[10001:20000])

# Clean feature names and remove illegal characters
x_feature = x_feature.strip().replace('\t', '')
y_feature = y_feature.strip().replace('\t', '')

plt.figure(figsize=(8, 6))

# Other samples(people who did not suffer from Sarcopenia)
other_samples = syn_data.drop(sample_indices)
plt.scatter(other_samples[x_feature], other_samples[y_feature], color='blue', label='Other Samples')

# Red sample
red_samples = syn_data.iloc[sample_indices]
plt.scatter(red_samples[x_feature], red_samples[y_feature], color='red', label='Sample Indices')

plt.xlabel(x_feature)
plt.ylabel(y_feature)
plt.title("Visualization")
plt.legend()
plt.show()

Original Q&A

There are 1 best solutions below

**Mohsen_Fatemi** · Accepted Answer

This can be done in this way :

x_feature, y_feature = np.random.choice(syn_data.columns, size=2, replace=False)

Where size=2 means selecting 2 samples and replace=False means selecting samples without replacement.

As I mentioned in the comments, if you want to have access to columns, you should not rename them or clean them. because it makes them different from original column names and they won't be accessible.

Although the root of inefficiency is where you are trying to drop rows and create a copy. This can be done using boolean masking.

sample_indices = syn_data.index.isin([1, 6, 7, 11, 14, 15, 27])

Selecting other samples :

other_samples = syn_data.iloc[~sample_indices]

Selecting red samples :

red_samples = syn_data.iloc[sample_indices]

how to use Python to select two features from over 50,000 new features and visualize, through a scatter plot to seperate into two group

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Trending Questions

Popular # Hahtags

Popular Questions