I am trying to perform PCA+LDA on this dataset Structural Protein Sequences.
The problem is that every feature has explained variance at 99-100 % and to keep 95% of the information only 1 principal component is needed.
X_mean = np.mean(X_train, axis=0)
cov_mat = (X_train - X_mean).T.dot((X_train - X_mean)) / (X_train.shape\[0\]-1)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
total = sum(eig_vals)
exp_var = \[(i /total)\*100 for i in sorted(eig_vals,reverse=True)\]
sum_exp_var = np.cumsum(exp_var)
sum_exp_var
Output
array([ 99.99891234, 99.99947164, 99.99991711, 99.99999955,
99.99999999, 100. , 100. , 100. ,
100. , 100. , 100. , 100. ,
100. , 100. ])
I am trying to reduce the dimensionality from 15 to maybe 10 features.
The dataset's categorical features are encoded with ordinalencoder() and scaled with standardscaler()
Any ideas on why every feature is so significant ? Are there datasets that KernelPCA can not yield results ?