I was trying to reduce the number of columns of the vector X with shape(20000,8000) , but got reduce the rows of dataset making it new dataset X_5000 , its shape is (5000 , 8000) . Kindly let me know where i am making the mistake . current I have - X - vector of shape (20000,8000) Required - X_5000 - vector of shape (5000 , 8000) I am using decision tree model and used feature_importance to reduce the no. of features.
clf = DecisionTreeClassifier()
clf.fit(X, y)
class_prob_sorted = (-clf.feature_importances_).argsort()
top_5000_index= class_prob_sorted[:5000]
X_5000=X.tocsr()[top_5000_index]
Actually I got - print(X_5000.shape) - (5000 , 8000)
Expected - print(X_5000.shape) - (20000 , 5000)
Sorry if I misunderstood your question, but I am still confused. You are fitting your model to your initial X, finding the most important features using
clf.feature_importances_(which is a 1D array hence the error messages), and then trying to reduce X to only those features? If so:Then the only question remains is why 5000 features? Maybe you should set a threshold of importance and grab the features above this threshold.
As to the
X.tocsr(), it didn't seem to fit into the question as I got the impression from my very brief reading that it is for reducing sparse matrices. Bu my apologies again if I misread your question a second time.