I am trying to reduce the number of columns of data set

528 Views Asked by At

I was trying to reduce the number of columns of the vector X with shape(20000,8000) , but got reduce the rows of dataset making it new dataset X_5000 , its shape is (5000 , 8000) . Kindly let me know where i am making the mistake . current I have - X - vector of shape (20000,8000) Required - X_5000 - vector of shape (5000 , 8000) I am using decision tree model and used feature_importance to reduce the no. of features.

clf = DecisionTreeClassifier()

clf.fit(X, y)

class_prob_sorted = (-clf.feature_importances_).argsort()              

top_5000_index= class_prob_sorted[:5000]    


X_5000=X.tocsr()[top_5000_index]

Actually I got - print(X_5000.shape) - (5000 , 8000)

Expected - print(X_5000.shape) - (20000 , 5000)

1

There are 1 best solutions below

0
sin tribu On

Sorry if I misunderstood your question, but I am still confused. You are fitting your model to your initial X, finding the most important features using clf.feature_importances_ (which is a 1D array hence the error messages), and then trying to reduce X to only those features? If so:

clf.fit(X, y)

#map indices of columns to most important features - argsort loses order
important = clf.important_features_
important_dict = dict( zip( [i for i in range( len( important ))], important ))

#sort the dict in reverse order to get list of indices of the most important columns
top_5000_index = sorted( important_dict, key=important_dict.get, reverse=True )[0:5000]

#add the rows to a new X 
reduced_X = []
reduced_y = []
for i in top_5000_index:
    reduced_X.append( X[:,i] )
    reduced_y.append( y[i] ) #if you need the labels for later

reduced_X = np.array( reduced_X )
reduced_y = np.array( reduced_y )

Then the only question remains is why 5000 features? Maybe you should set a threshold of importance and grab the features above this threshold.

As to the X.tocsr(), it didn't seem to fit into the question as I got the impression from my very brief reading that it is for reducing sparse matrices. Bu my apologies again if I misread your question a second time.