Active learning with modAL-- Invalid shapes

184 Views Asked by At

I am trying to implement active learning in Python. My classification problem currently takes Word2vec vector representations and feeds them into a Random Forest.

I have a tiny, initial train dataset and I would like to use the modAL package to exploit active learning and increase its size.

Here is what I've tried so far:

from modAL.models import ActiveLearner


learner = ActiveLearner(
    estimator=RandomForestClassifier(),
    query_strategy=modAL.uncertainty.uncertainty_sampling,
    X_training=X_train0, y_training=y_train
)

test=test.reset_index()
for i in range(20):
    query_idx, query_instance = learner.query(X_test0)
    y_new = input('Classify:')
    y_new=np.array([y_new])
    learner.teach(np.array(
    X_test0[query_idx].reshape(-1,1), y_new)

Where X_test0 is a pandas Dataframe with shape 1056x 100 (i.e 1056 examples with 100 features each, which are Word2vec representations). I leave this as if I had it unlabelled to later check performance. Similarly, y_train is another pandas dataframe containing the binary classification for the training data (0s or 1s).

My issue is that I want to make modAL understand that I am working under multiple features, and thus the classification is unique per every 100 length vector. In the example above, the following error appears:

ValueError: Found input variables with inconsistent numbers of samples: [100, 1]

It seems to me that it is not understanding that those 100 features correspond to only one label...

Any clue on how to solve it?

EDIT: I thought it might have been something with the reshaping function. Since it seems that it wants as an input an array, I also tried modifying the last line as follows:

learner.teach(X_test0.iloc[query_idx].values, np.array(y_new))

which now produces the following error:

TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid

Removing .values to make it a dataframe also produces an error:

TypeError: <class 'pandas.core.series.Series'> datatype is not supported
``
0

There are 0 best solutions below