Can I extract y-values (data labels) from inside a cross-validation pipline in scikit-learn?

132 Views Asked by At

My text classification pipeline has these steps:

  1. Chunking, with a custom transformer, with a few parameters (input: an XML text file; output: a bunch of documents and labels for those documents)
  2. Vectorizing, with TfidfVectorizer (input: a list of documents; output: an DxF matrix where D is the number of docs and F is the number of features)
  3. Sparse-to-Dense Matrix Transformer (input: a sparse matrix; output: a dense matrix)
  4. Dimensionality reduction, with PCA or similar technique (input: DxF matrix, output: DxN matrix, where N is a param: the number of desired components)
  5. Prediction with GaussianMixture (input: a DxN matrix, output: cluster assignments, i.e. groupings of documents)

There are so many parameters for each of these steps that it's inefficient to look through all the possible param combinations manually, so I've been trying to do a cross-validataion grid search with CVGridSearch(). That can use a scorer to compare the output groupings with the original groupings (labels). (The scorer I'm using is metrics.adjusted_rand_index().)

If I cut out step 1, the chunker, I can feed the data and the labels into a pipeline starting with step 2, and then run a grid search over all parameters of steps 2-4 to find the best params. But the problem is, the chunks generated by step 1 are also parameters that need to be tweaked, so I'd like to keep step 1 in. But I can't get the labels until after I finish step 1, and the grid search needs the labels to do its scoring.

So what I'd like to know is: is there a way to have CVGridSearch get the labels it needs from the first step, instead of being supplied labels ahead of time?

Edit: here's a link to a notebook that illustrates the kind of thing I've been trying. (Non-working first step grid search is commented out.)

0

There are 0 best solutions below