I have setup a bagging classifier in pyspark, in which a binary classifier trains on the positive samples and an equal number of randomly sampled unlabeled samples (given scores of 1 for positive and 0 for the unlabeled). The model then predicts the out of bag samples, and this process repeats so now I plan to take the average prediction per sample.
My question comes in as the output model prediction using PySpark is a probability column that is a vector of probabilities per class. So for example the output for binary classification looks like:
model.transform(test_data).show()
+-----+--------------------+
|label| probability|
+-----+--------------------+
| 0|[0.294, 0.8] |
| 1|[0.65, 0.2 ] |
To perform positive unlabeled learning from a binary classifier that outputs this, do I need to drop the probabilities predicted for the negative class and use only the predictions the model has made for if the unlabeled samples are positive?
Yes. The probability you get for each unlabeled data is the probability for that point to be positive as the model gains for. Then you take the average across iterations