How to perform positive unlabeled learning using a binary classifier?

531 Views Asked by At

I have setup a bagging classifier in pyspark, in which a binary classifier trains on the positive samples and an equal number of randomly sampled unlabeled samples (given scores of 1 for positive and 0 for the unlabeled). The model then predicts the out of bag samples, and this process repeats so now I plan to take the average prediction per sample.

My question comes in as the output model prediction using PySpark is a probability column that is a vector of probabilities per class. So for example the output for binary classification looks like:

model.transform(test_data).show()
+-----+--------------------+
|label|         probability|
+-----+--------------------+
|    0|[0.294, 0.8]        |
|    1|[0.65, 0.2 ]        |

To perform positive unlabeled learning from a binary classifier that outputs this, do I need to drop the probabilities predicted for the negative class and use only the predictions the model has made for if the unlabeled samples are positive?

1

There are 1 best solutions below

0
On

Yes. The probability you get for each unlabeled data is the probability for that point to be positive as the model gains for. Then you take the average across iterations