I am really new to PySpark and am trying to translate some python code into pyspark. I start with a panda, convert to a document - term matrix and then apply PCA.
The UDF:
class MultiLabelCounter():
def __init__(self, classes=None):
self.classes_ = classes
def fit(self,y):
self.classes_ =
sorted(set(itertools.chain.from_iterable(y)))
self.mapping = dict(zip(self.classes_,
range(len(self.classes_))))
return self
def transform(self,y):
yt = []
for labels in y:
data = [0]*len(self.classes_)
for label in labels:
data[self.mapping[label]] +=1
yt.append(data)
return yt
def fit_transform(self,y):
return self.fit(y).transform(y)
mlb = MultiLabelCounter()
df_grouped =
df_grouped.withColumnRenamed("collect_list(full)","full")
udf_mlb = udf(lambda x: mlb.fit_transform(x),IntegerType())
mlb_fitted = df_grouped.withColumn('full',udf_mlb(col("full")))
I am of course getting NULL results.
I am using spark 2.4.4 version.
EDIT
Adding sample input and output as per request
Input:
|id|val|
|--|---|
|1|[hello,world]|
|2|[goodbye, world]|
|3|[hello,hello]|
Output:
|id|hello|goodbye|world|
|--|-----|-------|-----|
|1|1|0|1|
|2|0|1|1|
|3|2|0|0|
Based upon input data shared, I tried replicating your output and it works. Please see below -
Input Data
Now, using
explode
to create separate rows out ofvals
list items. Thereafter, usingpivot
andcount
will calculate the frequency. Finally, replacingnull
values with0
usingfillna(0)
. See below -Output