String Indexer, CountVectorizer Pyspark on single row

264 Views Asked by saph_top At 28 July 2025 at 14:46

Hi I'm faced with a problem whereby I have rows with two columns of an array of words.

column1, column2
["a", "b" ,"b", "c"], ["a","b", "x", "y"]

Basically I want to count the occurrence of each word between columns to end up with two arrays:

[1, 2, 1, 0, 0], 
[1, 1, 0, 1, 1]

So "a" appears once in each array, "b" appears twice in column1 and once in column2, "c" only appears in column1, "x" and "y" only in column2. So on and so forth.

I've tried to look at the CountVectorizer function from the ml library, however not sure if that works rowwise, the arrays can be very large in each column? And 0 values (where one word appears in one column but not the other) don't seem to get carried through.

Any help appreciated.

Original Q&A

There are 1 best solutions below

blackbishop On 07 February 2020 at 21:27 BEST ANSWER

For Spark 2.4+, you can do that using DataFrame API and built-in array functions.

First, get all the words for each row using array_union function. Then, use transform function to transform the words array, where for each element calculate the number of occurences in each column using size and array_remove functions:

df = spark.createDataFrame([(["a", "b", "b", "c"], ["a", "b", "x", "y"])], ["column1", "column2"])

df.withColumn("words", array_union("column1", "column2")) \
  .withColumn("occ_column1",
              expr("transform(words, x -> size(column1) - size(array_remove(column1, x)))")) \
  .withColumn("occ_column2",
              expr("transform(words, x -> size(column2) - size(array_remove(column2, x)))")) \
  .drop("words") \
  .show(truncate=False)

Output:

+------------+------------+---------------+---------------+
|column1     |column2     |occ_column1    |occ_column2    |
+------------+------------+---------------+---------------+
|[a, b, b, c]|[a, b, x, y]|[1, 2, 1, 0, 0]|[1, 1, 0, 1, 1]|
+------------+------------+---------------+---------------+

String Indexer, CountVectorizer Pyspark on single row

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in PYSPARK

Related Questions in NLP

Related Questions in KEYWORD-EXTRACTION

Trending Questions

Popular # Hahtags

Popular Questions