In PySpark, how do I get word frequency in a column when a row can contain multiple words?

56 Views Asked by At

Assume a two column PySpark DataFrame with 3 rows:

["Number"]     [ "Keywords"}

1              Mary had a little lamb

2              A little lamb is white

3              Mary is little

Desired output:

little 3

Mary   2

lamb   2

is     2

a      2

had    1

white  1

Tried "explode" and "split", but could not get the syntax right.

2

There are 2 best solutions below

0
Pravash Panigrahi On BEST ANSWER

You can try below code -

from pyspark.sql import functions as F
from pyspark.sql.functions import explode, split


df = df.withColumn("Keyword", explode(split(F.col("Keywords"), " ")))

keyword_counts = df.withColumn("Keyword", F.lower(F.col("Keyword"))).groupBy("Keyword").count()

keyword_counts = keyword_counts.orderBy(F.col("count").desc())
0
notNull On

Try with explode and groupBy functions.

Example:

from pyspark.sql.functions import *
df = spark.createDataFrame([(1, 'Mary had a little lamb')],['Number','Keywords'])
df.selectExpr("explode(split(keywords,' '))").\
  groupBy("col").agg(count("*")).\
  show()

#+------+--------+
#|   col|count(1)|
#+------+--------+
#|  Mary|       1|
#|   had|       1|
#|  lamb|       1|
#|     a|       1|
#|little|       1|