In PySpark, how do I get word frequency in a column when a row can contain multiple words?

56 Views Asked by Zane Brown At 10 August 2023 at 22:14

Assume a two column PySpark DataFrame with 3 rows:

["Number"]     [ "Keywords"}

1              Mary had a little lamb

2              A little lamb is white

3              Mary is little

Desired output:

little 3

Mary   2

lamb   2

is     2

a      2

had    1

white  1

Tried "explode" and "split", but could not get the syntax right.

Original Q&A

There are 2 best solutions below

Pravash Panigrahi On 11 August 2023 at 05:10 BEST ANSWER

You can try below code -

from pyspark.sql import functions as F
from pyspark.sql.functions import explode, split


df = df.withColumn("Keyword", explode(split(F.col("Keywords"), " ")))

keyword_counts = df.withColumn("Keyword", F.lower(F.col("Keyword"))).groupBy("Keyword").count()

keyword_counts = keyword_counts.orderBy(F.col("count").desc())

notNull On 11 August 2023 at 00:41

Try with explode and groupBy functions.

Example:

from pyspark.sql.functions import *
df = spark.createDataFrame([(1, 'Mary had a little lamb')],['Number','Keywords'])
df.selectExpr("explode(split(keywords,' '))").\
  groupBy("col").agg(count("*")).\
  show()

#+------+--------+
#|   col|count(1)|
#+------+--------+
#|  Mary|       1|
#|   had|       1|
#|  lamb|       1|
#|     a|       1|
#|little|       1|

In PySpark, how do I get word frequency in a column when a row can contain multiple words?

There are 2 best solutions below

Related Questions in SQL

Related Questions in PYSPARK

Related Questions in SPLIT

Related Questions in PANDAS-EXPLODE

Trending Questions

Popular # Hahtags

Popular Questions