Regex expression to avoid '' records in a RDD after splitting the text

26 Views Asked by GreenPenguin At 21 March 2024 at 15:14

I am new to PySpark RDD and Regex, and I tried a simple example to read a text file and then split the text using every non-alphabetic character as delimiter:

rdd = spark.sparkContext.textFile("essay.txt")
rdd2 = rdd.flatMap(lambda x: re.split('\W+', x))

The result is nearly what I wanted, except that there are '' records when the original text contains more than one consecutive non-alphabetic characters. (Edit: I found that the '' may occur because of the line break)

For example, I have these lines in the essay.txt file:

data = [(1001, 'Introduction to Programming', 9, 50),
        (1002, 'Introduction to Algorithms', 9, 50),
        (1003, 'Introduction to Data Structures', 9, 40),
        (2002, 'Operating Systems', 12, 40),
        (2051, 'Advanced Data Structures', 6, 40),
        (3048, 'Networking and Cloud Computing', 12, 60)]

I got these results in rdd2:

['data', '1001', 'Introduction', 'to', 'Programming', '9', '50', '', '', '1002', 'Introduction', 'to', 'Algorithms', '9', '50', '', '', '1003', 'Introduction', 'to', 'Data', 'Structures', '9', '40', '', '', '2002', 'Operating', 'Systems', '12', '40', '', '', '2051', 'Advanced', 'Data', 'Structures', '6', '40', '', '', '3048', 'Networking', 'and', 'Cloud', 'Computing', '12', '60', '']

I believe there are two choices: I may remove the '' records from rdd2, but I'd prefer changing the Regex expression so consecutive non-alphabetic characters are treated as one. Is the latter possible?

Original Q&A

Regex expression to avoid '' records in a RDD after splitting the text

There are 0 best solutions below

Related Questions in REGEX

Related Questions in PYSPARK

Related Questions in RDD

Trending Questions

Popular # Hahtags

Popular Questions