I am new to PySpark RDD and Regex, and I tried a simple example to read a text file and then split the text using every non-alphabetic character as delimiter:
rdd = spark.sparkContext.textFile("essay.txt")
rdd2 = rdd.flatMap(lambda x: re.split('\W+', x))
The result is nearly what I wanted, except that there are '' records when the original text contains more than one consecutive non-alphabetic characters. (Edit: I found that the '' may occur because of the line break)
For example, I have these lines in the essay.txt file:
data = [(1001, 'Introduction to Programming', 9, 50),
(1002, 'Introduction to Algorithms', 9, 50),
(1003, 'Introduction to Data Structures', 9, 40),
(2002, 'Operating Systems', 12, 40),
(2051, 'Advanced Data Structures', 6, 40),
(3048, 'Networking and Cloud Computing', 12, 60)]
I got these results in rdd2:
['data', '1001', 'Introduction', 'to', 'Programming', '9', '50', '', '', '1002', 'Introduction', 'to', 'Algorithms', '9', '50', '', '', '1003', 'Introduction', 'to', 'Data', 'Structures', '9', '40', '', '', '2002', 'Operating', 'Systems', '12', '40', '', '', '2051', 'Advanced', 'Data', 'Structures', '6', '40', '', '', '3048', 'Networking', 'and', 'Cloud', 'Computing', '12', '60', '']
I believe there are two choices: I may remove the '' records from rdd2, but I'd prefer changing the Regex expression so consecutive non-alphabetic characters are treated as one. Is the latter possible?