Spark Streaming - Twitter - Filtering tweet data

3.4k Views Asked by At

I am new to Scala and Spark. I am working on spark streaming with twitter data. I flatmapped the stream into individual words.Now, I need to eliminate tweet words like which start with #,@ and words like RT from streaming data before processing them. I knew it is quite easy to do.I wrote filter for this, but it is not working. Can anyone help on this. My code is

val sparkConf = new SparkConf().setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf, Seconds(2))
    val stream = TwitterUtils.createStream(ssc, None)
    //val lanFilter = stream.filter(status => status.getLang == "en")
    val RDD1 = stream.flatMap(status => status.getText.split(" "))
    val filterRDD = RDD1.filter(word =>(word !=word.startsWith("#")))
    filterRDD.print()

Also language filter is showing error.

Thank you.

2

There are 2 best solutions below

0
On BEST ANSWER

Is your lambda expression correct? I think you want:

val filterRDD = RDD1.filter(word => !word.startsWith("#"))

2
On

You can use a built in word filter support:

TwitterUtils.createStream(ssc, None, Array("filter", "these", "words")) 

But if you want to fix your code:

.filterNot(_.getText.startsWith("#"))

Regarding language, see this question.