In an ordered dataset, I want to aggregate data until a condition is met, but grouped by a certain key.
To set some context to my question I simplify my problem to the below problem statement:
In spark I need to aggregate strings, grouped by key when a user stops "shouting" (the 2nd char in a string is not uppercase).
Dataset example:
ID, text, timestamps
1, "OMG I like bananas", 123
1, "Bananas are the best", 234
1, "MAN I love banana", 1235
2, "ORLY? I'm more into grapes", 123565
2, "BUT I like apples too", 999
2, "unless you count veggies", 9999
2, "THEN don't forget tomatoes", 999999
The expected result would be:
1, "OMG I like bananas Bananas are the best"
2, "ORLY? I'm more into grapes BUT I like apples too unless you count veggies"
via groupby and agg I can't seem to set a condition to "stop when an uppercase char" is found.
This only works in Spark 2.1 or above
What you want to do is possible, but it may be very expensive.
First, let's create some test data. As general advice, when you ask something on Stackoverflow please provide something similar to this so people have somewhere to start.
In order to get a column with the collected texts in order, we need to add a new column using a window function.
Using the spark shell:
To get the actual text we may need a UDF. Here's mine (I'm far from an expert in Scala, so bear with me)
So, we have a dataframe with the collected texts in the proper order, and a Scala function (wrapped as a UDF). Let's piece it together:
I think this is the result you want.