We are analyzing the streaming twitter data to find users who are posting similar (almost same) tweets over and over. I am using MATCH_RECOGNIZE for this. It is able to find the pattern, but I am not able to get the FIRST() and the LAST() values correctly. Here is sample dataset:
I am using the following Query:
SELECT
USERID
, NUM_OF_TWEETS
, FIRST_TWEET
, LAST_TWEET
, FIRST_TWEET_ID
, LAST_TWEET_ID
FROM SCRATCH.SAQIB_ALI.TWEETS
MATCH_RECOGNIZE(
PARTITION BY USERID
ORDER BY TWEETID ASC
MEASURES
FIRST(TWEET) AS FIRST_TWEET,
LAST(TWEET) AS LAST_TWEET,
FIRST(TWEETID) AS FIRST_TWEET_ID,
LAST(TWEETID) AS LAST_TWEET_ID,
COUNT(*) AS NUM_OF_TWEETS
ONE ROW PER MATCH
PATTERN (SIMILAR+)
DEFINE
SIMILAR AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
);
This correct identify the users that are posting same tweets over an over:

But I am not able to get the first tweet and the last tweet in the matching sequence.

There are multiple things at play.
The first is you only "have one row trigging a match" so first and last are the second row of you data. This can be seen by changing to
ALL ROWS PER MATCHif you change to say a match that catches the first value and the lag values:
you now match both the first and latter rows..
now if we expand our test a little bit more with four rows of data:
we see those values are not double registering..
BUT we also see the first ID is correct for all rows, but the last is within the scope of the current matched row, so not after all matches as you are hoping.
If we flip back to
one row per matchwe do how ever get the results we are expecting.