Per-user stream processing

262 Views Asked by At

I need to process data from a set of streams, applying the same elaboration to each stream independently from the other streams.

I've already seen frameworks like storm, but it appears that it allows the processing of static streams only (i.e. tweets form twitter), while I need to process data from each user separately.

A simple example of what I mean could be a system where each user can track his gps location and see statistics like average velocity, acceleration, burnt calories and so on in real time. Of course, each user would have his own stream(s) and the system should process the stream of each user separately, as if each user had its own dedicated topology processing his data.

Is there a way to achieve this with a framework like storm, spark streaming or samza?

It would be even better if python is supported, since I already have a lot of code I'd like to reuse.

Thank you very much for your help

3

There are 3 best solutions below

2
On

Using Storm, you can group data using fields-grouping connection pattern if you have a user-id in your tuples. This ensures, that data is partitioned by user-id and thus you get logical substreams. Your code only needs to be able to process multiple groups/substreams, because a single bolt instance gets multiple groups for processing. But Storm supports your use case for sure. It also can run Python code.

0
On

You can use WSO2 Stream Processor to achieve this. You can partition the input stream by user-name and process events pertain to each user separately. The processing logic has to be written in Siddhi QL which is a SQL like language.

WSO2 SP also has a python wrapper to, it will allow you do perform administrative tasks such as submitting, editing jobs. But you can't write processing logic using python code.

1
On

In Samza, similar to Storm, one would partition the individual streams on some user ID. This would guarantee that the same processor would see all the events for some particular user (as well as other user IDs that the partition function [a hash, for instance] assigns to that processor). Your description sounds like something that would more likely run on the client's system rather than being a server-side operation, however.

Non-JVM language support has been proposed for Samza, but not yet implemented.