Lambda Architecture Modelling Issue

559 Views Asked by At

I am considering implementing a Lambda Architecture in order to process events transmitted by multiple devices. In most cases (averages etc.) its seems to fit my requirements. However, I am stuck trying to model a specific use case. In short...

Each device has a device_id. Every device emits 1 event per second. Each event has an event_id ranging from {0-->10}.

An event_id of 0 indicates START & an event_id of 10 indicates END

All the events between START & END should be grouped into one single group (event_group). This will produce tuples of event_groups i.e. {0,2,2,2,5,10}, (0,4,2,7,...5,10), (0,10) This (event_group) might be small i.e. 10 minutes or very large say 3hours.

According to Lambda Architecture these events transmitted by every device are my "Master Data Set". Currently, the events are sent to HDFS & Storm using Kafka (Camus, Kafka Spout).

In the Streaming process I group by device_id, and use Redis to maintain a set of incoming events in memory, based on a key which is generated each time an event_id=0 arrives. The problem lies in HDFS. Say I save a file with all incoming events every hour. Is there a way to distinguish these (group_events)?

Using Hive I can group tuples in the same manner. However, each file will also contain "broken" event_groups

  • (0,2,2,3) previous computation (file)
  • (4,3,) previous computation (file)
  • (5,6,7,8,10) current computation (file)

so that I need to merge them based on device_id into (0,2,2,3,4,3,5,6,7,8,10) (multiple files)

Is a Lambda Architecture a fit for this scenario? Or should the streaming process be the only source of truth? I.e. write to hbase, hdfs itself won't this affect the overall latency.

1

There are 1 best solutions below

0
On

As far as I understand your process, I don't see any issue, as the principle of Lambda Architecure is to re-process regularly all your data on a batch mode. (by the way, not all your data, but a time frame, usually larger than the speed layer window)

If you choose a large enough time window for your batch mode (let's say your aggregation window + 3 hours, in order to include even the longest event groups), your map reduce program will be able to compute all your event groups for the desired aggregation window, whatever file the distincts events are stored (Hadoop shuffle magic !)

The underlying files are not part of the problem, but the time windows used to select data to process are.