I am implementing Lambda architecture, using spark and spark streaming for batch layer and speed layer respectively. As to now, I store both batch views and real-time views in HBase but in different table.
I am stuck at how to merge batch views generated by batch views and real-time views generated by speed layer, in order for queries. How to do it right? Should I just dump them into the same HBase table and the client go query directly to the HBase?
First of all, I think that HBase is not the best option for real-time views, as heavily loaded random read/random write is not the strongest side of the HBase.
Anyway, the one way can be the following:
DataFrame
/DataSet
for instanceDataFrame
/DataSet
tooVery simplified flow for doing that can be found in my github