How does Hadoop run in "real-time" against non-stale data?

157 Views Asked by At

My abysmally-rudimentary understanding of Hadoop and its "data ingest" tools (such as Flume or Sqoop) is that Hadoop must always run its MR jobs against data that is stored in structured files on its HDFS. And, that these tools (again, Flume, Sqoop, etc.) are responsible for essentially importing data from disparate systems (RDBMS, NoSQL, etc.) into HDFS.

To me, this means that Hadoop will always be running on "stale" (for lack of a better word) data that is minutes/hours/etc. old. Because, to import big data from these disparate systems onto HDFS takes time. By the time MR can even run, the data is stale and may no longer be relevant.

Say we have an app that has real-time constraints of making a decision within 500ms of something occurring. Say we have a massive stream of data that is being imported into HDFS, and because the data is so big it takes, say, 3 seconds to even get the data on to HDFS. Then say that the MR job that is responsible for making the decision takes 200ms. Because the loading of the data takes so long, we've already blown our time constraint, even though the MR job processing the data would be able to finish inside the given window.

Is there a solution for this kind of big data problem?


There are 1 best solutions below


With the help of tools Apache Spark streaming API & another one is Storm which you can use for real time stream processing.