How does Hadoop run in "real-time" against non-stale data?

141 Views Asked by smeeb At 28 July 2025 at 21:56

My abysmally-rudimentary understanding of Hadoop and its "data ingest" tools (such as Flume or Sqoop) is that Hadoop must always run its MR jobs against data that is stored in structured files on its HDFS. And, that these tools (again, Flume, Sqoop, etc.) are responsible for essentially importing data from disparate systems (RDBMS, NoSQL, etc.) into HDFS.

To me, this means that Hadoop will always be running on "stale" (for lack of a better word) data that is minutes/hours/etc. old. Because, to import big data from these disparate systems onto HDFS takes time. By the time MR can even run, the data is stale and may no longer be relevant.

Say we have an app that has real-time constraints of making a decision within 500ms of something occurring. Say we have a massive stream of data that is being imported into HDFS, and because the data is so big it takes, say, 3 seconds to even get the data on to HDFS. Then say that the MR job that is responsible for making the decision takes 200ms. Because the loading of the data takes so long, we've already blown our time constraint, even though the MR job processing the data would be able to finish inside the given window.

Is there a solution for this kind of big data problem?

Original Q&A

There are 1 best solutions below

Abhijeet Dhumal On 25 June 2015 at 19:23 BEST ANSWER

With the help of tools Apache Spark streaming API & another one is Storm which you can use for real time stream processing.

How does Hadoop run in "real-time" against non-stale data?

There are 1 best solutions below

Related Questions in JAVA

Related Questions in HADOOP

Related Questions in HDFS

Related Questions in REAL-TIME

Related Questions in SQOOP

Trending Questions

Popular # Hahtags

Popular Questions