What happens when calling Apache Crunch pipeline read twice on two different sources?

259 Views Asked by Lulu Li At 24 May 2018 at 06:49

When making the following call:

    PCollection<KeyValue> data1 = pipeline.read(source1);
    PCollection<KeyValue> data2 = pipeline.read(source2);
    PCollection<KeyValue> data3 = data1.union(data2);

According to Apache Crunch read documentation, is the same pipeline used to read from both sources, and then the data are joined together?

Original Q&A

There are 1 best solutions below

dbustosp On 24 May 2018 at 12:36

Apache Crunch Pipeline can read as many sources as you want and then you can start transforming the data as you wish, such as, PCollections unions, passing the sources through DoFn or MapFn in order to do Documents object composition using MapReduce, among many others.

One thing you need to keep in mind is that Apache Crunch as same as Apache Spark uses a lazy execution model, which means, no data transformation process will be triggered until you execute an action. Below I quote a small part of the Crunch documentation talking about it.

Crunch uses a lazy execution model. No jobs are run or outputs created until the user explicitly invokes one of the methods on the Pipeline interface that controls job planning and execution. The simplest of these methods is the PipelineResult run() method, which analyzes the current graph of PCollections and Target outputs and comes up with a plan to ensure that each of the outputs is created and then executes it, returning only when the jobs are completed. The PipelineResult returned by the run method contains information about what was run, including the number of jobs that were executed during the pipeline run and the values of the Hadoop Counters for each of those stages via the StageResult component classes.

Answering your question, yes, the same pipeline will read both sources.

Side note: You will probably want to have only one pipeline for your data transformation.

What happens when calling Apache Crunch pipeline read twice on two different sources?

There are 1 best solutions below

Related Questions in HADOOP

Related Questions in PIPELINE

Related Questions in APACHE-CRUNCH

Trending Questions

Popular # Hahtags

Popular Questions