What is the StreamSets architecture?

1.1k Views Asked by Aman Raturi At 28 June 2025 at 03:12

I am not very clear about the architecture even after going through tutorials. How do we scale streamset in a distributed environment? Let's say, our input data velocity increases from origin then how to ensure that SDC doesn't give performance issues? How many daemons will be running? Will it be Master worker architecture or peer to peer architecture?

If there are multiple daemons running on multiple machines (e.g. one sdc along with one NodeManager in YARN) then how it will show centralized view of data i.e. total record count etc.?

Also please do let me know architecture of Dataflow performance manager. Which all daemons are there in this product?

Original Q&A

There are 1 best solutions below

metadaddy On 08 December 2017 at 19:24

StreamSets Data Collector (SDC) scales by partitioning the input data. In some cases, this can be done automatically, for example Cluster Batch mode runs SDC as a MapReduce job on the Hadoop / MapR cluster to read Hadoop FS / MapR FS data, while Cluster Streaming mode leverages Kafka partitions and executes SDC as a Spark Streaming application to run as many pipeline instances as there are Kafka partitions.

In other cases, StreamSets can scale by multithreading - for example, the HTTP Server and JDBC Multitable Consumer origins run multiple pipeline instances in separate threads.

In all cases, Dataflow Performance Manager (DPM) can give you a centralized view of the data, including total record count.

What is the StreamSets architecture?

There are 1 best solutions below

Related Questions in CLOUDERA-QUICKSTART-VM

Related Questions in STREAMSETS

Trending Questions

Popular # Hahtags

Popular Questions