Is there any different between two types of union in spark streaming

1k Views Asked by At

Dstream provide two types of union :

StreamingContext.union(Dstreams)

Dstream.union(anotherDstream)

So I want to know what is the different, especially in parallelism performance.

2

There are 2 best solutions below

0
On BEST ANSWER

Looking at the source code of the two operations, there is no difference other than one taking a single DStream as input and the other a list.

StreamingContext:

def union[T: ClassTag](streams: Seq[DStream[T]]): DStream[T] = withScope {
  new UnionDStream[T](streams.toArray)
}

Dstream:

def union(that: DStream[T]): DStream[T] = ssc.withScope {
  new UnionDStream[T](Array(this, that))
}

Hence, which one you use depends on your preference, there is no performance gains to be had. When you have a list of streams to unite, the method in StreamingConext simplifies the code a bit, hence, it could be preferable in this case.

0
On

Your claim "DStream provide two types of union" is not quite right.

The ref mentions differnt signatures, and more specifically different classes that provide the union operation.

StreamingContext.union(*dstreams)

Create a unified DStream from multiple DStreams of the same type and same slide duration.

DStream.union(other)

Return a new DStream by unifying data of another DStream with this DStream. Parameters: other – Another DStream having the same interval (i.e., slideDuration) as this DStream.

The later is discussed in the Spark User List: "The union function simply returns a DStream with the elements from both. This is the same behavior as when we call union on RDDs".


Source code of StreamingContext:

def union(self, *dstreams):
    ...
    first = dstreams[0]
    jrest = [d._jdstream for d in dstreams[1:]]
    return DStream(self._jssc.union(first._jdstream, jrest), self, first._jrdd_deserializer)

Source code of DStream:

def union(self, other):
    return self.transformWith(lambda a, b: a.union(b), other, True)

You can see that the first uses recursion (as expected), while the other uses transformWith, which is defined in the same class and transforms each RDD.


The thing to remember is Level of Parallelism in Data Receiving, where in cases that the data receiving becomes a bottleneck in the system, then consider parallelizing the data receiving process would be a good idea.

As a result, the process of applying the union() method to multiple DStreams is encouraged`, which resulted in providing a method to do this easily, while keeping your code clean. IMHO, there wouldn't be a difference in performance.