Why not spark-streaming alone

211 Views Asked by At

I don't have much of an experience with Kafka/ Spark-Streaming, but I have read many articles on how great the combo is in building real time systems for analysis/dashboards. Can someone explain to me why spark-streaming can't do it alone? In other words, why is Kafka in between the data source and spark-streaming?

Thanks

4

There are 4 best solutions below

0
Devan M S On

For processing data using Spark,we need to provide data through different data sources which is supported by Spark. (Or we need to write our own custom data source)

If it is static data, spark provides

  sc.textFile("FILE PATH") //For reading text file
  sc.wholeTextFiles("DIRECTORY PATH") //For reading whole text files in a directory
  sqlContext.read.parquet("FILE PATH")
  sqlContext.read.json("FILE PATH")
  1. Apply your logic on the resultant RDD.

In streaming case spark supporting data from different sources like

Kafka,Flume,Kinesis,Twitter,ZeroMQ,MQTT etc.

And Spark support simple socket Streaming also,

val lines = ssc.socketTextStream("localhost", 9999)

For more

Kafka is a high-throughput distributed messaging system. Kafka's distributed behavior, scalability and fault tolerance give an advantage over other messaging systems. (MQTT, ZMQ etc)

So question is among these data sources which one is yours ? You can replace kafka data source with your own. We are using MQTT as default source.

0
onrdncl On

Actually there is a simple explanation for this question.

Spark Streaming and other streaming environments are designed for reading data instantly. After reading process they have not much ability to keep data alive.(Some of them have, but not efficient). By the way there will be a need for a message broker like Kafka to keep data alive for particular time. Therefore other tools can easily pick up data from message broker(Kafka) at any time by using a consumer. Dividing responsibilities will give you effective results.

0
Faisal Ahmed Siddiqui On

Can someone explain to me why spark-streaming can't do it alone?

Spark streaming is meant for live data and data needs to be ingested from somewhere. Like Kafka, Flume, Kinesis, or TCP sockets. Even you can read data from file.

https://spark.apache.org/docs/latest/streaming-programming-guide.html

In case if your use case is simple enough to read from file, i would say to go for apache nifi.

https://www.youtube.com/watch?v=gqV_63a0ABo&list=PLzmB162Wvzr05Pe-lobQEZkk0zXuzms56

In other words, why is Kafka in between the data source and spark-streaming?

Depending on scenarios, Kafka is often suitable option to store data and then consume from different aspects.

0
Diaboloxx On

I am new too to the field and I was searching exactly for the same thing and I found this simple explanation by practicing. When using Spark in a streaming context you are usually connecting your spark instance to an already established socket like the code below.

And here comes the need for a message broker to open this socket and keep it alive even when data is not coming.

Of course using a message broker like Kafka for example has a lot more advantages that just opening a socket connection especially in terms of resiliance and fault tolerance.

from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext

spark = SparkSession.builder \
    .appName("SocketConsumer") \
    .getOrCreate()

ssc = StreamingContext(spark.sparkContext, 1)  # 1-second batch interval

# Create DStream from socket
lines = ssc.socketTextStream("localhost", 9999)  # Change host and port as needed

# Process each RDD
def process_data(rdd):
    rdd.foreach(lambda record: process(record))

def process(record):
    # Implement logic to process the received data
    print(record)  # Example: Print received data

lines.foreachRDD(process_data)

# Start streaming context
ssc.start()
ssc.awaitTermination()