Hello I am trying to use pyspark + kafka in order to do this I execute this command in order to set up the Spark application
- Spark version is 3.5.0 | spark-3.5.0-bin-hadoop3
- Kafka version is - kafka_2.13-3.6.0
- pyspark version is 3.5.0
My code is
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /content/spark-streaming-kafka-0-10-assembly_2.13-3.5.0.jar--packages,org.apache.spark:spark-streaming-kafka-0-10_2.13:3.5.0,org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.0 pyspark-shell'
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "c1") \
.option("includeHeaders", "true") \
.load()
This return the following error:
Failed to find data source: kafka. Please deploy the application as per the deployment section of Structured Streaming + Kafka Integration Guide.
I'm using Google Colab
I've try downgrade versions and try multiple old solutions from stack overflow