Hello I am trying to use pyspark + kafka in order to do this I execute this command in order to set up the Spark application

  • Spark version is 3.5.0 | spark-3.5.0-bin-hadoop3
  • Kafka version is - kafka_2.13-3.6.0
  • pyspark version is 3.5.0

My code is

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /content/spark-streaming-kafka-0-10-assembly_2.13-3.5.0.jar--packages,org.apache.spark:spark-streaming-kafka-0-10_2.13:3.5.0,org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.0 pyspark-shell'

import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark = SparkSession \
    .builder \
    .appName("StructuredNetworkWordCount") \
    .getOrCreate()

df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("subscribe", "c1") \
  .option("includeHeaders", "true") \
  .load()

This return the following error:

Failed to find data source: kafka. Please deploy the application as per the deployment section of Structured Streaming + Kafka Integration Guide.

I'm using Google Colab

I've try downgrade versions and try multiple old solutions from stack overflow

0

There are 0 best solutions below