How to setup the Spark-Cassandra connector to connect to Cassandra cluster on Kubernetes

193 Views Asked by At

We are getting pretty bad write performance using the Spark-Cassandra connector, when Cassandra is on k8s. For clarity - we're trying to write a DF with 1.3Bn unique keys (around 30GB) with 16 executors, each with 4 cores and 16GB memory. we have a Cassandra cluster of 5 Nodes (replication factor = 2), where the cassandra table looks like:

CREATE TABLE <tablename> (hashed_id text PRIMARY KEY, timestamp1 bigint, timestamp2 bigint)

The write took around 8 hours....

Sample code of how we write a DataFrame to Cassandra:

df
.write
.format("org.apache.spark.sql.cassandra")
.mode("overwrite")
.option("confirm.truncate", "true")
.options(table=tablename, keyspace=cassandra_keyspace)
.save()

We've recently started using Cassandra, and decided it will be deployed on Kubernetes. We are running some ETLs on Spark that need to write directly to Cassandra.

Our setup is:

  • Cassandra (4.0) deployed on k8s using K8ssandra operator (1.6), behind a traefik ingress (no TLS)

  • Spark (3.2) deployed on bare-metal, ETLs in Pyspark, using spark-cassandra-connector_2.12-3.2.0.

Im looking for any reference on how to configure the spark connector to use all nodes in a such a case. What Im assuming is happening, is that the connector is only able to "see" the ingress address and gets back internal IPs for the other nodes. we want to follow the examples here but not sure how we could configure the spark connector to use such configurations...

1

There are 1 best solutions below

1
On

There are two questions,

  1. Why the writes are taking longer?
  2. It is not very clear to me what role SCC has in the K8s ingress.

To answer question #1,

  • spark.cassandra.connection.resolveContactPoints when set to true (Default) Controls, if we need to resolve contact points at start (true), or at reconnection (false). Helpful for usage with Kubernetes or other systems with dynamic endpoints which may change while the application is running. Ensure you haven't set it to false.
  • spark.cassandra.coonection.host - The hosts given here will be used as an intial contact point to the C* cluster. Upon getting the initial connection, it would find the entire topology of the cluster.

SCC Configuration Parameters are available here. You can tune in the Write Tuning Parameters i.e. the ones starting with spark.cassandra.output.*. Also, ensure your C* cluster is properly sized (e.g. hardware specs, datamodel, etc.,) to run efficiently.