I have setup 3 node spark kubernetes cluster with spark-kubernetes-operator helm-chart. The kubernetes cluster deployed on aws t2.2xlarge instances with 8 vcpus and 32gb memory.
I have build RandomForest price prediction spark-pipeline with Scala and run on this cluster. The training dataset(in CSV file) contains around 100,000 records. Following is the SparkApplication used to run the spark job.
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-job
namespace: spark-operator
spec:
type: Scala
mode: cluster
image: "erangaeb/spark-app:1.18"
imagePullPolicy: Always
mainClass: com.rahasak.sparkapp.Tea3RandomForest
mainApplicationFile: "local:///app/spark-app.jar"
sparkVersion: "3.1.1"
restartPolicy:
type: Never
sparkConf:
"spark.ui.port": "4041"
dynamicAllocation:
enabled: true
driver:
cores: 1
memory: "18g"
labels:
version: 3.1.1
serviceAccount: tea3-spark
volumeMounts:
- name: "data-volume"
mountPath: "/mnt/data"
executor:
cores: 2
memory: "24g"
instances: 4
labels:
version: 3.1.1
volumeMounts:
- name: "data-volume"
mountPath: "/mnt/data"
volumes:
- name: "data-volume"
persistentVolumeClaim:
claimName: rahasak-pvc
sparkConf:
spark.kubernetes.local.dirs.tmpfs: "true"
spark.local.dir: "/mnt/data"
to completed this job, it took 12 days. Any idea about the average time to complete the spark job? I hope 12 days it too much. Are there any optimization that I could to reduce the job time?