Sync AWS CodeCommit with gitsync on SparkApplication+k8s

90 Views Asked by At

I'm running a spark-operator on k8s and I need to synchronize my AWS CodeCommit repository directly so I can import my python modules and not have to build the images with them encapsulated in it. I've already used sync with GitHub and deploying SSH to the namespace. However, I am trying to sync with AWS credentials according to the yaml below:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: teste-sync-{{ macros.datetime.now().strftime("%Y-%m-%d-%H-%M-%S") }}
  namespace: processing
spec:
  volumes:
    - name: ivy
      emptyDir: {}
  sparkConf:
    extraJavaOptions: -Dcom.amazonaws.services.s3.enableV4=true
    spark.jars.packages: "org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark:spark-avro_2.12:3.0.1"
    spark.driver.extraJavaOptions: "-Divy.cache.dir=/tmp -Divy.home=/tmp"
    spark.kubernetes.allocation.batch.size: "10"
    spark.sql.debug.maxToStringFields: "2000"
  hadoopConf:
    "fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
    "fs.s3a.path.style.access": "True"
    "fs.s3a.connection.ssl.enabled": "True"
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: url_spark_image
  imagePullPolicy: Always
  mainApplicationFile: teste-sync.py
  sparkVersion: "3.1.2"
  restartPolicy:
    type: Never
  volumes:
    - name: ivy
      emptyDir: {}
    - name: scripts
      emptyDir: {}
  driver:
    volumeMounts:
      - name: scripts
        mountPath: /git-sync
    initContainers:
      - name: git-sync
        image: "k8s.gcr.io/git-sync/git-sync:v3.6.1"
        imagePullPolicy: IfNotPresent
        volumeMounts:
          - name: scripts
            mountPath: /scripts
        env:
          - name: GIT_SYNC_REPO
            value: "https://git-codecommit.MY_REGION.amazonaws.com/v1/repos/MY_REPO"
          - name: GIT_SYNC_BRANCH
            value: "master"   
          - name: GIT_SYNC_ROOT
            value: /dags
          - name: GIT_SYNC_DEST
            value: "main"
          - name: GIT_SYNC_ONE_TIME
            value: "true"
          - name: GIT_SYNC_SSH
            value: "false"
          - name: GIT_SYNC_AUTH
            value: "basic"   
          - name: AWS_ACCESS_KEY_ID
            valueFrom:
              secretKeyRef:
                name: aws-credentials
                key: aws_access_key_id
          - name: AWS_SECRET_ACCESS_KEY
            valueFrom:
              secretKeyRef:
                name: aws-credentials
                key: aws_secret_access_key           
    env:
      - name: PYTHONPATH
        value: "$PYTHONPATH:/git-sync/main/scripts"              
    envSecretKeyRefs:
      AWS_ACCESS_KEY_ID:
        name: aws-credentials
        key: aws_access_key_id
      AWS_SECRET_ACCESS_KEY:
        name: aws-credentials
        key: aws_secret_access_key
    cores: 1
    coreLimit: "1200m"
    memory: "2g"
    labels:
      version: 3.1.2
    serviceAccount: spark
    volumeMounts:
      - name: ivy
        mountPath: /tmp
  executor:
    envSecretKeyRefs:
      AWS_ACCESS_KEY_ID:
        name: aws-credentials
        key: aws_access_key_id
      AWS_SECRET_ACCESS_KEY:
        name: aws-credentials
        key: aws_secret_access_key
    cores: 1
    instances: 2
    memory: "3g"
    labels:
      version: 3.1.2
    volumeMounts:
      - name: ivy
        mountPath: /tmp

From the tests I did it's not working. Can anyone help me? Is there a problem with yaml or will this type of authentication not work and will I have to deploy SSH?

From the tests I did it's not working. Can anyone help me? Is there a problem with yaml or will this type of authentication not work and will I have to deploy SSH?

0

There are 0 best solutions below