Integrate Apache Aurora with dcos

588 Views Asked by At

There is only two mesos frameworks that support GPU resources: Marathon and Aurora. I would like to launch batch jobs on mesos agents with GPU resources. So, only Aurora supports such kind of jobs. But Aurora is not supported by dcos officially at the moment. I'v tried to integrate but not successful. DCOS Mesos masters don't register the Aurora framework but the exhibitor creates records for the Aurora. I'v not managed to find any records about Aurora in mesos masters logs. Here is my aurora-scheduler config:

 #!/bin/bash

 GLOG_v=0
 LIBPROCESS_PORT=8083
 #LIBPROCESS_IP=127.0.0.1

 JAVA_HOME=/opt/mesosphere/active/java/usr/java

 JAVA_OPTS="-server -Djava.library.path='/opt/mesosphere/lib;/usr/lib;/usr/lib64'"

 PATH=$PATH:/opt/mesosphere/bin

 MESOS_NATIVE_JAVA_LIBRARY=/opt/mesosphere/lib/libmesos.so

 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mesosphere/lib

 JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/opt/mesosphere/lib

 # Flags control the behavior of the Aurora scheduler.
 # For a full list of available flags, run /usr/lib/aurora/bin/aurora-scheduler -help
 AURORA_FLAGS=(
    # The name of this cluster.
   -cluster_name='My Cluster'

    # The HTTP port upon which Aurora will listen.
   -http_port=8088

    # The ZooKeeper URL of the ZNode where the Mesos master has registered.
    -mesos_master_address=zk://master_ip1:2181,master_ip2:2181,master_ip3:2181/mesos

    # The ZooKeeper quorum to which Aurora will register itself.
    -zk_endpoints=master_ip1:2181,master_ip1:2181,master_ip1:2181

    # The ZooKeeper ZNode within the specified quorum to which Aurora will register its
    # ServerSet, which keeps track of all live Aurora schedulers.
    -serverset_path='/aurora/scheduler'

    # Allows the scheduling of containers of the provided type.
    -allowed_container_types='DOCKER,MESOS'

    -allow_docker_parameters=true
    -allow_gpu_resource=true
    -executor_user=root
    ### Native Log Settings ###

    # The native log serves as a replicated database which stores the state of the
    # scheduler, allowing for multi-master operation.

    # Size of the quorum of Aurora schedulers which possess a native log.  If running in
    # multi-master mode, consult the following document to determine appropriate values:
    #
    # https://aurora.apache.org/documentation/latest/deploying-aurora-scheduler/#replicated-log-configuration
    -native_log_quorum_size=2
    # The ZooKeeper ZNode to which Aurora will register the locations of its replicated log.
    -native_log_zk_group_path='/aurora/replicated-log'
    # The local directory in which an Aurora scheduler can find Aurora's replicated log.
    -native_log_file_path='/var/lib/aurora/scheduler/db'
    # The local directory in which Aurora schedulers will place state backups.
    -backup_dir='/var/lib/aurora/scheduler/backups'

   ### Thermos Settings ###

   # The local path of the Thermos executor binary.
    -thermos_executor_path='/usr/bin/thermos_executor'
   # Flags to pass to the Thermos executor.
    -thermos_executor_flags='--announcer-ensemble 127.0.0.1:2181')
1

There are 1 best solutions below

0
On

I'v managed to start the Aurora framework on the DC/OS 1.8. Due to mesos and java are embedded into DS/OS and have custom configuration, especially paths I have to isolate aurora with docker. So, you can find docker images for the Aurora components at my docker repo: Aurora scheduler, Aurora executor. This also allows me or someone else to create an universe package.

Steps for deploying the Aurora Scheduler on DC/OS:

  1. Create folder /var/lib/aurora on each of DC/OS agents

  2. Start the aurora executor on all DC/OS agents using the next JSON:

    {
      "id": "/aurora/aurora-executor",
      "env": {
        "MESOS_ROOT": "/var/lib/mesos/slave"
      },
      "instances": 20,
      "cpus": 1,
      "mem": 128,
      "disk": 0,
      "gpus": 0,
      "constraints": [
        [
          "hostname",
          "UNIQUE"
        ]
      ],
      "container": {
        "docker": {
          "image": "krot/aurora-executor",
          "forcePullImage": true,
          "privileged": false,
          "network": "HOST"
        },
        "type": "DOCKER",
        "volumes": [
          {
            "containerPath": "/var/lib/mesos/slave",
            "hostPath": "/var/lib/mesos/slave",
            "mode": "RW"
          },
          {
            "containerPath": "/var/lib/aurora",
            "hostPath": "/var/lib/aurora",
            "mode": "RW"
          }
        ]
      }
    }
    

    Note. Set "instances" to number of agents.

    2a. The alternative way of aurora executor deployment (should be done on each of DC/OS agents):

     sudo yum install -y python2 wget
     wget -c https://apache.bintray.com/aurora/centos-7/aurora-executor-0.16.0-1.el7.centos.aurora.x86_64.rpm
     rpm -Uhv --nodeps aurora-executor-0.16.0-1.el7.centos.aurora.x86_64.rpm
    

    Make an edit to add the --mesos-root flag resulting in something like:

    grep -A5 OBSERVER_ARGS /etc/sysconfig/thermos
    OBSERVER_ARGS=(
       --port=1338
       --mesos-root=/var/lib/mesos/slave
       --log_to_disk=NONE
       --log_to_stderr=google:INFO
    )
    
  3. Start the aurora scheduler using the next JSON (3 or more instances are recommended for fault tolerance):

    {
          "id": "/aurora/aurora-scheduler",
          "env": {
            "CLUSTER_NAME": "YourCluster",
            "ZK_ENDPOINTS": "master.mesos:2181",
            "MESOS_MASTER": "zk://master.mesos:2181/mesos",
            "QUORUM_SIZE": "2",
            "EXTRA_SCHEDULER_ARGS": "-allow_gpu_resource=true"
          },
          "instances": 3,
          "cpus": 1,
          "mem": 1024,
          "disk": 0,
          "gpus": 0,
          "constraints": [
            [
              "hostname",
              "UNIQUE"
            ]
          ],
          "container": {
            "docker": {
              "image": "krot/aurora-scheduler",
              "forcePullImage": true,
              "privileged": false,
              "network": "HOST"
            },
            "type": "DOCKER",
            "volumes": [
              {
                "containerPath": "/var/lib/aurora",
                "hostPath": "/var/lib/aurora",
                "mode": "RW"
              }
            ]
          }
    }
    

    Note. -allow_gpu_resource=true enables GPU support. The Aurora scheduler can be configured using environment variables. Please refer to documentation for details.