There is only two mesos frameworks that support GPU resources: Marathon and Aurora. I would like to launch batch jobs on mesos agents with GPU resources. So, only Aurora supports such kind of jobs. But Aurora is not supported by dcos officially at the moment. I'v tried to integrate but not successful. DCOS Mesos masters don't register the Aurora framework but the exhibitor creates records for the Aurora. I'v not managed to find any records about Aurora in mesos masters logs. Here is my aurora-scheduler config:
#!/bin/bash
GLOG_v=0
LIBPROCESS_PORT=8083
#LIBPROCESS_IP=127.0.0.1
JAVA_HOME=/opt/mesosphere/active/java/usr/java
JAVA_OPTS="-server -Djava.library.path='/opt/mesosphere/lib;/usr/lib;/usr/lib64'"
PATH=$PATH:/opt/mesosphere/bin
MESOS_NATIVE_JAVA_LIBRARY=/opt/mesosphere/lib/libmesos.so
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mesosphere/lib
JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/opt/mesosphere/lib
# Flags control the behavior of the Aurora scheduler.
# For a full list of available flags, run /usr/lib/aurora/bin/aurora-scheduler -help
AURORA_FLAGS=(
# The name of this cluster.
-cluster_name='My Cluster'
# The HTTP port upon which Aurora will listen.
-http_port=8088
# The ZooKeeper URL of the ZNode where the Mesos master has registered.
-mesos_master_address=zk://master_ip1:2181,master_ip2:2181,master_ip3:2181/mesos
# The ZooKeeper quorum to which Aurora will register itself.
-zk_endpoints=master_ip1:2181,master_ip1:2181,master_ip1:2181
# The ZooKeeper ZNode within the specified quorum to which Aurora will register its
# ServerSet, which keeps track of all live Aurora schedulers.
-serverset_path='/aurora/scheduler'
# Allows the scheduling of containers of the provided type.
-allowed_container_types='DOCKER,MESOS'
-allow_docker_parameters=true
-allow_gpu_resource=true
-executor_user=root
### Native Log Settings ###
# The native log serves as a replicated database which stores the state of the
# scheduler, allowing for multi-master operation.
# Size of the quorum of Aurora schedulers which possess a native log. If running in
# multi-master mode, consult the following document to determine appropriate values:
#
# https://aurora.apache.org/documentation/latest/deploying-aurora-scheduler/#replicated-log-configuration
-native_log_quorum_size=2
# The ZooKeeper ZNode to which Aurora will register the locations of its replicated log.
-native_log_zk_group_path='/aurora/replicated-log'
# The local directory in which an Aurora scheduler can find Aurora's replicated log.
-native_log_file_path='/var/lib/aurora/scheduler/db'
# The local directory in which Aurora schedulers will place state backups.
-backup_dir='/var/lib/aurora/scheduler/backups'
### Thermos Settings ###
# The local path of the Thermos executor binary.
-thermos_executor_path='/usr/bin/thermos_executor'
# Flags to pass to the Thermos executor.
-thermos_executor_flags='--announcer-ensemble 127.0.0.1:2181')
I'v managed to start the Aurora framework on the DC/OS 1.8. Due to mesos and java are embedded into DS/OS and have custom configuration, especially paths I have to isolate aurora with docker. So, you can find docker images for the Aurora components at my docker repo: Aurora scheduler, Aurora executor. This also allows me or someone else to create an universe package.
Steps for deploying the Aurora Scheduler on DC/OS:
Create folder
/var/lib/aurora
on each of DC/OS agentsStart the aurora executor on all DC/OS agents using the next JSON:
Note. Set
"instances"
to number of agents.2a. The alternative way of aurora executor deployment (should be done on each of DC/OS agents):
Make an edit to add the
--mesos-root
flag resulting in something like:Start the aurora scheduler using the next JSON (3 or more instances are recommended for fault tolerance):
Note.
-allow_gpu_resource=true
enables GPU support. The Aurora scheduler can be configured using environment variables. Please refer to documentation for details.