I have set up a DCOS cluster and installed the arangodb mesos framework (I haven't changed the initial config). I can access the web interface on port 8529 through the arangodb-proxy and I can create databases, collections, documents there.
Now I'm trying to import some data using the java driver (3.1.4). After 2-3 calls the coordinator goes down. Mesos restarts it but as long as I send data it immediately dies again after a few requests (I also lose the connection on the webinterface for a few seconds):
com.arangodb.ArangoException: org.apache.http.NoHttpResponseException: 172.16.100.99:8529 failed to respond
My insert is basically just a create statement:
arangoDriver.graphCreateVertex(GRAPH_NAME, VERTEX_COLLECTION,
getId(), this, true);
ArangoDB proxy also complains:
I0109 11:26:45.046947 113285 exec.cpp:161] Version: 1.0.1
I0109 11:26:45.051712 113291 exec.cpp:236] Executor registered on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2
I0109 11:26:45.052942 113293 docker.cpp:815] Running docker -H unix:///var/run/docker.sock run --cpu-shares 1024 --memory 134217728 -e MARATHON_APP_VERSION=2017-01-09T10:26:29.819Z -e HOST=172.16.100.99 -e MARATHON_APP_RESOURCE_CPUS=1.0 -e MARATHON_APP_RESOURCE_GPUS=0 -e MARATHON_APP_DOCKER_IMAGE=arangodb/arangodb-mesos-haproxy -e MESOS_TASK_ID=arangodb-proxy.16604c72-d656-11e6-80d4-70b3d5800001 -e PORT=8529 -e MARATHON_APP_RESOURCE_MEM=128.0 -e PORTS=8529 -e MARATHON_APP_RESOURCE_DISK=0.0 -e PORT_80=8529 -e MARATHON_APP_LABELS= -e MARATHON_APP_ID=/arangodb-proxy -e PORT0=8529 -e LIBPROCESS_IP=172.16.100.99 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_CONTAINER_NAME=mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.e0db1925-ff85-4454-bd7e-e0f46e502631 -v /var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0000/executors/arangodb-proxy.16604c72-d656-11e6-80d4-70b3d5800001/runs/e0db1925-ff85-4454-bd7e-e0f46e502631:/mnt/mesos/sandbox --net bridge -p 8529:80/tcp --entrypoint /bin/sh --name mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.e0db1925-ff85-4454-bd7e-e0f46e502631 arangodb/arangodb-mesos-haproxy -c nodejs /configurator.js arangodb3
{ [Error: connect ECONNREFUSED 172.16.100.98:1891]
code: 'ECONNREFUSED',
errno: 'ECONNREFUSED',
syscall: 'connect',
address: '172.16.100.98',
port: 1891 }
{ [Error: connect ECONNREFUSED 172.16.100.99:10413]
code: 'ECONNREFUSED',
errno: 'ECONNREFUSED',
syscall: 'connect',
address: '172.16.100.99',
port: 10413 }
I can see the failed task in the arangodb service completed task list, but the stderr log doesn't seem to say anything:
I0109 16:28:31.792980 126177 exec.cpp:161] Version: 1.0.1
I0109 16:28:31.797145 126182 exec.cpp:236] Executor registered on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2
I0109 16:28:31.798338 126183 docker.cpp:815] Running docker -H unix:///var/run/docker.sock run --cpu-shares 1024 --memory 4294967296 -e CLUSTER_ROLE=coordinator -e CLUSTER_ID=Coordinator002 -e ADDITIONAL_ARGS= -e AGENCY_ENDPOINTS=tcp://172.16.100.97:1025 tcp://172.16.100.99:1025 tcp://172.16.100.98:1025 -e HOST=172.16.100.99 -e PORT0=1027 -e LIBPROCESS_IP=172.16.100.99 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_CONTAINER_NAME=mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.61fb92b9-62e0-48b2-b2a3-3dc0b95f7818 -v /var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049/executors/0c81c1f2-0404-4943-ab81-0abc78763140/runs/61fb92b9-62e0-48b2-b2a3-3dc0b95f7818:/mnt/mesos/sandbox --net bridge -p 1027:8529/tcp --name mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.61fb92b9-62e0-48b2-b2a3-3dc0b95f7818 arangodb/arangodb-mesos:3.1
Mesos log indicates that the task failed:
I0109 16:55:44.821689 13431 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at [email protected]:2273
I0109 16:55:45.313108 13431 master.cpp:5466] Performing explicit task state reconciliation for 1 tasks of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0003 (marathon-user) at [email protected]:42099
I0109 16:55:45.560523 13428 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141655 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at [email protected]:2273
I0109 16:55:45.676347 13431 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.106:42540 with User-Agent='python-requests/2.10.0'
I0109 16:55:45.823482 13425 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at [email protected]:44838
I0109 16:55:45.823698 13425 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.105:34374 with User-Agent='python-requests/2.10.0'
I0109 16:55:45.824986 13425 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141656 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at [email protected]:44838
I0109 16:55:45.826448 13425 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.107:41364 with User-Agent='python-requests/2.10.0'
I0109 16:55:46.694202 13425 master.cpp:5140] Status update TASK_FAILED (UUID: 2abcbe87-e1d6-4968-965d-33429573dfd9) for task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 from agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2 at slave(1)@172.16.100.99:5051 (172.16.100.99)
I0109 16:55:46.694247 13425 master.cpp:5202] Forwarding status update TASK_FAILED (UUID: 2abcbe87-e1d6-4968-965d-33429573dfd9) for task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049
I0109 16:55:46.694344 13425 master.cpp:6844] Updating the state of task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (latest state: TASK_FAILED, status update state: TASK_FAILED)
I0109 16:55:46.695953 13425 master.cpp:4265] Processing ACKNOWLEDGE call 2abcbe87-e1d6-4968-965d-33429573dfd9 for task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at [email protected]:2273 on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2
I0109 16:55:46.695989 13425 master.cpp:6910] Removing task 32014e7f-7f5b-4fea-b757-cca0faa3deac with resources mem(*):4096; cpus(*):1; disk(*):1024; ports(*):[1027-1027] of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2 at slave(1)@172.16.100.99:5051 (172.16.100.99)
I0109 16:55:46.824192 13430 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at [email protected]:2273
I0109 16:55:46.824347 13430 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at [email protected]:44838
I0109 16:55:46.825814 13425 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141658 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at [email protected]:44838
I0109 16:55:47.567651 13426 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141657 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at [email protected]:2273
I had the same system import tasks running without any issues on a single arangodb instance, so I assume the issue is not with my java code. But I can't find any further logs to indicate where the issue could be (but then I'm pretty new to Mesos).
My cluster:
4 Slaves (126GB RAM, 8 CPUs each) 3 Masters (32 GB RAM, 4 CPUs)
Can anyone tell me what I'm doing wrong or where I can find some more logging information?
Update: stdout only shows the startup log messages (looking at the timestamps both entries in stderr (above)/stdout (below) are from when the task is started, now when it failed):
--container="mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.209e5908-e884-4c92-92b2-a27f1761ad6e" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --launcher_dir="/opt/mesosphere/packages/mesos--253f5cb0a96e2e3574293ddfecf5c63358527377/libexec/mesos" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049/executors/1d0cf1e4-3267-4150-a465-19ecca21fa65/runs/209e5908-e884-4c92-92b2-a27f1761ad6e" --stop_timeout="20secs"
--container="mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.209e5908-e884-4c92-92b2-a27f1761ad6e" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --launcher_dir="/opt/mesosphere/packages/mesos--253f5cb0a96e2e3574293ddfecf5c63358527377/libexec/mesos" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049/executors/1d0cf1e4-3267-4150-a465-19ecca21fa65/runs/209e5908-e884-4c92-92b2-a27f1761ad6e" --stop_timeout="20secs"
Registered docker executor on 172.16.100.99
Starting task 1d0cf1e4-3267-4150-a465-19ecca21fa65
Initializing database...Hang on...
Database initialized...Starting System...
2017-01-10T08:04:30Z [1] INFO ArangoDB 3.1.7 [linux] 64bit, using VPack 0.1.30, ICU 54.1, V8 5.0.71.39, OpenSSL 1.0.1t 3 May 2016
2017-01-10T08:04:30Z [1] INFO using SSL options: SSL_OP_CIPHER_SERVER_PREFERENCE, SSL_OP_TLS_ROLLBACK_BUG
2017-01-10T08:04:30Z [1] WARNING {agencycomm} got an agency redirect from 'http+tcp://172.16.100.97:1025' to 'http+tcp://172.16.100.99:1025'
2017-01-10T08:04:31Z [1] WARNING {agencycomm} Retrying agency communication at 'http+tcp://172.16.100.99:1025', tries: 2
2017-01-10T08:04:31Z [1] INFO Waiting for DBservers to show up...
2017-01-10T08:04:31Z [1] INFO Found 3 DBservers.
2017-01-10T08:04:31Z [1] INFO file-descriptors (nofiles) hard limit is 1048576, soft limit is 1048576
2017-01-10T08:04:31Z [1] INFO JavaScript using startup '/usr/share/arangodb3/js', application '/var/lib/arangodb3-apps'
2017-01-10T08:04:32Z [1] INFO Cluster feature is turned on. Agency version: {"server":"arango","version":"3.1.7","license":"community"}, Agency endpoints: http+tcp://172.16.100.99:1025, http+tcp://172.16.100.97:1025, http+tcp://172.16.100.98:1025, server id: 'Coordinator002', internal address: tcp://172.16.100.99:1027, role: COORDINATOR
2017-01-10T08:04:32Z [1] INFO using heartbeat interval value '1000 ms' from agency
2017-01-10T08:04:32Z [1] INFO In database '_system': Database is up-to-date (30107/cluster-local/existing)
2017-01-10T08:04:32Z [1] INFO using endpoint 'http+tcp://0.0.0.0:8529' for non-encrypted requests
2017-01-10T08:04:33Z [1] INFO bootstraped coordinator Coordinator002
2017-01-10T08:04:33Z [1] INFO ArangoDB (version 3.1.7 [linux]) is ready for business. Have fun!
I've noted some output in stderr for the arangodb3 task but I'm not sure if this is just logging requests or part of the issue - this gets repeated every few seconds:
I0110 08:50:25.981262 23 HttpServer.cpp:456] handling http request 'GET /v1/endpoints.json'
I0110 08:50:26.000558 22 CaretakerCluster.cpp:470] And here the offer:
{"id":{"value":"b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O173464"},"framework_id":{"value":"b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049"},"slave_id":{"value":"b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S0"},"hostname":"172.16.100.97","url":{"scheme":"http","address":{"hostname":"172.16.100.97","ip":"172.16.100.97","port":5051},"path":"/slave(1)","query":[]},"resources":[{"name":"ports","type":1,"ranges":{"range":[{"begin":1028,"end":2180},{"begin":2182,"end":3887},{"begin":3889,"end":5049},{"begin":5052,"end":8079},{"begin":8082,"end":8180},{"begin":8182,"end":8528},{"begin":8530,"end":32000}]},"role":"*"},{"name":"disk","type":0,"scalar":{"value":291730},"role":"*"},{"name":"cpus","type":0,"scalar":{"value":4.75},"role":"*"},{"name":"mem","type":0,"scalar":{"value":117367},"role":"*"}],"attributes":[],"executor_ids":[]}
Update 2: also, the online log in /mesos shows me this - does this mean the cluster isn't started properly?
I0110 09:55:30.857897 13427 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.107:60624 with User-Agent='python-requests/2.10.0'
I0110 09:55:31.111609 13426 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O173624 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at [email protected]:9546
I0110 09:55:31.111747 13426 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O173623 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at [email protected]:9546
I