Spark streaming on Mesos - random failures of Marathon tasks

103 Views Asked by At

I'm using Mesos 1.1.0 cluster for both Spark batch and Spark streaming jobs. Batch is working without any problem with same Spark URI-s, Streaming is behaving abnormally from time to time. Streaming job is started using Marathon 1.3.6 by running simple bash submit script. Script is local on all slaves, same on all slaves.

Script:

spark-submit --class "$1" --master  mesos://zk://master1:2181,master2:2181,master3:2181/mesos --conf spark.executor.uri=/spark/spark-2.0.1-bin-hadoop2.6.tgz --conf spark.cores.max=2 --conf spark.eventLog.enabled=false /spark_jobs/"$2"

This all works well most of the time, Marathon deploys the job, job starts and remains active and working. Here is sample of Mesos stderr for job which starts fine. It is just the beginning of the log file because I beleive it is where the issue appears.

#################### START OF Normal stderr
I1224 09:02:00.480056 23118 fetcher.cpp:498] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/spark\/spark-2.0.1-bin-hadoop2.6.tgz"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3\/frameworks\/cd6497c0-d374-4957-b861-286fd0b27587-13920\/executors\/0\/runs\/feb948b2-1b37-4962-a682-77aeb45a3063","user":"root"}
I1224 09:02:00.483033 23118 fetcher.cpp:409] Fetching URI '/spark/spark-2.0.1-bin-hadoop2.6.tgz'
I1224 09:02:00.483050 23118 fetcher.cpp:250] Fetching directly into the sandbox directory
I1224 09:02:00.483072 23118 fetcher.cpp:187] Fetching URI '/spark/spark-2.0.1-bin-hadoop2.6.tgz'
I1224 09:02:00.483088 23118 fetcher.cpp:167] Copying resource with command:cp '/spark/spark-2.0.1-bin-hadoop2.6.tgz' '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063/spark-2.0.1-bin-hadoop2.6.tgz'
I1224 09:02:00.675047 23118 fetcher.cpp:84] Extracting with command: tar -C '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063' -xf '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063/spark-2.0.1-bin-hadoop2.6.tgz'
I1224 09:02:02.797693 23118 fetcher.cpp:92] Extracted '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063/spark-2.0.1-bin-hadoop2.6.tgz' into '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063'
I1224 09:02:02.797751 23118 fetcher.cpp:547] Fetched '/spark/spark-2.0.1-bin-hadoop2.6.tgz' to '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063/spark-2.0.1-bin-hadoop2.6.tgz'
I1224 09:02:02.850260 23144 exec.cpp:162] Version: 1.1.0
I1224 09:02:02.852879 23147 exec.cpp:237] Executor registered on agent bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/12/24 09:02:03 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 23155@mr02
16/12/24 09:02:03 INFO SignalUtils: Registered signal handler for TERM
16/12/24 09:02:03 INFO SignalUtils: Registered signal handler for HUP
16/12/24 09:02:03 INFO SignalUtils: Registered signal handler for INT
16/12/24 09:02:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/24 09:02:04 INFO SecurityManager: Changing view acls to: root
16/12/24 09:02:04 INFO SecurityManager: Changing modify acls to: root
16/12/24 09:02:04 INFO SecurityManager: Changing view acls groups to: 
16/12/24 09:02:04 INFO SecurityManager: Changing modify acls groups to: 
16/12/24 09:02:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
16/12/24 09:02:04 INFO TransportClientFactory: Successfully created connection to /10.210.197.151:57731 after 82 ms (0 ms spent in bootstraps)

Sometimes when we restart this job, job gets initialized, it registers as framework in Mesos, it starts Spark web UI, but it does nothing, all tasks are queued and it is doing some weird stuff (calculating stages of WHAT?). Sometimes it takes several restarts to make it work like it should, which is not desired behaviour, since process should be automated.

#### START OF Abnormal stderr
I1224 08:32:11.346411 19741 exec.cpp:162] Version: 1.1.0
I1224 08:32:11.348379 19743 exec.cpp:237] Executor registered on agent bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@730: Client environment:host.name=mr02
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@737: Client environment:os.name=Linux
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@738: Client environment:os.arch=3.13.0-100-generic
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@739: Client environment:os.version=#147-Ubuntu SMP Tue Oct 18 16:48:51 UTC 2016
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@747: Client environment:user.name=(null)
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@755: Client environment:user.home=/root
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@767: Client environment:user.dir=/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/aee5cbd3-b3e8-431d-bc18-fb5107d54074-0000/executors/jobs_storagetraffic17.76d13635-c9b3-11e6-bf95-bc764e104d04/runs/cd1a62ea-4cda-4599-a6fc-e7c5d1ad3956
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@zookeeper_init@800: Initiating client connection, host=mr01:2181,mr02:2181,mr03:2181 sessionTimeout=10000 watcher=0x7f8872f72200 sessionId=0 sessionPasswd=<null> context=0x7f883c000b10 flags=0
I1224 08:32:13.839673 19843 sched.cpp:226] Version: 1.1.0
2016-12-24 08:32:13,840:19753(0x7f88577fe700):ZOO_INFO@check_events@1728: initiated connection to server [10.210.224.189:2181]
2016-12-24 08:32:13,845:19753(0x7f88577fe700):ZOO_INFO@check_events@1775: session establishment complete on server [10.210.224.189:2181], sessionId=0x3592d18d17a001a, negotiated timeout=10000
I1224 08:32:13.845521 19838 group.cpp:340] Group process (zookeeper-group(1)@10.210.198.20:43514) connected to ZooKeeper
I1224 08:32:13.845562 19838 group.cpp:828] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I1224 08:32:13.845579 19838 group.cpp:418] Trying to create path '/mesos' in ZooKeeper
I1224 08:32:13.846849 19838 detector.cpp:152] Detected a new leader: (id='27')
I1224 08:32:13.846951 19838 group.cpp:697] Trying to get '/mesos/json.info_0000000027' in ZooKeeper
I1224 08:32:13.847759 19841 zookeeper.cpp:259] A new leading master ([email protected]:5050) is detected
I1224 08:32:13.847839 19838 sched.cpp:330] New master detected at [email protected]:5050
I1224 08:32:13.848011 19838 sched.cpp:341] No credentials provided. Attempting to register without authentication
I1224 08:32:13.850549 19841 sched.cpp:743] Framework registered with cd6497c0-d374-4957-b861-286fd0b27587-13914
16/12/24 08:32:14 INFO utils.VerifiableProperties: Verifying properties
16/12/24 08:32:14 INFO utils.VerifiableProperties: Property group.id is overridden to 
16/12/24 08:32:14 INFO utils.VerifiableProperties: Property zookeeper.connect is overridden to 
16/12/24 08:32:20 INFO utils.VerifiableProperties: Verifying properties
16/12/24 08:32:20 INFO utils.VerifiableProperties: Property group.id is overridden to 
16/12/24 08:32:20 INFO utils.VerifiableProperties: Property zookeeper.connect is overridden to 

[Stage 0:>                                                         (0 + 0) / 31]
[Stage 0:>                                                         (0 + 1) / 31]
[Stage 0:>                                                         (0 + 2) / 31]
[Stage 0:=>                                                        (1 + 3) / 31]
[Stage 0:==================================>                      (19 + 0) / 31]


[Stage 1:==================================>                      (19 + 0) / 31]


[Stage 2:==================================>                      (19 + 0) / 31]


[Stage 3:==================================>                      (19 + 0) / 31]


[Stage 4:==================================>                      (19 + 0) / 31]


[Stage 5:==================================>                      (19 + 0) / 31]


[Stage 6:==================================>                      (19 + 0) / 31]


[Stage 7:==================================>                      (19 + 0) / 31]


[Stage 8:==================================>                      (19 + 0) / 31]


[Stage 9:==================================>                      (19 + 0) / 31]


[Stage 10:==================================>                     (19 + 0) / 31]


[Stage 11:==================================>                     (19 + 0) / 31]


[Stage 12:==================================>                     (19 + 0) / 31]


[Stage 13:==================================>                     (19 + 0) / 31]
[Stage 13:==================================>                     (19 + 2) / 31]


[Stage 14:==================================>                     (19 + 0) / 31]
[Stage 14:=====================================>                  (21 + 2) / 31]


[Stage 15:==================================>                     (19 + 0) / 31]
[Stage 15:=====================================>                  (21 + 2) / 31]


[Stage 16:==================================>                     (19 + 0) / 31]
[Stage 16:=========================================>              (23 + 2) / 31]


[Stage 17:==================================>                     (19 + 0) / 31]
[Stage 17:=============================================>          (25 + 2) / 31]


[Stage 18:==================================>                     (19 + 0) / 31]
[Stage 18:=============================================>          (25 + 2) / 31]


[Stage 19:==================================>                     (19 + 0) / 31]
[Stage 19:=============================================>          (25 + 2) / 31]


[Stage 20:==================================>                     (19 + 0) / 31]
[Stage 20:===========================================>            (24 + 2) / 31]


[Stage 21:==================================>                     (19 + 0) / 31]


[Stage 22:==================================>                     (19 + 0) / 31]


[Stage 23:==================================>                     (19 + 0) / 31]


[Stage 24:==================================>                     (19 + 0) / 31]


[Stage 25:==================================>                     (19 + 0) / 31]


[Stage 26:==================================>                     (19 + 0) / 31]


[Stage 27:==================================>                     (19 + 0) / 31]


[Stage 28:==================================>                     (19 + 0) / 31]


[Stage 29:==================================>                     (19 + 0) / 31]


[Stage 30:==================================>                     (19 + 0) / 31]


[Stage 31:==================================>                     (19 + 0) / 31]


[Stage 32:==================================>                     (19 + 0) / 31]


[Stage 33:==================================>                     (19 + 0) / 31]


[Stage 34:==================================>                     (19 + 0) / 31]


[Stage 35:==================================>                     (19 + 0) / 31]


[Stage 36:==================================>                     (19 + 0) / 31]


[Stage 37:==================================>                     (19 + 0) / 31]


[Stage 38:==================================>                     (19 + 0) / 31]


[Stage 39:==================================>                     (19 + 0) / 31]

I suspect thet problem is in first few lines of the log file, for some reason abnormal job does not have fetcher part, maybe it is not copying /spark/spark-2.0.1-bin-hadoop2.6.tgz and does not have binaries to start...but then, how does it starts Spark UI?

Does anyone has any ideas how to solve this problem, where to look for additional clues, what to test to get more data?

Thanks,

Ivan

0

There are 0 best solutions below