Storm Topology does not start with parallelism hint of 1200

420 Views Asked by At
Version Info: 
   "org.apache.storm" % "storm-core" % "1.2.1" 
   "org.apache.storm" % "storm-kafka-client" % "1.2.1" 

I have a storm Topology with 3 bolts(A,B,C), Where the middle bolt takes around 450ms mean time and other two bolts takes less than 1ms.

I am able to run topology with following parallelism hint values:

A: 4 
B: 700
C: 10

But when I increase parallelism hint of B to 1200, the topology does not start.

In the topology logs, I see logs to load the executor: B multiple times, like this:

2018-05-18 18:56:37.462 o.a.s.d.executor main [INFO] Loading executor B:[111 111]
2018-05-18 18:56:37.463 o.a.s.d.executor main [INFO] Loaded executor tasks B:[111 111]
2018-05-18 18:56:37.465 o.a.s.d.executor main [INFO] Finished loading executor B:[111 111]
2018-05-18 18:56:37.528 o.a.s.d.executor main [INFO] Loading executor B:[355 355]
2018-05-18 18:56:37.529 o.a.s.d.executor main [INFO] Loaded executor tasks B:[355 355]
2018-05-18 18:56:37.530 o.a.s.d.executor main [INFO] Finished loading executor B:[355 355]
2018-05-18 18:56:37.666 o.a.s.d.executor main [INFO] Loading executor B:[993 993]
2018-05-18 18:56:37.667 o.a.s.d.executor main [INFO] Loaded executor tasks B:[993 993]
2018-05-18 18:56:37.669 o.a.s.d.executor main [INFO] Finished loading executor B:[993 993]
2018-05-18 18:56:37.713 o.a.s.d.executor main [INFO] Loading executor B:[765 765]
2018-05-18 18:56:37.714 o.a.s.d.executor main [INFO] Loaded executor tasks B:[765 765]

But in between worker process get restarted. I don't see any error in topology logs or storm logs. Following are storm logs, when worker gets restart:

2018-05-18 18:51:46.755 o.a.s.d.s.Container SLOT_6700 [INFO] Killing eaf4d8ce-e758-4912-a15d-6dab8cda96d0:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.204 o.a.s.d.s.BasicContainer Thread-7 [INFO] Worker Process 766258fe-a604-4385-8eeb-e85cad38b674 exited with code: 143
2018-05-18 18:51:47.766 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE RUNNING msInState: 109081 topo:myTopology-1-1526649581 worker:766258fe-a604-4385-8eeb-e85cad38b674 -> KILL msInState: 0 topo:myTopology-1-1526649581 worker:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.766 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.774 o.a.s.d.s.Slot SLOT_6700 [WARN] SLOT 6700 all processes are dead...
2018-05-18 18:51:47.775 o.a.s.d.s.Container SLOT_6700 [INFO] Cleaning up eaf4d8ce-e758-4912-a15d-6dab8cda96d0:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.775 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.775 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/pids/27798
2018-05-18 18:51:47.775 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/heartbeats
2018-05-18 18:51:47.780 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/pids
2018-05-18 18:51:47.780 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/tmp
2018-05-18 18:51:47.781 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.782 o.a.s.d.s.Container SLOT_6700 [INFO] REMOVE worker-user 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.782 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers-users/766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.783 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Removed Worker ID 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.783 o.a.s.l.AsyncLocalizer SLOT_6700 [INFO] Released blob reference myTopology-1-1526649581 6700 Cleaning up BLOB references...
2018-05-18 18:51:47.784 o.a.s.l.AsyncLocalizer SLOT_6700 [INFO] Released blob reference myTopology-1-1526649581 6700 Cleaning up basic files...
2018-05-18 18:51:47.785 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/supervisor/stormdist/myTopology-1-1526649581
2018-05-18 18:51:47.808 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE KILL msInState: 42 topo:myTopology-1-1526649581 worker:null -> EMPTY msInState: 0

This keeps happening and topology never restarts, which used to start perfectly when parallelism hint for bolt: B was 700, there is no other change.

I see one interesting log here is, not yet sure what this means:

Worker Process 766258fe-a604-4385-8eeb-e85cad38b674 exited with code: 143

Any Suggestions?

Edit:

Config:

topology.worker.childopts: -Xms1g -Xmx16g
topology.worker.logwriter.childopts: -Xmx1024m
topology.worker.max.heap.size.mb: 3072.0
worker.childopts: -Xms1g -Xmx16g -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=1%ID% -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -XX:+UseG1GC -XX:+AggressiveOpts -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/home/saurabh.mimani/apache-storm-1.2.1/logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -Dorg.newsclub.net.unix.library.path=/usr/share/specter/uds-lib/
worker.gc.childopts:
worker.heap.memory.mb: 8192
supervisor.childopts: -Xms1g -Xmx16g

Edit:

Logs for strace -fp PID -e trace=read,write,network,signal,ipc in gist.

not yet able to understand it fully, some relevant looking from it:

[pid 3362] open("/usr/lib/locale/UTF-8/LC_CTYPE", O_RDONLY) = -1 ENOENT (No such file or directory)

[pid 3362] kill(1487, SIGTERM) = 0

[pid 3362] close(1)

2

There are 2 best solutions below

0
On

I was able to get this running by tweaking following configurations, seems like it was timing out due to nimbus.task.launch.sec, which was set to 120 and it was restarting the worker if it was not started within 120 secs.

Updated value of some of these configs:

drpc.request.timeout.secs: 1600
supervisor.worker.start.timeout.secs: 1200
nimbus.supervisor.timeout.secs: 1200
nimbus.task.launch.secs: 1200

About nimbus.task.launch.sec:

A special timeout used when a task is initially launched. During launch, this is the timeout used until the first heartbeat, overriding nimbus.task.timeout.secs. A separate timeout exists for launch because there can be quite a bit of overhead to launching new JVM's and configuring them.

3
On

Quick google suggests 143 is the exit code for when the JVM receives a SIGTERM (e.g. Always app Java end with "Exit 143" Ubuntu). You might be running out of memory, or the OS may be killing the process for some other reason. Remember that setting the parallelism hint to 1200 means that you will get 1200 tasks (copies) for bolt B, where you only had 700 before.