I hope someone has done this before, or if someone can advise whether or not Gridgain supports this functionality.
My use case is:
- Start a Gridgain node using examples/config/example-compute.xml modified to support work stealing (see below)
- Submit 300 tasks to the cluster.They start executing on the first node, but as they take time to execute, there is a long queue of outstanding tasks
- Start a new node using same configuration and watch it join the cluster
- Shouldn't node 2 steal some of the work from the first node? It does not unfortunately and we have to wait for all the tasks to finish on node 1 while node 2 does nothing
I think that GridJobStealingCollisionSpi
is doing something because when I turn on debug logging I can see the following message in the log: Thief node does not belong to task topology [...]
. and looking through the source, what I think is happening is that GridJobStealingCollisionSpi
is checking to see if the stealing node is in the topology that the task was submitted for.
Has anyone seen my use case working as I would expect?
I have modified example-compute.xml (you can find the whole file at pastebin.com/gGsfEebG) to support work stealing by adding the below config:
<property name="collisionSpi">
<bean class="org.gridgain.grid.spi.collision.jobstealing.GridJobStealingCollisionSpi">
<property name="activeJobsThreshold" value="50" />
<property name="waitJobsThreshold" value="10" />
<property name="messageExpireTime" value="1000" />
<property name="maximumStealingAttempts" value="100" />
<property name="stealingEnabled" value="true" />
</bean>
</property>
<property name="failoverSpi">
<bean class="org.gridgain.grid.spi.failover.jobstealing.GridJobStealingFailoverSpi">
<property name="maximumFailoverAttempts" value="5" />
</bean>
</property>
<property name="metricsUpdateFrequency" value="1000"/>
My java class can be found at pastebin here: http://pastebin.com/AS8iKqjj and here are detailed instructions to run it:
run the ComputeSleepExample class which starts a node and submits 300 jobs, which will sleep for 5 seconds, to the cluster
java -DGRIDGAIN_DEBUG_ENABLED=true -DGRIDGAIN_QUIET=false -cp examples/config:examples/target/classes:examples/target/libs/*:target/gridgain-6.1.9.jar:modules/spring/target/gridgain-spring-6.1.9.jar org.gridgain.examples.compute.ComputeSleepExample 300 5000
start a new node, and you will see that all jobs are executed on node 1
bin/ggstart.sh examples/config/example-compute.xml
The problem is that currently task defines its node topology during the map step. If you are adding a new node after the task has already finished its mapping step, then the new node will not be in the task topology. That is the reason why it does not participate in the job-stealing.
If you start adding more tasks after the new node is started, then it should start participating in the job stealing immediately.
Having said that, it is possible to make the task topology dynamic, so job-stealing can steal jobs even after the mapping step. GridGain will implement it in the future release.