I am trying Wordcount example provided by google. It is running successfully when I run it from my local machine.
But when I run it through Google Cloud, it fails with the following exception,
java.io.IOException: INTERNAL: Finalize rejected (writer id not found) when talking to tcp://localhost:12345
The exception is not clear as well.
I found that this happens when any of the Job is Part running and the job after that is not running.
So When I removed the Sum.Perkey transform it was running successfully.
Like the one below.
EDIT 1
The log says the following
Jun 23, 2015, 5:21:27 PM
(306b526c890d6a9e): java.io.IOException: INTERNAL: Finalize rejected (writer id not found) when talking to tcp://localhost:12345 at com.google.cloud.dataflow.sdk.runners.worker.ApplianceShuffleWriter.close(Native Method) at
com.google.cloud.dataflow.sdk.runners.worker.ChunkingShuffleEntryWriter.close(ChunkingShuffleEntryWriter.java:66) at
com.google.cloud.dataflow.sdk.runners.worker.ShuffleSink$ShuffleSinkWriter.close(ShuffleSink.java:232) at
com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.finish(WriteOperation.java:100) at
com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:74) at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:130) at
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:95) at
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:139) at
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:124) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
My Job Id is : 2015-06-23_04_49_22-5338020413017331855
Please help me why this is happening.
I fixed this issue.
I set the work machine type in my pipeline options to
Previously i was using
When we use Combine or GroupBy Transforms, we have to use g1-small work machine it seems.
However i couldn't find this information anywhere in the dataflow documentation.
It would be nice if Google documents how to uses the compute engine instance for dataflow. That would have saved a lot of my time.