Dataflow is failing java.io.IOException on g1-small instances

351 Views Asked by At

I am trying Wordcount example provided by google. It is running successfully when I run it from my local machine.

But when I run it through Google Cloud, it fails with the following exception,

java.io.IOException: INTERNAL: Finalize rejected (writer id not found) when talking to tcp://localhost:12345

The exception is not clear as well.

I found that this happens when any of the Job is Part running and the job after that is not running.

enter image description here

So When I removed the Sum.Perkey transform it was running successfully.

Like the one below. enter image description here

EDIT 1

The log says the following

    Jun 23, 2015, 5:21:27 PM
(306b526c890d6a9e): java.io.IOException: INTERNAL: Finalize rejected (writer id not found) when talking to tcp://localhost:12345 at com.google.cloud.dataflow.sdk.runners.worker.ApplianceShuffleWriter.close(Native Method) at 

com.google.cloud.dataflow.sdk.runners.worker.ChunkingShuffleEntryWriter.close(ChunkingShuffleEntryWriter.java:66) at 

com.google.cloud.dataflow.sdk.runners.worker.ShuffleSink$ShuffleSinkWriter.close(ShuffleSink.java:232) at 

com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.finish(WriteOperation.java:100) at 

com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:74) at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:130) at 

com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:95) at

 com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:139) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:124) at 

java.util.concurrent.FutureTask.run(FutureTask.java:266) at 

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at 

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

My Job Id is : 2015-06-23_04_49_22-5338020413017331855

Please help me why this is happening.

1

There are 1 best solutions below

1
On

I fixed this issue.

I set the work machine type in my pipeline options to

g1-small

Previously i was using

f1-micro

When we use Combine or GroupBy Transforms, we have to use g1-small work machine it seems.

However i couldn't find this information anywhere in the dataflow documentation.

It would be nice if Google documents how to uses the compute engine instance for dataflow. That would have saved a lot of my time.