I am trying to follow this blog post from Google using their new CLOUDML tools.
Running from within their provided docker instance
docker pull gcr.io/cloud-datalab/datalab:local
docker run -it -p "127.0.0.1:8080:8080" \
--entrypoint=/bin/bash \
gcr.io/cloud-datalab/datalab:local
starting from:
root@9e93221352d8:~/google-cloud-ml/samples/flowers#
To the run the first preprocessing step:
Assign appropriate values.
PROJECT=$(gcloud config list project --format "value(core.project)")
JOB_ID="flowers_${USER}_$(date +%Y%m%d_%H%M%S)"
BUCKET="gs://${PROJECT}-ml"
GCS_PATH="${BUCKET}/${USER}/${JOB_ID}"
DICT_FILE=gs://cloud-ml-data/img/flower_photos/dict.txt
Preprocess the eval set.
python trainer/preprocess.py \
--input_dict "$DICT_FILE" \
--input_path "gs://cloud-ml-data/img/flower_photos/eval_set.csv" \
--output_path "${GCS_PATH}/preproc/eval" \
--cloud
returns
(27042c30421ec530): Workflow failed. Causes: (70e56dda0121e0fa): One or more access checks for temp location or staged files failed. Please refer to other error messages for details. For more information on security and permissions, please see https://cloud.google.com/dataflow/security-and-permissions.
Heading to the console, the logs read:
(531d956bf99b5f27): Staged package cloudml.latest.tar.gz at location 'gs://api-project-773889352370-ml/flowers__20170106_123249/preproc/staging/flowers-20170106-123312.1483705994.201001/cloudml.latest.tar.gz' is inaccessible.
I tried again authenticating with
gcloud beta auth application-default login
and getting the key from the browser. Nothing seems wrong there.
I have successfully run the MNIST cloud learning tutorial, so there is no authentication issues communicating with google compute engine.
I can confirm the path to my bucket is correct:
root@9e93221352d8:~/google-cloud-ml/samples/flowers# echo ${GCS_PATH}
gs://api-project-773889352370-ml//flowers__20170106_165608
but no folder flowers__20170106_165608 is ever created (due to permissions).
Does Dataflow need seperate credentials? I went to the console and made sure my account is open to the dataflow API. Anything beyond
root@9e93221352d8:~/google-cloud-ml/samples/flowers# gcloud config list
Your active configuration is: [default]
[component_manager]
disable_update_check = True
[compute]
region = us-central1
zone = us-central1-a
[core]
account = ####<- scrubbed for SO, its correct.
project = api-project-773889352370
Edit: To show that the service accounts tab on the console.
Edit: Accepted answer below. I'm accepting this answer because Jeremy Lewi is correct. The problem is not that dataflow does have permissions, but because the GCS object was never created. Going into the preprocess logger you can see
The tutorial google has shown is probably not well configured for the free tier, i'm guessing it distributes to too many instances and exceeds the CPU quota. If i cannot solve, I will open a correctly framed question.
Please see the information about service accounts at the link provided by the error message. I suspect the service account is not authorized correctly to view the staged file.