Save and load a spacy model to a google cloud storage bucket

1.3k Views Asked by At

I have a spacy model and I am trying to save it to a gcs bucket using this format

trainer.to_disk('gs://{bucket-name}/model')

But each time I run this I get this error message

FileNotFoundError: [Errno 2] No such file or directory: 'gs:/{bucket-name}/model'

Also when I create a kubeflow persistent volume and save the model there I can download the model using trainer.load('model') I get this error message

File "/usr/local/lib/python3.7/site-packages/spacy/__init__.py", line 30, in load
    return util.load_model(name, **overrides)
  File "/usr/local/lib/python3.7/site-packages/spacy/util.py", line 175, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model '/model/'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

I don't understand why I am having these errors as this works perfectly when I run this on my pc locally and use a local path.

3

There are 3 best solutions below

0
On

Since you've tagged this question "kubeflow-pipelines", I'll answer from that perspective.

KFP strives to be platform-agnostic. Most good components are cloud-independent. KFP promotes system-managed artifact passing where the components code only writes output data to local files and the system takes it and makes it available for other components.

So, it's best to describe your SpaCy model trainer that way - to write data to local files. Check how all other components work, for example, Train Keras classifier.

Since you want to upload to GCS, do that explicitly, but passing the model output of your trainer to an "Upload to GCS" component:

upload_to_gcs_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/616542ac0f789914f4eb53438da713dd3004fba4/components/google-cloud/storage/upload_to_explicit_uri/component.yaml')

def my_pipeline():
   model = train_specy_model(...).outputs['model']

   upload_to_gcs_op(
       data=model,
       gcs_path='gs:/.....',
   )
2
On

Cloud storage is not a local disk or a physical storage unit where you can save things directly to.

As you say

this on my pc locally and use a local path

Cloud Storage is virtually not a local path of any other tool in the cloud

If you are using python you will have to create a client using the Storage library and then upload your file using upload_blob i.e.:

from google.cloud import storage


def upload_blob(bucket_name, source_file_name, destination_blob_name):
"""Uploads a file to the bucket."""
# bucket_name = "your-bucket-name"
# source_file_name = "local/path/to/file"
# destination_blob_name = "storage-object-name"

storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)

blob.upload_from_filename(source_file_name)
0
On

The following implementation assumes you have gsutil installed in your computer. The spaCy version used was 3.2.4. In my case, I wanted everything to be part of a (demo) single Python file, spacy_import_export.py. To do so, I had to use subprocess python library, plus this comment, as follows:

# spacy_import_export.py
    
import spacy
import subprocess  # Will be used later

# spaCy models trained by user, are always stored as LOCAL directories, with more subdirectories and files in it.
PATH_TO_MODEL = "/home/jupyter/"  # Use your own path!

# Test-loading your "trainer" (optional step)
trainer = spacy.load(PATH_TO_MODEL+"model")

# Replace 'bucket-name' with the one of your own:
bucket_name = "destination-bucket-name"
GCS_BUCKET = "gs://{}/model".format(bucket_name)

# This does the trick for the UPLOAD to Cloud Storage:
# TIP: Just for security, check Cloud Storage afterwards: "model" should be in GCS_BUCKET
subprocess.run(["gsutil", "-m", "cp", "-r", PATH_TO_MODEL+"model", GCS_BUCKET])

# This does the trick for the DOWNLOAD:
# HINT: By now, in PATH_TO_MODEL, you should have a "model" & "downloaded_model"
subprocess.run(["gsutil", "-m", "cp", "-r", GCS_BUCKET+MODEL_NAME+"/*", PATH_TO_MODEL+"downloaded_model"])

# Test-loading your "GCS downloaded model" (optional step)
nlp_original = spacy.load(PATH_TO_MODEL+"downloaded_model")

I apologize for the excess of comments, I just wanted to make everything clear, for "spaCy newcomers". I know it is a bit late, but hope it helps.