Pass a whole dataset contains multiple files to HuggingFace function in Palantir

Question

Pass a whole dataset contains multiple files to HuggingFace function in Palantir

329 Views Asked by huy At 16 November 2023 at 02:25

I am using the pre-trained model from HuggingFace (dslim/bert-base-NER). Normally when working locally or in a Colab notebook, we can use the code below to load the model:

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

This code will connect with the HuggingFace server to load the model directly. But on Palantir, we cannot do much unless configuring the security.

The workaround approach here is to download all the files and upload them to a dataset. We can pass a folder that contains these files to the HuggingFace function (check this answer) like below:

tokenizer = AutoTokenizer.from_pretrained('./local_model_directory/')
model = AutoModelForTokenClassification.from_pretrained('./local_model_directory/')

For a model that has only a pickle file, we can easily read that file via a dataset called classifier:

@transform(
    file_input=Input("/Users/model/classifier")
) 
def load_model(file_input):
    with file_input.filesystem().open("classifier.pkl", "rb") as f:
        model = pickle.load(f)
    return model

My question is: How can we open a dataset and pass a whole directory to this function on the Palantir code repository?

Original Q&A

There are 1 best solutions below

**nweatherburn** · Accepted Answer · 2023-11-16T03:40:28.073000

Update:

palantir_models now publishes the below code under palantir_models.transforms.copy_model_to_driver that can be used in place of copy_to_temp_directory.

from palantir_models.transforms import ModelOutput, copy_model_to_driver

@transform(
    file_input=Input("/Users/model/hugging_face_files"),
    model_output=ModelOutput("/Users/model/model_asset")
) 
def load_model(file_input, model_output):
    temp_dir = copy_to_temp_directory(file_input.filesystem())
    tokenizer = AutoTokenizer.from_pretrained(temp_dir)
    model = AutoModel.from_pretrained(temp_dir)

    # Wrap files in a model adapter
    foundry_model = HuggingFaceModelAdapter(tokenizer, model)
    model_output.publish(model_adapter=foundry_model)

Original answer:

You can copy the files into a TemporaryDirectory with the below util copy_to_temp_directory. I've also provided an example usage.

@transform(
    file_input=Input("/Users/model/hugging_face_files"),
    model_output=ModelOutput("/Users/model/model_asset")
) 
def load_model(file_input, model_output):
    temp_dir = copy_to_temp_directory(file_input.filesystem())
    tokenizer = AutoTokenizer.from_pretrained(temp_dir)
    model = AutoModel.from_pretrained(temp_dir)
    
    # Wrap files in a model adapter
    foundry_model = HuggingFaceModelAdapter(tokenizer, model)
    model_output.publish(model_adapter=foundry_model)


def copy_to_temp_directory(foundry_filesystem):
    import shutil
    import tempfile
    import os

    temp_dir = tempfile.mkdtemp()
    for raw_file in foundry_filesystem.ls():
        output_path = os.path.join(temp_dir, raw_file.path)
        with foundry_filesystem.open(raw_file.path, "rb") as remote:
            with open(output_path, "wb") as local:
                shutil.copyfileobj(remote, local)

    return temp_dir

If your hugging face model large, you may need to specify a larger spark profile like DRIVER_MEMORY_MEDIUM or DRIVER_MEMORY_LARGE to ensure the spark driver has enough memory for the model files to load correctly.

Hopefully this helps!

Pass a whole dataset contains multiple files to HuggingFace function in Palantir

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PALANTIR-FOUNDRY

Related Questions in FOUNDRY-CODE-REPOSITORIES

Trending Questions

Popular # Hahtags

Popular Questions