I am using the pre-trained model from HuggingFace (dslim/bert-base-NER). Normally when working locally or in a Colab notebook, we can use the code below to load the model:
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
This code will connect with the HuggingFace server to load the model directly. But on Palantir, we cannot do much unless configuring the security.
The workaround approach here is to download all the files and upload them to a dataset. We can pass a folder that contains these files to the HuggingFace function (check this answer) like below:
tokenizer = AutoTokenizer.from_pretrained('./local_model_directory/')
model = AutoModelForTokenClassification.from_pretrained('./local_model_directory/')
For a model that has only a pickle
file, we can easily read that file via a dataset called classifier
:
@transform(
file_input=Input("/Users/model/classifier")
)
def load_model(file_input):
with file_input.filesystem().open("classifier.pkl", "rb") as f:
model = pickle.load(f)
return model
My question is: How can we open a dataset and pass a whole directory to this function on the Palantir code repository?
Update:
palantir_models
now publishes the below code underpalantir_models.transforms.copy_model_to_driver
that can be used in place ofcopy_to_temp_directory
.Original answer:
You can copy the files into a TemporaryDirectory with the below util
copy_to_temp_directory
. I've also provided an example usage.If your hugging face model large, you may need to specify a larger spark profile like
DRIVER_MEMORY_MEDIUM
orDRIVER_MEMORY_LARGE
to ensure the spark driver has enough memory for the model files to load correctly.Hopefully this helps!