I am using snowflake python worksheet to perform text analysis on some data in a snowflake table. This includes lemmatizing the text
I created this function in snowflake python worksheet
def lemmatize_text(text):
# Initialize NLTK's WordNet Lemmatizer for lemmatization
lemmatizer = WordNetLemmatizer()
words = nltk.word_tokenize(text)
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
return ' '.join(lemmatized_words)
It gives me an error that the word_tokenize is not a known member of nltk I suppose it is not supported directly in the snowflake anaconda packages
How can I solve this problem?
I am new to snowflake and snowpark, in my jupyter notebook, i tried to create a udf and put it on the snowflake stage, but i dont know what do next.
from snowflake.snowpark import Session session = Session.builder.configs(connection_parameters).create()
from snowflake.snowpark.functions import udf, sproc, col from snowflake.snowpark.types import IntegerType, FloatType, StringType, BooleanType, Variant from snowflake.snowpark import functions as fnsession.sql("CREATE STAGE IF NOT EXISTS nlp_text_analysis").collect()
def lemmatize_text(session : Session, text: str) -> Variant: import nltk from nltk.stem import PorterStemmer from nltk.stem import WordNetLemmatizer import re nltk.download('punkt') nltk.download('wordnet') nltk.download('omw-1.4')
lemmatizer = WordNetLemmatizer() words = nltk.word_tokenize(text) lemmatized_words = [lemmatizer.lemmatize(word) for word in words] return ' '.join(lemmatized_words)
session.sproc.register(func=lemmatize_text, name="lemmatize_text", replace=True)
result: <snowflake.snowpark.stored_procedure.StoredProcedure at 0x1ddd9888850>
try rolling back the spacy package version to 3.5.3 - that helped me using the nlp modules.