I have a Snowflake table with an ARRAY column containing custom embeddings (with array size>1000).
These arrays are sparse, and I would like to reduce their dimension with SVD (or one of the Snowpark ml.modeling.decomposition methods).
A toy example of the dataframe would be:
df = session.sql("""
select 'doc1' as doc_id, array_construct(0.1, 0.3, 0.5, 0.7) as doc_vec
union
select 'doc2' as doc_id, array_construct(0.2, 0.4, 0.6, 0.8) as doc_vec
""")
print(df)
# DOC_ID | DOC_VEC
# doc1 | [ 0.1, 0.3, 0.5, 0.7 ]
# doc2 | [ 0.2, 0.4, 0.6, 0.8 ]
However, when I try to fit this dataframe
from snowflake.ml.modeling.decomposition import TruncatedSVD
tsvd = TruncatedSVD(input_cols = 'doc_vec', output_cols='out_svd')
print(tsvd)
out = tsvd.fit(df)
I get
File "snowflake/ml/modeling/_internal/snowpark_trainer.py", line 218, in fit_wrapper_function
args = {"X": df[input_cols]}
~~^^^^^^^^^^^^ File "pandas/core/frame.py", line 3767, in __getitem__
indexer = self.columns._get_indexer_strict(key, "columns")[1]
<...snip...>
KeyError: "None of [Index(['doc_vec'], dtype='object')] are in the [columns]"
Based on the information in this tutorial text_embedding_as_snowpark_python_udf,
I suspect the Snowpark array needs to be converted to a np.ndarray before being fed to underlying sklearn.decomposition.TruncatedSVD
Can someone point me to any example using Snoflake arrays as inputs to the Snowpark models, please?
The problem right now is that Snowflake currently doesn't support sparse matrix (but it will).
A teammate wrote this sample code that will be supported in the future: