Snowflake ARRAY column as input to Snowpark modeling.decomposition

100 Views Asked by At

I have a Snowflake table with an ARRAY column containing custom embeddings (with array size>1000). These arrays are sparse, and I would like to reduce their dimension with SVD (or one of the Snowpark ml.modeling.decomposition methods). A toy example of the dataframe would be:

df = session.sql("""
    select 'doc1' as doc_id, array_construct(0.1, 0.3, 0.5, 0.7) as doc_vec
    union
    select 'doc2' as doc_id, array_construct(0.2, 0.4, 0.6, 0.8) as doc_vec
    """)
print(df)
# DOC_ID  | DOC_VEC
# doc1 | [   0.1,   0.3,   0.5,   0.7 ]
# doc2 | [   0.2,   0.4,   0.6,   0.8 ]

However, when I try to fit this dataframe

from snowflake.ml.modeling.decomposition import TruncatedSVD
tsvd = TruncatedSVD(input_cols = 'doc_vec', output_cols='out_svd')
print(tsvd)
out = tsvd.fit(df)

I get

 File "snowflake/ml/modeling/_internal/snowpark_trainer.py", line 218, in fit_wrapper_function
    args = {"X": df[input_cols]}
                 ~~^^^^^^^^^^^^   File "pandas/core/frame.py", line 3767, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]

<...snip...>

KeyError: "None of [Index(['doc_vec'], dtype='object')] are in the [columns]"

Based on the information in this tutorial text_embedding_as_snowpark_python_udf, I suspect the Snowpark array needs to be converted to a np.ndarray before being fed to underlying sklearn.decomposition.TruncatedSVD

Can someone point me to any example using Snoflake arrays as inputs to the Snowpark models, please?

1

There are 1 best solutions below

1
Felipe Hoffa On BEST ANSWER

The problem right now is that Snowflake currently doesn't support sparse matrix (but it will).

A teammate wrote this sample code that will be supported in the future:

from snowflake.ml.modeling.decomposition import TruncatedSVD
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
from snowflake.snowpark import Session, functions as F, types as T

session = Session.builder.configs(SnowflakeLoginOptions()).getOrCreate()

# This can not work right now because snowflake ml doesn't accept input as array type so far... We'll support it in the future!
t = session.range(5).with_column(
    "doc_vec",
    F.array_construct(
        F.lit(0.1),
        F.lit(0.2),
        F.lit(0.3),
    ),
).with_column("doc_vec", F.col("doc_vec").cast(T.ArrayType(T.FloatType())))
tsvd = TruncatedSVD(input_cols="DOC_VEC", output_cols="DOC_VEC")

# create a dataframe as input
t = session.create_dataframe([[0.1, 0.2, 0.3] for _ in range(5)], schema=["A", "B", "C"])
tsvd = TruncatedSVD(input_cols=["A", "B", "C"], output_cols=["OUTPUT"])
t.show()

tsvd.fit(t)
# show the results
tsvd.transform(t).show()