LSHModel on spark structured streaming

151 Views Asked by Galuoises At 28 July 2025 at 01:50

Apparently, the LSHModel of MLLib from spark 2.4 supports Spark Structured Streaming (https://issues.apache.org/jira/browse/SPARK-24465).

However, it's not clear to me how. For instance an approxSimilarityJoin from MinHashLSH transformation (https://spark.apache.org/docs/latest/ml-features#lsh-operations) could be applied directly to a streaming dataframe?

I don't find more information online about it. Could someone help me?

There are 1 best solutions below

Michael Heil On 02 March 2021 at 19:05

You need to

Persist the trained model (e.g. modelFitted) somewhere accessible to your Streaming job. This is done outside of your streaming job.

modelFitted.write.overwrite().save("/path/to/model/location")

import org.apache.spark.ml._
val model = PipelineModel.read.load("/path/to/model/location")

model.transform(df)

// in your case you may work with two streaming Dataframes to apply `approxSimilarityJoin`.

It might be required to get the streaming Dataframe into the correct format to be used in the model prediction.