I am trying to build a query to match two columns and I have tried the following:
obj= obj.filter(e => e.colOne.exactMatch(e.colTwo))
I am not be able to get this working, is there any way to filter by comparing the content of 2 columns?
I am trying to build a query to match two columns and I have tried the following:
obj= obj.filter(e => e.colOne.exactMatch(e.colTwo))
I am not be able to get this working, is there any way to filter by comparing the content of 2 columns?
On
It is not possible to compare two columns when writing Functions. A recommended strategy here would be to create a new column that captures your equality. For example in your pyspark pipeline, right before you generate the end objects that get indexed:
df.withColumn("colOneEqualsColTwo", F.when(
F.col("colOne") == F.col("colTwo"), True
).otherwise(False)
And then filter on that new column:
obj = obj.filter(e => e.colOneEqualsColTwo.exactMatch(true))
The
filter()method can't dynamically grab the value to filter based on each object, but can be used to filter on a static value.You can filter a smaller object set (<100K rows) named
myUnfilteredObjectsof typeObjectTypethis way:Edit: updating with a solution for larger-scale object sets:
You can create a new
booleancolumn in your object's underlying dataset that istrueifcolOneandcolTwomatch, andfalseotherwise. Filtering on this new column via thefilter()method will then work as you expect.