How to compare the content of 2 columns in functions on object?

480 Views Asked by At

I am trying to build a query to match two columns and I have tried the following:

obj= obj.filter(e => e.colOne.exactMatch(e.colTwo))

I am not be able to get this working, is there any way to filter by comparing the content of 2 columns?

2

There are 2 best solutions below

0
On

It is not possible to compare two columns when writing Functions. A recommended strategy here would be to create a new column that captures your equality. For example in your pyspark pipeline, right before you generate the end objects that get indexed:

df.withColumn("colOneEqualsColTwo", F.when(
     F.col("colOne") == F.col("colTwo"), True
).otherwise(False)

And then filter on that new column:

obj = obj.filter(e => e.colOneEqualsColTwo.exactMatch(true))
4
On

The filter() method can't dynamically grab the value to filter based on each object, but can be used to filter on a static value.

You can filter a smaller object set (<100K rows) named myUnfilteredObjects of type ObjectType this way:

let myFilteredObjects = new Set<ObjectType>();

for (const unfilteredObj of myUnfilteredObjects) {
    if (unfilteredObj.colOne === unfilteredObj.colTwo) {
        myFilteredObjects.add(unfilteredObj);
    }
}

Edit: updating with a solution for larger-scale object sets:

You can create a new boolean column in your object's underlying dataset that is true if colOne and colTwo match, and false otherwise. Filtering on this new column via the filter() method will then work as you expect.