Tab1 Columns [F,S,E]
F1 S1 R
F1 S2 R2
F1 S3 R1
F2 S1 R2
F2 S4 R4
F1 S4 R
Tab2 Columns [F,S]
F1 S1
F1 S3
F2 S1
F2 S4
TAKE ROWS FROM TAB1 FOR ONLY IF F->S RELATION IS PRESENT IN Tab2
RESULT Columns [F,S,E]
F1 S1 R
F1 S3 R
F2 S4 R4
I have the query now, but am not able to get results with pyspark.I am able to run on MySql db.
I tried to use corelated subquery in spark 2.4.3, but this returns 0 rows.
Tab1.createOrReplaceTempView("Tab1")
Tab2.createOrReplaceTempView("Tab2")
joined_df = spark.sql(
"""SELECT F, S, E FROM Tab1
WHERE EXISTS (SELECT * FROM Tab2 WHERE Tab1.F=Tab2.F AND Tab1.S=Tab2.S)"""
)
joined_df.show(10)