How can I check if an instance is in a dataframe in pyspark and get the ocurrence from the dataframe?

796 Views Asked by At

I have an instance extracted from a dataframe with 3 different Attributes: Atr1, Atr2 and Atr3.

On the other hand, I have a dataframe containing 4 Attributes: Atr1, Atr2, Atr3, Atr4, but the Atributes Atr1, Atr2 and Atr3 are the same as the previous mentioned instance. I have something like this:

Instance:

[Row(Atr1=u'A', Atr2=u'B', Atr3=24)]

Dataframe:

+------+------+------+------+
| Atr1 | Atr2 | Atr3 | Atr4 |
+------+------+------+------+
|  'C' |  'B' |  21  |  'H' |
+------+------+------+------+
|  'D' |  'B' |  21  |  'J' |
+------+------+------+------+
|  'E' |  'B' |  21  |  'K' |
+------+------+------+------+
|  'A' |  'B' |  24  |  'I' |
+------+------+------+------+

So, having the instance above, I want to check whether it exists an instance in the dataframe with those values for attributes Atr1, Atr2 and Atr3 and if it exists, get the value of Atr4. In this case, 'I'.

2

There are 2 best solutions below

0
On
df[(df['Atr1'] == row.Atr1) & (df['Atr2'] == row.Atr2) & (df['Atr3'] == row.Atr3)].Atr4

with row the Row and df the DataFrame you mentionned.

0
On
from pyspark.sql.types import Row
from pyspark.sql.functions import col

#sample data
row_list = [Row(Atr1=u'A', Atr2=u'B', Atr3=24),
            Row(Atr1=u'E', Atr2=u'F', Atr3=20),]
df = sc.parallelize([('C', 'B', 21, 'H'),
                     ('D', 'B', 21, 'J'),
                     ('E', 'B', 21, 'K'),
                     ('A', 'B', 24, 'I')]).\
    toDF(["Atr1", "Atr2", "Atr3", "Atr4"])

search_df = df.join(sqlContext.createDataFrame(row_list), ["Atr1", "Atr2", "Atr3"], "right").\
    withColumn("rowItem_Exist", col('Atr4').isNotNull())
search_df.show()

Output is:

+----+----+----+----+-------------+
|Atr1|Atr2|Atr3|Atr4|rowItem_Exist|
+----+----+----+----+-------------+
|   E|   F|  20|null|        false|
|   A|   B|  24|   I|         true|
+----+----+----+----+-------------+