How can I check if an instance is in a dataframe in pyspark and get the ocurrence from the dataframe?

782 Views Asked by jartymcfly At 27 July 2025 at 16:54

I have an instance extracted from a dataframe with 3 different Attributes: Atr1, Atr2 and Atr3.

On the other hand, I have a dataframe containing 4 Attributes: Atr1, Atr2, Atr3, Atr4, but the Atributes Atr1, Atr2 and Atr3 are the same as the previous mentioned instance. I have something like this:

Instance:

[Row(Atr1=u'A', Atr2=u'B', Atr3=24)]

Dataframe:

+------+------+------+------+
| Atr1 | Atr2 | Atr3 | Atr4 |
+------+------+------+------+
|  'C' |  'B' |  21  |  'H' |
+------+------+------+------+
|  'D' |  'B' |  21  |  'J' |
+------+------+------+------+
|  'E' |  'B' |  21  |  'K' |
+------+------+------+------+
|  'A' |  'B' |  24  |  'I' |
+------+------+------+------+

So, having the instance above, I want to check whether it exists an instance in the dataframe with those values for attributes Atr1, Atr2 and Atr3 and if it exists, get the value of Atr4. In this case, 'I'.

Original Q&A

There are 2 best solutions below

David Leon On 14 September 2017 at 12:44

df[(df['Atr1'] == row.Atr1) & (df['Atr2'] == row.Atr2) & (df['Atr3'] == row.Atr3)].Atr4

with row the Row and df the DataFrame you mentionned.

Prem On 14 September 2017 at 17:45

from pyspark.sql.types import Row
from pyspark.sql.functions import col

#sample data
row_list = [Row(Atr1=u'A', Atr2=u'B', Atr3=24),
            Row(Atr1=u'E', Atr2=u'F', Atr3=20),]
df = sc.parallelize([('C', 'B', 21, 'H'),
                     ('D', 'B', 21, 'J'),
                     ('E', 'B', 21, 'K'),
                     ('A', 'B', 24, 'I')]).\
    toDF(["Atr1", "Atr2", "Atr3", "Atr4"])

search_df = df.join(sqlContext.createDataFrame(row_list), ["Atr1", "Atr2", "Atr3"], "right").\
    withColumn("rowItem_Exist", col('Atr4').isNotNull())
search_df.show()

Output is:

+----+----+----+----+-------------+
|Atr1|Atr2|Atr3|Atr4|rowItem_Exist|
+----+----+----+----+-------------+
|   E|   F|  20|null|        false|
|   A|   B|  24|   I|         true|
+----+----+----+----+-------------+

How can I check if an instance is in a dataframe in pyspark and get the ocurrence from the dataframe?

There are 2 best solutions below

Related Questions in DATAFRAME

Related Questions in PYSPARK

Related Questions in ROW

Related Questions in APACHE-SPARK-SQL

Related Questions in FIND-OCCURRENCES

Trending Questions

Popular # Hahtags

Popular Questions