How to filter PySpark SQL dataframe read from Elasticsearch by metadata field (by _id for example)?

186 Views Asked by David206 At 05 June 2019 at 09:17

I am reading PySpark SQL Dataframe from Elasticsearch index, with the read option of es.read.metadata=True. I want to filter the data by condition on metadata field, but get an empty result, although there should be result. Is it possible to get the actual result?

I did get result when I used limit on the dataframe, even with a very big number, even larger then the dataframe size.

In addition, I did get result when using other not _metadata related field.

for example:

df.where(df._metadata._score > 1.0).select(df._metadata._id).show()

the result is empty:

+--------------+
|_metadata[_id]|
+--------------+
+--------------+

But when using limit:

df.limit(1000000).where(df._metadata._score > 1.0).select(df._metadata._id).show()

the result is not empty:

+--------------------+
|      _metadata[_id]|
+--------------------+
|cICqm2gBHl8Vy6RZyu_L|
+--------------------+

Original Q&A

How to filter PySpark SQL dataframe read from Elasticsearch by metadata field (by _id for example)?

There are 0 best solutions below

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in ELASTICSEARCH-SPARK

Trending Questions

Popular # Hahtags

Popular Questions