I execute query using apache iceberg as a data format
DDL of table (similiar for raw and ods)
CREATE TABLE ods.kafka_trbMetaEventTopic_v1 (
objectId long,
hasSign string,
fileName string,
fileExt string,
created timestamp,
tech_timestamp TIMESTAMP,
tech_raw_timestamp TIMESTAMP,
tech_date DATE,
tech_raw_date DATE,
schema_v_num INT
)
USING iceberg
PARTITIONED BY (tech_date, days(created));
- I start the search for the max value, it is performed instantly
spark.sql("SELECT coalesce(max(tech_raw_date), '1970-01-01') FROM ods.kafka_trbMetaEventTopic_v1").show()
- I substitute this value in the request, it is also executed quickly
spark.sql("select objectId, objectId_new, hasSign, fileName, fileExt, created, tech_timestamp as tech_raw_timestamp, tech_date as tech_raw_date from raw.kafka_trbMetaEventTopic_v1 where tech_date = '2023-11-13'").show()
- I combine two requests, hangs on the driver, gc eats time (in red on the screenshot)
spark.sql("select objectId, objectId_new, hasSign, fileName, fileExt, created, tech_timestamp as tech_raw_timestamp, tech_date as tech_raw_date from raw.kafka_trbMetaEventTopic_v1 where tech_date = (SELECT coalesce(max(tech_raw_date), '1970-01-01') FROM ods.kafka_trbMetaEventTopic_v1)").count()
It seems that it should find a max value firstly and then put that value in the filter query (this 2 operations execute quickly separate)
3 query am