What is a reason of GC overhead in the simple query pyspark?

46 Views Asked by Марсель Абдуллин At 27 June 2025 at 12:41

I execute query using apache iceberg as a data format

DDL of table (similiar for raw and ods)

CREATE TABLE ods.kafka_trbMetaEventTopic_v1 (
        objectId long,
        hasSign string,
        fileName string,
        fileExt string,
        created timestamp,
        tech_timestamp TIMESTAMP,
        tech_raw_timestamp TIMESTAMP,
        tech_date DATE,
        tech_raw_date DATE,
        schema_v_num INT
    )
    USING iceberg
    PARTITIONED BY (tech_date, days(created));

I start the search for the max value, it is performed instantly

spark.sql("SELECT coalesce(max(tech_raw_date), '1970-01-01') FROM ods.kafka_trbMetaEventTopic_v1").show()

I substitute this value in the request, it is also executed quickly

spark.sql("select objectId, objectId_new, hasSign, fileName, fileExt, created, tech_timestamp as tech_raw_timestamp, tech_date as tech_raw_date from raw.kafka_trbMetaEventTopic_v1 where tech_date = '2023-11-13'").show()

I combine two requests, hangs on the driver, gc eats time (in red on the screenshot)

spark.sql("select objectId, objectId_new, hasSign, fileName, fileExt, created, tech_timestamp as tech_raw_timestamp, tech_date as tech_raw_date from raw.kafka_trbMetaEventTopic_v1 where tech_date = (SELECT coalesce(max(tech_raw_date), '1970-01-01') FROM ods.kafka_trbMetaEventTopic_v1)").count()

It seems that it should find a max value firstly and then put that value in the filter query (this 2 operations execute quickly separate)

3 query am

Original Q&A

What is a reason of GC overhead in the simple query pyspark?

There are 0 best solutions below

Related Questions in PYSPARK

Related Questions in APACHE-ICEBERG

Trending Questions

Popular # Hahtags

Popular Questions