What is a reason of GC overhead in the simple query pyspark?

38 Views Asked by At

I execute query using apache iceberg as a data format

DDL of table (similiar for raw and ods)

CREATE TABLE ods.kafka_trbMetaEventTopic_v1 (
        objectId long,
        hasSign string,
        fileName string,
        fileExt string,
        created timestamp,
        tech_timestamp TIMESTAMP,
        tech_raw_timestamp TIMESTAMP,
        tech_date DATE,
        tech_raw_date DATE,
        schema_v_num INT
    )
    USING iceberg
    PARTITIONED BY (tech_date, days(created));
  1. I start the search for the max value, it is performed instantly
spark.sql("SELECT coalesce(max(tech_raw_date), '1970-01-01') FROM ods.kafka_trbMetaEventTopic_v1").show()
  1. I substitute this value in the request, it is also executed quickly
spark.sql("select objectId, objectId_new, hasSign, fileName, fileExt, created, tech_timestamp as tech_raw_timestamp, tech_date as tech_raw_date from raw.kafka_trbMetaEventTopic_v1 where tech_date = '2023-11-13'").show()
  1. I combine two requests, hangs on the driver, gc eats time (in red on the screenshot)
spark.sql("select objectId, objectId_new, hasSign, fileName, fileExt, created, tech_timestamp as tech_raw_timestamp, tech_date as tech_raw_date from raw.kafka_trbMetaEventTopic_v1 where tech_date = (SELECT coalesce(max(tech_raw_date), '1970-01-01') FROM ods.kafka_trbMetaEventTopic_v1)").count()

It seems that it should find a max value firstly and then put that value in the filter query (this 2 operations execute quickly separate) pyspark terminal

3 query am

3 launch master

0

There are 0 best solutions below