Apache Hudi Auto-Size During Writes is not Working for Flink SQL

37 Views Asked by At

The expectation (as depicted in Apache Hudi docs - https://hudi.apache.org/docs/file_sizing#auto-sizing-during-writes) is that with every flink commit (every minute) - a set of records will be accumulated and written to one of existing parquet files until parquet file max size threshold is met (in the example below is 5MB).
However, what happens is that every commit results in a separate parquet file (~400KB size) which are accumulated and are never merged. Please, help.

Flink SQL file:

SET 'parallelism.default' = '1';
SET 'execution.checkpointing.interval' = '1m';

CREATE TABLE datagen
(
    id   INT NOT NULL PRIMARY KEY NOT ENFORCED,
    data STRING
) WITH (
      'connector' = 'datagen',
      'rows-per-second' = '5'
);

CREATE TABLE hudi_tbl
(
    id   INT NOT NULL PRIMARY KEY NOT ENFORCED,
    data STRING
) WITH (
      'connector' = 'hudi',
      'path' = 'file:///opt/hudi',
      'table.type' = 'COPY_ON_WRITE',
      'write.parquet.block.size' = '1',
      'write.operation' = 'insert',
      'write.parquet.max.file.size' = '5'
);

INSERT INTO hudi_tbl SELECT * from datagen;
0

There are 0 best solutions below