Flink SQL Job Stops with Backpressure After a Week of Execution

74 Views Asked by olivia At 05 January 2024 at 01:55

Flink SQL Job Stops with Backpressure After a Week of Execution

I've been encountering an issue with my Flink SQL job, where it consistently stops after running for approximately a week due to backpressure. I am using Kafka as the connector for the table created with SQL. Despite the initial execution proceeding without crashes, I suspect there might be potential issues with infinite state growth during processing. I've already checked the WebUI, checkpoint status, and implemented Time-to-Live (TTL), but the problem persists. I am seeking advice on debugging tips and insights into similar operational issues with Flink. Specifically, I would like to know how to address and resolve the backpressure problem. Any guidance or recommendations would be greatly appreciated.

Tags: Flink SQL, backpressure, job termination, infinite state growth, debugging, WebUI, checkpoint, TTL.

Flink SQL Benchmarking Issue: Job Stops with Backpressure at Low Message Rate

I'm experiencing an issue while benchmarking Flink SQL queries (using Kafka source and sink). At a rate of around 300 messages per second on a 12-core setup, the job stops due to backpressure in the sink materializer. Despite variations in queries and some data skew in the Group by clause, the performance seems unreasonably low. I'm curious if this aligns with typical Flink performance and if there are anticipated issues or specific considerations for such scenarios. Notably, the queries involve a task using listagg to concatenate JSON, and the system operates in standalone mode.

SQL QUERY

CREATE TABLE  issue-table(
   ...
   ) WITH (
      'connector' = 'upsert-kafka',
      'key.format' = 'json',
      'value.format' = 'json'
      ...
   );

INSERT INTO issue-table
SELECT 
uuid(),
MAX(time) as time,
'{' || LISTAGG(CONCAT('"' || id || '"', ':', data)) || '}' as data,
CASE WHEN ...
THEN ...
WHEN ...
ELSE ...
END FROM (
   SELECT
   id,
   data,
   location,
   time,
   count,
   ROW_NUMBER() OVER (
      PARTITION BY id ORDER BY time DESC
      ) AS row
      FROM (
         SELECT
         count,
         id,
         data,
         location,
         other-table.id,
         other-table.digital_twin_id,
         other-table.rowtime
         FROM other-table
         JOIN another-table FOR SYSTEM_TIME AS OF time AS u ON other-table.id = u.id
         )
         )WHERE row = 1 GROUP BY id, count;

CHECK

I have checked the WebUI and the checkpoint status, and I have implemented TTL to address the issue, but the problem persists. I expected the job to run without interruptions, but the backpressure causing the job to stop after a week remains unresolved. I am actively seeking advice on debugging strategies and any experiences or insights the community may have regarding similar Flink operational issues.

"When backpressure occurs, this is how it appears in the WebUI for verification." enter image description here

Original Q&A

Flink SQL Job Stops with Backpressure After a Week of Execution

Flink SQL Job Stops with Backpressure After a Week of Execution

Flink SQL Benchmarking Issue: Job Stops with Backpressure at Low Message Rate

There are 0 best solutions below

Related Questions in SQL

Related Questions in APACHE-FLINK

Related Questions in BENCHMARKING

Related Questions in TTL

Related Questions in BACKPRESSURE

Trending Questions

Popular # Hahtags

Popular Questions