Flink SQL Job Stops with Backpressure After a Week of Execution
I've been encountering an issue with my Flink SQL job, where it consistently stops after running for approximately a week due to backpressure. I am using Kafka as the connector for the table created with SQL. Despite the initial execution proceeding without crashes, I suspect there might be potential issues with infinite state growth during processing. I've already checked the WebUI, checkpoint status, and implemented Time-to-Live (TTL), but the problem persists. I am seeking advice on debugging tips and insights into similar operational issues with Flink. Specifically, I would like to know how to address and resolve the backpressure problem. Any guidance or recommendations would be greatly appreciated.
Tags: Flink SQL, backpressure, job termination, infinite state growth, debugging, WebUI, checkpoint, TTL.
Flink SQL Benchmarking Issue: Job Stops with Backpressure at Low Message Rate
- I'm experiencing an issue while benchmarking Flink SQL queries (using Kafka source and sink). At a rate of around 300 messages per second on a 12-core setup, the job stops due to backpressure in the sink materializer. Despite variations in queries and some data skew in the Group by clause, the performance seems unreasonably low. I'm curious if this aligns with typical Flink performance and if there are anticipated issues or specific considerations for such scenarios. Notably, the queries involve a task using listagg to concatenate JSON, and the system operates in standalone mode.
SQL QUERY
CREATE TABLE issue-table(
...
) WITH (
'connector' = 'upsert-kafka',
'key.format' = 'json',
'value.format' = 'json'
...
);
INSERT INTO issue-table
SELECT
uuid(),
MAX(time) as time,
'{' || LISTAGG(CONCAT('"' || id || '"', ':', data)) || '}' as data,
CASE WHEN ...
THEN ...
WHEN ...
ELSE ...
END FROM (
SELECT
id,
data,
location,
time,
count,
ROW_NUMBER() OVER (
PARTITION BY id ORDER BY time DESC
) AS row
FROM (
SELECT
count,
id,
data,
location,
other-table.id,
other-table.digital_twin_id,
other-table.rowtime
FROM other-table
JOIN another-table FOR SYSTEM_TIME AS OF time AS u ON other-table.id = u.id
)
)WHERE row = 1 GROUP BY id, count;
CHECK
- I have checked the WebUI and the checkpoint status, and I have implemented TTL to address the issue, but the problem persists. I expected the job to run without interruptions, but the backpressure causing the job to stop after a week remains unresolved. I am actively seeking advice on debugging strategies and any experiences or insights the community may have regarding similar Flink operational issues.
"When backpressure occurs, this is how it appears in the WebUI for verification." enter image description here