Kinesis Data Analytics Flink: Continually Increasing Checkpoint Size

1.6k Views Asked by At

I am running a Flink application using the AWS Kinesis Data Analytics (KDA) service. My KDA Flink application last checkpoint size appears to be growing steadily over time. The sudden drops in checkpoint size you can see in the attached graph correspond with when I pushed changes out to the app, causing it to take a snapshot, update, and then restore from the snapshot. My concern is that once the application is no longer being actively developed, changes will not be deployed as regularly, and the checkpoint size could grow to eventually be too large.

Does anyone know what would cause the checkpoint size to grow continuously without end? I am using State TTL on all significant state and removing state in application code when it is no longer needed. Does the checkpoint size increasing indicate I have a bug in the code that handles state, or is something else potentially at play here?

Continually Increasing Checkpoint Size

1

There are 1 best solutions below

2
On

Update: See https://stackoverflow.com/a/67435073/2000823 for a better answer.


AWS Kinesis Data Analytics (KDA) is currently based on Flink 1.8, where this documentation regarding state cleanup applies.

Note that

by default if expired state is not read, it won’t be removed, possibly leading to ever growing state

You can also activate cleanup during full snapshots (which seems to be occurring), and background cleanup (which sounds like what you want). Note that for some workloads, even if background cleanup is enabled, the default settings for background cleanup might be insufficient to keep up with the rate at which state should be cleaned up, and some tuning might be necessary.

By the way, background cleanup is enabled by default since Flink 1.10.

If this doesn't answer your question, please clarify precisely how state TTL is configured.