I need to store netflow data in Postgresql. This is data about network traffic. Each record contains the following:
- Connection start time
- Connection end time
- Total data transferred
- Source/destination IPs/ASNs
- (There is a bunch more, but that is enough for the purpose of this question).
My question is this: How can I store this data so I can efficiently calculate data transfer rates for the past X days/hours? For example, I may want to draw a chart of all traffic to Netflix's ASN over the last 7 days, with hourly resolution.
The difference between the connection start & end times could be milliseconds, or could be over an hour.
My first-pass at this would be to store the connection in a TSTZRANGE field with a GiST index. Then, to query the data for hourly traffic over the last 7 days:
- Use a CTE to generate a sequence of hourly time buckets
- Look for any TSTZRANGEs which overlap each bucket
- Calculate the duration of the overlap
- Calculate the data rate for the record in bytes per second
- Do duration * bytes per second to get total data
- Group it all on the bucket, summing the total data values
However, that sounds like a lot of heavy lifting. Can anyone think of a better option?
After looking into this some more, I think the real answer is that there isn't an out-of-the-box way to achieve this in a performant manner. Especially as the data volume scales up. Ultimately it is just going to be slow to aggregate many thousands of rows, because that is simply a lot of data access.
Instead I have gone a different route. I am using a Postgresql trigger on the table which stores the raw flows (
traffic_flow
). Every time a record is inserted intotraffic_flow
, the trigger will then upsert the new data into separate aggregation tables for daily, hourly, and minutely data.Here is my experimental implementation in case it is useful to someone. This could be improved to also handle updates and deletes.