Graphite: sumSeries() does not sum

1.2k Views Asked by At

since this morning at 6 I'm experiencing a strange behavior of graphite.

We have two machine that collects date about calls received, I plot the charts and I also plot the sum of these two charts.

While the charts of single machine are fine, the sum is not working anymore.

This is a screenshot of graphtite and also grafana that shows how 4+5=5 (my math teacher is going to die for this)

enter image description here enter image description here

This wrong sum happens also for other metrics. And I don't get why.

storage-scheams.conf

# Schema definitions for whisper files. Entries are scanned in order,
# and first match wins.
#
#  [name]
#  pattern = regex
#  retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...

[default_1min_for_1day]
pattern = .*
retentions = 60s:1d,1h:7d,1d:1y,7d:5y

storage-aggregations.conf

# Schema definitions for whisper files. Entries are scanned in order,
# and first match wins.
#
#  [name]
#  pattern = regex
#  retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...

[time_data]
pattern = ^stats\.timers.*
xFilesFactor = 0.5
aggregationMethod = average

[storage_space]
pattern = \.postgresql\..*
xFilesFactor = 0.1
aggregationMethod = average

[default_1min_for_1day]
pattern = .*
xFilesFactor = 0
aggregationMethod = sum

aggregation-rules.conf This may be the cause, but it was working before 6AM. But anyway i don' see the stats_count.all metric.

stats_counts.all.rest.req (60) = sum stats_counts.srv_*_*.rest.req
stats_counts.all.rest.res (60) = sum stats_counts.srv_*_*.rest.res
2

There are 2 best solutions below

0
On BEST ANSWER

It seems that the two series were not alligned by the timestamp, so the sum could not summarize the points. This is visible i the following chart, where selecting a time highliths point in two diffrent minute (charts from grafana).

enter image description here

I don't know why this happened. I resetarted some services (This charts comes from statsd for python and bucky). Maybe was the fault of one of those.

NOTE. Now this works, however, I would like to know if someone knows the reason and how I can solve it.

0
On

One thing you need to ensure is that the services sending metrics to Graphite do it at the same granularity as your smallest retention period or the period you will be rendering your graphs in. If the data points in the graph will be every 60 seconds, you need to send metrics every 60 seconds from each service. If the graph will be showing a data point for every hour, you can send your metrics every hour. In your case the smallest period is every 60 seconds.

I encountered a similar problem in our system - graphite was configured with the smallest retention period of 10s:6h, but we had 7 instances of the same service generating lots of metrics and configured them to send data every 20 seconds in order to avoid overloading our monitoring. This caused an almost unavoidable misalignment, where the series from the different instances will have a datapoint every 20 seconds, but some would have it at 10, 30, 50 and others will have it at 0, 20, 40. Depending on how many services were aligned, we would get a very jagged graph, looking similar to yours.

What I did to solve this problem for time periods that were returning data in 10 second increments was to use the keepLastValue function -> keepLastValue(1). I used 1 as parameter, because I only wanted to skip 1 None value, because I knew our service causes this by sending once every 20 seconds rather than every 10. This way the series generated by different services never had gaps, so sums were closer to the real number and the graphs stopped having the jagged look. I guess this introduced a bit of extra lag in the monitoring, but this is acceptable for our use case.