I am having a problem on some of my AppEngine projects, since a few days I started to I see a lot of errors (which I noticed they might happen when an health check arrives) in my vm.syslog logs from Stackdriver Logging.
In the specific these are:
- write_gcm: Server response (CollectdTimeseriesRequest) contains errors:#012{#012 "payloadErrors": [#012 {#012 "index": 71,#012 "error": {#012 "code": 3,#012 "message": "Expected 4 labels. Found 0. Mismatched labels for payload [values {\n data_source_name: \"value\"\n data_source_type: GAUGE\n value {\n double_value: 694411264\n }\n}\nstart_time {\n seconds: 1513266364\n nanos: 618061284\n}\nend_time {\n seconds: 1513266364\n nanos: 618061284\n}\nplugin: \"processes\"\nplugin_instance: \"all\"\ntype: \"ps_rss\"\n] on resource [type: \"gce_instance\"\nlabels {\n key: \"instance_id\"\n value: \"xxx\"\n}\nlabels {\n key: \"zone\"\n value: \"europe-west2-a\"\n}\n] for project xxx"#012 }#012 }#012 ]#012}
- write_gcm: Unsuccessful HTTP request 400: {#012 "error": {#012 "code": 400,#012 "message": "Field timeSeries[11].metric.labels[1] had an invalid value of \"health_check_type\": Unrecognized metric label.",#012 "status": "INVALID_ARGUMENT"#012 }#012}
- write_gcm: Error talking to the endpoint.
- write_gcm: wg_transmit_unique_segment failed.
- write_gcm: wg_transmit_unique_segments failed. Flushing.
At the same time, I noticed that my Memory Usage in the AppEngine dashboard for the very same projects is increasing with the passing of time at the point where it reaches the max amount available and the instance restarts, throwing a 502 error when visiting the web site that the app is serving.
All this is not happening on a couple of projects that have not been updated since at least 2 weeks (neither the errors above or the memory increase) but it does happen on a newly created instance when deployed with the same codebase of one of the healthy projects. In addition, I don't happen to see any increase in the memory when running my project locally.
Can someone gently tell me if they experienced something similar or if they think that the errors and the memory increase are related? I have haven't changed my yaml file for deployment recently and I haven't specified any custom configuration for the health checks (which run on legacy mode at the default rate).
Thank you for your help, Nicola
Simliar question here App Engine Deferred: Tracking Down Memory Leaks
Going through same thing in compute engine on a single VM. I've tried increasing memory but the problem persists. Seems to be tied to a stackdriver method call. Not sure what to do, causes machines to stop after about 24hrs for me. In my case, I'm getting information every 3 seconds from a set of API's, but the error comes up every minute in the serial port 1 (console), which makes me suspect that it is a some kind of failure outside of my code. More from Google here: https://cloud.google.com/monitoring/api/ref_v3/rest/v3/projects.collectdTimeSeries/create .