Errors in vm.syslog and Memory Usage constantly increasing on NodeJS AppEngine

900 Views Asked by At

I am having a problem on some of my AppEngine projects, since a few days I started to I see a lot of errors (which I noticed they might happen when an health check arrives) in my vm.syslog logs from Stackdriver Logging.

In the specific these are:

  • write_gcm: Server response (CollectdTimeseriesRequest) contains errors:#012{#012 "payloadErrors": [#012 {#012 "index": 71,#012 "error": {#012 "code": 3,#012 "message": "Expected 4 labels. Found 0. Mismatched labels for payload [values {\n data_source_name: \"value\"\n data_source_type: GAUGE\n value {\n double_value: 694411264\n }\n}\nstart_time {\n seconds: 1513266364\n nanos: 618061284\n}\nend_time {\n seconds: 1513266364\n nanos: 618061284\n}\nplugin: \"processes\"\nplugin_instance: \"all\"\ntype: \"ps_rss\"\n] on resource [type: \"gce_instance\"\nlabels {\n key: \"instance_id\"\n value: \"xxx\"\n}\nlabels {\n key: \"zone\"\n value: \"europe-west2-a\"\n}\n] for project xxx"#012 }#012 }#012 ]#012}
  • write_gcm: Unsuccessful HTTP request 400: {#012 "error": {#012 "code": 400,#012 "message": "Field timeSeries[11].metric.labels[1] had an invalid value of \"health_check_type\": Unrecognized metric label.",#012 "status": "INVALID_ARGUMENT"#012 }#012}
  • write_gcm: Error talking to the endpoint.
  • write_gcm: wg_transmit_unique_segment failed.
  • write_gcm: wg_transmit_unique_segments failed. Flushing.

At the same time, I noticed that my Memory Usage in the AppEngine dashboard for the very same projects is increasing with the passing of time at the point where it reaches the max amount available and the instance restarts, throwing a 502 error when visiting the web site that the app is serving.

All this is not happening on a couple of projects that have not been updated since at least 2 weeks (neither the errors above or the memory increase) but it does happen on a newly created instance when deployed with the same codebase of one of the healthy projects. In addition, I don't happen to see any increase in the memory when running my project locally.

Can someone gently tell me if they experienced something similar or if they think that the errors and the memory increase are related? I have haven't changed my yaml file for deployment recently and I haven't specified any custom configuration for the health checks (which run on legacy mode at the default rate).

Thank you for your help, Nicola

2

There are 2 best solutions below

1
On

Simliar question here App Engine Deferred: Tracking Down Memory Leaks

Going through same thing in compute engine on a single VM. I've tried increasing memory but the problem persists. Seems to be tied to a stackdriver method call. Not sure what to do, causes machines to stop after about 24hrs for me. In my case, I'm getting information every 3 seconds from a set of API's, but the error comes up every minute in the serial port 1 (console), which makes me suspect that it is a some kind of failure outside of my code. More from Google here: https://cloud.google.com/monitoring/api/ref_v3/rest/v3/projects.collectdTimeSeries/create .

0
On

I'm not sure about all of the errors, but for the "write_gcm: Server response (CollectdTimeseriesRequest)" I had the same issue and contacted Google Cloud Support. They told me that the Stackdriver service has been updated recently to accept more detailed information on ps_rss metrics, but it has caused metrics from older agents to not be sent at all.

You should be able to fix this issue by upgrading your Stackdriver agent to the latest version. On Compute Engine (that I was running) you have control over this, I'm not sure how you'd do it on AppEngine, maybe trigger a new deploy?