Envoy proxy hangs client connection?

90 Views Asked by At

I’m looking for ideas on debugging an issue we’re seeing in production. Sorry for the length of this post, but I want to include all the details that might help.

We are using Envoy v1.25.5 as an egress proxy for code in the aws cluster to get out of the cluster.

This has been working just fine for thousands of client 24x7. However, there is one specific client, with one specific call, that hangs and I have yet to figure out why. Where is what’s it looks like:

The client is calling the aws javascript v2 sdk to upload a file to s3. The s3 bucket is outside the cluster so it flows through Envoy. This is decent size file, 850mb, however much larger files (10gb) work with no problem. It’s only just this one specific file that fails (aside, each days run the file is ever so slightly different, as it content changes slightly by the day).

If I swap out Envoy for our old Squid prox, it works fine.

Details on the call: The s3.upload aws api does a multipart upload under the covers. I can see in the aws logs that a few hundred parts upload successfully, but aws fails to complete the upload. I think if fails to complete the upload because I see where several parts fail to log they’ve completed. These are sequentially numbered parts, running several uploads at once. Ie. “PartNumber 1 completed in 22.3s, PartNumber2 completed in 7.45s”. The parts take between 5 and 35 seconds each, so not overly lengthy. This whole upload process takes about 30 minutes.

I’ve enabled ‘trace’ level Envoy logs and in that 30 minutes, Envoy collected 2.5 millions line of logs . That’s a lot to look through.

In the logs I can see plenty of above/below watermarks. But I can’t find any 4xx or 5xx status codes in the access logs. And the only errors I find seem to be from other upstream hosts. Meaning not s3, but instead for example the aws logs API, which seems to always close the upstream w/o a status. So not related.

I’m looking for ideas on how you might debug it further or ideas what could be happening. Thank you, Dan

0

There are 0 best solutions below