Getting 5xx error with AWS Application Load Balancer - fluctuating healthy and unhealthy target group

1.4k Views Asked by At

My web application on AWS EC2 + load balancer sometimes shows 500 errors. How do I know if the error is on the server side or the application side?

I am using Route 53 domain and ssl on my url. I set the ALB redirect requests on port 80 to 443, and forward requests on port 443 to the target group (the EC2). However, the target group is returning 5xx error code sometimes when handling the request. Please see the screenshots for the metrics and configurations for the ALB.

Target Group Metrics

Target Group Configuration

Load Balancer Metrics

Load Balancer Listeners

EC2 Metrics

Right now the web application is running unsteady, sometimes it returns a 502 or 503 service unavailable (seems like it's a connnection timeout).

I have set up the ALB idle timeout 4000 secs. ALB configuration

The application is using Nuxt.js + PHP7.0 + MySQL + Apache 2.4.54.

I have set the Apache prefork worker Maxclient number as 1000, which should be enough to handle the requests on the application.

The EC2 is a t2.Large resource, the CPU and Memory look enough to handle the processing.

It seems like if I directly request the IP address but not the domain, the amount of 5xx errors significantly reduced (but still exists).

I also have Wordpress application host on this EC2 in a subdomain (CNAME). I have never encountered any 5xx errors on this subdomain site, which makes me guess there might be some errors in my application code but not on the server side.

Is the 5xx error from my application or from the server?

I also tried to add another EC2 in the target group see if they can have at lease one healthy instance to handle the requests. However, the application is using a third-party API and has strict IP whitelist policy. I did some research that the Elastic IP I got from AWS cannot be attached to 2 different EC2s.

1

There are 1 best solutions below

0
On

First of all, if your application is prone to stutters, increase healthcheck retries and timeouts, which will affect your initial question of flapping health.
To what I see from your screenshot, most of your 5xx are due to either server or application (you know obviously better what's the culprit since you have access to their logs).
To answer your question about 5xx errors coming from LB: this happens directly after LB kicks out unhealthy instance and if there's none to replace (which shouldn't be the case because you're supposed to have ASG if you enable evaluation of target health for LB), it can't produce meaningful output and thus crumbles with 5xx.
This should be enough information for you to make adjustments and logs investigation.