Rails Resque - AWS ECS task randomly stuck

144 Views Asked by At

We are running our Ruby on Rails application on AWS ECS. Application has multiple services per cluster each running X number of tasks that are responsible for jobs in their queue. For queue we are using Resque which uses Redis as it's database. Versions:

  • Rails 5.2.6
  • Ruby 2.5.9p229
  • Resque 2.4.0

Most of the time everything is working fine, but sometimes some of the tasks just get stuck. After some investigation i found out this:

htop command result

    8 root       20   0  239M  137M 13928 S  0.0  1.8  0:04.73 `- /usr/local/bundle/bin/rake resque:workers QUEUE=import COUNT=1
   11 root       20   0 1091M  377M 32640 S  0.0  4.9  0:21.33 |  `- resque-2.4.0: Forked 27 at 1680393903
   28 root       20   0 1091M  377M 32640 S  0.0  4.9  0:00.11 |  |  `- ruby-timer-thr
   27 root       20   0 1027M  308M 20588 S  0.0  4.0  0:48.67 |  |  `- resque-2.4.0: Processing import since 1680
   32 root       20   0 1027M  308M 20588 S  0.0  4.0  0:00.00 |  |  |  `- connection_poo*
   30 root       20   0 1027M  308M 20588 S  0.0  4.0  0:00.00 |  |  |  `- connection_poo*
   29 root       20   0 1027M  308M 20588 S  0.0  4.0  0:48.46 |  |  |  `- ruby-timer-thr
   26 root       20   0 1091M  377M 32640 S  0.0  4.9  0:08.19 |  |  `- worker.rb:527
   23 root       20   0 1091M  377M 32640 S  0.0  4.9  0:00.03 |  |  `- ruby
   20 root       20   0 1091M  377M 32640 S  0.0  4.9  0:00.00 |  |  `- jemalloc_bg_thd
   19 root       20   0 1091M  377M 32640 S  0.0  4.9  0:00.35 |  |  `- connection_poo*
   18 root       20   0 1091M  377M 32640 S  0.0  4.9  0:00.39 |  |  `- connection_poo*
   17 root       20   0 1091M  377M 32640 S  0.0  4.9  0:00.49 |  |  `- connection_poo*
   16 root       20   0 1091M  377M 32640 S  0.0  4.9  0:00.29 |  |  `- connection_poo*
   15 root       20   0 1091M  377M 32640 S  0.0  4.9  0:00.43 |  |  `- connection_poo*
   13 root       20   0 1091M  377M 32640 S  0.0  4.9  0:00.46 |  |  `- connection_poo*
   10 root       20   0  239M  137M 13928 S  0.0  1.8  0:00.00 |  `- tasks.rb:32
    9 root       20   0  239M  137M 13928 S  0.0  1.8  0:00.00 |  `- ruby-timer-thr

when i do strace on PID 27 it returns this

futex(0x7f7027dc52d0, FUTEX_WAIT_PRIVATE, 2, NULL

my understanding is that process is waiting for some resource that is not available, so maybe some of the other processes are using those resources, or it's something regarding remote connection (Redis, Postgres etc) but i'm not really sure how to check that.

I also noticed that we have multiple 'ruby-timer-thr' processes strace on PID 29 returns '-1 EAGIN' every couple of minutes, will post the entire message once some of the tasks get stuck again.

strace on PID 28 returns this every second

restart_syscall(<... resuming interrupted read ...>) = 0
poll([{fd=3, events=POLLIN}], 1, 100)   = 0 (Timeout)
poll([{fd=3, events=POLLIN}], 1, 100)   = 0 (Timeout)
poll([{fd=3, events=POLLIN}], 1, 100)   = 0 (Timeout)
poll([{fd=3, events=POLLIN}], 1, 100)   = 0 (Timeout)
poll([{fd=3, events=POLLIN}], 1, 100)   = 0 (Timeout)

I tried to SIGKILL PID 29 (SIGTERM didn't work) and task created new fork of Resque and continued normally.

Does anyone have any idea what is the problem here and is there any other way for me to debug this? Maybe there is a deadlock somewhere?

0

There are 0 best solutions below