Django RQ rqworker freezes indefinitely

2.5k Views Asked by At

This week, my integration tests stopped working. I figured out that it was a django-rq job that just stuck indefinitely. My output:

$: RQ worker 'rq:worker:47e0aaf280be.13' started, version 0.12.0
$: *** Listening on default...
$: Cleaning registries for queue: default
$: default: myapp.engine.rules.process_event(<myapp.engine.event.Event object at 0x7f34f1ce50f0>) (a1e66a46-1a9d-4f52-be6f-6f4529dd2480)

And that's the point at which it freezes. I have to keyboard interrupt

The code has not changed. To be certain, I went back to the master branch, checked it out, re-ran integration tests, and they, too fail.

How can I start to debug redis or rq from the test case in python to understand what might be happening? Is there a way to view the actual queue record via python? The Redis queue will only exist as long as the test is running, and, since it's frozen, I can review the Redis queue via redis-cli from the Docker container running the Redis service.

The method I'm using to debug so far is:

from rq import Queue
from redis import Redis
from django_rq import get_worker  
...
  def test_motion_alarm(self):
      motion_sensor_data = {"motion_detected": 1}
      post_alarm(
          self.live_server_url,
          self.location,
          self.sensor_device_id,
          "ALARM_MOTIONDETECTED",
          motion_sensor_data
      )

      redis_conn = Redis('my_queue')
      q = Queue(connection=redis_conn)
      print(len(q))
      queued_job_ids = q.job_ids
      queued_jobs = q.jobs
      logger.debug('RQ info: \njob IDs: {}, \njobs: {}'.format(queued_job_ids, queued_jobs))
      get_worker().work(burst=True)

      time.sleep(1)

      self.assertTrue(db.event_exists_at_location(
          db.get_location_by_motion_detected(self.location_id),
          "ALARM_MOTIONDETECTED"))

Which produces this debug output:

$ DEBUG [myapi.tests.integration.test_rules:436] RQ info: 
job IDs: ['bef879c4-832d-431d-97e7-9eec9f4bf5d7']
jobs: [Job('bef879c4-832d-431d-97e7-9eec9f4bf5d7', enqueued_at=datetime.datetime(2018, 12, 6, 0, 10, 14, 829488))]
$ RQ worker 'rq:worker:54f6054e7aa5.7' started, version 0.12.0
$ *** Listening on default...
$ Cleaning registries for queue: default
$ default: myapi.engine.rules.process_event(<myapi.engine.event.Event object at 0x7fbf204e8c50>) (bef879c4-832d-431d-97e7-9eec9f4bf5d7)

And in the queue container, running a monitor process on the queue, I see every once in a while a new batch of monitor responses:

1544110882.343826 [0 172.19.0.4:38905] "EXPIRE" "rq:worker:ac50518f1c5e.7" "35"
1544110882.344304 [0 172.19.0.4:38905] "HSET" "rq:worker:ac50518f1c5e.7" "last_heartbeat" "2018-12-06T15:41:22.344170Z"
1544110882.968846 [0 172.19.0.4:38910] "EXPIRE" "rq:worker:ac50518f1c5e.12" "35"
1544110882.969651 [0 172.19.0.4:38910] "HSET" "rq:worker:ac50518f1c5e.12" "last_heartbeat" "2018-12-06T15:41:22.969181Z"
1544110884.122917 [0 172.19.0.4:38919] "EXPIRE" "rq:worker:ac50518f1c5e.13" "35"
1544110884.124966 [0 172.19.0.4:38919] "HSET" "rq:worker:ac50518f1c5e.13" "last_heartbeat" "2018-12-06T15:41:24.124809Z"
1544110884.708910 [0 172.19.0.4:38925] "EXPIRE" "rq:worker:ac50518f1c5e.14" "35"
1544110884.710736 [0 172.19.0.4:38925] "HSET" "rq:worker:ac50518f1c5e.14" "last_heartbeat" "2018-12-06T15:41:24.710599Z"
1544110885.415111 [0 172.19.0.4:38930] "EXPIRE" "rq:worker:ac50518f1c5e.15" "35"
1544110885.417279 [0 172.19.0.4:38930] "HSET" "rq:worker:ac50518f1c5e.15" "last_heartbeat" "2018-12-06T15:41:25.417155Z"
1544110886.028965 [0 172.19.0.4:38935] "EXPIRE" "rq:worker:ac50518f1c5e.16" "35"
1544110886.030002 [0 172.19.0.4:38935] "HSET" "rq:worker:ac50518f1c5e.16" "last_heartbeat" "2018-12-06T15:41:26.029817Z"
1544110886.700132 [0 172.19.0.4:38940] "EXPIRE" "rq:worker:ac50518f1c5e.17" "35"
1544110886.701861 [0 172.19.0.4:38940] "HSET" "rq:worker:ac50518f1c5e.17" "last_heartbeat" "2018-12-06T15:41:26.701716Z"
1544110887.359702 [0 172.19.0.4:38945] "EXPIRE" "rq:worker:ac50518f1c5e.18" "35"
1544110887.361642 [0 172.19.0.4:38945] "HSET" "rq:worker:ac50518f1c5e.18" "last_heartbeat" "2018-12-06T15:41:27.361481Z"
1544110887.966641 [0 172.19.0.4:38950] "EXPIRE" "rq:worker:ac50518f1c5e.19" "35"
1544110887.967931 [0 172.19.0.4:38950] "HSET" "rq:worker:ac50518f1c5e.19" "last_heartbeat" "2018-12-06T15:41:27.967760Z"
1544110888.595785 [0 172.19.0.4:38955] "EXPIRE" "rq:worker:ac50518f1c5e.20" "35"
1544110888.596962 [0 172.19.0.4:38955] "HSET" "rq:worker:ac50518f1c5e.20" "last_heartbeat" "2018-12-06T15:41:28.596799Z"
1544110889.199269 [0 172.19.0.4:38960] "EXPIRE" "rq:worker:ac50518f1c5e.21" "35"
1544110889.200416 [0 172.19.0.4:38960] "HSET" "rq:worker:ac50518f1c5e.21" "last_heartbeat" "2018-12-06T15:41:29.200265Z"
1544110889.783128 [0 172.19.0.4:38965] "EXPIRE" "rq:worker:ac50518f1c5e.22" "35"
1544110889.785444 [0 172.19.0.4:38965] "HSET" "rq:worker:ac50518f1c5e.22" "last_heartbeat" "2018-12-06T15:41:29.785158Z"
1544110890.422338 [0 172.19.0.4:38970] "EXPIRE" "rq:worker:ac50518f1c5e.23" "35"
1544110890.423470 [0 172.19.0.4:38970] "HSET" "rq:worker:ac50518f1c5e.23" "last_heartbeat" "2018-12-06T15:41:30.423314Z"

And, curiously or maybe by design, every time I see these go by, they end on the :30 or :00 seconds.

So, I can identify that yes, the queue does have this item in it, and the job is running, so why doesn't the job every pick up and run?

1

There are 1 best solutions below

0
On BEST ANSWER

It appears this is a recently reported defect in the rq_scheduler library, as reported here: https://github.com/rq/rq-scheduler/issues/197

There's a PR in the works for it. However, it comes to my attention that we allowed our redis library to increment up to 3.0.0 without explicitly requesting that version, and that is ultimately what broke the systems.

In the build scripts, I set the Dockerfile to perform: RUN pip install redis=="2.10.6", which alleviates the issue for the time being.