Monit - how to identify crashes of a program instead of restarts

1.3k Views Asked by At

I am using monit to monitor my program. The program being monitored can potentially crash under 2 situations

  • Program can randomly crash. It just needs to be restarted
  • It gets into a bad state and crashes each time it is started subsequently

To fix the latter situation, I have a script to stop the program, reset it to a good state by cleaning its data files and restart it. I tried the below config

check process program with pidfile program.pid
start program = "programStart" as uid username and gid groupname
stop program = "programStop" as uid username and gid groupname
if 3 restarts within 20 cycles then exec "cleanProgramAndRestart" as uid username and gid groupname
if 6 restarts within 20 cycles then timeout

Say monit restarts the program 3 times in 3 cycles. After it is restarted the third time, the cleanProgramAndRestart script runs. However as the cleanProgramAndRestart script restarts the program yet again, the condition of 3 restarts is met again in the next cycle and it becomes an infinite loop

Could anyone suggest any way to fix this?

If any of the below actions are possible, then there may be a way around.

  • If there is a "crash" keyword, instead of "restarts", I will be able to run the clean script after the program crashes 3 times instead of after it is restarted 3 times
  • If there is a way to reset the "restarts" counter in some way after running the exec script
  • If there is a way to exec something only if output of the condition 3 restarts changed
1

There are 1 best solutions below

0
On

Monit is polling your "tests" every cycle. The cycle length is usually defined in /etc/monitrc, in set daemon cycle_length

So if your cleanProgramAndRestart takes less than a cycle to perform, it shouldn't happen. As it is happening, I guess your cleanProgramAndRestart takes more than a cycle to perform.

You can:

  • Increase the cycle length in Monit configuration
  • check your program every x cycles (make sure that cycle_length*x > cleanProgramAndRestart_length)

If you can't modify these variables, there could be a little workaround, with a temp file:

check process program 
  with pidfile program.pid
  start program = "programStart" 
    as uid username and gid groupname
  stop program = "programStop" 
    as uid username and gid groupname
  if 3 restarts within 20 cycles 
  then exec "touch /tmp/program__is_crashed" 
  if 6 restarts within 20 cycles then timeout

check file program_crash with path /tmp/program_crash every x cycles #(make sure that cycle_length*x > cleanProgramAndRestart_length)
  if changed timestamp then exec "cleanProgramAndRestart"
    as uid username and gid groupname