Linux how to restart service automatically to avoid server downtime

Have a linux service running for a long time but quit accidentally due to crash, signal, kill etc. Want to restart it automatically to avoid/reduce service downtime, use systemd service restart policy to control it easily.

Last Update: February 02, 2022

Symptom

I have nginx server running for months, suddenly got a alarm from monitor service indicate the nginx server is not providing service. I can ssh to server, so server is still online. Then check nginx server status use systemctl status nginx, I see nginx is not running due to a core dump. yes, even nginx may crash.

$ systemctl status nginx   # or sudo service nginx status
● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
     Active: failed (Result: core-dump) since Thu 2022-01-06 18:51:49 PST; 702ms ago
       Docs: man:nginx(8)
    Process: 720751 ExecReload=/usr/sbin/nginx -g daemon on; master_process on; -s reload (code=exited, status=0/SUCCES>
   Main PID: 700787 (code=dumped, signal=SEGV)
      Tasks: 0 (limit: 1110)
     Memory: 14.5M
     CGroup: /system.slice/nginx.service

Jan 06 10:57:38 systemd[1]: Reloading A high performance web server and a reverse proxy server.
Jan 06 10:57:38 systemd[1]: Reloaded A high performance web server and a reverse proxy server.
Jan 06 18:51:46 systemd[1]: Reloading A high performance web server and a reverse proxy server.
Jan 06 18:51:46 systemd[1]: Reloaded A high performance web server and a reverse proxy server.
Jan 06 18:51:49 systemd[1]: nginx.service: Main process exited, code=dumped, status=11/SEGV
Jan 06 18:51:49 systemd[1]: nginx.service: Killing process 718970 (nginx) with signal SIGKILL.
Jan 06 18:51:49 systemd[1]: nginx.service: Killing process 718971 (nginx) with signal SIGKILL.
Jan 06 18:51:49 systemd[1]: nginx.service: Killing process 718970 (nginx) with signal SIGKILL.
Jan 06 18:51:49 systemd[1]: nginx.service: Killing process 718971 (nginx) with signal SIGKILL.
Jan 06 18:51:49 systemd[1]: nginx.service: Failed with result 'core-dump'.

Use ps -ef|grep nginx also verified there is no nginx is running:

$ ps -ef|grep nginx
ubuntu    720761  718615  0 18:52 pts/0    00:00:00 grep --color=auto nginx

I can restart nginx service manually to recover but I hope a Linux service can restart automatically to avoid any downtime.

Solution to automatically restart Linux service to avoid downtime

Luckily Linux systemd system and service manager already provide this feature in service configuration. You can specific Restart policy.

Service `Restart` policy

This following the full description of Restart policy:

Configures whether the service shall be restarted when the service process exits, is killed, or a timeout is reached. The service process may be the main service process, but it may also be one of the processes specified with ExecStartPre=, ExecStartPost=, ExecStop=, ExecStopPost=, or ExecReload=. When the death of the process is a result of systemd operation (e.g. service stop or restart), the service will not be restarted. Timeouts include missing the watchdog “keep-alive ping” deadline and a service start, reload, and stop operation timeouts.

Restart value takes one of no, on-success, on-failure,
on-abnormal, on-watchdog, on-abort, or always.

If set to no (the default), the service will not be restarted.

If set to on-success, it will be restarted only when the service process exits cleanly. In this context, a clean exit means any of the following:

exit code of 0;
for types other than Type=oneshot, one of the signals SIGHUP, SIGINT, SIGTERM, or SIGPIPE;
exit statuses and signals specified in SuccessExitStatus=.

If set to on-failure, the service will be restarted when the process exits with a non-zero exit code, is terminated by a signal (including on core dump, but excluding the aforementioned four signals), when an operation (such as service reload) times out, and when the configured watchdog timeout is triggered.

If set to on-abnormal, the service will be restarted when the process is terminated by a signal (including on core dump, excluding the aforementioned four signals), when an operation times out, or when the watchdog timeout is triggered.

If set to on-abort, the service will be restarted only if the service process exits due to an uncaught signal not specified as a clean exit status. If set to on-watchdog, the service will be restarted only if the watchdog timeout for the service expires.

If set to always, the service will be restarted regardless of whether it exited cleanly or not, got terminated abnormally by a signal, or hit a timeout.

Table: Exit causes and the effect of the Restart= settings

┌──────────────┬────┬────────┬────────────┬────────────┬─────────────┬──────────┬─────────────┐
│Restart       │ no │ always │ on-success │ on-failure │ on-abnormal │ on-abort │ on-watchdog │
│settings/Exit │    │        │            │            │             │          │             │
│causes        │    │        │            │            │             │          │             │
├──────────────┼────┼────────┼────────────┼────────────┼─────────────┼──────────┼─────────────┤
│Clean exit    │    │ X      │ X          │            │             │          │             │
│code or       │    │        │            │            │             │          │             │
│signal        │    │        │            │            │             │          │             │
├──────────────┼────┼────────┼────────────┼────────────┼─────────────┼──────────┼─────────────┤
│Unclean exit  │    │ X      │            │ X          │             │          │             │
│code          │    │        │            │            │             │          │             │
├──────────────┼────┼────────┼────────────┼────────────┼─────────────┼──────────┼─────────────┤
│Unclean       │    │ X      │            │ X          │ X           │ X        │             │
│signal        │    │        │            │            │             │          │             │
├──────────────┼────┼────────┼────────────┼────────────┼─────────────┼──────────┼─────────────┤
│Timeout       │    │ X      │            │ X          │ X           │          │             │
├──────────────┼────┼────────┼────────────┼────────────┼─────────────┼──────────┼─────────────┤
│Watchdog      │    │ X      │            │ X          │ X           │          │ X           │
└──────────────┴────┴────────┴────────────┴────────────┴─────────────┴──────────┴─────────────┘

As exceptions to the setting above, the service will not be restarted if the exit code or signal is specified in RestartPreventExitStatus= or the service is stopped with systemctl stop or an equivalent operation. Also, the services will always be restarted if the exit code or signal is specified in RestartForceExitStatus=.

Note that service restart is subject to unit start rate limiting configured with StartLimitIntervalSec= and StartLimitBurst=, see systemd.unit(5) for details. A restarted service enters the failed state only after the start limits are reached.

Setting this to on-failure is the recommended choice for long-running services, in order to increase reliability by attempting automatic recovery from errors. For services that shall be able to terminate on their own choice (and avoid immediate restarting), on-abnormal is an alternative choice.

Change `Restart` policy

systemctl have edit command to override the service config:

edit UNIT...
   Edit a drop-in snippet or a whole replacement file if --full is specified, to extend or override the specified unit.

To override existing unit file for nginx, Run sudo systemctl edit nginx, then paste following two lines to specific Restart policy as always to indicate the service will be restarted regardless of whether it exited cleanly or not, got terminated abnormally by a signal, or hit a timeout:

[Service]
Restart=always

Save and quit. It should take effect immediately.

The sudo systemctl edit nginx command write nginx config in /etc/systemd/system/nginx.service.d/override.conf, you can use cat to see its content.

$ cat /etc/systemd/system/nginx.service.d/override.conf
[Service]
Restart=always

You can also check the full config of nginx service unit file by systemctl cat nginx.service

$ systemctl cat nginx.service
# /lib/systemd/system/nginx.service
# Stop dance for nginx
# =======================
#
# ExecStop sends SIGSTOP (graceful stop) to the nginx process.
# If, after 5s (--retry QUIT/5) nginx is still running, systemd takes control
# and sends SIGTERM (fast shutdown) to the main process.
# After another 5s (TimeoutStopSec=5), and if nginx is alive, systemd sends
# SIGKILL to all the remaining processes in the process group (KillMode=mixed).
#
# nginx signals reference doc:
# http://nginx.org/en/docs/control.html
#
[Unit]
Description=A high performance web server and a reverse proxy server
Documentation=man:nginx(8)
After=network.target

[Service]
Type=forking
PIDFile=/run/nginx.pid
ExecStartPre=/usr/sbin/nginx -t -q -g 'daemon on; master_process on;'
ExecStart=/usr/sbin/nginx -g 'daemon on; master_process on;'
ExecReload=/usr/sbin/nginx -g 'daemon on; master_process on;' -s reload
ExecStop=-/sbin/start-stop-daemon --quiet --stop --retry QUIT/5 --pidfile /run/nginx.pid
TimeoutStopSec=5
KillMode=mixed

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/nginx.service.d/override.conf
[Service]
Restart=always

Then start nginx service:

$ sudo systemctl start nginx
$ systemctl status nginx
● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/nginx.service.d
             └-override.conf
     Active: active (running) since Tue 2022-02-01 15:18:18 PST; 4s ago
       Docs: man:nginx(8)
    Process: 2302672 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on; (code=exited, status=0/SUCCESS)
    Process: 2302673 ExecStart=/usr/sbin/nginx -g daemon on; master_process on; (code=exited, status=0/SUCCESS)
   Main PID: 2302674 (nginx)
      Tasks: 3 (limit: 1113)
     Memory: 9.2M
     CGroup: /system.slice/nginx.service
             ├-2302674 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
             ├-2302675 nginx: worker process
             └-2302676 nginx: worker process

Test kill nginx process by sudo pkill -f nginx, then use systemctl status nginx to check nginx status, you should see nginx is active (running) but with different process id, this indicate the nginx service restart automatically. cheers.

$ sudo pkill nginx
$ systemctl status nginx
● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/nginx.service.d
             └-override.conf
     Active: active (running) since Tue 2022-02-01 15:18:56 PST; 2s ago
       Docs: man:nginx(8)
    Process: 2302830 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on; (code=exited, status=0/SUCCESS)
    Process: 2302831 ExecStart=/usr/sbin/nginx -g daemon on; master_process on; (code=exited, status=0/SUCCESS)
   Main PID: 2302832 (nginx)
      Tasks: 3 (limit: 1113)
     Memory: 9.8M
     CGroup: /system.slice/nginx.service
             ├-2302832 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
             ├-2302833 nginx: worker process
             └-2302834 nginx: worker process