The PostgreSQL service on the standby node of my PostgreSQL cluster starts and stops again and again after I started the Patroni service.
I want to build a PostgreSQL HA cluster on two computers using Patroni 1.6.4 and etcd 3.3. Firstly, I built an etcd cluster, which is healthy.
member 230e646882873b50 is healthy: got healthy result from http://10.19.170.119:2379
member afcefe35d67a646c is healthy: got healthy result from http://10.19.170.155:2379
cluster is healthy
Next, I built a stream replication PostgreSQL cluster on two computers (running on port 5433), which worked fine.
Then I stopped then PostgreSQL cluster and started Patroni on the primary server and the standby server.
The PostgreSQL and Patroni service on the primary server seemed working normally.
2020-04-02 18:17:22,402 INFO: Lock owner: pgsql_119; I am pgsql_119
2020-04-02 18:17:22,430 INFO: no action. i am the leader with the lock
2020-04-02 18:17:26,402 INFO: Lock owner: pgsql_119; I am pgsql_119
2020-04-02 18:17:26,430 INFO: no action. i am the leader with the lock
But a problem occurred on the standby server. Patroni on the standby server printed messages as follows:
2020-04-02 18:45:25,995 INFO: no action. i am a secondary and i am following aleader
2020-04-02 18:45:27,285 INFO: closed patroni connection to the postgresql cluster
2020-04-02 18:45:27,722 INFO: postmaster pid=7448
2020-04-02 18:45:27.832 HKT [7448] LOG: listening on IPv4 address "0.0.0.0", port 5433
2020-04-02 18:45:27.994 HKT [7448] LOG: redirecting log output to logging collector process
2020-04-02 18:45:27.994 HKT [7448] HINT: Future log output will appear in directory "log".
2020-04-02 18:45:30,058 INFO: Lock owner: pgsql_node119; I am pgsql_node155
2020-04-02 18:45:30,058 INFO: does not have lock
2020-04-02 18:45:30,058 INFO: establishing a new patroni connection to the postgres cluster
2020-04-02 18:45:31,162 INFO: no action. i am a secondary and i am following aleader
2020-04-02 18:45:32,460 INFO: closed patroni connection to the postgresql cluster
2020-04-02 18:45:32,875 INFO: postmaster pid=8820
2020-04-02 18:45:32.996 HKT [8820] LOG: listening on IPv4 address "0.0.0.0", port 5433
2020-04-02 18:45:33.161 HKT [8820] LOG: redirecting log output to logging collector process
2020-04-02 18:45:33.161 HKT [8820] HINT: Future log output will appear in directory "log".
2020-04-02 18:45:35,211 INFO: Lock owner: pgsql_node119; I am pgsql_node155
2020-04-02 18:45:35,211 INFO: does not have lock
2020-04-02 18:45:35,211 INFO: establishing a new patroni connection to the postgres cluster
2020-04-02 18:45:37,215 INFO: establishing a new patroni connection to the postgres cluster
The content of postgresql logs repeated as follows:
FATAL: the database system is starting up
LOG: redo starts at 0/3A000060
LOG: consistent recovery state reached at 0/3A000140
LOG: invalid record length at 0/3A000140: wanted 24, got 0
LOG: database system is ready to accept read only connections
LOG: started streaming WAL from primary at 0/3A000000 on timeline 40
LOG: received fast shutdown request
LOG: aborting any active transactions
FATAL: terminating connection due to administrator command
FATAL: terminating walreceiver process due to administrator command
LOG: shutting down
LOG: database system is shut down
LOG: database system was shut down in recovery at 2020-04-02 18:50:20 CST
That meant PostgreSQL on the standby server restarted every 5 seconds!
Here is one of my patroni.yml. Another one is the same except the ip address.
scope: pgsql
namespace: /pgsql/
name: pgsql_node119
restapi:
listen: 10.19.170.119:8008
connect_address: 10.19.170.119:8008
etcd:
host: 10.19.170.119:2379
bootstrap:
# this section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
# and all other cluster members will use it as a `global configuration`
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
master_start_timeout: 300
synchronous_mode: false
# check_timeline: true
postgresql:
use_pg_rewind: true
use_slots: true
postgresql:
listen: 0.0.0.0:5433
connect_address: 10.19.170.119:5433
data_dir: "/opt/postgresql-11/data"
bin_dir: "/opt/postgresql-11/bin"
# config_dir: /etc/postgresql/9.6/main
authentication:
replication:
username: repuser
password: repuserpwd
superuser:
username: postgres
password: postgrespwd
#watchdog:
# mode: automatic # Allowed values: off, automatic, required
# device: /dev/watchdog
# safety_margin: 5
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
Do you know why this happened or how to solve this problem? Thank you.
I found the similar problem in this issue of Patroni on github.
The immediate cause of the problem that Patroni on the standby server retart PostgreSQL so frequently is that the password file required by Patroni is modified by another process. There was an important parameter of postgresql,
pgpass
, that was not configured in mypatroni.yml
. The default value is "$HOME/.pgpass
" ($HOME
is the home directory of user "postgres").However, the file "
$HOME/.pgpass
" may be modified by other applications. This will make Patroni service abnormal.The solution is to set
pgpass
to the path of a .pgpass only accessed by Patroni. For example,