Re: BUG? Slave don't reconnect to the master - Mailing list pgsql-general

From Олег Самойлов
Subject Re: BUG? Slave don't reconnect to the master
Date
Msg-id 52BADD0F-E348-43BB-8992-FDE86F3B1A65@ya.ru
Whole thread Raw
In response to Re: BUG? Slave don't reconnect to the master  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Responses Re: BUG? Slave don't reconnect to the master  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
List pgsql-general

> 19 авг. 2020 г., в 16:07, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> написал(а):
>
> On Tue, 18 Aug 2020 13:48:41 +0300
> Олег Самойлов <splarv@ya.ru> wrote:
>
>> Hi all.
>>
>> I found some strange behaviour of postgres, which I recognise as a bug. First
>> of all, let me explain situation.
>>
>> I created a "test bed" (not sure how to call it right), to test high
>> availability clusters based on Pacemaker and PostgreSQL. The test bed consist
>> of 12 virtual machines (on VirtualBox) runing on a MacBook Pro and formed 4
>> HA clusters with different structure. And all 4 HA cluster constantly tested
>> in loop: simulated failures with different nature, waited for rising
>> fall-over, fixing, and so on. For simplicity I'll explain only one HA
>> cluster.
>> This is 3 virtual machines, with master on one, and sync and async
>> slaves on other. The PostgreSQL service is provided by float IPs pointed to
>> working master and slaves. Slaves are connected to the master float IP too.
>> When the pacemaker detects a failure, for instance, on the master, it promote
>> a master on other node with lowest latency WAL and switches float IPs, so the
>> third node keeping be a sync slave. My company decided to open this project
>> as an open source, now I am finishing formality.
>
> As the maintainer of PAF[1], I'm looking forward to discover it :)
> Do not hesitate to ping me offlist as well in regard with Pacemaker and
> resource agents.

I am glad to see you here. Great thanks for PAF module.

>
>> Almost works fine, but sometimes, rather rare, I detected that a slave don't
>> reconnect to the new master after a failure. First case is PostgreSQL-STOP,
>> when I `kill` by STOP signal postgres on the master to simulate freeze. The
>> slave don't reconnect to the new master with errors in log:
>>
>> 18:02:56.236 [3154] FATAL:  terminating walreceiver due to timeout
>> 18:02:56.237 [1421] LOG:  record with incorrect prev-link 0/1600DDE8 at
>> 0/1A00DE10
>
> Do you have more logs from both side of the replication?

There was nothing special. Just errors from previous tests.

> How do you build your standbys?
Okey, I'll show you all configs. Linux Centos 7, PostgreSQL 11 from the official postgres repository.
# rpm -q corosync
corosync-2.4.5-4.el7.x86_64
# rpm -q pacemaker
pacemaker-1.1.21-4.el7.x86_64
# rpm -q sbd
sbd-1.4.0-15.el7.x86_64

The cluster designed for three datacenter with good connection between in one city , so it must survive a crush of one
datacenter.So stonith is not used, instead I use quorum and sbd based watchdog. Name of nodes: tuchanka3a, tuchanka3b,
tuchanka3c.Name for float ip: krogan3 for the master; krogan3s1, krogan3s2 for slaves. 

postgresql common conf (not default options):

ident_file = '/var/lib/pgsql/pg_ident.conf'
hba_file = '/var/lib/pgsql/pg_hba.conf'
listen_addresses = '*'
log_filename = 'postgresql.%F.log'
wal_keep_segments = 1
restart_after_crash = off
wal_receiver_timeout=0 # in case PostgreSQL-STOP wal_receiver_timeout is default 60s, of cause
shared_buffers = 32MB
max_wal_size=80MB

postgresql conf specific for nodes:
for tuchanka3a:

$ cat krogan3.conf
synchronous_commit = remote_write
synchronous_standby_names = 'ANY 1 (tuchanka3b,tuchanka3c)'

$ cat krogan3.paf
primary_conninfo = 'host=krogan3 user=replicant application_name=tuchanka3a sslmode=disable'
recovery_target_timeline = 'latest'
standby_mode = 'on'

for tucanka3b:
$ cat krogan3.conf
synchronous_commit = remote_write
synchronous_standby_names = 'ANY 1 (tuchanka3a,tuchanka3c)'

$ cat krogan3.paf
primary_conninfo = 'host=krogan3 user=replicant application_name=tuchanka3b sslmode=disable'
recovery_target_timeline = 'latest'
standby_mode = 'on'

for tuchanka3c:
$ cat krogan3.conf
synchronous_commit = remote_write
synchronous_standby_names = 'ANY 1 (tuchanka3a,tuchanka3b)'

$ cat krogan3.paf
primary_conninfo = 'host=krogan3 user=replicant application_name=tuchanka3c sslmode=disable'
recovery_target_timeline = 'latest'
standby_mode = 'on'

pacemaker config, specific for this cluster:

pcs cluster cib original_cib.xml
cp original_cib.xml configured_cib.xml
pcs -f configured_cib.xml resource create krogan3DB ocf:heartbeat:pgsqlms bindir=/usr/pgsql-11/bin
pgdata=/var/lib/pgsql/krogan3recovery_template=/var/lib/pgsql/krogan3.paf op monitor interval=15 timeout=10 monitor
interval=16role=Master timeout=20 monitor interval=17 role=Slave timeout=10 meta master notify=true
resource-stickiness=10
pcs -f configured_cib.xml resource create krogan3IP ocf:heartbeat:IPaddr2 nic=eth0 cidr_netmask=24 ip=192.168.89.35
pcs -f configured_cib.xml resource create krogan3s1IP ocf:heartbeat:IPaddr2 nic=eth0 cidr_netmask=24 ip=192.168.89.36
pcs -f configured_cib.xml resource create krogan3s2IP ocf:heartbeat:IPaddr2 nic=eth0 cidr_netmask=24 ip=192.168.89.37
pcs -f configured_cib.xml constraint colocation add krogan3IP with master krogan3DB-master INFINITY
pcs -f configured_cib.xml constraint order promote krogan3DB-master then start krogan3IP symmetrical=false
pcs -f configured_cib.xml constraint order demote krogan3DB-master then stop krogan3IP symmetrical=false kind=Optional
pcs -f configured_cib.xml constraint location krogan3s1IP rule score=-INFINITY master-krogan3DB lt integer 0
pcs -f configured_cib.xml constraint location krogan3s2IP rule score=-INFINITY master-krogan3DB lt integer 0
pcs -f configured_cib.xml constraint colocation add krogan3s1IP with slave krogan3DB-master INFINITY
pcs -f configured_cib.xml constraint colocation add krogan3s2IP with slave krogan3DB-master INFINITY
pcs -f configured_cib.xml constraint colocation add krogan3s1IP with krogan3s2IP -1000
pcs -f configured_cib.xml constraint order start krogan3DB-master then start krogan3s1IP
pcs -f configured_cib.xml constraint order start krogan3DB-master then start krogan3s2IP
pcs cluster cib-push configured_cib.xml --wait diff-against=original_cib.xml

Test PostgreSQL-STOP (wal_receiver_timeout is default 60s):
killall -STOP postgres

PostgreSQL logs:
From old master, tuchanka 3a:
2020-08-20 10:03:35.633 MSK [1359] LOG:  received promote request
2020-08-20 10:03:35.633 MSK [1507] FATAL:  terminating walreceiver process due to administrator command
2020-08-20 10:03:35.634 MSK [1359] LOG:  redo is not required
2020-08-20 10:03:35.636 MSK [1359] LOG:  selected new timeline ID: 3
2020-08-20 10:03:35.703 MSK [1359] LOG:  archive recovery complete
2020-08-20 10:03:35.712 MSK [1357] LOG:  database system is ready to accept connections
2020-08-20 10:03:36.704 MSK [1640] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:03:36.818 MSK [1641] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:05:03.399 MSK [1641] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:05:31.829 MSK [2564] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:07:07.183 MSK [1640] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:08:01.798 MSK [3813] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:09:36.798 MSK [3813] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:10:29.449 MSK [4975] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:12:05.088 MSK [2564] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:12:28.589 MSK [5977] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:14:15.063 MSK [5977] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:14:37.883 MSK [7071] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:16:29.821 MSK [7071] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:16:59.105 MSK [8236] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:18:39.647 MSK [4975] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:20:27.047 MSK [9855] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:22:13.899 MSK [8236] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:22:37.731 MSK [10933] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:24:28.857 MSK [10933] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:24:57.420 MSK [12084] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby

All errors are from former test, which was applied to slaves. (The host for tests choosen randomly).

New master tuchanka3b, some logs are copied from the old master on database restoration after previous test by
pg_basebackup.I separated them by blank line. 

2020-08-20 10:03:35.633 MSK [1359] LOG:  received promote request
2020-08-20 10:03:35.633 MSK [1507] FATAL:  terminating walreceiver process due to administrator command
2020-08-20 10:03:35.634 MSK [1359] LOG:  redo is not required
2020-08-20 10:03:35.636 MSK [1359] LOG:  selected new timeline ID: 3
2020-08-20 10:03:35.703 MSK [1359] LOG:  archive recovery complete
2020-08-20 10:03:35.712 MSK [1357] LOG:  database system is ready to accept connections
2020-08-20 10:03:36.704 MSK [1640] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:03:36.818 MSK [1641] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:05:03.399 MSK [1641] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:05:31.829 MSK [2564] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:07:07.183 MSK [1640] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:08:01.798 MSK [3813] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:09:36.798 MSK [3813] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:10:29.449 MSK [4975] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:12:05.088 MSK [2564] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:12:28.589 MSK [5977] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:14:15.063 MSK [5977] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:14:37.883 MSK [7071] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:16:29.821 MSK [7071] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:16:59.105 MSK [8236] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:18:39.647 MSK [4975] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:20:27.047 MSK [9855] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:22:13.899 MSK [8236] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:22:37.731 MSK [10933] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:24:28.857 MSK [10933] LOG:  terminating walsender process due to replication timeout

2020-08-20 10:24:55.888 MSK [1407] LOG:  database system was interrupted; last known up at 2020-08-20 10:24:49 MSK
2020-08-20 10:24:55.906 MSK [1407] LOG:  entering standby mode
2020-08-20 10:24:55.908 MSK [1407] LOG:  redo starts at 0/15000028
2020-08-20 10:24:55.909 MSK [1407] LOG:  consistent recovery state reached at 0/15002300
2020-08-20 10:24:55.910 MSK [1405] LOG:  database system is ready to accept read only connections
2020-08-20 10:24:55.928 MSK [1411] LOG:  started streaming WAL from primary at 0/16000000 on timeline 3
2020-08-20 10:26:37.308 MSK [1411] FATAL:  terminating walreceiver due to timeout
2020-08-20 10:26:37.308 MSK [1407] LOG:  invalid record length at 0/1600C4D8: wanted 24, got 0
2020-08-20 10:30:55.965 MSK [1407] LOG:  received promote request
2020-08-20 10:30:55.965 MSK [2424] FATAL:  terminating walreceiver process due to administrator command
2020-08-20 10:30:55.966 MSK [1407] LOG:  redo done at 0/1600C4B0
2020-08-20 10:30:55.966 MSK [1407] LOG:  last completed transaction was at log time 2020-08-20 10:25:38.764299+03
2020-08-20 10:30:55.968 MSK [1407] LOG:  selected new timeline ID: 4
2020-08-20 10:30:56.001 MSK [1407] LOG:  archive recovery complete
2020-08-20 10:30:56.005 MSK [1405] LOG:  database system is ready to accept connections

The slave with didn't reconnected replication, tuchanka3c. Also I separated logs copied from the old master by a blank
line:

2020-08-20 10:03:35.633 MSK [1359] LOG:  received promote request
2020-08-20 10:03:35.633 MSK [1507] FATAL:  terminating walreceiver process due to administrator command
2020-08-20 10:03:35.634 MSK [1359] LOG:  redo is not required
2020-08-20 10:03:35.636 MSK [1359] LOG:  selected new timeline ID: 3
2020-08-20 10:03:35.703 MSK [1359] LOG:  archive recovery complete
2020-08-20 10:03:35.712 MSK [1357] LOG:  database system is ready to accept connections
2020-08-20 10:03:36.704 MSK [1640] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:03:36.818 MSK [1641] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:05:03.399 MSK [1641] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:05:31.829 MSK [2564] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:07:07.183 MSK [1640] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:08:01.798 MSK [3813] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:09:36.798 MSK [3813] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:10:29.449 MSK [4975] LOG:  standby "tuchanka3c" is now a candidate for quorum synchronous standby
2020-08-20 10:12:05.088 MSK [2564] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:12:28.589 MSK [5977] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:14:15.063 MSK [5977] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:14:37.883 MSK [7071] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:16:29.821 MSK [7071] LOG:  terminating walsender process due to replication timeout
2020-08-20 10:16:59.105 MSK [8236] LOG:  standby "tuchanka3b" is now a candidate for quorum synchronous standby
2020-08-20 10:18:39.647 MSK [4975] LOG:  terminating walsender process due to replication timeout

2020-08-20 10:20:25.168 MSK [1408] LOG:  database system was interrupted; last known up at 2020-08-20 10:20:19 MSK
2020-08-20 10:20:25.180 MSK [1408] LOG:  entering standby mode
2020-08-20 10:20:25.181 MSK [1408] LOG:  redo starts at 0/11000098
2020-08-20 10:20:25.183 MSK [1408] LOG:  consistent recovery state reached at 0/11000A68
2020-08-20 10:20:25.183 MSK [1406] LOG:  database system is ready to accept read only connections
2020-08-20 10:20:25.193 MSK [1412] LOG:  started streaming WAL from primary at 0/12000000 on timeline 3
2020-08-20 10:25:05.370 MSK [1955] LOG:  could not send data to client: Connection reset by peer
2020-08-20 10:25:05.370 MSK [1955] CONTEXT:  PL/pgSQL function beat4tmux(text) line 4 at RAISE
2020-08-20 10:25:05.370 MSK [1955] STATEMENT:  call beat4tmux('3')
2020-08-20 10:25:05.370 MSK [1955] FATAL:  connection to client lost
2020-08-20 10:25:05.370 MSK [1955] CONTEXT:  PL/pgSQL function beat4tmux(text) line 4 at RAISE
2020-08-20 10:25:05.370 MSK [1955] STATEMENT:  call beat4tmux('3')
2020-08-20 10:26:38.655 MSK [1412] FATAL:  terminating walreceiver due to timeout
2020-08-20 10:26:38.655 MSK [1408] LOG:  record with incorrect prev-link 0/1200C4B0 at 0/1600C4D8

The function beat4tmux used for health monitoring.

The output from mon:

Online: [ tuchanka3b tuchanka3c ]
OFFLINE: [ tuchanka3a ]

Full list of resources:

 Master/Slave Set: krogan3DB-master [krogan3DB]
     Masters: [ tuchanka3b ]
     Slaves: [ tuchanka3c ]
     Stopped: [ tuchanka3a ]
krogan3IP    (ocf::heartbeat:IPaddr2):    Started tuchanka3b
krogan3s1IP     (ocf::heartbeat:IPaddr2):    Stopped
krogan3s2IP     (ocf::heartbeat:IPaddr2):    Stopped

Node Attributes:
* Node tuchanka3b:
    + master-krogan3DB                  : 1001
* Node tuchanka3c:
    + master-krogan3DB                  : -1000

Migration Summary:
* Node tuchanka3b:
* Node tuchanka3c:

Pending Fencing Actions:
* reboot of tuchanka3a pending: client=crmd.1242, origin=tuchanka3a

>> Well, this can be workarouned by turning off wal_receiver_timeout.

>> Now PostgreSQL-STOP works fine, but the problem is still
>> exists with other test. ForkBomb simulates an out of memory situation. In
>> this case a slave sometimes don't reconnect to the new master too, with
>> errors in log:
>>
>> 10:09:43.99 [1417] FATAL:  could not receive data from WAL stream: server
>> closed the connection unexpectedly This probably means the server terminated
>> abnormally before or while processing the request.
>> 10:09:43.992 [1413] LOG:  invalid record length at 0/D8014278: wanted 24, got
>> 0
>
> I suspect the problem is somewhere else. The first message here is probably
> related to your primary being fenced, the second one is normal. After your
> IP moved to the recently promoted primary, your standby are supposed to
> reconnect with no problem.

I believe that problem with the ForkBomb has the same nature. The case with PostgreSQL-STOP and default
wal_receiver_timeoutcan be easily repeated, it rises always by `killall -STOP postgres` on the master, as I can see.
Butthe case with wal_receiver_timeout=0 and ForkBomb happens rather rare, not always. 




pgsql-general by date:

Previous
From: Adam Sjøgren
Date:
Subject: Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
Next
From: "Ko, Christina"
Date:
Subject: Loading Oracle Spatial Data to Postgresql