Re: BUG? Slave don't reconnect to the master - Mailing list pgsql-general

From Jehan-Guillaume de Rorthais
Subject Re: BUG? Slave don't reconnect to the master
Date
Msg-id 20200821162642.4245add6@firost
Whole thread Raw
In response to Re: BUG? Slave don't reconnect to the master  (Олег Самойлов <splarv@ya.ru>)
Responses Re: BUG? Slave don't reconnect to the master  (Олег Самойлов <splarv@ya.ru>)
List pgsql-general
On Thu, 20 Aug 2020 15:16:10 +0300
Олег Самойлов <splarv@ya.ru> wrote:
[...]
> >> Almost works fine, but sometimes, rather rare, I detected that a slave
> >> don't reconnect to the new master after a failure. First case is
> >> PostgreSQL-STOP, when I `kill` by STOP signal postgres on the master to
> >> simulate freeze. The slave don't reconnect to the new master with errors
> >> in log:
> >>
> >> 18:02:56.236 [3154] FATAL:  terminating walreceiver due to timeout
> >> 18:02:56.237 [1421] LOG:  record with incorrect prev-link 0/1600DDE8 at
> >> 0/1A00DE10
> >
> > Do you have more logs from both side of the replication?
>
> There was nothing special. Just errors from previous tests.

OK

> > How do you build your standbys?
> Okey, I'll show you all configs. Linux Centos 7, PostgreSQL 11 from the
> official postgres repository. # rpm -q corosync
> corosync-2.4.5-4.el7.x86_64
> # rpm -q pacemaker
> pacemaker-1.1.21-4.el7.x86_64
> # rpm -q sbd
> sbd-1.4.0-15.el7.x86_64
>
> The cluster designed for three datacenter with good connection between in one
> city , so it must survive a crush of one datacenter. So stonith is not used,
> instead I use quorum and sbd based watchdog. Name of nodes: tuchanka3a,
> tuchanka3b, tuchanka3c. Name for float ip: krogan3 for the master; krogan3s1,
> krogan3s2 for slaves.
>
> postgresql common conf (not default options):
>
> ident_file = '/var/lib/pgsql/pg_ident.conf'
> hba_file = '/var/lib/pgsql/pg_hba.conf'
> listen_addresses = '*'
> log_filename = 'postgresql.%F.log'
> wal_keep_segments = 1
> restart_after_crash = off
> wal_receiver_timeout=0 # in case PostgreSQL-STOP wal_receiver_timeout is
> default 60s, of cause shared_buffers = 32MB
> max_wal_size=80MB

Based on setup per node, you can probably add
'synchronous_commit=remote_write' in the common conf.

> [...]
> pacemaker config, specific for this cluster:
> [...]

why did you add "monitor interval=15"? No harm, but it is redundant with
"monitor interval=16 role=Master" and "monitor interval=17 role=Slave".

By the way, nice catch to exclude RO IP when master score is
negative!

> Test PostgreSQL-STOP (wal_receiver_timeout is default 60s):
> killall -STOP postgres
>
> PostgreSQL logs:
> From old master, tuchanka 3a:
> [...]

OK

> All errors are from former test, which was applied to slaves. (The host for
> tests choosen randomly).
>
> New master tuchanka3b, some logs are copied from the old master on database
> restoration after previous test by pg_basebackup. I separated them by blank
> line.

You should probably consider putting your logs outside of your PGDATA, or
even better: send your PostgreSQL/Pacemaker logs to a dedicated syslog node
outside. Adding the hostname in the prefix of each log line might probably help
a lot.

> [...]
> 10:24:55.906 LOG:  entering standby mode
> 10:24:55.908 LOG:  redo starts at 0/15000028
> 10:24:55.909 LOG:  consistent recovery state reached at 0/15002300
> 10:24:55.910 LOG:  database system is ready to accept read only connections
> 10:24:55.928 LOG:  started streaming WAL from primary at 0/16000000 on tl 3
> 10:26:37.308 FATAL:  terminating walreceiver due to timeout

Timeout because of SIGSTOP on primary here.

> 10:26:37.308 LOG:  invalid record length at 0/1600C4D8: wanted 24, got 0
> 10:30:55.965 LOG:  received promote request

Promotion from Pacemaker here.

What happened during more than 4 minutes between the timeout and the promotion?

> 10:30:55.965 FATAL:  terminating walreceiver process due to administrator cmd
> 10:30:55.966 LOG:  redo done at 0/1600C4B0
> 10:30:55.966 LOG:  last completed transaction was at log time 10:25:38.76429
> 10:30:55.968 LOG:  selected new timeline ID: 4
> 10:30:56.001 LOG:  archive recovery complete
> 10:30:56.005 LOG:  database system is ready to accept connections

> The slave with didn't reconnected replication, tuchanka3c. Also I separated
> logs copied from the old master by a blank line:
>
> [...]
>
> 10:20:25.168 LOG:  database system was interrupted; last known up at 10:20:19
> 10:20:25.180 LOG:  entering standby mode
> 10:20:25.181 LOG:  redo starts at 0/11000098
> 10:20:25.183 LOG:  consistent recovery state reached at 0/11000A68
> 10:20:25.183 LOG:  database system is ready to accept read only connections
> 10:20:25.193 LOG:  started streaming WAL from primary at 0/12000000 on tl 3
> 10:25:05.370 LOG:  could not send data to client: Connection reset by peer
> 10:26:38.655 FATAL:  terminating walreceiver due to timeout
> 10:26:38.655 LOG:  record with incorrect prev-link 0/1200C4B0 at 0/1600C4D8

This message appear before the effective promotion of tuchanka3b. Do you have
logs about what happen *after* the promotion?

Reading at this error, it seems like record at 0/1600C4D8 references the
previous one in WAL 0/12000000. So the file referenced as 0/16 have either
corrupted data or was 0/12 being recycled, but not zeroed correctly, as v11
always do no matter what (no wal_init_zero there).

That's why I'm wondering how you built your standbys, from scratch?

[...]
>
> >> Well, this can be workarouned by turning off wal_receiver_timeout.

Just because the walreceiver keeps waiting on its socket and don't pay
attention to data on disk I suppose.

> >> Now PostgreSQL-STOP works fine, but the problem is still
> >> exists with other test. ForkBomb simulates an out of memory situation. In
> >> this case a slave sometimes don't reconnect to the new master too, with
> >> errors in log:
> >>
> >> 10:09:43.99 [1417] FATAL:  could not receive data from WAL stream: server
> >> closed the connection unexpectedly This probably means the server
> >> terminated abnormally before or while processing the request.
> >> 10:09:43.992 [1413] LOG:  invalid record length at 0/D8014278: wanted 24,
> >> got 0
> >
> > I suspect the problem is somewhere else. The first message here is probably
> > related to your primary being fenced, the second one is normal. After your
> > IP moved to the recently promoted primary, your standby are supposed to
> > reconnect with no problem.
>
> I believe that problem with the ForkBomb has the same nature. The case with
> PostgreSQL-STOP and default wal_receiver_timeout can be easily repeated, it
> rises always by `killall -STOP postgres` on the master, as I can see. But the
> case with wal_receiver_timeout=0 and ForkBomb happens rather rare, not always.

because of resources pressure, I suppose the OOM killer behavior is not
predictable. Sometime it will kill your walsender, sometime not. And sometime,
the watchdog itself is supposed to kill the node before anyone else.

Regards,



pgsql-general by date:

Previous
From: Achilleas Mantzios
Date:
Subject: Re: Creating many tables gets logical replication stuck
Next
From: Laurenz Albe
Date:
Subject: Re: When are largobject records TOASTed into pg_toast_2613?