Re: [BUGS] BUG #14702: Streaming replication broken after serverclosed connection unexpectedly - Mailing list pgsql-bugs

From Michael Paquier
Subject Re: [BUGS] BUG #14702: Streaming replication broken after serverclosed connection unexpectedly
Date
Msg-id CAB7nPqS_iYTqHDp6XXWhtv1m6vLLQXJNqSQFeUHY-iw1EWuqfA@mail.gmail.com
Whole thread Raw
In response to [BUGS] BUG #14702: Streaming replication broken after server closedconnection unexpectedly  (girgen@pingpong.net)
Responses Re: [BUGS] BUG #14702: Streaming replication broken after serverclosed connection unexpectedly  (Palle Girgensohn <girgen@pingpong.net>)
List pgsql-bugs
On Tue, Jun 13, 2017 at 6:52 AM,  <girgen@pingpong.net> wrote:
> Setup is simple streaming replication: master -> slave. There is a
> replication slot at the master, so xlogs should not be removed until the
> client has received them properly.

Hm. There has been the following discussion as well, which refers to a
legit bug where WAL segments could be removed even if a slot is used:
https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP+JoJAjoGx=GNuOAshEDWCext7BFvCQ@mail.gmail.com
The circumstances to trigger the problem are quite particular though
as it needs an incomplete WAL record at the end of a segment.

> After this, the slave could not be started again, each time the same error
> about "invalid memory alloc request size 1600487424".

Hm. That smells of data corruption.. Be careful going forward.

> Looking more closely, the last xlog file, let's call it 0000EB, is corrupt
> on the slave, having a different checksum from the proper one at the master.

To which checksum are you referring here? Did you do yourself a
calculation using what is on-disk? Note that during streaming
replication the content of an unfinished segment may be different than
what is on the primary.

> Now, I don't know exactly what happened when the slave lost track, but the
> bug, I think, is that the streamed WAL was corrupt, and still accepted by
> the slave *and* hence removed from the master. It required too much
> experience to fix that. The slave should not accept a not fully transported
> WAL file. It seems it happened during some connection failure between the
> slave and master, but still it should preferrably fail more gracefully. Are
> the mechanisms implemented to support that, and they failed, or is it just
> not implemented?

There is a per-record CRC calculation to check the validity of each
record, and this is processed when fetching each record at recovery as
a sanity check. That's one way to prevent applying an incorrect
record. In the event of such an error you would have seen "incorrect
resource manager data checksum in record at" or similar. It seems to
me that you should be careful with the primary as well.
-- 
Michael


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [BUGS] BUG #14703: documentation bug:
Next
From: Peter Eisentraut
Date:
Subject: Re: [BUGS] BUG #14703: documentation bug: