[BUGS] BUG #14702: Streaming replication broken after server closedconnection unexpectedly - Mailing list pgsql-bugs

From girgen@pingpong.net
Subject [BUGS] BUG #14702: Streaming replication broken after server closedconnection unexpectedly
Date
Msg-id 20170612215251.26502.58139@wrigleys.postgresql.org
Whole thread Raw
Responses Re: [BUGS] BUG #14702: Streaming replication broken after serverclosed connection unexpectedly  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      14702
Logged by:          Palle Girgensohn
Email address:      girgen@pingpong.net
PostgreSQL version: 9.6.2
Operating system:   FreeBSD 11.0
Description:

Hi!

Setup is simple streaming replication: master -> slave. There is a
replication slot at the master, so xlogs should not be removed until the
client has received them properly. 

Suddenly, the slave logs this:

[75678-1] db=,user= LOG:  restartpoint complete: wrote 2795 buffers (1.6%);
0 transaction log file(s) added, 0 removed, 1 recycled; write=269.858 s,
sync=3.768 s, total=273.636 s; sync files=242, longest=3.763 s,
average=0.015 s; distance=26953 kB, estimate=58915 kB
[75679-1] db=,user= LOG:  recovery restart point at 226B/E9DF4E30
[75679-2] db=,user= DETAIL:  last completed transaction was at log time
2017-05-29 00:57:19.221918+02
[75680-1] db=,user= LOG:  restartpoint starting: time
[5-1] db=,user= FATAL:  could not send data to WAL stream: server closed the
connection unexpectedly
[5-2]          This probably means the server terminated abnormally
[5-3]          before or while processing the request.
[5-4]
[246-1] db=,user= FATAL:  invalid memory alloc request size 1600487424


The master has not relevant information in its log.

After this, the slave could not be started again, each time the same error
about "invalid memory alloc request size 1600487424".

Looking more closely, the last xlog file, let's call it 0000EB, is currupt
on the slave, having a different checksum from the proper one at the master.
The oldest xlog saved on the master is 0000EB, all older ones are properly
deleted due to retention; they have been delivered to the slave. On the
slave, the youngest one is 0000EA (the file before 0000EB). Hence there is
no common xlog file on both systems, since the 0000EB file is corrupt on the
slave.

Removing the corrupt 0000EB from the slave, and starting the slave again,
changes the error messate from "invalid memory alloc" to 

[1-1] db=,user= LOG:  ending log output to stderr
[1-2] db=,user= HINT:  Future log output will go to log destination
"syslog".
[2-1] db=,user= LOG:  database system was interrupted while in recovery at
log time 2017-05-29 00:48:11 CEST
[2-2] db=,user= HINT:  If this has occurred more than once some data might
be corrupted and you might need to choose an earlier recovery target.
[3-1] db=,user= LOG:  entering standby mode
[4-1] db=,user= LOG:  redo starts at 226B/E9DF4E30
[2-1] db=,user= LOG:  started streaming WAL from primary at 226B/EA000000 on
timeline 1
[3-1] db=,user= FATAL:  could not receive data from WAL stream: ERROR:
requested WAL segment 000000010000226B000000EA has already been removed


OK, better... I have the 0000EA on the slave. So I copied that 0000EA over
back to the master's pg_xlog directory. The slave was happy without further
ado:

[2-1] db=,user= LOG:  started streaming WAL from primary at 226B/EA000000 on
timeline 1
[5-1] db=,user= LOG:  consistent recovery state reached at 226B/EBE89CB8
[2-1] db=,user= LOG:  database system is ready to accept read only
connections
[2-1] db=,user= LOG:  restartpoint starting: time


And life is peachy again.

Now, I don't know exactly what happened when the slave lost track, but the
bug, I think, is that the streamed WAL was corrupt, and still accepted by
the slave *and* hence removed from the master. It required too much
experience to fix that. The slave should not accept a not fully transported
WAL file. It seems it happened during some connection failure between the
slave and master, but still it should preferrably fail more gracefully. Are
the mechanisms implemented to support that, and they failed, or is it just
not implemented?

Palle



--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [BUGS] Invalid WAL segment size. Allowed values are 1,2,4,8,16,32,64
Next
From: Alvaro Herrera
Date:
Subject: Re: [BUGS] Re: BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog