Thread: Hot-Standby resync problem after connection loss

Hot-Standby resync problem after connection loss

From
"Stefan Kohlhauser"
Date:
<div style="font-family: Verdana;font-size: 12.0px;"><div><div>Hello everyone!</div><div> </div><div>I have set up a
PostgreSQL9.2.5 in hot-standby with a master and one standby on two virtualized RHEL6.4. I am using them for a
high-availabilityKamailio SIP-server. Everything is managed by Pacemaker/Corosync with two network connections, one for
Pacemaker,one for PostgreSQL.<br /> After setup replication works fine.</div><div> </div><div>postgresql.conf excerpt
(sameon both nodes):<br /> wal_level = hot_standby<br /> synchronous_commit = off<br /> wal_buffers = 1MB<br />
archive_mode= on<br /> archive_command = 'test ! -f /var/lib/kamailiodb1/archive/%f && cp %p
/var/lib/kamailiodb1/archive/%f'<br/> max_wal_senders = 16<br /> replication_timeout = 10s<br /> hot_standby = on<br />
max_standby_archive_delay= -1<br /> max_standby_streaming_delay = -1<br /> wal_receiver_status_interval = 3s<br />
hot_standby_feedback= on</div><div> </div><div>However, I have a test case that troubles me:<br /> Node 1 runs the
master,node 2 the standby. I remove the network connection for PostgreSQL (the Pacemaker instances still see each
other).After a few seconds I poweroff node 1. This causes node 2 to be promoted to master by Pacemaker. I add a
new insert onnode 2. Then I boot node 1. From the log entries on node 1 they seem to resync:<br />
2014-01-08T16:30:43.698+00:00kamailionode1 postgres-kamailio[1980] info: [1-1] LOG: database system was shut down in
recoveryat 2014-01-08 16:29:14 GMT<br /> 2014-01-08T16:30:43.843+00:00 kamailionode1 postgres-kamailio[1980] info:
[2-1]LOG: entering standby mode<br /> 2014-01-08T16:30:44.024+00:00 kamailionode1 postgres-kamailio[1980] info: [3-1]
LOG:consistent recovery state reached at 0/6000080<br /> 2014-01-08T16:30:44.025+00:00 kamailionode1
postgres-kamailio[1965]info: [1-1] LOG: database system is ready to accept read only connections<br />
2014-01-08T16:30:44.025+00:00kamailionode1 postgres-kamailio[1980] info: [4-1] LOG: record with zero length at
0/6000080<br/> 2014-01-08T16:30:44.272+00:00 kamailionode1 postgres-kamailio[1998] info: [2-1] LOG: streaming
replicationsuccessfully connected to primary</div><div> </div><div>However, the new insert is not shown on node 1 when
iquery the DB with psql, on node 2 it is shown.</div><div> </div><div>recovery.conf:<br /> standby_mode = 'on'<br />
primary_conninfo= 'host=pgreplicationha port=5432 user=replicate application_name=kamailionode2 '<br /> restore_command
='scp -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostkeyChecking=no
pgreplicationha:/var/lib/kamailiodb1/archive/%f%p'<br /> recovery_target_timeline = 'latest'</div><div> </div><div>I'm
notsure I understand this and if it has anything to do with it, but before the connection loss node 1 was using WAL
000000010000000000000005.Now "ps aux" on node 1 tells me:<br /> postgres: startup process waiting for
000000010000000000000006<br/> Doesn't that mean he is waiting for information from 000000010000000000000006 which isn't
evenused by the master yet, according to pg_controldata?</div><div> </div><div>pg_controldata excerpt on node 1 (now
standby):<br/> Database cluster state: in archive recovery<br /> Latest checkpoint location: 0/6000020<br /> Prior
checkpointlocation: 0/6000020<br /> Latest checkpoint's REDO location: 0/6000020<br /> Minimum recovery ending
location:0/6000020</div><div> </div><div>pg_controldata excerpt on node 2 (now master):<br /> Database cluster state:
inproduction<br /> Latest checkpoint location: 0/519BB68<br /> Prior checkpoint location: 0/5187F38<br /> Latest
checkpoint'sREDO location: 0/519BB68<br /> Minimum recovery ending location: 0/0</div><div> </div><div>I know this
couldbe a potential split-brain (which internally it is regarding timelines if I understood correctly; however, the
timelinepart of the WAL stays 00000001xxxxx(?)).<br /> I need a way to fix this automatically by resyncing to what the
currentmaster holds.<br /> The DB will contain only a few entries and DB modifications are probably rare. It is much
moreimportant that the overall downtime of the system is as low as possible than maybe losing one or two entries.<br />
Thereforemy question: How do I resync the new standby node properly without restarting/rebooting/moving the
master?</div><div> </div><div>Bestregards and thanks in advance,<br /> Stefan</div></div></div> 

Re: Hot-Standby resync problem after connection loss

From
Brian Weaver
Date:
Sounds like it might be related to a know data corruption issue found in 9.2.5. See the following thread from the list archive

http://www.postgresql.org/message-id/0E76EE0A-1740-41B5-88FF-54AA98794532@thebuild.com


On Thu, Jan 9, 2014 at 5:00 AM, Stefan Kohlhauser <stefan.kohlhauser@gmx.net> wrote:
Hello everyone!
 
I have set up a PostgreSQL 9.2.5 in hot-standby with a master and one standby on two virtualized RHEL6.4. I am using them for a high-availability Kamailio SIP-server. Everything is managed by Pacemaker/Corosync with two network connections, one for Pacemaker, one for PostgreSQL.
After setup replication works fine.
 
postgresql.conf excerpt (same on both nodes):
wal_level = hot_standby
synchronous_commit = off
wal_buffers = 1MB
archive_mode = on
archive_command = 'test ! -f /var/lib/kamailiodb1/archive/%f && cp %p /var/lib/kamailiodb1/archive/%f'
max_wal_senders = 16
replication_timeout = 10s
hot_standby = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1
wal_receiver_status_interval = 3s
hot_standby_feedback = on
 
However, I have a test case that troubles me:
Node 1 runs the master, node 2 the standby. I remove the network connection for PostgreSQL (the Pacemaker instances still see each other). After a few seconds I poweroff node 1. This causes node 2 to be promoted to master by Pacemaker. I add a new insert on node 2. Then I boot node 1. From the log entries on node 1 they seem to resync:
2014-01-08T16:30:43.698+00:00 kamailionode1 postgres-kamailio[1980] info: [1-1] LOG: database system was shut down in recovery at 2014-01-08 16:29:14 GMT
2014-01-08T16:30:43.843+00:00 kamailionode1 postgres-kamailio[1980] info: [2-1] LOG: entering standby mode
2014-01-08T16:30:44.024+00:00 kamailionode1 postgres-kamailio[1980] info: [3-1] LOG: consistent recovery state reached at 0/6000080
2014-01-08T16:30:44.025+00:00 kamailionode1 postgres-kamailio[1965] info: [1-1] LOG: database system is ready to accept read only connections
2014-01-08T16:30:44.025+00:00 kamailionode1 postgres-kamailio[1980] info: [4-1] LOG: record with zero length at 0/6000080
2014-01-08T16:30:44.272+00:00 kamailionode1 postgres-kamailio[1998] info: [2-1] LOG: streaming replication successfully connected to primary
 
However, the new insert is not shown on node 1 when i query the DB with psql, on node 2 it is shown.
 
recovery.conf:
standby_mode = 'on'
primary_conninfo = 'host=pgreplicationha port=5432 user=replicate application_name=kamailionode2 '
restore_command = 'scp -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostkeyChecking=no pgreplicationha:/var/lib/kamailiodb1/archive/%f %p'
recovery_target_timeline = 'latest'
 
I'm not sure I understand this and if it has anything to do with it, but before the connection loss node 1 was using WAL 000000010000000000000005. Now "ps aux" on node 1 tells me:
postgres: startup process waiting for 000000010000000000000006
Doesn't that mean he is waiting for information from 000000010000000000000006 which isn't even used by the master yet, according to pg_controldata?
 
pg_controldata excerpt on node 1 (now standby):
Database cluster state: in archive recovery
Latest checkpoint location: 0/6000020
Prior checkpoint location: 0/6000020
Latest checkpoint's REDO location: 0/6000020
Minimum recovery ending location: 0/6000020
 
pg_controldata excerpt on node 2 (now master):
Database cluster state: in production
Latest checkpoint location: 0/519BB68
Prior checkpoint location: 0/5187F38
Latest checkpoint's REDO location: 0/519BB68
Minimum recovery ending location: 0/0
 
I know this could be a potential split-brain (which internally it is regarding timelines if I understood correctly; however, the timeline part of the WAL stays 00000001xxxxx(?)).
I need a way to fix this automatically by resyncing to what the current master holds.
The DB will contain only a few entries and DB modifications are probably rare. It is much more important that the overall downtime of the system is as low as possible than maybe losing one or two entries.
Therefore my question: How do I resync the new standby node properly without restarting/rebooting/moving the master?
 
Best regards and thanks in advance,
Stefan



--

/* insert witty comment here */

Re: Hot-Standby resync problem after connection loss

From
Simon Riggs
Date:
On 9 January 2014 10:00, Stefan Kohlhauser <stefan.kohlhauser@gmx.net> wrote:

> However, the new insert is not shown on node 1 when i query the DB with
> psql, on node 2 it is shown.
>
> recovery.conf:
> standby_mode = 'on'
> primary_conninfo = 'host=pgreplicationha port=5432 user=replicate
> application_name=kamailionode2 '
> restore_command = 'scp -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null
> -o StrictHostkeyChecking=no pgreplicationha:/var/lib/kamailiodb1/archive/%f
> %p'
> recovery_target_timeline = 'latest'
>
> I'm not sure I understand this and if it has anything to do with it, but
> before the connection loss node 1 was using WAL 000000010000000000000005.
> Now "ps aux" on node 1 tells me:
> postgres: startup process waiting for 000000010000000000000006
> Doesn't that mean he is waiting for information from
> 000000010000000000000006 which isn't even used by the master yet, according
> to pg_controldata?

You've configured file based replication also, so when the streaming
rep ended it attempts to get next data from archive and waits until it
exists.

So yes, it is asking for something not created yet, at your request.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services