Hot-Standby resync problem after connection loss - Mailing list pgsql-admin
From | Stefan Kohlhauser |
---|---|
Subject | Hot-Standby resync problem after connection loss |
Date | |
Msg-id | trinity-616b67c7-9f21-4ca1-a430-d09537aa8e1f-1389261652097@3capp-gmx-bs36 Whole thread Raw |
Responses |
Re: Hot-Standby resync problem after connection loss
Re: Hot-Standby resync problem after connection loss |
List | pgsql-admin |
<div style="font-family: Verdana;font-size: 12.0px;"><div><div>Hello everyone!</div><div> </div><div>I have set up a PostgreSQL9.2.5 in hot-standby with a master and one standby on two virtualized RHEL6.4. I am using them for a high-availabilityKamailio SIP-server. Everything is managed by Pacemaker/Corosync with two network connections, one for Pacemaker,one for PostgreSQL.<br /> After setup replication works fine.</div><div> </div><div>postgresql.conf excerpt (sameon both nodes):<br /> wal_level = hot_standby<br /> synchronous_commit = off<br /> wal_buffers = 1MB<br /> archive_mode= on<br /> archive_command = 'test ! -f /var/lib/kamailiodb1/archive/%f && cp %p /var/lib/kamailiodb1/archive/%f'<br/> max_wal_senders = 16<br /> replication_timeout = 10s<br /> hot_standby = on<br /> max_standby_archive_delay= -1<br /> max_standby_streaming_delay = -1<br /> wal_receiver_status_interval = 3s<br /> hot_standby_feedback= on</div><div> </div><div>However, I have a test case that troubles me:<br /> Node 1 runs the master,node 2 the standby. I remove the network connection for PostgreSQL (the Pacemaker instances still see each other).After a few seconds I poweroff node 1. This causes node 2 to be promoted to master by Pacemaker. I add a new insert onnode 2. Then I boot node 1. From the log entries on node 1 they seem to resync:<br /> 2014-01-08T16:30:43.698+00:00kamailionode1 postgres-kamailio[1980] info: [1-1] LOG: database system was shut down in recoveryat 2014-01-08 16:29:14 GMT<br /> 2014-01-08T16:30:43.843+00:00 kamailionode1 postgres-kamailio[1980] info: [2-1]LOG: entering standby mode<br /> 2014-01-08T16:30:44.024+00:00 kamailionode1 postgres-kamailio[1980] info: [3-1] LOG:consistent recovery state reached at 0/6000080<br /> 2014-01-08T16:30:44.025+00:00 kamailionode1 postgres-kamailio[1965]info: [1-1] LOG: database system is ready to accept read only connections<br /> 2014-01-08T16:30:44.025+00:00kamailionode1 postgres-kamailio[1980] info: [4-1] LOG: record with zero length at 0/6000080<br/> 2014-01-08T16:30:44.272+00:00 kamailionode1 postgres-kamailio[1998] info: [2-1] LOG: streaming replicationsuccessfully connected to primary</div><div> </div><div>However, the new insert is not shown on node 1 when iquery the DB with psql, on node 2 it is shown.</div><div> </div><div>recovery.conf:<br /> standby_mode = 'on'<br /> primary_conninfo= 'host=pgreplicationha port=5432 user=replicate application_name=kamailionode2 '<br /> restore_command ='scp -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostkeyChecking=no pgreplicationha:/var/lib/kamailiodb1/archive/%f%p'<br /> recovery_target_timeline = 'latest'</div><div> </div><div>I'm notsure I understand this and if it has anything to do with it, but before the connection loss node 1 was using WAL 000000010000000000000005.Now "ps aux" on node 1 tells me:<br /> postgres: startup process waiting for 000000010000000000000006<br/> Doesn't that mean he is waiting for information from 000000010000000000000006 which isn't evenused by the master yet, according to pg_controldata?</div><div> </div><div>pg_controldata excerpt on node 1 (now standby):<br/> Database cluster state: in archive recovery<br /> Latest checkpoint location: 0/6000020<br /> Prior checkpointlocation: 0/6000020<br /> Latest checkpoint's REDO location: 0/6000020<br /> Minimum recovery ending location:0/6000020</div><div> </div><div>pg_controldata excerpt on node 2 (now master):<br /> Database cluster state: inproduction<br /> Latest checkpoint location: 0/519BB68<br /> Prior checkpoint location: 0/5187F38<br /> Latest checkpoint'sREDO location: 0/519BB68<br /> Minimum recovery ending location: 0/0</div><div> </div><div>I know this couldbe a potential split-brain (which internally it is regarding timelines if I understood correctly; however, the timelinepart of the WAL stays 00000001xxxxx(?)).<br /> I need a way to fix this automatically by resyncing to what the currentmaster holds.<br /> The DB will contain only a few entries and DB modifications are probably rare. It is much moreimportant that the overall downtime of the system is as low as possible than maybe losing one or two entries.<br /> Thereforemy question: How do I resync the new standby node properly without restarting/rebooting/moving the master?</div><div> </div><div>Bestregards and thanks in advance,<br /> Stefan</div></div></div>
pgsql-admin by date: