Re: WAL scenario valid? - Mailing list pgsql-admin

From prakhar jauhari
Subject Re: WAL scenario valid?
Date
Msg-id CAEd0_=8WHcpw=RVfAy2Sd7fnp+vGab+XeK9aDMBSO-YOmxARHQ@mail.gmail.com
Whole thread Raw
In response to Re: WAL scenario valid?  (Sergey Konoplev <gray.ru@gmail.com>)
Responses Re: WAL scenario valid?  (Albe Laurenz <laurenz.albe@wien.gv.at>)
Re: WAL scenario valid?  (Sergey Konoplev <gray.ru@gmail.com>)
List pgsql-admin
Hey..

I am facing a similar kind of problem, but in a two node setup. Streaming replication is being used with a high wal_keep_segments, using log shipping to bump up timelines on the standby server to setup SR(streaming replication).
DB1 - master
DB2 - standby

When a switchover happens DB2 becomes the new master and when DB1 comes up it will act as the standby to the new master (.history files from new master are copied to DB1 to bump up its timeline so as to setup SR). DB1 is not recreated from scratch. This runs fine in normal switchover, but there seems to be problem in the following situation, leading to database corruption:

Current state :
DB1 - master
DB2 - standby

Now the failing scenario:


1. DB2 machine goes down.
2. After some time DB1 machine also goes down (DB2 is still down).
3. Now DB2 comes up (it will join the cluster as master as DB1 is still down).
4. DB2 is started as master postgresql.
5. Now DB1 comes up (it will join the cluster as standby to DB2)
6. Now when DB1 attempts to SR with DB2, DB1 gets corrupted.

Looked into the issue and found that when DB1 went down initially, it created some WAL's which were not synced to DB2 as it was already down.
Now when DB2 started as master it still had not played the last few WAL's created by DB1(when it was master). DB2 starts as master properly.
When DB1 came as standby to DB2, it bumped it timeline using history file from DB2, but when SR was setup with DB2, DB1 gets corrupted.

Now the question is:

1. Is this a theoretically valid approach?
2. If it is a valid approach, then how can i detect such a scenario (where SR will corrupt the DB)? So that i can go for a basebackup in such situation.

regards,
Prakhar
 


On Tue, Jun 18, 2013 at 1:27 AM, Sergey Konoplev <gray.ru@gmail.com> wrote:
On Mon, Jun 17, 2013 at 8:14 AM, Keith Ouellette
<Keith.Ouellette@airgas.com> wrote:
> I got the initial WAL replication going between DB1 and DB2 and between the
> virtual IP (currently on DB1) and DB3. So far all is well. However, If I
> simulate a failover to DB2 (promote DB2 and move the virtual IP to it), WAL
> does not resume between the virtual IP and DB3. I tried restarting
> PostgreSQL on DB2, but that did not help. I also tried restarting PostgreSQL
> on DB2 to see if that would kick start it, but it did not. The only way I
> could get WAL between the Virutal IP and DB3 is to do a manual sync using
> rsync.

[skipped]

> Is what I am trying to do possible?

It is.

In your situation, when both replicas following the master, in case of
the master's failure you need to find the most caught up replica. To
do this compare WAL replay locations on replicas and chose the one
with the biggest value.

SELECT pg_last_xlog_replay_location();

Note, that If you chose not the most caught up one, than other
replicas that have replayed later WAL entries must be reconfigured
from scratch. Otherwise their data might be corrupted and you will not
get any warnings about it.

In the case of the planned switchover, choose one that will be a new
master on your own.

Then stop all the slaves except the new master.

Use the command below to guarantee that the master and the remaining
slave are ahead of other (stopped) slaves if you are not sure that
they already are. The command creates a minimal WAL entry.

SELECT txid_current();

Touch the failover file on the remaining slave to promote it as a new master.

On the stopped slaves delete everything from the pg_xlog directory and
copy the pg_xlog/*.history there from the new master.

Then change the DSN to point to the new master and add the following
instruction in recovery.conf. This will make replicas to follow the
latest created timeline.

recovery_target_timeline = 'latest'

Start postgres on these slaves and that is it.

Alternatively, if your postgres version is 9.2, you can setup
cascading replicasion, so db2 will follow db1 and db3 will follow db2.
In case of db1 failover, all you need to do is to promote db2.
However, you need to remember that if db2 fails you will have to
change DSN in recovery.conf on db3 to point to db1, or to redirect
your db2's virtual IP. Yes, I should have probably start with this
solution.

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gray.ru@gmail.com


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

pgsql-admin by date:

Previous
From: Ziggy Skalski
Date:
Subject: Re: Re: PG_UPGRADE issues "this utility can only upgrade to postgresql version 9.2."
Next
From: Albe Laurenz
Date:
Subject: Re: WAL scenario valid?