Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
Date
Msg-id 20130118003522.GD3074@awork2.anarazel.de
Whole thread Raw
In response to Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave  (Michael Paquier <michael.paquier@gmail.com>)
Responses Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
List pgsql-hackers
On 2013-01-18 08:24:31 +0900, Michael Paquier wrote:
> On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> 
> >  I encountered the problem that the timeline switch is not performed
> > expectedly.
> > I set up one master, one standby and one cascade standby. All the servers
> > share the archive directory. restore_command is specified in the
> > recovery.conf
> > in those two standbys.
> >
> > I shut down the master, and then promoted the standby. In this case, the
> > cascade standby should switch to new timeline and replication should be
> > successfully restarted. But the timeline was never changed, and the
> > following
> > log messages were kept outputting.
> >
> > sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
> > sby2 LOG:  replication terminated by primary server
> > sby2 DETAIL:  End of WAL reached on timeline 1
> > sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
> > sby2 LOG:  replication terminated by primary server
> > sby2 DETAIL:  End of WAL reached on timeline 1
> > sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
> > sby2 LOG:  replication terminated by primary server
> > sby2 DETAIL:  End of WAL reached on timeline 1
> >
> I am seeing similar issues with master at 88228e6.
> This is easily reproducible by setting up 2 slaves under a master, then
> kill the master. Promote slave 1 and  reconnect slave 2 to slave 1, then
> you will notice that the timeline jump is not done.
> 
> I don't know if Masao tried to put in sync the slave that reconnects to the
> promoted slave, but in this case slave2 stucks in "potential" state". That
> is due to timeline that has not changed on slave2 but better to let you
> know...

Ok, I know whats causing this now. Rather ugly.

Whenever accessing a page in a segment we haven't accessed before we
read the first page to do an extra bit of validation as the first page
in a segment contains more information.

Suppose timeline 1 ends at 0/6087088, xlog.c notices that WAL ends
there, wants to read the new timeline, requests record
0/06087088. xlogreader wants to do its validation and goes back to the
first page in the segment which triggers xlog.c to rerequest timeline1
to be transferred..

Heikki, any ideas?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
Next
From: Tatsuo Ishii
Date:
Subject: Re: review: pgbench - aggregation of info written into log