Thread: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Michael Paquier

Date:

17 January 2013, 04:47:45

Hi all,<br /><br />There is a strange bug with the latest master head (commit 7fcbf6a).<br />When the WAL stream with a
masteris cut on a slave, slave returns a FATAL (well normal...), but then enters in recovery process and automatically
promotes.<br/> Here are more details about the logs on slave (I simply killed the master manually):<br />FATAL:  could
notreceive data from WAL stream:<br />cp: cannot stat
‘/home/michael/bin/pgsql/archive/master/000000010000000000000004’:No such file or directory<br /> LOG:  record with
zerolength at 0/401E1B8<br />LOG:  redo done at 0/401E178<br />LOG:  last completed transaction was at log time
2013-01-1720:27:53.180971+09<br />cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000002.history’: No such
fileor directory<br /> LOG:  selected new timeline ID: 2<br />cp: cannot stat
‘/home/michael/bin/pgsql/archive/master/00000001.history’:No such file or directory<br />LOG:  archive recovery
complete<br/>DEBUG:  resetting unlogged relations: cleanup 0 init 1<br /> LOG:  database system is ready to accept
connections<br/>LOG:  autovacuum launcher started<br />DEBUG:  archived transaction log file
"000000010000000000000004"<br/>DEBUG:  archived transaction log file "00000002.history"<br /> LOG:  statement: create
tablebn (a int);<br />DEBUG:  autovacuum: processing database "postgres"<br /><br />Slave does not try anymore to
reconnectto master with messages of the type:<br />FATAL:  could not connect to the primary server<br /><br />I also
noticedthat there is some delay until modifications on master are visible on slave.<br />For example run a simple
CREATETABLE and the new table is not <br /><br />[some bisecting later...]<br /><br />I think that bug has been
introducedby commit 7fcbf6a.<br />Before splitting xlog reading as a separate facility things worked correctly.<br
/>Thereare also no delay problems before this commit.<br /><br />Does someone else noticed that?<br />-- <br />Michael
Paquier<br/><a href="http://michael.otacoo.com" target="_blank">http://michael.otacoo.com</a>

Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Andres Freund

Date:

17 January 2013, 10:54:07

On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
> Hi all,
>
> There is a strange bug with the latest master head (commit 7fcbf6a).
> When the WAL stream with a master is cut on a slave, slave returns a FATAL
> (well normal...), but then enters in recovery process and automatically
> promotes.
> Here are more details about the logs on slave (I simply killed the master
> manually):
> FATAL:  could not receive data from WAL stream:
> cp: cannot stat
> ‘/home/michael/bin/pgsql/archive/master/000000010000000000000004’: No such
> file or directory
> LOG:  record with zero length at 0/401E1B8
> LOG:  redo done at 0/401E178
> LOG:  last completed transaction was at log time 2013-01-17
> 20:27:53.180971+09
> cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000002.history’:
> No such file or directory
> LOG:  selected new timeline ID: 2
> cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000001.history’:
> No such file or directory
> LOG:  archive recovery complete
> DEBUG:  resetting unlogged relations: cleanup 0 init 1
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started
> DEBUG:  archived transaction log file "000000010000000000000004"
> DEBUG:  archived transaction log file "00000002.history"
> LOG:  statement: create table bn (a int);
> DEBUG:  autovacuum: processing database "postgres"
>
> Slave does not try anymore to reconnect to master with messages of the type:
> FATAL:  could not connect to the primary server
>
> I also noticed that there is some delay until modifications on master are
> visible on slave.
> For example run a simple CREATE TABLE and the new table is not
>
> [some bisecting later...]
>
> I think that bug has been introduced by commit 7fcbf6a.
> Before splitting xlog reading as a separate facility things worked
> correctly.
> There are also no delay problems before this commit.
>
> Does someone else noticed that?

No, at least I haven't, but it got committed only a rather short time
ago ;)
Looking, I have an inkling where the rpoblem could be.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Andres Freund

Date:

17 January 2013, 13:05:21

On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
> Hi all,
>
> There is a strange bug with the latest master head (commit 7fcbf6a).
> When the WAL stream with a master is cut on a slave, slave returns a FATAL
> (well normal...), but then enters in recovery process and automatically
> promotes.
> Here are more details about the logs on slave (I simply killed the master
> manually):
> FATAL:  could not receive data from WAL stream:
> cp: cannot stat
> ‘/home/michael/bin/pgsql/archive/master/000000010000000000000004’: No such
> file or directory
> LOG:  record with zero length at 0/401E1B8
> LOG:  redo done at 0/401E178
> LOG:  last completed transaction was at log time 2013-01-17
> 20:27:53.180971+09
> cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000002.history’:
> No such file or directory
> LOG:  selected new timeline ID: 2
> cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000001.history’:
> No such file or directory
> LOG:  archive recovery complete
> DEBUG:  resetting unlogged relations: cleanup 0 init 1
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started
> DEBUG:  archived transaction log file "000000010000000000000004"
> DEBUG:  archived transaction log file "00000002.history"
> LOG:  statement: create table bn (a int);
> DEBUG:  autovacuum: processing database "postgres"
>
> Slave does not try anymore to reconnect to master with messages of the type:
> FATAL:  could not connect to the primary server
>
> I also noticed that there is some delay until modifications on master are
> visible on slave.
> For example run a simple CREATE TABLE and the new table is not
>
> [some bisecting later...]
>
> I think that bug has been introduced by commit 7fcbf6a.
> Before splitting xlog reading as a separate facility things worked
> correctly.
> There are also no delay problems before this commit.

Ok, my inkling proved to be correct, its two related issues:

a ) The error handling in XLogReadRecord is inconsistent, it doesn't
always reset the necessary things.

b) ReadRecord: We cannot not break out of the retry loop in readRecord
just so, just removing the break seems correct.

c) ReadRecord: (minor): We should log an error even if errormsg is not
set, otherwise we wont jump out far enough.

I think at least a) and b) is the result of merges between development
of different people, sorry for that.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Heikki Linnakangas

Date:

17 January 2013, 15:18:07

On 17.01.2013 15:05, Andres Freund wrote:
> On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
>> I think that bug has been introduced by commit 7fcbf6a.
>> Before splitting xlog reading as a separate facility things worked
>> correctly.
>> There are also no delay problems before this commit.
>
> Ok, my inkling proved to be correct, its two related issues:
>
> a ) The error handling in XLogReadRecord is inconsistent, it doesn't
> always reset the necessary things.
>
> b) ReadRecord: We cannot not break out of the retry loop in readRecord
> just so, just removing the break seems correct.
>
> c) ReadRecord: (minor): We should log an error even if errormsg is not
> set, otherwise we wont jump out far enough.
>
> I think at least a) and b) is the result of merges between development
> of different people, sorry for that.

Got a patch?

- Heikki

Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Andres Freund

Date:

17 January 2013, 15:23:51

On 2013-01-17 17:18:14 +0200, Heikki Linnakangas wrote:
> On 17.01.2013 15:05, Andres Freund wrote:
> >On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
> >>I think that bug has been introduced by commit 7fcbf6a.
> >>Before splitting xlog reading as a separate facility things worked
> >>correctly.
> >>There are also no delay problems before this commit.
> >
> >Ok, my inkling proved to be correct, its two related issues:
> >
> >a ) The error handling in XLogReadRecord is inconsistent, it doesn't
> >always reset the necessary things.
> >
> >b) ReadRecord: We cannot not break out of the retry loop in readRecord
> >just so, just removing the break seems correct.
> >
> >c) ReadRecord: (minor): We should log an error even if errormsg is not
> >set, otherwise we wont jump out far enough.
> >
> >I think at least a) and b) is the result of merges between development
> >of different people, sorry for that.
> 
> Got a patch?

Yes, I am just testing some scenarios with it, will send it after that.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Andres Freund

Date:

17 January 2013, 15:42:36

On 2013-01-17 16:23:44 +0100, Andres Freund wrote:
> On 2013-01-17 17:18:14 +0200, Heikki Linnakangas wrote:
> > On 17.01.2013 15:05, Andres Freund wrote:
> > >On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
> > >>I think that bug has been introduced by commit 7fcbf6a.
> > >>Before splitting xlog reading as a separate facility things worked
> > >>correctly.
> > >>There are also no delay problems before this commit.
> > >
> > >Ok, my inkling proved to be correct, its two related issues:
> > >
> > >a ) The error handling in XLogReadRecord is inconsistent, it doesn't
> > >always reset the necessary things.
> > >
> > >b) ReadRecord: We cannot not break out of the retry loop in readRecord
> > >just so, just removing the break seems correct.
> > >
> > >c) ReadRecord: (minor): We should log an error even if errormsg is not
> > >set, otherwise we wont jump out far enough.
> > >
> > >I think at least a) and b) is the result of merges between development
> > >of different people, sorry for that.
> >
> > Got a patch?
>
> Yes, I am just testing some scenarios with it, will send it after that.

Ok, the attached patch seems to fix a) and b). c) above is bogus, as
explained in a comment in the patch.  I also noticed that the TLI check
didn't mark the last source as failed.

Not a real issue and its independent from this patch but I found that
when promoting from streaming rep the code first fails over to archive
recovery and only then to recovering from pg_xlog.  Is that intended?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

fix-error-handling-xlogreader.patch

Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Heikki Linnakangas

Date:

17 January 2013, 16:33:39

On 17.01.2013 17:42, Andres Freund wrote:
> Ok, the attached patch seems to fix a) and b). c) above is bogus, as
> explained in a comment in the patch.  I also noticed that the TLI check
> didn't mark the last source as failed.

This looks fragile:

>             /*
>              * We only end up here without a message when XLogPageRead() failed
>              * - in that case we already logged something.
>              * In StandbyMode that only happens if we have been triggered, so
>              * we shouldn't loop anymore in that case.
>              */
>             if (errormsg == NULL)
>                 break;

I don't like relying on the presence of an error message to control 
logic like that. Should we throw in an explicit CheckForStandbyTrigger() 
check in the condition of that loop?

- Heikki

Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Andres Freund

Date:

17 January 2013, 16:42:29

On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote:
> On 17.01.2013 17:42, Andres Freund wrote:
> >Ok, the attached patch seems to fix a) and b). c) above is bogus, as
> >explained in a comment in the patch.  I also noticed that the TLI check
> >didn't mark the last source as failed.
> 
> This looks fragile:
> 
> >            /*
> >             * We only end up here without a message when XLogPageRead() failed
> >             * - in that case we already logged something.
> >             * In StandbyMode that only happens if we have been triggered, so
> >             * we shouldn't loop anymore in that case.
> >             */
> >            if (errormsg == NULL)
> >                break;
> 
> I don't like relying on the presence of an error message to control logic
> like that. Should we throw in an explicit CheckForStandbyTrigger() check in
> the condition of that loop?

I agree, I wasn't too happy about that either. But in some sense its
only propagating state from XLogReadPage which already has dealt with
the error and decided its ok.
Its the solution closest to what happened in the old implementation,
thats why I thought it would be halfway to acceptable.

Adding the CheckForStandbyTrigger() in the condition would mean
promotion would happen before all the available records are processed
and it would increase the amount of stat()s tremendously.
So I don't really like that either.

Don't really have a good idea :(

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Heikki Linnakangas

Date:

17 January 2013, 16:50:45

On 17.01.2013 18:42, Andres Freund wrote:
> On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote:
>> On 17.01.2013 17:42, Andres Freund wrote:
>>> Ok, the attached patch seems to fix a) and b). c) above is bogus, as
>>> explained in a comment in the patch.  I also noticed that the TLI check
>>> didn't mark the last source as failed.
>>
>> This looks fragile:
>>
>>>             /*
>>>              * We only end up here without a message when XLogPageRead() failed
>>>              * - in that case we already logged something.
>>>              * In StandbyMode that only happens if we have been triggered, so
>>>              * we shouldn't loop anymore in that case.
>>>              */
>>>             if (errormsg == NULL)
>>>                 break;
>>
>> I don't like relying on the presence of an error message to control logic
>> like that. Should we throw in an explicit CheckForStandbyTrigger() check in
>> the condition of that loop?
>
> I agree, I wasn't too happy about that either. But in some sense its
> only propagating state from XLogReadPage which already has dealt with
> the error and decided its ok.
> Its the solution closest to what happened in the old implementation,
> thats why I thought it would be halfway to acceptable.
>
> Adding the CheckForStandbyTrigger() in the condition would mean
> promotion would happen before all the available records are processed
> and it would increase the amount of stat()s tremendously.
> So I don't really like that either.

I was thinking of the attached. As long as we check for
CheckForStandbyTrigger() after the "record == NULL" check, we won't
perform extra stat() calls on successful reads, only when we're polling
after reaching the end of valid WAL. That seems acceptable. If we want
to avoid even that, we could move the static 'triggered' variable from
CheckForStandbyTrigger() out of that function, and check that in the loop.

- Heikki

Attachment

fix-error-handling-xlogreader-2.patch

Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Andres Freund

Date:

17 January 2013, 16:55:15

On 2013-01-17 18:50:35 +0200, Heikki Linnakangas wrote:
> On 17.01.2013 18:42, Andres Freund wrote:
> >On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote:
> >>On 17.01.2013 17:42, Andres Freund wrote:
> >>>Ok, the attached patch seems to fix a) and b). c) above is bogus, as
> >>>explained in a comment in the patch.  I also noticed that the TLI check
> >>>didn't mark the last source as failed.
> >>
> >>This looks fragile:
> >>
> >>>            /*
> >>>             * We only end up here without a message when XLogPageRead() failed
> >>>             * - in that case we already logged something.
> >>>             * In StandbyMode that only happens if we have been triggered, so
> >>>             * we shouldn't loop anymore in that case.
> >>>             */
> >>>            if (errormsg == NULL)
> >>>                break;
> >>
> >>I don't like relying on the presence of an error message to control logic
> >>like that. Should we throw in an explicit CheckForStandbyTrigger() check in
> >>the condition of that loop?
> >
> >I agree, I wasn't too happy about that either. But in some sense its
> >only propagating state from XLogReadPage which already has dealt with
> >the error and decided its ok.
> >Its the solution closest to what happened in the old implementation,
> >thats why I thought it would be halfway to acceptable.
> >
> >Adding the CheckForStandbyTrigger() in the condition would mean
> >promotion would happen before all the available records are processed
> >and it would increase the amount of stat()s tremendously.
> >So I don't really like that either.
> 
> I was thinking of the attached. As long as we check for
> CheckForStandbyTrigger() after the "record == NULL" check, we won't perform
> extra stat() calls on successful reads, only when we're polling after
> reaching the end of valid WAL. That seems acceptable. 

Looks good to me.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Heikki Linnakangas

Date:

17 January 2013, 17:29:44

On 17.01.2013 18:55, Andres Freund wrote:
> On 2013-01-17 18:50:35 +0200, Heikki Linnakangas wrote:
>> I was thinking of the attached. As long as we check for
>> CheckForStandbyTrigger() after the "record == NULL" check, we won't perform
>> extra stat() calls on successful reads, only when we're polling after
>> reaching the end of valid WAL. That seems acceptable.
>
> Looks good to me.

Ok, committed.

- Heikki

Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Andres Freund

Date:

17 January 2013, 17:35:29

On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
> Slave does not try anymore to reconnect to master with messages of the type:
> FATAL:  could not connect to the primary server
>
> I also noticed that there is some delay until modifications on master are
> visible on slave.

> I think that bug has been introduced by commit 7fcbf6a.
> Before splitting xlog reading as a separate facility things worked
> correctly.
> There are also no delay problems before this commit.

Heikki committed a fix for at least the promotion issue, I didn't notice
any problem with an increased delay, could you check again if you still
see it?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Fujii Masao

Date:

17 January 2013, 18:05:49

On Fri, Jan 18, 2013 at 2:35 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
>> Slave does not try anymore to reconnect to master with messages of the type:
>> FATAL:  could not connect to the primary server
>>
>> I also noticed that there is some delay until modifications on master are
>> visible on slave.
>
>> I think that bug has been introduced by commit 7fcbf6a.
>> Before splitting xlog reading as a separate facility things worked
>> correctly.
>> There are also no delay problems before this commit.
>
> Heikki committed a fix for at least the promotion issue, I didn't notice
> any problem with an increased delay, could you check again if you still
> see it?

I encountered the problem that the timeline switch is not performed expectedly.
I set up one master, one standby and one cascade standby. All the servers
share the archive directory. restore_command is specified in the recovery.conf
in those two standbys.

I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the following
log messages were kept outputting.

sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG:  replication terminated by primary server
sby2 DETAIL:  End of WAL reached on timeline 1
sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG:  replication terminated by primary server
sby2 DETAIL:  End of WAL reached on timeline 1
sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG:  replication terminated by primary server
sby2 DETAIL:  End of WAL reached on timeline 1
....

Regards,

-- 
Fujii Masao

Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Andres Freund

Date:

17 January 2013, 18:08:34

On 2013-01-18 03:05:47 +0900, Fujii Masao wrote:
> On Fri, Jan 18, 2013 at 2:35 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
> >> Slave does not try anymore to reconnect to master with messages of the type:
> >> FATAL:  could not connect to the primary server
> >>
> >> I also noticed that there is some delay until modifications on master are
> >> visible on slave.
> >
> >> I think that bug has been introduced by commit 7fcbf6a.
> >> Before splitting xlog reading as a separate facility things worked
> >> correctly.
> >> There are also no delay problems before this commit.
> >
> > Heikki committed a fix for at least the promotion issue, I didn't notice
> > any problem with an increased delay, could you check again if you still
> > see it?
> 
> I encountered the problem that the timeline switch is not performed expectedly.
> I set up one master, one standby and one cascade standby. All the servers
> share the archive directory. restore_command is specified in the recovery.conf
> in those two standbys.
> 
> I shut down the master, and then promoted the standby. In this case, the
> cascade standby should switch to new timeline and replication should be
> successfully restarted. But the timeline was never changed, and the following
> log messages were kept outputting.
> 
> sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
> sby2 LOG:  replication terminated by primary server
> sby2 DETAIL:  End of WAL reached on timeline 1
> sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
> sby2 LOG:  replication terminated by primary server
> sby2 DETAIL:  End of WAL reached on timeline 1
> sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
> sby2 LOG:  replication terminated by primary server
> sby2 DETAIL:  End of WAL reached on timeline 1
> ....

That's after the commit or before? Because in passing I think I
noticed/fixed a bug that could cause exactly that problem...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Heikki Linnakangas

Date:

17 January 2013, 19:57:22

On 17.01.2013 20:08, Andres Freund wrote:
> On 2013-01-18 03:05:47 +0900, Fujii Masao wrote:
>> I encountered the problem that the timeline switch is not performed expectedly.
>> I set up one master, one standby and one cascade standby. All the servers
>> share the archive directory. restore_command is specified in the recovery.conf
>> in those two standbys.
>>
>> I shut down the master, and then promoted the standby. In this case, the
>> cascade standby should switch to new timeline and replication should be
>> successfully restarted. But the timeline was never changed, and the following
>> log messages were kept outputting.
>>
>> sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
>> sby2 LOG:  replication terminated by primary server
>> sby2 DETAIL:  End of WAL reached on timeline 1
>> sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
>> sby2 LOG:  replication terminated by primary server
>> sby2 DETAIL:  End of WAL reached on timeline 1
>> sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
>> sby2 LOG:  replication terminated by primary server
>> sby2 DETAIL:  End of WAL reached on timeline 1
>> ....
>
> That's after the commit or before? Because in passing I think I
> noticed/fixed a bug that could cause exactly that problem...

I think I broke that with the "teach pg_receivexlog to cross timelines" 
patch. Will take a look...

- Heikki

Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Heikki Linnakangas

Date:

17 January 2013, 21:49:26

On 17.01.2013 21:57, Heikki Linnakangas wrote:
> On 17.01.2013 20:08, Andres Freund wrote:
>> On 2013-01-18 03:05:47 +0900, Fujii Masao wrote:
>>> I encountered the problem that the timeline switch is not performed
>>> expectedly.
>>> I set up one master, one standby and one cascade standby. All the
>>> servers
>>> share the archive directory. restore_command is specified in the
>>> recovery.conf
>>> in those two standbys.
>>>
>>> I shut down the master, and then promoted the standby. In this case, the
>>> cascade standby should switch to new timeline and replication should be
>>> successfully restarted. But the timeline was never changed, and the
>>> following
>>> log messages were kept outputting.
>>>
>>> sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
>>> sby2 LOG: replication terminated by primary server
>>> sby2 DETAIL: End of WAL reached on timeline 1
>>> sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
>>> sby2 LOG: replication terminated by primary server
>>> sby2 DETAIL: End of WAL reached on timeline 1
>>> sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
>>> sby2 LOG: replication terminated by primary server
>>> sby2 DETAIL: End of WAL reached on timeline 1
>>> ....
>>
>> That's after the commit or before? Because in passing I think I
>> noticed/fixed a bug that could cause exactly that problem...
>
> I think I broke that with the "teach pg_receivexlog to cross timelines"
> patch. Will take a look...

Ok, there was a couple of issues. First, as I suspected above, I added a 
new result set after copy has ended in START_STREAMING command, but 
forgot to teach walreceiver about it. Fixed that.

Secondly, there's an interesting interaction between the new xlogreader 
code and streaming replication and timeline switches:

The first call to the page_read callback in xlogreader is always made 
with size SizeOfXLogRecord, counting from the beginning of the page. 
That's bogus; there can be no WAL record in the very beginning of page, 
because of the page header, so I think that was supposed to be 
SizeXLogShortPHD. But that won't fix the issue.

The problem is that XLogPageRead callback uses the page address and 
requested length to decide what timeline to stream from. For example, 
imagine that there's a timeline switch at offset 1000 in a page, and we 
have already read all the WAL up to that point. When xlogreader first 
asks to fetch the first 32 bytes from the page (the bogus 
SizeOfXLogRecord), XLogPageRead deduces that that byte range is still on 
the old timeline, and starts streaming from the old timeline. Next, 
xlogreader needs the rest of the page, up to 1000 + SizeOfPageHeader, to 
read the first record it's actually interested in, XLogPageRead will 
return an error because that range is not on the timeline that's 
currently streamed. And we loop back to retry, and run into the same 
problem again.

This interaction is a bit too subtle for my taste, but the 
straightforward fix is to just modify xlogreader so that the first 
read_page call requests all the bytes up to record we're actually 
interested in. That seems like a smart thing to do anyway.

I committed a patch for that second issue too, but please take a look in 
case you come up with a better idea.

- Heikki

Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Andres Freund

Date:

17 January 2013, 22:07:33

On 2013-01-17 23:49:22 +0200, Heikki Linnakangas wrote:
> On 17.01.2013 21:57, Heikki Linnakangas wrote:
> >On 17.01.2013 20:08, Andres Freund wrote:
> >>On 2013-01-18 03:05:47 +0900, Fujii Masao wrote:
> >>>I encountered the problem that the timeline switch is not performed
> >>>expectedly.
> >>>I set up one master, one standby and one cascade standby. All the
> >>>servers
> >>>share the archive directory. restore_command is specified in the
> >>>recovery.conf
> >>>in those two standbys.
> >>>
> >>>I shut down the master, and then promoted the standby. In this case, the
> >>>cascade standby should switch to new timeline and replication should be
> >>>successfully restarted. But the timeline was never changed, and the
> >>>following
> >>>log messages were kept outputting.
> >>>
> >>>sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
> >>>sby2 LOG: replication terminated by primary server
> >>>sby2 DETAIL: End of WAL reached on timeline 1
> >>>sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
> >>>sby2 LOG: replication terminated by primary server
> >>>sby2 DETAIL: End of WAL reached on timeline 1
> >>>sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
> >>>sby2 LOG: replication terminated by primary server
> >>>sby2 DETAIL: End of WAL reached on timeline 1
> >>>....
> >>
> >>That's after the commit or before? Because in passing I think I
> >>noticed/fixed a bug that could cause exactly that problem...
> >
> >I think I broke that with the "teach pg_receivexlog to cross timelines"
> >patch. Will take a look...
>
> Ok, there was a couple of issues. First, as I suspected above, I added a new
> result set after copy has ended in START_STREAMING command, but forgot to
> teach walreceiver about it. Fixed that.
>
> Secondly, there's an interesting interaction between the new xlogreader code
> and streaming replication and timeline switches:
>
> The first call to the page_read callback in xlogreader is always made with
> size SizeOfXLogRecord, counting from the beginning of the page. That's
> bogus; there can be no WAL record in the very beginning of page, because of
> the page header, so I think that was supposed to be SizeXLogShortPHD. But
> that won't fix the issue.

Yea, thats clearly a typo.

> The problem is that XLogPageRead callback uses the page address and
> requested length to decide what timeline to stream from. For example,
> imagine that there's a timeline switch at offset 1000 in a page, and we have
> already read all the WAL up to that point. When xlogreader first asks to
> fetch the first 32 bytes from the page (the bogus SizeOfXLogRecord),
> XLogPageRead deduces that that byte range is still on the old timeline, and
> starts streaming from the old timeline. Next, xlogreader needs the rest of
> the page, up to 1000 + SizeOfPageHeader, to read the first record it's
> actually interested in, XLogPageRead will return an error because that range
> is not on the timeline that's currently streamed. And we loop back to retry,
> and run into the same problem again.

> This interaction is a bit too subtle for my taste, but the straightforward
> fix is to just modify xlogreader so that the first read_page call requests
> all the bytes up to record we're actually interested in. That seems like a
> smart thing to do anyway.

Yuck. Reading from targetRecOff onwards seems like a good idea, yes. But
ISTM that the code isn't really safe now: Assume targetRecOff is 0
(which is valid) then we read up to SizeOfXLogRecord but assume that we
can read both the xlog record as well as the page header.
Then you also need to factor in that the page header can be differently
big.

And yes, its too subtle :(

Do you want to fix that or shall I?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Michael Paquier

Date:

17 January 2013, 23:24:34

On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I encountered the problem that the timeline switch is not performed expectedly.
I set up one master, one standby and one cascade standby. All the servers
share the archive directory. restore_command is specified in the recovery.conf
in those two standbys.

I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the following
log messages were kept outputting.

sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1

I am seeing similar issues with master at 88228e6.
This is easily reproducible by setting up 2 slaves under a master, then kill the master. Promote slave 1 and reconnect slave 2 to slave 1, then you will notice that the timeline jump is not done.

I don't know if Masao tried to put in sync the slave that reconnects to the promoted slave, but in this case slave2 stucks in "potential" state". That is due to timeline that has not changed on slave2 but better to let you know...

The replication delays are still here.

--
Michael Paquier
http://michael.otacoo.com

Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Andres Freund

Date:

17 January 2013, 23:48:49

On 2013-01-18 08:24:31 +0900, Michael Paquier wrote:
> On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> 
> >  I encountered the problem that the timeline switch is not performed
> > expectedly.
> > I set up one master, one standby and one cascade standby. All the servers
> > share the archive directory. restore_command is specified in the
> > recovery.conf
> > in those two standbys.
> >
> > I shut down the master, and then promoted the standby. In this case, the
> > cascade standby should switch to new timeline and replication should be
> > successfully restarted. But the timeline was never changed, and the
> > following
> > log messages were kept outputting.
> >
> > sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
> > sby2 LOG:  replication terminated by primary server
> > sby2 DETAIL:  End of WAL reached on timeline 1
> > sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
> > sby2 LOG:  replication terminated by primary server
> > sby2 DETAIL:  End of WAL reached on timeline 1
> > sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
> > sby2 LOG:  replication terminated by primary server
> > sby2 DETAIL:  End of WAL reached on timeline 1
> >
> I am seeing similar issues with master at 88228e6.
> This is easily reproducible by setting up 2 slaves under a master, then
> kill the master. Promote slave 1 and  reconnect slave 2 to slave 1, then
> you will notice that the timeline jump is not done.

Can you reproduce that one with 7fcbf6a^ (i.e before xlogreader got
split off?).

> The replication delays are still here.

That one is caused by this nice bug, courtesy of yours truly:
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 90ba32e..1174493 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8874,7 +8874,7 @@ retry:   /* See if we need to retrieve more data */   if (readFile < 0 ||       (readSource ==
XLOG_FROM_STREAM&&
 
-        receivedUpto <= targetPagePtr + reqLen))
+        receivedUpto < targetPagePtr + reqLen))   {       if (StandbyMode)       {

I didn't notice because I had a testscript inserting stuff continuously
and it cause at most lagging by one record...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Michael Paquier

Date:

18 January 2013, 00:33:26

On Fri, Jan 18, 2013 at 8:48 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Can you reproduce that one with 7fcbf6a^ (i.e before xlogreader got
split off?).

Yes, it is reproducible before the xlog reader split.
Just an additional report, the master jumps correctly to the new timeline.

> The replication delays are still here.

That one is caused by this nice bug, courtesy of yours truly:
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 90ba32e..1174493 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8874,7 +8874,7 @@ retry:
/* See if we need to retrieve more data */
if (readFile < 0 ||
(readSource == XLOG_FROM_STREAM &&
- receivedUpto <= targetPagePtr + reqLen))
+ receivedUpto < targetPagePtr + reqLen))
{
if (StandbyMode)
{

I didn't notice because I had a testscript inserting stuff continuously
and it cause at most lagging by one record...

This fix is indeed working.

--
Michael Paquier
http://michael.otacoo.com

Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Andres Freund

Date:

18 January 2013, 00:35:27

On 2013-01-18 08:24:31 +0900, Michael Paquier wrote:
> On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> 
> >  I encountered the problem that the timeline switch is not performed
> > expectedly.
> > I set up one master, one standby and one cascade standby. All the servers
> > share the archive directory. restore_command is specified in the
> > recovery.conf
> > in those two standbys.
> >
> > I shut down the master, and then promoted the standby. In this case, the
> > cascade standby should switch to new timeline and replication should be
> > successfully restarted. But the timeline was never changed, and the
> > following
> > log messages were kept outputting.
> >
> > sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
> > sby2 LOG:  replication terminated by primary server
> > sby2 DETAIL:  End of WAL reached on timeline 1
> > sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
> > sby2 LOG:  replication terminated by primary server
> > sby2 DETAIL:  End of WAL reached on timeline 1
> > sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
> > sby2 LOG:  replication terminated by primary server
> > sby2 DETAIL:  End of WAL reached on timeline 1
> >
> I am seeing similar issues with master at 88228e6.
> This is easily reproducible by setting up 2 slaves under a master, then
> kill the master. Promote slave 1 and  reconnect slave 2 to slave 1, then
> you will notice that the timeline jump is not done.
> 
> I don't know if Masao tried to put in sync the slave that reconnects to the
> promoted slave, but in this case slave2 stucks in "potential" state". That
> is due to timeline that has not changed on slave2 but better to let you
> know...

Ok, I know whats causing this now. Rather ugly.

Whenever accessing a page in a segment we haven't accessed before we
read the first page to do an extra bit of validation as the first page
in a segment contains more information.

Suppose timeline 1 ends at 0/6087088, xlog.c notices that WAL ends
there, wants to read the new timeline, requests record
0/06087088. xlogreader wants to do its validation and goes back to the
first page in the segment which triggers xlog.c to rerequest timeline1
to be transferred..

Heikki, any ideas?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Heikki Linnakangas

Date:

18 January 2013, 09:20:52

On 18.01.2013 02:35, Andres Freund wrote:
> On 2013-01-18 08:24:31 +0900, Michael Paquier wrote:
>> On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao<masao.fujii@gmail.com>  wrote:
>>
>>>   I encountered the problem that the timeline switch is not performed
>>> expectedly.
>>> I set up one master, one standby and one cascade standby. All the servers
>>> share the archive directory. restore_command is specified in the
>>> recovery.conf
>>> in those two standbys.
>>>
>>> I shut down the master, and then promoted the standby. In this case, the
>>> cascade standby should switch to new timeline and replication should be
>>> successfully restarted. But the timeline was never changed, and the
>>> following
>>> log messages were kept outputting.
>>>
>>> sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
>>> sby2 LOG:  replication terminated by primary server
>>> sby2 DETAIL:  End of WAL reached on timeline 1
>>> sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
>>> sby2 LOG:  replication terminated by primary server
>>> sby2 DETAIL:  End of WAL reached on timeline 1
>>> sby2 LOG:  restarted WAL streaming at 0/3000000 on timeline 1
>>> sby2 LOG:  replication terminated by primary server
>>> sby2 DETAIL:  End of WAL reached on timeline 1
>>>
>> I am seeing similar issues with master at 88228e6.
>> This is easily reproducible by setting up 2 slaves under a master, then
>> kill the master. Promote slave 1 and  reconnect slave 2 to slave 1, then
>> you will notice that the timeline jump is not done.
>>
>> I don't know if Masao tried to put in sync the slave that reconnects to the
>> promoted slave, but in this case slave2 stucks in "potential" state". That
>> is due to timeline that has not changed on slave2 but better to let you
>> know...
>
> Ok, I know whats causing this now. Rather ugly.
>
> Whenever accessing a page in a segment we haven't accessed before we
> read the first page to do an extra bit of validation as the first page
> in a segment contains more information.
>
> Suppose timeline 1 ends at 0/6087088, xlog.c notices that WAL ends
> there, wants to read the new timeline, requests record
> 0/06087088. xlogreader wants to do its validation and goes back to the
> first page in the segment which triggers xlog.c to rerequest timeline1
> to be transferred..

Hmm, so it's the same issue I thought I fixed yesterday. My patch only
fixed it for the case that the timeline switch is in the first page of
the segment. When it's not, you still get two calls for a WAL record,
first one for the first page in the segment, to verify that, and then
the page that actually contains the record. The first call leads
XLogPageRead to think it needs to read from the old timeline.

We didn't have this problem before the xlogreader refactoring because
XLogPageRead() was always called with the RecPtr of the record, even
when we actually read the segment header from the file first. We'll have
to somehow get that same information, the RecPtr of the record we're
actually interested in, to XLogPageRead(). We could add a new argument
to the callback for that, or we could keep xlogreader.c as it is and
pass it through from ReadRecord to XLogPageRead() in the private struct.

An explicit argument to the callback is probably best. That's
straightforward, and it might be useful for the callback to know the
actual WAL position that xlogreader.c is interested in anyway. See attached.

- Heikki

Attachment

choose-correct-timeline-in-xlogpageread.patch

Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Alvaro Herrera

Date:

18 January 2013, 14:20:52

Andres Freund escribió:
> On 2013-01-18 08:24:31 +0900, Michael Paquier wrote:
>
> > The replication delays are still here.
>
> That one is caused by this nice bug, courtesy of yours truly:
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index 90ba32e..1174493 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -8874,7 +8874,7 @@ retry:
>     /* See if we need to retrieve more data */
>     if (readFile < 0 ||
>         (readSource == XLOG_FROM_STREAM &&
> -        receivedUpto <= targetPagePtr + reqLen))
> +        receivedUpto < targetPagePtr + reqLen))
>     {
>         if (StandbyMode)
>         {

Pushed.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Michael Paquier

Date:

22 January 2013, 00:06:53

On Fri, Jan 18, 2013 at 6:20 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

Hmm, so it's the same issue I thought I fixed yesterday. My patch only fixed it for the case that the timeline switch is in the first page of the segment. When it's not, you still get two calls for a WAL record, first one for the first page in the segment, to verify that, and then the page that actually contains the record. The first call leads XLogPageRead to think it needs to read from the old timeline.

We didn't have this problem before the xlogreader refactoring because XLogPageRead() was always called with the RecPtr of the record, even when we actually read the segment header from the file first. We'll have to somehow get that same information, the RecPtr of the record we're actually interested in, to XLogPageRead(). We could add a new argument to the callback for that, or we could keep xlogreader.c as it is and pass it through from ReadRecord to XLogPageRead() in the private struct.

An explicit argument to the callback is probably best. That's straightforward, and it might be useful for the callback to know the actual WAL position that xlogreader.c is interested in anyway. See attached.

Just to let you know that I am still getting the error even after commit 2ff6555.
With the same scenario:
1) Start a master with 2 slaves
2) Kill/Stop slave
3) Promote slave 1, it switches to timeline 2
Log on slave 1
LOG: selected new timeline ID: 2
4) Reconnect slave 2 to save 1, slave 2 remains stuck in timeline 1 even if recovery_target_timeline is set to latest
Log on slave 1 at this moment:
DEBUG: received replication command: IDENTIFY_SYSTEM
DEBUG: received replication command: TIMELINE_HISTORY 2
DEBUG: received replication command: START_REPLICATION 0/5000000 TIMELINE 1
Slave 1 receives command to start replication with timeline 1, while it is sync with timeline 2.
Log on slave 2 at this moment:
LOG: restarted WAL streaming at 0/5000000 on timeline 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1 at 0/5014200
DEBUG: walreceiver ended streaming and awaits new instructions

The timeline history file is the same for both nodes:
$ cat 00000002.history
1 0/5014200 no recovery target specified

I might be wrong, but shouldn't there be also an entry for timeline 2 in this file?

Am I missing something?

--
Michael Paquier
http://michael.otacoo.com

Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From

Michael Paquier

Date:

22 January 2013, 00:25:10

On Tue, Jan 22, 2013 at 9:06 AM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Fri, Jan 18, 2013 at 6:20 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
Hmm, so it's the same issue I thought I fixed yesterday. My patch only fixed it for the case that the timeline switch is in the first page of the segment. When it's not, you still get two calls for a WAL record, first one for the first page in the segment, to verify that, and then the page that actually contains the record. The first call leads XLogPageRead to think it needs to read from the old timeline.

We didn't have this problem before the xlogreader refactoring because XLogPageRead() was always called with the RecPtr of the record, even when we actually read the segment header from the file first. We'll have to somehow get that same information, the RecPtr of the record we're actually interested in, to XLogPageRead(). We could add a new argument to the callback for that, or we could keep xlogreader.c as it is and pass it through from ReadRecord to XLogPageRead() in the private struct.

An explicit argument to the callback is probably best. That's straightforward, and it might be useful for the callback to know the actual WAL position that xlogreader.c is interested in anyway. See attached.
Just to let you know that I am still getting the error even after commit 2ff6555.
With the same scenario:
1) Start a master with 2 slaves
2) Kill/Stop slave
3) Promote slave 1, it switches to timeline 2
Log on slave 1

LOG: selected new timeline ID: 2
4) Reconnect slave 2 to save 1, slave 2 remains stuck in timeline 1 even if recovery_target_timeline is set to latest
Log on slave 1 at this moment:
DEBUG: received replication command: IDENTIFY_SYSTEM
DEBUG: received replication command: TIMELINE_HISTORY 2
DEBUG: received replication command: START_REPLICATION 0/5000000 TIMELINE 1
Slave 1 receives command to start replication with timeline 1, while it is sync with timeline 2.
Log on slave 2 at this moment:
LOG: restarted WAL streaming at 0/5000000 on timeline 1

LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1 at 0/5014200
DEBUG: walreceiver ended streaming and awaits new instructions

The timeline history file is the same for both nodes:
$ cat 00000002.history
1 0/5014200 no recovery target specified

I might be wrong, but shouldn't there be also an entry for timeline 2 in this file?

Am I missing something?

Sorry, there are no problems...
I simply forgot to set up recovery_target_timeline to 'latest' in recovery.conf...

--
Michael Paquier
http://michael.otacoo.com