Thread: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Michael Paquier
Date:
Hi all,<br /><br />There is a strange bug with the latest master head (commit 7fcbf6a).<br />When the WAL stream with a masteris cut on a slave, slave returns a FATAL (well normal...), but then enters in recovery process and automatically promotes.<br/> Here are more details about the logs on slave (I simply killed the master manually):<br />FATAL: could notreceive data from WAL stream:<br />cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/000000010000000000000004’:No such file or directory<br /> LOG: record with zerolength at 0/401E1B8<br />LOG: redo done at 0/401E178<br />LOG: last completed transaction was at log time 2013-01-1720:27:53.180971+09<br />cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000002.history’: No such fileor directory<br /> LOG: selected new timeline ID: 2<br />cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000001.history’:No such file or directory<br />LOG: archive recovery complete<br/>DEBUG: resetting unlogged relations: cleanup 0 init 1<br /> LOG: database system is ready to accept connections<br/>LOG: autovacuum launcher started<br />DEBUG: archived transaction log file "000000010000000000000004"<br/>DEBUG: archived transaction log file "00000002.history"<br /> LOG: statement: create tablebn (a int);<br />DEBUG: autovacuum: processing database "postgres"<br /><br />Slave does not try anymore to reconnectto master with messages of the type:<br />FATAL: could not connect to the primary server<br /><br />I also noticedthat there is some delay until modifications on master are visible on slave.<br />For example run a simple CREATETABLE and the new table is not <br /><br />[some bisecting later...]<br /><br />I think that bug has been introducedby commit 7fcbf6a.<br />Before splitting xlog reading as a separate facility things worked correctly.<br />Thereare also no delay problems before this commit.<br /><br />Does someone else noticed that?<br />-- <br />Michael Paquier<br/><a href="http://michael.otacoo.com" target="_blank">http://michael.otacoo.com</a>
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Andres Freund
Date:
On 2013-01-17 13:47:41 +0900, Michael Paquier wrote: > Hi all, > > There is a strange bug with the latest master head (commit 7fcbf6a). > When the WAL stream with a master is cut on a slave, slave returns a FATAL > (well normal...), but then enters in recovery process and automatically > promotes. > Here are more details about the logs on slave (I simply killed the master > manually): > FATAL: could not receive data from WAL stream: > cp: cannot stat > ‘/home/michael/bin/pgsql/archive/master/000000010000000000000004’: No such > file or directory > LOG: record with zero length at 0/401E1B8 > LOG: redo done at 0/401E178 > LOG: last completed transaction was at log time 2013-01-17 > 20:27:53.180971+09 > cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000002.history’: > No such file or directory > LOG: selected new timeline ID: 2 > cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000001.history’: > No such file or directory > LOG: archive recovery complete > DEBUG: resetting unlogged relations: cleanup 0 init 1 > LOG: database system is ready to accept connections > LOG: autovacuum launcher started > DEBUG: archived transaction log file "000000010000000000000004" > DEBUG: archived transaction log file "00000002.history" > LOG: statement: create table bn (a int); > DEBUG: autovacuum: processing database "postgres" > > Slave does not try anymore to reconnect to master with messages of the type: > FATAL: could not connect to the primary server > > I also noticed that there is some delay until modifications on master are > visible on slave. > For example run a simple CREATE TABLE and the new table is not > > [some bisecting later...] > > I think that bug has been introduced by commit 7fcbf6a. > Before splitting xlog reading as a separate facility things worked > correctly. > There are also no delay problems before this commit. > > Does someone else noticed that? No, at least I haven't, but it got committed only a rather short time ago ;) Looking, I have an inkling where the rpoblem could be. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Andres Freund
Date:
On 2013-01-17 13:47:41 +0900, Michael Paquier wrote: > Hi all, > > There is a strange bug with the latest master head (commit 7fcbf6a). > When the WAL stream with a master is cut on a slave, slave returns a FATAL > (well normal...), but then enters in recovery process and automatically > promotes. > Here are more details about the logs on slave (I simply killed the master > manually): > FATAL: could not receive data from WAL stream: > cp: cannot stat > ‘/home/michael/bin/pgsql/archive/master/000000010000000000000004’: No such > file or directory > LOG: record with zero length at 0/401E1B8 > LOG: redo done at 0/401E178 > LOG: last completed transaction was at log time 2013-01-17 > 20:27:53.180971+09 > cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000002.history’: > No such file or directory > LOG: selected new timeline ID: 2 > cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000001.history’: > No such file or directory > LOG: archive recovery complete > DEBUG: resetting unlogged relations: cleanup 0 init 1 > LOG: database system is ready to accept connections > LOG: autovacuum launcher started > DEBUG: archived transaction log file "000000010000000000000004" > DEBUG: archived transaction log file "00000002.history" > LOG: statement: create table bn (a int); > DEBUG: autovacuum: processing database "postgres" > > Slave does not try anymore to reconnect to master with messages of the type: > FATAL: could not connect to the primary server > > I also noticed that there is some delay until modifications on master are > visible on slave. > For example run a simple CREATE TABLE and the new table is not > > [some bisecting later...] > > I think that bug has been introduced by commit 7fcbf6a. > Before splitting xlog reading as a separate facility things worked > correctly. > There are also no delay problems before this commit. Ok, my inkling proved to be correct, its two related issues: a ) The error handling in XLogReadRecord is inconsistent, it doesn't always reset the necessary things. b) ReadRecord: We cannot not break out of the retry loop in readRecord just so, just removing the break seems correct. c) ReadRecord: (minor): We should log an error even if errormsg is not set, otherwise we wont jump out far enough. I think at least a) and b) is the result of merges between development of different people, sorry for that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Heikki Linnakangas
Date:
On 17.01.2013 15:05, Andres Freund wrote: > On 2013-01-17 13:47:41 +0900, Michael Paquier wrote: >> I think that bug has been introduced by commit 7fcbf6a. >> Before splitting xlog reading as a separate facility things worked >> correctly. >> There are also no delay problems before this commit. > > Ok, my inkling proved to be correct, its two related issues: > > a ) The error handling in XLogReadRecord is inconsistent, it doesn't > always reset the necessary things. > > b) ReadRecord: We cannot not break out of the retry loop in readRecord > just so, just removing the break seems correct. > > c) ReadRecord: (minor): We should log an error even if errormsg is not > set, otherwise we wont jump out far enough. > > I think at least a) and b) is the result of merges between development > of different people, sorry for that. Got a patch? - Heikki
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Andres Freund
Date:
On 2013-01-17 17:18:14 +0200, Heikki Linnakangas wrote: > On 17.01.2013 15:05, Andres Freund wrote: > >On 2013-01-17 13:47:41 +0900, Michael Paquier wrote: > >>I think that bug has been introduced by commit 7fcbf6a. > >>Before splitting xlog reading as a separate facility things worked > >>correctly. > >>There are also no delay problems before this commit. > > > >Ok, my inkling proved to be correct, its two related issues: > > > >a ) The error handling in XLogReadRecord is inconsistent, it doesn't > >always reset the necessary things. > > > >b) ReadRecord: We cannot not break out of the retry loop in readRecord > >just so, just removing the break seems correct. > > > >c) ReadRecord: (minor): We should log an error even if errormsg is not > >set, otherwise we wont jump out far enough. > > > >I think at least a) and b) is the result of merges between development > >of different people, sorry for that. > > Got a patch? Yes, I am just testing some scenarios with it, will send it after that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Andres Freund
Date:
On 2013-01-17 16:23:44 +0100, Andres Freund wrote: > On 2013-01-17 17:18:14 +0200, Heikki Linnakangas wrote: > > On 17.01.2013 15:05, Andres Freund wrote: > > >On 2013-01-17 13:47:41 +0900, Michael Paquier wrote: > > >>I think that bug has been introduced by commit 7fcbf6a. > > >>Before splitting xlog reading as a separate facility things worked > > >>correctly. > > >>There are also no delay problems before this commit. > > > > > >Ok, my inkling proved to be correct, its two related issues: > > > > > >a ) The error handling in XLogReadRecord is inconsistent, it doesn't > > >always reset the necessary things. > > > > > >b) ReadRecord: We cannot not break out of the retry loop in readRecord > > >just so, just removing the break seems correct. > > > > > >c) ReadRecord: (minor): We should log an error even if errormsg is not > > >set, otherwise we wont jump out far enough. > > > > > >I think at least a) and b) is the result of merges between development > > >of different people, sorry for that. > > > > Got a patch? > > Yes, I am just testing some scenarios with it, will send it after that. Ok, the attached patch seems to fix a) and b). c) above is bogus, as explained in a comment in the patch. I also noticed that the TLI check didn't mark the last source as failed. Not a real issue and its independent from this patch but I found that when promoting from streaming rep the code first fails over to archive recovery and only then to recovering from pg_xlog. Is that intended? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Heikki Linnakangas
Date:
On 17.01.2013 17:42, Andres Freund wrote: > Ok, the attached patch seems to fix a) and b). c) above is bogus, as > explained in a comment in the patch. I also noticed that the TLI check > didn't mark the last source as failed. This looks fragile: > /* > * We only end up here without a message when XLogPageRead() failed > * - in that case we already logged something. > * In StandbyMode that only happens if we have been triggered, so > * we shouldn't loop anymore in that case. > */ > if (errormsg == NULL) > break; I don't like relying on the presence of an error message to control logic like that. Should we throw in an explicit CheckForStandbyTrigger() check in the condition of that loop? - Heikki
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Andres Freund
Date:
On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote: > On 17.01.2013 17:42, Andres Freund wrote: > >Ok, the attached patch seems to fix a) and b). c) above is bogus, as > >explained in a comment in the patch. I also noticed that the TLI check > >didn't mark the last source as failed. > > This looks fragile: > > > /* > > * We only end up here without a message when XLogPageRead() failed > > * - in that case we already logged something. > > * In StandbyMode that only happens if we have been triggered, so > > * we shouldn't loop anymore in that case. > > */ > > if (errormsg == NULL) > > break; > > I don't like relying on the presence of an error message to control logic > like that. Should we throw in an explicit CheckForStandbyTrigger() check in > the condition of that loop? I agree, I wasn't too happy about that either. But in some sense its only propagating state from XLogReadPage which already has dealt with the error and decided its ok. Its the solution closest to what happened in the old implementation, thats why I thought it would be halfway to acceptable. Adding the CheckForStandbyTrigger() in the condition would mean promotion would happen before all the available records are processed and it would increase the amount of stat()s tremendously. So I don't really like that either. Don't really have a good idea :( Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Heikki Linnakangas
Date:
On 17.01.2013 18:42, Andres Freund wrote: > On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote: >> On 17.01.2013 17:42, Andres Freund wrote: >>> Ok, the attached patch seems to fix a) and b). c) above is bogus, as >>> explained in a comment in the patch. I also noticed that the TLI check >>> didn't mark the last source as failed. >> >> This looks fragile: >> >>> /* >>> * We only end up here without a message when XLogPageRead() failed >>> * - in that case we already logged something. >>> * In StandbyMode that only happens if we have been triggered, so >>> * we shouldn't loop anymore in that case. >>> */ >>> if (errormsg == NULL) >>> break; >> >> I don't like relying on the presence of an error message to control logic >> like that. Should we throw in an explicit CheckForStandbyTrigger() check in >> the condition of that loop? > > I agree, I wasn't too happy about that either. But in some sense its > only propagating state from XLogReadPage which already has dealt with > the error and decided its ok. > Its the solution closest to what happened in the old implementation, > thats why I thought it would be halfway to acceptable. > > Adding the CheckForStandbyTrigger() in the condition would mean > promotion would happen before all the available records are processed > and it would increase the amount of stat()s tremendously. > So I don't really like that either. I was thinking of the attached. As long as we check for CheckForStandbyTrigger() after the "record == NULL" check, we won't perform extra stat() calls on successful reads, only when we're polling after reaching the end of valid WAL. That seems acceptable. If we want to avoid even that, we could move the static 'triggered' variable from CheckForStandbyTrigger() out of that function, and check that in the loop. - Heikki
Attachment
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Andres Freund
Date:
On 2013-01-17 18:50:35 +0200, Heikki Linnakangas wrote: > On 17.01.2013 18:42, Andres Freund wrote: > >On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote: > >>On 17.01.2013 17:42, Andres Freund wrote: > >>>Ok, the attached patch seems to fix a) and b). c) above is bogus, as > >>>explained in a comment in the patch. I also noticed that the TLI check > >>>didn't mark the last source as failed. > >> > >>This looks fragile: > >> > >>> /* > >>> * We only end up here without a message when XLogPageRead() failed > >>> * - in that case we already logged something. > >>> * In StandbyMode that only happens if we have been triggered, so > >>> * we shouldn't loop anymore in that case. > >>> */ > >>> if (errormsg == NULL) > >>> break; > >> > >>I don't like relying on the presence of an error message to control logic > >>like that. Should we throw in an explicit CheckForStandbyTrigger() check in > >>the condition of that loop? > > > >I agree, I wasn't too happy about that either. But in some sense its > >only propagating state from XLogReadPage which already has dealt with > >the error and decided its ok. > >Its the solution closest to what happened in the old implementation, > >thats why I thought it would be halfway to acceptable. > > > >Adding the CheckForStandbyTrigger() in the condition would mean > >promotion would happen before all the available records are processed > >and it would increase the amount of stat()s tremendously. > >So I don't really like that either. > > I was thinking of the attached. As long as we check for > CheckForStandbyTrigger() after the "record == NULL" check, we won't perform > extra stat() calls on successful reads, only when we're polling after > reaching the end of valid WAL. That seems acceptable. Looks good to me. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Heikki Linnakangas
Date:
On 17.01.2013 18:55, Andres Freund wrote: > On 2013-01-17 18:50:35 +0200, Heikki Linnakangas wrote: >> I was thinking of the attached. As long as we check for >> CheckForStandbyTrigger() after the "record == NULL" check, we won't perform >> extra stat() calls on successful reads, only when we're polling after >> reaching the end of valid WAL. That seems acceptable. > > Looks good to me. Ok, committed. - Heikki
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Andres Freund
Date:
On 2013-01-17 13:47:41 +0900, Michael Paquier wrote: > Slave does not try anymore to reconnect to master with messages of the type: > FATAL: could not connect to the primary server > > I also noticed that there is some delay until modifications on master are > visible on slave. > I think that bug has been introduced by commit 7fcbf6a. > Before splitting xlog reading as a separate facility things worked > correctly. > There are also no delay problems before this commit. Heikki committed a fix for at least the promotion issue, I didn't notice any problem with an increased delay, could you check again if you still see it? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Fujii Masao
Date:
On Fri, Jan 18, 2013 at 2:35 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-01-17 13:47:41 +0900, Michael Paquier wrote: >> Slave does not try anymore to reconnect to master with messages of the type: >> FATAL: could not connect to the primary server >> >> I also noticed that there is some delay until modifications on master are >> visible on slave. > >> I think that bug has been introduced by commit 7fcbf6a. >> Before splitting xlog reading as a separate facility things worked >> correctly. >> There are also no delay problems before this commit. > > Heikki committed a fix for at least the promotion issue, I didn't notice > any problem with an increased delay, could you check again if you still > see it? I encountered the problem that the timeline switch is not performed expectedly. I set up one master, one standby and one cascade standby. All the servers share the archive directory. restore_command is specified in the recovery.conf in those two standbys. I shut down the master, and then promoted the standby. In this case, the cascade standby should switch to new timeline and replication should be successfully restarted. But the timeline was never changed, and the following log messages were kept outputting. sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 .... Regards, -- Fujii Masao
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Andres Freund
Date:
On 2013-01-18 03:05:47 +0900, Fujii Masao wrote: > On Fri, Jan 18, 2013 at 2:35 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-01-17 13:47:41 +0900, Michael Paquier wrote: > >> Slave does not try anymore to reconnect to master with messages of the type: > >> FATAL: could not connect to the primary server > >> > >> I also noticed that there is some delay until modifications on master are > >> visible on slave. > > > >> I think that bug has been introduced by commit 7fcbf6a. > >> Before splitting xlog reading as a separate facility things worked > >> correctly. > >> There are also no delay problems before this commit. > > > > Heikki committed a fix for at least the promotion issue, I didn't notice > > any problem with an increased delay, could you check again if you still > > see it? > > I encountered the problem that the timeline switch is not performed expectedly. > I set up one master, one standby and one cascade standby. All the servers > share the archive directory. restore_command is specified in the recovery.conf > in those two standbys. > > I shut down the master, and then promoted the standby. In this case, the > cascade standby should switch to new timeline and replication should be > successfully restarted. But the timeline was never changed, and the following > log messages were kept outputting. > > sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 > sby2 LOG: replication terminated by primary server > sby2 DETAIL: End of WAL reached on timeline 1 > sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 > sby2 LOG: replication terminated by primary server > sby2 DETAIL: End of WAL reached on timeline 1 > sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 > sby2 LOG: replication terminated by primary server > sby2 DETAIL: End of WAL reached on timeline 1 > .... That's after the commit or before? Because in passing I think I noticed/fixed a bug that could cause exactly that problem... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Heikki Linnakangas
Date:
On 17.01.2013 20:08, Andres Freund wrote: > On 2013-01-18 03:05:47 +0900, Fujii Masao wrote: >> I encountered the problem that the timeline switch is not performed expectedly. >> I set up one master, one standby and one cascade standby. All the servers >> share the archive directory. restore_command is specified in the recovery.conf >> in those two standbys. >> >> I shut down the master, and then promoted the standby. In this case, the >> cascade standby should switch to new timeline and replication should be >> successfully restarted. But the timeline was never changed, and the following >> log messages were kept outputting. >> >> sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 >> sby2 LOG: replication terminated by primary server >> sby2 DETAIL: End of WAL reached on timeline 1 >> sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 >> sby2 LOG: replication terminated by primary server >> sby2 DETAIL: End of WAL reached on timeline 1 >> sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 >> sby2 LOG: replication terminated by primary server >> sby2 DETAIL: End of WAL reached on timeline 1 >> .... > > That's after the commit or before? Because in passing I think I > noticed/fixed a bug that could cause exactly that problem... I think I broke that with the "teach pg_receivexlog to cross timelines" patch. Will take a look... - Heikki
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Heikki Linnakangas
Date:
On 17.01.2013 21:57, Heikki Linnakangas wrote: > On 17.01.2013 20:08, Andres Freund wrote: >> On 2013-01-18 03:05:47 +0900, Fujii Masao wrote: >>> I encountered the problem that the timeline switch is not performed >>> expectedly. >>> I set up one master, one standby and one cascade standby. All the >>> servers >>> share the archive directory. restore_command is specified in the >>> recovery.conf >>> in those two standbys. >>> >>> I shut down the master, and then promoted the standby. In this case, the >>> cascade standby should switch to new timeline and replication should be >>> successfully restarted. But the timeline was never changed, and the >>> following >>> log messages were kept outputting. >>> >>> sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 >>> sby2 LOG: replication terminated by primary server >>> sby2 DETAIL: End of WAL reached on timeline 1 >>> sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 >>> sby2 LOG: replication terminated by primary server >>> sby2 DETAIL: End of WAL reached on timeline 1 >>> sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 >>> sby2 LOG: replication terminated by primary server >>> sby2 DETAIL: End of WAL reached on timeline 1 >>> .... >> >> That's after the commit or before? Because in passing I think I >> noticed/fixed a bug that could cause exactly that problem... > > I think I broke that with the "teach pg_receivexlog to cross timelines" > patch. Will take a look... Ok, there was a couple of issues. First, as I suspected above, I added a new result set after copy has ended in START_STREAMING command, but forgot to teach walreceiver about it. Fixed that. Secondly, there's an interesting interaction between the new xlogreader code and streaming replication and timeline switches: The first call to the page_read callback in xlogreader is always made with size SizeOfXLogRecord, counting from the beginning of the page. That's bogus; there can be no WAL record in the very beginning of page, because of the page header, so I think that was supposed to be SizeXLogShortPHD. But that won't fix the issue. The problem is that XLogPageRead callback uses the page address and requested length to decide what timeline to stream from. For example, imagine that there's a timeline switch at offset 1000 in a page, and we have already read all the WAL up to that point. When xlogreader first asks to fetch the first 32 bytes from the page (the bogus SizeOfXLogRecord), XLogPageRead deduces that that byte range is still on the old timeline, and starts streaming from the old timeline. Next, xlogreader needs the rest of the page, up to 1000 + SizeOfPageHeader, to read the first record it's actually interested in, XLogPageRead will return an error because that range is not on the timeline that's currently streamed. And we loop back to retry, and run into the same problem again. This interaction is a bit too subtle for my taste, but the straightforward fix is to just modify xlogreader so that the first read_page call requests all the bytes up to record we're actually interested in. That seems like a smart thing to do anyway. I committed a patch for that second issue too, but please take a look in case you come up with a better idea. - Heikki
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Andres Freund
Date:
On 2013-01-17 23:49:22 +0200, Heikki Linnakangas wrote: > On 17.01.2013 21:57, Heikki Linnakangas wrote: > >On 17.01.2013 20:08, Andres Freund wrote: > >>On 2013-01-18 03:05:47 +0900, Fujii Masao wrote: > >>>I encountered the problem that the timeline switch is not performed > >>>expectedly. > >>>I set up one master, one standby and one cascade standby. All the > >>>servers > >>>share the archive directory. restore_command is specified in the > >>>recovery.conf > >>>in those two standbys. > >>> > >>>I shut down the master, and then promoted the standby. In this case, the > >>>cascade standby should switch to new timeline and replication should be > >>>successfully restarted. But the timeline was never changed, and the > >>>following > >>>log messages were kept outputting. > >>> > >>>sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 > >>>sby2 LOG: replication terminated by primary server > >>>sby2 DETAIL: End of WAL reached on timeline 1 > >>>sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 > >>>sby2 LOG: replication terminated by primary server > >>>sby2 DETAIL: End of WAL reached on timeline 1 > >>>sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 > >>>sby2 LOG: replication terminated by primary server > >>>sby2 DETAIL: End of WAL reached on timeline 1 > >>>.... > >> > >>That's after the commit or before? Because in passing I think I > >>noticed/fixed a bug that could cause exactly that problem... > > > >I think I broke that with the "teach pg_receivexlog to cross timelines" > >patch. Will take a look... > > Ok, there was a couple of issues. First, as I suspected above, I added a new > result set after copy has ended in START_STREAMING command, but forgot to > teach walreceiver about it. Fixed that. > > Secondly, there's an interesting interaction between the new xlogreader code > and streaming replication and timeline switches: > > The first call to the page_read callback in xlogreader is always made with > size SizeOfXLogRecord, counting from the beginning of the page. That's > bogus; there can be no WAL record in the very beginning of page, because of > the page header, so I think that was supposed to be SizeXLogShortPHD. But > that won't fix the issue. Yea, thats clearly a typo. > The problem is that XLogPageRead callback uses the page address and > requested length to decide what timeline to stream from. For example, > imagine that there's a timeline switch at offset 1000 in a page, and we have > already read all the WAL up to that point. When xlogreader first asks to > fetch the first 32 bytes from the page (the bogus SizeOfXLogRecord), > XLogPageRead deduces that that byte range is still on the old timeline, and > starts streaming from the old timeline. Next, xlogreader needs the rest of > the page, up to 1000 + SizeOfPageHeader, to read the first record it's > actually interested in, XLogPageRead will return an error because that range > is not on the timeline that's currently streamed. And we loop back to retry, > and run into the same problem again. > This interaction is a bit too subtle for my taste, but the straightforward > fix is to just modify xlogreader so that the first read_page call requests > all the bytes up to record we're actually interested in. That seems like a > smart thing to do anyway. Yuck. Reading from targetRecOff onwards seems like a good idea, yes. But ISTM that the code isn't really safe now: Assume targetRecOff is 0 (which is valid) then we read up to SizeOfXLogRecord but assume that we can read both the xlog record as well as the page header. Then you also need to factor in that the page header can be differently big. And yes, its too subtle :( Do you want to fix that or shall I? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Michael Paquier
Date:
On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
--
Michael Paquier
http://michael.otacoo.com
I encountered the problem that the timeline switch is not performed expectedly.I set up one master, one standby and one cascade standby. All the servers
share the archive directory. restore_command is specified in the recovery.conf
in those two standbys.
I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the following
log messages were kept outputting.
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
I am seeing similar issues with master at 88228e6.
This is easily reproducible by setting up 2 slaves under a master, then kill the master. Promote slave 1 and reconnect slave 2 to slave 1, then you will notice that the timeline jump is not done.
I don't know if Masao tried to put in sync the slave that reconnects to the promoted slave, but in this case slave2 stucks in "potential" state". That is due to timeline that has not changed on slave2 but better to let you know...
The replication delays are still here.
This is easily reproducible by setting up 2 slaves under a master, then kill the master. Promote slave 1 and reconnect slave 2 to slave 1, then you will notice that the timeline jump is not done.
I don't know if Masao tried to put in sync the slave that reconnects to the promoted slave, but in this case slave2 stucks in "potential" state". That is due to timeline that has not changed on slave2 but better to let you know...
The replication delays are still here.
Michael Paquier
http://michael.otacoo.com
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Andres Freund
Date:
On 2013-01-18 08:24:31 +0900, Michael Paquier wrote: > On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > > > I encountered the problem that the timeline switch is not performed > > expectedly. > > I set up one master, one standby and one cascade standby. All the servers > > share the archive directory. restore_command is specified in the > > recovery.conf > > in those two standbys. > > > > I shut down the master, and then promoted the standby. In this case, the > > cascade standby should switch to new timeline and replication should be > > successfully restarted. But the timeline was never changed, and the > > following > > log messages were kept outputting. > > > > sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 > > sby2 LOG: replication terminated by primary server > > sby2 DETAIL: End of WAL reached on timeline 1 > > sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 > > sby2 LOG: replication terminated by primary server > > sby2 DETAIL: End of WAL reached on timeline 1 > > sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 > > sby2 LOG: replication terminated by primary server > > sby2 DETAIL: End of WAL reached on timeline 1 > > > I am seeing similar issues with master at 88228e6. > This is easily reproducible by setting up 2 slaves under a master, then > kill the master. Promote slave 1 and reconnect slave 2 to slave 1, then > you will notice that the timeline jump is not done. Can you reproduce that one with 7fcbf6a^ (i.e before xlogreader got split off?). > The replication delays are still here. That one is caused by this nice bug, courtesy of yours truly: diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 90ba32e..1174493 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8874,7 +8874,7 @@ retry: /* See if we need to retrieve more data */ if (readFile < 0 || (readSource == XLOG_FROM_STREAM&& - receivedUpto <= targetPagePtr + reqLen)) + receivedUpto < targetPagePtr + reqLen)) { if (StandbyMode) { I didn't notice because I had a testscript inserting stuff continuously and it cause at most lagging by one record... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Michael Paquier
Date:
On Fri, Jan 18, 2013 at 8:48 AM, Andres Freund <andres@2ndquadrant.com> wrote:
-- split off?).Can you reproduce that one with 7fcbf6a^ (i.e before xlogreader got
Yes, it is reproducible before the xlog reader split.
Just an additional report, the master jumps correctly to the new timeline.
Just an additional report, the master jumps correctly to the new timeline.
> The replication delays are still here.That one is caused by this nice bug, courtesy of yours truly:
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 90ba32e..1174493 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8874,7 +8874,7 @@ retry:
/* See if we need to retrieve more data */
if (readFile < 0 ||
(readSource == XLOG_FROM_STREAM &&
- receivedUpto <= targetPagePtr + reqLen))
+ receivedUpto < targetPagePtr + reqLen))
{
if (StandbyMode)
{
I didn't notice because I had a testscript inserting stuff continuously
and it cause at most lagging by one record...
This fix is indeed working.
Michael Paquier
http://michael.otacoo.com
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Andres Freund
Date:
On 2013-01-18 08:24:31 +0900, Michael Paquier wrote: > On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > > > I encountered the problem that the timeline switch is not performed > > expectedly. > > I set up one master, one standby and one cascade standby. All the servers > > share the archive directory. restore_command is specified in the > > recovery.conf > > in those two standbys. > > > > I shut down the master, and then promoted the standby. In this case, the > > cascade standby should switch to new timeline and replication should be > > successfully restarted. But the timeline was never changed, and the > > following > > log messages were kept outputting. > > > > sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 > > sby2 LOG: replication terminated by primary server > > sby2 DETAIL: End of WAL reached on timeline 1 > > sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 > > sby2 LOG: replication terminated by primary server > > sby2 DETAIL: End of WAL reached on timeline 1 > > sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 > > sby2 LOG: replication terminated by primary server > > sby2 DETAIL: End of WAL reached on timeline 1 > > > I am seeing similar issues with master at 88228e6. > This is easily reproducible by setting up 2 slaves under a master, then > kill the master. Promote slave 1 and reconnect slave 2 to slave 1, then > you will notice that the timeline jump is not done. > > I don't know if Masao tried to put in sync the slave that reconnects to the > promoted slave, but in this case slave2 stucks in "potential" state". That > is due to timeline that has not changed on slave2 but better to let you > know... Ok, I know whats causing this now. Rather ugly. Whenever accessing a page in a segment we haven't accessed before we read the first page to do an extra bit of validation as the first page in a segment contains more information. Suppose timeline 1 ends at 0/6087088, xlog.c notices that WAL ends there, wants to read the new timeline, requests record 0/06087088. xlogreader wants to do its validation and goes back to the first page in the segment which triggers xlog.c to rerequest timeline1 to be transferred.. Heikki, any ideas? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Heikki Linnakangas
Date:
On 18.01.2013 02:35, Andres Freund wrote: > On 2013-01-18 08:24:31 +0900, Michael Paquier wrote: >> On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao<masao.fujii@gmail.com> wrote: >> >>> I encountered the problem that the timeline switch is not performed >>> expectedly. >>> I set up one master, one standby and one cascade standby. All the servers >>> share the archive directory. restore_command is specified in the >>> recovery.conf >>> in those two standbys. >>> >>> I shut down the master, and then promoted the standby. In this case, the >>> cascade standby should switch to new timeline and replication should be >>> successfully restarted. But the timeline was never changed, and the >>> following >>> log messages were kept outputting. >>> >>> sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 >>> sby2 LOG: replication terminated by primary server >>> sby2 DETAIL: End of WAL reached on timeline 1 >>> sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 >>> sby2 LOG: replication terminated by primary server >>> sby2 DETAIL: End of WAL reached on timeline 1 >>> sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1 >>> sby2 LOG: replication terminated by primary server >>> sby2 DETAIL: End of WAL reached on timeline 1 >>> >> I am seeing similar issues with master at 88228e6. >> This is easily reproducible by setting up 2 slaves under a master, then >> kill the master. Promote slave 1 and reconnect slave 2 to slave 1, then >> you will notice that the timeline jump is not done. >> >> I don't know if Masao tried to put in sync the slave that reconnects to the >> promoted slave, but in this case slave2 stucks in "potential" state". That >> is due to timeline that has not changed on slave2 but better to let you >> know... > > Ok, I know whats causing this now. Rather ugly. > > Whenever accessing a page in a segment we haven't accessed before we > read the first page to do an extra bit of validation as the first page > in a segment contains more information. > > Suppose timeline 1 ends at 0/6087088, xlog.c notices that WAL ends > there, wants to read the new timeline, requests record > 0/06087088. xlogreader wants to do its validation and goes back to the > first page in the segment which triggers xlog.c to rerequest timeline1 > to be transferred.. Hmm, so it's the same issue I thought I fixed yesterday. My patch only fixed it for the case that the timeline switch is in the first page of the segment. When it's not, you still get two calls for a WAL record, first one for the first page in the segment, to verify that, and then the page that actually contains the record. The first call leads XLogPageRead to think it needs to read from the old timeline. We didn't have this problem before the xlogreader refactoring because XLogPageRead() was always called with the RecPtr of the record, even when we actually read the segment header from the file first. We'll have to somehow get that same information, the RecPtr of the record we're actually interested in, to XLogPageRead(). We could add a new argument to the callback for that, or we could keep xlogreader.c as it is and pass it through from ReadRecord to XLogPageRead() in the private struct. An explicit argument to the callback is probably best. That's straightforward, and it might be useful for the callback to know the actual WAL position that xlogreader.c is interested in anyway. See attached. - Heikki
Attachment
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Alvaro Herrera
Date:
Andres Freund escribió: > On 2013-01-18 08:24:31 +0900, Michael Paquier wrote: > > > The replication delays are still here. > > That one is caused by this nice bug, courtesy of yours truly: > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c > index 90ba32e..1174493 100644 > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -8874,7 +8874,7 @@ retry: > /* See if we need to retrieve more data */ > if (readFile < 0 || > (readSource == XLOG_FROM_STREAM && > - receivedUpto <= targetPagePtr + reqLen)) > + receivedUpto < targetPagePtr + reqLen)) > { > if (StandbyMode) > { Pushed. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Michael Paquier
Date:
On Fri, Jan 18, 2013 at 6:20 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
-- Hmm, so it's the same issue I thought I fixed yesterday. My patch only fixed it for the case that the timeline switch is in the first page of the segment. When it's not, you still get two calls for a WAL record, first one for the first page in the segment, to verify that, and then the page that actually contains the record. The first call leads XLogPageRead to think it needs to read from the old timeline.
We didn't have this problem before the xlogreader refactoring because XLogPageRead() was always called with the RecPtr of the record, even when we actually read the segment header from the file first. We'll have to somehow get that same information, the RecPtr of the record we're actually interested in, to XLogPageRead(). We could add a new argument to the callback for that, or we could keep xlogreader.c as it is and pass it through from ReadRecord to XLogPageRead() in the private struct.
An explicit argument to the callback is probably best. That's straightforward, and it might be useful for the callback to know the actual WAL position that xlogreader.c is interested in anyway. See attached.
Just to let you know that I am still getting the error even after commit 2ff6555.
With the same scenario:
1) Start a master with 2 slaves
2) Kill/Stop slave
3) Promote slave 1, it switches to timeline 2
Log on slave 1
LOG: selected new timeline ID: 2
4) Reconnect slave 2 to save 1, slave 2 remains stuck in timeline 1 even if recovery_target_timeline is set to latest
Log on slave 1 at this moment:
DEBUG: received replication command: IDENTIFY_SYSTEM
DEBUG: received replication command: TIMELINE_HISTORY 2
DEBUG: received replication command: START_REPLICATION 0/5000000 TIMELINE 1
Slave 1 receives command to start replication with timeline 1, while it is sync with timeline 2.
Log on slave 2 at this moment:
LOG: restarted WAL streaming at 0/5000000 on timeline 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1 at 0/5014200
DEBUG: walreceiver ended streaming and awaits new instructions
The timeline history file is the same for both nodes:
$ cat 00000002.history
1 0/5014200 no recovery target specified
I might be wrong, but shouldn't there be also an entry for timeline 2 in this file?
Am I missing something?
With the same scenario:
1) Start a master with 2 slaves
2) Kill/Stop slave
3) Promote slave 1, it switches to timeline 2
Log on slave 1
LOG: selected new timeline ID: 2
4) Reconnect slave 2 to save 1, slave 2 remains stuck in timeline 1 even if recovery_target_timeline is set to latest
Log on slave 1 at this moment:
DEBUG: received replication command: IDENTIFY_SYSTEM
DEBUG: received replication command: TIMELINE_HISTORY 2
DEBUG: received replication command: START_REPLICATION 0/5000000 TIMELINE 1
Slave 1 receives command to start replication with timeline 1, while it is sync with timeline 2.
Log on slave 2 at this moment:
LOG: restarted WAL streaming at 0/5000000 on timeline 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1 at 0/5014200
DEBUG: walreceiver ended streaming and awaits new instructions
The timeline history file is the same for both nodes:
$ cat 00000002.history
1 0/5014200 no recovery target specified
I might be wrong, but shouldn't there be also an entry for timeline 2 in this file?
Am I missing something?
Michael Paquier
http://michael.otacoo.com
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
From
Michael Paquier
Date:
On Tue, Jan 22, 2013 at 9:06 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
-- On Fri, Jan 18, 2013 at 6:20 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:Hmm, so it's the same issue I thought I fixed yesterday. My patch only fixed it for the case that the timeline switch is in the first page of the segment. When it's not, you still get two calls for a WAL record, first one for the first page in the segment, to verify that, and then the page that actually contains the record. The first call leads XLogPageRead to think it needs to read from the old timeline.
We didn't have this problem before the xlogreader refactoring because XLogPageRead() was always called with the RecPtr of the record, even when we actually read the segment header from the file first. We'll have to somehow get that same information, the RecPtr of the record we're actually interested in, to XLogPageRead(). We could add a new argument to the callback for that, or we could keep xlogreader.c as it is and pass it through from ReadRecord to XLogPageRead() in the private struct.
An explicit argument to the callback is probably best. That's straightforward, and it might be useful for the callback to know the actual WAL position that xlogreader.c is interested in anyway. See attached.Just to let you know that I am still getting the error even after commit 2ff6555.
With the same scenario:
1) Start a master with 2 slaves
2) Kill/Stop slave
3) Promote slave 1, it switches to timeline 2
Log on slave 14) Reconnect slave 2 to save 1, slave 2 remains stuck in timeline 1 even if recovery_target_timeline is set to latest
LOG: selected new timeline ID: 2
Log on slave 1 at this moment:
DEBUG: received replication command: IDENTIFY_SYSTEM
DEBUG: received replication command: TIMELINE_HISTORY 2
DEBUG: received replication command: START_REPLICATION 0/5000000 TIMELINE 1
Slave 1 receives command to start replication with timeline 1, while it is sync with timeline 2.
Log on slave 2 at this moment:
LOG: restarted WAL streaming at 0/5000000 on timeline 1DETAIL: End of WAL reached on timeline 1 at 0/5014200
LOG: replication terminated by primary server
DEBUG: walreceiver ended streaming and awaits new instructions
The timeline history file is the same for both nodes:
$ cat 00000002.history
1 0/5014200 no recovery target specified
I might be wrong, but shouldn't there be also an entry for timeline 2 in this file?
Am I missing something?
Sorry, there are no problems...
I simply forgot to set up recovery_target_timeline to 'latest' in recovery.conf...
I simply forgot to set up recovery_target_timeline to 'latest' in recovery.conf...
Michael Paquier
http://michael.otacoo.com