Thread: Teaching pg_receivexlog to follow timeline switches
Now that a standby server can follow timeline switches through streaming replication, we should do teach pg_receivexlog to do the same. Patch attached. I made one change to the way START_STREAMING command works, to better support this. When a standby server reaches the timeline it's streaming from the master, it stops streaming, fetches any missing timeline history files, and parses the history file of the latest timeline to figure out where to continue. However, I don't want to parse timeline history files in pg_receivexlog. Better to keep it simple. So instead, I modified the server-side code for START_STREAMING to return the next timeline's ID at the end, and used that in pg_receivexlog. I also modifed BASE_BACKUP to return not only the start XLogRecPtr, but also the corresponding timeline ID. Otherwise we might try to start streaming from wrong timeline if you issue a BASE_BACKUP at the same moment the server switches to a new timeline. When pg_receivexlog switches timeline, what to do with the partial file on the old timeline? When the timeline changes in the middle of a WAL segment, the segment old the old timeline is only half-filled. For example, when timeline changes from 1 to 2, you'll have this in pg_xlog: 000000010000000000000006 000000010000000000000007 000000010000000000000008 000000020000000000000008 00000002.history The segment 000000010000000000000008 is only half-filled, as the timeline changed in the middle of that segment. The beginning portion of that file is duplicated in 000000020000000000000008, with the timeline-changing checkpoint record right after the duplicated portion. When we stream that with pg_receivexlog, and hit the timeline switch, we'll have this situation in the client: 000000010000000000000006 000000010000000000000007 000000010000000000000008.partial What to do with the partial file? One option is to rename it to 000000010000000000000008. However, if you then kill pg_receivexlog before it has finished streaming a full segment from the new timeline, on restart it will try to begin streaming WAL segment 000000010000000000000009, because it sees that segment 000000010000000000000008 is already completed. That'd be wrong. The best option seems to be to just leave the .partial file in place, so as streaming progresses, you end up with: 000000010000000000000006 000000010000000000000007 000000010000000000000008.partial 000000020000000000000008 000000020000000000000009 00000002000000000000000A.partial It feels a bit confusing to have that old partial file there, but that seems like the most correct solution. That file is indeed partial. This also ensures that if the server running on timeline 1 continues to generate new WAL, and it fills 000000010000000000000008, we won't confuse the partial segment with that name with a full one. - Heikki
Attachment
On Tue, Jan 15, 2013 at 11:05 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Now that a standby server can follow timeline switches through streaming > replication, we should do teach pg_receivexlog to do the same. Patch > attached. > > I made one change to the way START_STREAMING command works, to better > support this. When a standby server reaches the timeline it's streaming from > the master, it stops streaming, fetches any missing timeline history files, > and parses the history file of the latest timeline to figure out where to > continue. However, I don't want to parse timeline history files in > pg_receivexlog. Better to keep it simple. So instead, I modified the > server-side code for START_STREAMING to return the next timeline's ID at the > end, and used that in pg_receivexlog. I also modifed BASE_BACKUP to return > not only the start XLogRecPtr, but also the corresponding timeline ID. > Otherwise we might try to start streaming from wrong timeline if you issue a > BASE_BACKUP at the same moment the server switches to a new timeline. > > When pg_receivexlog switches timeline, what to do with the partial file on > the old timeline? When the timeline changes in the middle of a WAL segment, > the segment old the old timeline is only half-filled. For example, when > timeline changes from 1 to 2, you'll have this in pg_xlog: > > 000000010000000000000006 > 000000010000000000000007 > 000000010000000000000008 > 000000020000000000000008 > 00000002.history > > The segment 000000010000000000000008 is only half-filled, as the timeline > changed in the middle of that segment. The beginning portion of that file is > duplicated in 000000020000000000000008, with the timeline-changing > checkpoint record right after the duplicated portion. > > When we stream that with pg_receivexlog, and hit the timeline switch, we'll > have this situation in the client: > > 000000010000000000000006 > 000000010000000000000007 > 000000010000000000000008.partial > > What to do with the partial file? One option is to rename it to > 000000010000000000000008. However, if you then kill pg_receivexlog before it > has finished streaming a full segment from the new timeline, on restart it > will try to begin streaming WAL segment 000000010000000000000009, because it > sees that segment 000000010000000000000008 is already completed. That'd be > wrong. Can't we rename .partial file safely after we receive a full segment of the WAL file with new timeline and the same logid/segmentid? Regards, -- Fujii Masao
On 15.01.2013 20:22, Fujii Masao wrote: > On Tue, Jan 15, 2013 at 11:05 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> Now that a standby server can follow timeline switches through streaming >> replication, we should do teach pg_receivexlog to do the same. Patch >> attached. >> >> I made one change to the way START_STREAMING command works, to better >> support this. When a standby server reaches the timeline it's streaming from >> the master, it stops streaming, fetches any missing timeline history files, >> and parses the history file of the latest timeline to figure out where to >> continue. However, I don't want to parse timeline history files in >> pg_receivexlog. Better to keep it simple. So instead, I modified the >> server-side code for START_STREAMING to return the next timeline's ID at the >> end, and used that in pg_receivexlog. I also modifed BASE_BACKUP to return >> not only the start XLogRecPtr, but also the corresponding timeline ID. >> Otherwise we might try to start streaming from wrong timeline if you issue a >> BASE_BACKUP at the same moment the server switches to a new timeline. >> >> When pg_receivexlog switches timeline, what to do with the partial file on >> the old timeline? When the timeline changes in the middle of a WAL segment, >> the segment old the old timeline is only half-filled. For example, when >> timeline changes from 1 to 2, you'll have this in pg_xlog: >> >> 000000010000000000000006 >> 000000010000000000000007 >> 000000010000000000000008 >> 000000020000000000000008 >> 00000002.history >> >> The segment 000000010000000000000008 is only half-filled, as the timeline >> changed in the middle of that segment. The beginning portion of that file is >> duplicated in 000000020000000000000008, with the timeline-changing >> checkpoint record right after the duplicated portion. >> >> When we stream that with pg_receivexlog, and hit the timeline switch, we'll >> have this situation in the client: >> >> 000000010000000000000006 >> 000000010000000000000007 >> 000000010000000000000008.partial >> >> What to do with the partial file? One option is to rename it to >> 000000010000000000000008. However, if you then kill pg_receivexlog before it >> has finished streaming a full segment from the new timeline, on restart it >> will try to begin streaming WAL segment 000000010000000000000009, because it >> sees that segment 000000010000000000000008 is already completed. That'd be >> wrong. > > Can't we rename .partial file safely after we receive a full segment > of the WAL file > with new timeline and the same logid/segmentid? I'd prefer to leave the .partial suffix in place, as the segment really isn't complete. It doesn't make a difference when you recover to the latest timeline, but if you have a more complicated scenario with multiple timelines that are still "alive", ie. there's a server still actively generating WAL on that timeline, you'll easily get confused. As an example, imagine that you have a master server, and one standby. You maintain a WAL archive for backup purposes with pg_receivexlog, connected to the standby. Now, for some reason, you get a split-brain situation and the standby server is promoted with new timeline 2, while the real master is still running. The DBA notices the problem, and kills the standby and pg_receivexlog. He deletes the XLOG files belonging to timeline 2 in pg_receivexlog's target directory, and re-points pg_recevexlog to the master while he re-builds the standby server from backup. At that point, pg_receivexlog will start streaming from the end of the zero-padded segment, not knowing that it was partial, and you have a hole in the archived WAL stream. Oops. The DBA could avoid that by also removing the last WAL segment on timeline 1, the one that was partial. But it's really not obvious that there's anything wrong with that segment. Keeping the .partial suffix makes it clear. - Heikki
On Thu, Jan 17, 2013 at 1:08 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 15.01.2013 20:22, Fujii Masao wrote: >> >> On Tue, Jan 15, 2013 at 11:05 PM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>> >>> Now that a standby server can follow timeline switches through streaming >>> replication, we should do teach pg_receivexlog to do the same. Patch >>> attached. >>> >>> I made one change to the way START_STREAMING command works, to better >>> support this. When a standby server reaches the timeline it's streaming >>> from >>> the master, it stops streaming, fetches any missing timeline history >>> files, >>> and parses the history file of the latest timeline to figure out where to >>> continue. However, I don't want to parse timeline history files in >>> pg_receivexlog. Better to keep it simple. So instead, I modified the >>> server-side code for START_STREAMING to return the next timeline's ID at >>> the >>> end, and used that in pg_receivexlog. I also modifed BASE_BACKUP to >>> return >>> not only the start XLogRecPtr, but also the corresponding timeline ID. >>> Otherwise we might try to start streaming from wrong timeline if you >>> issue a >>> BASE_BACKUP at the same moment the server switches to a new timeline. >>> >>> When pg_receivexlog switches timeline, what to do with the partial file >>> on >>> the old timeline? When the timeline changes in the middle of a WAL >>> segment, >>> the segment old the old timeline is only half-filled. For example, when >>> timeline changes from 1 to 2, you'll have this in pg_xlog: >>> >>> 000000010000000000000006 >>> 000000010000000000000007 >>> 000000010000000000000008 >>> 000000020000000000000008 >>> 00000002.history >>> >>> The segment 000000010000000000000008 is only half-filled, as the timeline >>> changed in the middle of that segment. The beginning portion of that file >>> is >>> duplicated in 000000020000000000000008, with the timeline-changing >>> checkpoint record right after the duplicated portion. >>> >>> When we stream that with pg_receivexlog, and hit the timeline switch, >>> we'll >>> have this situation in the client: >>> >>> 000000010000000000000006 >>> 000000010000000000000007 >>> 000000010000000000000008.partial >>> >>> What to do with the partial file? One option is to rename it to >>> 000000010000000000000008. However, if you then kill pg_receivexlog before >>> it >>> has finished streaming a full segment from the new timeline, on restart >>> it >>> will try to begin streaming WAL segment 000000010000000000000009, because >>> it >>> sees that segment 000000010000000000000008 is already completed. That'd >>> be >>> wrong. >> >> >> Can't we rename .partial file safely after we receive a full segment >> of the WAL file >> with new timeline and the same logid/segmentid? > > > I'd prefer to leave the .partial suffix in place, as the segment really > isn't complete. It doesn't make a difference when you recover to the latest > timeline, but if you have a more complicated scenario with multiple > timelines that are still "alive", ie. there's a server still actively > generating WAL on that timeline, you'll easily get confused. > > As an example, imagine that you have a master server, and one standby. You > maintain a WAL archive for backup purposes with pg_receivexlog, connected to > the standby. Now, for some reason, you get a split-brain situation and the > standby server is promoted with new timeline 2, while the real master is > still running. The DBA notices the problem, and kills the standby and > pg_receivexlog. He deletes the XLOG files belonging to timeline 2 in > pg_receivexlog's target directory, and re-points pg_recevexlog to the master > while he re-builds the standby server from backup. At that point, > pg_receivexlog will start streaming from the end of the zero-padded segment, > not knowing that it was partial, and you have a hole in the archived WAL > stream. Oops. > > The DBA could avoid that by also removing the last WAL segment on timeline > 1, the one that was partial. But it's really not obvious that there's > anything wrong with that segment. Keeping the .partial suffix makes it > clear. Thanks for elaborating the reason why .partial suffix should be kept. I agree that keeping the .partial suffix would be safer. Regards, -- Fujii Masao
Fujii Masao <masao.fujii@gmail.com> writes: > Thanks for elaborating the reason why .partial suffix should be kept. > I agree that keeping the .partial suffix would be safer. +1 to both points. So +2 I guess :) Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Wed, Jan 16, 2013 at 11:08 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > I'd prefer to leave the .partial suffix in place, as the segment really > isn't complete. It doesn't make a difference when you recover to the latest > timeline, but if you have a more complicated scenario with multiple > timelines that are still "alive", ie. there's a server still actively > generating WAL on that timeline, you'll easily get confused. > > As an example, imagine that you have a master server, and one standby. You > maintain a WAL archive for backup purposes with pg_receivexlog, connected to > the standby. Now, for some reason, you get a split-brain situation and the > standby server is promoted with new timeline 2, while the real master is > still running. The DBA notices the problem, and kills the standby and > pg_receivexlog. He deletes the XLOG files belonging to timeline 2 in > pg_receivexlog's target directory, and re-points pg_recevexlog to the master > while he re-builds the standby server from backup. At that point, > pg_receivexlog will start streaming from the end of the zero-padded segment, > not knowing that it was partial, and you have a hole in the archived WAL > stream. Oops. > > The DBA could avoid that by also removing the last WAL segment on timeline > 1, the one that was partial. But it's really not obvious that there's > anything wrong with that segment. Keeping the .partial suffix makes it > clear. I shudder at the idea that the DBA is manually involved in any of this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 17.01.2013 16:56, Robert Haas wrote: > On Wed, Jan 16, 2013 at 11:08 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> I'd prefer to leave the .partial suffix in place, as the segment really >> isn't complete. It doesn't make a difference when you recover to the latest >> timeline, but if you have a more complicated scenario with multiple >> timelines that are still "alive", ie. there's a server still actively >> generating WAL on that timeline, you'll easily get confused. >> >> As an example, imagine that you have a master server, and one standby. You >> maintain a WAL archive for backup purposes with pg_receivexlog, connected to >> the standby. Now, for some reason, you get a split-brain situation and the >> standby server is promoted with new timeline 2, while the real master is >> still running. The DBA notices the problem, and kills the standby and >> pg_receivexlog. He deletes the XLOG files belonging to timeline 2 in >> pg_receivexlog's target directory, and re-points pg_recevexlog to the master >> while he re-builds the standby server from backup. At that point, >> pg_receivexlog will start streaming from the end of the zero-padded segment, >> not knowing that it was partial, and you have a hole in the archived WAL >> stream. Oops. >> >> The DBA could avoid that by also removing the last WAL segment on timeline >> 1, the one that was partial. But it's really not obvious that there's >> anything wrong with that segment. Keeping the .partial suffix makes it >> clear. > > I shudder at the idea that the DBA is manually involved in any of this. The scenario I described is that you screwed up your failover environment, and end up with a split-brain situation by accident. The DBA certainly needs to be involved to recover from that. - Heikki
On Thu, Jan 17, 2013 at 9:59 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > The scenario I described is that you screwed up your failover environment, > and end up with a split-brain situation by accident. The DBA certainly needs > to be involved to recover from that. OK, I agree, but I still think a lot of DBAs would have no idea how to handle that situation. I agree with your proposal, don't get me wrong - I just think there's still an awful lot of room for operator error in these more complex replication scenarios. I don't have a clue how to fix that, and it's certainly not the purpose of this thread to fix that; I'm just venting. Actually, I'm really glad to see all the work you've done to improve the way that some of these scenarios work and eliminate various bugs and other surprising failure modes over the last couple of months. It's great stuff. Alas, I think we still some distance from being able to provide an "easy button". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas escribió: > Actually, I'm really glad to see all the work you've done to improve > the way that some of these scenarios work and eliminate various bugs > and other surprising failure modes over the last couple of months. > It's great stuff. +1 -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jan 15, 2013 at 9:05 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Now that a standby server can follow timeline switches through streaming > replication, we should do teach pg_receivexlog to do the same. Patch > attached. Is it possible to re-use walreceiver code from the backend? I was thinking that it would actually be very useful to have the whole replication functionality modularized and in a standalone binary that could act as a replication proxy and WAL archiver that could run without all the overhead of an entire PG instance.
On 18.01.2013 06:38, Phil Sorber wrote: > On Tue, Jan 15, 2013 at 9:05 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> Now that a standby server can follow timeline switches through streaming >> replication, we should do teach pg_receivexlog to do the same. Patch >> attached. > > Is it possible to re-use walreceiver code from the backend? > > I was thinking that it would actually be very useful to have the whole > replication functionality modularized and in a standalone binary that > could act as a replication proxy and WAL archiver that could run > without all the overhead of an entire PG instance There's much sense in trying to extract that into a stand-along module. src/bin/pg_basebackup/receivelog.c is about 1000 lines of code at the moment, and it looks quite different from the corresponding code in the backend, because it doesn't have all the backend infrastructure available. - Heikki
On Fri, Jan 18, 2013 at 7:55 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 18.01.2013 06:38, Phil Sorber wrote: >> Is it possible to re-use walreceiver code from the backend? >> >> I was thinking that it would actually be very useful to have the whole >> replication functionality modularized and in a standalone binary that >> could act as a replication proxy and WAL archiver that could run >> without all the overhead of an entire PG instance > > > There's much sense in trying to extract that into a stand-along module. > src/bin/pg_basebackup/receivelog.c is about 1000 lines of code at the > moment, and it looks quite different from the corresponding code in the > backend, because it doesn't have all the backend infrastructure available. > > - Heikki That's fair. What do you think about the idea of a full WAL proxy? Probably not for 9.3 at this point though.
This patch was in Needs Review status, but you committed it on 2013-01-17. I have marked it as such in the CF app.
Phil Sorber <phil@omniti.com> writes: > What do you think about the idea of a full WAL proxy? Probably not for > 9.3 at this point though. I was thinking that a WAL proxy nowadays is called a cascading standby with local archiving enabled. I'm not sure why you would want to trust your archiving and WAL relaying to another piece of software… Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 22.01.2013 15:02, Dimitri Fontaine wrote: > Phil Sorber<phil@omniti.com> writes: >> What do you think about the idea of a full WAL proxy? Probably not for >> 9.3 at this point though. > > I was thinking that a WAL proxy nowadays is called a cascading standby > with local archiving enabled. I'm not sure why you would want to trust > your archiving and WAL relaying to another piece of software… You might not want to keep a copy of the whole data directory around, as you have to in a cascading standby. I can see value in a separate WAL proxy software, especially if it's integrated into a larger backup manager program like barman or wal-e. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > You might not want to keep a copy of the whole data directory around, as you > have to in a cascading standby. I can see value in a separate WAL proxy > software, especially if it's integrated into a larger backup manager program > like barman or wal-e. +1 I somehow forgot about $PGDATA here. Time for a little break I guess :) Another idea is to have a daemon mode pg_receivexlog where not only it can maintain a local archive but also feed it using the replication protocol to standbies, keeping track of their position. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Tue, Jan 22, 2013 at 8:33 AM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: >> You might not want to keep a copy of the whole data directory around, as you >> have to in a cascading standby. I can see value in a separate WAL proxy >> software, especially if it's integrated into a larger backup manager program >> like barman or wal-e. > > +1 > > I somehow forgot about $PGDATA here. Time for a little break I guess :) > > Another idea is to have a daemon mode pg_receivexlog where not only it > can maintain a local archive but also feed it using the replication > protocol to standbies, keeping track of their position. I'm not sure if i described it well, but that's essentially what I was asking about. It would have both wal receiving and and wal sending capability. Along with it's own local WAL storage perhaps governed in size by a keep_wal_segments and also a longer term archive that you could have compressed but also pull from with a archive and restore command. And also be able to act as a synchronous replication peer. I think it has already been discussed to have pg_receivexlog do that last one. So yeah, a cascading standby without $PGDATA or hot_standby or large shared_buffers resources. It seems like maybe we could add through subtraction. Add a parameter that disables wal replay? I'm sure there'd be more things it would have to disable, but then it's not two separate binaries. > > Regards, > -- > Dimitri Fontaine > http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 01/22/2013 06:43 AM, Noah Misch wrote: > This patch was in Needs Review status, but you committed it on 2013-01-17. I > have marked it as such in the CF app. Thankyou. There's a lot to keep up with :S -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services