Thread: Teaching pg_receivexlog to follow timeline switches

Teaching pg_receivexlog to follow timeline switches

From

Heikki Linnakangas

Date:

15 January 2013, 17:05:57

Now that a standby server can follow timeline switches through streaming
replication, we should do teach pg_receivexlog to do the same. Patch
attached.

I made one change to the way START_STREAMING command works, to better
support this. When a standby server reaches the timeline it's streaming
from the master, it stops streaming, fetches any missing timeline
history files, and parses the history file of the latest timeline to
figure out where to continue. However, I don't want to parse timeline
history files in pg_receivexlog. Better to keep it simple. So instead, I
modified the server-side code for START_STREAMING to return the next
timeline's ID at the end, and used that in pg_receivexlog. I also
modifed BASE_BACKUP to return not only the start XLogRecPtr, but also
the corresponding timeline ID. Otherwise we might try to start streaming
from wrong timeline if you issue a BASE_BACKUP at the same moment the
server switches to a new timeline.

When pg_receivexlog switches timeline, what to do with the partial file
on the old timeline? When the timeline changes in the middle of a WAL
segment, the segment old the old timeline is only half-filled. For
example, when timeline changes from 1 to 2, you'll have this in pg_xlog:

000000010000000000000006
000000010000000000000007
000000010000000000000008
000000020000000000000008
00000002.history

The segment 000000010000000000000008 is only half-filled, as the
timeline changed in the middle of that segment. The beginning portion of
that file is duplicated in 000000020000000000000008, with the
timeline-changing checkpoint record right after the duplicated portion.

When we stream that with pg_receivexlog, and hit the timeline switch,
we'll have this situation in the client:

000000010000000000000006
000000010000000000000007
000000010000000000000008.partial

What to do with the partial file? One option is to rename it to
000000010000000000000008. However, if you then kill pg_receivexlog
before it has finished streaming a full segment from the new timeline,
on restart it will try to begin streaming WAL segment
000000010000000000000009, because it sees that segment
000000010000000000000008 is already completed. That'd be wrong.

The best option seems to be to just leave the .partial file in place, so
as streaming progresses, you end up with:

000000010000000000000006
000000010000000000000007
000000010000000000000008.partial
000000020000000000000008
000000020000000000000009
00000002000000000000000A.partial

It feels a bit confusing to have that old partial file there, but that
seems like the most correct solution. That file is indeed partial. This
also ensures that if the server running on timeline 1 continues to
generate new WAL, and it fills 000000010000000000000008, we won't
confuse the partial segment with that name with a full one.

- Heikki

Attachment

teach-receivexlog-to-switch-timelines-1.patch

Re: Teaching pg_receivexlog to follow timeline switches

From

Fujii Masao

Date:

15 January 2013, 21:22:07

On Tue, Jan 15, 2013 at 11:05 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Now that a standby server can follow timeline switches through streaming
> replication, we should do teach pg_receivexlog to do the same. Patch
> attached.
>
> I made one change to the way START_STREAMING command works, to better
> support this. When a standby server reaches the timeline it's streaming from
> the master, it stops streaming, fetches any missing timeline history files,
> and parses the history file of the latest timeline to figure out where to
> continue. However, I don't want to parse timeline history files in
> pg_receivexlog. Better to keep it simple. So instead, I modified the
> server-side code for START_STREAMING to return the next timeline's ID at the
> end, and used that in pg_receivexlog. I also modifed BASE_BACKUP to return
> not only the start XLogRecPtr, but also the corresponding timeline ID.
> Otherwise we might try to start streaming from wrong timeline if you issue a
> BASE_BACKUP at the same moment the server switches to a new timeline.
>
> When pg_receivexlog switches timeline, what to do with the partial file on
> the old timeline? When the timeline changes in the middle of a WAL segment,
> the segment old the old timeline is only half-filled. For example, when
> timeline changes from 1 to 2, you'll have this in pg_xlog:
>
> 000000010000000000000006
> 000000010000000000000007
> 000000010000000000000008
> 000000020000000000000008
> 00000002.history
>
> The segment 000000010000000000000008 is only half-filled, as the timeline
> changed in the middle of that segment. The beginning portion of that file is
> duplicated in 000000020000000000000008, with the timeline-changing
> checkpoint record right after the duplicated portion.
>
> When we stream that with pg_receivexlog, and hit the timeline switch, we'll
> have this situation in the client:
>
> 000000010000000000000006
> 000000010000000000000007
> 000000010000000000000008.partial
>
> What to do with the partial file? One option is to rename it to
> 000000010000000000000008. However, if you then kill pg_receivexlog before it
> has finished streaming a full segment from the new timeline, on restart it
> will try to begin streaming WAL segment 000000010000000000000009, because it
> sees that segment 000000010000000000000008 is already completed. That'd be
> wrong.

Can't we rename .partial file safely after we receive a full segment
of the WAL file
with new timeline and the same logid/segmentid?

Regards,

-- 
Fujii Masao

Re: Teaching pg_receivexlog to follow timeline switches

From

Heikki Linnakangas

Date:

16 January 2013, 19:08:25

On 15.01.2013 20:22, Fujii Masao wrote:
> On Tue, Jan 15, 2013 at 11:05 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com>  wrote:
>> Now that a standby server can follow timeline switches through streaming
>> replication, we should do teach pg_receivexlog to do the same. Patch
>> attached.
>>
>> I made one change to the way START_STREAMING command works, to better
>> support this. When a standby server reaches the timeline it's streaming from
>> the master, it stops streaming, fetches any missing timeline history files,
>> and parses the history file of the latest timeline to figure out where to
>> continue. However, I don't want to parse timeline history files in
>> pg_receivexlog. Better to keep it simple. So instead, I modified the
>> server-side code for START_STREAMING to return the next timeline's ID at the
>> end, and used that in pg_receivexlog. I also modifed BASE_BACKUP to return
>> not only the start XLogRecPtr, but also the corresponding timeline ID.
>> Otherwise we might try to start streaming from wrong timeline if you issue a
>> BASE_BACKUP at the same moment the server switches to a new timeline.
>>
>> When pg_receivexlog switches timeline, what to do with the partial file on
>> the old timeline? When the timeline changes in the middle of a WAL segment,
>> the segment old the old timeline is only half-filled. For example, when
>> timeline changes from 1 to 2, you'll have this in pg_xlog:
>>
>> 000000010000000000000006
>> 000000010000000000000007
>> 000000010000000000000008
>> 000000020000000000000008
>> 00000002.history
>>
>> The segment 000000010000000000000008 is only half-filled, as the timeline
>> changed in the middle of that segment. The beginning portion of that file is
>> duplicated in 000000020000000000000008, with the timeline-changing
>> checkpoint record right after the duplicated portion.
>>
>> When we stream that with pg_receivexlog, and hit the timeline switch, we'll
>> have this situation in the client:
>>
>> 000000010000000000000006
>> 000000010000000000000007
>> 000000010000000000000008.partial
>>
>> What to do with the partial file? One option is to rename it to
>> 000000010000000000000008. However, if you then kill pg_receivexlog before it
>> has finished streaming a full segment from the new timeline, on restart it
>> will try to begin streaming WAL segment 000000010000000000000009, because it
>> sees that segment 000000010000000000000008 is already completed. That'd be
>> wrong.
>
> Can't we rename .partial file safely after we receive a full segment
> of the WAL file
> with new timeline and the same logid/segmentid?

I'd prefer to leave the .partial suffix in place, as the segment really 
isn't complete. It doesn't make a difference when you recover to the 
latest timeline, but if you have a more complicated scenario with 
multiple timelines that are still "alive", ie. there's a server still 
actively generating WAL on that timeline, you'll easily get confused.

As an example, imagine that you have a master server, and one standby. 
You maintain a WAL archive for backup purposes with pg_receivexlog, 
connected to the standby. Now, for some reason, you get a split-brain 
situation and the standby server is promoted with new timeline 2, while 
the real master is still running. The DBA notices the problem, and kills 
the standby and pg_receivexlog. He deletes the XLOG files belonging to 
timeline 2 in pg_receivexlog's target directory, and re-points 
pg_recevexlog to the master while he re-builds the standby server from 
backup. At that point, pg_receivexlog will start streaming from the end 
of the zero-padded segment, not knowing that it was partial, and you 
have a hole in the archived WAL stream. Oops.

The DBA could avoid that by also removing the last WAL segment on 
timeline 1, the one that was partial. But it's really not obvious that 
there's anything wrong with that segment. Keeping the .partial suffix 
makes it clear.

- Heikki

Re: Teaching pg_receivexlog to follow timeline switches

From

Fujii Masao

Date:

16 January 2013, 20:06:54

On Thu, Jan 17, 2013 at 1:08 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 15.01.2013 20:22, Fujii Masao wrote:
>>
>> On Tue, Jan 15, 2013 at 11:05 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com>  wrote:
>>>
>>> Now that a standby server can follow timeline switches through streaming
>>> replication, we should do teach pg_receivexlog to do the same. Patch
>>> attached.
>>>
>>> I made one change to the way START_STREAMING command works, to better
>>> support this. When a standby server reaches the timeline it's streaming
>>> from
>>> the master, it stops streaming, fetches any missing timeline history
>>> files,
>>> and parses the history file of the latest timeline to figure out where to
>>> continue. However, I don't want to parse timeline history files in
>>> pg_receivexlog. Better to keep it simple. So instead, I modified the
>>> server-side code for START_STREAMING to return the next timeline's ID at
>>> the
>>> end, and used that in pg_receivexlog. I also modifed BASE_BACKUP to
>>> return
>>> not only the start XLogRecPtr, but also the corresponding timeline ID.
>>> Otherwise we might try to start streaming from wrong timeline if you
>>> issue a
>>> BASE_BACKUP at the same moment the server switches to a new timeline.
>>>
>>> When pg_receivexlog switches timeline, what to do with the partial file
>>> on
>>> the old timeline? When the timeline changes in the middle of a WAL
>>> segment,
>>> the segment old the old timeline is only half-filled. For example, when
>>> timeline changes from 1 to 2, you'll have this in pg_xlog:
>>>
>>> 000000010000000000000006
>>> 000000010000000000000007
>>> 000000010000000000000008
>>> 000000020000000000000008
>>> 00000002.history
>>>
>>> The segment 000000010000000000000008 is only half-filled, as the timeline
>>> changed in the middle of that segment. The beginning portion of that file
>>> is
>>> duplicated in 000000020000000000000008, with the timeline-changing
>>> checkpoint record right after the duplicated portion.
>>>
>>> When we stream that with pg_receivexlog, and hit the timeline switch,
>>> we'll
>>> have this situation in the client:
>>>
>>> 000000010000000000000006
>>> 000000010000000000000007
>>> 000000010000000000000008.partial
>>>
>>> What to do with the partial file? One option is to rename it to
>>> 000000010000000000000008. However, if you then kill pg_receivexlog before
>>> it
>>> has finished streaming a full segment from the new timeline, on restart
>>> it
>>> will try to begin streaming WAL segment 000000010000000000000009, because
>>> it
>>> sees that segment 000000010000000000000008 is already completed. That'd
>>> be
>>> wrong.
>>
>>
>> Can't we rename .partial file safely after we receive a full segment
>> of the WAL file
>> with new timeline and the same logid/segmentid?
>
>
> I'd prefer to leave the .partial suffix in place, as the segment really
> isn't complete. It doesn't make a difference when you recover to the latest
> timeline, but if you have a more complicated scenario with multiple
> timelines that are still "alive", ie. there's a server still actively
> generating WAL on that timeline, you'll easily get confused.
>
> As an example, imagine that you have a master server, and one standby. You
> maintain a WAL archive for backup purposes with pg_receivexlog, connected to
> the standby. Now, for some reason, you get a split-brain situation and the
> standby server is promoted with new timeline 2, while the real master is
> still running. The DBA notices the problem, and kills the standby and
> pg_receivexlog. He deletes the XLOG files belonging to timeline 2 in
> pg_receivexlog's target directory, and re-points pg_recevexlog to the master
> while he re-builds the standby server from backup. At that point,
> pg_receivexlog will start streaming from the end of the zero-padded segment,
> not knowing that it was partial, and you have a hole in the archived WAL
> stream. Oops.
>
> The DBA could avoid that by also removing the last WAL segment on timeline
> 1, the one that was partial. But it's really not obvious that there's
> anything wrong with that segment. Keeping the .partial suffix makes it
> clear.

Thanks for elaborating the reason why .partial suffix should be kept.
I agree that keeping the .partial suffix would be safer.

Regards,

-- 
Fujii Masao

Re: Teaching pg_receivexlog to follow timeline switches

From

Dimitri Fontaine

Date:

17 January 2013, 01:28:54

Fujii Masao <masao.fujii@gmail.com> writes:
> Thanks for elaborating the reason why .partial suffix should be kept.
> I agree that keeping the .partial suffix would be safer.

+1 to both points.  So +2 I guess :)

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

Re: Teaching pg_receivexlog to follow timeline switches

From

Robert Haas

Date:

17 January 2013, 17:56:55

On Wed, Jan 16, 2013 at 11:08 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> I'd prefer to leave the .partial suffix in place, as the segment really
> isn't complete. It doesn't make a difference when you recover to the latest
> timeline, but if you have a more complicated scenario with multiple
> timelines that are still "alive", ie. there's a server still actively
> generating WAL on that timeline, you'll easily get confused.
>
> As an example, imagine that you have a master server, and one standby. You
> maintain a WAL archive for backup purposes with pg_receivexlog, connected to
> the standby. Now, for some reason, you get a split-brain situation and the
> standby server is promoted with new timeline 2, while the real master is
> still running. The DBA notices the problem, and kills the standby and
> pg_receivexlog. He deletes the XLOG files belonging to timeline 2 in
> pg_receivexlog's target directory, and re-points pg_recevexlog to the master
> while he re-builds the standby server from backup. At that point,
> pg_receivexlog will start streaming from the end of the zero-padded segment,
> not knowing that it was partial, and you have a hole in the archived WAL
> stream. Oops.
>
> The DBA could avoid that by also removing the last WAL segment on timeline
> 1, the one that was partial. But it's really not obvious that there's
> anything wrong with that segment. Keeping the .partial suffix makes it
> clear.

I shudder at the idea that the DBA is manually involved in any of this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Teaching pg_receivexlog to follow timeline switches

From

Heikki Linnakangas

Date:

17 January 2013, 17:59:16

On 17.01.2013 16:56, Robert Haas wrote:
> On Wed, Jan 16, 2013 at 11:08 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com>  wrote:
>> I'd prefer to leave the .partial suffix in place, as the segment really
>> isn't complete. It doesn't make a difference when you recover to the latest
>> timeline, but if you have a more complicated scenario with multiple
>> timelines that are still "alive", ie. there's a server still actively
>> generating WAL on that timeline, you'll easily get confused.
>>
>> As an example, imagine that you have a master server, and one standby. You
>> maintain a WAL archive for backup purposes with pg_receivexlog, connected to
>> the standby. Now, for some reason, you get a split-brain situation and the
>> standby server is promoted with new timeline 2, while the real master is
>> still running. The DBA notices the problem, and kills the standby and
>> pg_receivexlog. He deletes the XLOG files belonging to timeline 2 in
>> pg_receivexlog's target directory, and re-points pg_recevexlog to the master
>> while he re-builds the standby server from backup. At that point,
>> pg_receivexlog will start streaming from the end of the zero-padded segment,
>> not knowing that it was partial, and you have a hole in the archived WAL
>> stream. Oops.
>>
>> The DBA could avoid that by also removing the last WAL segment on timeline
>> 1, the one that was partial. But it's really not obvious that there's
>> anything wrong with that segment. Keeping the .partial suffix makes it
>> clear.
>
> I shudder at the idea that the DBA is manually involved in any of this.

The scenario I described is that you screwed up your failover 
environment, and end up with a split-brain situation by accident. The 
DBA certainly needs to be involved to recover from that.

- Heikki

Re: Teaching pg_receivexlog to follow timeline switches

From

Robert Haas

Date:

17 January 2013, 18:12:14

On Thu, Jan 17, 2013 at 9:59 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> The scenario I described is that you screwed up your failover environment,
> and end up with a split-brain situation by accident. The DBA certainly needs
> to be involved to recover from that.

OK, I agree, but I still think a lot of DBAs would have no idea how to
handle that situation.  I agree with your proposal, don't get me wrong
- I just think there's still an awful lot of room for operator error
in these more complex replication scenarios.  I don't have a clue how
to fix that, and it's certainly not the purpose of this thread to fix
that; I'm just venting.

Actually, I'm really glad to see all the work you've done to improve
the way that some of these scenarios work and eliminate various bugs
and other surprising failure modes over the last couple of months.
It's great stuff.  Alas, I think we still some distance from being
able to provide an "easy button".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Teaching pg_receivexlog to follow timeline switches

From

Alvaro Herrera

Date:

17 January 2013, 22:45:16

Robert Haas escribió:

> Actually, I'm really glad to see all the work you've done to improve
> the way that some of these scenarios work and eliminate various bugs
> and other surprising failure modes over the last couple of months.
> It's great stuff.

+1

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Teaching pg_receivexlog to follow timeline switches

From

Phil Sorber

Date:

18 January 2013, 07:38:51

On Tue, Jan 15, 2013 at 9:05 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Now that a standby server can follow timeline switches through streaming
> replication, we should do teach pg_receivexlog to do the same. Patch
> attached.

Is it possible to re-use walreceiver code from the backend?

I was thinking that it would actually be very useful to have the whole
replication functionality modularized and in a standalone binary that
could act as a replication proxy and WAL archiver that could run
without all the overhead of an entire PG instance.

Re: Teaching pg_receivexlog to follow timeline switches

From

Heikki Linnakangas

Date:

18 January 2013, 15:55:16

On 18.01.2013 06:38, Phil Sorber wrote:
> On Tue, Jan 15, 2013 at 9:05 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com>  wrote:
>> Now that a standby server can follow timeline switches through streaming
>> replication, we should do teach pg_receivexlog to do the same. Patch
>> attached.
>
> Is it possible to re-use walreceiver code from the backend?
>
> I was thinking that it would actually be very useful to have the whole
> replication functionality modularized and in a standalone binary that
> could act as a replication proxy and WAL archiver that could run
> without all the overhead of an entire PG instance

There's much sense in trying to extract that into a stand-along module. 
src/bin/pg_basebackup/receivelog.c is about 1000 lines of code at the 
moment, and it looks quite different from the corresponding code in the 
backend, because it doesn't have all the backend infrastructure available.

- Heikki

Re: Teaching pg_receivexlog to follow timeline switches

From

Phil Sorber

Date:

21 January 2013, 18:58:29

On Fri, Jan 18, 2013 at 7:55 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 18.01.2013 06:38, Phil Sorber wrote:
>> Is it possible to re-use walreceiver code from the backend?
>>
>> I was thinking that it would actually be very useful to have the whole
>> replication functionality modularized and in a standalone binary that
>> could act as a replication proxy and WAL archiver that could run
>> without all the overhead of an entire PG instance
>
>
> There's much sense in trying to extract that into a stand-along module.
> src/bin/pg_basebackup/receivelog.c is about 1000 lines of code at the
> moment, and it looks quite different from the corresponding code in the
> backend, because it doesn't have all the backend infrastructure available.
>
> - Heikki

That's fair.

What do you think about the idea of a full WAL proxy? Probably not for
9.3 at this point though.

Re: Teaching pg_receivexlog to follow timeline switches

From

Noah Misch

Date:

22 January 2013, 01:43:22

This patch was in Needs Review status, but you committed it on 2013-01-17.  I
have marked it as such in the CF app.

Re: Teaching pg_receivexlog to follow timeline switches

From

Dimitri Fontaine

Date:

22 January 2013, 16:03:18

Phil Sorber <phil@omniti.com> writes:
> What do you think about the idea of a full WAL proxy? Probably not for
> 9.3 at this point though.

I was thinking that a WAL proxy nowadays is called a cascading standby
with local archiving enabled. I'm not sure why you would want to trust
your archiving and WAL relaying to another piece of software…

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

Re: Teaching pg_receivexlog to follow timeline switches

From

Heikki Linnakangas

Date:

22 January 2013, 16:10:17

On 22.01.2013 15:02, Dimitri Fontaine wrote:
> Phil Sorber<phil@omniti.com>  writes:
>> What do you think about the idea of a full WAL proxy? Probably not for
>> 9.3 at this point though.
>
> I was thinking that a WAL proxy nowadays is called a cascading standby
> with local archiving enabled. I'm not sure why you would want to trust
> your archiving and WAL relaying to another piece of software…

You might not want to keep a copy of the whole data directory around, as
you have to in a cascading standby. I can see value in a separate WAL
proxy software, especially if it's integrated into a larger backup
manager program like barman or wal-e.

- Heikki

Re: Teaching pg_receivexlog to follow timeline switches

From

Dimitri Fontaine

Date:

22 January 2013, 16:33:39

Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> You might not want to keep a copy of the whole data directory around, as you
> have to in a cascading standby. I can see value in a separate WAL proxy
> software, especially if it's integrated into a larger backup manager program
> like barman or wal-e.

+1

I somehow forgot about $PGDATA here. Time for a little break I guess :)

Another idea is to have a daemon mode pg_receivexlog where not only it
can maintain a local archive but also feed it using the replication
protocol to standbies, keeping track of their position.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

Re: Teaching pg_receivexlog to follow timeline switches

From

Phil Sorber

Date:

22 January 2013, 17:13:58

On Tue, Jan 22, 2013 at 8:33 AM, Dimitri Fontaine
<dimitri@2ndquadrant.fr> wrote:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
>> You might not want to keep a copy of the whole data directory around, as you
>> have to in a cascading standby. I can see value in a separate WAL proxy
>> software, especially if it's integrated into a larger backup manager program
>> like barman or wal-e.
>
> +1
>
> I somehow forgot about $PGDATA here. Time for a little break I guess :)
>
> Another idea is to have a daemon mode pg_receivexlog where not only it
> can maintain a local archive but also feed it using the replication
> protocol to standbies, keeping track of their position.

I'm not sure if i described it well, but that's essentially what I was
asking about. It would have both wal receiving and and wal sending
capability. Along with it's own local WAL storage perhaps governed in
size by a keep_wal_segments and also a longer term archive that you
could have compressed but also pull from with a archive and restore
command. And also be able to act as a synchronous replication peer. I
think it has already been discussed to have pg_receivexlog do that
last one.

So yeah, a cascading standby without $PGDATA or hot_standby or large
shared_buffers resources. It seems like maybe we could add through
subtraction. Add a parameter that disables wal replay? I'm sure
there'd be more things it would have to disable, but then it's not two
separate binaries.

>
> Regards,
> --
> Dimitri Fontaine
> http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

Re: Teaching pg_receivexlog to follow timeline switches

From

Craig Ringer

Date:

24 January 2013, 08:42:49

On 01/22/2013 06:43 AM, Noah Misch wrote:
> This patch was in Needs Review status, but you committed it on 2013-01-17.  I
> have marked it as such in the CF app.
Thankyou. There's a lot to keep up with :S

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services