Thread: Change pg_last_xlog_receive_location not to move backwards

Change pg_last_xlog_receive_location not to move backwards

From

Fujii Masao

Date:

13 January 2011, 03:24:42

Hi,

In the case where there are multiple standbys, when a failover
happens, we usually calculate most advanced standby by using
pg_last_xlog_receive_location or pg_last_xlog_replay_location,
and promote it to new master. The problem is that neither
function might return the right last location of WAL available in
the standby. So we cannot use them to check which standby
is most ahead.

Since pg_last_xlog_receive_location moves backwards when
the standby attempts to reconnect to the primary, it might fall
behind the last location of WAL available. OTOH,
pg_last_xlog_replay_location might also fall behind because
of Hot Standby query conflict.

So I'm thinking to change pg_last_xlog_receive_location not to
move backwards. There is no need to move it backwards when
the standby reconnects to the primary. So we can do that.

BTW, the related discussion was done before:
http://archives.postgresql.org/pgsql-hackers/2010-06/msg00576.php

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Change pg_last_xlog_receive_location not to move backwards

From

Fujii Masao

Date:

13 January 2011, 03:46:24

On Thu, Jan 13, 2011 at 4:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> So I'm thinking to change pg_last_xlog_receive_location not to
> move backwards.

The attached patch does that.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

receive_location_not_back_off_v1.patch

Re: Change pg_last_xlog_receive_location not to move backwards

From

Robert Haas

Date:

24 January 2011, 15:25:20

On Thu, Jan 13, 2011 at 2:40 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Thu, Jan 13, 2011 at 4:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> So I'm thinking to change pg_last_xlog_receive_location not to
>> move backwards.
>
> The attached patch does that.

It looks to me like this is changing more than just the return value
of pg_last_xlog_receive_location.  receivedUpto isn't only used for
that one function, and there's no explanation in your email or in the
patch of why the new behavior is correct and/or better for the other
places where it's used.

This email from Heikki seems to indicate that there's a reason for the
current behavior:

http://archives.postgresql.org/pgsql-hackers/2010-06/msg00586.php

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Change pg_last_xlog_receive_location not to move backwards

From

Fujii Masao

Date:

24 January 2011, 21:45:46

On Tue, Jan 25, 2011 at 4:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 13, 2011 at 2:40 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Thu, Jan 13, 2011 at 4:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> So I'm thinking to change pg_last_xlog_receive_location not to
>>> move backwards.
>>
>> The attached patch does that.
>
> It looks to me like this is changing more than just the return value
> of pg_last_xlog_receive_location.  receivedUpto isn't only used for
> that one function, and there's no explanation in your email or in the
> patch of why the new behavior is correct and/or better for the other
> places where it's used.
>
> This email from Heikki seems to indicate that there's a reason for the
> current behavior:
>
> http://archives.postgresql.org/pgsql-hackers/2010-06/msg00586.php

Yes, so I didn't change that behavior, i.e., even with the patch, SR still
always starts from the head of the WAL segment, not the middle of that.
What I changed is only the return value of pg_last_xlog_receive_location.
Am I missing something?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Change pg_last_xlog_receive_location not to move backwards

From

Jeff Janes

Date:

25 January 2011, 14:40:31

On Mon, Jan 24, 2011 at 11:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 13, 2011 at 2:40 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Thu, Jan 13, 2011 at 4:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> So I'm thinking to change pg_last_xlog_receive_location not to
>>> move backwards.
>>
>> The attached patch does that.
>
> It looks to me like this is changing more than just the return value
> of pg_last_xlog_receive_location.  receivedUpto isn't only used for
> that one function, and there's no explanation in your email or in the
> patch of why the new behavior is correct and/or better for the other
> places where it's used.

I believe the new walrcv->receiveStart was introduced to divide up those
behaviors that should go backwards from those that should not.

The non-retreating value is used in 3 places (via GetWalRcvWriteRecPtr)
in xlog.c.  (And some other places I haven't examined yet.)

One place is to decide whether to remove/recycle XLog files, and I think the
non-retreating value is at least as suitable for this usage.

One is to feed the pg_last_xlog_receive_location() function.

The third seems more problematic.  In the XLogPageRead,
it checks to see if more records have been received beyond what
has been applied.  By using the non-retreating value here, it seems
like the xlog replay could start replaying records that the wal
receiver is in the process of overwriting.  Now, I've argued to myself
that this is not a problem, because the receiver is overwriting them
with identical data to that which is already there.

But by that logic, why does any part of it (walrcv->receiveStart in
the patch, walrcv->receivedUpto unpatched) need to retreat?  From
the previous discussion, I understand that the concern is that we don't
want to retrieve and write out partial xlog files.  What I don't understand
is how we could get our selves into the position in which we are doing
that, other than by someone going in and doing impermissible things to
the PGDATA directory behind our backs.

I've been spinning my wheels on that part for a while.  Since I don't understand
what we are defending against in the original code, I can't evaluate
if the patch
maintains that defense.

Cheers,

Jeff

Re: Change pg_last_xlog_receive_location not to move backwards

From

Fujii Masao

Date:

25 January 2011, 22:39:05

Thanks for the review!

On Wed, Jan 26, 2011 at 3:40 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> I believe the new walrcv->receiveStart was introduced to divide up those
> behaviors that should go backwards from those that should not.

Yes.

> The non-retreating value is used in 3 places (via GetWalRcvWriteRecPtr)
> in xlog.c.  (And some other places I haven't examined yet.)

Yes.

> The third seems more problematic.  In the XLogPageRead,
> it checks to see if more records have been received beyond what
> has been applied.  By using the non-retreating value here, it seems
> like the xlog replay could start replaying records that the wal
> receiver is in the process of overwriting.  Now, I've argued to myself
> that this is not a problem, because the receiver is overwriting them
> with identical data to that which is already there.

Yes. I don't think that it's a problem, too.

> But by that logic, why does any part of it (walrcv->receiveStart in
> the patch, walrcv->receivedUpto unpatched) need to retreat?  From
> the previous discussion, I understand that the concern is that we don't
> want to retrieve and write out partial xlog files.  What I don't understand
> is how we could get our selves into the position in which we are doing
> that, other than by someone going in and doing impermissible things to
> the PGDATA directory behind our backs.

That logic exists because we'd like to check that newly-received WAL
data is consistent with previous one by validating the header of new
WAL file. So since we need the header of new WAL file, we retreat the
replication starting location to the beginning of the WAL file when
reconnecting to the primary.

The following code (in XLogPageRead) validates the header of new
WAL file.

----------------------if (switched_segment && targetPageOff != 0){    /*     * Whenever switching to a new WAL segment,
weread the first page of     * the file and validate its header, even if that's not where the     * target record is.
Thisis so that we can check the additional     * identification info that is present in the first page's "long"     *
header.    */    readOff = 0;    if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)    {
ereport(emode_for_corrupt_record(emode,*RecPtr),                (errcode_for_file_access(),
errmsg("couldnot read from log file %u, segment %u, offset %u: %m",                        readId, readSeg, readOff)));
      goto next_record_is_invalid;    }    if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))        goto
next_record_is_invalid;}
----------------------

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Change pg_last_xlog_receive_location not to move backwards

From

Jeff Janes

Date:

29 January 2011, 21:44:43

On Tue, Jan 25, 2011 at 6:38 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

>> The third seems more problematic.  In the XLogPageRead,
>> it checks to see if more records have been received beyond what
>> has been applied.  By using the non-retreating value here, it seems
>> like the xlog replay could start replaying records that the wal
>> receiver is in the process of overwriting.  Now, I've argued to myself
>> that this is not a problem, because the receiver is overwriting them
>> with identical data to that which is already there.
>
> Yes. I don't think that it's a problem, too.
>
>> But by that logic, why does any part of it (walrcv->receiveStart in
>> the patch, walrcv->receivedUpto unpatched) need to retreat?  From
>> the previous discussion, I understand that the concern is that we don't
>> want to retrieve and write out partial xlog files.  What I don't understand
>> is how we could get our selves into the position in which we are doing
>> that, other than by someone going in and doing impermissible things to
>> the PGDATA directory behind our backs.
>
> That logic exists because we'd like to check that newly-received WAL
> data is consistent with previous one by validating the header of new
> WAL file.

I do not understand what doing so gets us.

Say we previously received 2/3 of a WAL file, and replayed most of it.
So now the shared buffers have data that has been synced to that WAL
file already, and some of those dirty shared buffers have been written
to disk and some have not.  At this point, we need the data in the first
2/3 of the WAL file in order to reach a consistent state.  But now we
lose the connection to the master, and then we restore it.  Now we
request the entire file from the start rather than from where it
left off.

Either of two things happens.  Most likely, the newly received WAL file
matches the file it is overwriting, in which case there was no
point in asking for it.

Less likely, the master is feeding us gibberish.  By requesting the
full WAL file, we check the header and detect that the master is feeding
us gibberish.  Unfortunately, we only detect that fact *after* we have
replaced a critical part of our own (previously good) copy of the WAL
file with said gibberish.  The standby is now in an unrecoverable state.

With a bit of malicious engineering, I have created this situation.
I don't know how likely it is that something like that could happen
accidentally, say with a corrupted file system.  I have been unable
to engineer a situation where checking the header actually does
any good.  It has either done nothing, or done harm.

> So since we need the header of new WAL file, we retreat the
> replication starting location to the beginning of the WAL file when
> reconnecting to the primary.
>
> The following code (in XLogPageRead) validates the header of new
> WAL file.
>
> ----------------------
>        if (switched_segment && targetPageOff != 0)
>        {
>                /*
>                 * Whenever switching to a new WAL segment, we read the first page of
>                 * the file and validate its header, even if that's not where the
>                 * target record is.  This is so that we can check the additional
>                 * identification info that is present in the first page's "long"
>                 * header.
>                 */
>                readOff = 0;
>                if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
>                {
>                        ereport(emode_for_corrupt_record(emode, *RecPtr),
>                                        (errcode_for_file_access(),
>                                         errmsg("could not read from log file %u, segment %u, offset %u: %m",
>                                                        readId, readSeg, readOff)));
>                        goto next_record_is_invalid;
>                }
>                if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))
>                        goto next_record_is_invalid;
>        }

OK, thanks for the explanation.  Is there a race condition here?  That is,
it seems that with your patch this part of the code could get executed
after streaming is restarted, but before the streaming ever actually received
and wrote anything.  So it would be checking the old header, not the newly
about-to-be received header.

Cheers,

Jeff

Re: Change pg_last_xlog_receive_location not to move backwards

From

Fujii Masao

Date:

31 January 2011, 03:12:36

On Sun, Jan 30, 2011 at 10:44 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> I do not understand what doing so gets us.
>
> Say we previously received 2/3 of a WAL file, and replayed most of it.
> So now the shared buffers have data that has been synced to that WAL
> file already, and some of those dirty shared buffers have been written
> to disk and some have not.  At this point, we need the data in the first
> 2/3 of the WAL file in order to reach a consistent state.  But now we
> lose the connection to the master, and then we restore it.  Now we
> request the entire file from the start rather than from where it
> left off.
>
> Either of two things happens.  Most likely, the newly received WAL file
> matches the file it is overwriting, in which case there was no
> point in asking for it.
>
> Less likely, the master is feeding us gibberish.  By requesting the
> full WAL file, we check the header and detect that the master is feeding
> us gibberish.  Unfortunately, we only detect that fact *after* we have
> replaced a critical part of our own (previously good) copy of the WAL
> file with said gibberish.  The standby is now in an unrecoverable state.

Right. To avoid this problem completely, IMO, walreceiver should validate
the received WAL data before writing it. Or, walreceiver should write the
WAL to the transient file, and the startup process should rename it to the
correct name after replaying it.

We should do something like the above?

> With a bit of malicious engineering, I have created this situation.
> I don't know how likely it is that something like that could happen
> accidentally, say with a corrupted file system.  I have been unable
> to engineer a situation where checking the header actually does
> any good.  It has either done nothing, or done harm.

OK, I seem to have to consider again why the code which retreats the
replication starting location exists.

At first, I added the code to prevent a half-baked WAL file. For example,
please imagine the case where you start the standby server with no WAL
files in pg_xlog. In this case, if replication starts from the middle of WAL
file, the received WAL file is obviously broken (i.e., with no WAL data in
the first half of file). This broken WAL file might cause trouble when we
restart the standby and even when we just promote it (because the last
applied WAL file is re-fetched at the end of recovery).

OTOH, we can start replication from the middle of WAL file if we can
*ensure* that the first half of WAL file already exists. At least, when the
standby reconnects to the master, we might be able to ensure that and
start from the middle.

> OK, thanks for the explanation.  Is there a race condition here?  That is,
> it seems that with your patch this part of the code could get executed
> after streaming is restarted, but before the streaming ever actually received
> and wrote anything.  So it would be checking the old header, not the newly
> about-to-be received header.

As far as I read the code again, there is no such a race condition. When
the received WAL is read just after streaming is restarted, that part of the
code seems to always be executed.

Maybe I'm missing something. Could you explain about the problematic
scenario?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Change pg_last_xlog_receive_location not to move backwards

From

Stephen Frost

Date:

11 February 2011, 12:53:05

Fujii, all,

* Fujii Masao (masao.fujii@gmail.com) wrote:
> That logic exists because we'd like to check that newly-received WAL
> data is consistent with previous one by validating the header of new
> WAL file. So since we need the header of new WAL file, we retreat the
> replication starting location to the beginning of the WAL file when
> reconnecting to the primary.

Thanks for that explanation, but I can't help but wonder why it doesn't
make more sense to introduce a new variable to track the value you want
rather than reusing an existing one and then adding a variable to
represent what the old variable was already doing.  In other words, why
not invent

XLogRecPtr        newestReceviedUpto; /* or something */

and update that as necessary and have the function you want changed
return that, and leave receivedUpto alone..?  It seems like it'd be a
lot easier to prove to ourselves that your patch didn't break anything,
presuming the function we're talking about changing the return value of
isn't called anywhere (seems rather unlikely to me..).

Also, you really should reformat the docs properly when you change them,
rather than leaving the lines ragged..
Thanks,
    Stephen

Re: Change pg_last_xlog_receive_location not to move backwards

From

Robert Haas

Date:

11 February 2011, 13:25:39

On Fri, Feb 11, 2011 at 11:52 AM, Stephen Frost <sfrost@snowman.net> wrote:
> Fujii, all,
>
> * Fujii Masao (masao.fujii@gmail.com) wrote:
>> That logic exists because we'd like to check that newly-received WAL
>> data is consistent with previous one by validating the header of new
>> WAL file. So since we need the header of new WAL file, we retreat the
>> replication starting location to the beginning of the WAL file when
>> reconnecting to the primary.
>
> Thanks for that explanation, but I can't help but wonder why it doesn't
> make more sense to introduce a new variable to track the value you want
> rather than reusing an existing one and then adding a variable to
> represent what the old variable was already doing.

+1.

It seems like there may be some more significant changes in this area
needed; however, for now I think the best fix is the one with the
least chance of breaking anything.

> Also, you really should reformat the docs properly when you change them,
> rather than leaving the lines ragged..

It's OK to leave them a little ragged, I think.  It eases back-patching.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Change pg_last_xlog_receive_location not to move backwards

From

Robert Haas

Date:

11 February 2011, 13:32:06

On Fri, Feb 11, 2011 at 12:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Feb 11, 2011 at 11:52 AM, Stephen Frost <sfrost@snowman.net> wrote:
>> Fujii, all,
>>
>> * Fujii Masao (masao.fujii@gmail.com) wrote:
>>> That logic exists because we'd like to check that newly-received WAL
>>> data is consistent with previous one by validating the header of new
>>> WAL file. So since we need the header of new WAL file, we retreat the
>>> replication starting location to the beginning of the WAL file when
>>> reconnecting to the primary.
>>
>> Thanks for that explanation, but I can't help but wonder why it doesn't
>> make more sense to introduce a new variable to track the value you want
>> rather than reusing an existing one and then adding a variable to
>> represent what the old variable was already doing.
>
> +1.
>
> It seems like there may be some more significant changes in this area
> needed; however, for now I think the best fix is the one with the
> least chance of breaking anything.

Actually... wait a minute.  Now that I'm thinking about this a little
more, I'm not so convinced.  If we leave this the way is, and just
paper over the problem using the method Stephen and I both had in
mind, then we could be streaming from a position that precedes the
so-called "current" position.  And worse, the newly introduced replies
to the master will still show the position going backward, so the
master will reflect a position that is earlier that the position the
slave reports for itself, not because of any real difference but
because of a reporting difference.  That sure doesn't seem good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Change pg_last_xlog_receive_location not to move backwards

From

Stephen Frost

Date:

11 February 2011, 13:52:52

* Robert Haas (robertmhaas@gmail.com) wrote:
> Actually... wait a minute.  Now that I'm thinking about this a little
> more, I'm not so convinced.  If we leave this the way is, and just
> paper over the problem using the method Stephen and I both had in
> mind, then we could be streaming from a position that precedes the
> so-called "current" position.  And worse, the newly introduced replies
> to the master will still show the position going backward, so the
> master will reflect a position that is earlier that the position the
> slave reports for itself, not because of any real difference but
> because of a reporting difference.  That sure doesn't seem good.

I'm really not sure it's as bad as all that...  The slave and the master
are only going to be "out of sync" wrt position until the slave sends
the request for update to the master, gets back the result, and checks
the XLOG header, right?

Here's my question- we're talking about if the master dies here, as I
understand it.  If the slave comes up and can't get to the master, is he
going to be reporting the older position, or his current one?  The
answer to that should be, I believe, the *current* one.  He'd only "go
backwards" *if* the master is up and he gets his request over to the
master to get the old log.  In fact, if he comes up and gets his request
off to the master and the master never replies, in my view, he should be
able to be restarted into 'master' mode and come all the way up to
'current' (which would be later than what he was trying to ask the
master for).  *That* is the value, I think, that Fujii is looking for-
if this slave started up as a master, where would he be?  That should
always be increasing.  The fact that we "back up" a little, as a
double-check to make sure we didn't get wildly out of sync, and because
we want the master to only be producing full XLOGs, is an implementation
detail, really, that probably doesn't really need to be exposed..

That may mean that we want to change what the slave is reporting for
itself though, if it's currently allowed to be seen "going backwards",
but I don't think that would be terribly difficult to change...

I havn't really dug into the SR/HS stuff, so I could be way off too..
Thanks,
    Stephen

Re: Change pg_last_xlog_receive_location not to move backwards

From

Robert Haas

Date:

12 February 2011, 10:32:39

On Fri, Feb 11, 2011 at 12:52 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> Actually... wait a minute.  Now that I'm thinking about this a little
>> more, I'm not so convinced.  If we leave this the way is, and just
>> paper over the problem using the method Stephen and I both had in
>> mind, then we could be streaming from a position that precedes the
>> so-called "current" position.  And worse, the newly introduced replies
>> to the master will still show the position going backward, so the
>> master will reflect a position that is earlier that the position the
>> slave reports for itself, not because of any real difference but
>> because of a reporting difference.  That sure doesn't seem good.
>
> I'm really not sure it's as bad as all that...  The slave and the master
> are only going to be "out of sync" wrt position until the slave sends
> the request for update to the master, gets back the result, and checks
> the XLOG header, right?

It'll restream the whole segment up to wherever it was, but basically, yes.

I think a big part of the problem here is that we have wildly
inconsistent terminology for no especially good reason.  The standby
reports three XLOG positions to the master, currently called write,
flush, and apply.  The walreceiver reports positions called
recievedUpTo and lastChunkStart.  receivedUpTo is actually the FLUSH
position, and lastChunkStart is the previous flush position, except
when the walreceiver first starts, when it's the position at which
streaming is to begin.

So, what if we did some renaming?  I'd be inclined to start by
renaming "receivedUpTo" to Flush, and add a new position called
Stream.  When walreciever is started, we set Stream to the position at
which streaming is going to begin (which can rewind) and leave Flush
alone (so it never rewinds).  We then change the walreceiver feedback
mechanism to use the term stream_location rather than write_location;
and we could consider replacing pg_last_xlog_receive_location() with
pg_last_xlog_stream_location() and pg_last_xlog_flush_location().
That's a backward compatibility break, but maybe it's worth it for the
sake of terminological consistency, not to mention accuracy.

I'd also be inclined to go to the walreceiver code and and rename the
apply_location to replay_location, so that it matches
pg_last_xlog_replay_location().  The latter is in 9.0, but the former
is new to 9.1, so we can still fix it to be consistent without a
backward compatibility break.

Thoughts, comments?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Change pg_last_xlog_receive_location not to move backwards

From

Stephen Frost

Date:

12 February 2011, 11:49:21

* Robert Haas (robertmhaas@gmail.com) wrote:
> I think a big part of the problem here is that we have wildly
> inconsistent terminology for no especially good reason.

Agreed.

> Thoughts, comments?

I thought about this for a bit and agree w/ your suggestions.

So, +1 from me.
Thanks,
    Stephen

Re: Change pg_last_xlog_receive_location not to move backwards

From

Simon Riggs

Date:

12 February 2011, 13:32:00

On Sat, 2011-02-12 at 09:32 -0500, Robert Haas wrote:
> On Fri, Feb 11, 2011 at 12:52 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > * Robert Haas (robertmhaas@gmail.com) wrote:
> >> Actually... wait a minute.  Now that I'm thinking about this a little
> >> more, I'm not so convinced.  If we leave this the way is, and just
> >> paper over the problem using the method Stephen and I both had in
> >> mind, then we could be streaming from a position that precedes the
> >> so-called "current" position.  And worse, the newly introduced replies
> >> to the master will still show the position going backward, so the
> >> master will reflect a position that is earlier that the position the
> >> slave reports for itself, not because of any real difference but
> >> because of a reporting difference.  That sure doesn't seem good.
> >
> > I'm really not sure it's as bad as all that...  The slave and the master
> > are only going to be "out of sync" wrt position until the slave sends
> > the request for update to the master, gets back the result, and checks
> > the XLOG header, right?
> 
> It'll restream the whole segment up to wherever it was, but basically, yes.
> 
> I think a big part of the problem here is that we have wildly
> inconsistent terminology for no especially good reason.  The standby
> reports three XLOG positions to the master, currently called write,
> flush, and apply.  The walreceiver reports positions called
> recievedUpTo and lastChunkStart.  receivedUpTo is actually the FLUSH
> position, and lastChunkStart is the previous flush position, except
> when the walreceiver first starts, when it's the position at which
> streaming is to begin.
> 
> So, what if we did some renaming?  I'd be inclined to start by
> renaming "receivedUpTo" to Flush, and add a new position called
> Stream.  When walreciever is started, we set Stream to the position at
> which streaming is going to begin (which can rewind) and leave Flush
> alone (so it never rewinds).

OK

>   We then change the walreceiver feedback
> mechanism to use the term stream_location rather than write_location;

OK

> and we could consider replacing pg_last_xlog_receive_location() with
> pg_last_xlog_stream_location() and pg_last_xlog_flush_location().
> That's a backward compatibility break, but maybe it's worth it for the
> sake of terminological consistency, not to mention accuracy.

Don't see a need to break anything. OK with two new function names.

> I'd also be inclined to go to the walreceiver code and and rename the
> apply_location to replay_location, so that it matches
> pg_last_xlog_replay_location().  The latter is in 9.0, but the former
> is new to 9.1, so we can still fix it to be consistent without a
> backward compatibility break.

Any renaming of things should be done as cosmetic fixes after important
patches are in.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services

Re: Change pg_last_xlog_receive_location not to move backwards

From

Simon Riggs

Date:

12 February 2011, 13:32:25

On Mon, 2011-01-31 at 16:12 +0900, Fujii Masao wrote:
> On Sun, Jan 30, 2011 at 10:44 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> > I do not understand what doing so gets us.
> >
> > Say we previously received 2/3 of a WAL file, and replayed most of it.
> > So now the shared buffers have data that has been synced to that WAL
> > file already, and some of those dirty shared buffers have been written
> > to disk and some have not.  At this point, we need the data in the first
> > 2/3 of the WAL file in order to reach a consistent state.  But now we
> > lose the connection to the master, and then we restore it.  Now we
> > request the entire file from the start rather than from where it
> > left off.
> >
> > Either of two things happens.  Most likely, the newly received WAL file
> > matches the file it is overwriting, in which case there was no
> > point in asking for it.
> >
> > Less likely, the master is feeding us gibberish.  By requesting the
> > full WAL file, we check the header and detect that the master is feeding
> > us gibberish.  Unfortunately, we only detect that fact *after* we have
> > replaced a critical part of our own (previously good) copy of the WAL
> > file with said gibberish.  The standby is now in an unrecoverable state.
> 
> Right. To avoid this problem completely, IMO, walreceiver should validate
> the received WAL data before writing it. Or, walreceiver should write the
> WAL to the transient file, and the startup process should rename it to the
> correct name after replaying it.
> 
> We should do something like the above?
> 
> > With a bit of malicious engineering, I have created this situation.
> > I don't know how likely it is that something like that could happen
> > accidentally, say with a corrupted file system.  I have been unable
> > to engineer a situation where checking the header actually does
> > any good.  It has either done nothing, or done harm.
> 
> OK, I seem to have to consider again why the code which retreats the
> replication starting location exists.
> 
> At first, I added the code to prevent a half-baked WAL file. For example,
> please imagine the case where you start the standby server with no WAL
> files in pg_xlog. In this case, if replication starts from the middle of WAL
> file, the received WAL file is obviously broken (i.e., with no WAL data in
> the first half of file). This broken WAL file might cause trouble when we
> restart the standby and even when we just promote it (because the last
> applied WAL file is re-fetched at the end of recovery).
> 
> OTOH, we can start replication from the middle of WAL file if we can
> *ensure* that the first half of WAL file already exists. At least, when the
> standby reconnects to the master, we might be able to ensure that and
> start from the middle.

Some important points here, but it seems complex.

AFAICS we need to do two things
* check header of WAL file matches when we start streaming
* start streaming from last received location

So why not do them separately, rather than rewinding the received
location and kludging things?

Seems easier to send all required info in a separate data packet, then
validate the existing WAL file against that. Then start streaming from
last received location. That way we don't have any "going backwards"
logic at all.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services

Re: Change pg_last_xlog_receive_location not to move backwards

From

Jeff Janes

Date:

13 February 2011, 17:05:16

On Sun, Jan 30, 2011 at 11:12 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Sun, Jan 30, 2011 at 10:44 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> I do not understand what doing so gets us.
>>
>> Say we previously received 2/3 of a WAL file, and replayed most of it.
>> So now the shared buffers have data that has been synced to that WAL
>> file already, and some of those dirty shared buffers have been written
>> to disk and some have not.  At this point, we need the data in the first
>> 2/3 of the WAL file in order to reach a consistent state.  But now we
>> lose the connection to the master, and then we restore it.  Now we
>> request the entire file from the start rather than from where it
>> left off.
>>
>> Either of two things happens.  Most likely, the newly received WAL file
>> matches the file it is overwriting, in which case there was no
>> point in asking for it.
>>
>> Less likely, the master is feeding us gibberish.  By requesting the
>> full WAL file, we check the header and detect that the master is feeding
>> us gibberish.  Unfortunately, we only detect that fact *after* we have
>> replaced a critical part of our own (previously good) copy of the WAL
>> file with said gibberish.  The standby is now in an unrecoverable state.
>
> Right. To avoid this problem completely, IMO, walreceiver should validate
> the received WAL data before writing it. Or, walreceiver should write the
> WAL to the transient file, and the startup process should rename it to the
> correct name after replaying it.
>
> We should do something like the above?

I don't think we should overwrite local WAL that has already been
applied, unless we already know for sure that it is bad (and suspect
re-streaming might make it right.)

I think the better approach would be to check the existence, and the
segment header, of the WAL segment we already (think we) have in
pg_xlog.  And then request re-streaming of that file from the
beginning only if it is either missing or has the wrong header in the
local copy.  That might make the code a lot more complex, though.  At
least I don't see a simple way to implement it.

>
>> With a bit of malicious engineering, I have created this situation.
>> I don't know how likely it is that something like that could happen
>> accidentally, say with a corrupted file system.  I have been unable
>> to engineer a situation where checking the header actually does
>> any good.  It has either done nothing, or done harm.
>
> OK, I seem to have to consider again why the code which retreats the
> replication starting location exists.
>
> At first, I added the code to prevent a half-baked WAL file. For example,
> please imagine the case where you start the standby server with no WAL
> files in pg_xlog. In this case, if replication starts from the middle of WAL
> file, the received WAL file is obviously broken (i.e., with no WAL data in
> the first half of file). This broken WAL file might cause trouble when we
> restart the standby and even when we just promote it (because the last
> applied WAL file is re-fetched at the end of recovery).
>
> OTOH, we can start replication from the middle of WAL file if we can
> *ensure* that the first half of WAL file already exists. At least, when the
> standby reconnects to the master, we might be able to ensure that and
> start from the middle.
>
>> OK, thanks for the explanation.  Is there a race condition here?  That is,
>> it seems that with your patch this part of the code could get executed
>> after streaming is restarted, but before the streaming ever actually received
>> and wrote anything.  So it would be checking the old header, not the newly
>> about-to-be received header.
>
> As far as I read the code again, there is no such a race condition. When
> the received WAL is read just after streaming is restarted, that part of the
> code seems to always be executed.
>
> Maybe I'm missing something. Could you explain about the problematic
> scenario?

OK, I think you are right.  I was thinking that if apply is well
behind receive when the connection is lost, that a new connection
could immediately pass the "havedata" test an so proceed to the header
check perhaps before anything was received.  But now I see that we
never try to re-connect until after apply has caught up with receive,
so that race condition cannot happen.  Sorry for the false alarm.

By the way, the patch no longer applies cleanly to HEAD, as of commit
d309acf201ab2c5 (Feb 11th).  But it looks like a trivial conflict (Two
different patches both correct the same doc typo).

Cheers,

Jeff

Re: Change pg_last_xlog_receive_location not to move backwards

From

Fujii Masao

Date:

15 February 2011, 01:35:07

On Sat, Feb 12, 2011 at 11:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> So, what if we did some renaming?  I'd be inclined to start by
> renaming "receivedUpTo" to Flush, and add a new position called
> Stream.  When walreciever is started, we set Stream to the position at
> which streaming is going to begin (which can rewind) and leave Flush
> alone (so it never rewinds). We then change the walreceiver feedback
> mechanism to use the term stream_location rather than write_location;
> and we could consider replacing pg_last_xlog_receive_location() with
> pg_last_xlog_stream_location() and pg_last_xlog_flush_location().

You suggest that the shared variable Stream tracks the WAL write location,
after it's set to the replication starting position? I don't think
that the write
location needs to be tracked in the shmem because other processes than
walreceiver don't use it.

What I proposed is:

Walreceiver-local variables
==================
1. LogstreamResult.Write    - Indicates the location of recently written WAL record    - Can rewind    -
pg_stat_replication.write_locationreturns this 

2. LogstreamResult.Flush    - Indicates the location of recently flushed WAL record    - Can rewind    -
pg_stat_replication.flush_locationreturns this 

Shmem variables
===========
3. WalRcv->receiveStart    - Indicates the replication starting location    - Updated only when walreceiver is started
 - Doesn't exist at the moment, so I propose to add this 

4. WalRcv->receivedUpto    - Indicates the latest location of all the flushed WAL records    - Never rewinds       (Can
rewindat the moment, so I propose to prevent the rewind)    - pg_last_xlog_receive_location returns this 

You propose to rename LogstreamResult.Write to .Stream, and
merge it and receiveStart?

> I'd also be inclined to go to the walreceiver code and and rename the
> apply_location to replay_location, so that it matches
> pg_last_xlog_replay_location().  The latter is in 9.0, but the former
> is new to 9.1, so we can still fix it to be consistent without a
> backward compatibility break.

+1

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Change pg_last_xlog_receive_location not to move backwards

From

Robert Haas

Date:

15 February 2011, 08:41:45

On Tue, Feb 15, 2011 at 12:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> You suggest that the shared variable Stream tracks the WAL write location,
> after it's set to the replication starting position? I don't think
> that the write
> location needs to be tracked in the shmem because other processes than
> walreceiver don't use it.

Well, my proposal was to expose it, on the theory that it's useful.
As we stream the WAL, we write it, so I think for all intents and
purposes write == stream.  But using it to convey the starting
position makes more sense if you call it stream than it does if you
call it write.

> You propose to rename LogstreamResult.Write to .Stream, and
> merge it and receiveStart?

Yeah, or probably change recieveStart to be called Stream.  It's
representing the same thing, just in shmem instead of backend-local,
so why name it differently?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Change pg_last_xlog_receive_location not to move backwards

From

Fujii Masao

Date:

16 February 2011, 01:59:54

On Tue, Feb 15, 2011 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 15, 2011 at 12:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> You suggest that the shared variable Stream tracks the WAL write location,
>> after it's set to the replication starting position? I don't think
>> that the write
>> location needs to be tracked in the shmem because other processes than
>> walreceiver don't use it.
>
> Well, my proposal was to expose it, on the theory that it's useful.
> As we stream the WAL, we write it, so I think for all intents and
> purposes write == stream.  But using it to convey the starting
> position makes more sense if you call it stream than it does if you
> call it write.

Umm.. I could not find any use case to expose the WAL write location
besides flush one. So I'm not sure if it's really useful to track the
write location in the shmem besides the walreceiver-local memory.
What use case do you think of?

Personally the term "stream" sounds more ambiguous than "write".
I cannot imagine what location the pg_last_xlog_stream_location or
stream_location actually returns, from the function name;  WAL
location that has been received? written? flushed? replayed?
Since the "write" sounds cleaner, I like it.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Change pg_last_xlog_receive_location not to move backwards

From

Robert Haas

Date:

16 February 2011, 08:06:15

On Wed, Feb 16, 2011 at 12:59 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Tue, Feb 15, 2011 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Feb 15, 2011 at 12:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> You suggest that the shared variable Stream tracks the WAL write location,
>>> after it's set to the replication starting position? I don't think
>>> that the write
>>> location needs to be tracked in the shmem because other processes than
>>> walreceiver don't use it.
>>
>> Well, my proposal was to expose it, on the theory that it's useful.
>> As we stream the WAL, we write it, so I think for all intents and
>> purposes write == stream.  But using it to convey the starting
>> position makes more sense if you call it stream than it does if you
>> call it write.
>
> Umm.. I could not find any use case to expose the WAL write location
> besides flush one. So I'm not sure if it's really useful to track the
> write location in the shmem besides the walreceiver-local memory.
> What use case do you think of?

Well, we're currently exposing that on the master via
pg_stat_replication.  I guess we could rip that out, but I think that
if nothing else we're imagining eventually supporting a sync rep mode
where the standby acknowledges WAL upon receipt rather than upon
write.  And the lag between the write and flush positions can be up to
16MB, so it doesn't seem entirely academic.  Basically, the write
position is the most WAL that could be on disk on standby and the
flush is the most WAL that we're SURE is on disk on the standby.

> Personally the term "stream" sounds more ambiguous than "write".
> I cannot imagine what location the pg_last_xlog_stream_location or
> stream_location actually returns, from the function name;  WAL
> location that has been received? written? flushed? replayed?
> Since the "write" sounds cleaner, I like it.

Well, the problem with receivedUpto is that it's really being used for
two different things, neither of which is how much WAL has been
received.  One is where streaming is to start (hence, stream) and the
other is how much we've flushed to disk (hence, flush).  So you might
think there were four positions: streaming start, write, flush, apply.But I think the first two are really the same:
onceyou've received 
at least one byte, the position that you're streaming from and the
write position are the same, so I think the name stream can span both
concepts.  OTOH, stream-start and flush are clearly NOT the same -
there is a small but potentially significant delay between
stream/write and flush.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Change pg_last_xlog_receive_location not to move backwards

From

Robert Haas

Date:

26 February 2011, 10:58:15

On Wed, Feb 16, 2011 at 7:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 16, 2011 at 12:59 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Tue, Feb 15, 2011 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Tue, Feb 15, 2011 at 12:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>> You suggest that the shared variable Stream tracks the WAL write location,
>>>> after it's set to the replication starting position? I don't think
>>>> that the write
>>>> location needs to be tracked in the shmem because other processes than
>>>> walreceiver don't use it.
>>>
>>> Well, my proposal was to expose it, on the theory that it's useful.
>>> As we stream the WAL, we write it, so I think for all intents and
>>> purposes write == stream.  But using it to convey the starting
>>> position makes more sense if you call it stream than it does if you
>>> call it write.
>>
>> Umm.. I could not find any use case to expose the WAL write location
>> besides flush one. So I'm not sure if it's really useful to track the
>> write location in the shmem besides the walreceiver-local memory.
>> What use case do you think of?
>
> Well, we're currently exposing that on the master via
> pg_stat_replication.  I guess we could rip that out, but I think that
> if nothing else we're imagining eventually supporting a sync rep mode
> where the standby acknowledges WAL upon receipt rather than upon
> write.  And the lag between the write and flush positions can be up to
> 16MB, so it doesn't seem entirely academic.  Basically, the write
> position is the most WAL that could be on disk on standby and the
> flush is the most WAL that we're SURE is on disk on the standby.
>
>> Personally the term "stream" sounds more ambiguous than "write".
>> I cannot imagine what location the pg_last_xlog_stream_location or
>> stream_location actually returns, from the function name;  WAL
>> location that has been received? written? flushed? replayed?
>> Since the "write" sounds cleaner, I like it.
>
> Well, the problem with receivedUpto is that it's really being used for
> two different things, neither of which is how much WAL has been
> received.  One is where streaming is to start (hence, stream) and the
> other is how much we've flushed to disk (hence, flush).  So you might
> think there were four positions: streaming start, write, flush, apply.
>  But I think the first two are really the same: once you've received
> at least one byte, the position that you're streaming from and the
> write position are the same, so I think the name stream can span both
> concepts.  OTOH, stream-start and flush are clearly NOT the same -
> there is a small but potentially significant delay between
> stream/write and flush.

It sounds like the only thing we have definite agreement about from
all this is that apply_location should be renamed to replay_location
in pg_stat_replication, a point fairly incidental to the what this
patch is about.  It seems fairly unsatisfying to just change that and
punt the rest of this, but I'm not sure what the alternative is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Change pg_last_xlog_receive_location not to move backwards

From

Heikki Linnakangas

Date:

01 March 2011, 14:58:35

On 26.02.2011 16:58, Robert Haas wrote:
> It sounds like the only thing we have definite agreement about from
> all this is that apply_location should be renamed to replay_location
> in pg_stat_replication, a point fairly incidental to the what this
> patch is about.  It seems fairly unsatisfying to just change that and
> punt the rest of this, but I'm not sure what the alternative is.

After reading the discussions, I don't see any actual opposition to 
Fujii-san's patch. And I think it makes sense, the new definition makes 
sense for the purpose Fujii mentioned in the mail that started this 
thread: determining which standby is most up-to-date.

There has been a lot of good suggestions, like verifying the received 
WAL before overwriting existing WAL. But we're not going to start doing 
bigger code changes this late in the release cycle. And even if we did, 
this patch would still make sense - we still wouldn't want 
pg_last_xlog_receive_location() to move backwards.

So, committed.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com