Thread: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

sriggs@postgresql.org (Simon Riggs) writes:
> Log Message:
> -----------
> Augment WAL records for btree delete with GetOldestXmin() to reduce
> false positives during Hot Standby conflict processing. Simple
> patch to enhance conflict processing, following previous discussions. 
> Controlled by parameter minimize_standby_conflicts = on | off, with
> default off allows measurement of performance impact to see whether
> it should be set on all the time.

WTF?  Simon, this seems to be getting way way beyond anything the
community has agreed to.  Particularly, inventing GUCs is not something
to be doing without consensus.
        regards, tom lane


Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> sriggs@postgresql.org (Simon Riggs) writes:
>> Log Message:
>> -----------
>> Augment WAL records for btree delete with GetOldestXmin() to reduce
>> false positives during Hot Standby conflict processing. Simple
>> patch to enhance conflict processing, following previous discussions. 
>> Controlled by parameter minimize_standby_conflicts = on | off, with
>> default off allows measurement of performance impact to see whether
>> it should be set on all the time.
> 
> WTF?  Simon, this seems to be getting way way beyond anything the
> community has agreed to.  Particularly, inventing GUCs is not something
> to be doing without consensus.

yeah actually both of the last commits seem rather strange given the 
current discussion on this list I must say...



Stefan


On Fri, 2010-01-29 at 14:04 -0500, Tom Lane wrote:
> sriggs@postgresql.org (Simon Riggs) writes:
> > Log Message:
> > -----------
> > Augment WAL records for btree delete with GetOldestXmin() to reduce
> > false positives during Hot Standby conflict processing. Simple
> > patch to enhance conflict processing, following previous discussions. 
> > Controlled by parameter minimize_standby_conflicts = on | off, with
> > default off allows measurement of performance impact to see whether
> > it should be set on all the time.
> 
> WTF?  Simon, this seems to be getting way way beyond anything the
> community has agreed to.  Particularly, inventing GUCs is not something
> to be doing without consensus.

If you or anybody else would like me to revoke it then I am happy to do
that, with no problem or argument. I agree it hasn't had discussion,
though am happy to have such a discussion.

The commit is a one line change, with parameter to control it, discussed
by Heikki and myself in December 2008. I stand by the accuracy of the
change; the parameter is really to ensure we can test during beta. 

I imagined such a minor addition would pass without comment. This is not
the same patch not-being-discussed on the other thread, which is too
complex for a commit without review. It is a separate change altogether,
although it does relate to conflict processing. In any case I would
never commit a patch while discussion on it continues.

-- Simon Riggs           www.2ndQuadrant.com



Simon Riggs <simon@2ndQuadrant.com> writes:
> On Fri, 2010-01-29 at 14:04 -0500, Tom Lane wrote:
>> WTF?  Simon, this seems to be getting way way beyond anything the
>> community has agreed to.  Particularly, inventing GUCs is not something
>> to be doing without consensus.

> If you or anybody else would like me to revoke it then I am happy to do
> that, with no problem or argument. I agree it hasn't had discussion,
> though am happy to have such a discussion.

> The commit is a one line change, with parameter to control it, discussed
> by Heikki and myself in December 2008. I stand by the accuracy of the
> change; the parameter is really to ensure we can test during beta. 

Well, I was waiting to see if anyone else had an opinion, but: my
opinion is that a GUC is not appropriate here.  Either test it yourself
enough to be sure it's a win, or don't put it in.
        regards, tom lane


On Sun, 2010-01-31 at 14:07 -0500, Tom Lane wrote:

> > The commit is a one line change, with parameter to control it, discussed
> > by Heikki and myself in December 2008. I stand by the accuracy of the
> > change; the parameter is really to ensure we can test during beta. 
> 
> Well, I was waiting to see if anyone else had an opinion, but: my
> opinion is that a GUC is not appropriate here.  Either test it yourself
> enough to be sure it's a win, or don't put it in.

I will remove the parameter then, keeping the augmentation. That OK?

-- Simon Riggs           www.2ndQuadrant.com



Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

From
Stefan Kaltenbrunner
Date:
Simon Riggs wrote:
> On Sun, 2010-01-31 at 14:07 -0500, Tom Lane wrote:
> 
>>> The commit is a one line change, with parameter to control it, discussed
>>> by Heikki and myself in December 2008. I stand by the accuracy of the
>>> change; the parameter is really to ensure we can test during beta. 
>> Well, I was waiting to see if anyone else had an opinion, but: my
>> opinion is that a GUC is not appropriate here.  Either test it yourself
>> enough to be sure it's a win, or don't put it in.
> 
> I will remove the parameter then, keeping the augmentation. That OK?

Well how much is the actual hit with this on the master for different 
workloads do we have realistic numbers on that? Also how much of an 
actual win is it in the other direction - as in under what circumstances 
and workloads does it help in avoiding superflous cancelations on the 
standby?


Stefan


On Sun, 2010-01-31 at 20:48 +0100, Stefan Kaltenbrunner wrote:
> Simon Riggs wrote:
> > On Sun, 2010-01-31 at 14:07 -0500, Tom Lane wrote:
> > 
> >>> The commit is a one line change, with parameter to control it, discussed
> >>> by Heikki and myself in December 2008. I stand by the accuracy of the
> >>> change; the parameter is really to ensure we can test during beta. 
> >> Well, I was waiting to see if anyone else had an opinion, but: my
> >> opinion is that a GUC is not appropriate here.  Either test it yourself
> >> enough to be sure it's a win, or don't put it in.
> > 
> > I will remove the parameter then, keeping the augmentation. That OK?
> 
> Well how much is the actual hit with this on the master for different 
> workloads do we have realistic numbers on that? Also how much of an 
> actual win is it in the other direction - as in under what circumstances 
> and workloads does it help in avoiding superflous cancelations on the 
> standby?

At the moment a btree delete record will cancel every request
1. no matter how long they have been running
2. no matter if they haven't accessed the index being cleaned (they
might later, is the thinking...)

This change improves case (1) in that it will only remove queries that
are older than the oldest snapshot on the primary, should
max_standby_delay be exceeded. Case (2) would have been improved by my
other proposed patch should max_standby_delay be exceeded.

The cost of improving case (1) is that we do the equivalent of taking a
snapshot of the procarray while holding locks on the btree block being
cleaned. That will increase index contention somewhat in applications
that do btree deletes, i.e. non-hot index updates or deletes.

Greg Stark's comments have been that he wants to see max_standby_delay =
-1 put back, which would effecively prevent both (1) and (2) from
occurring. I am working on that now.

-- Simon Riggs           www.2ndQuadrant.com



Simon Riggs <simon@2ndQuadrant.com> writes:
> At the moment a btree delete record will cancel every request
> 1. no matter how long they have been running
> 2. no matter if they haven't accessed the index being cleaned (they
> might later, is the thinking...)

That seems seriously horrid.  What is the rationale for #2 in
particular?  I would hope that at worst this would affect sessions
that are actively competing for the index being cleaned.
        regards, tom lane


On Sun, 2010-01-31 at 15:41 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > At the moment a btree delete record will cancel every request
> > 1. no matter how long they have been running
> > 2. no matter if they haven't accessed the index being cleaned (they
> > might later, is the thinking...)
> 
> That seems seriously horrid.  What is the rationale for #2 in
> particular?  I would hope that at worst this would affect sessions
> that are actively competing for the index being cleaned.

That is exactly the feedback I received from many other people and why I
prioritised the relation-specific conflict patch.

It's worse that that because point 2 effects WAL cleanup records for the
heap also.

The rationale is that a session *might* in the future access a table,
and if it did so it would receive the wrong answer *potentially*. This
is the issue I have been discussing for a long time now, in various
forms, starting on-list in Aug 2008.

-- Simon Riggs           www.2ndQuadrant.com



Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

From
Stefan Kaltenbrunner
Date:
Simon Riggs wrote:
> On Sun, 2010-01-31 at 20:48 +0100, Stefan Kaltenbrunner wrote:
>> Simon Riggs wrote:
>>> On Sun, 2010-01-31 at 14:07 -0500, Tom Lane wrote:
>>>
>>>>> The commit is a one line change, with parameter to control it, discussed
>>>>> by Heikki and myself in December 2008. I stand by the accuracy of the
>>>>> change; the parameter is really to ensure we can test during beta. 
>>>> Well, I was waiting to see if anyone else had an opinion, but: my
>>>> opinion is that a GUC is not appropriate here.  Either test it yourself
>>>> enough to be sure it's a win, or don't put it in.
>>> I will remove the parameter then, keeping the augmentation. That OK?
>> Well how much is the actual hit with this on the master for different 
>> workloads do we have realistic numbers on that? Also how much of an 
>> actual win is it in the other direction - as in under what circumstances 
>> and workloads does it help in avoiding superflous cancelations on the 
>> standby?
> 
> At the moment a btree delete record will cancel every request
> 1. no matter how long they have been running
> 2. no matter if they haven't accessed the index being cleaned (they
> might later, is the thinking...)
> 
> This change improves case (1) in that it will only remove queries that
> are older than the oldest snapshot on the primary, should
> max_standby_delay be exceeded. Case (2) would have been improved by my
> other proposed patch should max_standby_delay be exceeded.
> 
> The cost of improving case (1) is that we do the equivalent of taking a
> snapshot of the procarray while holding locks on the btree block being
> cleaned. That will increase index contention somewhat in applications
> that do btree deletes, i.e. non-hot index updates or deletes.

hmm makes sense from the HS side - but I think to make a case for a GUC 
it would help a lot to see actual numbers(as in benchmarks) showing how 
much of a hit it is to the master.


Stefan


On Sun, 2010-01-31 at 22:05 +0100, Stefan Kaltenbrunner wrote:

> hmm makes sense from the HS side - but I think to make a case for a GUC 
> it would help a lot to see actual numbers(as in benchmarks) showing how 
> much of a hit it is to the master.

Agreed, though my approach to benchmarking was to provide the mechanism
by which it was possible to benchmark. I didn't presume to be able to
cover wide area with benchmarking tests just for this.

-- Simon Riggs           www.2ndQuadrant.com



Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Sun, 2010-01-31 at 22:05 +0100, Stefan Kaltenbrunner wrote:
> 
>> hmm makes sense from the HS side - but I think to make a case for a GUC 
>> it would help a lot to see actual numbers(as in benchmarks) showing how 
>> much of a hit it is to the master.
> 
> Agreed, though my approach to benchmarking was to provide the mechanism
> by which it was possible to benchmark. I didn't presume to be able to
> cover wide area with benchmarking tests just for this.

I don't think this patch helps much though. It's setting lastRemovedXid
to GetOldestXmin(), which is still a very pessimistic estimate. In fact,
if there's only one transaction running in the master, it's not any
better than just setting it to InvalidTransactionId and killing all
active queries in the standby.

What we'd want to set it to is the xmin/xmax of the oldest heap tuple
whose pointer was removed from the index page. That could be much much
older than GetOldestXmin(), allowing many more read-only queries to live
in the standby.

IIRC it was Greg Stark who suggested last time this was discussed that
we could calculate the exact value for latestRemovedXid in the standby.
When replaying the deletion record, the standby could look at all the
heap tuples whose index pointers are being removed, to see which one was
newest. That can be pretty expensive, involving random I/O, but it gives
an exact value, and doesn't stress the master at all. And we could skip
it if there's no potentially conflicting read-only queries running.

That seems like the most user-friendly approach to me. Even though
random I/O is expensive, surely it's better than killing queries.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> IIRC it was Greg Stark who suggested last time this was discussed that
> we could calculate the exact value for latestRemovedXid in the standby.
> When replaying the deletion record, the standby could look at all the
> heap tuples whose index pointers are being removed, to see which one was
> newest.

This assumes that the standby can tell which heap tuples are being
removed, which I'm not sure is true --- consider the case where the
deletion record is carrying a page image instead of a list of deleted
tuples.  But maybe we could rejigger the representation to make sure
that the info is always available.  In general it sounds like a much
nicer answer.
        regards, tom lane


On Sun, 2010-01-31 at 23:43 +0200, Heikki Linnakangas wrote:
> IIRC it was Greg Stark who suggested last time this was discussed that
> we could calculate the exact value for latestRemovedXid in the
> standby. When replaying the deletion record, the standby could look at
> all the heap tuples whose index pointers are being removed, to see
> which one was newest. That can be pretty expensive, involving random
> I/O, but it gives an exact value, and doesn't stress the master at
> all. And we could skip it if there's no potentially conflicting
> read-only queries running.
> 
> That seems like the most user-friendly approach to me. Even though
> random I/O is expensive, surely it's better than killing queries.

Best solution, no time to do it.

Should I revoke this change? 

-- Simon Riggs           www.2ndQuadrant.com



Simon Riggs <simon@2ndQuadrant.com> writes:
> Should I revoke this change? 

Please.  We can always put it back later if nothing better gets
implemented.
        regards, tom lane


On Sun, 2010-01-31 at 17:10 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > Should I revoke this change? 
> 
> Please.

Will do, in morning.

-- Simon Riggs           www.2ndQuadrant.com



On Sun, 2010-01-31 at 16:53 -0500, Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> > IIRC it was Greg Stark who suggested last time this was discussed that
> > we could calculate the exact value for latestRemovedXid in the standby.
> > When replaying the deletion record, the standby could look at all the
> > heap tuples whose index pointers are being removed, to see which one was
> > newest.
> 
> This assumes that the standby can tell which heap tuples are being
> removed, which I'm not sure is true --- consider the case where the
> deletion record is carrying a page image instead of a list of deleted
> tuples.  But maybe we could rejigger the representation to make sure
> that the info is always available.  In general it sounds like a much
> nicer answer.

I'm looking at this now since we definitely need either this or the
previous patch.

XLOG_BTREE_DELETE records would not carry latestRemovedXid, this would
be calculated by inspection of heap tuples. XLOG_BTREE_DELETE would have
a heap relid, to allow direct access to the heap without a Relation. We
would check xmin, xmax and xvac. Code will look a little like
heap_fetch() though disimilar enough not to use directly. We would only
look at the tuple header and info.

-- Simon Riggs           www.2ndQuadrant.com



Simon Riggs wrote:
> XLOG_BTREE_DELETE records would not carry latestRemovedXid, this would
> be calculated by inspection of heap tuples.

You might still want to keep the conservative latestRemovedXid value in
the WAL record to avoid the extra work when latestRemovedXid alone allows.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


On Tue, 2010-03-09 at 11:20 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > XLOG_BTREE_DELETE records would not carry latestRemovedXid, this would
> > be calculated by inspection of heap tuples.
> 
> You might still want to keep the conservative latestRemovedXid value in
> the WAL record to avoid the extra work when latestRemovedXid alone allows.

OK will do.

-- Simon Riggs           www.2ndQuadrant.com



On Sun, 2010-01-31 at 23:43 +0200, Heikki Linnakangas wrote:

> When replaying the deletion record, the standby could look at all the
> heap tuples whose index pointers are being removed, to see which one
> was newest.

Long after coding this, I now realise this really is a dumb-ass idea.

There is no spoon. The index tuples did once point at valid heap tuples.
1. heap tuples are deleted
2. heap tuples become dead
3. index tuples can now be marked killed
4. index tuples removed
Heap tuples can be removed at step 2, index tuples can't be removed
until step 4. so the dead index tuples can't be followed reliably to
read the xids. They might be the correct ones, might not, and no way to
tell.

So that puts this back to exactly the place we were on 31 Jan:

On Sun, 2010-01-31 at 17:10 -0500, Tom Lane wrote:
> We can always put it back later if nothing better gets
> implemented.

So, barring huge injections of insight, I'll be putting back the
generation of OldestXmin() into the btree delete path.

-- Simon Riggs           www.2ndQuadrant.com



Simon Riggs <simon@2ndQuadrant.com> writes:
> On Sun, 2010-01-31 at 23:43 +0200, Heikki Linnakangas wrote:
>> When replaying the deletion record, the standby could look at all the
>> heap tuples whose index pointers are being removed, to see which one
>> was newest.

> Long after coding this, I now realise this really is a dumb-ass idea.

> There is no spoon. The index tuples did once point at valid heap tuples.
> 1. heap tuples are deleted
> 2. heap tuples become dead
> 3. index tuples can now be marked killed
> 4. index tuples removed
> Heap tuples can be removed at step 2, index tuples can't be removed
> until step 4.

Uh, no, heap tuples can't be removed until after all index entries that
are pointing at them have been removed.  Please tell me you have not
broken this.

> So, barring huge injections of insight, I'll be putting back the
> generation of OldestXmin() into the btree delete path.

I stand by my previous comments about where this is going to end up.
        regards, tom lane


On Fri, 2010-03-26 at 16:16 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Sun, 2010-01-31 at 23:43 +0200, Heikki Linnakangas wrote:
> >> When replaying the deletion record, the standby could look at all the
> >> heap tuples whose index pointers are being removed, to see which one
> >> was newest.
>
> > Long after coding this, I now realise this really is a dumb-ass idea.
>
> > There is no spoon. The index tuples did once point at valid heap tuples.
> > 1. heap tuples are deleted
> > 2. heap tuples become dead
> > 3. index tuples can now be marked killed
> > 4. index tuples removed
> > Heap tuples can be removed at step 2, index tuples can't be removed
> > until step 4.
>
> Uh, no, heap tuples can't be removed until after all index entries that
> are pointing at them have been removed.  Please tell me you have not
> broken this.

Nothing broken.

It appears that in practice many of the index items point to heap items
that are LP_DEAD. So for the purposes of accessing a heap tuple's xmin,
then we're both right. To the current purpose the tuple has been
removed, though you are also right: its stub remains.

So how do I calculate xmin and xmax for an LP_DEAD tuple? Clearly
nothing can be done directly. Is there another way?

A conjecture: if the index items point to a heap tuple that is LP_NORMAL
then we can get the xmin/xmax from there. The xmin/xmax of LP_DEAD items
will always be *earlier* than the latest LP_NORMAL tuple that is being
removed. So as long as I have at least 1 LP_NORMAL heap tuple, then I
can use the latestRemovedXid from that and simply discard the LP_DEAD
items (for the purposes of this calculation). The idea is that whatever
marked those heap tuples LP_DEAD would also have marked the others, if
they were the same or earlier than the LP_DEAD ones.

Do you agree with this conjecture? If you do, then attached patch is
complete.

--
 Simon Riggs           www.2ndQuadrant.com

Attachment
On Sat, Mar 27, 2010 at 10:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> It appears that in practice many of the index items point to heap items
> that are LP_DEAD. So for the purposes of accessing a heap tuple's xmin,
> then we're both right. To the current purpose the tuple has been
> removed, though you are also right: its stub remains.

If we're pruning an index entry to a heap tuple that has been HOT
pruned wouldn't the HOT pruning record have already conflicted with
any queries that could see  it?


-- 
greg


On Sat, 2010-03-27 at 19:15 +0000, Greg Stark wrote:
> On Sat, Mar 27, 2010 at 10:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > It appears that in practice many of the index items point to heap items
> > that are LP_DEAD. So for the purposes of accessing a heap tuple's xmin,
> > then we're both right. To the current purpose the tuple has been
> > removed, though you are also right: its stub remains.
> 
> If we're pruning an index entry to a heap tuple that has been HOT
> pruned wouldn't the HOT pruning record have already conflicted with
> any queries that could see  it?

Quite probably, but a query that started after that record arrived might
slip through. We have to treat each WAL record separately.

Do you agree with the conjecture? That LP_DEAD items can be ignored
because their xid would have been earlier than the latest LP_NORMAL
tuple we find? (on any page).

Or is a slightly less strong condition true: we can ignore LP_DEAD items
on a page that is also referenced by an LP_NORMAL item.

-- Simon Riggs           www.2ndQuadrant.com



On Sat, Mar 27, 2010 at 7:36 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sat, 2010-03-27 at 19:15 +0000, Greg Stark wrote:
> > If we're pruning an index entry to a heap tuple that has been HOT
> > pruned wouldn't the HOT pruning record have already conflicted with
> > any queries that could see  it?
>
> Quite probably, but a query that started after that record arrived might
> slip through. We have to treat each WAL record separately.

Slip through? I'm not following you.


> Do you agree with the conjecture? That LP_DEAD items can be ignored
> because their xid would have been earlier than the latest LP_NORMAL
> tuple we find? (on any page).
>
> Or is a slightly less strong condition true: we can ignore LP_DEAD items
> on a page that is also referenced by an LP_NORMAL item.

I don't like having dependencies on the precise logic in vacuum rather
than only on the guarantees that vacuum provides. We want to improve
the logic in vacuum and hot pruning to cover more cases and that will
be harder if there's code elsewhere depending on its simple-minded xid
<= globalxmin test.


-- 
greg


On Sat, 2010-03-27 at 22:39 +0000, Greg Stark wrote:
> On Sat, Mar 27, 2010 at 7:36 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On Sat, 2010-03-27 at 19:15 +0000, Greg Stark wrote:
> > > If we're pruning an index entry to a heap tuple that has been HOT
> > > pruned wouldn't the HOT pruning record have already conflicted with
> > > any queries that could see  it?
> >
> > Quite probably, but a query that started after that record arrived might
> > slip through. We have to treat each WAL record separately.
> 
> Slip through? I'm not following you.

No, there is no possibility for it to slip through, you're right. (After
much thinking).

> > Do you agree with the conjecture? That LP_DEAD items can be ignored
> > because their xid would have been earlier than the latest LP_NORMAL
> > tuple we find? (on any page).
> >
> > Or is a slightly less strong condition true: we can ignore LP_DEAD items
> > on a page that is also referenced by an LP_NORMAL item.
> 
> I don't like having dependencies on the precise logic in vacuum rather
> than only on the guarantees that vacuum provides. We want to improve
> the logic in vacuum and hot pruning to cover more cases and that will
> be harder if there's code elsewhere depending on its simple-minded xid
> <= globalxmin test.

Agreed

-- Simon Riggs           www.2ndQuadrant.com