Thread: Hot Standby b-tree delete records review

Hot Standby b-tree delete records review

From
Heikki Linnakangas
Date:
btree_redo:
>     case XLOG_BTREE_DELETE:
> 
>         /*
>          * Btree delete records can conflict with standby queries. You
>          * might think that vacuum records would conflict as well, but
>          * we've handled that already. XLOG_HEAP2_CLEANUP_INFO records
>          * provide the highest xid cleaned by the vacuum of the heap
>          * and so we can resolve any conflicts just once when that
>          * arrives. After that any we know that no conflicts exist
>          * from individual btree vacuum records on that index.
>          */
>         {
>             TransactionId latestRemovedXid = btree_xlog_delete_get_latestRemovedXid(record);
>             xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
> 
>             /*
>              * XXX Currently we put everybody on death row, because
>              * currently _bt_delitems() supplies InvalidTransactionId.
>              * This can be fairly painful, so providing a better value
>              * here is worth some thought and possibly some effort to
>              * improve.
>              */
>             ResolveRecoveryConflictWithSnapshot(latestRemovedXid, xlrec->node);
>         }
>         break;

The XXX comment is out-of-date, the latestRemovedXid value is calculated
by btree_xlog_delete_get_latestRemovedXid() nowadays.

If we're re-replaying the WAL record, for example after restarting the
standby server, btree_xlog_delete_get_latestRemovedXid() won't find the
deleted records and will return InvalidTransactionId. That's OK, because
until we reach again the point in WAL where we were before the restart,
we don't accept read-only connections so there's no-one to kill anyway,
but you do get a useless "Invalid latestRemovedXid reported, using
latestCompletedXid instead" message in the log (that shouldn't be
capitalized, BTW).

It would be nice to check if there's any potentially conflicting
read-only queries before calling
btree_xlog_delete_get_latestRemovedXid(), which can be quite expensive.

If the "Invalid latestRemovedXid reported, using latestCompletedXid
instead" message is going to happen commonly, I think it should be
downgraded to DEBUG1. If it's an unexpected scenario, it should be
upgraded to WARNING.

In btree_xlog_delete_get_latestRemovedXid:
>     Assert(num_unused == 0);

Can't that happen as well in a re-replay scenario, if a heap item was
vacuumed away later on?

>     /*
>      * Note that if all heap tuples were LP_DEAD then we will be
>      * returning InvalidTransactionId here. This seems very unlikely
>      * in practice.
>      */

If none of the removed heap tuples were present anymore, we currently
return InvalidTransactionId, which kills/waits out all read-only
queries. But if none of the tuples were present anymore, the read-only
queries wouldn't have seen them anyway, so ISTM that we should treat
InvalidTransactionId return value as "we don't need to kill anyone".

Why does btree_xlog_delete_get_latestRemovedXid() keep the
num_unused/num_dead/num_redirect counts, it doesn't actually do anything
with them.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby b-tree delete records review

From
Simon Riggs
Date:
On Thu, 2010-04-22 at 10:24 +0300, Heikki Linnakangas wrote:
> btree_redo:
> >     case XLOG_BTREE_DELETE:
> > 
> >         /*
> >          * Btree delete records can conflict with standby queries. You
> >          * might think that vacuum records would conflict as well, but
> >          * we've handled that already. XLOG_HEAP2_CLEANUP_INFO records
> >          * provide the highest xid cleaned by the vacuum of the heap
> >          * and so we can resolve any conflicts just once when that
> >          * arrives. After that any we know that no conflicts exist
> >          * from individual btree vacuum records on that index.
> >          */
> >         {
> >             TransactionId latestRemovedXid = btree_xlog_delete_get_latestRemovedXid(record);
> >             xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
> > 
> >             /*
> >              * XXX Currently we put everybody on death row, because
> >              * currently _bt_delitems() supplies InvalidTransactionId.
> >              * This can be fairly painful, so providing a better value
> >              * here is worth some thought and possibly some effort to
> >              * improve.
> >              */
> >             ResolveRecoveryConflictWithSnapshot(latestRemovedXid, xlrec->node);
> >         }
> >         break;
> 
> The XXX comment is out-of-date, the latestRemovedXid value is calculated
> by btree_xlog_delete_get_latestRemovedXid() nowadays.

Removed, thanks.

> If we're re-replaying the WAL record, for example after restarting the
> standby server, btree_xlog_delete_get_latestRemovedXid() won't find the
> deleted records and will return InvalidTransactionId. That's OK, because
> until we reach again the point in WAL where we were before the restart,
> we don't accept read-only connections so there's no-one to kill anyway,
> but you do get a useless "Invalid latestRemovedXid reported, using
> latestCompletedXid instead" message in the log (that shouldn't be
> capitalized, BTW).

> It would be nice to check if there's any potentially conflicting
> read-only queries before calling
> btree_xlog_delete_get_latestRemovedXid(), which can be quite expensive.

Good idea. You're welcome to add such tuning yourself, if you like.

> If the "Invalid latestRemovedXid reported, using latestCompletedXid
> instead" message is going to happen commonly, I think it should be
> downgraded to DEBUG1. If it's an unexpected scenario, it should be
> upgraded to WARNING.

Set to DEBUG because the above optimisation makes it return invalid much
more frequently, which we don't want reported.

> In btree_xlog_delete_get_latestRemovedXid:
> >     Assert(num_unused == 0);
> 
> Can't that happen as well in a re-replay scenario, if a heap item was
> vacuumed away later on?

OK, will remove. The re-replay gets me every time.

> >     /*
> >      * Note that if all heap tuples were LP_DEAD then we will be
> >      * returning InvalidTransactionId here. This seems very unlikely
> >      * in practice.
> >      */
> 
> If none of the removed heap tuples were present anymore, we currently
> return InvalidTransactionId, which kills/waits out all read-only
> queries. But if none of the tuples were present anymore, the read-only
> queries wouldn't have seen them anyway, so ISTM that we should treat
> InvalidTransactionId return value as "we don't need to kill anyone".

That's not the point. The tuples were not themselves the sole focus,
they indicated the latestRemovedXid of the backend on the primary that
had performed the deletion. So even if those tuples are no longer
present there may be others with similar xids that would conflict, so we
cannot ignore. Comment updated.

> Why does btree_xlog_delete_get_latestRemovedXid() keep the
> num_unused/num_dead/num_redirect counts, it doesn't actually do anything
> with them.

Probably a debug tool. Removed.

Changes committed.

Thanks for the review.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby b-tree delete records review

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2010-04-22 at 10:24 +0300, Heikki Linnakangas wrote:
>> btree_redo:
>>>     /*
>>>      * Note that if all heap tuples were LP_DEAD then we will be
>>>      * returning InvalidTransactionId here. This seems very unlikely
>>>      * in practice.
>>>      */
>> If none of the removed heap tuples were present anymore, we currently
>> return InvalidTransactionId, which kills/waits out all read-only
>> queries. But if none of the tuples were present anymore, the read-only
>> queries wouldn't have seen them anyway, so ISTM that we should treat
>> InvalidTransactionId return value as "we don't need to kill anyone".
> 
> That's not the point. The tuples were not themselves the sole focus,

Yes, they were. We're replaying a b-tree deletion record, which removes
pointers to some heap tuples, making them unreachable to any read-only
queries. If any of them still need to be visible to read-only queries,
we have a conflict. But if all of the heap tuples are gone already,
removing the index pointers to them can'ẗ change the situation for any
query. If any of them should've been visible to a query, the damage was
done already by whoever pruned the heap tuples leaving just the
tombstone LP_DEAD item pointers (in the heap) behind.

Or do we use the latestRemovedXid value for something else as well?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby b-tree delete records review

From
Simon Riggs
Date:
On Thu, 2010-04-22 at 11:28 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2010-04-22 at 10:24 +0300, Heikki Linnakangas wrote:
> >> btree_redo:
> >>>     /*
> >>>      * Note that if all heap tuples were LP_DEAD then we will be
> >>>      * returning InvalidTransactionId here. This seems very unlikely
> >>>      * in practice.
> >>>      */
> >> If none of the removed heap tuples were present anymore, we currently
> >> return InvalidTransactionId, which kills/waits out all read-only
> >> queries. But if none of the tuples were present anymore, the read-only
> >> queries wouldn't have seen them anyway, so ISTM that we should treat
> >> InvalidTransactionId return value as "we don't need to kill anyone".
> > 
> > That's not the point. The tuples were not themselves the sole focus,
> 
> Yes, they were. We're replaying a b-tree deletion record, which removes
> pointers to some heap tuples, making them unreachable to any read-only
> queries. If any of them still need to be visible to read-only queries,
> we have a conflict. But if all of the heap tuples are gone already,
> removing the index pointers to them can'ẗ change the situation for any
> query. If any of them should've been visible to a query, the damage was
> done already by whoever pruned the heap tuples leaving just the
> tombstone LP_DEAD item pointers (in the heap) behind.

You're missing my point. Those tuples are indicators of what may lie
elsewhere in the database, completely unreferenced by this WAL record.
Just because these referenced tuples are gone doesn't imply that all
tuple versions written by the as yet-unknown-xids are also gone. We
can't infer anything about the whole database just from one small group
of records.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby b-tree delete records review

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2010-04-22 at 11:28 +0300, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Thu, 2010-04-22 at 10:24 +0300, Heikki Linnakangas wrote:
>>>> btree_redo:
>>>>>     /*
>>>>>      * Note that if all heap tuples were LP_DEAD then we will be
>>>>>      * returning InvalidTransactionId here. This seems very unlikely
>>>>>      * in practice.
>>>>>      */
>>>> If none of the removed heap tuples were present anymore, we currently
>>>> return InvalidTransactionId, which kills/waits out all read-only
>>>> queries. But if none of the tuples were present anymore, the read-only
>>>> queries wouldn't have seen them anyway, so ISTM that we should treat
>>>> InvalidTransactionId return value as "we don't need to kill anyone".
>>> That's not the point. The tuples were not themselves the sole focus,
>> Yes, they were. We're replaying a b-tree deletion record, which removes
>> pointers to some heap tuples, making them unreachable to any read-only
>> queries. If any of them still need to be visible to read-only queries,
>> we have a conflict. But if all of the heap tuples are gone already,
>> removing the index pointers to them can'ẗ change the situation for any
>> query. If any of them should've been visible to a query, the damage was
>> done already by whoever pruned the heap tuples leaving just the
>> tombstone LP_DEAD item pointers (in the heap) behind.
> 
> You're missing my point. Those tuples are indicators of what may lie
> elsewhere in the database, completely unreferenced by this WAL record.
> Just because these referenced tuples are gone doesn't imply that all
> tuple versions written by the as yet-unknown-xids are also gone. We
> can't infer anything about the whole database just from one small group
> of records.

Have you got an example of that?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby b-tree delete records review

From
Simon Riggs
Date:
On Thu, 2010-04-22 at 11:56 +0300, Heikki Linnakangas wrote:

> >>>> If none of the removed heap tuples were present anymore, we currently
> >>>> return InvalidTransactionId, which kills/waits out all read-only
> >>>> queries. But if none of the tuples were present anymore, the read-only
> >>>> queries wouldn't have seen them anyway, so ISTM that we should treat
> >>>> InvalidTransactionId return value as "we don't need to kill anyone".
> >>> That's not the point. The tuples were not themselves the sole focus,
> >> Yes, they were. We're replaying a b-tree deletion record, which removes
> >> pointers to some heap tuples, making them unreachable to any read-only
> >> queries. If any of them still need to be visible to read-only queries,
> >> we have a conflict. But if all of the heap tuples are gone already,
> >> removing the index pointers to them can'ẗ change the situation for any
> >> query. If any of them should've been visible to a query, the damage was
> >> done already by whoever pruned the heap tuples leaving just the
> >> tombstone LP_DEAD item pointers (in the heap) behind.
> > 
> > You're missing my point. Those tuples are indicators of what may lie
> > elsewhere in the database, completely unreferenced by this WAL record.
> > Just because these referenced tuples are gone doesn't imply that all
> > tuple versions written by the as yet-unknown-xids are also gone. We
> > can't infer anything about the whole database just from one small group
> > of records.
> 
> Have you got an example of that?

I don't need one, I have suggested the safe route. In order to infer
anything, and thereby further optimise things, we would need proof that
no cases can exist, which I don't have. Perhaps we can add "yet", not
sure about that either.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby b-tree delete records review

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2010-04-22 at 11:56 +0300, Heikki Linnakangas wrote:
> 
>>>>>> If none of the removed heap tuples were present anymore, we currently
>>>>>> return InvalidTransactionId, which kills/waits out all read-only
>>>>>> queries. But if none of the tuples were present anymore, the read-only
>>>>>> queries wouldn't have seen them anyway, so ISTM that we should treat
>>>>>> InvalidTransactionId return value as "we don't need to kill anyone".
>>>>> That's not the point. The tuples were not themselves the sole focus,
>>>> Yes, they were. We're replaying a b-tree deletion record, which removes
>>>> pointers to some heap tuples, making them unreachable to any read-only
>>>> queries. If any of them still need to be visible to read-only queries,
>>>> we have a conflict. But if all of the heap tuples are gone already,
>>>> removing the index pointers to them can'ẗ change the situation for any
>>>> query. If any of them should've been visible to a query, the damage was
>>>> done already by whoever pruned the heap tuples leaving just the
>>>> tombstone LP_DEAD item pointers (in the heap) behind.
>>> You're missing my point. Those tuples are indicators of what may lie
>>> elsewhere in the database, completely unreferenced by this WAL record.
>>> Just because these referenced tuples are gone doesn't imply that all
>>> tuple versions written by the as yet-unknown-xids are also gone. We
>>> can't infer anything about the whole database just from one small group
>>> of records.
>> Have you got an example of that?
> 
> I don't need one, I have suggested the safe route. In order to infer
> anything, and thereby further optimise things, we would need proof that
> no cases can exist, which I don't have. Perhaps we can add "yet", not
> sure about that either.

It's good to be safe rather than sorry, but I'd still like to know
because I'm quite surprised by that, and got me worried that I don't
understand how hot standby works as well as I thought I did. I thought
the point of stopping replay/killing queries at a b-tree deletion record
is precisely that it makes some heap tuples invisible to running
read-only queries. If it doesn't make any tuples invisible, why do any
queries need to be killed? And why was it OK for them to be running just
before replaying the b-tree deletion record?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby b-tree delete records review

From
Simon Riggs
Date:
On Thu, 2010-04-22 at 12:18 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2010-04-22 at 11:56 +0300, Heikki Linnakangas wrote:
> > 
> >>>>>> If none of the removed heap tuples were present anymore, we currently
> >>>>>> return InvalidTransactionId, which kills/waits out all read-only
> >>>>>> queries. But if none of the tuples were present anymore, the read-only
> >>>>>> queries wouldn't have seen them anyway, so ISTM that we should treat
> >>>>>> InvalidTransactionId return value as "we don't need to kill anyone".
> >>>>> That's not the point. The tuples were not themselves the sole focus,
> >>>> Yes, they were. We're replaying a b-tree deletion record, which removes
> >>>> pointers to some heap tuples, making them unreachable to any read-only
> >>>> queries. If any of them still need to be visible to read-only queries,
> >>>> we have a conflict. But if all of the heap tuples are gone already,
> >>>> removing the index pointers to them can'ẗ change the situation for any
> >>>> query. If any of them should've been visible to a query, the damage was
> >>>> done already by whoever pruned the heap tuples leaving just the
> >>>> tombstone LP_DEAD item pointers (in the heap) behind.
> >>> You're missing my point. Those tuples are indicators of what may lie
> >>> elsewhere in the database, completely unreferenced by this WAL record.
> >>> Just because these referenced tuples are gone doesn't imply that all
> >>> tuple versions written by the as yet-unknown-xids are also gone. We
> >>> can't infer anything about the whole database just from one small group
> >>> of records.
> >> Have you got an example of that?
> > 
> > I don't need one, I have suggested the safe route. In order to infer
> > anything, and thereby further optimise things, we would need proof that
> > no cases can exist, which I don't have. Perhaps we can add "yet", not
> > sure about that either.
> 
> It's good to be safe rather than sorry, but I'd still like to know
> because I'm quite surprised by that, and got me worried that I don't
> understand how hot standby works as well as I thought I did. I thought
> the point of stopping replay/killing queries at a b-tree deletion record
> is precisely that it makes some heap tuples invisible to running
> read-only queries. If it doesn't make any tuples invisible, why do any
> queries need to be killed? And why was it OK for them to be running just
> before replaying the b-tree deletion record?

I'm sorry but I'm too busy to talk further on this today. Since we are
discussing a further optimisation rather than a bug, I hope it is OK to
come back to this again later.

-- Simon Riggs           www.2ndQuadrant.com



Re: Hot Standby b-tree delete records review

From
Heikki Linnakangas
Date:
(cleaning up my inbox, and bumped into this..)

On 22.04.2010 12:31, Simon Riggs wrote:
> On Thu, 2010-04-22 at 12:18 +0300, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Thu, 2010-04-22 at 11:56 +0300, Heikki Linnakangas wrote:
>>>
>>>>>>>> If none of the removed heap tuples were present anymore, we currently
>>>>>>>> return InvalidTransactionId, which kills/waits out all read-only
>>>>>>>> queries. But if none of the tuples were present anymore, the read-only
>>>>>>>> queries wouldn't have seen them anyway, so ISTM that we should treat
>>>>>>>> InvalidTransactionId return value as "we don't need to kill anyone".
>>>>>>> That's not the point. The tuples were not themselves the sole focus,
>>>>>> Yes, they were. We're replaying a b-tree deletion record, which removes
>>>>>> pointers to some heap tuples, making them unreachable to any read-only
>>>>>> queries. If any of them still need to be visible to read-only queries,
>>>>>> we have a conflict. But if all of the heap tuples are gone already,
>>>>>> removing the index pointers to them can'ẗ change the situation for any
>>>>>> query. If any of them should've been visible to a query, the damage was
>>>>>> done already by whoever pruned the heap tuples leaving just the
>>>>>> tombstone LP_DEAD item pointers (in the heap) behind.
>>>>> You're missing my point. Those tuples are indicators of what may lie
>>>>> elsewhere in the database, completely unreferenced by this WAL record.
>>>>> Just because these referenced tuples are gone doesn't imply that all
>>>>> tuple versions written by the as yet-unknown-xids are also gone. We
>>>>> can't infer anything about the whole database just from one small group
>>>>> of records.
>>>> Have you got an example of that?
>>>
>>> I don't need one, I have suggested the safe route. In order to infer
>>> anything, and thereby further optimise things, we would need proof that
>>> no cases can exist, which I don't have. Perhaps we can add "yet", not
>>> sure about that either.
>>
>> It's good to be safe rather than sorry, but I'd still like to know
>> because I'm quite surprised by that, and got me worried that I don't
>> understand how hot standby works as well as I thought I did. I thought
>> the point of stopping replay/killing queries at a b-tree deletion record
>> is precisely that it makes some heap tuples invisible to running
>> read-only queries. If it doesn't make any tuples invisible, why do any
>> queries need to be killed? And why was it OK for them to be running just
>> before replaying the b-tree deletion record?
>
> I'm sorry but I'm too busy to talk further on this today. Since we are
> discussing a further optimisation rather than a bug, I hope it is OK to
> come back to this again later.

Would now be a good time to revisit this? I still don't see why a b-tree 
deletion record should conflict with anything, if all the removed index 
tuples point to just LP_DEAD tombstones in the heap.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby b-tree delete records review

From
Simon Riggs
Date:
On Tue, 2010-11-09 at 13:34 +0200, Heikki Linnakangas wrote:
> (cleaning up my inbox, and bumped into this..)
> 
> On 22.04.2010 12:31, Simon Riggs wrote:
> > On Thu, 2010-04-22 at 12:18 +0300, Heikki Linnakangas wrote:
> >> Simon Riggs wrote:
> >>> On Thu, 2010-04-22 at 11:56 +0300, Heikki Linnakangas wrote:
> >>>
> >>>>>>>> If none of the removed heap tuples were present anymore, we currently
> >>>>>>>> return InvalidTransactionId, which kills/waits out all read-only
> >>>>>>>> queries. But if none of the tuples were present anymore, the read-only
> >>>>>>>> queries wouldn't have seen them anyway, so ISTM that we should treat
> >>>>>>>> InvalidTransactionId return value as "we don't need to kill anyone".
> >>>>>>> That's not the point. The tuples were not themselves the sole focus,
> >>>>>> Yes, they were. We're replaying a b-tree deletion record, which removes
> >>>>>> pointers to some heap tuples, making them unreachable to any read-only
> >>>>>> queries. If any of them still need to be visible to read-only queries,
> >>>>>> we have a conflict. But if all of the heap tuples are gone already,
> >>>>>> removing the index pointers to them can'ẗ change the situation for any
> >>>>>> query. If any of them should've been visible to a query, the damage was
> >>>>>> done already by whoever pruned the heap tuples leaving just the
> >>>>>> tombstone LP_DEAD item pointers (in the heap) behind.
> >>>>> You're missing my point. Those tuples are indicators of what may lie
> >>>>> elsewhere in the database, completely unreferenced by this WAL record.
> >>>>> Just because these referenced tuples are gone doesn't imply that all
> >>>>> tuple versions written by the as yet-unknown-xids are also gone. We
> >>>>> can't infer anything about the whole database just from one small group
> >>>>> of records.
> >>>> Have you got an example of that?
> >>>
> >>> I don't need one, I have suggested the safe route. In order to infer
> >>> anything, and thereby further optimise things, we would need proof that
> >>> no cases can exist, which I don't have. Perhaps we can add "yet", not
> >>> sure about that either.
> >>
> >> It's good to be safe rather than sorry, but I'd still like to know
> >> because I'm quite surprised by that, and got me worried that I don't
> >> understand how hot standby works as well as I thought I did. I thought
> >> the point of stopping replay/killing queries at a b-tree deletion record
> >> is precisely that it makes some heap tuples invisible to running
> >> read-only queries. If it doesn't make any tuples invisible, why do any
> >> queries need to be killed? And why was it OK for them to be running just
> >> before replaying the b-tree deletion record?
> >
> > I'm sorry but I'm too busy to talk further on this today. Since we are
> > discussing a further optimisation rather than a bug, I hope it is OK to
> > come back to this again later.
> 
> Would now be a good time to revisit this? I still don't see why a b-tree 
> deletion record should conflict with anything, if all the removed index 
> tuples point to just LP_DEAD tombstones in the heap.

I want what you say to be true. The question is: is it? We just need to
explain why that will never be a problem.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services