Re: POC: Lock updated tuples in tuple_update() and tuple_delete() - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: POC: Lock updated tuples in tuple_update() and tuple_delete()
Date
Msg-id CAPpHfduCrfzg0HgiZCbVog+sHHmiDO1M_sUouCH1cAJrY__OnA@mail.gmail.com
Whole thread Raw
In response to Re: POC: Lock updated tuples in tuple_update() and tuple_delete()  (Andres Freund <andres@anarazel.de>)
Responses Re: POC: Lock updated tuples in tuple_update() and tuple_delete()  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi!

On Thu, Mar 23, 2023 at 3:30 AM Andres Freund <andres@anarazel.de> wrote:
> On 2023-03-21 01:25:11 +0300, Alexander Korotkov wrote:
> > I'm going to push patchset v15 if no objections.
>
> Just saw that this went in - didn't catch up with the thread before,
> unfortunately. At the very least I'd like to see some more work on cleaning up
> the lazy tuple slot stuff. It's replete with unnecessary multiple-evaluation
> hazards - I realize that there's some of those already, but I don't think we
> should go further down that route. As far as I can tell there's no need for
> any of this to be macros.

Thank you for taking a look at this, even post-commit.  Regarding
marcos, do you think inline functions would be good instead?

> > From a8b4e8a7b27815e013ea07b8cc9ac68541a9ac07 Mon Sep 17 00:00:00 2001
> > From: Alexander Korotkov <akorotkov@postgresql.org>
> > Date: Tue, 21 Mar 2023 00:34:15 +0300
> > Subject: [PATCH 1/2] Evade extra table_tuple_fetch_row_version() in
> >  ExecUpdate()/ExecDelete()
> >
> > When we lock tuple using table_tuple_lock() then we at the same time fetch
> > the locked tuple to the slot.  In this case we can skip extra
> > table_tuple_fetch_row_version() thank to we've already fetched the 'old' tuple
> > and nobody can change it concurrently since it's locked.
> >
> > Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
> > Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp
> > Reviewed-by: Andres Freund, Chris Travers
> > ---
> >  src/backend/executor/nodeModifyTable.c | 48 +++++++++++++++++++-------
> >  1 file changed, 35 insertions(+), 13 deletions(-)
> >
> > diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
> > index 3a673895082..93ebfdbb0d8 100644
> > --- a/src/backend/executor/nodeModifyTable.c
> > +++ b/src/backend/executor/nodeModifyTable.c
> > @@ -1559,6 +1559,22 @@ ldelete:
> >                                       {
> >                                               case TM_Ok:
> >                                                       Assert(context->tmfd.traversed);
> > +
> > +                                                     /*
> > +                                                      * Save locked tuple for further processing of
> > +                                                      * RETURNING clause.
> > +                                                      */
> > +                                                     if (processReturning &&
> > +                                                             resultRelInfo->ri_projectReturning &&
> > +                                                             !resultRelInfo->ri_FdwRoutine)
> > +                                                     {
> > +                                                             TupleTableSlot *returningSlot;
> > +
> > +                                                             returningSlot = ExecGetReturningSlot(estate,
resultRelInfo);
> > +                                                             ExecCopySlot(returningSlot, inputslot);
> > +                                                             ExecMaterializeSlot(returningSlot);
> > +                                                     }
> > +
> >                                                       epqslot = EvalPlanQual(context->epqstate,
> >
resultRelationDesc,
> >
resultRelInfo->ri_RangeTableIndex,
>
> This seems a bit byzantine. We use inputslot = EvalPlanQualSlot(...) to make
> EvalPlanQual() a bit cheaper, because that avoids a slot copy inside
> EvalPlanQual().  But now we copy and materialize that slot anyway - and we do
> so even if EPQ fails. And we afaics also do it when epqreturnslot is set, in
> which case we'll afaics never use the copied slot.

Yes, I agree that there is a redundancy we could avoid.

> Read the next paragraph below before replying to the above - I don't think
> this is right for other reasons:
>
> > @@ -1673,12 +1689,17 @@ ldelete:
> >               }
> >               else
> >               {
> > +                     /*
> > +                      * Tuple can be already fetched to the returning slot in case
> > +                      * we've previously locked it.  Fetch the tuple only if the slot
> > +                      * is empty.
> > +                      */
> >                       slot = ExecGetReturningSlot(estate, resultRelInfo);
> >                       if (oldtuple != NULL)
> >                       {
> >                               ExecForceStoreHeapTuple(oldtuple, slot, false);
> >                       }
> > -                     else
> > +                     else if (TupIsNull(slot))
> >                       {
> >                               if (!table_tuple_fetch_row_version(resultRelationDesc, tupleid,
> >                                                                                                  SnapshotAny,
slot))
>
>
> I don't think this is correct as-is - what if ExecDelete() is called with some
> older tuple in the returning slot? If we don't enter the TM_Updated path, it
> won't get updated, and we'll return the wrong tuple.  It certainly looks
> possible to me - consider what happens if a first tuple enter the TM_Updated
> path but then fails EvalPlanQual(). If a second tuple is deleted without
> entering the TM_Updated path, the wrong tuple will be used for RETURNING.
>
> <plays around with isolationtester>
>
> Yes, indeed. The attached isolationtest breaks with 764da7710bf.

Thank you for cathing this!  This is definitely a bug.

> I think it's entirely sensible to avoid the tuple fetching in ExecDelete(),
> but it needs a bit less localized work. Instead of using the presence of a
> tuple in the returning slot, ExecDelete() should track whether it already has
> fetched the deleted tuple.
>
> Or alternatively, do the work to avoid refetching the tuple for the much more
> common case of not needing EPQ at all.
>
> I guess this really is part of my issue with this change - it optimizes the
> rare case, while not addressing the same inefficiency in the common case.

I'm going to fix this for ExecDelete().  Avoiding refetching the tuple
in more common case is something I'm definitely very interested in.
But I would leave it for the future.

> > @@ -299,14 +305,46 @@ heapam_tuple_complete_speculative(Relation relation, TupleTableSlot *slot,
> >  static TM_Result
> >  heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
> >                                       Snapshot snapshot, Snapshot crosscheck, bool wait,
> > -                                     TM_FailureData *tmfd, bool changingPart)
> > +                                     TM_FailureData *tmfd, bool changingPart,
> > +                                     LazyTupleTableSlot *lockedSlot)
> >  {
> > +     TM_Result       result;
> > +
> >       /*
> >        * Currently Deleting of index tuples are handled at vacuum, in case if
> >        * the storage itself is cleaning the dead tuples by itself, it is the
> >        * time to call the index tuple deletion also.
> >        */
> > -     return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
> > +     result = heap_delete(relation, tid, cid, crosscheck, wait,
> > +                                              tmfd, changingPart);
> > +
> > +     /*
> > +      * If the tuple has been concurrently updated, then get the lock on it.
> > +      * (Do this if caller asked for tat by providing a 'lockedSlot'.) With the
> > +      * lock held retry of delete should succeed even if there are more
> > +      * concurrent update attempts.
> > +      */
> > +     if (result == TM_Updated && lockedSlot)
> > +     {
> > +             TupleTableSlot *evalSlot;
> > +
> > +             Assert(wait);
> > +
> > +             evalSlot = LAZY_TTS_EVAL(lockedSlot);
> > +             result = heapam_tuple_lock_internal(relation, tid, snapshot,
> > +                                                                                     evalSlot, cid,
LockTupleExclusive,
> > +                                                                                     LockWaitBlock,
> > +
TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
> > +                                                                                     tmfd, true);
>
> Oh, huh? As I mentioned before, the unconditional use of LockWaitBlock means
> the wait parameter is ignored.
>
> I'm frankly getting annoyed here.

lockedSlot shoudln't be provided when wait == false.  The assertion
above expresses this intention.  However, the code lacking of comment
directly expressing this idea.

And sorry for getting you annoyed.  The relevant comment should be
already there.

> > +/*
> > + * This routine does the work for heapam_tuple_lock(), but also support
> > + * `updated` argument to re-use the work done by heapam_tuple_update() or
> > + * heapam_tuple_delete() on figuring out that tuple was concurrently updated.
> > + */
> > +static TM_Result
> > +heapam_tuple_lock_internal(Relation relation, ItemPointer tid,
> > +                                                Snapshot snapshot, TupleTableSlot *slot,
> > +                                                CommandId cid, LockTupleMode mode,
> > +                                                LockWaitPolicy wait_policy, uint8 flags,
> > +                                                TM_FailureData *tmfd, bool updated)
>
> Why is the new parameter named 'updated'?

To indicate that we know that we're locking the updated tuple.
Probably not descriptive enough.

> >  {
> >       BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
> >       TM_Result       result;
> > -     Buffer          buffer;
> > +     Buffer          buffer = InvalidBuffer;
> >       HeapTuple       tuple = &bslot->base.tupdata;
> >       bool            follow_updates;
> >
> > @@ -374,16 +455,26 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
> >
> >  tuple_lock_retry:
> >       tuple->t_self = *tid;
> > -     result = heap_lock_tuple(relation, tuple, cid, mode, wait_policy,
> > -                                                      follow_updates, &buffer, tmfd);
> > +     if (!updated)
> > +             result = heap_lock_tuple(relation, tuple, cid, mode, wait_policy,
> > +                                                              follow_updates, &buffer, tmfd);
> > +     else
> > +             result = TM_Updated;
> >
> >       if (result == TM_Updated &&
> >               (flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION))
> >       {
> > -             /* Should not encounter speculative tuple on recheck */
> > -             Assert(!HeapTupleHeaderIsSpeculative(tuple->t_data));
> > +             if (!updated)
> > +             {
> > +                     /* Should not encounter speculative tuple on recheck */
> > +                     Assert(!HeapTupleHeaderIsSpeculative(tuple->t_data));
>
> Hm, why is it ok to encounter speculative tuples in the updated case? Oh, I
> guess you got failures because slot doesn't point anywhere at this point.

Yes, given that tuple is not accessible, this assert can't work.  I've
a couple ideas on how to replace it.
1) As I get the primary point of of this assertion is to be sure that
tmfd->ctid really points us to a correct tuple (while speculative
token doesn't do).  So, for the 'updated' case we can check tmfd->ctid
directly (or even for both cases?).  However, I can't find the
relevant macro for this, probably
2) We can check that we don't return TM_Updated from heap_update() and
heap_delete(), when old tuple is speculative token.

> > -             ReleaseBuffer(buffer);
> > +                     ReleaseBuffer(buffer);
> > +             }
> > +             else
> > +             {
> > +                     updated = false;
> > +             }
> >
> >               if (!ItemPointerEquals(&tmfd->ctid, &tuple->t_self))
> >               {
>
> Which means this is completely bogus now?
>
>         HeapTuple       tuple = &bslot->base.tupdata;
>
> In the first iteration this just points to the newly created slot. Which
> doesn't have a tuple stored in it. So the above checks some uninitialized
> memory.

No, this is not so.  tuple->t_self is unconditionally assigned few
lines before.  So, it can't be uninitialized memory given that *tid is
initialized.

I seriously doubt this patch could pass the tests if that comparison
would use uninitialized memory.

> Giving up at this point.
>
>
> This doesn't seem ready to have been committed.

Yep, that could be better.  Given, that item pointer comparison
doesn't really use uninitialized memory, probably not as bad as you
thought at the first glance.

I'm going to post a patch to address the issues you've raised in next 24h.

------
Regards,
Alexander Korotkov



pgsql-hackers by date:

Previous
From: Nazir Bilal Yavuz
Date:
Subject: Re: Refactor calculations to use instr_time
Next
From: Peter Eisentraut
Date:
Subject: Re: Schema variables - new implementation for Postgres 15