Re: POC: Lock updated tuples in tuple_update() and tuple_delete() - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: POC: Lock updated tuples in tuple_update() and tuple_delete()
Date
Msg-id CAPpHfdtwKb5UVXKkDQZYW8nQCODy0fY_S7mV5Z+cg7urL=zDEA@mail.gmail.com
Whole thread Raw
In response to Re: POC: Lock updated tuples in tuple_update() and tuple_delete()  (Andres Freund <andres@anarazel.de>)
Responses Re: POC: Lock updated tuples in tuple_update() and tuple_delete()  (Pavel Borisov <pashkin.elfe@gmail.com>)
List pgsql-hackers
Hi, Andres.

Thank you for your review.  Sorry for the late reply.  I took some
time for me to figure out how to revise the patch.

The revised patchset is attached.  I decided to split the patch into two:
1) Avoid re-fetching the "original" row version during update and delete.
2) Save the efforts by re-using existing context of
tuple_update()/tuple_delete() for locking the tuple.
They are two separate optimizations. So let's evaluate their
performance separately.

On Tue, Jan 10, 2023 at 4:07 AM Andres Freund <andres@anarazel.de> wrote:
> I'm a bit worried that this is optimizing the rare case while hurting the
> common case. See e.g. my point below about creating additional slots in the
> happy path.

This makes sense.  It worth to allocate the slot only if we're going
to store a tuple there.  I implemented this by passing a callback for
slot allocation instead of the slot.

> It's also not clear that change is right directionally. If we want to avoid
> re-fetching the "original" row version, why don't we provide that
> functionality via table_tuple_lock()?

These are two distinct optimizations.  Now, they come as two distinct patches.

> On 2023-01-09 13:29:18 +0300, Alexander Korotkov wrote:
> > @@ -53,6 +53,12 @@ static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
> >                                                                  HeapTuple tuple,
> >                                                                  OffsetNumber tupoffset);
> >
> > +static TM_Result heapam_tuple_lock_internal(Relation relation, ItemPointer tid,
> > +                                                                                     Snapshot snapshot,
TupleTableSlot*slot,
 
> > +                                                                                     CommandId cid, LockTupleMode
mode,
> > +                                                                                     LockWaitPolicy wait_policy,
uint8flags,
 
> > +                                                                                     TM_FailureData *tmfd, bool
updated);
> > +
> >  static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
> >
> >  static const TableAmRoutine heapam_methods;
> > @@ -299,14 +305,39 @@ heapam_tuple_complete_speculative(Relation relation, TupleTableSlot *slot,
> >  static TM_Result
> >  heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
> >                                       Snapshot snapshot, Snapshot crosscheck, bool wait,
> > -                                     TM_FailureData *tmfd, bool changingPart)
> > +                                     TM_FailureData *tmfd, bool changingPart,
> > +                                     TupleTableSlot *lockedSlot)
> >  {
> > +     TM_Result       result;
> > +
> >       /*
> >        * Currently Deleting of index tuples are handled at vacuum, in case if
> >        * the storage itself is cleaning the dead tuples by itself, it is the
> >        * time to call the index tuple deletion also.
> >        */
> > -     return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
> > +     result = heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
> > +
> > +     /*
> > +      * If the tuple has been concurrently updated, get lock already so that on
> > +      * retry it will succeed, provided that the caller asked to do this by
> > +      * providing a lockedSlot.
> > +      */
> > +     if (result == TM_Updated && lockedSlot != NULL)
> > +     {
> > +             result = heapam_tuple_lock_internal(relation, tid, snapshot,
> > +                                                                                     lockedSlot, cid,
LockTupleExclusive,
> > +                                                                                     LockWaitBlock,
> > +
TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
> > +                                                                                     tmfd, true);
>
> You're ignoring the 'wait' parameter here, no? I think the modification to
> heapam_tuple_update() has the same issue.

Yep.  I didn't catch this, because currently we also call
tuple_update()/tuple_delete() with wait == true.  Fixed.

> > +             if (result == TM_Ok)
> > +             {
> > +                     tmfd->traversed = true;
> > +                     return TM_Updated;
> > +             }
> > +     }
> > +
> > +     return result;
>
> Doesn't this mean that the caller can't easily distinguish between
> heapam_tuple_delete() and heapam_tuple_lock_internal() returning a failure
> state?

Exactly.  But currently nodeModifyTable.c handles these failure states
in the similar way.  And I don't see why it should be different in
future.

> > @@ -350,213 +402,8 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
> >                                 LockWaitPolicy wait_policy, uint8 flags,
> >                                 TM_FailureData *tmfd)
> >  {
>
> Moving the entire body of the function around, makes it harder to review
> this change, because the code movement is intermingled with "actual" changes.

OK, fixed.

> > +/*
> > + * This routine does the work for heapam_tuple_lock(), but also support
> > + * `updated` to re-use the work done by heapam_tuple_update() or
> > + * heapam_tuple_delete() on fetching tuple and checking its visibility.
> > + */
> > +static TM_Result
> > +heapam_tuple_lock_internal(Relation relation, ItemPointer tid, Snapshot snapshot,
> > +                                                TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
> > +                                                LockWaitPolicy wait_policy, uint8 flags,
> > +                                                TM_FailureData *tmfd, bool updated)
> > +{
> > +     BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
> > +     TM_Result       result;
> > +     Buffer          buffer = InvalidBuffer;
> > +     HeapTuple       tuple = &bslot->base.tupdata;
> > +     bool            follow_updates;
> > +
> > +     follow_updates = (flags & TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS) != 0;
> > +     tmfd->traversed = false;
> > +
> > +     Assert(TTS_IS_BUFFERTUPLE(slot));
> > +
> > +tuple_lock_retry:
> > +     tuple->t_self = *tid;
> > +     if (!updated)
> > +             result = heap_lock_tuple(relation, tuple, cid, mode, wait_policy,
> > +                                                              follow_updates, &buffer, tmfd);
> > +     else
> > +             result = TM_Updated;
>
> To make sure I understand: You're basically trying to have
> heapam_tuple_lock_internal() work as before, except that you want to omit
> fetching the first row version, assuming that the caller already tried to lock
> it?
>
> I think at the very this needs an assert verifying that the slot actually
> contains a tuple in the "updated" path.

This part was re-written.

> > +     if (result == TM_Updated &&
> > +             (flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION))
> > +     {
> > +             if (!updated)
> > +             {
> > +                     /* Should not encounter speculative tuple on recheck */
> > +                     Assert(!HeapTupleHeaderIsSpeculative(tuple->t_data));
>
> Why shouldn't this be checked in the updated case as well?
>
>
> > @@ -1490,7 +1492,16 @@ ExecDelete(ModifyTableContext *context,
> >                * transaction-snapshot mode transactions.
> >                */
> >  ldelete:
> > -             result = ExecDeleteAct(context, resultRelInfo, tupleid, changingPart);
> > +
> > +             /*
> > +              * Ask ExecDeleteAct() to immediately place the lock on the updated
> > +              * tuple if we will need EvalPlanQual() in that case to handle it.
> > +              */
> > +             if (!IsolationUsesXactSnapshot())
> > +                     slot = ExecGetReturningSlot(estate, resultRelInfo);
> > +
> > +             result = ExecDeleteAct(context, resultRelInfo, tupleid, changingPart,
> > +                                                        slot);
>
> I don't like that 'slot' is now used for multiple things. I think this could
> best be addressed by simply moving the slot variable inside the blocks using
> it. And here it should be named more accurately.

I didn't do that refactoring.  But now editing introduced by the 1st
patch of the set are more granular and doesn't affect usage of the
'slot' variable.

> Is there a potential conflict with other uses of the ExecGetReturningSlot()?

Yep.  The current revision evades this random usage of slots.

> Given that we now always create the slot, doesn't this increase the overhead
> for the very common case of not needing EPQ? We'll create unnecessary slots
> all the time, no?

Yes, this is addressed by allocating EPQ slot only once it is needed
via callback.  I'm thinking about wrapping this into some abstraction
called 'LazySlot'.

> >                                        */
> >                                       EvalPlanQualBegin(context->epqstate);
> >                                       inputslot = EvalPlanQualSlot(context->epqstate, resultRelationDesc,
> >
resultRelInfo->ri_RangeTableIndex);
> > +                                     ExecCopySlot(inputslot, slot);
>
> > -                                     result = table_tuple_lock(resultRelationDesc, tupleid,
> > -                                                                                       estate->es_snapshot,
> > -                                                                                       inputslot,
estate->es_output_cid,
> > -                                                                                       LockTupleExclusive,
LockWaitBlock,
> > -
TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
> > -                                                                                       &context->tmfd);
> > +                                     Assert(context->tmfd.traversed);
> > +                                     epqslot = EvalPlanQual(context->epqstate,
> > +                                                                                resultRelationDesc,
> > +
resultRelInfo->ri_RangeTableIndex,
> > +                                                                                inputslot);
>
> The only point of using EvalPlanQualSlot() is to avoid copying the tuple from
> one slot to another. Given that we're not benefiting from that anymore (due to
> your manual ExecCopySlot() call), it seems we could just pass 'slot' to
> EvalPlanQual() and not bother with EvalPlanQualSlot().

This makes sense.  Now, usage pattern of the slots is more clear.

> > @@ -1449,6 +1451,8 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
> >   *   tmfd - filled in failure cases (see below)
> >   *   changingPart - true iff the tuple is being moved to another partition
> >   *           table due to an update of the partition key. Otherwise, false.
> > + *   lockedSlot - slot to save the locked tuple if should lock the last row
> > + *           version during the concurrent update. NULL if not needed.
>
> The grammar in the new comments is off ("if should lock").
>
> I think this is also needs to mention that this *significantly* changes the
> behaviour of table_tuple_delete(). That's not at all clear from the comment.

Let's see the performance results for the patchset.  I'll properly
revise the comments if results will be good.

Pavel, could you please re-run your tests over revised patchset?

------
Regards,
Alexander Korotkov

Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Add LZ4 compression in pg_dump
Next
From: Bharath Rupireddy
Date:
Subject: Re: Combine pg_walinspect till_end_of_wal functions with others