Home > mailing lists
Re: Buffer locking is special (hints, checksums, AIO writes) - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: Buffer locking is special (hints, checksums, AIO writes)
Date	November 25 19:54:00
Msg-id	3nce7i72ayzkunai6mkz24ckbxk74jodz4ua2chcdrwppxlxcd@w6x5kfkjrkru Whole thread Raw
In response to	Re: Buffer locking is special (hints, checksums, AIO writes) (Melanie Plageman <melanieplageman@gmail.com>)
Responses	Re: Buffer locking is special (hints, checksums, AIO writes)
List	pgsql-hackers
Tree view
Hi,

Thanks a lot for that detailed review!  A few questions and comments, before I
try to address the comments in the next version.


On 2025-11-25 10:44:31 -0500, Melanie Plageman wrote:
> On Wed, Nov 19, 2025 at 9:47 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > 0008: The main change. Implements buffer content locking independently from
> > lwlock.c. There's obviously a lot of similarity between lwlock.c code and
> > this, but I've not found a good way to reduce the duplication without giving
> > up too much.  This patch does immediately introduce share-exclusive as a new
> > lock level, mostly because it was too painful to do separately.
> >
> > 0009+0010+0011: Preparatory work for 0012.
> >
> > 0012: One of the main goals of this patchset - use the new share-exclusive
> > lock level to only allow hint bits to be set while no IO is going on.
> 
> Below is a review of 0008-0012
> I skipped 0013 and 0014 after seeing "#if 1" in 0013 :)

I left that in there to make the comparison easier. But clearly that's not to
be committed...


> 0008:
> -------
> > [PATCH v6 08/14] bufmgr: Implement buffer content locks independently
>  of lwlocks
> ...
> > This commit unfortunately introduces some code that is very similar to the
> > code in lwlock.c, however the code is not equivalent enough to easily merge
> > it. The future wins that this commit makes possible seem worth the cost.
> 
> It is a truly unfortunate amount of duplication. I tried some
> refactoring myself just to convince myself it wasn't a good idea.

Without success, I guess?


> However, ISTM there is no reason for the lwlock and buffer locking
> implementations to have to stay in sync. So they can diverge as
> features are added and perhaps the duplication won't be as conspicuous
> in the future.

Right - I'd expect further divergence.


> > As of this commit nothing uses the new share-exclusive lock mode. It will be
> > documented and used in a future commit. It seemed too complicated to introduce
> > the lock-level in a separate commit.
> 
> I would have liked this mentioned earlier in the commit message.

Hm, ok.  I could split share-exclusive out again, but it was somewhat painful,
because it doesn't just lead to adding code, but to changing code.


> Also, I don't know how I feel about it being "documented" in a future
> commit...perhaps just don't say that.

Hm, why?


> diff --git a/src/include/storage/buf_internals.h
> b/src/include/storage/buf_internals.h
> index 28519ad2813..0a145d95024 100644
> --- a/src/include/storage/buf_internals.h
> +++ b/src/include/storage/buf_internals.h
> @@ -32,22 +33,29 @@
>  /*
>   * Buffer state is a single 64-bit variable where following data is combined.
>   *
> + * State of the buffer itself:
>   * - 18 bits refcount
>   * - 4 bits usage count
>   * - 10 bits of flags
>   *
> + * State of the content lock:
> + * - 1 bit has_waiter
> + * - 1 bit release_ok
> + * - 1 bit lock state locked
> 
> Somewhere you should clearly explain the scenarios in which you still
> need to take the buffer header lock with LockBufHdr() vs when you can
> just use atomic operations/CAS loops.

There is an explanation of that, or at least my attempt at it ;). See the
dcoumentation for BufferDesc.


> @@ -268,6 +287,15 @@ BufMappingPartitionLockByIndex(uint32 index)
>   * wait_backend_pgprocno and setting flag bit BM_PIN_COUNT_WAITER.  At present,
>   * there can be only one such waiter per buffer.
>   *
> + * The content of buffers is protected via the buffer content lock,
> + * implemented as part buffer state. Note that the buffer header lock is *not*
> + * used to control access to the data in the buffer! We used to use an LWLock
> + * to implement the content lock, but having a dedicated implementation of
> + * content locks allows to implement some otherwise hard things (e.g.
> + * race-freely checking if AIO is in progress before locking a buffer
> + * exclusively) and makes otherwise impossible optimizations possible
> + * (e.g. unlocking and unpinning a buffer in one atomic operation).
> + *
> 
> I don't really like that this includes a description of how it used to
> work. It makes the description more confusing when trying to
> understand the current state. And comments like that make most sense
> when the current state is confusing because of a long annoying
> history.

I generally agree with that - however, in this case it seemed like folks, in
the future, might actually be wondering why this isn't using lwlocks.


> + */
> +static inline void
> +LockBuffer(Buffer buffer, BufferLockMode mode)
> +{
> +    if (mode == BUFFER_LOCK_UNLOCK)
> +        UnlockBuffer(buffer);
> +    else
> +        LockBufferInternal(buffer, mode);
> +}
> +
> 
> Is there a reason to stick with the LockBuffer(buf,
> BUFFER_LOCK_UNLOCK) interface here? It seems like a cgood time to
> start using UnlockBuffer() which I thought most people find more
> intuitive.

There are like 200 uses of BUFFER_LOCK_UNLOCK|GIN_UNLOCK|GIST_UNLOCK. And
probably a lot of external ones.  I'm not against using UnlockBuffer()
directly in the future, but changing all that code as part of this change
doesn't quite seem to make sense.


>  /*
> - * Acquire or release the content_lock for the buffer.
> + * Acquire the buffer content lock in the specified mode
> + *
> + * If the lock is not available, sleep until it is.
> + *
> + * Side effect: cancel/die interrupts are held off until lock release.
> + *
> + * This uses almost the same locking approach as lwlock.c's
> + * LWLockAcquire(). See documentation atop of lwlock.c for a more detailed
> + * discussion.
> + *
> + * The reason that this, and most of the other BufferLock* functions, get both
> + * the Buffer and BufferDesc* as parameters, is that looking up one from the
> + * other repeatedly shows up noticeably in profiles.
> + */
> +static inline void
> +BufferLockAcquire(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
> 
> Either here or over the lock mode enum, I'd spend a few sentences
> describing how the buffer lock modes interact -- what conflicts with
> what. You spend more time saying how this is like LWLock than
> explaining what it actually is.

Historically they're just explained in the README. But yea, it makes sense to
add to the enum. I like adding it to the enum better than to the function, as
there are multiple ways to acquire a lock.


> +{
> +    bool        new_release_ok;
> +    bool        wake_exclusive = unlocked;
> +    bool        wake_share_exclusive = true;
> +    proclist_head wakeup;
> +    proclist_mutable_iter iter;
> +
> +    proclist_init(&wakeup);
> +
> +    new_release_ok = true;
> +
> +    /* lock wait list while collecting backends to wake up */
> +    LockBufHdr(buf_hdr);
> +
> +    proclist_foreach_modify(iter, &buf_hdr->lock_waiters, lwWaitLink)
> +    {
> +        PGPROC       *waiter = GetPGProcByNumber(iter.cur);
> +
> +        if (!wake_exclusive && waiter->lwWaitMode == BUFFER_LOCK_EXCLUSIVE)
> +            continue;
> +
> +        if (!wake_share_exclusive && waiter->lwWaitMode ==
> BUFFER_LOCK_SHARE_EXCLUSIVE)
> +            continue;
> +
> 
> It seems there is a difference between LWLockWakeup() and
> BufferLockWakeup() if the queue contains a share lock waiter followed
> by an exclusive lock waiter.
> 
> In LWLockWakeup(), it will wake up the share waiter, set
> wokeup_somebody to true, and then not wake up the exclusive lock
> waiter because wokeup_somebody is true.
> 
> In BufferLockWakeup() when unlocked is true, it will wake up the share
> lock waiter and then wake up the exclusive lock waiter because
> wake_exclusive and wake_share_exclusive are both still true.
> 
> This might be the intended behavior, but it is a difference from
> LWLockWakeup(), so I think it is worth documenting why it is okay.

Yea, that doesn't look quite right. I think I just whacked the code around too
much and somewhere lost that branch.


> 
> +         * Signal that the process isn't on the wait list anymore. This allows
> +         * BufferLockDequeueSelf() to remove itself of the waitlist with a
> +         * proclist_delete(), rather than having to check if it has been
> 
> I know this comment was ported over, but the "remove itself of the
> waitlist" -- the "of" is confusing in the original comment and it is
> confusing here.

As in s/of/from/?


> + * Compute subtraction from buffer state for a release of a held lock in
> + * `mode`.
> + *
> + * This is separated from BufferLockUnlock() as we want to combine the lock
> + * release with other atomic operations when possible, leading to the lock
> + * release being done in multiple places.
> + */
> +static inline uint64
> +BufferLockReleaseSub(BufferLockMode mode)
> 
> I don't understand why this is a separate function even with your comment.

Because there are operations where we want to unlock the buffer as well as do
something else. E.g. in 0013, UnlockReleaseBuffer() we want to unlock the
buffer and decrease the refcount in one atomic operation. For that we need to
know what to subtract from the state variable for the lock portion - hence
BufferLockReleaseSub().
 

> + * BufferLockHeldByMe - test whether my process holds the content lock in any
> + * mode
> + *
> + * This is meant as debug support only.
> + */
> +static bool
> +BufferLockHeldByMe(BufferDesc *buf_hdr)
> +{
> +    PrivateRefCountEntry *entry =
> +        GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf_hdr), false);
> +
> +    if (!entry)
> +        return false;
> +    else
> +        return entry->data.lockmode != BUFFER_LOCK_UNLOCK;
> +}
> 
> Previously, if I called LockBuffer(buf, BUFFER_LOCK_SHARE) again after
> calling it once, say in RelationCopyStorageUsingBuffer(), I would trip
> an assert when unpinning the buffer about how I needed to have
> released the lock.

Right, that's important to keep that way.


> Now, though, it doesn't trip the assert because we don't track
> multiple locks. When I call LockBuffer(UNLOCK), it sets lockmode to
> BUFFER_LOCK_UNLOCK and BufferLockHeldByMe() only checks the private
> refcount entry.

That part seems right.


> Whereas LWLockHeldByMe() checked held_lwlocks which had multiple entries and
> would report that I still held a lock.

It'd be much better if we had detected that redundant lock acquisition, that's
not legal...


> Now, if I call LockBuffer(SHARE) twice, I'll only find out later when I have
> a hang because someone is trying to get an exclusive lock and the actual
> BufferDesc->state still has a lock set.
> 
> Maybe you should add an assert to the lock acquisition path that the
> prevate ref count entry mode is UNLOCK?

Yes, we should...

It'd be nice if we had a decent way to test things that we except to
crash. Like repeated buffer lock acquisitions...

>  void
> -LockBuffer(Buffer buffer, BufferLockMode mode)
> +LockBufferInternal(Buffer buffer, BufferLockMode mode)
>  {
> -    buf = GetBufferDescriptor(buffer - 1);
> +    buf_hdr = GetBufferDescriptor(buffer - 1);
> 
> -    if (mode == BUFFER_LOCK_UNLOCK)
> -        LWLockRelease(BufferDescriptorGetContentLock(buf));
> -    else if (mode == BUFFER_LOCK_SHARE)
> -        LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
> +    if (mode == BUFFER_LOCK_SHARE)
> +        BufferLockAcquire(buffer, buf_hdr, BUFFER_LOCK_SHARE);
> +    else if (mode == BUFFER_LOCK_SHARE_EXCLUSIVE)
> +        BufferLockAcquire(buffer, buf_hdr, BUFFER_LOCK_SHARE_EXCLUSIVE);
>      else if (mode == BUFFER_LOCK_EXCLUSIVE)
> -        LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
> +        BufferLockAcquire(buffer, buf_hdr, BUFFER_LOCK_EXCLUSIVE);
>      else
>          elog(ERROR, "unrecognized buffer lock mode: %d", mode);
> 
> I presume you've stuck with this if statement structure because that
> is what LockBuffer() used.
>
> Even though now the BufferLockMode passed
> through to BufferLockAcquire is the exact thing you are testing.

Ah. Using explicit constants allows the compiler to do constant propagation
(so e.g. BufferLockAttempt() doesn't have runtime branches on mode), which
turns out to generate more efficient code.  I'll add a comment to that effect.


> @@ -6625,7 +7295,25 @@ ResOwnerReleaseBufferPin(Datum res)
>      if (BufferIsLocal(buffer))
>          UnpinLocalBufferNoOwner(buffer);
>      else
> +    {
> +        PrivateRefCountEntry *ref;
> +
> +        ref = GetPrivateRefCountEntry(buffer, false);
> +
> +        /*
> +         * If the buffer was locked at the time of the resowner release,
> +         * release the lock now. This should only happen after errors.
> +         */
> +        if (ref->data.lockmode != BUFFER_LOCK_UNLOCK)
> +        {
> +            BufferDesc *buf = GetBufferDescriptor(buffer - 1);
> +
> +            HOLD_INTERRUPTS();    /* match the upcoming RESUME_INTERRUPTS */
> +            BufferLockUnlock(buffer, buf);
> +        }
> +
>          UnpinBufferNoOwner(GetBufferDescriptor(buffer - 1));
> +    }
>  }
> 
>  Bit confusing that ResOwnerReleaseBufferBin() now releases locks as well.

Do you have a better suggestion? I'll add a comment that makes that explicit,
but other than that I don't have a great idea. Renaming the whole buffer pin
mechanism seems pretty noisy.


> 0009 -- I didn't look closely at 0009

That's hopefully just a boring code move...


> 0010:
> --------
> > [PATCH v6 10/14] heapam: Use exclusive lock on old page in CLUSTER
> 
> > To be able to guarantee that we can set the hint bit, acquire an exclusive
> > lock on the old buffer. We need the hint bits to be set as otherwise
> > reform_and_rewrite_tuple() -> rewrite_heap_tuple() -> heap_freeze_tuple() will
> > get confused.
> 
> So, this is an active bug?

I don't think so - we have an AEL on the relation at that point, so nobody
else can access the buffer, aside from checkpointer writing it out. Which
doesn't currrently block hint bits from being set.


> And what exactly do you mean heap_freeze_tuple() gets confused? I thought
> somewhere in there we would check the clog.

heap_freeze_tuple()->heap_prepare_freeze_tuple() assumes that:
 * It is assumed that the caller has checked the tuple with
 * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
 * (else we should be removing the tuple, not freezing it).

In the new world, if HTSV were not allowed to set the hint bit, e.g. because
the page is in the process of being written out, there wouldn't be a guarantee
that hints were set.  Which then leads to confusion, because some of the code
assumes that hint bits were already set.  TBH, I don't quite remember the
precise details, it's been a while (this was already part of an earlier
attempt at dealing with the hint bit stuff).  I'll try to reconstruct.


> 0011:
> --------
> > [PATCH v6 11/14] heapam: Add batch mode mvcc check and use it in page mode
> 
> > 2) We would like to stop setting hint bits while pages are being written
> > out. The necessary locking becomes visible for page mode scans if done for
> > every tuple. With batching the overhead can be amortized to only happen
> > once per page.
> 
> I don't understand the above point. What does this patch have to do
> with not setting hint bits while pages are being written out (that
> happens in the next patch)?

It's just an explanation for why we want to use batch mode. Without the
changed locking model the reason for that is a lot less clear.

> And I presume you mean don't set hint bits on a buffer that is being flushed
> by someone else -- but it sounds like you mean not to set hint bits as part
> of flushing a buffer.

Right, I do mean the former.


> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> index 4b0c49f4bb0..ddabd1a3ec3 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -504,42 +504,93 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
>                      BlockNumber block, int lines,
>                      bool all_visible, bool check_serializable)
>  {
> +    Oid            relid = RelationGetRelid(scan->rs_base.rs_rd);
> +#ifdef BATCHMVCC_FEWER_ARGS
> +    BatchMVCCState batchmvcc;
> +    HeapTupleData *tuples = batchmvcc.tuples;
> +    bool       *visible = batchmvcc.visible;
> +#else
> +    HeapTupleData tuples[MaxHeapTuplesPerPage];
> +    bool        visible[MaxHeapTuplesPerPage];
> +#endif
>      int            ntup = 0;
> 
> It's pretty confusing when visible is an output vs input parameter and
> who fills it in when. (i.e. it's filled in in page_collect_tuples() if
> check_serializable and all_visible are true, otherwise it's filled in
> in HeapTupleSatisifiesMVCCBatch())

I don't really see a good alternative.


> Personally, I think I'd almost prefer an all-visible and
> not-all-visible version of page_collect_tuples() (or helper functions
> containing the loop) that separate this. (I haven't tried it though)

I have a hard time seeing how that comes out better. Where would you make the
switch between the different functions and how would that lead to easier to
understand code?


> BATCHMVCC_FEWER_ARGS definitely doesn't make it any easier to read --
> which I assume you are removing.

Yea, I only left it in so others perhaps can reproduce the performance effect
of needing BATCHMVCC_FEWER_ARGS.


> +        /*
> +         * If the page is not all-visible or we need to check serializability,
> +         * maintain enough state to be able to refind the tuple efficiently,
> +         * without again needing to extract it from the page.
> +         */
> +        if (!all_visible || check_serializable)
> +        {
> 
> "enough state" is pretty vague here.

I don't really follow. Wouldn't going into more detail just restate the code
in a comment?


> Maybe mention what it wouldn't be valid to do with the tuples array?

Hm? The tuples array *is* that state?



> + * visibility is set in batchmvcc->visible[]. In addition, ->vistuples_dense
> + * is set to contain the offsets of visible tuples.
> + *
> + * Returns the number of visible tuples.
> + */
> +int
> +HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
> +                            int ntups,
> 
> I don't really get why this (the patch in general) would be
> substantially faster.  You still call HeapTupleSatisfiesMVCC() for
> each tuple in a loop. The difference is that you've got pointers into
> the array of tuples instead of doing PageGetItem(), then calling
> HeapTupleSatisfiesMVCC() for each tuple.

There are a few reasons:

1) Most importantly, in the batched world, we only need to check if we can set
   hint bits once. That check is far from free. We also only need to mark the
   buffer dirty once.

2) HeapTupleSatisfiesMVCCBatch() can inline HeapTupleSatisfiesMVCC() and avoid
   some redundant work, e.g. it doesn't have to set up a stack frame each
   time.

3) A loop over the tuples in heapam.c needs an external function call to
   HeapTupleSatisfiesMVCC(), as that's in heapam_visibility.c. That function
   call quickly shows up.


> 
> 0012:
> -------
> 
> > [PATCH v6 12/14] Require share-exclusive lock to set hint bits
> > To address these issue, this commit changes the rules so that modifications to
> > pages are not allowed anymore while holding a share lock. Instead the new
> 
> In the commit message, you make it sound like you only change the lock
> level for setting the hint bits. But that wouldn't solve any problems
> if FlushBuffer() could still happen with a share lock. I would try to
> make it clear that you change the lock level both for setting the hint
> bits and flushing the buffer.

Good point.


> @@ -77,6 +73,16 @@ gistkillitems(IndexScanDesc scan)
>       */
>      for (i = 0; i < so->numKilled; i++)
>      {
> +        if (!killedsomething)
> +        {
> +            /*
> +             * Use hint bit infrastructure to be allowed to modify the page
> +             * without holding an exclusive lock.
> +             */
> +            if (!BufferBeginSetHintBits(buffer))
> +                goto unlock;
> +        }
> +
> 
> I don't understand why this is in the loop. Clearly you want to call
> BufferBeginSetHintBits() once, but why would you do it in the loop?

Why would we want to continue if we can't set hint bits?


> In the comment, I might also note that the lock level will be upgraded
> as needed or something since we only have a share lock, it is
> confusing at first

I don't think copying the way this works into all the callers of
BufferBeginSetHintBits() is a good idea. That just makes it harder to adjust
going down the road.


> - * SetHintBits()
> + * To be allowed to set hint bits, SetHintBits() needs to call
> + * BufferBeginSetHintBits(). However, that's not free, and some callsites call
> + * SetHintBits() on many tuples in a row. For those it makes sense to amortize
> + * the cost of BufferBeginSetHintBits(). Additionally it's desirable to defer
> + * the cost of BufferBeginSetHintBits() until a hint bit needs to actually be
> + * set. This enum serves as the necessary state space passed to
> + * SetHintbitsExt().
> + */
> +typedef enum SetHintBitsState
> +{
> +   /* not yet checked if hint bits may be set */
> +   SHB_INITIAL,
> +   /* failed to get permission to set hint bits, don't check again */
> +   SHB_DISABLED,
> +   /* allowed to set hint bits */
> +   SHB_ENABLED,
> +} SetHintBitsState;
> 
> I dislike the SHB prefix. Perhaps something involving the word hint?

Shrug. It's a very locally used enum, I don't think it matters terribly.


> And should the enum name itself (SetHintBitsState) include the word
> "batch"? I know that would make it

> long. At least the comment should explain that these are needed when batch
> setting hint bits.

Isn't that what the comment does, explaining that we want to amortize the cost
of BufferBeginSetHintBits()?


> +SetHintBitsExt(HeapTupleHeader tuple, Buffer buffer,
> +               uint16 infomask, TransactionId xid, SetHintBitsState *state)
>  {
>      if (TransactionIdIsValid(xid))
>      {
> -        /* NB: xid must be known committed here! */
> -        XLogRecPtr    commitLSN = TransactionIdGetCommitLSN(xid);
> +        if (BufferIsPermanent(buffer))
> +        {
> 
> I really wish there was a way to better pull apart the batch and
> non-batch cases in a way that could allow the below block to be in a
> helper do_set_hint() (or whatever) which SetHintBitsExt() and
> SetHintBits() called.

I couldn't see a way that didn't lead to substantially more code duplication.


> And then you inlined BufferSetHintBits16().

Hm?


> I presume you didn't do this because HeapTupleSatisifiesMVCC() for
> SNAPSHOT_MVCC calls the non-batch version (and, of course,
> HeapTupleSatisifiesVisibility() is the much more common case).
> 
> if (TransactionIdIsValid(xid))
> {
>         if (BufferIsPermanent(buffer))
>         {
>                 /* NB: xid must be known committed here! */
>                 XLogRecPtr    commitLSN = TransactionIdGetCommitLSN(xid);
> 
>                 if (XLogNeedsFlush(commitLSN) &&
>                         BufferGetLSNAtomic(buffer) < commitLSN)
>                 {
>                         /* not flushed and no LSN interlock, so don't
> set hint */
>                         return; false;
>                 }
>         }
> }
> 
> Separately, I was thinking, should we assert here about having the
> right lock type?

Not sure I get what assert of what locktype where?


>  * It is only safe to set a transaction-committed hint bit if we know the
>  * transaction's commit record is guaranteed to be flushed to disk before the
>  * buffer, or if the table is temporary or unlogged and will be obliterated by
>  * a crash anyway.  We cannot change the LSN of the page here, because we may
>  * hold only a share lock on the buffer, so we can only use the LSN to
>  * interlock this if the buffer's LSN already is newer than the commit LSN;
>  * otherwise we have to just refrain from setting the hint bit until some
>  * future re-examination of the tuple.
>  *
> 
> Should this say "we may hold only a share exclusive lock on the
> buffer". Also what is "this" in "only use the LSN to interlock this"?

That's a pre-existing comment, right?


> @@ -1628,6 +1701,9 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot,
> Buffer buffer,
> +    if (state == SHB_ENABLED)
> +        BufferFinishSetHintBits(buffer, true, true);
> +
>      return nvis;
>  }
> 
> I wondered if it would be more natural for BufferBeginSetHintBits()
> and BufferFinishSetHintBits() to set SHB_INITIAL and SHB_DISABLED
> instead of having callers do it. But, I guess you don't do this
> because of gist and hash indexes using this for doing their own
> modifications.

I thought about it. But yea, the different callers seemed to make that not
really useful. It'd also mean that we'd do an external function call for every
tuple, which I think would be prohibitively expensive.



> --- a/src/backend/storage/freespace/freespace.c
> +++ b/src/backend/storage/freespace/freespace.c
> @@ -904,14 +904,22 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
>     max_avail = fsm_get_max_avail(page);
>     /*
> -    * Reset the next slot pointer. This encourages the use of low-numbered
> -    * pages, increasing the chances that a later vacuum can truncate the
> -    * relation.  We don't bother with a lock here, nor with marking the page
> -    * dirty if it wasn't already, since this is just a hint.
> +    * Try to reset the next slot pointer. This encourages the use of
> +    * low-numbered pages, increasing the chances that a later vacuum can
> +    * truncate the relation.  We don't bother with a lock here, nor with
> +    * marking the page dirty if it wasn't already, since this is just a hint.
> +    *
> +    * To be allowed to update the page without an exclusive lock, we have to
> +    * use the hint bit infrastructure.
>      */
> 
> What the heck? This didn't even take a share lock before...

It is insane. I do not understand how anybody thought this was ok.  I think I
probably should split this out into a separate commit, since it's so insane.


> diff --git a/src/backend/storage/freespace/fsmpage.c
> b/src/backend/storage/freesp>
> index 66a5c80b5a6..a59696b6484 100644
> --- a/src/backend/storage/freespace/fsmpage.c
> +++ b/src/backend/storage/freespace/fsmpage.c
> @@ -298,9 +298,18 @@ restart:
>      * lock and get a garbled next pointer every now and then, than take the
>      * concurrency hit of an exclusive lock.
>      *
> +    * Without an exclusive lock, we need to use the hint bit infrastructure
> +    * to be allowed to modify the page.
> +    *
> 
> Is the sentence above this still correct?

Seems ok enough to me.


> /*
> * Update the next-target pointer. Note that we do this even if we're only
> * holding a shared lock, on the grounds that it's better to use a shared
> * lock and get a garbled next pointer every now and then, than take the
> * concurrency hit of an exclusive lock.
> 
> We appear to avoid the garbling now?

I don't think so. Two backends concurrently can do fsm_search_avail() and
one backend might set a hint to a page that is already used up by the other
one. At least I think so?


> In general on 0012, I didn't spend much time checking if you caught
> all the places where we mention our hint bit hackery (only taking the
> share lock). But those can always be caught later as we inevitably
> encounter them.

There's definitely some more search needed as part of polishing that
commit. But I think it's pretty much inevitable that we'll miss some comment
somewhere :(

Greetings,

Andres Freund
pgsql-hackers by date:
From: "Greg Burd"
Date: 25 November, 19:37:38
Subject: Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB barriers
From: Andrew Dunstan
Date: 25 November, 20:15:20
Subject: Re: reduce size of logs stored by buildfarm
Re: Buffer locking is special (hints, checksums, AIO writes) - Mailing list pgsql-hackers

Previous

Next