Re: refactoring relation extension and BufferAlloc(), faster COPY - Mailing list pgsql-hackers

From Andres Freund
Subject Re: refactoring relation extension and BufferAlloc(), faster COPY
Date
Msg-id 20230329034754.elzpfwacqmapv6pj@awork3.anarazel.de
Whole thread Raw
In response to Re: refactoring relation extension and BufferAlloc(), faster COPY  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: refactoring relation extension and BufferAlloc(), faster COPY
List pgsql-hackers
Hi,

On 2023-03-26 17:42:45 -0400, Melanie Plageman wrote:
> Below is my review of a slightly older version than you just posted --
> much of it you may have already addressed.

Far from all is already - thanks for the review!


> From 3a6c3f41000e057bae12ab4431e6bb1c5f3ec4b0 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Mon, 20 Mar 2023 21:57:40 -0700
> Subject: [PATCH v5 01/15] createdb-using-wal-fixes
>
> This could use a more detailed commit message -- I don't really get what
> it is doing

It's a fix for a bug that I encountered while hacking on this, it since has
been committed.

commit 5df319f3d55d09fadb4f7e4b58c5b476a3aeceb4
Author: Andres Freund <andres@anarazel.de>
Date:   2023-03-20 21:57:40 -0700

    Fix memory leak and inefficiency in CREATE DATABASE ... STRATEGY WAL_LOG

    RelationCopyStorageUsingBuffer() did not free the strategies used to access
    the source / target relation. They memory was released at the end of the
    transaction, but when using a template database with a lot of relations, the
    temporary leak can become big prohibitively big.

    RelationCopyStorageUsingBuffer() acquired the buffer for the target relation
    with RBM_NORMAL, therefore requiring a read of a block guaranteed to be
    zero. Use RBM_ZERO_AND_LOCK instead.

    Reviewed-by: Robert Haas <robertmhaas@gmail.com>
    Discussion: https://postgr.es/m/20230321070113.o2vqqxogjykwgfrr@awork3.anarazel.de
    Backpatch: 15-, where STRATEGY WAL_LOG was introduced


> From 6faba69c241fd5513022bb042c33af09d91e84a6 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Wed, 1 Jul 2020 19:06:45 -0700
> Subject: [PATCH v5 02/15] Add some error checking around pinning
>
> ---
>  src/backend/storage/buffer/bufmgr.c | 40 ++++++++++++++++++++---------
>  src/include/storage/bufmgr.h        |  1 +
>  2 files changed, 29 insertions(+), 12 deletions(-)
>

> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index 95212a3941..fa20fab5a2 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -4283,6 +4287,25 @@ ConditionalLockBuffer(Buffer buffer)
>                                      LW_EXCLUSIVE);
>  }
>
> +void
> +BufferCheckOneLocalPin(Buffer buffer)
> +{
> +    if (BufferIsLocal(buffer))
> +    {
> +        /* There should be exactly one pin */
> +        if (LocalRefCount[-buffer - 1] != 1)
> +            elog(ERROR, "incorrect local pin count: %d",
> +                LocalRefCount[-buffer - 1]);
> +    }
> +    else
> +    {
> +        /* There should be exactly one local pin */
> +        if (GetPrivateRefCount(buffer) != 1)
>
> I'd rather this be an else if (was already like this, but, no reason not
> to change it now)

I don't like that much - it'd break the symmetry between local / non-local.


> +/*
> + * Zero a region of the file.
> + *
> + * Returns 0 on success, -1 otherwise. In the latter case errno is set to the
> + * appropriate error.
> + */
> +int
> +FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
> +{
> +    int            returnCode;
> +    ssize_t        written;
> +
> +    Assert(FileIsValid(file));
> +    returnCode = FileAccess(file);
> +    if (returnCode < 0)
> +        return returnCode;
> +
> +    pgstat_report_wait_start(wait_event_info);
> +    written = pg_pwrite_zeros(VfdCache[file].fd, amount, offset);
> +    pgstat_report_wait_end();
> +
> +    if (written < 0)
> +        return -1;
> +    else if (written != amount)
>
> this doesn't need to be an else if

You mean it could be a "bare" if instead? I don't really think that's clearer.


> +    {
> +        /* if errno is unset, assume problem is no disk space */
> +        if (errno == 0)
> +            errno = ENOSPC;
> +        return -1;
> +    }
>
> +int
> +FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
> +{
> +#ifdef HAVE_POSIX_FALLOCATE
> +    int            returnCode;
> +
> +    Assert(FileIsValid(file));
> +    returnCode = FileAccess(file);
> +    if (returnCode < 0)
> +        return returnCode;
> +
> +    pgstat_report_wait_start(wait_event_info);
> +    returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
> +    pgstat_report_wait_end();
> +
> +    if (returnCode == 0)
> +        return 0;
> +
> +    /* for compatibility with %m printing etc */
> +    errno = returnCode;
> +
> +    /*
> +    * Return in cases of a "real" failure, if fallocate is not supported,
> +    * fall through to the FileZero() backed implementation.
> +    */
> +    if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
> +        return returnCode;
>
> I'm pretty sure you can just delete the below if statement
>
> +    if (returnCode == 0 ||
> +        (returnCode != EINVAL && returnCode != EINVAL))
> +        return returnCode;

Hm. I don't see how - wouldn't that lead us to call FileZero(), even if
FileFallocate() succeeded or failed (rather than not being supported)?

>
> +/*
> + *    mdzeroextend() -- Add ew zeroed out blocks to the specified relation.
>
> not sure what ew is

A hurried new :)


> + *
> + *        Similar to mdrextend(), except the relation can be extended by
>
> mdrextend->mdextend

> + *        multiple blocks at once, and that the added blocks will be
> filled with
>
> I would lose the comma and just say "and the added blocks will be filled..."

Done.


> +void
> +mdzeroextend(SMgrRelation reln, ForkNumber forknum,
> +            BlockNumber blocknum, int nblocks, bool skipFsync)
>
> So, I think there are  a few too many local variables in here, and it
> actually makes it more confusing.
> Assuming you would like to keep the input parameters blocknum and
> nblocks unmodified for debugging/other reasons, here is a suggested
> refactor of this function

I'm mostly adopting this.


> Also, I think you can combine the two error cases (I don't know if the
> user cares what you were trying to extend the file with).

Hm. I do find it a somewhat useful distinction for figuring out problems - we
haven't used posix_fallocate for files so far, it seems plausible we'd hit
some portability issues.  We could make it an errdetail(), I guess?



> void
> mdzeroextend(SMgrRelation reln, ForkNumber forknum,
>             BlockNumber blocknum, int nblocks, bool skipFsync)
> {
>     MdfdVec    *v;
>     BlockNumber curblocknum = blocknum;
>     int            remblocks = nblocks;
>     Assert(nblocks > 0);
>
>     /* This assert is too expensive to have on normally ... */
> #ifdef CHECK_WRITE_VS_EXTEND
>     Assert(blocknum >= mdnblocks(reln, forknum));
> #endif
>
>     /*
>     * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any
>     * more --- we mustn't create a block whose number actually is
>     * InvalidBlockNumber or larger.
>     */
>     if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)
>         ereport(ERROR,
>                 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
>                 errmsg("cannot extend file \"%s\" beyond %u blocks",
>                         relpath(reln->smgr_rlocator, forknum),
>                         InvalidBlockNumber)));
>
>     while (remblocks > 0)
>     {
>         int            segstartblock = curblocknum % ((BlockNumber)
> RELSEG_SIZE);

Hm - this shouldn't be an int - that's my fault, not yours...




> From ad7cd10a6c340d7f7d0adf26d5e39224dfd8439d Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Wed, 26 Oct 2022 12:05:07 -0700
> Subject: [PATCH v5 05/15] bufmgr: Add Pin/UnpinLocalBuffer()
>
> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index fa20fab5a2..6f50dbd212 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -4288,18 +4268,16 @@ ConditionalLockBuffer(Buffer buffer)
>  }
>
>  void
> -BufferCheckOneLocalPin(Buffer buffer)
> +BufferCheckWePinOnce(Buffer buffer)
>
> This name is weird. Who is we?

The current backend. I.e. the function checks the current backend pins the
buffer exactly once, rather that *any* backend pins it once.

I now see that BufferIsPinned() is named, IMO, misleadingly, more generally,
even though it also just applies to pins by the current backend.


> diff --git a/src/backend/storage/buffer/localbuf.c
> b/src/backend/storage/buffer/localbuf.c
> index 5325ddb663..798c5b93a8 100644
> --- a/src/backend/storage/buffer/localbuf.c
> +++ b/src/backend/storage/buffer/localbuf.c
> +bool
> +PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
> +{
> +    uint32        buf_state;
> +    Buffer        buffer = BufferDescriptorGetBuffer(buf_hdr);
> +    int            bufid = -(buffer + 1);
>
> You do
>   int            buffid = -buffer - 1;
> in UnpinLocalBuffer()
> They should be consistent.
>
>   int            bufid = -(buffer + 1);
>
> I think this version is better:
>
>   int            buffid = -buffer - 1;
>
> Since if buffer is INT_MAX, then the -(buffer + 1) version invokes
> undefined behavior while the -buffer - 1 version doesn't.

You are right! Not sure what I was doing there...

Ah - turns out there's pre-existing code in localbuf.c that do it that way :(
See at least MarkLocalBufferDirty().

We really need to wrap this in something, rather than open coding it all over
bufmgr.c/localbuf.c. I really dislike this indexing :(.


> From a0228218e2ac299aac754eeb5b2be7ddfc56918d Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Fri, 17 Feb 2023 18:26:34 -0800
> Subject: [PATCH v5 07/15] bufmgr: Acquire and clean victim buffer separately
>
> Previously we held buffer locks for two buffer mapping partitions at the same
> time to change the identity of buffers. Particularly for extending relations
> needing to hold the extension lock while acquiring a victim buffer is
> painful. By separating out the victim buffer acquisition, future commits will
> be able to change relation extensions to scale better.
>
> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index 3d0683593f..ea423ae484 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -1200,293 +1200,111 @@ BufferAlloc(SMgrRelation smgr, char
> relpersistence, ForkNumber forkNum,
>
>      /*
>      * Buffer contents are currently invalid.  Try to obtain the right to
>      * start I/O.  If StartBufferIO returns false, then someone else managed
>      * to read it before we did, so there's nothing left for BufferAlloc() to
>      * do.
>      */
> -    if (StartBufferIO(buf, true))
> +    if (StartBufferIO(victim_buf_hdr, true))
>          *foundPtr = false;
>      else
>          *foundPtr = true;
>
> I know it was already like this, but since you edited the line already,
> can we just make this this now?
>
>     *foundPtr = !StartBufferIO(victim_buf_hdr, true);

Hm, I do think it's easier to review if largely unchanged code is just moved
around, rather also rewritten. So I'm hesitant to do that here.


> @@ -1595,6 +1413,237 @@ retry:
>      StrategyFreeBuffer(buf);
>  }
>
> +/*
> + * Helper routine for GetVictimBuffer()
> + *
> + * Needs to be called on a buffer with a valid tag, pinned, but without the
> + * buffer header spinlock held.
> + *
> + * Returns true if the buffer can be reused, in which case the buffer is only
> + * pinned by this backend and marked as invalid, false otherwise.
> + */
> +static bool
> +InvalidateVictimBuffer(BufferDesc *buf_hdr)
> +{
> +    /*
> +    * Clear out the buffer's tag and flags and usagecount.  This is not
> +    * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before
> +    * doing anything with the buffer. But currently it's beneficial as the
> +    * pre-check for several linear scans of shared buffers just checks the
> +    * tag.
>
> I don't really understand the above comment -- mainly the last sentence.

To start with, it's s/checks/check/

"linear scans" is a reference to functions like DropRelationBuffers(), which
iterate over all buffers, and just check the tag for a match. If we leave the
tag around, it'll still work, as InvalidateBuffer() etc will figure out that
the buffer is invalid. But of course that's slower then just skipping the
buffer "early on".


> +static Buffer
> +GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
> +{
> +    BufferDesc *buf_hdr;
> +    Buffer        buf;
> +    uint32        buf_state;
> +    bool        from_ring;
> +
> +    /*
> +    * Ensure, while the spinlock's not yet held, that there's a free refcount
> +    * entry.
> +    */
> +    ReservePrivateRefCountEntry();
> +    ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> +
> +    /* we return here if a prospective victim buffer gets used concurrently */
> +again:
>
> Why use goto instead of a loop here (again is the goto label)?

I find it way more readable this way. I'd use a loop if it were the common
case to loop, but it's the rare case, and for that I find the goto more
readable.



> @@ -4709,8 +4704,6 @@ TerminateBufferIO(BufferDesc *buf, bool
> clear_dirty, uint32 set_flag_bits)
>  {
>      uint32        buf_state;
>
> I noticed that the comment above TermianteBufferIO() says
>  * TerminateBufferIO: release a buffer we were doing I/O on
>  *    (Assumptions)
>  *    My process is executing IO for the buffer
>
> Can we still say this is an assumption? What about when it is being
> cleaned up after being called from AbortBufferIO()

That hasn't really changed - it was already called by AbortBufferIO().

I think it's still correct, too. We must have marked the IO as being in
progress to get there.


> diff --git a/src/backend/utils/resowner/resowner.c
> b/src/backend/utils/resowner/resowner.c
> index 19b6241e45..fccc59b39d 100644
> --- a/src/backend/utils/resowner/resowner.c
> +++ b/src/backend/utils/resowner/resowner.c
> @@ -121,6 +121,7 @@ typedef struct ResourceOwnerData
>
>      /* We have built-in support for remembering: */
>      ResourceArray bufferarr;    /* owned buffers */
> +    ResourceArray bufferioarr;    /* in-progress buffer IO */
>      ResourceArray catrefarr;    /* catcache references */
>      ResourceArray catlistrefarr;    /* catcache-list pins */
>      ResourceArray relrefarr;    /* relcache references */
> @@ -441,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
>
> Maybe worth mentioning in-progress buffer IO in resowner README? I know
> it doesn't claim to be exhaustive, so, up to you.

Hm. Given the few types of resources mentioned in the README, I don't think
it's worth doing so.


> Also, I realize that existing code in this file has the extraneous
> parantheses, but maybe it isn't worth staying consistent with that?
> as in: &(owner->bufferioarr)

I personally don't find it worth being consistent with that, but if you /
others think it is, I'd be ok with adapting to that.


> From f26d1fa7e528d04436402aa8f94dc2442999dde3 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Wed, 1 Mar 2023 13:24:19 -0800
> Subject: [PATCH v5 09/15] bufmgr: Move relation extension handling into
>  ExtendBufferedRel{By,To,}
>
> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index 3c95b87bca..4e07a5bc48 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
>
> +/*
> + * Extend relation by multiple blocks.
> + *
> + * Tries to extend the relation by extend_by blocks. Depending on the
> + * availability of resources the relation may end up being extended by a
> + * smaller number of pages (unless an error is thrown, always by at least one
> + * page). *extended_by is updated to the number of pages the relation has been
> + * extended to.
> + *
> + * buffers needs to be an array that is at least extend_by long. Upon
> + * completion, the first extend_by array elements will point to a pinned
> + * buffer.
> + *
> + * If EB_LOCK_FIRST is part of flags, the first returned buffer is
> + * locked. This is useful for callers that want a buffer that is guaranteed to
> + * be empty.
>
> This should document what the returned BlockNumber is.

Ok.


> Also, instead of having extend_by and extended_by, how about just having
> one which is set by the caller to the desired number to extend by and
> then overwritten in this function to the value it successfully extended
> by.

I had it that way at first - but I think it turned out to be more confusing.


> It would be nice if the function returned the number it extended by
> instead of the BlockNumber.

It's not actually free to get the block number from a buffer (it causes more
sharing of the BufferDesc cacheline, which then makes modifications of the
cacheline more expensive). We should work on removing all those
BufferGetBlockNumber(). So I don't want to introduce a new function that
requires using BufferGetBlockNumber().

So I don't think this would be an improvement.


> +    Assert((eb.rel != NULL) ^ (eb.smgr != NULL));
>
> Can we turn these into !=
>
>   Assert((eb.rel != NULL) != (eb.smgr != NULL));
>
> since it is easier to understand.

Done.


> + * Extend the relation so it is at least extend_to blocks large, read buffer
>
> Use of "read buffer" here is confusing. We only read the block if, after
> we try extending the relation, someone else already did so and we have
> to read the block they extended in, right?

That's one case, yes. I think there's also some unfortunate other case that
I'd like to get rid of. See my archeology at
https://postgr.es/m/20230223010147.32oir7sb66slqnjk%40awork3.anarazel.de



> +        uint32 num_pages = lengthof(buffers);
> +        BlockNumber first_block;
> +
> +        if ((uint64) current_size + num_pages > extend_to)
> +            num_pages = extend_to - current_size;
> +
> +        first_block = ExtendBufferedRelCommon(eb, fork, strategy, flags,
> +                                             num_pages, extend_to,
> +                                             buffers, &extended_by);
> +
> +        current_size = first_block + extended_by;
> +        Assert(current_size <= extend_to);
> +        Assert(num_pages != 0 || current_size >= extend_to);
> +
> +        for (int i = 0; i < extended_by; i++)
> +        {
> +            if (first_block + i != extend_to - 1)
>
> Is there a way we could avoid pinning these other buffers to begin with
> (e.g. passing a parameter to ExtendBufferedRelCommon())

We can't avoid pinning them. We could make ExtendBufferedRelCommon() release
them though - but I'm not sure that'd be an improvement.  I actually had a
flag for that temporarily, but



> +    if (buffer == InvalidBuffer)
> +    {
> +        bool hit;
> +
> +        Assert(extended_by == 0);
> +        buffer = ReadBuffer_common(eb.smgr, eb.relpersistence,
> +                                  fork, extend_to - 1, mode, strategy,
> +                                  &hit);
> +    }
> +
> +    return buffer;
> +}
>
> Do we use compound literals? Here, this could be:
>
>         buffer = ReadBuffer_common(eb.smgr, eb.relpersistence,
>                                   fork, extend_to - 1, mode, strategy,
>                                   &(bool) {0});
>
> To eliminate the extraneous hit variable.

We do use compound literals in a few places. However, I don't think it's a
good idea to pass a pointer to a temporary.  At least I need to look up the
lifetime rules for those every time. And this isn't a huge win, so I wouldn't
go for it here.



>  /*
>   * ReadBuffer_common -- common logic for all ReadBuffer variants
> @@ -801,35 +991,36 @@ ReadBuffer_common(SMgrRelation smgr, char
> relpersistence, ForkNumber forkNum,
>      bool        found;
>      IOContext    io_context;
>      IOObject    io_object;
> -    bool        isExtend;
>      bool        isLocalBuf = SmgrIsTemp(smgr);
>
>      *hit = false;
>
> +    /*
> +    * Backward compatibility path, most code should use
> +    * ExtendRelationBuffered() instead, as acquiring the extension lock
> +    * inside ExtendRelationBuffered() scales a lot better.
>
> Think these are old function names in the comment

Indeed.


> +static BlockNumber
> +ExtendBufferedRelShared(ExtendBufferedWhat eb,
> +                        ForkNumber fork,
> +                        BufferAccessStrategy strategy,
> +                        uint32 flags,
> +                        uint32 extend_by,
> +                        BlockNumber extend_upto,
> +                        Buffer *buffers,
> +                        uint32 *extended_by)
> +{
> +    BlockNumber first_block;
> +    IOContext    io_context = IOContextForStrategy(strategy);
> +
> +    LimitAdditionalPins(&extend_by);
> +
> +    /*
> +    * Acquire victim buffers for extension without holding extension lock.
> +    * Writing out victim buffers is the most expensive part of extending the
> +    * relation, particularly when doing so requires WAL flushes. Zeroing out
> +    * the buffers is also quite expensive, so do that before holding the
> +    * extension lock as well.
> +    *
> +    * These pages are pinned by us and not valid. While we hold the pin they
> +    * can't be acquired as victim buffers by another backend.
> +    */
> +    for (uint32 i = 0; i < extend_by; i++)
> +    {
> +        Block        buf_block;
> +
> +        buffers[i] = GetVictimBuffer(strategy, io_context);
> +        buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1));
> +
> +        /* new buffers are zero-filled */
> +        MemSet((char *) buf_block, 0, BLCKSZ);
> +    }
> +
> +    /*
> +    * Lock relation against concurrent extensions, unless requested not to.
> +    *
> +    * We use the same extension lock for all forks. That's unnecessarily
> +    * restrictive, but currently extensions for forks don't happen often
> +    * enough to make it worth locking more granularly.
> +    *
> +    * Note that another backend might have extended the relation by the time
> +    * we get the lock.
> +    */
> +    if (!(flags & EB_SKIP_EXTENSION_LOCK))
> +    {
> +        LockRelationForExtension(eb.rel, ExclusiveLock);
> +        eb.smgr = RelationGetSmgr(eb.rel);
> +    }
> +
> +    /*
> +    * If requested, invalidate size cache, so that smgrnblocks asks the
> +    * kernel.
> +    */
> +    if (flags & EB_CLEAR_SIZE_CACHE)
> +        eb.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;
>
> I don't see this in master, is it new?

Not really - it's just elsewhere. See vm_extend() and fsm_extend(). I could
move this part back into "Convert a few places to ExtendBufferedRelTo", but I
doin't think that'd be better.


> + * Flags influencing the behaviour of ExtendBufferedRel*
> + */
> +typedef enum ExtendBufferedFlags
> +{
> +    /*
> +    * Don't acquire extension lock. This is safe only if the relation isn't
> +    * shared, an access exclusive lock is held or if this is the startup
> +    * process.
> +    */
> +    EB_SKIP_EXTENSION_LOCK = (1 << 0),
> +
> +    /* Is this extension part of recovery? */
> +    EB_PERFORMING_RECOVERY = (1 << 1),
> +
> +    /*
> +    * Should the fork be created if it does not currently exist? This likely
> +    * only ever makes sense for relation forks.
> +    */
> +    EB_CREATE_FORK_IF_NEEDED = (1 << 2),
> +
> +    /* Should the first (possibly only) return buffer be returned locked? */
> +    EB_LOCK_FIRST = (1 << 3),
> +
> +    /* Should the smgr size cache be cleared? */
> +    EB_CLEAR_SIZE_CACHE = (1 << 4),
> +
> +    /* internal flags follow */
>
> I don't understand what this comment means ("internal flags follow")

Hm - just that the flags defined subsequently are for internal use, not for
callers to specify.





> + */
> +static int
> +heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples,
> Size saveFreeSpace)
> +{
> +    size_t        page_avail;
> +    int            npages = 0;
> +
> +    page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
> +    npages++;
>
> can this not just be this:
>
> size_t        page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
> int            npages = 1;

Yes.


>
> From 5d2be27caf8f4ee8f26841b2aa1674c90bd51754 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Wed, 26 Oct 2022 14:14:11 -0700
> Subject: [PATCH v5 12/15] hio: Use ExtendBufferedRelBy()

> ---
>  src/backend/access/heap/hio.c | 285 +++++++++++++++++-----------------
>  1 file changed, 146 insertions(+), 139 deletions(-)
>
> diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
> index 65886839e7..48cfcff975 100644
> --- a/src/backend/access/heap/hio.c
> +++ b/src/backend/access/heap/hio.c
> @@ -354,6 +270,9 @@ RelationGetBufferForTuple(Relation relation, Size len,
>
> so in RelationGetBufferForTuple() up above where your changes start,
> there is this code
>
>
>     /*
>     * We first try to put the tuple on the same page we last inserted a tuple
>     * on, as cached in the BulkInsertState or relcache entry.  If that
>     * doesn't work, we ask the Free Space Map to locate a suitable page.
>     * Since the FSM's info might be out of date, we have to be prepared to
>     * loop around and retry multiple times. (To insure this isn't an infinite
>     * loop, we must update the FSM with the correct amount of free space on
>     * each page that proves not to be suitable.)  If the FSM has no record of
>     * a page with enough free space, we give up and extend the relation.
>     *
>     * When use_fsm is false, we either put the tuple onto the existing target
>     * page or extend the relation.
>     */
>     if (bistate && bistate->current_buf != InvalidBuffer)
>     {
>         targetBlock = BufferGetBlockNumber(bistate->current_buf);
>     }
>     else
>         targetBlock = RelationGetTargetBlock(relation);
>
>     if (targetBlock == InvalidBlockNumber && use_fsm)
>     {
>         /*
>         * We have no cached target page, so ask the FSM for an initial
>         * target.
>         */
>         targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
>     }
>
> And, I was thinking how, ReadBufferBI() only has one caller now
> (RelationGetBufferForTuple()) and, this caller basically already has
> checked for the case in the inside of ReadBufferBI() (the code I pasted
> above)
>
>     /* If we have the desired block already pinned, re-pin and return it */
>     if (bistate->current_buf != InvalidBuffer)
>     {
>         if (BufferGetBlockNumber(bistate->current_buf) == targetBlock)
>         {
>             /*
>             * Currently the LOCK variants are only used for extending
>             * relation, which should never reach this branch.
>             */
>             Assert(mode != RBM_ZERO_AND_LOCK &&
>                   mode != RBM_ZERO_AND_CLEANUP_LOCK);
>
>             IncrBufferRefCount(bistate->current_buf);
>             return bistate->current_buf;
>         }
>         /* ... else drop the old buffer */
>
> So, I was thinking maybe there is some way to inline the logic for
> ReadBufferBI(), because I think it would feel more streamlined to me.

I don't really see how - I'd welcome suggestions?


> @@ -558,18 +477,46 @@ loop:
>              ReleaseBuffer(buffer);
>          }
>
> Oh, and I forget which commit introduced BulkInsertState->next_free and
> last_free, but I remember thinking that it didn't seem to fit with the
> other parts of that commit.

I'll move it into the one using ExtendBufferedRelBy().


> -        /* Without FSM, always fall out of the loop and extend */
> -        if (!use_fsm)
> -            break;
> +        if (bistate
> +            && bistate->next_free != InvalidBlockNumber
> +            && bistate->next_free <= bistate->last_free)
> +        {
> +            /*
> +            * We bulk extended the relation before, and there are still some
> +            * unused pages from that extension, so we don't need to look in
> +            * the FSM for a new page. But do record the free space from the
> +            * last page, somebody might insert narrower tuples later.
> +            */
>
> Why couldn't we have found out that we bulk-extended before and get the
> block from there up above the while loop?

I'm not quite sure I follow - above the loop there might still have been space
on the prior page? We also need the ability to loop if the space has been used
since.

I guess there's an argument for also checking above the loop, but I don't
think that'd currently ever be reachable.


> +            {
> +                bistate->next_free = InvalidBlockNumber;
> +                bistate->last_free = InvalidBlockNumber;
> +            }
> +            else
> +                bistate->next_free++;
> +        }
> +        else if (!use_fsm)
> +        {
> +            /* Without FSM, always fall out of the loop and extend */
> +            break;
> +        }
>
> It would be nice to have a comment explaining why this is in its own
> else if instead of breaking earlier (i.e. !use_fsm is still a valid case
> in the if branch above it)

I'm not quite following. Breaking where earlier?

Note that that branch is old code, it's just that a new way of getting a page
that potentially has free space is preceding it.



> we can get rid of needLock and waitcount variables like this
>
> +#define MAX_BUFFERS 64
> +        Buffer        victim_buffers[MAX_BUFFERS];
> +        BlockNumber firstBlock = InvalidBlockNumber;
> +        BlockNumber firstBlockFSM = InvalidBlockNumber;
> +        BlockNumber curBlock;
> +        uint32        extend_by_pages;
> +        uint32        no_fsm_pages;
> +        uint32        waitcount;
> +
> +        extend_by_pages = num_pages;
> +
> +        /*
> +        * Multiply the number of pages to extend by the number of waiters. Do
> +        * this even if we're not using the FSM, as it does relieve
> +        * contention. Pages will be found via bistate->next_free.
> +        */
> +        if (needLock)
> +            waitcount = RelationExtensionLockWaiterCount(relation);
> +        else
> +            waitcount = 0;
> +        extend_by_pages += extend_by_pages * waitcount;
>
>         if (!RELATION_IS_LOCAL(relation))
>             extend_by_pages += extend_by_pages *
> RelationExtensionLockWaiterCount(relation);

I guess I find it useful to be able to quickly add logging messages for stuff
like this. I don't think local variables are as bad as you make them out to be
:)

Thanks for the review!

Andres



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Should vacuum process config file reload more often
Next
From: Andres Freund
Date:
Subject: Re: refactoring relation extension and BufferAlloc(), faster COPY