Re: refactoring relation extension and BufferAlloc(), faster COPY - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: refactoring relation extension and BufferAlloc(), faster COPY
Date
Msg-id CAAKRu_Z_oetHcv37trJxaZv2N-vA4Z6Hoaq1PRnRL9M68EXtKw@mail.gmail.com
Whole thread Raw
In response to Re: refactoring relation extension and BufferAlloc(), faster COPY  (Andres Freund <andres@anarazel.de>)
Responses Re: refactoring relation extension and BufferAlloc(), faster COPY
List pgsql-hackers
On Tue, Mar 28, 2023 at 11:47 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-03-26 17:42:45 -0400, Melanie Plageman wrote:
> > +    {
> > +        /* if errno is unset, assume problem is no disk space */
> > +        if (errno == 0)
> > +            errno = ENOSPC;
> > +        return -1;
> > +    }
> >
> > +int
> > +FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
> > +{
> > +#ifdef HAVE_POSIX_FALLOCATE
> > +    int            returnCode;
> > +
> > +    Assert(FileIsValid(file));
> > +    returnCode = FileAccess(file);
> > +    if (returnCode < 0)
> > +        return returnCode;
> > +
> > +    pgstat_report_wait_start(wait_event_info);
> > +    returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
> > +    pgstat_report_wait_end();
> > +
> > +    if (returnCode == 0)
> > +        return 0;
> > +
> > +    /* for compatibility with %m printing etc */
> > +    errno = returnCode;
> > +
> > +    /*
> > +    * Return in cases of a "real" failure, if fallocate is not supported,
> > +    * fall through to the FileZero() backed implementation.
> > +    */
> > +    if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
> > +        return returnCode;
> >
> > I'm pretty sure you can just delete the below if statement
> >
> > +    if (returnCode == 0 ||
> > +        (returnCode != EINVAL && returnCode != EINVAL))
> > +        return returnCode;
>
> Hm. I don't see how - wouldn't that lead us to call FileZero(), even if
> FileFallocate() succeeded or failed (rather than not being supported)?

Uh...I'm confused...maybe my eyes aren't working. If returnCode was 0,
you already would have returned and if returnCode wasn't EINVAL, you
also already would have returned.
Not to mention (returnCode != EINVAL && returnCode != EINVAL) contains
two identical operands.

> > +void
> > +mdzeroextend(SMgrRelation reln, ForkNumber forknum,
> > +            BlockNumber blocknum, int nblocks, bool skipFsync)
> >
> > So, I think there are  a few too many local variables in here, and it
> > actually makes it more confusing.
> > Assuming you would like to keep the input parameters blocknum and
> > nblocks unmodified for debugging/other reasons, here is a suggested
> > refactor of this function
>
> I'm mostly adopting this.
>
>
> > Also, I think you can combine the two error cases (I don't know if the
> > user cares what you were trying to extend the file with).
>
> Hm. I do find it a somewhat useful distinction for figuring out problems - we
> haven't used posix_fallocate for files so far, it seems plausible we'd hit
> some portability issues.  We could make it an errdetail(), I guess?

I think that would be clearer.

> > From ad7cd10a6c340d7f7d0adf26d5e39224dfd8439d Mon Sep 17 00:00:00 2001
> > From: Andres Freund <andres@anarazel.de>
> > Date: Wed, 26 Oct 2022 12:05:07 -0700
> > Subject: [PATCH v5 05/15] bufmgr: Add Pin/UnpinLocalBuffer()
> >
> > diff --git a/src/backend/storage/buffer/bufmgr.c
> > b/src/backend/storage/buffer/bufmgr.c
> > index fa20fab5a2..6f50dbd212 100644
> > --- a/src/backend/storage/buffer/bufmgr.c
> > +++ b/src/backend/storage/buffer/bufmgr.c
> > @@ -4288,18 +4268,16 @@ ConditionalLockBuffer(Buffer buffer)
> >  }
> >
> >  void
> > -BufferCheckOneLocalPin(Buffer buffer)
> > +BufferCheckWePinOnce(Buffer buffer)
> >
> > This name is weird. Who is we?
>
> The current backend. I.e. the function checks the current backend pins the
> buffer exactly once, rather that *any* backend pins it once.
>
> I now see that BufferIsPinned() is named, IMO, misleadingly, more generally,
> even though it also just applies to pins by the current backend.

Maybe there is a way to use "self" instead of a pronoun? But, if you
feel quite strongly about a pronoun, I think "we" implies more than one
backend, so "I" would be better.

> > @@ -1595,6 +1413,237 @@ retry:
> >      StrategyFreeBuffer(buf);
> >  }
> >
> > +/*
> > + * Helper routine for GetVictimBuffer()
> > + *
> > + * Needs to be called on a buffer with a valid tag, pinned, but without the
> > + * buffer header spinlock held.
> > + *
> > + * Returns true if the buffer can be reused, in which case the buffer is only
> > + * pinned by this backend and marked as invalid, false otherwise.
> > + */
> > +static bool
> > +InvalidateVictimBuffer(BufferDesc *buf_hdr)
> > +{
> > +    /*
> > +    * Clear out the buffer's tag and flags and usagecount.  This is not
> > +    * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before
> > +    * doing anything with the buffer. But currently it's beneficial as the
> > +    * pre-check for several linear scans of shared buffers just checks the
> > +    * tag.
> >
> > I don't really understand the above comment -- mainly the last sentence.
>
> To start with, it's s/checks/check/
>
> "linear scans" is a reference to functions like DropRelationBuffers(), which
> iterate over all buffers, and just check the tag for a match. If we leave the
> tag around, it'll still work, as InvalidateBuffer() etc will figure out that
> the buffer is invalid. But of course that's slower then just skipping the
> buffer "early on".

Ah. I see the updated comment on your branch and find it to be more clear.

> > @@ -4709,8 +4704,6 @@ TerminateBufferIO(BufferDesc *buf, bool
> > clear_dirty, uint32 set_flag_bits)
> >  {
> >      uint32        buf_state;
> >
> > I noticed that the comment above TermianteBufferIO() says
> >  * TerminateBufferIO: release a buffer we were doing I/O on
> >  *    (Assumptions)
> >  *    My process is executing IO for the buffer
> >
> > Can we still say this is an assumption? What about when it is being
> > cleaned up after being called from AbortBufferIO()
>
> That hasn't really changed - it was already called by AbortBufferIO().
>
> I think it's still correct, too. We must have marked the IO as being in
> progress to get there.

Oh, no I meant the "my process is executing IO for the buffer" --
couldn't another backend clear IO_IN_PROGRESS (i.e. not the original one
who set it on all the victim buffers)?

> > Also, I realize that existing code in this file has the extraneous
> > parantheses, but maybe it isn't worth staying consistent with that?
> > as in: &(owner->bufferioarr)
>
> I personally don't find it worth being consistent with that, but if you /
> others think it is, I'd be ok with adapting to that.

Right, so I'm saying you should remove the extraneous parentheses in the
code you added.

> > From f26d1fa7e528d04436402aa8f94dc2442999dde3 Mon Sep 17 00:00:00 2001
> > From: Andres Freund <andres@anarazel.de>
> > Date: Wed, 1 Mar 2023 13:24:19 -0800
> > Subject: [PATCH v5 09/15] bufmgr: Move relation extension handling into
> >  ExtendBufferedRel{By,To,}
> > + * Flags influencing the behaviour of ExtendBufferedRel*
> > + */
> > +typedef enum ExtendBufferedFlags
> > +{
> > +    /*
> > +    * Don't acquire extension lock. This is safe only if the relation isn't
> > +    * shared, an access exclusive lock is held or if this is the startup
> > +    * process.
> > +    */
> > +    EB_SKIP_EXTENSION_LOCK = (1 << 0),
> > +
> > +    /* Is this extension part of recovery? */
> > +    EB_PERFORMING_RECOVERY = (1 << 1),
> > +
> > +    /*
> > +    * Should the fork be created if it does not currently exist? This likely
> > +    * only ever makes sense for relation forks.
> > +    */
> > +    EB_CREATE_FORK_IF_NEEDED = (1 << 2),
> > +
> > +    /* Should the first (possibly only) return buffer be returned locked? */
> > +    EB_LOCK_FIRST = (1 << 3),
> > +
> > +    /* Should the smgr size cache be cleared? */
> > +    EB_CLEAR_SIZE_CACHE = (1 << 4),
> > +
> > +    /* internal flags follow */
> >
> > I don't understand what this comment means ("internal flags follow")
>
> Hm - just that the flags defined subsequently are for internal use, not for
> callers to specify.

If EB_LOCK_TARGET is the only one of these, it might be more clear, for
now, to just say "an internal flag" or "for internal use" above
EB_LOCK_TARGET, since it is the only one.

> > -        /* Without FSM, always fall out of the loop and extend */
> > -        if (!use_fsm)
> > -            break;
> > +        if (bistate
> > +            && bistate->next_free != InvalidBlockNumber
> > +            && bistate->next_free <= bistate->last_free)
> > +        {
> > +            /*
> > +            * We bulk extended the relation before, and there are still some
> > +            * unused pages from that extension, so we don't need to look in
> > +            * the FSM for a new page. But do record the free space from the
> > +            * last page, somebody might insert narrower tuples later.
> > +            */
> >
> > Why couldn't we have found out that we bulk-extended before and get the
> > block from there up above the while loop?
>
> I'm not quite sure I follow - above the loop there might still have been space
> on the prior page? We also need the ability to loop if the space has been used
> since.
>
> I guess there's an argument for also checking above the loop, but I don't
> think that'd currently ever be reachable.

My idea was that directly below this code in RelationGetBufferForTuple():

    if (bistate && bistate->current_buf != InvalidBuffer)
        targetBlock = BufferGetBlockNumber(bistate->current_buf);
    else
        targetBlock = RelationGetTargetBlock(relation);

We could check bistate->next_free instead of checking the freespace map first.

But, perhaps we couldn't hit this because we would have already have set
current_buf if we had a next_free?

- Melanie



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Add pg_walinspect function with block info columns
Next
From: Andres Freund
Date:
Subject: Re: refactoring relation extension and BufferAlloc(), faster COPY