Re: refactoring relation extension and BufferAlloc(), faster COPY - Mailing list pgsql-hackers

From Andres Freund
Subject Re: refactoring relation extension and BufferAlloc(), faster COPY
Date
Msg-id 20230227214530.vfe4gq5oipmsa3zu@awork3.anarazel.de
Whole thread Raw
In response to Re: refactoring relation extension and BufferAlloc(), faster COPY  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: refactoring relation extension and BufferAlloc(), faster COPY
List pgsql-hackers
Hi,

On 2023-02-27 18:06:22 +0200, Heikki Linnakangas wrote:
> We also do this in freespace.c and visibilitymap.c:
> 
>     /* Extend as needed. */
>     while (fsm_nblocks_now < fsm_nblocks)
>     {
>         PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);
> 
>         smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
>                    pg.data, false);
>         fsm_nblocks_now++;
>      }
> 
> We could use the new smgrzeroextend function here. But it would be better to
> go through the buffer cache, because after this, the last block, at
> 'fsm_nblocks', will be read with ReadBuffer() and modified.

I doubt it's a particularly crucial thing to optimize.

But, uh, isn't this code racy? Because this doesn't go through shared buffers,
there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know
that writing pages isn't atomic vs readers. So another connection could
connection could see the new relation size, but a read might return a
partially written state of the page. Which then would cause checksum
failures. And even worse, I think it could lead to loosing a write, if the
concurrent connection writes out a page.


> We could use BulkExtendSharedRelationBuffered() to extend the relation and
> keep the last page locked, but the BulkExtendSharedRelationBuffered()
> signature doesn't allow that. It can return the first N pages locked, but
> there's no way to return the *last* page locked.

We can't rely on bulk extending a, potentially, large number of pages in one
go anyway (since we might not be allowed to pin that many pages). So I don't
think requiring locking the last page is a really viable API.

I think for this case I'd just just use the ExtendRelationTo() API we were
discussing nearby. Compared to the cost of reducing syscalls / filesystem
overhead to extend the relation, the cost of the buffer mapping lookup does't
seem significant. That's different in e.g. the hio.c case, because there we
need a buffer with free space, and concurrent activity could otherwise fill up
the buffer before we can lock it again.


I had started hacking on ExtendRelationTo() that when I saw problems with the
existing code that made me hesitate:
https://postgr.es/m/20230223010147.32oir7sb66slqnjk%40awork3.anarazel.de


> Perhaps we should decompose this function into several function calls.
> Something like:
> 
> /* get N victim buffers, pinned and !BM_VALID */
> buffers = BeginExtendRelation(int npages);
> 
> LockRelationForExtension(rel)
> 
> /* Insert buffers into buffer table */
> first_blk = smgrnblocks()
> for (blk = first_blk; blk < last_blk; blk++)
>     MapNewBuffer(blk, buffers[i])
> 
> /* extend the file on disk */
> smgrzeroextend();
> 
> UnlockRelationForExtension(rel)
> 
> for (blk = first_blk; blk < last_blk; blk++)
> {
>     memset(BufferGetPage(buffers[i]), 0,
>     FinishNewBuffer(buffers[i])
>     /* optionally lock the buffer */
>     LockBuffer(buffers[i]);
> }
> 
> That's a lot more verbose, of course, but gives the callers the flexibility.
> And might even be more readable than one function call with lots of
> arguments.

To me this seems like a quite bad idea. The amount of complexity this would
expose all over the tree is substantial. Which would also make it harder to
further improve relation extension at a later date. It certainly shouldn't be
the default interface. And I'm not sure I see a promisung usecase.


> This would expose the concept of a buffer that's mapped but marked as
> IO-in-progress outside bufmgr.c. On one hand, maybe that's exposing details
> that shouldn't be exposed. On the other hand, it might come handy. Instead
> of RBM_ZERO_AND_LOCK mode, for example, it might be handy to have a function
> that returns an IO-in-progress buffer that you can initialize any way you
> want.

I'd much rather encapsulate that in additional functions, or perhaps a
callback that can make decisions about what to do.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: PGDOCS - sgml linkend using single-quotes
Next
From: Juan José Santamaría Flecha
Date:
Subject: Re: WIN32 pg_import_system_collations