Re: refactoring relation extension and BufferAlloc(), faster COPY - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: refactoring relation extension and BufferAlloc(), faster COPY |
Date | |
Msg-id | 20230227214530.vfe4gq5oipmsa3zu@awork3.anarazel.de Whole thread Raw |
In response to | Re: refactoring relation extension and BufferAlloc(), faster COPY (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: refactoring relation extension and BufferAlloc(), faster COPY
|
List | pgsql-hackers |
Hi, On 2023-02-27 18:06:22 +0200, Heikki Linnakangas wrote: > We also do this in freespace.c and visibilitymap.c: > > /* Extend as needed. */ > while (fsm_nblocks_now < fsm_nblocks) > { > PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now); > > smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now, > pg.data, false); > fsm_nblocks_now++; > } > > We could use the new smgrzeroextend function here. But it would be better to > go through the buffer cache, because after this, the last block, at > 'fsm_nblocks', will be read with ReadBuffer() and modified. I doubt it's a particularly crucial thing to optimize. But, uh, isn't this code racy? Because this doesn't go through shared buffers, there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know that writing pages isn't atomic vs readers. So another connection could connection could see the new relation size, but a read might return a partially written state of the page. Which then would cause checksum failures. And even worse, I think it could lead to loosing a write, if the concurrent connection writes out a page. > We could use BulkExtendSharedRelationBuffered() to extend the relation and > keep the last page locked, but the BulkExtendSharedRelationBuffered() > signature doesn't allow that. It can return the first N pages locked, but > there's no way to return the *last* page locked. We can't rely on bulk extending a, potentially, large number of pages in one go anyway (since we might not be allowed to pin that many pages). So I don't think requiring locking the last page is a really viable API. I think for this case I'd just just use the ExtendRelationTo() API we were discussing nearby. Compared to the cost of reducing syscalls / filesystem overhead to extend the relation, the cost of the buffer mapping lookup does't seem significant. That's different in e.g. the hio.c case, because there we need a buffer with free space, and concurrent activity could otherwise fill up the buffer before we can lock it again. I had started hacking on ExtendRelationTo() that when I saw problems with the existing code that made me hesitate: https://postgr.es/m/20230223010147.32oir7sb66slqnjk%40awork3.anarazel.de > Perhaps we should decompose this function into several function calls. > Something like: > > /* get N victim buffers, pinned and !BM_VALID */ > buffers = BeginExtendRelation(int npages); > > LockRelationForExtension(rel) > > /* Insert buffers into buffer table */ > first_blk = smgrnblocks() > for (blk = first_blk; blk < last_blk; blk++) > MapNewBuffer(blk, buffers[i]) > > /* extend the file on disk */ > smgrzeroextend(); > > UnlockRelationForExtension(rel) > > for (blk = first_blk; blk < last_blk; blk++) > { > memset(BufferGetPage(buffers[i]), 0, > FinishNewBuffer(buffers[i]) > /* optionally lock the buffer */ > LockBuffer(buffers[i]); > } > > That's a lot more verbose, of course, but gives the callers the flexibility. > And might even be more readable than one function call with lots of > arguments. To me this seems like a quite bad idea. The amount of complexity this would expose all over the tree is substantial. Which would also make it harder to further improve relation extension at a later date. It certainly shouldn't be the default interface. And I'm not sure I see a promisung usecase. > This would expose the concept of a buffer that's mapped but marked as > IO-in-progress outside bufmgr.c. On one hand, maybe that's exposing details > that shouldn't be exposed. On the other hand, it might come handy. Instead > of RBM_ZERO_AND_LOCK mode, for example, it might be handy to have a function > that returns an IO-in-progress buffer that you can initialize any way you > want. I'd much rather encapsulate that in additional functions, or perhaps a callback that can make decisions about what to do. Greetings, Andres Freund
pgsql-hackers by date: