RE: [Patch] Optimize dropping of relation buffers using dlist - Mailing list pgsql-hackers

From k.jamison@fujitsu.com
Subject RE: [Patch] Optimize dropping of relation buffers using dlist
Date
Msg-id OSBPR01MB2341F8FC0DEED991777D2B4DEFF90@OSBPR01MB2341.jpnprd01.prod.outlook.com
Whole thread Raw
In response to RE: [Patch] Optimize dropping of relation buffers using dlist  ("k.jamison@fujitsu.com" <k.jamison@fujitsu.com>)
List pgsql-hackers
> From: k.jamison@fujitsu.com <k.jamison@fujitsu.com>
> On Thursday, November 19, 2020 4:08 PM, Tsunakawa, Takayuki wrote:
> > From: Andres Freund <andres@anarazel.de>
> > > DropRelFileNodeBuffers() in recovery? The most common path is
> > > DropRelationFiles()->smgrdounlinkall()->DropRelFileNodesAllBuffers()
> > > , which 3/4 doesn't address and 4/4 doesn't mention.
> > >
> > > 4/4 seems to address DropRelationFiles(), but only talks about
> > TRUNCATE?
> >
> > Yes.  DropRelationFiles() is used in the following two paths:
> >
> > [Replay of TRUNCATE during recovery]
> > xact_redo_commit/abort() -> DropRelationFiles()  -> smgrdounlinkall()
> > ->
> > DropRelFileNodesAllBuffers()
> >
> > [COMMIT/ROLLBACK PREPARED]
> > FinishPreparedTransaction() -> DropRelationFiles()  ->
> > smgrdounlinkall()
> > -> DropRelFileNodesAllBuffers()
>
> Yes. The concern is that it was not clear in the function descriptions and
> commit logs what the optimizations for the DropRelFileNodeBuffers() and
> DropRelFileNodesAllBuffers() are for. So I revised only the function
> description of DropRelFileNodeBuffers() and the commit logs of the
> 0003-0004 patches. Please check if the brief descriptions would suffice.
>
>
> > > I also don't get why 4/4 would be a good idea on its own. It uses
> > > BUF_DROP_FULL_SCAN_THRESHOLD to guard
> > > FindAndDropRelFileNodeBuffers() on a per relation basis. But since
> > > DropRelFileNodesAllBuffers() can be used for many relations at once,
> > > this could end up doing BUF_DROP_FULL_SCAN_THRESHOLD - 1
> > lookups a lot
> > > of times, once for each of nnodes relations?
> >
> > So, the threshold value should be compared with the total number of
> > blocks of all target relations, not each relation.  You seem to be right, got it.
>
> Fixed this in 0004 patch. Now we compare the total number of
> buffers-to-be-invalidated For ALL relations to the
> BUF_DROP_FULL_SCAN_THRESHOLD.
>
> > > Also, how is 4/4 safe - this is outside of recovery too?
> >
> > It seems that DropRelFileNodesAllBuffers() should trigger the new
> > optimization path only when InRecovery == true, because it
> > intentionally doesn't check the "accurate" value returned from
> smgrnblocks().
>
> Fixed it in 0004 patch. Now we ensure that we only enter the optimization path
> Iff during recovery.
>
>
> > From: Amit Kapila <amit.kapila16@gmail.com> On Wed, Nov 18, 2020 at
> > 11:43 PM Andres Freund <andres@anarazel.de>
> > > I'm also worried about the cases where this could cause buffers left
> > > in the buffer pool, without a crosscheck like Thomas' patch would
> > > allow to add. Obviously other processes can dirty buffers in
> > > hot_standby, so any leftover buffer could have bad consequences.
> > >
> >
> > The problem can only arise if other processes extend the relation. The
> > idea was that in recovery it extends relation by one process which
> > helps to maintain the cache. Kirk seems to have done testing to
> > cross-verify it by using his first patch
> > (Prevent-invalidating-blocks-in-smgrextend-during). Which other
> crosscheck you are referring here?
> >
> > I agree that we can do a better job by expanding comments to clearly
> > state why it is safe.
>
> Yes, basically what Amit-san also mentioned above. The first patch prevents
> that.
> And in the description of DropRelFileNodeBuffers in the 0003 patch, please
> check If that would suffice.
>
>
> > > Smaller comment:
> > >
> > > +static void
> > > +FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber
> > *forkNum,
> > > int nforks,
> > > +                              BlockNumber
> > > *nForkBlocks, BlockNumber *firstDelBlock) ...
> > > +            /* Check that it is in the buffer pool. If not, do
> > nothing.
> > > */
> > > +            LWLockAcquire(bufPartitionLock, LW_SHARED);
> > > +            buf_id = BufTableLookup(&bufTag, bufHash);
> > > ...
> > > +            bufHdr = GetBufferDescriptor(buf_id);
> > > +
> > > +            buf_state = LockBufHdr(bufHdr);
> > > +
> > > +            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode)
> > &&
> > > +                bufHdr->tag.forkNum == forkNum[i] &&
> > > +                bufHdr->tag.blockNum >= firstDelBlock[i])
> > > +                InvalidateBuffer(bufHdr);    /* releases
> > > spinlock */
> > > +            else
> > > +                UnlockBufHdr(bufHdr, buf_state);
> > >
> > > I'm a bit confused about the check here. We hold a buffer partition
> > > lock, and have done a lookup in the mapping table. Why are we then
> > > rechecking the relfilenode/fork/blocknum? And why are we doing so
> > > holding the buffer header lock, which is essentially a spinlock, so
> > > should only ever be held for very short portions?
> > >
> > > This looks like it's copying logic from DropRelFileNodeBuffers()
> > > etc, but there the situation is different: We haven't done a buffer
> > > mapping lookup, and we don't hold a partition lock!
> >
> > That's because the buffer partition lock is released immediately after
> > the hash table has been looked up.  As an aside, InvalidateBuffer()
> > requires the caller to hold the buffer header spinlock and doesn't hold the
> buffer partition lock.
>
> Yes. Holding the buffer header spinlock is necessary to invalidate the buffers.
> As for buffer mapping partition lock, as mentioned by Tsunakawa-san, it is
> released immediately after BufTableLookup, which is similar to lookup done
> in PrefetchSharedBuffer. So I retained these changes.
>
> I have attached the updated patches. Aside from descriptions, no other major
> changes in the patch set except 0004. Feedbacks are welcome.

Hi,

Given that I modified the 0004 patch. I repeated the recovery performance
tests I did in [1]. But this time I used 1000 relations (1MB per relation).
Because of this rel size, it is expected that sequential full buffer scan is
executed for 128MB shared_buffers, while the optimized process is
implemented for the larger shared_buffers.

Below are the results:

[TRUNCATE]
| s_b    | MASTER (sec) | PATCHED (sec) |
|--------|--------------|---------------|
| 128 MB | 0.506        | 0.506         |
| 1 GB   | 0.906        | 0.506         |
| 20 GB  | 19.33        | 0.506         |
| 100 GB | 74.941       | 0.506         |

[VACUUM]
| s_b    | MASTER (sec) | PATCHED (sec) |
|--------|--------------|---------------|
| 128 MB | 1.207        | 0.737         |
| 1 GB   | 1.707        | 0.806         |
| 20 GB  | 14.325       | 0.806         |
| 100 GB | 64.728       | 1.307         |

Looking at the results for both VACUUM and TRUNCATE, we can see
the improvement of performance because of the optimizations.
In addition, there was no regression for the full scan of whole buffer
Pool (as seen in 128MB s_b).

Regards,
Kirk Jamison

[1]
https://www.postgresql.org/message-id/OSBPR01MB234176B1829AECFE9FDDFCC2EFE90%40OSBPR01MB2341.jpnprd01.prod.outlook.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Improving spin-lock implementation on ARM.
Next
From: Krunal Bauskar
Date:
Subject: Re: Improving spin-lock implementation on ARM.