RE: [Patch] Optimize dropping of relation buffers using dlist - Mailing list pgsql-hackers

From k.jamison@fujitsu.com
Subject RE: [Patch] Optimize dropping of relation buffers using dlist
Date
Msg-id OSBPR01MB2341E08800C892C8B813F67EEFEE0@OSBPR01MB2341.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: [Patch] Optimize dropping of relation buffers using dlist  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: [Patch] Optimize dropping of relation buffers using dlist  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Thursday, November 5, 2020 10:22 AM, Horiguchi-san wrote:
> Hello.
>
> Many of the quetions are on the code following my past suggestions.

Yeah, I was also about to answer with the feedback you have given.
Thank you for replying and taking a look too.

> At Wed, 4 Nov 2020 15:59:17 +0530, Amit Kapila <amit.kapila16@gmail.com>
> wrote in
> > On Wed, Nov 4, 2020 at 8:28 AM k.jamison@fujitsu.com
> > <k.jamison@fujitsu.com> wrote:
> > >
> > > Hi,
> > >
> > > I've updated the patch 0004 (Truncate optimization) with the
> > > previous comments of Tsunakawa-san already addressed in the patch.
> > > (Thank you very much for the review.) The change here compared to
> > > the previous version is that in DropRelFileNodesAllBuffers() we don't
> check for the accurate flag anymore when deciding whether to optimize or
> not.
> > > For relations with blocks that do not exceed the threshold for full
> > > scan, we call DropRelFileNodeBuffers where the flag will be checked
> > > anyway. Otherwise, we proceed to the traditional buffer scan. Thoughts?
> > >
> >
> > Can we do a Truncate optimization once we decide about your other
> > patch as I see a few problems with it? If we can get the first patch
> > (vacuum optimization) committed it might be a bit easier for us to get
> > the truncate optimization. If possible, let's focus on (auto)vacuum
> > optimization first.

Sure. That'd be better.

> > Few comments on patches:
> > ======================
> > v29-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks
> > ----------------------------------------------------------------------
> > -------------
> > 1.
> > -smgrnblocks(SMgrRelation reln, ForkNumber forknum)
> > +smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool *accurate)
> >  {
> >   BlockNumber result;
> >
> >   /*
> >   * For now, we only use cached values in recovery due to lack of a
> > shared
> > - * invalidation mechanism for changes in file size.
> > + * invalidation mechanism for changes in file size.  The cached
> > + values
> > + * could be smaller than the actual number of existing buffers of the file.
> > + * This is caused by lseek of buggy Linux kernels that might not have
> > + * accounted for the recent write.
> >   */
> >   if (InRecovery && reln->smgr_cached_nblocks[forknum] !=
> > InvalidBlockNumber)
> > + {
> > + if (accurate != NULL)
> > + *accurate = true;
> > +
> >
> > I don't understand this comment. Few emails back, I think we have
> > discussed that cached value can't be less than the number of buffers
> > during recovery. If that happens to be true then we have some problem.
> > If you want to explain 'accurate' variable then you can do the same
> > atop of function. Would it be better to name this variable as
> > 'cached'?
>
> (I agree that the comment needs to be fixed.)
>
> FWIW I don't think 'cached' suggests the characteristics of the returned value
> on its interface.  It was introduced to reduce fseek() calls, and after that we
> have found that it can be regarded as the authoritative source of the file size.
> The "accurate" means that it is guaranteed that we don't have a buffer for the
> file blocks further than that number.  I don't come up with a more proper
> word than "accurate" but also I don't think "cached" is proper here.

I also couldn't think of a better parameter name. Accurate seems to be better fit
as it describes a measurement close to an accepted value.
How about fixing the comment like below, would this suffice?

/*
 *    smgrnblocks() -- Calculate the number of blocks in the
 *                     supplied relation.
 *
*        accurate flag acts as an authoritative source of the file size and
 *        ensures that no buffers exist for blocks after the file size is known
 *        to the process.
 */
BlockNumber
smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool *accurate)
{
    BlockNumber result;

    /*
     * For now, we only use cached values in recovery due to lack of a shared
     * invalidation mechanism for changes in file size.  In recovery, the cached
     * value returned by the first lseek could be smaller than the actual number
     * of existing buffers of the file, which is caused by buggy Linux kernels
     * that might not have accounted for the recent write.  However, we can
     * still rely on the cached value even if we get a bogus value from first
     * lseek since it is impossible to have buffer for blocks after the file size.
     */


> By the way, if there's a case where we extend a file by more than one block the
> cached value becomes invalid.  I'm not sure if it actually happens, but the
> following sequence may lead to a problem. We need a protection for that
> case.
>
> smgrnblocks()    : cached n
> truncate to n-5  : cached n=5
> extend to m + 2  : cached invalid
> (fsync failed)
> smgrnblocks()    : returns and cached n-5

I am not sure if the patch should cover this or should be a separate thread altogether since
a number of functions also rely on the smgrnblocks(). But I'll take it into consideration.


> > v29-0003-Optimize-DropRelFileNodeBuffers-during-recovery
> > ----------------------------------------------------------------------
> > ------------
> > 2.
> > + /* Check that it is in the buffer pool. If not, do nothing. */
> > + LWLockAcquire(bufPartitionLock, LW_SHARED); buf_id =
> > + BufTableLookup(&bufTag, bufHash); LWLockRelease(bufPartitionLock);
> > +
> > + if (buf_id < 0)
> > + continue;
> > +
> > + bufHdr = GetBufferDescriptor(buf_id);
> > +
> > + buf_state = LockBufHdr(bufHdr);
> > +
> > + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> >
> > I think a pre-check for RelFileNode might be better before LockBufHdr
> > for the reasons mentioned in this function few lines down.
>
> The equivalent check is already done by BufTableLookup(). The last line in
> the above is not a precheck but the final check.

Right. So I'll retain that current code.

> > 3.
> > -DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber
> *forkNum,
> > +DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber
> *forkNum,
> >      int nforks, BlockNumber *firstDelBlock)  {
> >   int i;
> >   int j;
> > + RelFileNodeBackend rnode;
> > + bool accurate;
> >
> > It is better to initialize accurate with false. Again, is it better to
> > change this variable name as 'cached'.
>
> *I* agree to initilization.

Understood. I'll include only the initialization in the next updated patch.


> > 4.
> > + /*
> > + * Look up the buffer in the hashtable if the block size is known to
> > + * be accurate and the total blocks to be invalidated is below the
> > + * full scan threshold.  Otherwise, give up the optimization.
> > + */
> > + if (accurate && nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD)
> > + { for (j = 0; j < nforks; j++) { BlockNumber curBlock;
> > +
> > + for (curBlock = firstDelBlock[j]; curBlock < nForkBlocks[j];
> > + curBlock++) {
> > + uint32 bufHash; /* hash value for tag */ BufferTag bufTag; /*
> > + identity of requested block */
> > + LWLock     *bufPartitionLock; /* buffer partition lock for it */
> > + int buf_id;
> > +
> > + /* create a tag so we can lookup the buffer */
> > + INIT_BUFFERTAG(bufTag, rnode.node, forkNum[j], curBlock);
> > +
> > + /* determine its hash code and partition lock ID */ bufHash =
> > + BufTableHashCode(&bufTag); bufPartitionLock =
> > + BufMappingPartitionLock(bufHash);
> > +
> > + /* Check that it is in the buffer pool. If not, do nothing. */
> > + LWLockAcquire(bufPartitionLock, LW_SHARED); buf_id =
> > + BufTableLookup(&bufTag, bufHash); LWLockRelease(bufPartitionLock);
> > +
> > + if (buf_id < 0)
> > + continue;
> > +
> > + bufHdr = GetBufferDescriptor(buf_id);
> > +
> > + buf_state = LockBufHdr(bufHdr);
> > +
> > + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> > + bufHdr->tag.forkNum == forkNum[j] && tag.blockNum >=
> > + bufHdr->firstDelBlock[j])
> > + InvalidateBuffer(bufHdr); /* releases spinlock */ else
> > + UnlockBufHdr(bufHdr, buf_state); } } return; }
> >
> > Can we move the code under this 'if' condition to a separate function,
> > say FindAndDropRelFileNodeBuffers or something like that?
>
> Thinking about the TRUNCATE optimization, it sounds reasonable to have a
> separate function, which runs the optmized dropping unconditionally.

Hmm, sure., although only DropRelFileNodeBuffers() would call the new function.
I guess it won't be a problem.


> > v29-0004-TRUNCATE-optimization
> > ------------------------------------------------
> > 5.
> > + for (i = 0; i < n; i++)
> > + {
> > + nforks = 0;
> > + nBlocksToInvalidate = 0;
> > +
> > + for (j = 0; j <= MAX_FORKNUM; j++)
> > + {
> > + if (!smgrexists(rels[i], j))
> > + continue;
> > +
> > + /* Get the number of blocks for a relation's fork */ nblocks =
> > + smgrnblocks(rels[i], j, NULL);
> > +
> > + nBlocksToInvalidate += nblocks;
> > +
> > + forks[nforks++] = j;
> > + }
> > + if (nBlocksToInvalidate >= BUF_DROP_FULL_SCAN_THRESHOLD)
> goto
> > + buffer_full_scan;
> > +
> > + DropRelFileNodeBuffers(rels[i], forks, nforks, firstDelBlocks); }
> > + pfree(nodes); pfree(rels); pfree(rnodes); return;
> >
> > I think this can be slower than the current Truncate. Say there are
> > BUF_DROP_FULL_SCAN_THRESHOLD then you would anyway have to
> scan the
> > entire shared buffers so the work done in optimized path for other two
> > relations will add some over head.
>
> That's true.  The criteria here is the number of blocks of all relations. And
> even if all of the relations is smaller than the threshold, we should go to the
> full-scan dropping if the total size exceeds the threshold.  So we cannot
> reuse DropRelFileNodeBuffers() as is here.
> > Also, as written, I think you need to remove the nodes for which you
> > have invalidated the buffers via optimized path, no.

Right, in the current patch it is indeed slower.
But the decision criteria whether to optimize or not is decided per relation,
not for all relations. So there is a possibility that we have already invalidated
buffers of the first relation, but the next relation buffers exceed the threshold that we
need to do the full scan. So yes that should be fixed. Remove the nodes that we
have already invalidated so that we don't scan them anymore when scanning NBuffers.
I will fix in the next version.

Thank you for the helpful feedback. I'll upload the updated set of patches soon
also when we reach a consensus on the boolean parameter name too.

Regards,
Kirk Jamison



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: vacuum -vs reltuples on insert only index
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module