Re: [Patch] Optimize dropping of relation buffers using dlist - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [Patch] Optimize dropping of relation buffers using dlist
Date
Msg-id CAA4eK1L5vNu9jnpS_54PbYmQcLEb1gaGivk_JVDuchp7XQRnDw@mail.gmail.com
Whole thread Raw
In response to RE: [Patch] Optimize dropping of relation buffers using dlist  ("k.jamison@fujitsu.com" <k.jamison@fujitsu.com>)
Responses Re: [Patch] Optimize dropping of relation buffers using dlist  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
On Thu, Nov 5, 2020 at 8:26 AM k.jamison@fujitsu.com
<k.jamison@fujitsu.com> wrote:
>
> On Thursday, November 5, 2020 10:22 AM, Horiguchi-san wrote:
> > >
> > > Can we do a Truncate optimization once we decide about your other
> > > patch as I see a few problems with it? If we can get the first patch
> > > (vacuum optimization) committed it might be a bit easier for us to get
> > > the truncate optimization. If possible, let's focus on (auto)vacuum
> > > optimization first.
>
> Sure. That'd be better.
>

Okay, thanks.

> > > Few comments on patches:
> > > ======================
> > > v29-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks
> > > ----------------------------------------------------------------------
> > > -------------
> > > 1.
> > > -smgrnblocks(SMgrRelation reln, ForkNumber forknum)
> > > +smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool *accurate)
> > >  {
> > >   BlockNumber result;
> > >
> > >   /*
> > >   * For now, we only use cached values in recovery due to lack of a
> > > shared
> > > - * invalidation mechanism for changes in file size.
> > > + * invalidation mechanism for changes in file size.  The cached
> > > + values
> > > + * could be smaller than the actual number of existing buffers of the file.
> > > + * This is caused by lseek of buggy Linux kernels that might not have
> > > + * accounted for the recent write.
> > >   */
> > >   if (InRecovery && reln->smgr_cached_nblocks[forknum] !=
> > > InvalidBlockNumber)
> > > + {
> > > + if (accurate != NULL)
> > > + *accurate = true;
> > > +
> > >
> > > I don't understand this comment. Few emails back, I think we have
> > > discussed that cached value can't be less than the number of buffers
> > > during recovery. If that happens to be true then we have some problem.
> > > If you want to explain 'accurate' variable then you can do the same
> > > atop of function. Would it be better to name this variable as
> > > 'cached'?
> >
> > (I agree that the comment needs to be fixed.)
> >
> > FWIW I don't think 'cached' suggests the characteristics of the returned value
> > on its interface.  It was introduced to reduce fseek() calls, and after that we
> > have found that it can be regarded as the authoritative source of the file size.
> > The "accurate" means that it is guaranteed that we don't have a buffer for the
> > file blocks further than that number.  I don't come up with a more proper
> > word than "accurate" but also I don't think "cached" is proper here.
>

Sure but that is not the guarantee this API gives. It has to be
guaranteed by the logic else-where, so not sure if it is a good idea
to try to reflect the same here. The comments in the caller where we
use this should explain why it is safe to use this value.


> I also couldn't think of a better parameter name. Accurate seems to be better fit
> as it describes a measurement close to an accepted value.
> How about fixing the comment like below, would this suffice?
>
> /*
>  *      smgrnblocks() -- Calculate the number of blocks in the
>  *                                       supplied relation.
>  *
> *               accurate flag acts as an authoritative source of the file size and
>  *              ensures that no buffers exist for blocks after the file size is known
>  *              to the process.
>  */
> BlockNumber
> smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool *accurate)
> {
>         BlockNumber result;
>
>         /*
>          * For now, we only use cached values in recovery due to lack of a shared
>          * invalidation mechanism for changes in file size.  In recovery, the cached
>          * value returned by the first lseek could be smaller than the actual number
>          * of existing buffers of the file, which is caused by buggy Linux kernels
>          * that might not have accounted for the recent write.  However, we can
>          * still rely on the cached value even if we get a bogus value from first
>          * lseek since it is impossible to have buffer for blocks after the file size.
>          */
>
>
> > By the way, if there's a case where we extend a file by more than one block the
> > cached value becomes invalid.  I'm not sure if it actually happens, but the
> > following sequence may lead to a problem. We need a protection for that
> > case.
> >
> > smgrnblocks()    : cached n
> > truncate to n-5  : cached n=5
> > extend to m + 2  : cached invalid
> > (fsync failed)
> > smgrnblocks()    : returns and cached n-5
>

I think one possible idea is to actually commit the Assert patch
(v29-0001-Prevent-invalidating-blocks-in-smgrextend-during) to ensure
that it can't happen during recovery. And even if it happens why would
there be any buffer with the block in it left when the fsync failed?
And if there is no buffer with a block which doesn't account due to
lseek lies then there shouldn't be any problem. Do you have any other
ideas on what better can be done here?

> I am not sure if the patch should cover this or should be a separate thread altogether since
> a number of functions also rely on the smgrnblocks(). But I'll take it into consideration.
>
>
> > > v29-0003-Optimize-DropRelFileNodeBuffers-during-recovery
> > > ----------------------------------------------------------------------
> > > ------------
> > > 2.
> > > + /* Check that it is in the buffer pool. If not, do nothing. */
> > > + LWLockAcquire(bufPartitionLock, LW_SHARED); buf_id =
> > > + BufTableLookup(&bufTag, bufHash); LWLockRelease(bufPartitionLock);
> > > +
> > > + if (buf_id < 0)
> > > + continue;
> > > +
> > > + bufHdr = GetBufferDescriptor(buf_id);
> > > +
> > > + buf_state = LockBufHdr(bufHdr);
> > > +
> > > + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> > >
> > > I think a pre-check for RelFileNode might be better before LockBufHdr
> > > for the reasons mentioned in this function few lines down.
> >
> > The equivalent check is already done by BufTableLookup(). The last line in
> > the above is not a precheck but the final check.
>

Which check in that API you are talking about? Are you telling because
we are trying to use a hash value corresponding to rnode.node to find
the block then I don't think it is equivalent because there is a
difference in actual values. But even if we want to rely on that, a
comment is required but I guess we can do the check as well because it
shouldn't be a costly pre-check.

>
> > > 4.
> > > + /*
> > > + * Look up the buffer in the hashtable if the block size is known to
> > > + * be accurate and the total blocks to be invalidated is below the
> > > + * full scan threshold.  Otherwise, give up the optimization.
> > > + */
> > > + if (accurate && nBlocksToInvalidate <
> > BUF_DROP_FULL_SCAN_THRESHOLD)
> > > + { for (j = 0; j < nforks; j++) { BlockNumber curBlock;
> > > +
> > > + for (curBlock = firstDelBlock[j]; curBlock < nForkBlocks[j];
> > > + curBlock++) {
> > > + uint32 bufHash; /* hash value for tag */ BufferTag bufTag; /*
> > > + identity of requested block */
> > > + LWLock     *bufPartitionLock; /* buffer partition lock for it */
> > > + int buf_id;
> > > +
> > > + /* create a tag so we can lookup the buffer */
> > > + INIT_BUFFERTAG(bufTag, rnode.node, forkNum[j], curBlock);
> > > +
> > > + /* determine its hash code and partition lock ID */ bufHash =
> > > + BufTableHashCode(&bufTag); bufPartitionLock =
> > > + BufMappingPartitionLock(bufHash);
> > > +
> > > + /* Check that it is in the buffer pool. If not, do nothing. */
> > > + LWLockAcquire(bufPartitionLock, LW_SHARED); buf_id =
> > > + BufTableLookup(&bufTag, bufHash); LWLockRelease(bufPartitionLock);
> > > +
> > > + if (buf_id < 0)
> > > + continue;
> > > +
> > > + bufHdr = GetBufferDescriptor(buf_id);
> > > +
> > > + buf_state = LockBufHdr(bufHdr);
> > > +
> > > + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> > > + bufHdr->tag.forkNum == forkNum[j] && tag.blockNum >=
> > > + bufHdr->firstDelBlock[j])
> > > + InvalidateBuffer(bufHdr); /* releases spinlock */ else
> > > + UnlockBufHdr(bufHdr, buf_state); } } return; }
> > >
> > > Can we move the code under this 'if' condition to a separate function,
> > > say FindAndDropRelFileNodeBuffers or something like that?
> >
> > Thinking about the TRUNCATE optimization, it sounds reasonable to have a
> > separate function, which runs the optmized dropping unconditionally.
>
> Hmm, sure., although only DropRelFileNodeBuffers() would call the new function.
> I guess it won't be a problem.
>

That shouldn't be a problem, you can make it a static function. It is
more from the code-readability perspective.

>
> > > v29-0004-TRUNCATE-optimization
> > > ------------------------------------------------
> > > 5.
> > > + for (i = 0; i < n; i++)
> > > + {
> > > + nforks = 0;
> > > + nBlocksToInvalidate = 0;
> > > +
> > > + for (j = 0; j <= MAX_FORKNUM; j++)
> > > + {
> > > + if (!smgrexists(rels[i], j))
> > > + continue;
> > > +
> > > + /* Get the number of blocks for a relation's fork */ nblocks =
> > > + smgrnblocks(rels[i], j, NULL);
> > > +
> > > + nBlocksToInvalidate += nblocks;
> > > +
> > > + forks[nforks++] = j;
> > > + }
> > > + if (nBlocksToInvalidate >= BUF_DROP_FULL_SCAN_THRESHOLD)
> > goto
> > > + buffer_full_scan;
> > > +
> > > + DropRelFileNodeBuffers(rels[i], forks, nforks, firstDelBlocks); }
> > > + pfree(nodes); pfree(rels); pfree(rnodes); return;
> > >
> > > I think this can be slower than the current Truncate. Say there are
> > > BUF_DROP_FULL_SCAN_THRESHOLD then you would anyway have to
> > scan the
> > > entire shared buffers so the work done in optimized path for other two
> > > relations will add some over head.
> >
> > That's true.  The criteria here is the number of blocks of all relations. And
> > even if all of the relations is smaller than the threshold, we should go to the
> > full-scan dropping if the total size exceeds the threshold.  So we cannot
> > reuse DropRelFileNodeBuffers() as is here.
> > > Also, as written, I think you need to remove the nodes for which you
> > > have invalidated the buffers via optimized path, no.
>
> Right, in the current patch it is indeed slower.
> But the decision criteria whether to optimize or not is decided per relation,
> not for all relations. So there is a possibility that we have already invalidated
> buffers of the first relation, but the next relation buffers exceed the threshold that we
> need to do the full scan. So yes that should be fixed. Remove the nodes that we
> have already invalidated so that we don't scan them anymore when scanning NBuffers.
> I will fix in the next version.
>
> Thank you for the helpful feedback. I'll upload the updated set of patches soon
> also when we reach a consensus on the boolean parameter name too.
>

Sure, but feel free to leave the truncate optimization patch for now,
we can do that as a follow-up patch once the vacuum-optimization patch
is committed. Horiguchi-San, are you fine with this approach?

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Strange behavior with polygon and NaN
Next
From: Amit Kapila
Date:
Subject: Re: Some doubious code in pgstat.c