Re: [Patch] Optimize dropping of relation buffers using dlist - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: [Patch] Optimize dropping of relation buffers using dlist
Date
Msg-id 20201105.172939.256484268058713113.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: [Patch] Optimize dropping of relation buffers using dlist  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: [Patch] Optimize dropping of relation buffers using dlist  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
At Thu, 5 Nov 2020 11:07:21 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Thu, Nov 5, 2020 at 8:26 AM k.jamison@fujitsu.com
> <k.jamison@fujitsu.com> wrote:
> > > > Few comments on patches:
> > > > ======================
> > > > v29-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks
> > > > ----------------------------------------------------------------------
> > > > -------------
> > > > 1.
> > > > -smgrnblocks(SMgrRelation reln, ForkNumber forknum)
> > > > +smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool *accurate)
> > > >  {
> > > >   BlockNumber result;
> > > >
> > > >   /*
> > > >   * For now, we only use cached values in recovery due to lack of a
> > > > shared
> > > > - * invalidation mechanism for changes in file size.
> > > > + * invalidation mechanism for changes in file size.  The cached
> > > > + values
> > > > + * could be smaller than the actual number of existing buffers of the file.
> > > > + * This is caused by lseek of buggy Linux kernels that might not have
> > > > + * accounted for the recent write.
> > > >   */
> > > >   if (InRecovery && reln->smgr_cached_nblocks[forknum] !=
> > > > InvalidBlockNumber)
> > > > + {
> > > > + if (accurate != NULL)
> > > > + *accurate = true;
> > > > +
> > > >
> > > > I don't understand this comment. Few emails back, I think we have
> > > > discussed that cached value can't be less than the number of buffers
> > > > during recovery. If that happens to be true then we have some problem.
> > > > If you want to explain 'accurate' variable then you can do the same
> > > > atop of function. Would it be better to name this variable as
> > > > 'cached'?
> > >
> > > (I agree that the comment needs to be fixed.)
> > >
> > > FWIW I don't think 'cached' suggests the characteristics of the returned value
> > > on its interface.  It was introduced to reduce fseek() calls, and after that we
> > > have found that it can be regarded as the authoritative source of the file size.
> > > The "accurate" means that it is guaranteed that we don't have a buffer for the
> > > file blocks further than that number.  I don't come up with a more proper
> > > word than "accurate" but also I don't think "cached" is proper here.
> >
> 
> Sure but that is not the guarantee this API gives. It has to be
> guaranteed by the logic else-where, so not sure if it is a good idea
> to try to reflect the same here. The comments in the caller where we
> use this should explain why it is safe to use this value.

Isn't it already guaranteed by the bugmgr code that we don't have
buffers for nonexistent file blocks?  What is needed here is, yeah,
the returned value from smgrblocks is "reliable".  If "reliable" is
still not proper, I give up and agree to "cached".

> > I also couldn't think of a better parameter name. Accurate seems to be better fit
> > as it describes a measurement close to an accepted value.
> > How about fixing the comment like below, would this suffice?
> >
> > /*
> >  *      smgrnblocks() -- Calculate the number of blocks in the
> >  *                                       supplied relation.
> >  *
> > *               accurate flag acts as an authoritative source of the file size and
> >  *              ensures that no buffers exist for blocks after the file size is known
> >  *              to the process.
> >  */
> > BlockNumber
> > smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool *accurate)
> > {
> >         BlockNumber result;
> >
> >         /*
> >          * For now, we only use cached values in recovery due to lack of a shared
> >          * invalidation mechanism for changes in file size.  In recovery, the cached
> >          * value returned by the first lseek could be smaller than the actual number
> >          * of existing buffers of the file, which is caused by buggy Linux kernels
> >          * that might not have accounted for the recent write.  However, we can
> >          * still rely on the cached value even if we get a bogus value from first
> >          * lseek since it is impossible to have buffer for blocks after the file size.
> >          */
> >
> >
> > > By the way, if there's a case where we extend a file by more than one block the
> > > cached value becomes invalid.  I'm not sure if it actually happens, but the
> > > following sequence may lead to a problem. We need a protection for that
> > > case.
> > >
> > > smgrnblocks()    : cached n
> > > truncate to n-5  : cached n=5
> > > extend to m + 2  : cached invalid
> > > (fsync failed)
> > > smgrnblocks()    : returns and cached n-5
> >
> 
> I think one possible idea is to actually commit the Assert patch
> (v29-0001-Prevent-invalidating-blocks-in-smgrextend-during) to ensure
> that it can't happen during recovery. And even if it happens why would
> there be any buffer with the block in it left when the fsync failed?
> And if there is no buffer with a block which doesn't account due to
> lseek lies then there shouldn't be any problem. Do you have any other
> ideas on what better can be done here?

Ouch! Sorry for the confusion. I confused that patch touches the
truncation side. Yes the 0001 does that.

> > I am not sure if the patch should cover this or should be a separate thread altogether since
> > a number of functions also rely on the smgrnblocks(). But I'll take it into consideration.
> >
> >
> > > > v29-0003-Optimize-DropRelFileNodeBuffers-during-recovery
> > > > ----------------------------------------------------------------------
> > > > ------------
> > > > 2.
> > > > + /* Check that it is in the buffer pool. If not, do nothing. */
> > > > + LWLockAcquire(bufPartitionLock, LW_SHARED); buf_id =
> > > > + BufTableLookup(&bufTag, bufHash); LWLockRelease(bufPartitionLock);
> > > > +
> > > > + if (buf_id < 0)
> > > > + continue;
> > > > +
> > > > + bufHdr = GetBufferDescriptor(buf_id);
> > > > +
> > > > + buf_state = LockBufHdr(bufHdr);
> > > > +
> > > > + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> > > >
> > > > I think a pre-check for RelFileNode might be better before LockBufHdr
> > > > for the reasons mentioned in this function few lines down.
> > >
> > > The equivalent check is already done by BufTableLookup(). The last line in
> > > the above is not a precheck but the final check.
> >
> 
> Which check in that API you are talking about? Are you telling because
> we are trying to use a hash value corresponding to rnode.node to find
> the block then I don't think it is equivalent because there is a
> difference in actual values. But even if we want to rely on that, a
> comment is required but I guess we can do the check as well because it
> shouldn't be a costly pre-check.

I think the only problematic case is that BufTableLookup wrongly
misses buffers actually to be dropped. (And the case of too-many
false-positives, not critical though.)  If omission is the case, we
cannot adopt this optimization at all.  And if the false-positive is
the case, maybe we need to adopt more restrictive prechecking, but
RelFileNodeEquals is *not* more restrictive than BufTableLookup in the
first place.

What case do you think is problematic when considering
BufTableLookup() as the prechecking?

> > > > 4.
> > > > + /*
> > > > + * Look up the buffer in the hashtable if the block size is known to
> > > > + * be accurate and the total blocks to be invalidated is below the
> > > > + * full scan threshold.  Otherwise, give up the optimization.
> > > > + */
> > > > + if (accurate && nBlocksToInvalidate <
> > > BUF_DROP_FULL_SCAN_THRESHOLD)
> > > > + { for (j = 0; j < nforks; j++) { BlockNumber curBlock;
> > > > +
> > > > + for (curBlock = firstDelBlock[j]; curBlock < nForkBlocks[j];
> > > > + curBlock++) {
> > > > + uint32 bufHash; /* hash value for tag */ BufferTag bufTag; /*
> > > > + identity of requested block */
> > > > + LWLock     *bufPartitionLock; /* buffer partition lock for it */
> > > > + int buf_id;
> > > > +
> > > > + /* create a tag so we can lookup the buffer */
> > > > + INIT_BUFFERTAG(bufTag, rnode.node, forkNum[j], curBlock);
> > > > +
> > > > + /* determine its hash code and partition lock ID */ bufHash =
> > > > + BufTableHashCode(&bufTag); bufPartitionLock =
> > > > + BufMappingPartitionLock(bufHash);
> > > > +
> > > > + /* Check that it is in the buffer pool. If not, do nothing. */
> > > > + LWLockAcquire(bufPartitionLock, LW_SHARED); buf_id =
> > > > + BufTableLookup(&bufTag, bufHash); LWLockRelease(bufPartitionLock);
> > > > +
> > > > + if (buf_id < 0)
> > > > + continue;
> > > > +
> > > > + bufHdr = GetBufferDescriptor(buf_id);
> > > > +
> > > > + buf_state = LockBufHdr(bufHdr);
> > > > +
> > > > + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> > > > + bufHdr->tag.forkNum == forkNum[j] && tag.blockNum >=
> > > > + bufHdr->firstDelBlock[j])
> > > > + InvalidateBuffer(bufHdr); /* releases spinlock */ else
> > > > + UnlockBufHdr(bufHdr, buf_state); } } return; }
> > > >
> > > > Can we move the code under this 'if' condition to a separate function,
> > > > say FindAndDropRelFileNodeBuffers or something like that?
> > >
> > > Thinking about the TRUNCATE optimization, it sounds reasonable to have a
> > > separate function, which runs the optmized dropping unconditionally.
> >
> > Hmm, sure., although only DropRelFileNodeBuffers() would call the new function.
> > I guess it won't be a problem.
> >
> 
> That shouldn't be a problem, you can make it a static function. It is
> more from the code-readability perspective.

> Sure, but feel free to leave the truncate optimization patch for now,
> we can do that as a follow-up patch once the vacuum-optimization patch
> is committed. Horiguchi-San, are you fine with this approach?

Of course.  I don't think we have to commit the two at once at all.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Move catalog toast table and index declarations
Next
From: Takashi Menjo
Date:
Subject: Re: [PoC] Non-volatile WAL buffer