Re: [Patch] Optimize dropping of relation buffers using dlist - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: [Patch] Optimize dropping of relation buffers using dlist
Date
Msg-id 20201105.102205.1642087700692673022.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: [Patch] Optimize dropping of relation buffers using dlist  (Amit Kapila <amit.kapila16@gmail.com>)
Responses RE: [Patch] Optimize dropping of relation buffers using dlist  ("k.jamison@fujitsu.com" <k.jamison@fujitsu.com>)
List pgsql-hackers
Hello.

Many of the quetions are on the code following my past suggestions.

At Wed, 4 Nov 2020 15:59:17 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Wed, Nov 4, 2020 at 8:28 AM k.jamison@fujitsu.com
> <k.jamison@fujitsu.com> wrote:
> >
> > Hi,
> >
> > I've updated the patch 0004 (Truncate optimization) with the previous comments of
> > Tsunakawa-san already addressed in the patch. (Thank you very much for the review.)
> > The change here compared to the previous version is that in DropRelFileNodesAllBuffers()
> > we don't check for the accurate flag anymore when deciding whether to optimize or not.
> > For relations with blocks that do not exceed the threshold for full scan, we call
> > DropRelFileNodeBuffers where the flag will be checked anyway. Otherwise, we proceed
> > to the traditional buffer scan. Thoughts?
> >
> 
> Can we do a Truncate optimization once we decide about your other
> patch as I see a few problems with it? If we can get the first patch
> (vacuum optimization) committed it might be a bit easier for us to get
> the truncate optimization. If possible, let's focus on (auto)vacuum
> optimization first.
> 
> Few comments on patches:
> ======================
> v29-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks
> -----------------------------------------------------------------------------------
> 1.
> -smgrnblocks(SMgrRelation reln, ForkNumber forknum)
> +smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool *accurate)
>  {
>   BlockNumber result;
> 
>   /*
>   * For now, we only use cached values in recovery due to lack of a shared
> - * invalidation mechanism for changes in file size.
> + * invalidation mechanism for changes in file size.  The cached values
> + * could be smaller than the actual number of existing buffers of the file.
> + * This is caused by lseek of buggy Linux kernels that might not have
> + * accounted for the recent write.
>   */
>   if (InRecovery && reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
> + {
> + if (accurate != NULL)
> + *accurate = true;
> +
> 
> I don't understand this comment. Few emails back, I think we have
> discussed that cached value can't be less than the number of buffers
> during recovery. If that happens to be true then we have some problem.
> If you want to explain 'accurate' variable then you can do the same
> atop of function. Would it be better to name this variable as
> 'cached'?

(I agree that the comment needs to be fixed.)

FWIW I don't think 'cached' suggests the characteristics of the
returned value on its interface.  It was introduced to reduce fseek()
calls, and after that we have found that it can be regarded as the
authoritative source of the file size.  The "accurate" means that it
is guaranteed that we don't have a buffer for the file blocks further
than that number.  I don't come up with a more proper word than
"accurate" but also I don't think "cached" is proper here.

By the way, if there's a case where we extend a file by more than one
block the cached value becomes invalid.  I'm not sure if it actually
happens, but the following sequence may lead to a problem. We need a
protection for that case.

smgrnblocks()    : cached n
truncate to n-5  : cached n=5
extend to m + 2  : cached invalid
(fsync failed)
smgrnblocks()    : returns and cached n-5



> v29-0003-Optimize-DropRelFileNodeBuffers-during-recovery
> ----------------------------------------------------------------------------------
> 2.
> + /* Check that it is in the buffer pool. If not, do nothing. */
> + LWLockAcquire(bufPartitionLock, LW_SHARED);
> + buf_id = BufTableLookup(&bufTag, bufHash);
> + LWLockRelease(bufPartitionLock);
> +
> + if (buf_id < 0)
> + continue;
> +
> + bufHdr = GetBufferDescriptor(buf_id);
> +
> + buf_state = LockBufHdr(bufHdr);
> +
> + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> 
> I think a pre-check for RelFileNode might be better before LockBufHdr
> for the reasons mentioned in this function few lines down.

The equivalent check is already done by BufTableLookup(). The last
line in the above is not a precheck but the final check.

> 3.
> -DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
> +DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
>      int nforks, BlockNumber *firstDelBlock)
>  {
>   int i;
>   int j;
> + RelFileNodeBackend rnode;
> + bool accurate;
> 
> It is better to initialize accurate with false. Again, is it better to
> change this variable name as 'cached'.

*I* agree to initilization.

> 4.
> + /*
> + * Look up the buffer in the hashtable if the block size is known to
> + * be accurate and the total blocks to be invalidated is below the
> + * full scan threshold.  Otherwise, give up the optimization.
> + */
> + if (accurate && nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
> + {
> + for (j = 0; j < nforks; j++)
> + {
> + BlockNumber curBlock;
> +
> + for (curBlock = firstDelBlock[j]; curBlock < nForkBlocks[j]; curBlock++)
> + {
> + uint32 bufHash; /* hash value for tag */
> + BufferTag bufTag; /* identity of requested block */
> + LWLock     *bufPartitionLock; /* buffer partition lock for it */
> + int buf_id;
> +
> + /* create a tag so we can lookup the buffer */
> + INIT_BUFFERTAG(bufTag, rnode.node, forkNum[j], curBlock);
> +
> + /* determine its hash code and partition lock ID */
> + bufHash = BufTableHashCode(&bufTag);
> + bufPartitionLock = BufMappingPartitionLock(bufHash);
> +
> + /* Check that it is in the buffer pool. If not, do nothing. */
> + LWLockAcquire(bufPartitionLock, LW_SHARED);
> + buf_id = BufTableLookup(&bufTag, bufHash);
> + LWLockRelease(bufPartitionLock);
> +
> + if (buf_id < 0)
> + continue;
> +
> + bufHdr = GetBufferDescriptor(buf_id);
> +
> + buf_state = LockBufHdr(bufHdr);
> +
> + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> + bufHdr->tag.forkNum == forkNum[j] &&
> + bufHdr->tag.blockNum >= firstDelBlock[j])
> + InvalidateBuffer(bufHdr); /* releases spinlock */
> + else
> + UnlockBufHdr(bufHdr, buf_state);
> + }
> + }
> + return;
> + }
> 
> Can we move the code under this 'if' condition to a separate function,
> say FindAndDropRelFileNodeBuffers or something like that?

Thinking about the TRUNCATE optimization, it sounds reasonable to have
a separate function, which runs the optmized dropping unconditionally.

> v29-0004-TRUNCATE-optimization
> ------------------------------------------------
> 5.
> + for (i = 0; i < n; i++)
> + {
> + nforks = 0;
> + nBlocksToInvalidate = 0;
> +
> + for (j = 0; j <= MAX_FORKNUM; j++)
> + {
> + if (!smgrexists(rels[i], j))
> + continue;
> +
> + /* Get the number of blocks for a relation's fork */
> + nblocks = smgrnblocks(rels[i], j, NULL);
> +
> + nBlocksToInvalidate += nblocks;
> +
> + forks[nforks++] = j;
> + }
> + if (nBlocksToInvalidate >= BUF_DROP_FULL_SCAN_THRESHOLD)
> + goto buffer_full_scan;
> +
> + DropRelFileNodeBuffers(rels[i], forks, nforks, firstDelBlocks);
> + }
> + pfree(nodes);
> + pfree(rels);
> + pfree(rnodes);
> + return;
> 
> I think this can be slower than the current Truncate. Say there are
> BUF_DROP_FULL_SCAN_THRESHOLD then you would anyway have to scan the
> entire shared buffers so the work done in optimized path for other two
> relations will add some over head.

That's true.  The criteria here is the number of blocks of all
relations. And even if all of the relations is smaller than the
threshold, we should go to the full-scan dropping if the total size
exceeds the threshold.  So we cannot reuse DropRelFileNodeBuffers() as
is here.

> Also, as written, I think you need to remove the nodes for which you
> have invalidated the buffers via optimized path, no.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: PANIC: could not fsync file "pg_multixact/..." since commit dee663f7843
Next
From: Bharath Rupireddy
Date:
Subject: Re: Log message for GSS connection is missing once connection authorization is successful.