Re: [Patch] Optimize dropping of relation buffers using dlist - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: [Patch] Optimize dropping of relation buffers using dlist
Date
Msg-id 20201126.161855.2277488410017350751.horikyota.ntt@gmail.com
Whole thread Raw
In response to RE: [Patch] Optimize dropping of relation buffers using dlist  ("k.jamison@fujitsu.com" <k.jamison@fujitsu.com>)
Responses Re: [Patch] Optimize dropping of relation buffers using dlist  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
RE: [Patch] Optimize dropping of relation buffers using dlist  ("k.jamison@fujitsu.com" <k.jamison@fujitsu.com>)
RE: [Patch] Optimize dropping of relation buffers using dlist  ("k.jamison@fujitsu.com" <k.jamison@fujitsu.com>)
List pgsql-hackers
Hello, Kirk. Thank you for the new version.

At Thu, 26 Nov 2020 03:04:10 +0000, "k.jamison@fujitsu.com" <k.jamison@fujitsu.com> wrote in 
> On Thursday, November 19, 2020 4:08 PM, Tsunakawa, Takayuki wrote:
> > From: Andres Freund <andres@anarazel.de>
> > > DropRelFileNodeBuffers() in recovery? The most common path is
> > > DropRelationFiles()->smgrdounlinkall()->DropRelFileNodesAllBuffers(),
> > > which 3/4 doesn't address and 4/4 doesn't mention.
> > >
> > > 4/4 seems to address DropRelationFiles(), but only talks about
> > TRUNCATE?
> > 
> > Yes.  DropRelationFiles() is used in the following two paths:
> > 
> > [Replay of TRUNCATE during recovery]
> > xact_redo_commit/abort() -> DropRelationFiles()  -> smgrdounlinkall() ->
> > DropRelFileNodesAllBuffers()
> > 
> > [COMMIT/ROLLBACK PREPARED]
> > FinishPreparedTransaction() -> DropRelationFiles()  -> smgrdounlinkall()
> > -> DropRelFileNodesAllBuffers()
> 
> Yes. The concern is that it was not clear in the function descriptions and commit logs
> what the optimizations for the DropRelFileNodeBuffers() and DropRelFileNodesAllBuffers()
> are for. So I revised only the function description of DropRelFileNodeBuffers() and the
> commit logs of the 0003-0004 patches. Please check if the brief descriptions would suffice.

I read the commit message of 3/4.  (Though this is not involved
literally in the final commit.)

> While recovery, when WAL files of XLOG_SMGR_TRUNCATE from vacuum
> or autovacuum are replayed, the buffers are dropped when the sizes
> of all involved forks of a relation are already "cached". We can get

This sentence seems missing "dropped by (or using) what".

> a reliable size of nblocks for supplied relation's fork at that time,
> and it's safe because DropRelFileNodeBuffers() relies on the behavior
> that cached nblocks will not be invalidated by file extension during
> recovery.  Otherwise, or if not in recovery, proceed to sequential
> search of the whole buffer pool.

This sentence seems involving confusion. It reads as if "we can rely
on it because we're relying on it".  And "the cached value won't be
invalidated" doesn't explain the reason precisely. The reason I think
is that the cached value is guaranteed to be the maximum page we have
in shared buffer at least while recovery, and that guarantee is holded
by not asking fseek once we cached the value.

> > > I also don't get why 4/4 would be a good idea on its own. It uses
> > > BUF_DROP_FULL_SCAN_THRESHOLD to guard
> > > FindAndDropRelFileNodeBuffers() on a per relation basis. But since
> > > DropRelFileNodesAllBuffers() can be used for many relations at once,
> > > this could end up doing BUF_DROP_FULL_SCAN_THRESHOLD - 1
> > lookups a lot
> > > of times, once for each of nnodes relations?
> > 
> > So, the threshold value should be compared with the total number of blocks
> > of all target relations, not each relation.  You seem to be right, got it.
> 
> Fixed this in 0004 patch. Now we compare the total number of buffers-to-be-invalidated
> For ALL relations to the BUF_DROP_FULL_SCAN_THRESHOLD.

I didn't see the previous version, but the row of additional
palloc/pfree's in this version looks uneasy.


     int            i,
+                j,
+                *nforks,
                 n = 0;

Perhaps I think we don't define variable in different types at once.
(I'm not sure about defining multple variables at once.)


@@ -3110,7 +3125,10 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes)
                 DropRelFileNodeAllLocalBuffers(rnodes[i].node);
         }
         else
+        {
+            rels[n] = smgr_reln[i];
             nodes[n++] = rnodes[i].node;
+        }
     }

We don't need to remember nodes and rnodes here since rnodes[n] is
rels[n]->smgr_rnode here.  Or we don't even need to store rels since
we can scan the smgr_reln later again.

nodes is needed in the full-scan path but it is enough to collect it
after finding that we do full-scan.


     /*
@@ -3120,6 +3138,68 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes)
     if (n == 0)
     {
         pfree(nodes);
+        pfree(rels);
+        pfree(rnodes);
+        return;
+    }
+
+    nforks = palloc(sizeof(int) * n);
+    forks = palloc(sizeof(ForkNumber *) * n);
+    blocks = palloc(sizeof(BlockNumber *) * n);
+    firstDelBlocks = palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM + 1));
+    for (i = 0; i < n; i++)
+    {
+        forks[i] = palloc(sizeof(ForkNumber) * (MAX_FORKNUM + 1));
+        blocks[i] = palloc(sizeof(BlockNumber) * (MAX_FORKNUM + 1));
+    }

We can allocate the whole array at once like this.

 BlockNumber (*blocks)[MAX_FORKNUM+1] =
      (BlockNumber (*)[MAX_FORKNUM+1])
      palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM + 1))

The elements of forks[][] and blocks[][] are not initialized bacause
some of the elemets may be skipped due to the absense of the
corresponding fork.

+            if (!smgrexists(rels[i], j))
+                continue;
+
+            /* Get the number of blocks for a relation's fork */
+            blocks[i][numForks] = smgrnblocks(rels[i], j, NULL);

If we see a fork which its size is not cached we must give up this
optimization for all target relations.

+            nBlocksToInvalidate += blocks[i][numForks];
+
+            forks[i][numForks++] = j;

We can signal to the later code the absense of a fork by setting
InvalidBlockNumber to blocks. Thus forks[], nforks and numForks can be
removed.

+    /* Zero the array of blocks because these will all be dropped anyway */
+    MemSet(firstDelBlocks, 0, sizeof(BlockNumber) * n * (MAX_FORKNUM + 1));

We don't need to prepare nforks, forks and firstDelBlocks for all
relations before looping over relations.  In other words, we can fill
in the arrays for a relation at every iteration of relations.

+     * We enter the optimization iff we are in recovery and the number of blocks to

This comment ticks out of 80 columns. (I'm not sure whether that
convention is still valid..)

+    if (InRecovery && nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)

We don't need to check InRecovery here. DropRelFileNodeBuffers doesn't
do that.

+        for (j = 0; j < n; j++)
+        {
+            FindAndDropRelFileNodeBuffers(nodes[j], forks[j], nforks[j],

i is not used at this nesting level so we can use i here.



> > > Also, how is 4/4 safe - this is outside of recovery too?
> > 
> > It seems that DropRelFileNodesAllBuffers() should trigger the new
> > optimization path only when InRecovery == true, because it intentionally
> > doesn't check the "accurate" value returned from smgrnblocks().
> 
> Fixed it in 0004 patch. Now we ensure that we only enter the optimization path
> Iff during recovery.

If the size of any of the target relations is not cached, we give up
the optimization at all even while recoverying.  Or am I missing
something?

> > From: Amit Kapila <amit.kapila16@gmail.com>
> > On Wed, Nov 18, 2020 at 11:43 PM Andres Freund <andres@anarazel.de>
> > > I'm also worried about the cases where this could cause buffers left
> > > in the buffer pool, without a crosscheck like Thomas' patch would
> > > allow to add. Obviously other processes can dirty buffers in
> > > hot_standby, so any leftover buffer could have bad consequences.
> > >
> > 
> > The problem can only arise if other processes extend the relation. The idea
> > was that in recovery it extends relation by one process which helps to
> > maintain the cache. Kirk seems to have done testing to cross-verify it by using
> > his first patch (Prevent-invalidating-blocks-in-smgrextend-during). Which
> > other crosscheck you are referring here?
> > 
> > I agree that we can do a better job by expanding comments to clearly state
> > why it is safe.
> 
> Yes, basically what Amit-san also mentioned above. The first patch prevents that.
> And in the description of DropRelFileNodeBuffers in the 0003 patch, please check
> If that would suffice.

+ *        While in recovery, if the expected maximum number of buffers to be
+ *        dropped is small enough and the sizes of all involved forks are
+ *        already cached, individual buffer is located by BufTableLookup().
+ *        It is safe because cached blocks will not be invalidated by file
+ *        extension during recovery.  See smgrnblocks() and smgrextend() for
+ *        more details. Otherwise, if the conditions for optimization are not
+ *        met, the buffer pool is sequentially scanned so that no buffers are
+ *        left behind.

I'm not confident on it, but it seems somewhat obscure.  How about
something like this?

We mustn't leave a buffer for the relations to be dropped.  We
invalidate buffer blocks by locating using BufTableLookup() when we
assure that we know up to what page of every fork we possiblly have a
buffer for. We can know that by the "cached" flag returned by
smgrblocks. It currently gets true only while recovery. See
smgrnblocks() and smgrextend(). Otherwise we scan the whole buffer
pool to find buffers for the relation, which is slower when a small
part of buffers are to be dropped.

> > > Smaller comment:
> > >
> > > +static void
> > > +FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber
> > *forkNum,
> > > int nforks,
> > > +                              BlockNumber
> > > *nForkBlocks, BlockNumber *firstDelBlock) ...
> > > +            /* Check that it is in the buffer pool. If not, do
> > nothing.
> > > */
> > > +            LWLockAcquire(bufPartitionLock, LW_SHARED);
> > > +            buf_id = BufTableLookup(&bufTag, bufHash);
> > > ...
> > > +            bufHdr = GetBufferDescriptor(buf_id);
> > > +
> > > +            buf_state = LockBufHdr(bufHdr);
> > > +
> > > +            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode)
> > &&
> > > +                bufHdr->tag.forkNum == forkNum[i] &&
> > > +                bufHdr->tag.blockNum >= firstDelBlock[i])
> > > +                InvalidateBuffer(bufHdr);    /* releases
> > > spinlock */
> > > +            else
> > > +                UnlockBufHdr(bufHdr, buf_state);
> > >
> > > I'm a bit confused about the check here. We hold a buffer partition
> > > lock, and have done a lookup in the mapping table. Why are we then
> > > rechecking the relfilenode/fork/blocknum? And why are we doing so
> > > holding the buffer header lock, which is essentially a spinlock, so
> > > should only ever be held for very short portions?
> > >
> > > This looks like it's copying logic from DropRelFileNodeBuffers() etc,
> > > but there the situation is different: We haven't done a buffer mapping
> > > lookup, and we don't hold a partition lock!
> > 
> > That's because the buffer partition lock is released immediately after the hash
> > table has been looked up.  As an aside, InvalidateBuffer() requires the caller
> > to hold the buffer header spinlock and doesn't hold the buffer partition lock.
> 
> Yes. Holding the buffer header spinlock is necessary to invalidate the buffers.
> As for buffer mapping partition lock, as mentioned by Tsunakawa-san, it is
> released immediately after BufTableLookup, which is similar to lookup done in
> PrefetchSharedBuffer. So I retained these changes.
> 
> I have attached the updated patches. Aside from descriptions, no other major
> changes in the patch set except 0004. Feedbacks are welcome.

FWIW, As tunakawa-san mentioned, the partition lock is release
immedately after the look-up.  The reason that we may release the
partition lock immediately is that it is OK that the buffer has been
evicted by someone to reuse it for other relations.  We can know that
case by rechecking the buffer tag after holding header lock.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: "osumi.takamichi@fujitsu.com"
Date:
Subject: Stronger safeguard for archive recovery not to miss data
Next
From: Fujii Masao
Date:
Subject: Re: [doc] plan invalidation when statistics are update