Re: [Patch] Optimize dropping of relation buffers using dlist - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: [Patch] Optimize dropping of relation buffers using dlist
Date
Msg-id 20200916.173215.1574064307499707091.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: [Patch] Optimize dropping of relation buffers using dlist  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: [Patch] Optimize dropping of relation buffers using dlist  (Amit Kapila <amit.kapila16@gmail.com>)
RE: [Patch] Optimize dropping of relation buffers using dlist  ("k.jamison@fujitsu.com" <k.jamison@fujitsu.com>)
List pgsql-hackers
At Wed, 16 Sep 2020 10:05:32 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Wed, Sep 16, 2020 at 9:02 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Wed, 16 Sep 2020 08:33:06 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
> > > On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi
> > > <horikyota.ntt@gmail.com> wrote:
> > By the way I'm not sure that actually happens, but if one smgrextend
> > call exnteded the relation by two or more blocks, the cache is
> > invalidated and succeeding smgrnblocks returns lseek()'s result.
> >
> 
> Can you think of any such case? I think in recovery we use
> XLogReadBufferExtended->ReadBufferWithoutRelcache for reading the page
> which seems to be extending page-by-page but there could be some case
> where that is not true. One idea is to run regressions and add an
> Assert to see if we are extending more than a block during recovery.

I agree with you. Actually XLogReadBufferExtended is the only point to
read a page while recovery and seems calling ReadBufferWithoutRelcache
page by page up to the target page. The only case I found where the
cache is invalidated is ALTER TABLE SET TABLESPACE while
wal_level=minimal and not during recovery. smgrextend is called
without smgrnblocks called at the time.

Considering that the behavior of lseek can be a problem only just after
extending a file, an assertion in smgrextend seems to be
enough. Although, I'm not confident on the diagnosis.

--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -474,7 +474,14 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
     if (reln->smgr_cached_nblocks[forknum] == blocknum)
         reln->smgr_cached_nblocks[forknum] = blocknum + 1;
     else
+    {
+        /*
+         * DropRelFileNodeBuffers relies on the behavior that nblocks cache
+         * won't be invalidated by file extension while recoverying.
+         */
+        Assert(!InRecovery);
         reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+    }
 }

> > Don't
> > we need to guarantee the cache to be valid while recovery?
> >
> 
> One possibility could be that we somehow detect that the value we are
> using is cached one and if so then only do this optimization.

I basically like this direction.  But I'm not sure the additional
parameter for smgrnblocks is acceptable.

But on the contrary, it might be a better design that
DropRelFileNodeBuffers gives up the optimization when
smgrnblocks(,,must_accurate = true) returns InvalidBlockNumber.


@@ -544,9 +551,12 @@ smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 /*
  *    smgrnblocks() -- Calculate the number of blocks in the
  *                     supplied relation.
+ *
+ *    Returns InvalidBlockNumber if must_accurate is true and smgr_cached_nblocks
+ *    is not available.
  */
 BlockNumber
-smgrnblocks(SMgrRelation reln, ForkNumber forknum)
+smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool must_accurate)
 {
     BlockNumber result;
 
@@ -561,6 +571,17 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 
     reln->smgr_cached_nblocks[forknum] = result;
 
+    /*
+     * We cannot believe the result from smgr_nblocks is always accurate
+     * because lseek of buggy Linux kernels doesn't account for a recent
+     * write. However, we can rely on the result from lseek while recovering
+     * because the first call to this function is not happen just after a file
+     * extension. Return values on subsequent calls return cached nblocks,
+     * which should be accurate during recovery.
+     */
+    if (!InRecovery && must_accurate)
+        return InvalidBlockNumber;
+
     return result;
 }


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Yet another fast GiST build
Next
From: Krasiyan Andreev
Date:
Subject: Re: [PATCH] distinct aggregates within a window function WIP