Re: Unexpected page allocation behavior on insert-only tables - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Unexpected page allocation behavior on insert-only tables
Date
Msg-id 19116.1275338859@sss.pgh.pa.us
Whole thread Raw
In response to Re: Unexpected page allocation behavior on insert-only tables  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Unexpected page allocation behavior on insert-only tables
Re: Unexpected page allocation behavior on insert-only tables
List pgsql-hackers
I wrote:
> In particular, now that there's a distinction between smgr flush
> and relcache flush, maybe we could associate targblock reset with
> smgr flush (only) and arrange to not flush the smgr level during
> ANALYZE --- basically, smgr flush would only be needed when truncating
> or reassigning the relfilenode.  I think this might work out nicely but
> haven't chased the details.

I looked into that a bit more and decided that it'd be a ticklish
change: the coupling between relcache and smgr cache is pretty tight,
and there just isn't any provision for having an smgr cache entry live
longer than its owning relcache entry.  Even if we could fix it to
work reliably, this approach does nothing for the case where a backend
actually exits after filling just part of a new page, as noted by
Takahiro-san.

The next most promising fix is to have RelationGetBufferForTuple tell
the FSM about the new page immediately on creation.  I made a draft
patch for that (attached).  It fixes Michael's scenario nicely ---
all pages get filled completely --- and a simple test with pgbench
didn't reveal any obvious change in performance.  However there is
clear *potential* for performance loss, due to both the extra FSM
access and the potential for increased contention because of multiple
backends piling into the same new page.  So it would be good to do
some real performance testing on insert-heavy scenarios before we
consider applying this.  Any volunteers?

Note: patch is against HEAD but should work in 8.4, if you reverse out
the use of the rd_targblock access macros.

            regards, tom lane

Index: src/backend/access/heap/hio.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/access/heap/hio.c,v
retrieving revision 1.78
diff -c -r1.78 hio.c
*** src/backend/access/heap/hio.c    9 Feb 2010 21:43:29 -0000    1.78
--- src/backend/access/heap/hio.c    31 May 2010 20:44:29 -0000
***************
*** 354,384 ****
       * is empty (this should never happen, but if it does we don't want to
       * risk wiping out valid data).
       */
      page = BufferGetPage(buffer);

      if (!PageIsNew(page))
          elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
!              BufferGetBlockNumber(buffer),
!              RelationGetRelationName(relation));

      PageInit(page, BufferGetPageSize(buffer), 0);

!     if (len > PageGetHeapFreeSpace(page))
      {
          /* We should not get here given the test at the top */
          elog(PANIC, "tuple is too big: size %lu", (unsigned long) len);
      }

      /*
       * Remember the new page as our target for future insertions.
-      *
-      * XXX should we enter the new page into the free space map immediately,
-      * or just keep it for this backend's exclusive use in the short run
-      * (until VACUUM sees it)?    Seems to depend on whether you expect the
-      * current backend to make more insertions or not, which is probably a
-      * good bet most of the time.  So for now, don't add it to FSM yet.
       */
!     RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));

      return buffer;
  }
--- 354,388 ----
       * is empty (this should never happen, but if it does we don't want to
       * risk wiping out valid data).
       */
+     targetBlock = BufferGetBlockNumber(buffer);
      page = BufferGetPage(buffer);

      if (!PageIsNew(page))
          elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
!              targetBlock, RelationGetRelationName(relation));

      PageInit(page, BufferGetPageSize(buffer), 0);

!     pageFreeSpace = PageGetHeapFreeSpace(page);
!     if (len > pageFreeSpace)
      {
          /* We should not get here given the test at the top */
          elog(PANIC, "tuple is too big: size %lu", (unsigned long) len);
      }

      /*
+      * If using FSM, mark the page in FSM as having whatever amount of
+      * free space will be left after our insertion.  This is needed so that
+      * the free space won't be forgotten about if this backend doesn't use
+      * it up before exiting or flushing the rel's relcache entry.
+      */
+     if (use_fsm)
+         RecordPageWithFreeSpace(relation, targetBlock, pageFreeSpace - len);
+
+     /*
       * Remember the new page as our target for future insertions.
       */
!     RelationSetTargetBlock(relation, targetBlock);

      return buffer;
  }

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: why do we have rd_istemp?
Next
From: Jesper Krogh
Date:
Subject: Re: bitmap-index-scan faster than seq-scan on full-table-scan (gin index)