Seq scans status update - Mailing list pgsql-patches

From Heikki Linnakangas
Subject Seq scans status update
Date
Msg-id 464C8385.7030409@enterprisedb.com
Whole thread Raw
Responses Re: Seq scans status update
Re: Seq scans status update
List pgsql-patches
Attached is a new version of Simon's "scan-resistant buffer manager"
patch. It's not ready for committing yet because of a small issue I
found this morning (* see bottom), but here's a status update.

To recap, the basic idea is to use a small ring of buffers for large
scans like VACUUM, COPY and seq-scans. Changes to the original patch:

- a different sized ring is used for VACUUMs and seq-scans, and COPY.
VACUUM and COPY use a ring of 32 buffers, and COPY uses a ring of 4096
buffers in default configuration. See README changes in the patch for
rationale.

- for queries with large seqscans, the buffer ring is only used for
reads issued by the seq scan, not for any other reads in the query.
Typical scenario where this matters is doing a large seq scan with a
nested loop join to a smaller table. You don't want to use the buffer
ring for index lookups inside the nested loop.

- for seqscans, drop buffers from the ring that would need a WAL flush
to reuse. That makes bulk updates to behave roughly like they do without
the patch, instead of having to do a WAL flush every 32 pages.

I've spent a lot of time thinking of solutions to the last point. The
obvious solution would be to not use the buffer ring for updating scans.
The difficulty with that is that we don't know if a scan is read-only in
heapam.c, where the hint to use the buffer ring is set.

I've completed a set of performance tests on a test server. The server
has 4 GB of RAM, of which 1 GB is used for shared_buffers.

Results for a 10 GB table:

  head-copy-bigtable               | 00:10:09.07016
  head-copy-bigtable               | 00:10:20.507357
  head-copy-bigtable               | 00:10:21.857677
  head-copy_nowal-bigtable         | 00:05:18.232956
  head-copy_nowal-bigtable         | 00:03:24.109047
  head-copy_nowal-bigtable         | 00:05:31.019643
  head-select-bigtable             | 00:03:47.102731
  head-select-bigtable             | 00:01:08.314719
  head-select-bigtable             | 00:01:08.238509
  head-select-bigtable             | 00:01:08.208563
  head-select-bigtable             | 00:01:08.28347
  head-select-bigtable             | 00:01:08.308671
  head-vacuum_clean-bigtable       | 00:01:04.227832
  head-vacuum_clean-bigtable       | 00:01:04.232258
  head-vacuum_clean-bigtable       | 00:01:04.294621
  head-vacuum_clean-bigtable       | 00:01:04.280677
  head-vacuum_hintbits-bigtable    | 00:04:01.123924
  head-vacuum_hintbits-bigtable    | 00:03:58.253175
  head-vacuum_hintbits-bigtable    | 00:04:26.318159
  head-vacuum_hintbits-bigtable    | 00:04:37.512965
  patched-copy-bigtable            | 00:09:52.776754
  patched-copy-bigtable            | 00:10:18.185826
  patched-copy-bigtable            | 00:10:16.975482
  patched-copy_nowal-bigtable      | 00:03:14.882366
  patched-copy_nowal-bigtable      | 00:04:01.04648
  patched-copy_nowal-bigtable      | 00:03:56.062272
  patched-select-bigtable          | 00:03:47.704154
  patched-select-bigtable          | 00:01:08.460326
  patched-select-bigtable          | 00:01:10.441544
  patched-select-bigtable          | 00:01:11.916221
  patched-select-bigtable          | 00:01:13.848038
  patched-select-bigtable          | 00:01:10.956133
  patched-vacuum_clean-bigtable    | 00:01:10.315439
  patched-vacuum_clean-bigtable    | 00:01:12.210537
  patched-vacuum_clean-bigtable    | 00:01:15.202114
  patched-vacuum_clean-bigtable    | 00:01:10.712235
  patched-vacuum_hintbits-bigtable | 00:03:42.279201
  patched-vacuum_hintbits-bigtable | 00:04:02.057778
  patched-vacuum_hintbits-bigtable | 00:04:26.805822
  patched-vacuum_hintbits-bigtable | 00:04:28.911184

In other words, the patch has no significant effect, as expected. The
select times did go up by a couple of seconds, which I didn't expect,
though. One theory is that unused shared_buffers are swapped out during
the tests, and bgwriter pulls them back in. I'll set swappiness to 0 and
try again at some point.

Results for a 2 GB table:

  copy-medsize-unpatched            | 00:02:18.23246
  copy-medsize-unpatched            | 00:02:22.347194
  copy-medsize-unpatched            | 00:02:23.875874
  copy_nowal-medsize-unpatched      | 00:01:27.606334
  copy_nowal-medsize-unpatched      | 00:01:17.491243
  copy_nowal-medsize-unpatched      | 00:01:31.902719
  select-medsize-unpatched          | 00:00:03.786031
  select-medsize-unpatched          | 00:00:02.678069
  select-medsize-unpatched          | 00:00:02.666103
  select-medsize-unpatched          | 00:00:02.673494
  select-medsize-unpatched          | 00:00:02.669645
  select-medsize-unpatched          | 00:00:02.666278
  vacuum_clean-medsize-unpatched    | 00:00:01.091356
  vacuum_clean-medsize-unpatched    | 00:00:01.923138
  vacuum_clean-medsize-unpatched    | 00:00:01.917213
  vacuum_clean-medsize-unpatched    | 00:00:01.917333
  vacuum_hintbits-medsize-unpatched | 00:00:01.683718
  vacuum_hintbits-medsize-unpatched | 00:00:01.864003
  vacuum_hintbits-medsize-unpatched | 00:00:03.186596
  vacuum_hintbits-medsize-unpatched | 00:00:02.16494
  copy-medsize-patched              | 00:02:35.113501
  copy-medsize-patched              | 00:02:25.269866
  copy-medsize-patched              | 00:02:31.881089
  copy_nowal-medsize-patched        | 00:01:00.254633
  copy_nowal-medsize-patched        | 00:01:04.630687
  copy_nowal-medsize-patched        | 00:01:03.729128
  select-medsize-patched            | 00:00:03.201837
  select-medsize-patched            | 00:00:01.332975
  select-medsize-patched            | 00:00:01.33014
  select-medsize-patched            | 00:00:01.332392
  select-medsize-patched            | 00:00:01.333498
  select-medsize-patched            | 00:00:01.332692
  vacuum_clean-medsize-patched      | 00:00:01.140189
  vacuum_clean-medsize-patched      | 00:00:01.062762
  vacuum_clean-medsize-patched      | 00:00:01.062402
  vacuum_clean-medsize-patched      | 00:00:01.07113
  vacuum_hintbits-medsize-patched   | 00:00:17.865446
  vacuum_hintbits-medsize-patched   | 00:00:15.162064
  vacuum_hintbits-medsize-patched   | 00:00:01.704651
  vacuum_hintbits-medsize-patched   | 00:00:02.671651

This looks good to me, except for some glitch at the last
vacuum_hintbits tests. Selects and vacuums benefit significantly, as
does non-WAL-logged copy.

Not shown here, but I run tests earlier with vacuum on a table that
actually had dead tuples to be removed on it. In that test the patched
version really shined, reducing the runtime to ~ 1/6th. That was the
original motivation of this patch: not having to do a WAL flush on every
page in the 2nd phase of vacuum.

Test script attached. To use it:

1. Edit testscript.sh. Change BIGTABLESIZE.
2. Start postmaster
3. Run script, giving test-label as argument. For example:
"./testscript.sh bigtable-patched"

Attached is also the patch I used for the tests.

I would appreciate it if people would download the patch and the script
and repeat the tests on different hardware. I'm particularly interested
in testing on a box with good I/O hardware where selects on unpatched
PostgreSQL are bottlenecked by CPU.

Barring any surprises I'm going to fix the remaining issue and submit a
final patch, probably in the weekend.

(*) The issue with this patch is that if the buffer cache is completely
filled with dirty buffers that need a WAL flush to evict, the buffer
ring code will get into an infinite loop trying to find one that doesn't
need a WAL flush. Should be simple to fix.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/heap/heapam.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.232
diff -c -r1.232 heapam.c
*** src/backend/access/heap/heapam.c    8 Apr 2007 01:26:27 -0000    1.232
--- src/backend/access/heap/heapam.c    16 May 2007 11:35:14 -0000
***************
*** 83,88 ****
--- 83,96 ----
       */
      scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);

+     /* A scan on a table smaller than shared_buffers is treated like random
+      * access, but bigger scans should use the bulk read replacement policy.
+      */
+     if (scan->rs_nblocks > NBuffers)
+         scan->rs_accesspattern = AP_BULKREAD;
+     else
+         scan->rs_accesspattern = AP_NORMAL;
+
      scan->rs_inited = false;
      scan->rs_ctup.t_data = NULL;
      ItemPointerSetInvalid(&scan->rs_ctup.t_self);
***************
*** 123,133 ****
--- 131,146 ----

      Assert(page < scan->rs_nblocks);

+     /* Read the page with the right strategy */
+     SetAccessPattern(scan->rs_accesspattern);
+
      scan->rs_cbuf = ReleaseAndReadBuffer(scan->rs_cbuf,
                                           scan->rs_rd,
                                           page);
      scan->rs_cblock = page;

+     SetAccessPattern(AP_NORMAL);
+
      if (!scan->rs_pageatatime)
          return;

Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.268
diff -c -r1.268 xlog.c
*** src/backend/access/transam/xlog.c    30 Apr 2007 21:01:52 -0000    1.268
--- src/backend/access/transam/xlog.c    15 May 2007 16:23:30 -0000
***************
*** 1668,1673 ****
--- 1668,1700 ----
  }

  /*
+  * Returns true if 'record' hasn't been flushed to disk yet.
+  */
+ bool
+ XLogNeedsFlush(XLogRecPtr record)
+ {
+     /* Quick exit if already known flushed */
+     if (XLByteLE(record, LogwrtResult.Flush))
+         return false;
+
+     /* read LogwrtResult and update local state */
+     {
+         /* use volatile pointer to prevent code rearrangement */
+         volatile XLogCtlData *xlogctl = XLogCtl;
+
+         SpinLockAcquire(&xlogctl->info_lck);
+         LogwrtResult = xlogctl->LogwrtResult;
+         SpinLockRelease(&xlogctl->info_lck);
+     }
+
+     /* check again */
+     if (XLByteLE(record, LogwrtResult.Flush))
+         return false;
+
+     return true;
+ }
+
+ /*
   * Ensure that all XLOG data through the given position is flushed to disk.
   *
   * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not
Index: src/backend/commands/copy.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.283
diff -c -r1.283 copy.c
*** src/backend/commands/copy.c    27 Apr 2007 22:05:46 -0000    1.283
--- src/backend/commands/copy.c    15 May 2007 17:05:29 -0000
***************
*** 1876,1881 ****
--- 1876,1888 ----
      nfields = file_has_oids ? (attr_count + 1) : attr_count;
      field_strings = (char **) palloc(nfields * sizeof(char *));

+     /* Use the special COPY buffer replacement strategy if WAL-logging
+      * is enabled. If it's not, the pages we're writing are dirty but
+      * don't need a WAL flush to write out, so the BULKREAD strategy
+      * is more suitable.
+      */
+     SetAccessPattern(use_wal ? AP_COPY : AP_BULKREAD);
+
      /* Initialize state variables */
      cstate->fe_eof = false;
      cstate->eol_type = EOL_UNKNOWN;
***************
*** 2161,2166 ****
--- 2168,2176 ----
                              cstate->filename)));
      }

+     /* Reset buffer replacement strategy */
+     SetAccessPattern(AP_NORMAL);
+
      /*
       * If we skipped writing WAL, then we need to sync the heap (but not
       * indexes since those use WAL anyway)
Index: src/backend/commands/vacuum.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/vacuum.c,v
retrieving revision 1.350
diff -c -r1.350 vacuum.c
*** src/backend/commands/vacuum.c    16 Apr 2007 18:29:50 -0000    1.350
--- src/backend/commands/vacuum.c    15 May 2007 17:06:18 -0000
***************
*** 421,431 ****
                   * Tell the buffer replacement strategy that vacuum is causing
                   * the IO
                   */
!                 StrategyHintVacuum(true);

                  analyze_rel(relid, vacstmt);

!                 StrategyHintVacuum(false);

                  if (use_own_xacts)
                      CommitTransactionCommand();
--- 421,431 ----
                   * Tell the buffer replacement strategy that vacuum is causing
                   * the IO
                   */
!                 SetAccessPattern(AP_VACUUM);

                  analyze_rel(relid, vacstmt);

!                 SetAccessPattern(AP_NORMAL);

                  if (use_own_xacts)
                      CommitTransactionCommand();
***************
*** 442,448 ****
          /* Make sure cost accounting is turned off after error */
          VacuumCostActive = false;
          /* And reset buffer replacement strategy, too */
!         StrategyHintVacuum(false);
          PG_RE_THROW();
      }
      PG_END_TRY();
--- 442,448 ----
          /* Make sure cost accounting is turned off after error */
          VacuumCostActive = false;
          /* And reset buffer replacement strategy, too */
!         SetAccessPattern(AP_NORMAL);
          PG_RE_THROW();
      }
      PG_END_TRY();
***************
*** 1088,1094 ****
       * Tell the cache replacement strategy that vacuum is causing all
       * following IO
       */
!     StrategyHintVacuum(true);

      /*
       * Do the actual work --- either FULL or "lazy" vacuum
--- 1088,1094 ----
       * Tell the cache replacement strategy that vacuum is causing all
       * following IO
       */
!     SetAccessPattern(AP_VACUUM);

      /*
       * Do the actual work --- either FULL or "lazy" vacuum
***************
*** 1098,1104 ****
      else
          lazy_vacuum_rel(onerel, vacstmt);

!     StrategyHintVacuum(false);

      /* all done with this class, but hold lock until commit */
      relation_close(onerel, NoLock);
--- 1098,1104 ----
      else
          lazy_vacuum_rel(onerel, vacstmt);

!     SetAccessPattern(AP_NORMAL);

      /* all done with this class, but hold lock until commit */
      relation_close(onerel, NoLock);
Index: src/backend/storage/buffer/README
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/README,v
retrieving revision 1.11
diff -c -r1.11 README
*** src/backend/storage/buffer/README    23 Jul 2006 03:07:58 -0000    1.11
--- src/backend/storage/buffer/README    16 May 2007 11:43:11 -0000
***************
*** 152,159 ****
  a field to show which backend is doing its I/O).


! Buffer replacement strategy
! ---------------------------

  There is a "free list" of buffers that are prime candidates for replacement.
  In particular, buffers that are completely free (contain no valid page) are
--- 152,159 ----
  a field to show which backend is doing its I/O).


! Normal buffer replacement strategy
! ----------------------------------

  There is a "free list" of buffers that are prime candidates for replacement.
  In particular, buffers that are completely free (contain no valid page) are
***************
*** 199,221 ****
  have to give up and try another buffer.  This however is not a concern
  of the basic select-a-victim-buffer algorithm.)

- A special provision is that while running VACUUM, a backend does not
- increment the usage count on buffers it accesses.  In fact, if ReleaseBuffer
- sees that it is dropping the pin count to zero and the usage count is zero,
- then it appends the buffer to the tail of the free list.  (This implies that
- VACUUM, but only VACUUM, must take the BufFreelistLock during ReleaseBuffer;
- this shouldn't create much of a contention problem.)  This provision
- encourages VACUUM to work in a relatively small number of buffers rather
- than blowing out the entire buffer cache.  It is reasonable since a page
- that has been touched only by VACUUM is unlikely to be needed again soon.
-
- Since VACUUM usually requests many pages very fast, the effect of this is that
- it will get back the very buffers it filled and possibly modified on the next
- call and will therefore do its work in a few shared memory buffers, while
- being able to use whatever it finds in the cache already.  This also implies
- that most of the write traffic caused by a VACUUM will be done by the VACUUM
- itself and not pushed off onto other processes.


  Background writer's processing
  ------------------------------
--- 199,243 ----
  have to give up and try another buffer.  This however is not a concern
  of the basic select-a-victim-buffer algorithm.)


+ Buffer ring replacement strategy
+ ---------------------------------
+
+ When running a query that needs to access a large number of pages, like VACUUM,
+ COPY, or a large sequential scan, a different strategy is used.  A page that
+ has been touched only by such a scan is unlikely to be needed again soon, so
+ instead of running the normal clock sweep algorithm and blowing out the entire
+ buffer cache, a small ring of buffers is allocated using the normal clock sweep
+ algorithm and those buffers are reused for the whole scan.  This also implies
+ that most of the write traffic caused by such a statement will be done by the
+ backend itself and not pushed off onto other processes.
+
+ The size of the ring used depends on the kind of scan:
+
+ For sequential scans, a small 256 KB ring is used. That's small enough to fit
+ in L2 cache, which makes transferring pages from OS cache to shared buffer
+ cache efficient. Even less would often be enough, but the ring must be big
+ enough to accommodate all pages in the scan that are pinned concurrently.
+ 256 KB should also be enough to leave a small cache trail for other backends to
+ join in a synchronized seq scan. If a buffer is dirtied and LSN set, the buffer
+ is removed from the ring and a replacement buffer is chosen using the normal
+ replacement strategy. In a scan that modifies every page in the scan, like a
+ bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and the
+ ring strategy effectively degrades to the normal strategy.
+
+ VACUUM uses a 256 KB ring like sequential scans, but dirty pages are not
+ removed from the ring. WAL is flushed instead to allow reuse of the buffers.
+ Before introducing the buffer ring strategy in 8.3, buffers were put to the
+ freelist, which was effectively a buffer ring of 1 buffer.
+
+ COPY behaves like VACUUM, but a much larger ring is used. The ring size is
+ chosen to be twice the WAL segment size. This avoids polluting the buffer cache
+ like the clock sweep would do, and using a ring larger than WAL segment size
+ avoids having to do any extra WAL flushes, since a WAL segment will always be
+ filled, forcing a WAL flush, before looping through the buffer ring and bumping
+ into a buffer that would force a WAL flush. However, for non-WAL-logged COPY
+ operations the smaller 256 KB ring is used because WAL flushes are not needed
+ to write the buffers.

  Background writer's processing
  ------------------------------
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.218
diff -c -r1.218 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    2 May 2007 23:34:48 -0000    1.218
--- src/backend/storage/buffer/bufmgr.c    16 May 2007 12:34:10 -0000
***************
*** 419,431 ****
      /* Loop here in case we have to try another victim buffer */
      for (;;)
      {
          /*
           * Select a victim buffer.    The buffer is returned with its header
           * spinlock still held!  Also the BufFreelistLock is still held, since
           * it would be bad to hold the spinlock while possibly waking up other
           * processes.
           */
!         buf = StrategyGetBuffer();

          Assert(buf->refcount == 0);

--- 419,433 ----
      /* Loop here in case we have to try another victim buffer */
      for (;;)
      {
+         bool lock_held;
+
          /*
           * Select a victim buffer.    The buffer is returned with its header
           * spinlock still held!  Also the BufFreelistLock is still held, since
           * it would be bad to hold the spinlock while possibly waking up other
           * processes.
           */
!         buf = StrategyGetBuffer(&lock_held);

          Assert(buf->refcount == 0);

***************
*** 436,442 ****
          PinBuffer_Locked(buf);

          /* Now it's safe to release the freelist lock */
!         LWLockRelease(BufFreelistLock);

          /*
           * If the buffer was dirty, try to write it out.  There is a race
--- 438,445 ----
          PinBuffer_Locked(buf);

          /* Now it's safe to release the freelist lock */
!         if (lock_held)
!             LWLockRelease(BufFreelistLock);

          /*
           * If the buffer was dirty, try to write it out.  There is a race
***************
*** 464,469 ****
--- 467,489 ----
               */
              if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED))
              {
+                 /* In BULKREAD-mode, check if a WAL flush would be needed to
+                  * evict this buffer. If so, ask the replacement strategy if
+                  * we should go ahead and do it or choose another victim.
+                  */
+                 if (active_access_pattern == AP_BULKREAD)
+                 {
+                     if (XLogNeedsFlush(BufferGetLSN(buf)))
+                     {
+                         if (StrategyRejectBuffer(buf))
+                         {
+                             LWLockRelease(buf->content_lock);
+                             UnpinBuffer(buf, true, false);
+                             continue;
+                         }
+                     }
+                 }
+
                  FlushBuffer(buf, NULL);
                  LWLockRelease(buf->content_lock);
              }
***************
*** 925,932 ****
      PrivateRefCount[b]--;
      if (PrivateRefCount[b] == 0)
      {
-         bool        immed_free_buffer = false;
-
          /* I'd better not still hold any locks on the buffer */
          Assert(!LWLockHeldByMe(buf->content_lock));
          Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
--- 945,950 ----
***************
*** 940,956 ****
          /* Update buffer usage info, unless this is an internal access */
          if (normalAccess)
          {
!             if (!strategy_hint_vacuum)
              {
!                 if (buf->usage_count < BM_MAX_USAGE_COUNT)
!                     buf->usage_count++;
              }
              else
!             {
!                 /* VACUUM accesses don't bump usage count, instead... */
!                 if (buf->refcount == 0 && buf->usage_count == 0)
!                     immed_free_buffer = true;
!             }
          }

          if ((buf->flags & BM_PIN_COUNT_WAITER) &&
--- 958,975 ----
          /* Update buffer usage info, unless this is an internal access */
          if (normalAccess)
          {
!             if (active_access_pattern != AP_NORMAL)
              {
!                 /* We don't want large one-off scans like vacuum to inflate
!                  * the usage_count. We do want to set it to 1, though, to keep
!                  * other backends from hijacking it from the buffer ring.
!                  */
!                 if (buf->usage_count == 0)
!                     buf->usage_count = 1;
              }
              else
!             if (buf->usage_count < BM_MAX_USAGE_COUNT)
!                 buf->usage_count++;
          }

          if ((buf->flags & BM_PIN_COUNT_WAITER) &&
***************
*** 965,978 ****
          }
          else
              UnlockBufHdr(buf);
-
-         /*
-          * If VACUUM is releasing an otherwise-unused buffer, send it to the
-          * freelist for near-term reuse.  We put it at the tail so that it
-          * won't be used before any invalid buffers that may exist.
-          */
-         if (immed_free_buffer)
-             StrategyFreeBuffer(buf, false);
      }
  }

--- 984,989 ----
Index: src/backend/storage/buffer/freelist.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/freelist.c,v
retrieving revision 1.58
diff -c -r1.58 freelist.c
*** src/backend/storage/buffer/freelist.c    5 Jan 2007 22:19:37 -0000    1.58
--- src/backend/storage/buffer/freelist.c    17 May 2007 16:12:56 -0000
***************
*** 18,23 ****
--- 18,25 ----
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"

+ #include "utils/memutils.h"
+

  /*
   * The shared freelist control information.
***************
*** 39,47 ****
  /* Pointers to shared state */
  static BufferStrategyControl *StrategyControl = NULL;

! /* Backend-local state about whether currently vacuuming */
! bool        strategy_hint_vacuum = false;


  /*
   * StrategyGetBuffer
--- 41,53 ----
  /* Pointers to shared state */
  static BufferStrategyControl *StrategyControl = NULL;

! /* Currently active access pattern hint. */
! AccessPattern active_access_pattern = AP_NORMAL;

+ /* prototypes for internal functions */
+ static volatile BufferDesc *GetBufferFromRing(void);
+ static void PutBufferToRing(volatile BufferDesc *buf);
+ static void InitRing(void);

  /*
   * StrategyGetBuffer
***************
*** 51,67 ****
   *    the selected buffer must not currently be pinned by anyone.
   *
   *    To ensure that no one else can pin the buffer before we do, we must
!  *    return the buffer with the buffer header spinlock still held.  That
!  *    means that we return with the BufFreelistLock still held, as well;
!  *    the caller must release that lock once the spinlock is dropped.
   */
  volatile BufferDesc *
! StrategyGetBuffer(void)
  {
      volatile BufferDesc *buf;
      int            trycounter;

      LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);

      /*
       * Try to get a buffer from the freelist.  Note that the freeNext fields
--- 57,89 ----
   *    the selected buffer must not currently be pinned by anyone.
   *
   *    To ensure that no one else can pin the buffer before we do, we must
!  *    return the buffer with the buffer header spinlock still held.  If
!  *    *lock_held is set at return, we return with the BufFreelistLock still
!  *    held, as well;    the caller must release that lock once the spinlock is
!  *    dropped.
   */
  volatile BufferDesc *
! StrategyGetBuffer(bool *lock_held)
  {
      volatile BufferDesc *buf;
      int            trycounter;

+     /* Get a buffer from the ring if we're doing a bulk scan */
+     if (active_access_pattern != AP_NORMAL)
+     {
+         buf = GetBufferFromRing();
+         if (buf != NULL)
+         {
+             *lock_held = false;
+             return buf;
+         }
+     }
+
+     /*
+      * If our selected buffer wasn't available, pick another...
+      */
      LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+     *lock_held = true;

      /*
       * Try to get a buffer from the freelist.  Note that the freeNext fields
***************
*** 86,96 ****
           */
          LockBufHdr(buf);
          if (buf->refcount == 0 && buf->usage_count == 0)
              return buf;
          UnlockBufHdr(buf);
      }

!     /* Nothing on the freelist, so run the "clock sweep" algorithm */
      trycounter = NBuffers;
      for (;;)
      {
--- 108,122 ----
           */
          LockBufHdr(buf);
          if (buf->refcount == 0 && buf->usage_count == 0)
+         {
+             if (active_access_pattern != AP_NORMAL)
+                 PutBufferToRing(buf);
              return buf;
+         }
          UnlockBufHdr(buf);
      }

!     /* Nothing on the freelist, so run the shared "clock sweep" algorithm */
      trycounter = NBuffers;
      for (;;)
      {
***************
*** 105,111 ****
--- 131,141 ----
           */
          LockBufHdr(buf);
          if (buf->refcount == 0 && buf->usage_count == 0)
+         {
+             if (active_access_pattern != AP_NORMAL)
+                 PutBufferToRing(buf);
              return buf;
+         }
          if (buf->usage_count > 0)
          {
              buf->usage_count--;
***************
*** 191,204 ****
  }

  /*
!  * StrategyHintVacuum -- tell us whether VACUUM is active
   */
  void
! StrategyHintVacuum(bool vacuum_active)
  {
!     strategy_hint_vacuum = vacuum_active;
! }


  /*
   * StrategyShmemSize
--- 221,245 ----
  }

  /*
!  * SetAccessPattern -- Sets the active access pattern hint
!  *
!  * Caller is responsible for resetting the hint to AP_NORMAL after the bulk
!  * operation is done. It's ok to switch repeatedly between AP_NORMAL and one of
!  * the other strategies, for example in a query with one large sequential scan
!  * nested loop joined to an index scan. Index tuples should be fetched with the
!  * normal strategy and the pages from the seq scan should be read in with the
!  * AP_BULKREAD strategy. The ring won't be affected by such switching, however
!  * switching to an access pattern with different ring size will invalidate the
!  * old ring.
   */
  void
! SetAccessPattern(AccessPattern new_pattern)
  {
!     active_access_pattern = new_pattern;

+     if (active_access_pattern != AP_NORMAL)
+         InitRing();
+ }

  /*
   * StrategyShmemSize
***************
*** 274,276 ****
--- 315,498 ----
      else
          Assert(!init);
  }
+
+ /* ----------------------------------------------------------------
+  *                Backend-private buffer ring management
+  * ----------------------------------------------------------------
+  */
+
+ /*
+  * Ring sizes for different access patterns. See README for the rationale
+  * of these.
+  */
+ #define BULKREAD_RING_SIZE    256 * 1024 / BLCKSZ
+ #define VACUUM_RING_SIZE    256 * 1024 / BLCKSZ
+ #define COPY_RING_SIZE        Min(NBuffers / 8, (XLOG_SEG_SIZE / BLCKSZ) * 2)
+
+ /*
+  * BufferRing is an array of buffer ids, and RingSize it's size in number of
+  * elements. It's allocated in TopMemoryContext the first time it's needed.
+  */
+ static int *BufferRing = NULL;
+ static int RingSize = 0;
+
+ /* Index of the "current" slot in the ring. It's advanced every time a buffer
+  * is handed out from the ring with GetBufferFromRing and it points to the
+  * last buffer returned from the ring. RingCurSlot + 1 is the next victim
+  * GetBufferRing will hand out.
+  */
+ static int RingCurSlot = 0;
+
+ /* magic value to mark empty slots in the ring */
+ #define BUF_ID_NOT_SET -1
+
+
+ /*
+  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
+  *        ring is empty.
+  *
+  * The bufhdr spin lock is held on the returned buffer.
+  */
+ static volatile BufferDesc *
+ GetBufferFromRing(void)
+ {
+     volatile BufferDesc *buf;
+
+     /* ring should be initialized by now */
+     Assert(RingSize > 0 && BufferRing != NULL);
+
+     /* Run private "clock cycle" */
+     if (++RingCurSlot >= RingSize)
+         RingCurSlot = 0;
+
+     /*
+      * If that slot hasn't been filled yet, tell the caller to allocate
+      * a new buffer with the normal allocation strategy. He will then
+      * fill this slot by calling PutBufferToRing with the new buffer.
+      */
+     if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET)
+         return NULL;
+
+     buf = &BufferDescriptors[BufferRing[RingCurSlot]];
+
+     /*
+      * If the buffer is pinned we cannot use it under any circumstances.
+      * If usage_count == 0 then the buffer is fair game.
+      *
+      * We also choose this buffer if usage_count == 1. Strictly, this
+      * might sometimes be the wrong thing to do, but we rely on the high
+      * probability that it was this process that last touched the buffer.
+      * If it wasn't, we'll choose a suboptimal victim, but  it shouldn't
+      * make any difference in the big scheme of things.
+      *
+      */
+     LockBufHdr(buf);
+     if (buf->refcount == 0 && buf->usage_count <= 1)
+         return buf;
+     UnlockBufHdr(buf);
+
+     return NULL;
+ }
+
+ /*
+  * PutBufferToRing -- adds a buffer to the buffer ring
+  *
+  * Caller must hold the buffer header spinlock on the buffer.
+  */
+ static void
+ PutBufferToRing(volatile BufferDesc *buf)
+ {
+     /* ring should be initialized by now */
+     Assert(RingSize > 0 && BufferRing != NULL);
+
+     if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET)
+         BufferRing[RingCurSlot] = buf->buf_id;
+ }
+
+ /*
+  * Initializes a ring buffer with correct size for the currently
+  * active strategy. Does nothing if the ring already has the right size.
+  */
+ static void
+ InitRing(void)
+ {
+     int new_size;
+     int old_size = RingSize;
+     int i;
+     MemoryContext oldcxt;
+
+     /* Determine new size */
+
+     switch(active_access_pattern)
+     {
+         case AP_BULKREAD:
+             new_size = BULKREAD_RING_SIZE;
+             break;
+         case AP_COPY:
+             new_size = COPY_RING_SIZE;
+             break;
+         case AP_VACUUM:
+             new_size = VACUUM_RING_SIZE;
+             break;
+         default:
+             elog(ERROR, "unexpected buffer cache strategy %d",
+                  active_access_pattern);
+             return; /* keep compile happy */
+     }
+
+     /*
+      * Seq scans set and reset the strategy on every page, so we better exit
+      * quickly if no change in size is needed.
+      */
+     if (new_size == old_size)
+         return;
+
+     /* Allocate array */
+
+     oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+     if (old_size == 0)
+     {
+         Assert(BufferRing == NULL);
+         BufferRing = palloc(new_size * sizeof(int));
+     }
+     else
+         BufferRing = repalloc(BufferRing, new_size * sizeof(int));
+
+     MemoryContextSwitchTo(oldcxt);
+
+     for(i = 0; i < new_size; i++)
+         BufferRing[i] = BUF_ID_NOT_SET;
+
+     RingCurSlot = 0;
+     RingSize = new_size;
+ }
+
+ /*
+  * Buffer manager calls this function in AP_BULKREAD mode when the
+  * buffer handed to it turns out to need a WAL flush to write out. This
+  * gives the strategy a second chance to choose another victim.
+  *
+  * Returns true if buffer manager should ask for a new victim, and false
+  * if WAL should be flushed and this buffer used.
+  */
+ bool
+ StrategyRejectBuffer(volatile BufferDesc *buf)
+ {
+     Assert(RingSize > 0);
+
+     if (BufferRing[RingCurSlot] == buf->buf_id)
+     {
+         BufferRing[RingCurSlot] = BUF_ID_NOT_SET;
+         return true;
+     }
+     else
+     {
+         /* Apparently the buffer didn't come from the ring. We don't want to
+          * mess with how the clock sweep works; in worst case there's no
+          * buffers in the buffer cache that can be reused without a WAL flush,
+          * and we'd get into an endless loop trying.
+          */
+         return false;
+     }
+ }
Index: src/include/access/relscan.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/relscan.h,v
retrieving revision 1.52
diff -c -r1.52 relscan.h
*** src/include/access/relscan.h    20 Jan 2007 18:43:35 -0000    1.52
--- src/include/access/relscan.h    15 May 2007 17:01:31 -0000
***************
*** 28,33 ****
--- 28,34 ----
      ScanKey        rs_key;            /* array of scan key descriptors */
      BlockNumber rs_nblocks;        /* number of blocks to scan */
      bool        rs_pageatatime; /* verify visibility page-at-a-time? */
+     AccessPattern rs_accesspattern; /* access pattern to use for reads */

      /* scan current state */
      bool        rs_inited;        /* false = scan not init'd yet */
Index: src/include/access/xlog.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
retrieving revision 1.76
diff -c -r1.76 xlog.h
*** src/include/access/xlog.h    5 Jan 2007 22:19:51 -0000    1.76
--- src/include/access/xlog.h    14 May 2007 21:22:40 -0000
***************
*** 151,156 ****
--- 151,157 ----

  extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
  extern void XLogFlush(XLogRecPtr RecPtr);
+ extern bool XLogNeedsFlush(XLogRecPtr RecPtr);

  extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
Index: src/include/storage/buf_internals.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v
retrieving revision 1.89
diff -c -r1.89 buf_internals.h
*** src/include/storage/buf_internals.h    5 Jan 2007 22:19:57 -0000    1.89
--- src/include/storage/buf_internals.h    15 May 2007 17:07:59 -0000
***************
*** 16,21 ****
--- 16,22 ----
  #define BUFMGR_INTERNALS_H

  #include "storage/buf.h"
+ #include "storage/bufmgr.h"
  #include "storage/lwlock.h"
  #include "storage/shmem.h"
  #include "storage/spin.h"
***************
*** 168,174 ****
  extern BufferDesc *LocalBufferDescriptors;

  /* in freelist.c */
! extern bool strategy_hint_vacuum;

  /* event counters in buf_init.c */
  extern long int ReadBufferCount;
--- 169,175 ----
  extern BufferDesc *LocalBufferDescriptors;

  /* in freelist.c */
! extern AccessPattern active_access_pattern;

  /* event counters in buf_init.c */
  extern long int ReadBufferCount;
***************
*** 184,195 ****
   */

  /* freelist.c */
! extern volatile BufferDesc *StrategyGetBuffer(void);
  extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
  extern int    StrategySyncStart(void);
  extern Size StrategyShmemSize(void);
  extern void StrategyInitialize(bool init);

  /* buf_table.c */
  extern Size BufTableShmemSize(int size);
  extern void InitBufTable(int size);
--- 185,198 ----
   */

  /* freelist.c */
! extern volatile BufferDesc *StrategyGetBuffer(bool *lock_held);
  extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
  extern int    StrategySyncStart(void);
  extern Size StrategyShmemSize(void);
  extern void StrategyInitialize(bool init);

+ extern bool StrategyRejectBuffer(volatile BufferDesc *buf);
+
  /* buf_table.c */
  extern Size BufTableShmemSize(int size);
  extern void InitBufTable(int size);
Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.103
diff -c -r1.103 bufmgr.h
*** src/include/storage/bufmgr.h    2 May 2007 23:18:03 -0000    1.103
--- src/include/storage/bufmgr.h    15 May 2007 17:07:02 -0000
***************
*** 48,53 ****
--- 48,61 ----
  #define BUFFER_LOCK_SHARE        1
  #define BUFFER_LOCK_EXCLUSIVE    2

+ typedef enum AccessPattern
+ {
+     AP_NORMAL,        /* Normal random access */
+     AP_BULKREAD,    /* Large read-only scan (hint bit updates are ok) */
+     AP_COPY,        /* Large updating scan, like COPY with WAL enabled */
+     AP_VACUUM,        /* VACUUM */
+ } AccessPattern;
+
  /*
   * These routines are beaten on quite heavily, hence the macroization.
   */
***************
*** 157,162 ****
  extern void AtProcExit_LocalBuffers(void);

  /* in freelist.c */
! extern void StrategyHintVacuum(bool vacuum_active);

  #endif
--- 165,170 ----
  extern void AtProcExit_LocalBuffers(void);

  /* in freelist.c */
! extern void SetAccessPattern(AccessPattern new_pattern);

  #endif

Attachment

pgsql-patches by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [DOCS] Autovacuum and XID wraparound
Next
From: Andrew Dunstan
Date:
Subject: Re: UTF8MatchText