Thread: Seq scans status update

Seq scans status update

From

Heikki Linnakangas

Date:

17 May 2007, 13:32:37

Attached is a new version of Simon's "scan-resistant buffer manager"
patch. It's not ready for committing yet because of a small issue I
found this morning (* see bottom), but here's a status update.

To recap, the basic idea is to use a small ring of buffers for large
scans like VACUUM, COPY and seq-scans. Changes to the original patch:

- a different sized ring is used for VACUUMs and seq-scans, and COPY.
VACUUM and COPY use a ring of 32 buffers, and COPY uses a ring of 4096
buffers in default configuration. See README changes in the patch for
rationale.

- for queries with large seqscans, the buffer ring is only used for
reads issued by the seq scan, not for any other reads in the query.
Typical scenario where this matters is doing a large seq scan with a
nested loop join to a smaller table. You don't want to use the buffer
ring for index lookups inside the nested loop.

- for seqscans, drop buffers from the ring that would need a WAL flush
to reuse. That makes bulk updates to behave roughly like they do without
the patch, instead of having to do a WAL flush every 32 pages.

I've spent a lot of time thinking of solutions to the last point. The
obvious solution would be to not use the buffer ring for updating scans.
The difficulty with that is that we don't know if a scan is read-only in
heapam.c, where the hint to use the buffer ring is set.

I've completed a set of performance tests on a test server. The server
has 4 GB of RAM, of which 1 GB is used for shared_buffers.

Results for a 10 GB table:

  head-copy-bigtable               | 00:10:09.07016
  head-copy-bigtable               | 00:10:20.507357
  head-copy-bigtable               | 00:10:21.857677
  head-copy_nowal-bigtable         | 00:05:18.232956
  head-copy_nowal-bigtable         | 00:03:24.109047
  head-copy_nowal-bigtable         | 00:05:31.019643
  head-select-bigtable             | 00:03:47.102731
  head-select-bigtable             | 00:01:08.314719
  head-select-bigtable             | 00:01:08.238509
  head-select-bigtable             | 00:01:08.208563
  head-select-bigtable             | 00:01:08.28347
  head-select-bigtable             | 00:01:08.308671
  head-vacuum_clean-bigtable       | 00:01:04.227832
  head-vacuum_clean-bigtable       | 00:01:04.232258
  head-vacuum_clean-bigtable       | 00:01:04.294621
  head-vacuum_clean-bigtable       | 00:01:04.280677
  head-vacuum_hintbits-bigtable    | 00:04:01.123924
  head-vacuum_hintbits-bigtable    | 00:03:58.253175
  head-vacuum_hintbits-bigtable    | 00:04:26.318159
  head-vacuum_hintbits-bigtable    | 00:04:37.512965
  patched-copy-bigtable            | 00:09:52.776754
  patched-copy-bigtable            | 00:10:18.185826
  patched-copy-bigtable            | 00:10:16.975482
  patched-copy_nowal-bigtable      | 00:03:14.882366
  patched-copy_nowal-bigtable      | 00:04:01.04648
  patched-copy_nowal-bigtable      | 00:03:56.062272
  patched-select-bigtable          | 00:03:47.704154
  patched-select-bigtable          | 00:01:08.460326
  patched-select-bigtable          | 00:01:10.441544
  patched-select-bigtable          | 00:01:11.916221
  patched-select-bigtable          | 00:01:13.848038
  patched-select-bigtable          | 00:01:10.956133
  patched-vacuum_clean-bigtable    | 00:01:10.315439
  patched-vacuum_clean-bigtable    | 00:01:12.210537
  patched-vacuum_clean-bigtable    | 00:01:15.202114
  patched-vacuum_clean-bigtable    | 00:01:10.712235
  patched-vacuum_hintbits-bigtable | 00:03:42.279201
  patched-vacuum_hintbits-bigtable | 00:04:02.057778
  patched-vacuum_hintbits-bigtable | 00:04:26.805822
  patched-vacuum_hintbits-bigtable | 00:04:28.911184

In other words, the patch has no significant effect, as expected. The
select times did go up by a couple of seconds, which I didn't expect,
though. One theory is that unused shared_buffers are swapped out during
the tests, and bgwriter pulls them back in. I'll set swappiness to 0 and
try again at some point.

Results for a 2 GB table:

  copy-medsize-unpatched            | 00:02:18.23246
  copy-medsize-unpatched            | 00:02:22.347194
  copy-medsize-unpatched            | 00:02:23.875874
  copy_nowal-medsize-unpatched      | 00:01:27.606334
  copy_nowal-medsize-unpatched      | 00:01:17.491243
  copy_nowal-medsize-unpatched      | 00:01:31.902719
  select-medsize-unpatched          | 00:00:03.786031
  select-medsize-unpatched          | 00:00:02.678069
  select-medsize-unpatched          | 00:00:02.666103
  select-medsize-unpatched          | 00:00:02.673494
  select-medsize-unpatched          | 00:00:02.669645
  select-medsize-unpatched          | 00:00:02.666278
  vacuum_clean-medsize-unpatched    | 00:00:01.091356
  vacuum_clean-medsize-unpatched    | 00:00:01.923138
  vacuum_clean-medsize-unpatched    | 00:00:01.917213
  vacuum_clean-medsize-unpatched    | 00:00:01.917333
  vacuum_hintbits-medsize-unpatched | 00:00:01.683718
  vacuum_hintbits-medsize-unpatched | 00:00:01.864003
  vacuum_hintbits-medsize-unpatched | 00:00:03.186596
  vacuum_hintbits-medsize-unpatched | 00:00:02.16494
  copy-medsize-patched              | 00:02:35.113501
  copy-medsize-patched              | 00:02:25.269866
  copy-medsize-patched              | 00:02:31.881089
  copy_nowal-medsize-patched        | 00:01:00.254633
  copy_nowal-medsize-patched        | 00:01:04.630687
  copy_nowal-medsize-patched        | 00:01:03.729128
  select-medsize-patched            | 00:00:03.201837
  select-medsize-patched            | 00:00:01.332975
  select-medsize-patched            | 00:00:01.33014
  select-medsize-patched            | 00:00:01.332392
  select-medsize-patched            | 00:00:01.333498
  select-medsize-patched            | 00:00:01.332692
  vacuum_clean-medsize-patched      | 00:00:01.140189
  vacuum_clean-medsize-patched      | 00:00:01.062762
  vacuum_clean-medsize-patched      | 00:00:01.062402
  vacuum_clean-medsize-patched      | 00:00:01.07113
  vacuum_hintbits-medsize-patched   | 00:00:17.865446
  vacuum_hintbits-medsize-patched   | 00:00:15.162064
  vacuum_hintbits-medsize-patched   | 00:00:01.704651
  vacuum_hintbits-medsize-patched   | 00:00:02.671651

This looks good to me, except for some glitch at the last
vacuum_hintbits tests. Selects and vacuums benefit significantly, as
does non-WAL-logged copy.

Not shown here, but I run tests earlier with vacuum on a table that
actually had dead tuples to be removed on it. In that test the patched
version really shined, reducing the runtime to ~ 1/6th. That was the
original motivation of this patch: not having to do a WAL flush on every
page in the 2nd phase of vacuum.

Test script attached. To use it:

1. Edit testscript.sh. Change BIGTABLESIZE.
2. Start postmaster
3. Run script, giving test-label as argument. For example:
"./testscript.sh bigtable-patched"

Attached is also the patch I used for the tests.

I would appreciate it if people would download the patch and the script
and repeat the tests on different hardware. I'm particularly interested
in testing on a box with good I/O hardware where selects on unpatched
PostgreSQL are bottlenecked by CPU.

Barring any surprises I'm going to fix the remaining issue and submit a
final patch, probably in the weekend.

(*) The issue with this patch is that if the buffer cache is completely
filled with dirty buffers that need a WAL flush to evict, the buffer
ring code will get into an infinite loop trying to find one that doesn't
need a WAL flush. Should be simple to fix.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/heap/heapam.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.232
diff -c -r1.232 heapam.c
*** src/backend/access/heap/heapam.c    8 Apr 2007 01:26:27 -0000    1.232
--- src/backend/access/heap/heapam.c    16 May 2007 11:35:14 -0000
***************
*** 83,88 ****
--- 83,96 ----
       */
      scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);

+     /* A scan on a table smaller than shared_buffers is treated like random
+      * access, but bigger scans should use the bulk read replacement policy.
+      */
+     if (scan->rs_nblocks > NBuffers)
+         scan->rs_accesspattern = AP_BULKREAD;
+     else
+         scan->rs_accesspattern = AP_NORMAL;
+
      scan->rs_inited = false;
      scan->rs_ctup.t_data = NULL;
      ItemPointerSetInvalid(&scan->rs_ctup.t_self);
***************
*** 123,133 ****
--- 131,146 ----

      Assert(page < scan->rs_nblocks);

+     /* Read the page with the right strategy */
+     SetAccessPattern(scan->rs_accesspattern);
+
      scan->rs_cbuf = ReleaseAndReadBuffer(scan->rs_cbuf,
                                           scan->rs_rd,
                                           page);
      scan->rs_cblock = page;

+     SetAccessPattern(AP_NORMAL);
+
      if (!scan->rs_pageatatime)
          return;

Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.268
diff -c -r1.268 xlog.c
*** src/backend/access/transam/xlog.c    30 Apr 2007 21:01:52 -0000    1.268
--- src/backend/access/transam/xlog.c    15 May 2007 16:23:30 -0000
***************
*** 1668,1673 ****
--- 1668,1700 ----
  }

  /*
+  * Returns true if 'record' hasn't been flushed to disk yet.
+  */
+ bool
+ XLogNeedsFlush(XLogRecPtr record)
+ {
+     /* Quick exit if already known flushed */
+     if (XLByteLE(record, LogwrtResult.Flush))
+         return false;
+
+     /* read LogwrtResult and update local state */
+     {
+         /* use volatile pointer to prevent code rearrangement */
+         volatile XLogCtlData *xlogctl = XLogCtl;
+
+         SpinLockAcquire(&xlogctl->info_lck);
+         LogwrtResult = xlogctl->LogwrtResult;
+         SpinLockRelease(&xlogctl->info_lck);
+     }
+
+     /* check again */
+     if (XLByteLE(record, LogwrtResult.Flush))
+         return false;
+
+     return true;
+ }
+
+ /*
   * Ensure that all XLOG data through the given position is flushed to disk.
   *
   * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not
Index: src/backend/commands/copy.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.283
diff -c -r1.283 copy.c
*** src/backend/commands/copy.c    27 Apr 2007 22:05:46 -0000    1.283
--- src/backend/commands/copy.c    15 May 2007 17:05:29 -0000
***************
*** 1876,1881 ****
--- 1876,1888 ----
      nfields = file_has_oids ? (attr_count + 1) : attr_count;
      field_strings = (char **) palloc(nfields * sizeof(char *));

+     /* Use the special COPY buffer replacement strategy if WAL-logging
+      * is enabled. If it's not, the pages we're writing are dirty but
+      * don't need a WAL flush to write out, so the BULKREAD strategy
+      * is more suitable.
+      */
+     SetAccessPattern(use_wal ? AP_COPY : AP_BULKREAD);
+
      /* Initialize state variables */
      cstate->fe_eof = false;
      cstate->eol_type = EOL_UNKNOWN;
***************
*** 2161,2166 ****
--- 2168,2176 ----
                              cstate->filename)));
      }

+     /* Reset buffer replacement strategy */
+     SetAccessPattern(AP_NORMAL);
+
      /*
       * If we skipped writing WAL, then we need to sync the heap (but not
       * indexes since those use WAL anyway)
Index: src/backend/commands/vacuum.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/vacuum.c,v
retrieving revision 1.350
diff -c -r1.350 vacuum.c
*** src/backend/commands/vacuum.c    16 Apr 2007 18:29:50 -0000    1.350
--- src/backend/commands/vacuum.c    15 May 2007 17:06:18 -0000
***************
*** 421,431 ****
                   * Tell the buffer replacement strategy that vacuum is causing
                   * the IO
                   */
!                 StrategyHintVacuum(true);

                  analyze_rel(relid, vacstmt);

!                 StrategyHintVacuum(false);

                  if (use_own_xacts)
                      CommitTransactionCommand();
--- 421,431 ----
                   * Tell the buffer replacement strategy that vacuum is causing
                   * the IO
                   */
!                 SetAccessPattern(AP_VACUUM);

                  analyze_rel(relid, vacstmt);

!                 SetAccessPattern(AP_NORMAL);

                  if (use_own_xacts)
                      CommitTransactionCommand();
***************
*** 442,448 ****
          /* Make sure cost accounting is turned off after error */
          VacuumCostActive = false;
          /* And reset buffer replacement strategy, too */
!         StrategyHintVacuum(false);
          PG_RE_THROW();
      }
      PG_END_TRY();
--- 442,448 ----
          /* Make sure cost accounting is turned off after error */
          VacuumCostActive = false;
          /* And reset buffer replacement strategy, too */
!         SetAccessPattern(AP_NORMAL);
          PG_RE_THROW();
      }
      PG_END_TRY();
***************
*** 1088,1094 ****
       * Tell the cache replacement strategy that vacuum is causing all
       * following IO
       */
!     StrategyHintVacuum(true);

      /*
       * Do the actual work --- either FULL or "lazy" vacuum
--- 1088,1094 ----
       * Tell the cache replacement strategy that vacuum is causing all
       * following IO
       */
!     SetAccessPattern(AP_VACUUM);

      /*
       * Do the actual work --- either FULL or "lazy" vacuum
***************
*** 1098,1104 ****
      else
          lazy_vacuum_rel(onerel, vacstmt);

!     StrategyHintVacuum(false);

      /* all done with this class, but hold lock until commit */
      relation_close(onerel, NoLock);
--- 1098,1104 ----
      else
          lazy_vacuum_rel(onerel, vacstmt);

!     SetAccessPattern(AP_NORMAL);

      /* all done with this class, but hold lock until commit */
      relation_close(onerel, NoLock);
Index: src/backend/storage/buffer/README
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/README,v
retrieving revision 1.11
diff -c -r1.11 README
*** src/backend/storage/buffer/README    23 Jul 2006 03:07:58 -0000    1.11
--- src/backend/storage/buffer/README    16 May 2007 11:43:11 -0000
***************
*** 152,159 ****
  a field to show which backend is doing its I/O).


! Buffer replacement strategy
! ---------------------------

  There is a "free list" of buffers that are prime candidates for replacement.
  In particular, buffers that are completely free (contain no valid page) are
--- 152,159 ----
  a field to show which backend is doing its I/O).


! Normal buffer replacement strategy
! ----------------------------------

  There is a "free list" of buffers that are prime candidates for replacement.
  In particular, buffers that are completely free (contain no valid page) are
***************
*** 199,221 ****
  have to give up and try another buffer.  This however is not a concern
  of the basic select-a-victim-buffer algorithm.)

- A special provision is that while running VACUUM, a backend does not
- increment the usage count on buffers it accesses.  In fact, if ReleaseBuffer
- sees that it is dropping the pin count to zero and the usage count is zero,
- then it appends the buffer to the tail of the free list.  (This implies that
- VACUUM, but only VACUUM, must take the BufFreelistLock during ReleaseBuffer;
- this shouldn't create much of a contention problem.)  This provision
- encourages VACUUM to work in a relatively small number of buffers rather
- than blowing out the entire buffer cache.  It is reasonable since a page
- that has been touched only by VACUUM is unlikely to be needed again soon.
-
- Since VACUUM usually requests many pages very fast, the effect of this is that
- it will get back the very buffers it filled and possibly modified on the next
- call and will therefore do its work in a few shared memory buffers, while
- being able to use whatever it finds in the cache already.  This also implies
- that most of the write traffic caused by a VACUUM will be done by the VACUUM
- itself and not pushed off onto other processes.


  Background writer's processing
  ------------------------------
--- 199,243 ----
  have to give up and try another buffer.  This however is not a concern
  of the basic select-a-victim-buffer algorithm.)


+ Buffer ring replacement strategy
+ ---------------------------------
+
+ When running a query that needs to access a large number of pages, like VACUUM,
+ COPY, or a large sequential scan, a different strategy is used.  A page that
+ has been touched only by such a scan is unlikely to be needed again soon, so
+ instead of running the normal clock sweep algorithm and blowing out the entire
+ buffer cache, a small ring of buffers is allocated using the normal clock sweep
+ algorithm and those buffers are reused for the whole scan.  This also implies
+ that most of the write traffic caused by such a statement will be done by the
+ backend itself and not pushed off onto other processes.
+
+ The size of the ring used depends on the kind of scan:
+
+ For sequential scans, a small 256 KB ring is used. That's small enough to fit
+ in L2 cache, which makes transferring pages from OS cache to shared buffer
+ cache efficient. Even less would often be enough, but the ring must be big
+ enough to accommodate all pages in the scan that are pinned concurrently.
+ 256 KB should also be enough to leave a small cache trail for other backends to
+ join in a synchronized seq scan. If a buffer is dirtied and LSN set, the buffer
+ is removed from the ring and a replacement buffer is chosen using the normal
+ replacement strategy. In a scan that modifies every page in the scan, like a
+ bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and the
+ ring strategy effectively degrades to the normal strategy.
+
+ VACUUM uses a 256 KB ring like sequential scans, but dirty pages are not
+ removed from the ring. WAL is flushed instead to allow reuse of the buffers.
+ Before introducing the buffer ring strategy in 8.3, buffers were put to the
+ freelist, which was effectively a buffer ring of 1 buffer.
+
+ COPY behaves like VACUUM, but a much larger ring is used. The ring size is
+ chosen to be twice the WAL segment size. This avoids polluting the buffer cache
+ like the clock sweep would do, and using a ring larger than WAL segment size
+ avoids having to do any extra WAL flushes, since a WAL segment will always be
+ filled, forcing a WAL flush, before looping through the buffer ring and bumping
+ into a buffer that would force a WAL flush. However, for non-WAL-logged COPY
+ operations the smaller 256 KB ring is used because WAL flushes are not needed
+ to write the buffers.

  Background writer's processing
  ------------------------------
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.218
diff -c -r1.218 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    2 May 2007 23:34:48 -0000    1.218
--- src/backend/storage/buffer/bufmgr.c    16 May 2007 12:34:10 -0000
***************
*** 419,431 ****
      /* Loop here in case we have to try another victim buffer */
      for (;;)
      {
          /*
           * Select a victim buffer.    The buffer is returned with its header
           * spinlock still held!  Also the BufFreelistLock is still held, since
           * it would be bad to hold the spinlock while possibly waking up other
           * processes.
           */
!         buf = StrategyGetBuffer();

          Assert(buf->refcount == 0);

--- 419,433 ----
      /* Loop here in case we have to try another victim buffer */
      for (;;)
      {
+         bool lock_held;
+
          /*
           * Select a victim buffer.    The buffer is returned with its header
           * spinlock still held!  Also the BufFreelistLock is still held, since
           * it would be bad to hold the spinlock while possibly waking up other
           * processes.
           */
!         buf = StrategyGetBuffer(&lock_held);

          Assert(buf->refcount == 0);

***************
*** 436,442 ****
          PinBuffer_Locked(buf);

          /* Now it's safe to release the freelist lock */
!         LWLockRelease(BufFreelistLock);

          /*
           * If the buffer was dirty, try to write it out.  There is a race
--- 438,445 ----
          PinBuffer_Locked(buf);

          /* Now it's safe to release the freelist lock */
!         if (lock_held)
!             LWLockRelease(BufFreelistLock);

          /*
           * If the buffer was dirty, try to write it out.  There is a race
***************
*** 464,469 ****
--- 467,489 ----
               */
              if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED))
              {
+                 /* In BULKREAD-mode, check if a WAL flush would be needed to
+                  * evict this buffer. If so, ask the replacement strategy if
+                  * we should go ahead and do it or choose another victim.
+                  */
+                 if (active_access_pattern == AP_BULKREAD)
+                 {
+                     if (XLogNeedsFlush(BufferGetLSN(buf)))
+                     {
+                         if (StrategyRejectBuffer(buf))
+                         {
+                             LWLockRelease(buf->content_lock);
+                             UnpinBuffer(buf, true, false);
+                             continue;
+                         }
+                     }
+                 }
+
                  FlushBuffer(buf, NULL);
                  LWLockRelease(buf->content_lock);
              }
***************
*** 925,932 ****
      PrivateRefCount[b]--;
      if (PrivateRefCount[b] == 0)
      {
-         bool        immed_free_buffer = false;
-
          /* I'd better not still hold any locks on the buffer */
          Assert(!LWLockHeldByMe(buf->content_lock));
          Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
--- 945,950 ----
***************
*** 940,956 ****
          /* Update buffer usage info, unless this is an internal access */
          if (normalAccess)
          {
!             if (!strategy_hint_vacuum)
              {
!                 if (buf->usage_count < BM_MAX_USAGE_COUNT)
!                     buf->usage_count++;
              }
              else
!             {
!                 /* VACUUM accesses don't bump usage count, instead... */
!                 if (buf->refcount == 0 && buf->usage_count == 0)
!                     immed_free_buffer = true;
!             }
          }

          if ((buf->flags & BM_PIN_COUNT_WAITER) &&
--- 958,975 ----
          /* Update buffer usage info, unless this is an internal access */
          if (normalAccess)
          {
!             if (active_access_pattern != AP_NORMAL)
              {
!                 /* We don't want large one-off scans like vacuum to inflate
!                  * the usage_count. We do want to set it to 1, though, to keep
!                  * other backends from hijacking it from the buffer ring.
!                  */
!                 if (buf->usage_count == 0)
!                     buf->usage_count = 1;
              }
              else
!             if (buf->usage_count < BM_MAX_USAGE_COUNT)
!                 buf->usage_count++;
          }

          if ((buf->flags & BM_PIN_COUNT_WAITER) &&
***************
*** 965,978 ****
          }
          else
              UnlockBufHdr(buf);
-
-         /*
-          * If VACUUM is releasing an otherwise-unused buffer, send it to the
-          * freelist for near-term reuse.  We put it at the tail so that it
-          * won't be used before any invalid buffers that may exist.
-          */
-         if (immed_free_buffer)
-             StrategyFreeBuffer(buf, false);
      }
  }

--- 984,989 ----
Index: src/backend/storage/buffer/freelist.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/freelist.c,v
retrieving revision 1.58
diff -c -r1.58 freelist.c
*** src/backend/storage/buffer/freelist.c    5 Jan 2007 22:19:37 -0000    1.58
--- src/backend/storage/buffer/freelist.c    17 May 2007 16:12:56 -0000
***************
*** 18,23 ****
--- 18,25 ----
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"

+ #include "utils/memutils.h"
+

  /*
   * The shared freelist control information.
***************
*** 39,47 ****
  /* Pointers to shared state */
  static BufferStrategyControl *StrategyControl = NULL;

! /* Backend-local state about whether currently vacuuming */
! bool        strategy_hint_vacuum = false;


  /*
   * StrategyGetBuffer
--- 41,53 ----
  /* Pointers to shared state */
  static BufferStrategyControl *StrategyControl = NULL;

! /* Currently active access pattern hint. */
! AccessPattern active_access_pattern = AP_NORMAL;

+ /* prototypes for internal functions */
+ static volatile BufferDesc *GetBufferFromRing(void);
+ static void PutBufferToRing(volatile BufferDesc *buf);
+ static void InitRing(void);

  /*
   * StrategyGetBuffer
***************
*** 51,67 ****
   *    the selected buffer must not currently be pinned by anyone.
   *
   *    To ensure that no one else can pin the buffer before we do, we must
!  *    return the buffer with the buffer header spinlock still held.  That
!  *    means that we return with the BufFreelistLock still held, as well;
!  *    the caller must release that lock once the spinlock is dropped.
   */
  volatile BufferDesc *
! StrategyGetBuffer(void)
  {
      volatile BufferDesc *buf;
      int            trycounter;

      LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);

      /*
       * Try to get a buffer from the freelist.  Note that the freeNext fields
--- 57,89 ----
   *    the selected buffer must not currently be pinned by anyone.
   *
   *    To ensure that no one else can pin the buffer before we do, we must
!  *    return the buffer with the buffer header spinlock still held.  If
!  *    *lock_held is set at return, we return with the BufFreelistLock still
!  *    held, as well;    the caller must release that lock once the spinlock is
!  *    dropped.
   */
  volatile BufferDesc *
! StrategyGetBuffer(bool *lock_held)
  {
      volatile BufferDesc *buf;
      int            trycounter;

+     /* Get a buffer from the ring if we're doing a bulk scan */
+     if (active_access_pattern != AP_NORMAL)
+     {
+         buf = GetBufferFromRing();
+         if (buf != NULL)
+         {
+             *lock_held = false;
+             return buf;
+         }
+     }
+
+     /*
+      * If our selected buffer wasn't available, pick another...
+      */
      LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+     *lock_held = true;

      /*
       * Try to get a buffer from the freelist.  Note that the freeNext fields
***************
*** 86,96 ****
           */
          LockBufHdr(buf);
          if (buf->refcount == 0 && buf->usage_count == 0)
              return buf;
          UnlockBufHdr(buf);
      }

!     /* Nothing on the freelist, so run the "clock sweep" algorithm */
      trycounter = NBuffers;
      for (;;)
      {
--- 108,122 ----
           */
          LockBufHdr(buf);
          if (buf->refcount == 0 && buf->usage_count == 0)
+         {
+             if (active_access_pattern != AP_NORMAL)
+                 PutBufferToRing(buf);
              return buf;
+         }
          UnlockBufHdr(buf);
      }

!     /* Nothing on the freelist, so run the shared "clock sweep" algorithm */
      trycounter = NBuffers;
      for (;;)
      {
***************
*** 105,111 ****
--- 131,141 ----
           */
          LockBufHdr(buf);
          if (buf->refcount == 0 && buf->usage_count == 0)
+         {
+             if (active_access_pattern != AP_NORMAL)
+                 PutBufferToRing(buf);
              return buf;
+         }
          if (buf->usage_count > 0)
          {
              buf->usage_count--;
***************
*** 191,204 ****
  }

  /*
!  * StrategyHintVacuum -- tell us whether VACUUM is active
   */
  void
! StrategyHintVacuum(bool vacuum_active)
  {
!     strategy_hint_vacuum = vacuum_active;
! }


  /*
   * StrategyShmemSize
--- 221,245 ----
  }

  /*
!  * SetAccessPattern -- Sets the active access pattern hint
!  *
!  * Caller is responsible for resetting the hint to AP_NORMAL after the bulk
!  * operation is done. It's ok to switch repeatedly between AP_NORMAL and one of
!  * the other strategies, for example in a query with one large sequential scan
!  * nested loop joined to an index scan. Index tuples should be fetched with the
!  * normal strategy and the pages from the seq scan should be read in with the
!  * AP_BULKREAD strategy. The ring won't be affected by such switching, however
!  * switching to an access pattern with different ring size will invalidate the
!  * old ring.
   */
  void
! SetAccessPattern(AccessPattern new_pattern)
  {
!     active_access_pattern = new_pattern;

+     if (active_access_pattern != AP_NORMAL)
+         InitRing();
+ }

  /*
   * StrategyShmemSize
***************
*** 274,276 ****
--- 315,498 ----
      else
          Assert(!init);
  }
+
+ /* ----------------------------------------------------------------
+  *                Backend-private buffer ring management
+  * ----------------------------------------------------------------
+  */
+
+ /*
+  * Ring sizes for different access patterns. See README for the rationale
+  * of these.
+  */
+ #define BULKREAD_RING_SIZE    256 * 1024 / BLCKSZ
+ #define VACUUM_RING_SIZE    256 * 1024 / BLCKSZ
+ #define COPY_RING_SIZE        Min(NBuffers / 8, (XLOG_SEG_SIZE / BLCKSZ) * 2)
+
+ /*
+  * BufferRing is an array of buffer ids, and RingSize it's size in number of
+  * elements. It's allocated in TopMemoryContext the first time it's needed.
+  */
+ static int *BufferRing = NULL;
+ static int RingSize = 0;
+
+ /* Index of the "current" slot in the ring. It's advanced every time a buffer
+  * is handed out from the ring with GetBufferFromRing and it points to the
+  * last buffer returned from the ring. RingCurSlot + 1 is the next victim
+  * GetBufferRing will hand out.
+  */
+ static int RingCurSlot = 0;
+
+ /* magic value to mark empty slots in the ring */
+ #define BUF_ID_NOT_SET -1
+
+
+ /*
+  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
+  *        ring is empty.
+  *
+  * The bufhdr spin lock is held on the returned buffer.
+  */
+ static volatile BufferDesc *
+ GetBufferFromRing(void)
+ {
+     volatile BufferDesc *buf;
+
+     /* ring should be initialized by now */
+     Assert(RingSize > 0 && BufferRing != NULL);
+
+     /* Run private "clock cycle" */
+     if (++RingCurSlot >= RingSize)
+         RingCurSlot = 0;
+
+     /*
+      * If that slot hasn't been filled yet, tell the caller to allocate
+      * a new buffer with the normal allocation strategy. He will then
+      * fill this slot by calling PutBufferToRing with the new buffer.
+      */
+     if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET)
+         return NULL;
+
+     buf = &BufferDescriptors[BufferRing[RingCurSlot]];
+
+     /*
+      * If the buffer is pinned we cannot use it under any circumstances.
+      * If usage_count == 0 then the buffer is fair game.
+      *
+      * We also choose this buffer if usage_count == 1. Strictly, this
+      * might sometimes be the wrong thing to do, but we rely on the high
+      * probability that it was this process that last touched the buffer.
+      * If it wasn't, we'll choose a suboptimal victim, but  it shouldn't
+      * make any difference in the big scheme of things.
+      *
+      */
+     LockBufHdr(buf);
+     if (buf->refcount == 0 && buf->usage_count <= 1)
+         return buf;
+     UnlockBufHdr(buf);
+
+     return NULL;
+ }
+
+ /*
+  * PutBufferToRing -- adds a buffer to the buffer ring
+  *
+  * Caller must hold the buffer header spinlock on the buffer.
+  */
+ static void
+ PutBufferToRing(volatile BufferDesc *buf)
+ {
+     /* ring should be initialized by now */
+     Assert(RingSize > 0 && BufferRing != NULL);
+
+     if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET)
+         BufferRing[RingCurSlot] = buf->buf_id;
+ }
+
+ /*
+  * Initializes a ring buffer with correct size for the currently
+  * active strategy. Does nothing if the ring already has the right size.
+  */
+ static void
+ InitRing(void)
+ {
+     int new_size;
+     int old_size = RingSize;
+     int i;
+     MemoryContext oldcxt;
+
+     /* Determine new size */
+
+     switch(active_access_pattern)
+     {
+         case AP_BULKREAD:
+             new_size = BULKREAD_RING_SIZE;
+             break;
+         case AP_COPY:
+             new_size = COPY_RING_SIZE;
+             break;
+         case AP_VACUUM:
+             new_size = VACUUM_RING_SIZE;
+             break;
+         default:
+             elog(ERROR, "unexpected buffer cache strategy %d",
+                  active_access_pattern);
+             return; /* keep compile happy */
+     }
+
+     /*
+      * Seq scans set and reset the strategy on every page, so we better exit
+      * quickly if no change in size is needed.
+      */
+     if (new_size == old_size)
+         return;
+
+     /* Allocate array */
+
+     oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+     if (old_size == 0)
+     {
+         Assert(BufferRing == NULL);
+         BufferRing = palloc(new_size * sizeof(int));
+     }
+     else
+         BufferRing = repalloc(BufferRing, new_size * sizeof(int));
+
+     MemoryContextSwitchTo(oldcxt);
+
+     for(i = 0; i < new_size; i++)
+         BufferRing[i] = BUF_ID_NOT_SET;
+
+     RingCurSlot = 0;
+     RingSize = new_size;
+ }
+
+ /*
+  * Buffer manager calls this function in AP_BULKREAD mode when the
+  * buffer handed to it turns out to need a WAL flush to write out. This
+  * gives the strategy a second chance to choose another victim.
+  *
+  * Returns true if buffer manager should ask for a new victim, and false
+  * if WAL should be flushed and this buffer used.
+  */
+ bool
+ StrategyRejectBuffer(volatile BufferDesc *buf)
+ {
+     Assert(RingSize > 0);
+
+     if (BufferRing[RingCurSlot] == buf->buf_id)
+     {
+         BufferRing[RingCurSlot] = BUF_ID_NOT_SET;
+         return true;
+     }
+     else
+     {
+         /* Apparently the buffer didn't come from the ring. We don't want to
+          * mess with how the clock sweep works; in worst case there's no
+          * buffers in the buffer cache that can be reused without a WAL flush,
+          * and we'd get into an endless loop trying.
+          */
+         return false;
+     }
+ }
Index: src/include/access/relscan.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/relscan.h,v
retrieving revision 1.52
diff -c -r1.52 relscan.h
*** src/include/access/relscan.h    20 Jan 2007 18:43:35 -0000    1.52
--- src/include/access/relscan.h    15 May 2007 17:01:31 -0000
***************
*** 28,33 ****
--- 28,34 ----
      ScanKey        rs_key;            /* array of scan key descriptors */
      BlockNumber rs_nblocks;        /* number of blocks to scan */
      bool        rs_pageatatime; /* verify visibility page-at-a-time? */
+     AccessPattern rs_accesspattern; /* access pattern to use for reads */

      /* scan current state */
      bool        rs_inited;        /* false = scan not init'd yet */
Index: src/include/access/xlog.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
retrieving revision 1.76
diff -c -r1.76 xlog.h
*** src/include/access/xlog.h    5 Jan 2007 22:19:51 -0000    1.76
--- src/include/access/xlog.h    14 May 2007 21:22:40 -0000
***************
*** 151,156 ****
--- 151,157 ----

  extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
  extern void XLogFlush(XLogRecPtr RecPtr);
+ extern bool XLogNeedsFlush(XLogRecPtr RecPtr);

  extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
Index: src/include/storage/buf_internals.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v
retrieving revision 1.89
diff -c -r1.89 buf_internals.h
*** src/include/storage/buf_internals.h    5 Jan 2007 22:19:57 -0000    1.89
--- src/include/storage/buf_internals.h    15 May 2007 17:07:59 -0000
***************
*** 16,21 ****
--- 16,22 ----
  #define BUFMGR_INTERNALS_H

  #include "storage/buf.h"
+ #include "storage/bufmgr.h"
  #include "storage/lwlock.h"
  #include "storage/shmem.h"
  #include "storage/spin.h"
***************
*** 168,174 ****
  extern BufferDesc *LocalBufferDescriptors;

  /* in freelist.c */
! extern bool strategy_hint_vacuum;

  /* event counters in buf_init.c */
  extern long int ReadBufferCount;
--- 169,175 ----
  extern BufferDesc *LocalBufferDescriptors;

  /* in freelist.c */
! extern AccessPattern active_access_pattern;

  /* event counters in buf_init.c */
  extern long int ReadBufferCount;
***************
*** 184,195 ****
   */

  /* freelist.c */
! extern volatile BufferDesc *StrategyGetBuffer(void);
  extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
  extern int    StrategySyncStart(void);
  extern Size StrategyShmemSize(void);
  extern void StrategyInitialize(bool init);

  /* buf_table.c */
  extern Size BufTableShmemSize(int size);
  extern void InitBufTable(int size);
--- 185,198 ----
   */

  /* freelist.c */
! extern volatile BufferDesc *StrategyGetBuffer(bool *lock_held);
  extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
  extern int    StrategySyncStart(void);
  extern Size StrategyShmemSize(void);
  extern void StrategyInitialize(bool init);

+ extern bool StrategyRejectBuffer(volatile BufferDesc *buf);
+
  /* buf_table.c */
  extern Size BufTableShmemSize(int size);
  extern void InitBufTable(int size);
Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.103
diff -c -r1.103 bufmgr.h
*** src/include/storage/bufmgr.h    2 May 2007 23:18:03 -0000    1.103
--- src/include/storage/bufmgr.h    15 May 2007 17:07:02 -0000
***************
*** 48,53 ****
--- 48,61 ----
  #define BUFFER_LOCK_SHARE        1
  #define BUFFER_LOCK_EXCLUSIVE    2

+ typedef enum AccessPattern
+ {
+     AP_NORMAL,        /* Normal random access */
+     AP_BULKREAD,    /* Large read-only scan (hint bit updates are ok) */
+     AP_COPY,        /* Large updating scan, like COPY with WAL enabled */
+     AP_VACUUM,        /* VACUUM */
+ } AccessPattern;
+
  /*
   * These routines are beaten on quite heavily, hence the macroization.
   */
***************
*** 157,162 ****
  extern void AtProcExit_LocalBuffers(void);

  /* in freelist.c */
! extern void StrategyHintVacuum(bool vacuum_active);

  #endif
--- 165,170 ----
  extern void AtProcExit_LocalBuffers(void);

  /* in freelist.c */
! extern void SetAccessPattern(AccessPattern new_pattern);

  #endif

Attachment

testscript.sh

Re: Seq scans status update

From

Tom Lane

Date:

17 May 2007, 14:03:33

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> I've completed a set of performance tests on a test server. The server
> has 4 GB of RAM, of which 1 GB is used for shared_buffers.

Perhaps I'm misreading it, but these tests seem to show no improvement
worth spending any effort on --- some of the tests improve a bit but
many get worse, and that's for tests allegedly designed to highlight the
improvement; there's been no attempt to measure the distributed cost of
the additional logic in scenarios where it's not helpful.  To say
nothing of the likelihood that it will be flat-out counterproductive
in yet other scenarios.

regression=# select id,sum(rt),count(*) from times group by id;
     id     |       sum       | count
------------+-----------------+-------
 10G        | 01:15:53.497114 |    20
 10Gpatched | 01:12:51.749906 |    20
 2G         | 00:11:54.343741 |    20
 2Gpatched  | 00:11:32.482733 |    20
(4 rows)

Should we not just reject the patch and move on to something more
useful?

            regards, tom lane

Re: Seq scans status update

From

Heikki Linnakangas

Date:

17 May 2007, 14:28:40

Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> I've completed a set of performance tests on a test server. The server
>> has 4 GB of RAM, of which 1 GB is used for shared_buffers.
>
> Perhaps I'm misreading it, but these tests seem to show no improvement
> worth spending any effort on --- some of the tests improve a bit but
> many get worse, and that's for tests allegedly designed to highlight the
> improvement; there's been no attempt to measure the distributed cost of
> the additional logic in scenarios where it's not helpful.  To say
> nothing of the likelihood that it will be flat-out counterproductive
> in yet other scenarios.

Yeah, possibly. I've been thinking hard of scenarios where this would be
counterproductive, bulk delete is one that the original patch hurt, but
I think I have all interesting scenarios covered now.

> Should we not just reject the patch and move on to something more
> useful?

If Luke is right, the L2 cache effect that's visible in these in-memory
tests:

  select-medsize-patched            | 00:00:01.332975
  select-medsize-patched            | 00:00:01.33014
  select-medsize-patched            | 00:00:01.332392
  select-medsize-patched            | 00:00:01.333498
  select-medsize-patched            | 00:00:01.332692
  select-medsize-unpatched          | 00:00:02.678069
  select-medsize-unpatched          | 00:00:02.666103
  select-medsize-unpatched          | 00:00:02.673494
  select-medsize-unpatched          | 00:00:02.669645
  select-medsize-unpatched          | 00:00:02.666278

is also visible on larger scans that don't fit in cache with bigger I/O
hardware, and this patch would increase the max. I/O throughput that we
can handle on such hardware. I don't have such hardware available, I
hope someone else will try that.

In addition to that, this should reduce the cache-spoiling effect of big
scans and COPY. That's harder to quantify, I've been planning to run a
TPC-C test with a large SELECT COUNT(*) running in the background to
measure that, but me and the server has been busy with those other tests.

In any case, I do want this for VACUUMs to fix the "WAL flush for every
dirty page" problem. Maybe we should indeed drop the other aspects of
the patch and move on, I'm getting tired of this as well.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Seq scans status update

From

Tom Lane

Date:

17 May 2007, 15:00:44

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> In any case, I do want this for VACUUMs to fix the "WAL flush for every
> dirty page" problem. Maybe we should indeed drop the other aspects of
> the patch and move on, I'm getting tired of this as well.

Can we devise a small patch that fixes that issue without creating
uncertain impacts on other behavior?

The thing that has made me uncomfortable with this set of patches right
along is the presumption that it's a good idea to tweak cache behavior
to improve a small set of cases.  We are using cache behavior (now
clock-sweep, formerly LRU) that is well tested and proven across a large
body of experience; my gut feeling is that deviating from that behavior
is likely to be a net loss when the big picture is considered.  I
certainly have not seen any test results that try to prove that there's
no such generalized disadvantage to these patches.

One could argue that the real bug here is that VACUUM deviates from the
standard behavior by forcing immediate recycling of pages it releases,
and that getting rid of that wart is the correct answer rather than
piling more warts atop it --- especially warts that will change the
behavior for anything besides VACUUM.

            regards, tom lane

Re: Seq scans status update

From

"Luke Lonergan"

Date:

18 May 2007, 02:38:23

Hi Heikki,

On 5/17/07 10:28 AM, "Heikki Linnakangas" <heikki@enterprisedb.com> wrote:

> is also visible on larger scans that don't fit in cache with bigger I/O
> hardware, and this patch would increase the max. I/O throughput that we
> can handle on such hardware. I don't have such hardware available, I
> hope someone else will try that.

Yes, this is absolutely the case, in addition to the benefits of not
polluting the bufcache with seq scans (as discussed in detail previously).
We've adopted this (see CK's patch) with excellent benefits.

We can try your version on a machine with fast I/O and get back to you with
a comparison of this and CK's version.

- Luke

Re: Seq scans status update

From

Heikki Linnakangas

Date:

18 May 2007, 05:33:44

CK Tan wrote:
> If it is convenient for you, could you run my patch against the same
> hardware and data to get some numbers on select for comparison? Although
> we don't address updates, copy, or inserts,  we are definitely getting
> at least 20% improvement in scans here without poisoning the bufpool for
> large tables.

Sure, here you are:

  copy-cktan-big                    | 00:10:08.843126
  copy-cktan-big                    | 00:10:17.767606
  copy-cktan-big                    | 00:09:37.797059
  copy_nowal-cktan-big              | 00:05:12.305025
  copy_nowal-cktan-big              | 00:05:10.518417
  copy_nowal-cktan-big              | 00:05:03.472939
  select-cktan-big                  | 00:03:27.655982
  select-cktan-big                  | 00:01:55.496408
  select-cktan-big                  | 00:01:31.693856
  select-cktan-big                  | 00:01:12.705268
  select-cktan-big                  | 00:01:12.478247
  select-cktan-big                  | 00:01:10.866484
  vacuum_clean-cktan-big            | 00:03:05.340875
  vacuum_clean-cktan-big            | 00:01:12.428317
  vacuum_clean-cktan-big            | 00:01:13.179957
  vacuum_clean-cktan-big            | 00:01:10.438888
  vacuum_hintbits-cktan-big         | 00:03:58.78208
  vacuum_hintbits-cktan-big         | 00:04:02.515778
  vacuum_hintbits-cktan-big         | 00:04:19.083402
  vacuum_hintbits-cktan-big         | 00:04:11.170831
  copy-cktan-med                    | 00:02:19.413484
  copy-cktan-med                    | 00:02:22.270497
  copy-cktan-med                    | 00:02:22.297946
  copy_nowal-cktan-med              | 00:01:31.192601
  copy_nowal-cktan-med              | 00:01:17.736356
  copy_nowal-cktan-med              | 00:01:32.272778
  select-cktan-med                  | 00:00:03.774974
  select-cktan-med                  | 00:00:01.279276
  select-cktan-med                  | 00:00:01.297703
  select-cktan-med                  | 00:00:01.304129
  select-cktan-med                  | 00:00:01.297048
  select-cktan-med                  | 00:00:01.306073
  vacuum_clean-cktan-med            | 00:00:01.820733
  vacuum_clean-cktan-med            | 00:00:01.755684
  vacuum_clean-cktan-med            | 00:00:01.755659
  vacuum_clean-cktan-med            | 00:00:01.750972
  vacuum_hintbits-cktan-med         | 00:00:01.58744
  vacuum_hintbits-cktan-med         | 00:00:06.216038
  vacuum_hintbits-cktan-med         | 00:00:02.201789
  vacuum_hintbits-cktan-med         | 00:00:36.494427

Select performance looks the same as with Simon's/my patch.

That 20% gain must be hardware-dependent. My interpretation is that the
CPU savings from more efficient cache usage only matters when the I/O
bandwidth is high enough that CPU becomes the bottleneck.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Seq scans status update

From

Heikki Linnakangas

Date:

18 May 2007, 06:07:54

Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> In any case, I do want this for VACUUMs to fix the "WAL flush for every
>> dirty page" problem. Maybe we should indeed drop the other aspects of
>> the patch and move on, I'm getting tired of this as well.
>
> Can we devise a small patch that fixes that issue without creating
> uncertain impacts on other behavior?

Yeah certainly, that's my backup plan.

> The thing that has made me uncomfortable with this set of patches right
> along is the presumption that it's a good idea to tweak cache behavior
> to improve a small set of cases.  We are using cache behavior (now
> clock-sweep, formerly LRU) that is well tested and proven across a large
> body of experience; my gut feeling is that deviating from that behavior
> is likely to be a net loss when the big picture is considered.  I
> certainly have not seen any test results that try to prove that there's
> no such generalized disadvantage to these patches.

That was my initial reaction as well. However, people claimed that using
a small number of buffers you can achieve higher raw throughput.
Avoiding the cache spoiling sounds plausible as well, if you're going to
do a seq scan of a table larger than shared_buffers, you know those
pages are not going to be needed in the near future; you're going to
replace them yourself with pages from the same scan.

> One could argue that the real bug here is that VACUUM deviates from the
> standard behavior by forcing immediate recycling of pages it releases,
> and that getting rid of that wart is the correct answer rather than
> piling more warts atop it --- especially warts that will change the
> behavior for anything besides VACUUM.

Maybe a better approach would be switching to an algorithm that's more
resistant to sequential scans by nature. That's definitely not going to
happen for 8.3, though.

It's also possible that whatever algorithm we choose doesn't make much
difference in practice because the in typical configurations the OS
cache is bigger and more significant anyway. It's also possible that
there's some counter-productive interactions between our and OS cache
management.

In any case, I'd like to see more test results before we make a
decision. I'm running tests with DBT-2 and a seq scan running in the
background to see if the cache-spoiling effect shows up. I'm also trying
to get hold of some bigger hardware to run on. Running these tests takes
some calendar time, but the hard work has already been done. I'm going
to start reviewing Jeff's synchronized scans patch now.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Seq scans status update

From

Heikki Linnakangas

Date:

20 May 2007, 10:50:36

Heikki Linnakangas wrote:
> In any case, I'd like to see more test results before we make a
> decision. I'm running tests with DBT-2 and a seq scan running in the
> background to see if the cache-spoiling effect shows up. I'm also trying
> to get hold of some bigger hardware to run on. Running these tests takes
> some calendar time, but the hard work has already been done. I'm going
> to start reviewing Jeff's synchronized scans patch now.

Here are the results of the DBT-2 tests:

http://community.enterprisedb.com/seqscan/imola/

In each of these tests, at the end of rampup a script is started that
issues a "SELECT COUNT(*) FROM stock" in a loop, with 2 minute delay
between end of previous query and start of next one.

The patch makes the seq scans go significantly faster. In the 1 hour
test period, the patched tests perform roughly 30-100% as many selects
as unpatched tests.

With 100 and 105 warehouses, it also significantly reduces the impact of
the seq scan on other queries; response times are lower with the patch.
With 120 warehouses the reduction of impact is not as clear, but when
you plot the response times it's still there (the plots on the "response
times charts"-page are useless because they're overwhelmed by the
checkpoint spike).

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Seq scans status update

From

Heikki Linnakangas

Date:

21 May 2007, 04:08:30

I forgot to attach the program used to generate test data. Here it is.

Heikki Linnakangas wrote:
> Attached is a new version of Simon's "scan-resistant buffer manager"
> patch. It's not ready for committing yet because of a small issue I
> found this morning (* see bottom), but here's a status update.
>
> To recap, the basic idea is to use a small ring of buffers for large
> scans like VACUUM, COPY and seq-scans. Changes to the original patch:
>
> - a different sized ring is used for VACUUMs and seq-scans, and COPY.
> VACUUM and COPY use a ring of 32 buffers, and COPY uses a ring of 4096
> buffers in default configuration. See README changes in the patch for
> rationale.
>
> - for queries with large seqscans, the buffer ring is only used for
> reads issued by the seq scan, not for any other reads in the query.
> Typical scenario where this matters is doing a large seq scan with a
> nested loop join to a smaller table. You don't want to use the buffer
> ring for index lookups inside the nested loop.
>
> - for seqscans, drop buffers from the ring that would need a WAL flush
> to reuse. That makes bulk updates to behave roughly like they do without
> the patch, instead of having to do a WAL flush every 32 pages.
>
> I've spent a lot of time thinking of solutions to the last point. The
> obvious solution would be to not use the buffer ring for updating scans.
> The difficulty with that is that we don't know if a scan is read-only in
> heapam.c, where the hint to use the buffer ring is set.
>
> I've completed a set of performance tests on a test server. The server
> has 4 GB of RAM, of which 1 GB is used for shared_buffers.
>
> Results for a 10 GB table:
>
>  head-copy-bigtable               | 00:10:09.07016
>  head-copy-bigtable               | 00:10:20.507357
>  head-copy-bigtable               | 00:10:21.857677
>  head-copy_nowal-bigtable         | 00:05:18.232956
>  head-copy_nowal-bigtable         | 00:03:24.109047
>  head-copy_nowal-bigtable         | 00:05:31.019643
>  head-select-bigtable             | 00:03:47.102731
>  head-select-bigtable             | 00:01:08.314719
>  head-select-bigtable             | 00:01:08.238509
>  head-select-bigtable             | 00:01:08.208563
>  head-select-bigtable             | 00:01:08.28347
>  head-select-bigtable             | 00:01:08.308671
>  head-vacuum_clean-bigtable       | 00:01:04.227832
>  head-vacuum_clean-bigtable       | 00:01:04.232258
>  head-vacuum_clean-bigtable       | 00:01:04.294621
>  head-vacuum_clean-bigtable       | 00:01:04.280677
>  head-vacuum_hintbits-bigtable    | 00:04:01.123924
>  head-vacuum_hintbits-bigtable    | 00:03:58.253175
>  head-vacuum_hintbits-bigtable    | 00:04:26.318159
>  head-vacuum_hintbits-bigtable    | 00:04:37.512965
>  patched-copy-bigtable            | 00:09:52.776754
>  patched-copy-bigtable            | 00:10:18.185826
>  patched-copy-bigtable            | 00:10:16.975482
>  patched-copy_nowal-bigtable      | 00:03:14.882366
>  patched-copy_nowal-bigtable      | 00:04:01.04648
>  patched-copy_nowal-bigtable      | 00:03:56.062272
>  patched-select-bigtable          | 00:03:47.704154
>  patched-select-bigtable          | 00:01:08.460326
>  patched-select-bigtable          | 00:01:10.441544
>  patched-select-bigtable          | 00:01:11.916221
>  patched-select-bigtable          | 00:01:13.848038
>  patched-select-bigtable          | 00:01:10.956133
>  patched-vacuum_clean-bigtable    | 00:01:10.315439
>  patched-vacuum_clean-bigtable    | 00:01:12.210537
>  patched-vacuum_clean-bigtable    | 00:01:15.202114
>  patched-vacuum_clean-bigtable    | 00:01:10.712235
>  patched-vacuum_hintbits-bigtable | 00:03:42.279201
>  patched-vacuum_hintbits-bigtable | 00:04:02.057778
>  patched-vacuum_hintbits-bigtable | 00:04:26.805822
>  patched-vacuum_hintbits-bigtable | 00:04:28.911184
>
> In other words, the patch has no significant effect, as expected. The
> select times did go up by a couple of seconds, which I didn't expect,
> though. One theory is that unused shared_buffers are swapped out during
> the tests, and bgwriter pulls them back in. I'll set swappiness to 0 and
> try again at some point.
>
> Results for a 2 GB table:
>
>  copy-medsize-unpatched            | 00:02:18.23246
>  copy-medsize-unpatched            | 00:02:22.347194
>  copy-medsize-unpatched            | 00:02:23.875874
>  copy_nowal-medsize-unpatched      | 00:01:27.606334
>  copy_nowal-medsize-unpatched      | 00:01:17.491243
>  copy_nowal-medsize-unpatched      | 00:01:31.902719
>  select-medsize-unpatched          | 00:00:03.786031
>  select-medsize-unpatched          | 00:00:02.678069
>  select-medsize-unpatched          | 00:00:02.666103
>  select-medsize-unpatched          | 00:00:02.673494
>  select-medsize-unpatched          | 00:00:02.669645
>  select-medsize-unpatched          | 00:00:02.666278
>  vacuum_clean-medsize-unpatched    | 00:00:01.091356
>  vacuum_clean-medsize-unpatched    | 00:00:01.923138
>  vacuum_clean-medsize-unpatched    | 00:00:01.917213
>  vacuum_clean-medsize-unpatched    | 00:00:01.917333
>  vacuum_hintbits-medsize-unpatched | 00:00:01.683718
>  vacuum_hintbits-medsize-unpatched | 00:00:01.864003
>  vacuum_hintbits-medsize-unpatched | 00:00:03.186596
>  vacuum_hintbits-medsize-unpatched | 00:00:02.16494
>  copy-medsize-patched              | 00:02:35.113501
>  copy-medsize-patched              | 00:02:25.269866
>  copy-medsize-patched              | 00:02:31.881089
>  copy_nowal-medsize-patched        | 00:01:00.254633
>  copy_nowal-medsize-patched        | 00:01:04.630687
>  copy_nowal-medsize-patched        | 00:01:03.729128
>  select-medsize-patched            | 00:00:03.201837
>  select-medsize-patched            | 00:00:01.332975
>  select-medsize-patched            | 00:00:01.33014
>  select-medsize-patched            | 00:00:01.332392
>  select-medsize-patched            | 00:00:01.333498
>  select-medsize-patched            | 00:00:01.332692
>  vacuum_clean-medsize-patched      | 00:00:01.140189
>  vacuum_clean-medsize-patched      | 00:00:01.062762
>  vacuum_clean-medsize-patched      | 00:00:01.062402
>  vacuum_clean-medsize-patched      | 00:00:01.07113
>  vacuum_hintbits-medsize-patched   | 00:00:17.865446
>  vacuum_hintbits-medsize-patched   | 00:00:15.162064
>  vacuum_hintbits-medsize-patched   | 00:00:01.704651
>  vacuum_hintbits-medsize-patched   | 00:00:02.671651
>
> This looks good to me, except for some glitch at the last
> vacuum_hintbits tests. Selects and vacuums benefit significantly, as
> does non-WAL-logged copy.
>
> Not shown here, but I run tests earlier with vacuum on a table that
> actually had dead tuples to be removed on it. In that test the patched
> version really shined, reducing the runtime to ~ 1/6th. That was the
> original motivation of this patch: not having to do a WAL flush on every
> page in the 2nd phase of vacuum.
>
> Test script attached. To use it:
>
> 1. Edit testscript.sh. Change BIGTABLESIZE.
> 2. Start postmaster
> 3. Run script, giving test-label as argument. For example:
> "./testscript.sh bigtable-patched"
>
> Attached is also the patch I used for the tests.
>
> I would appreciate it if people would download the patch and the script
> and repeat the tests on different hardware. I'm particularly interested
> in testing on a box with good I/O hardware where selects on unpatched
> PostgreSQL are bottlenecked by CPU.
>
> Barring any surprises I'm going to fix the remaining issue and submit a
> final patch, probably in the weekend.
>
> (*) The issue with this patch is that if the buffer cache is completely
> filled with dirty buffers that need a WAL flush to evict, the buffer
> ring code will get into an infinite loop trying to find one that doesn't
> need a WAL flush. Should be simple to fix.
>
>
> ------------------------------------------------------------------------
>
> Index: src/backend/access/heap/heapam.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v
> retrieving revision 1.232
> diff -c -r1.232 heapam.c
> *** src/backend/access/heap/heapam.c    8 Apr 2007 01:26:27 -0000    1.232
> --- src/backend/access/heap/heapam.c    16 May 2007 11:35:14 -0000
> ***************
> *** 83,88 ****
> --- 83,96 ----
>        */
>       scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
>
> +     /* A scan on a table smaller than shared_buffers is treated like random
> +      * access, but bigger scans should use the bulk read replacement policy.
> +      */
> +     if (scan->rs_nblocks > NBuffers)
> +         scan->rs_accesspattern = AP_BULKREAD;
> +     else
> +         scan->rs_accesspattern = AP_NORMAL;
> +
>       scan->rs_inited = false;
>       scan->rs_ctup.t_data = NULL;
>       ItemPointerSetInvalid(&scan->rs_ctup.t_self);
> ***************
> *** 123,133 ****
> --- 131,146 ----
>
>       Assert(page < scan->rs_nblocks);
>
> +     /* Read the page with the right strategy */
> +     SetAccessPattern(scan->rs_accesspattern);
> +
>       scan->rs_cbuf = ReleaseAndReadBuffer(scan->rs_cbuf,
>                                            scan->rs_rd,
>                                            page);
>       scan->rs_cblock = page;
>
> +     SetAccessPattern(AP_NORMAL);
> +
>       if (!scan->rs_pageatatime)
>           return;
>
> Index: src/backend/access/transam/xlog.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
> retrieving revision 1.268
> diff -c -r1.268 xlog.c
> *** src/backend/access/transam/xlog.c    30 Apr 2007 21:01:52 -0000    1.268
> --- src/backend/access/transam/xlog.c    15 May 2007 16:23:30 -0000
> ***************
> *** 1668,1673 ****
> --- 1668,1700 ----
>   }
>
>   /*
> +  * Returns true if 'record' hasn't been flushed to disk yet.
> +  */
> + bool
> + XLogNeedsFlush(XLogRecPtr record)
> + {
> +     /* Quick exit if already known flushed */
> +     if (XLByteLE(record, LogwrtResult.Flush))
> +         return false;
> +
> +     /* read LogwrtResult and update local state */
> +     {
> +         /* use volatile pointer to prevent code rearrangement */
> +         volatile XLogCtlData *xlogctl = XLogCtl;
> +
> +         SpinLockAcquire(&xlogctl->info_lck);
> +         LogwrtResult = xlogctl->LogwrtResult;
> +         SpinLockRelease(&xlogctl->info_lck);
> +     }
> +
> +     /* check again */
> +     if (XLByteLE(record, LogwrtResult.Flush))
> +         return false;
> +
> +     return true;
> + }
> +
> + /*
>    * Ensure that all XLOG data through the given position is flushed to disk.
>    *
>    * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not
> Index: src/backend/commands/copy.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
> retrieving revision 1.283
> diff -c -r1.283 copy.c
> *** src/backend/commands/copy.c    27 Apr 2007 22:05:46 -0000    1.283
> --- src/backend/commands/copy.c    15 May 2007 17:05:29 -0000
> ***************
> *** 1876,1881 ****
> --- 1876,1888 ----
>       nfields = file_has_oids ? (attr_count + 1) : attr_count;
>       field_strings = (char **) palloc(nfields * sizeof(char *));
>
> +     /* Use the special COPY buffer replacement strategy if WAL-logging
> +      * is enabled. If it's not, the pages we're writing are dirty but
> +      * don't need a WAL flush to write out, so the BULKREAD strategy
> +      * is more suitable.
> +      */
> +     SetAccessPattern(use_wal ? AP_COPY : AP_BULKREAD);
> +
>       /* Initialize state variables */
>       cstate->fe_eof = false;
>       cstate->eol_type = EOL_UNKNOWN;
> ***************
> *** 2161,2166 ****
> --- 2168,2176 ----
>                               cstate->filename)));
>       }
>
> +     /* Reset buffer replacement strategy */
> +     SetAccessPattern(AP_NORMAL);
> +
>       /*
>        * If we skipped writing WAL, then we need to sync the heap (but not
>        * indexes since those use WAL anyway)
> Index: src/backend/commands/vacuum.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/vacuum.c,v
> retrieving revision 1.350
> diff -c -r1.350 vacuum.c
> *** src/backend/commands/vacuum.c    16 Apr 2007 18:29:50 -0000    1.350
> --- src/backend/commands/vacuum.c    15 May 2007 17:06:18 -0000
> ***************
> *** 421,431 ****
>                    * Tell the buffer replacement strategy that vacuum is causing
>                    * the IO
>                    */
> !                 StrategyHintVacuum(true);
>
>                   analyze_rel(relid, vacstmt);
>
> !                 StrategyHintVacuum(false);
>
>                   if (use_own_xacts)
>                       CommitTransactionCommand();
> --- 421,431 ----
>                    * Tell the buffer replacement strategy that vacuum is causing
>                    * the IO
>                    */
> !                 SetAccessPattern(AP_VACUUM);
>
>                   analyze_rel(relid, vacstmt);
>
> !                 SetAccessPattern(AP_NORMAL);
>
>                   if (use_own_xacts)
>                       CommitTransactionCommand();
> ***************
> *** 442,448 ****
>           /* Make sure cost accounting is turned off after error */
>           VacuumCostActive = false;
>           /* And reset buffer replacement strategy, too */
> !         StrategyHintVacuum(false);
>           PG_RE_THROW();
>       }
>       PG_END_TRY();
> --- 442,448 ----
>           /* Make sure cost accounting is turned off after error */
>           VacuumCostActive = false;
>           /* And reset buffer replacement strategy, too */
> !         SetAccessPattern(AP_NORMAL);
>           PG_RE_THROW();
>       }
>       PG_END_TRY();
> ***************
> *** 1088,1094 ****
>        * Tell the cache replacement strategy that vacuum is causing all
>        * following IO
>        */
> !     StrategyHintVacuum(true);
>
>       /*
>        * Do the actual work --- either FULL or "lazy" vacuum
> --- 1088,1094 ----
>        * Tell the cache replacement strategy that vacuum is causing all
>        * following IO
>        */
> !     SetAccessPattern(AP_VACUUM);
>
>       /*
>        * Do the actual work --- either FULL or "lazy" vacuum
> ***************
> *** 1098,1104 ****
>       else
>           lazy_vacuum_rel(onerel, vacstmt);
>
> !     StrategyHintVacuum(false);
>
>       /* all done with this class, but hold lock until commit */
>       relation_close(onerel, NoLock);
> --- 1098,1104 ----
>       else
>           lazy_vacuum_rel(onerel, vacstmt);
>
> !     SetAccessPattern(AP_NORMAL);
>
>       /* all done with this class, but hold lock until commit */
>       relation_close(onerel, NoLock);
> Index: src/backend/storage/buffer/README
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/README,v
> retrieving revision 1.11
> diff -c -r1.11 README
> *** src/backend/storage/buffer/README    23 Jul 2006 03:07:58 -0000    1.11
> --- src/backend/storage/buffer/README    16 May 2007 11:43:11 -0000
> ***************
> *** 152,159 ****
>   a field to show which backend is doing its I/O).
>
>
> ! Buffer replacement strategy
> ! ---------------------------
>
>   There is a "free list" of buffers that are prime candidates for replacement.
>   In particular, buffers that are completely free (contain no valid page) are
> --- 152,159 ----
>   a field to show which backend is doing its I/O).
>
>
> ! Normal buffer replacement strategy
> ! ----------------------------------
>
>   There is a "free list" of buffers that are prime candidates for replacement.
>   In particular, buffers that are completely free (contain no valid page) are
> ***************
> *** 199,221 ****
>   have to give up and try another buffer.  This however is not a concern
>   of the basic select-a-victim-buffer algorithm.)
>
> - A special provision is that while running VACUUM, a backend does not
> - increment the usage count on buffers it accesses.  In fact, if ReleaseBuffer
> - sees that it is dropping the pin count to zero and the usage count is zero,
> - then it appends the buffer to the tail of the free list.  (This implies that
> - VACUUM, but only VACUUM, must take the BufFreelistLock during ReleaseBuffer;
> - this shouldn't create much of a contention problem.)  This provision
> - encourages VACUUM to work in a relatively small number of buffers rather
> - than blowing out the entire buffer cache.  It is reasonable since a page
> - that has been touched only by VACUUM is unlikely to be needed again soon.
> -
> - Since VACUUM usually requests many pages very fast, the effect of this is that
> - it will get back the very buffers it filled and possibly modified on the next
> - call and will therefore do its work in a few shared memory buffers, while
> - being able to use whatever it finds in the cache already.  This also implies
> - that most of the write traffic caused by a VACUUM will be done by the VACUUM
> - itself and not pushed off onto other processes.
>
>
>   Background writer's processing
>   ------------------------------
> --- 199,243 ----
>   have to give up and try another buffer.  This however is not a concern
>   of the basic select-a-victim-buffer algorithm.)
>
>
> + Buffer ring replacement strategy
> + ---------------------------------
> +
> + When running a query that needs to access a large number of pages, like VACUUM,
> + COPY, or a large sequential scan, a different strategy is used.  A page that
> + has been touched only by such a scan is unlikely to be needed again soon, so
> + instead of running the normal clock sweep algorithm and blowing out the entire
> + buffer cache, a small ring of buffers is allocated using the normal clock sweep
> + algorithm and those buffers are reused for the whole scan.  This also implies
> + that most of the write traffic caused by such a statement will be done by the
> + backend itself and not pushed off onto other processes.
> +
> + The size of the ring used depends on the kind of scan:
> +
> + For sequential scans, a small 256 KB ring is used. That's small enough to fit
> + in L2 cache, which makes transferring pages from OS cache to shared buffer
> + cache efficient. Even less would often be enough, but the ring must be big
> + enough to accommodate all pages in the scan that are pinned concurrently.
> + 256 KB should also be enough to leave a small cache trail for other backends to
> + join in a synchronized seq scan. If a buffer is dirtied and LSN set, the buffer
> + is removed from the ring and a replacement buffer is chosen using the normal
> + replacement strategy. In a scan that modifies every page in the scan, like a
> + bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and the
> + ring strategy effectively degrades to the normal strategy.
> +
> + VACUUM uses a 256 KB ring like sequential scans, but dirty pages are not
> + removed from the ring. WAL is flushed instead to allow reuse of the buffers.
> + Before introducing the buffer ring strategy in 8.3, buffers were put to the
> + freelist, which was effectively a buffer ring of 1 buffer.
> +
> + COPY behaves like VACUUM, but a much larger ring is used. The ring size is
> + chosen to be twice the WAL segment size. This avoids polluting the buffer cache
> + like the clock sweep would do, and using a ring larger than WAL segment size
> + avoids having to do any extra WAL flushes, since a WAL segment will always be
> + filled, forcing a WAL flush, before looping through the buffer ring and bumping
> + into a buffer that would force a WAL flush. However, for non-WAL-logged COPY
> + operations the smaller 256 KB ring is used because WAL flushes are not needed
> + to write the buffers.
>
>   Background writer's processing
>   ------------------------------
> Index: src/backend/storage/buffer/bufmgr.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
> retrieving revision 1.218
> diff -c -r1.218 bufmgr.c
> *** src/backend/storage/buffer/bufmgr.c    2 May 2007 23:34:48 -0000    1.218
> --- src/backend/storage/buffer/bufmgr.c    16 May 2007 12:34:10 -0000
> ***************
> *** 419,431 ****
>       /* Loop here in case we have to try another victim buffer */
>       for (;;)
>       {
>           /*
>            * Select a victim buffer.    The buffer is returned with its header
>            * spinlock still held!  Also the BufFreelistLock is still held, since
>            * it would be bad to hold the spinlock while possibly waking up other
>            * processes.
>            */
> !         buf = StrategyGetBuffer();
>
>           Assert(buf->refcount == 0);
>
> --- 419,433 ----
>       /* Loop here in case we have to try another victim buffer */
>       for (;;)
>       {
> +         bool lock_held;
> +
>           /*
>            * Select a victim buffer.    The buffer is returned with its header
>            * spinlock still held!  Also the BufFreelistLock is still held, since
>            * it would be bad to hold the spinlock while possibly waking up other
>            * processes.
>            */
> !         buf = StrategyGetBuffer(&lock_held);
>
>           Assert(buf->refcount == 0);
>
> ***************
> *** 436,442 ****
>           PinBuffer_Locked(buf);
>
>           /* Now it's safe to release the freelist lock */
> !         LWLockRelease(BufFreelistLock);
>
>           /*
>            * If the buffer was dirty, try to write it out.  There is a race
> --- 438,445 ----
>           PinBuffer_Locked(buf);
>
>           /* Now it's safe to release the freelist lock */
> !         if (lock_held)
> !             LWLockRelease(BufFreelistLock);
>
>           /*
>            * If the buffer was dirty, try to write it out.  There is a race
> ***************
> *** 464,469 ****
> --- 467,489 ----
>                */
>               if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED))
>               {
> +                 /* In BULKREAD-mode, check if a WAL flush would be needed to
> +                  * evict this buffer. If so, ask the replacement strategy if
> +                  * we should go ahead and do it or choose another victim.
> +                  */
> +                 if (active_access_pattern == AP_BULKREAD)
> +                 {
> +                     if (XLogNeedsFlush(BufferGetLSN(buf)))
> +                     {
> +                         if (StrategyRejectBuffer(buf))
> +                         {
> +                             LWLockRelease(buf->content_lock);
> +                             UnpinBuffer(buf, true, false);
> +                             continue;
> +                         }
> +                     }
> +                 }
> +
>                   FlushBuffer(buf, NULL);
>                   LWLockRelease(buf->content_lock);
>               }
> ***************
> *** 925,932 ****
>       PrivateRefCount[b]--;
>       if (PrivateRefCount[b] == 0)
>       {
> -         bool        immed_free_buffer = false;
> -
>           /* I'd better not still hold any locks on the buffer */
>           Assert(!LWLockHeldByMe(buf->content_lock));
>           Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
> --- 945,950 ----
> ***************
> *** 940,956 ****
>           /* Update buffer usage info, unless this is an internal access */
>           if (normalAccess)
>           {
> !             if (!strategy_hint_vacuum)
>               {
> !                 if (buf->usage_count < BM_MAX_USAGE_COUNT)
> !                     buf->usage_count++;
>               }
>               else
> !             {
> !                 /* VACUUM accesses don't bump usage count, instead... */
> !                 if (buf->refcount == 0 && buf->usage_count == 0)
> !                     immed_free_buffer = true;
> !             }
>           }
>
>           if ((buf->flags & BM_PIN_COUNT_WAITER) &&
> --- 958,975 ----
>           /* Update buffer usage info, unless this is an internal access */
>           if (normalAccess)
>           {
> !             if (active_access_pattern != AP_NORMAL)
>               {
> !                 /* We don't want large one-off scans like vacuum to inflate
> !                  * the usage_count. We do want to set it to 1, though, to keep
> !                  * other backends from hijacking it from the buffer ring.
> !                  */
> !                 if (buf->usage_count == 0)
> !                     buf->usage_count = 1;
>               }
>               else
> !             if (buf->usage_count < BM_MAX_USAGE_COUNT)
> !                 buf->usage_count++;
>           }
>
>           if ((buf->flags & BM_PIN_COUNT_WAITER) &&
> ***************
> *** 965,978 ****
>           }
>           else
>               UnlockBufHdr(buf);
> -
> -         /*
> -          * If VACUUM is releasing an otherwise-unused buffer, send it to the
> -          * freelist for near-term reuse.  We put it at the tail so that it
> -          * won't be used before any invalid buffers that may exist.
> -          */
> -         if (immed_free_buffer)
> -             StrategyFreeBuffer(buf, false);
>       }
>   }
>
> --- 984,989 ----
> Index: src/backend/storage/buffer/freelist.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/freelist.c,v
> retrieving revision 1.58
> diff -c -r1.58 freelist.c
> *** src/backend/storage/buffer/freelist.c    5 Jan 2007 22:19:37 -0000    1.58
> --- src/backend/storage/buffer/freelist.c    17 May 2007 16:12:56 -0000
> ***************
> *** 18,23 ****
> --- 18,25 ----
>   #include "storage/buf_internals.h"
>   #include "storage/bufmgr.h"
>
> + #include "utils/memutils.h"
> +
>
>   /*
>    * The shared freelist control information.
> ***************
> *** 39,47 ****
>   /* Pointers to shared state */
>   static BufferStrategyControl *StrategyControl = NULL;
>
> ! /* Backend-local state about whether currently vacuuming */
> ! bool        strategy_hint_vacuum = false;
>
>
>   /*
>    * StrategyGetBuffer
> --- 41,53 ----
>   /* Pointers to shared state */
>   static BufferStrategyControl *StrategyControl = NULL;
>
> ! /* Currently active access pattern hint. */
> ! AccessPattern active_access_pattern = AP_NORMAL;
>
> + /* prototypes for internal functions */
> + static volatile BufferDesc *GetBufferFromRing(void);
> + static void PutBufferToRing(volatile BufferDesc *buf);
> + static void InitRing(void);
>
>   /*
>    * StrategyGetBuffer
> ***************
> *** 51,67 ****
>    *    the selected buffer must not currently be pinned by anyone.
>    *
>    *    To ensure that no one else can pin the buffer before we do, we must
> !  *    return the buffer with the buffer header spinlock still held.  That
> !  *    means that we return with the BufFreelistLock still held, as well;
> !  *    the caller must release that lock once the spinlock is dropped.
>    */
>   volatile BufferDesc *
> ! StrategyGetBuffer(void)
>   {
>       volatile BufferDesc *buf;
>       int            trycounter;
>
>       LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
>
>       /*
>        * Try to get a buffer from the freelist.  Note that the freeNext fields
> --- 57,89 ----
>    *    the selected buffer must not currently be pinned by anyone.
>    *
>    *    To ensure that no one else can pin the buffer before we do, we must
> !  *    return the buffer with the buffer header spinlock still held.  If
> !  *    *lock_held is set at return, we return with the BufFreelistLock still
> !  *    held, as well;    the caller must release that lock once the spinlock is
> !  *    dropped.
>    */
>   volatile BufferDesc *
> ! StrategyGetBuffer(bool *lock_held)
>   {
>       volatile BufferDesc *buf;
>       int            trycounter;
>
> +     /* Get a buffer from the ring if we're doing a bulk scan */
> +     if (active_access_pattern != AP_NORMAL)
> +     {
> +         buf = GetBufferFromRing();
> +         if (buf != NULL)
> +         {
> +             *lock_held = false;
> +             return buf;
> +         }
> +     }
> +
> +     /*
> +      * If our selected buffer wasn't available, pick another...
> +      */
>       LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
> +     *lock_held = true;
>
>       /*
>        * Try to get a buffer from the freelist.  Note that the freeNext fields
> ***************
> *** 86,96 ****
>            */
>           LockBufHdr(buf);
>           if (buf->refcount == 0 && buf->usage_count == 0)
>               return buf;
>           UnlockBufHdr(buf);
>       }
>
> !     /* Nothing on the freelist, so run the "clock sweep" algorithm */
>       trycounter = NBuffers;
>       for (;;)
>       {
> --- 108,122 ----
>            */
>           LockBufHdr(buf);
>           if (buf->refcount == 0 && buf->usage_count == 0)
> +         {
> +             if (active_access_pattern != AP_NORMAL)
> +                 PutBufferToRing(buf);
>               return buf;
> +         }
>           UnlockBufHdr(buf);
>       }
>
> !     /* Nothing on the freelist, so run the shared "clock sweep" algorithm */
>       trycounter = NBuffers;
>       for (;;)
>       {
> ***************
> *** 105,111 ****
> --- 131,141 ----
>            */
>           LockBufHdr(buf);
>           if (buf->refcount == 0 && buf->usage_count == 0)
> +         {
> +             if (active_access_pattern != AP_NORMAL)
> +                 PutBufferToRing(buf);
>               return buf;
> +         }
>           if (buf->usage_count > 0)
>           {
>               buf->usage_count--;
> ***************
> *** 191,204 ****
>   }
>
>   /*
> !  * StrategyHintVacuum -- tell us whether VACUUM is active
>    */
>   void
> ! StrategyHintVacuum(bool vacuum_active)
>   {
> !     strategy_hint_vacuum = vacuum_active;
> ! }
>
>
>   /*
>    * StrategyShmemSize
> --- 221,245 ----
>   }
>
>   /*
> !  * SetAccessPattern -- Sets the active access pattern hint
> !  *
> !  * Caller is responsible for resetting the hint to AP_NORMAL after the bulk
> !  * operation is done. It's ok to switch repeatedly between AP_NORMAL and one of
> !  * the other strategies, for example in a query with one large sequential scan
> !  * nested loop joined to an index scan. Index tuples should be fetched with the
> !  * normal strategy and the pages from the seq scan should be read in with the
> !  * AP_BULKREAD strategy. The ring won't be affected by such switching, however
> !  * switching to an access pattern with different ring size will invalidate the
> !  * old ring.
>    */
>   void
> ! SetAccessPattern(AccessPattern new_pattern)
>   {
> !     active_access_pattern = new_pattern;
>
> +     if (active_access_pattern != AP_NORMAL)
> +         InitRing();
> + }
>
>   /*
>    * StrategyShmemSize
> ***************
> *** 274,276 ****
> --- 315,498 ----
>       else
>           Assert(!init);
>   }
> +
> + /* ----------------------------------------------------------------
> +  *                Backend-private buffer ring management
> +  * ----------------------------------------------------------------
> +  */
> +
> + /*
> +  * Ring sizes for different access patterns. See README for the rationale
> +  * of these.
> +  */
> + #define BULKREAD_RING_SIZE    256 * 1024 / BLCKSZ
> + #define VACUUM_RING_SIZE    256 * 1024 / BLCKSZ
> + #define COPY_RING_SIZE        Min(NBuffers / 8, (XLOG_SEG_SIZE / BLCKSZ) * 2)
> +
> + /*
> +  * BufferRing is an array of buffer ids, and RingSize it's size in number of
> +  * elements. It's allocated in TopMemoryContext the first time it's needed.
> +  */
> + static int *BufferRing = NULL;
> + static int RingSize = 0;
> +
> + /* Index of the "current" slot in the ring. It's advanced every time a buffer
> +  * is handed out from the ring with GetBufferFromRing and it points to the
> +  * last buffer returned from the ring. RingCurSlot + 1 is the next victim
> +  * GetBufferRing will hand out.
> +  */
> + static int RingCurSlot = 0;
> +
> + /* magic value to mark empty slots in the ring */
> + #define BUF_ID_NOT_SET -1
> +
> +
> + /*
> +  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
> +  *        ring is empty.
> +  *
> +  * The bufhdr spin lock is held on the returned buffer.
> +  */
> + static volatile BufferDesc *
> + GetBufferFromRing(void)
> + {
> +     volatile BufferDesc *buf;
> +
> +     /* ring should be initialized by now */
> +     Assert(RingSize > 0 && BufferRing != NULL);
> +
> +     /* Run private "clock cycle" */
> +     if (++RingCurSlot >= RingSize)
> +         RingCurSlot = 0;
> +
> +     /*
> +      * If that slot hasn't been filled yet, tell the caller to allocate
> +      * a new buffer with the normal allocation strategy. He will then
> +      * fill this slot by calling PutBufferToRing with the new buffer.
> +      */
> +     if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET)
> +         return NULL;
> +
> +     buf = &BufferDescriptors[BufferRing[RingCurSlot]];
> +
> +     /*
> +      * If the buffer is pinned we cannot use it under any circumstances.
> +      * If usage_count == 0 then the buffer is fair game.
> +      *
> +      * We also choose this buffer if usage_count == 1. Strictly, this
> +      * might sometimes be the wrong thing to do, but we rely on the high
> +      * probability that it was this process that last touched the buffer.
> +      * If it wasn't, we'll choose a suboptimal victim, but  it shouldn't
> +      * make any difference in the big scheme of things.
> +      *
> +      */
> +     LockBufHdr(buf);
> +     if (buf->refcount == 0 && buf->usage_count <= 1)
> +         return buf;
> +     UnlockBufHdr(buf);
> +
> +     return NULL;
> + }
> +
> + /*
> +  * PutBufferToRing -- adds a buffer to the buffer ring
> +  *
> +  * Caller must hold the buffer header spinlock on the buffer.
> +  */
> + static void
> + PutBufferToRing(volatile BufferDesc *buf)
> + {
> +     /* ring should be initialized by now */
> +     Assert(RingSize > 0 && BufferRing != NULL);
> +
> +     if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET)
> +         BufferRing[RingCurSlot] = buf->buf_id;
> + }
> +
> + /*
> +  * Initializes a ring buffer with correct size for the currently
> +  * active strategy. Does nothing if the ring already has the right size.
> +  */
> + static void
> + InitRing(void)
> + {
> +     int new_size;
> +     int old_size = RingSize;
> +     int i;
> +     MemoryContext oldcxt;
> +
> +     /* Determine new size */
> +
> +     switch(active_access_pattern)
> +     {
> +         case AP_BULKREAD:
> +             new_size = BULKREAD_RING_SIZE;
> +             break;
> +         case AP_COPY:
> +             new_size = COPY_RING_SIZE;
> +             break;
> +         case AP_VACUUM:
> +             new_size = VACUUM_RING_SIZE;
> +             break;
> +         default:
> +             elog(ERROR, "unexpected buffer cache strategy %d",
> +                  active_access_pattern);
> +             return; /* keep compile happy */
> +     }
> +
> +     /*
> +      * Seq scans set and reset the strategy on every page, so we better exit
> +      * quickly if no change in size is needed.
> +      */
> +     if (new_size == old_size)
> +         return;
> +
> +     /* Allocate array */
> +
> +     oldcxt = MemoryContextSwitchTo(TopMemoryContext);
> +
> +     if (old_size == 0)
> +     {
> +         Assert(BufferRing == NULL);
> +         BufferRing = palloc(new_size * sizeof(int));
> +     }
> +     else
> +         BufferRing = repalloc(BufferRing, new_size * sizeof(int));
> +
> +     MemoryContextSwitchTo(oldcxt);
> +
> +     for(i = 0; i < new_size; i++)
> +         BufferRing[i] = BUF_ID_NOT_SET;
> +
> +     RingCurSlot = 0;
> +     RingSize = new_size;
> + }
> +
> + /*
> +  * Buffer manager calls this function in AP_BULKREAD mode when the
> +  * buffer handed to it turns out to need a WAL flush to write out. This
> +  * gives the strategy a second chance to choose another victim.
> +  *
> +  * Returns true if buffer manager should ask for a new victim, and false
> +  * if WAL should be flushed and this buffer used.
> +  */
> + bool
> + StrategyRejectBuffer(volatile BufferDesc *buf)
> + {
> +     Assert(RingSize > 0);
> +
> +     if (BufferRing[RingCurSlot] == buf->buf_id)
> +     {
> +         BufferRing[RingCurSlot] = BUF_ID_NOT_SET;
> +         return true;
> +     }
> +     else
> +     {
> +         /* Apparently the buffer didn't come from the ring. We don't want to
> +          * mess with how the clock sweep works; in worst case there's no
> +          * buffers in the buffer cache that can be reused without a WAL flush,
> +          * and we'd get into an endless loop trying.
> +          */
> +         return false;
> +     }
> + }
> Index: src/include/access/relscan.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/relscan.h,v
> retrieving revision 1.52
> diff -c -r1.52 relscan.h
> *** src/include/access/relscan.h    20 Jan 2007 18:43:35 -0000    1.52
> --- src/include/access/relscan.h    15 May 2007 17:01:31 -0000
> ***************
> *** 28,33 ****
> --- 28,34 ----
>       ScanKey        rs_key;            /* array of scan key descriptors */
>       BlockNumber rs_nblocks;        /* number of blocks to scan */
>       bool        rs_pageatatime; /* verify visibility page-at-a-time? */
> +     AccessPattern rs_accesspattern; /* access pattern to use for reads */
>
>       /* scan current state */
>       bool        rs_inited;        /* false = scan not init'd yet */
> Index: src/include/access/xlog.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
> retrieving revision 1.76
> diff -c -r1.76 xlog.h
> *** src/include/access/xlog.h    5 Jan 2007 22:19:51 -0000    1.76
> --- src/include/access/xlog.h    14 May 2007 21:22:40 -0000
> ***************
> *** 151,156 ****
> --- 151,157 ----
>
>   extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
>   extern void XLogFlush(XLogRecPtr RecPtr);
> + extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
>
>   extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
>   extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
> Index: src/include/storage/buf_internals.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v
> retrieving revision 1.89
> diff -c -r1.89 buf_internals.h
> *** src/include/storage/buf_internals.h    5 Jan 2007 22:19:57 -0000    1.89
> --- src/include/storage/buf_internals.h    15 May 2007 17:07:59 -0000
> ***************
> *** 16,21 ****
> --- 16,22 ----
>   #define BUFMGR_INTERNALS_H
>
>   #include "storage/buf.h"
> + #include "storage/bufmgr.h"
>   #include "storage/lwlock.h"
>   #include "storage/shmem.h"
>   #include "storage/spin.h"
> ***************
> *** 168,174 ****
>   extern BufferDesc *LocalBufferDescriptors;
>
>   /* in freelist.c */
> ! extern bool strategy_hint_vacuum;
>
>   /* event counters in buf_init.c */
>   extern long int ReadBufferCount;
> --- 169,175 ----
>   extern BufferDesc *LocalBufferDescriptors;
>
>   /* in freelist.c */
> ! extern AccessPattern active_access_pattern;
>
>   /* event counters in buf_init.c */
>   extern long int ReadBufferCount;
> ***************
> *** 184,195 ****
>    */
>
>   /* freelist.c */
> ! extern volatile BufferDesc *StrategyGetBuffer(void);
>   extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
>   extern int    StrategySyncStart(void);
>   extern Size StrategyShmemSize(void);
>   extern void StrategyInitialize(bool init);
>
>   /* buf_table.c */
>   extern Size BufTableShmemSize(int size);
>   extern void InitBufTable(int size);
> --- 185,198 ----
>    */
>
>   /* freelist.c */
> ! extern volatile BufferDesc *StrategyGetBuffer(bool *lock_held);
>   extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
>   extern int    StrategySyncStart(void);
>   extern Size StrategyShmemSize(void);
>   extern void StrategyInitialize(bool init);
>
> + extern bool StrategyRejectBuffer(volatile BufferDesc *buf);
> +
>   /* buf_table.c */
>   extern Size BufTableShmemSize(int size);
>   extern void InitBufTable(int size);
> Index: src/include/storage/bufmgr.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
> retrieving revision 1.103
> diff -c -r1.103 bufmgr.h
> *** src/include/storage/bufmgr.h    2 May 2007 23:18:03 -0000    1.103
> --- src/include/storage/bufmgr.h    15 May 2007 17:07:02 -0000
> ***************
> *** 48,53 ****
> --- 48,61 ----
>   #define BUFFER_LOCK_SHARE        1
>   #define BUFFER_LOCK_EXCLUSIVE    2
>
> + typedef enum AccessPattern
> + {
> +     AP_NORMAL,        /* Normal random access */
> +     AP_BULKREAD,    /* Large read-only scan (hint bit updates are ok) */
> +     AP_COPY,        /* Large updating scan, like COPY with WAL enabled */
> +     AP_VACUUM,        /* VACUUM */
> + } AccessPattern;
> +
>   /*
>    * These routines are beaten on quite heavily, hence the macroization.
>    */
> ***************
> *** 157,162 ****
>   extern void AtProcExit_LocalBuffers(void);
>
>   /* in freelist.c */
> ! extern void StrategyHintVacuum(bool vacuum_active);
>
>   #endif
> --- 165,170 ----
>   extern void AtProcExit_LocalBuffers(void);
>
>   /* in freelist.c */
> ! extern void SetAccessPattern(AccessPattern new_pattern);
>
>   #endif


--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define TUPLES_PER_PAGE 15

int main(int argc, char **argv)
{
    int tablesize;
    int lines;
    char buf[1000];
    int i;

    if (argc != 2)
    {
        exit(1);
    }

    memset(buf, 'a', 500);
    buf[500] = '\0';

    tablesize = atoi(argv[1]);

    lines = tablesize * 1024 * 1024 / 8192 * TUPLES_PER_PAGE;

    for(i = 1; i <= lines; i++)
        printf("%d\t%s\n", i, buf);
}

Re: Seq scans status update

From

Bruce Momjian

Date:

22 May 2007, 11:30:36

Great.  Based on this, do you have a patch that is ready to apply.

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> > In any case, I'd like to see more test results before we make a
> > decision. I'm running tests with DBT-2 and a seq scan running in the
> > background to see if the cache-spoiling effect shows up. I'm also trying
> > to get hold of some bigger hardware to run on. Running these tests takes
> > some calendar time, but the hard work has already been done. I'm going
> > to start reviewing Jeff's synchronized scans patch now.
>
> Here are the results of the DBT-2 tests:
>
> http://community.enterprisedb.com/seqscan/imola/
>
> In each of these tests, at the end of rampup a script is started that
> issues a "SELECT COUNT(*) FROM stock" in a loop, with 2 minute delay
> between end of previous query and start of next one.
>
> The patch makes the seq scans go significantly faster. In the 1 hour
> test period, the patched tests perform roughly 30-100% as many selects
> as unpatched tests.
>
> With 100 and 105 warehouses, it also significantly reduces the impact of
> the seq scan on other queries; response times are lower with the patch.
> With 120 warehouses the reduction of impact is not as clear, but when
> you plot the response times it's still there (the plots on the "response
> times charts"-page are useless because they're overwhelmed by the
> checkpoint spike).
>
> --
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Seq scans status update

From

Heikki Linnakangas

Date:

22 May 2007, 12:06:35

Not yet, there's still one issue that needs fixing.

Bruce Momjian wrote:
> Great.  Based on this, do you have a patch that is ready to apply.
>
> ---------------------------------------------------------------------------
>
> Heikki Linnakangas wrote:
>> Heikki Linnakangas wrote:
>>> In any case, I'd like to see more test results before we make a
>>> decision. I'm running tests with DBT-2 and a seq scan running in the
>>> background to see if the cache-spoiling effect shows up. I'm also trying
>>> to get hold of some bigger hardware to run on. Running these tests takes
>>> some calendar time, but the hard work has already been done. I'm going
>>> to start reviewing Jeff's synchronized scans patch now.
>> Here are the results of the DBT-2 tests:
>>
>> http://community.enterprisedb.com/seqscan/imola/
>>
>> In each of these tests, at the end of rampup a script is started that
>> issues a "SELECT COUNT(*) FROM stock" in a loop, with 2 minute delay
>> between end of previous query and start of next one.
>>
>> The patch makes the seq scans go significantly faster. In the 1 hour
>> test period, the patched tests perform roughly 30-100% as many selects
>> as unpatched tests.
>>
>> With 100 and 105 warehouses, it also significantly reduces the impact of
>> the seq scan on other queries; response times are lower with the patch.
>> With 120 warehouses the reduction of impact is not as clear, but when
>> you plot the response times it's still there (the plots on the "response
>> times charts"-page are useless because they're overwhelmed by the
>> checkpoint spike).
>>
>> --
>>    Heikki Linnakangas
>>    EnterpriseDB   http://www.enterprisedb.com
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 5: don't forget to increase your free space map settings
>


--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Seq scans status update

From

Heikki Linnakangas

Date:

25 May 2007, 16:31:12

Here's a new version, all known issues are now fixed. I'm now happy with
this patch.

Next, I'll start looking at the latest version of Jeff's synchronized
scans patch.

Bruce Momjian wrote:
> Great.  Based on this, do you have a patch that is ready to apply.
>
> ---------------------------------------------------------------------------
>
> Heikki Linnakangas wrote:
>> Heikki Linnakangas wrote:
>>> In any case, I'd like to see more test results before we make a
>>> decision. I'm running tests with DBT-2 and a seq scan running in the
>>> background to see if the cache-spoiling effect shows up. I'm also trying
>>> to get hold of some bigger hardware to run on. Running these tests takes
>>> some calendar time, but the hard work has already been done. I'm going
>>> to start reviewing Jeff's synchronized scans patch now.
>> Here are the results of the DBT-2 tests:
>>
>> http://community.enterprisedb.com/seqscan/imola/
>>
>> In each of these tests, at the end of rampup a script is started that
>> issues a "SELECT COUNT(*) FROM stock" in a loop, with 2 minute delay
>> between end of previous query and start of next one.
>>
>> The patch makes the seq scans go significantly faster. In the 1 hour
>> test period, the patched tests perform roughly 30-100% as many selects
>> as unpatched tests.
>>
>> With 100 and 105 warehouses, it also significantly reduces the impact of
>> the seq scan on other queries; response times are lower with the patch.
>> With 120 warehouses the reduction of impact is not as clear, but when
>> you plot the response times it's still there (the plots on the "response
>> times charts"-page are useless because they're overwhelmed by the
>> checkpoint spike).
>>
>> --
>>    Heikki Linnakangas
>>    EnterpriseDB   http://www.enterprisedb.com
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 5: don't forget to increase your free space map settings
>


--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/heap/heapam.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.232
diff -c -r1.232 heapam.c
*** src/backend/access/heap/heapam.c    8 Apr 2007 01:26:27 -0000    1.232
--- src/backend/access/heap/heapam.c    25 May 2007 19:22:33 -0000
***************
*** 83,88 ****
--- 83,96 ----
       */
      scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);

+     /* A scan on a table smaller than shared_buffers is treated like random
+      * access, but bigger scans use the bulk read page replacement policy.
+      */
+     if (scan->rs_nblocks > NBuffers)
+         scan->rs_accesspattern = AP_BULKREAD;
+     else
+         scan->rs_accesspattern = AP_NORMAL;
+
      scan->rs_inited = false;
      scan->rs_ctup.t_data = NULL;
      ItemPointerSetInvalid(&scan->rs_ctup.t_self);
***************
*** 123,133 ****
--- 131,146 ----

      Assert(page < scan->rs_nblocks);

+     /* Read the page with the right strategy */
+     SetAccessPattern(scan->rs_accesspattern);
+
      scan->rs_cbuf = ReleaseAndReadBuffer(scan->rs_cbuf,
                                           scan->rs_rd,
                                           page);
      scan->rs_cblock = page;

+     SetAccessPattern(AP_NORMAL);
+
      if (!scan->rs_pageatatime)
          return;

Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.268
diff -c -r1.268 xlog.c
*** src/backend/access/transam/xlog.c    30 Apr 2007 21:01:52 -0000    1.268
--- src/backend/access/transam/xlog.c    15 May 2007 16:23:30 -0000
***************
*** 1668,1673 ****
--- 1668,1700 ----
  }

  /*
+  * Returns true if 'record' hasn't been flushed to disk yet.
+  */
+ bool
+ XLogNeedsFlush(XLogRecPtr record)
+ {
+     /* Quick exit if already known flushed */
+     if (XLByteLE(record, LogwrtResult.Flush))
+         return false;
+
+     /* read LogwrtResult and update local state */
+     {
+         /* use volatile pointer to prevent code rearrangement */
+         volatile XLogCtlData *xlogctl = XLogCtl;
+
+         SpinLockAcquire(&xlogctl->info_lck);
+         LogwrtResult = xlogctl->LogwrtResult;
+         SpinLockRelease(&xlogctl->info_lck);
+     }
+
+     /* check again */
+     if (XLByteLE(record, LogwrtResult.Flush))
+         return false;
+
+     return true;
+ }
+
+ /*
   * Ensure that all XLOG data through the given position is flushed to disk.
   *
   * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not
Index: src/backend/commands/copy.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.283
diff -c -r1.283 copy.c
*** src/backend/commands/copy.c    27 Apr 2007 22:05:46 -0000    1.283
--- src/backend/commands/copy.c    15 May 2007 17:05:29 -0000
***************
*** 1876,1881 ****
--- 1876,1888 ----
      nfields = file_has_oids ? (attr_count + 1) : attr_count;
      field_strings = (char **) palloc(nfields * sizeof(char *));

+     /* Use the special COPY buffer replacement strategy if WAL-logging
+      * is enabled. If it's not, the pages we're writing are dirty but
+      * don't need a WAL flush to write out, so the BULKREAD strategy
+      * is more suitable.
+      */
+     SetAccessPattern(use_wal ? AP_COPY : AP_BULKREAD);
+
      /* Initialize state variables */
      cstate->fe_eof = false;
      cstate->eol_type = EOL_UNKNOWN;
***************
*** 2161,2166 ****
--- 2168,2176 ----
                              cstate->filename)));
      }

+     /* Reset buffer replacement strategy */
+     SetAccessPattern(AP_NORMAL);
+
      /*
       * If we skipped writing WAL, then we need to sync the heap (but not
       * indexes since those use WAL anyway)
Index: src/backend/commands/vacuum.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/vacuum.c,v
retrieving revision 1.350
diff -c -r1.350 vacuum.c
*** src/backend/commands/vacuum.c    16 Apr 2007 18:29:50 -0000    1.350
--- src/backend/commands/vacuum.c    15 May 2007 17:06:18 -0000
***************
*** 421,431 ****
                   * Tell the buffer replacement strategy that vacuum is causing
                   * the IO
                   */
!                 StrategyHintVacuum(true);

                  analyze_rel(relid, vacstmt);

!                 StrategyHintVacuum(false);

                  if (use_own_xacts)
                      CommitTransactionCommand();
--- 421,431 ----
                   * Tell the buffer replacement strategy that vacuum is causing
                   * the IO
                   */
!                 SetAccessPattern(AP_VACUUM);

                  analyze_rel(relid, vacstmt);

!                 SetAccessPattern(AP_NORMAL);

                  if (use_own_xacts)
                      CommitTransactionCommand();
***************
*** 442,448 ****
          /* Make sure cost accounting is turned off after error */
          VacuumCostActive = false;
          /* And reset buffer replacement strategy, too */
!         StrategyHintVacuum(false);
          PG_RE_THROW();
      }
      PG_END_TRY();
--- 442,448 ----
          /* Make sure cost accounting is turned off after error */
          VacuumCostActive = false;
          /* And reset buffer replacement strategy, too */
!         SetAccessPattern(AP_NORMAL);
          PG_RE_THROW();
      }
      PG_END_TRY();
***************
*** 1088,1094 ****
       * Tell the cache replacement strategy that vacuum is causing all
       * following IO
       */
!     StrategyHintVacuum(true);

      /*
       * Do the actual work --- either FULL or "lazy" vacuum
--- 1088,1094 ----
       * Tell the cache replacement strategy that vacuum is causing all
       * following IO
       */
!     SetAccessPattern(AP_VACUUM);

      /*
       * Do the actual work --- either FULL or "lazy" vacuum
***************
*** 1098,1104 ****
      else
          lazy_vacuum_rel(onerel, vacstmt);

!     StrategyHintVacuum(false);

      /* all done with this class, but hold lock until commit */
      relation_close(onerel, NoLock);
--- 1098,1104 ----
      else
          lazy_vacuum_rel(onerel, vacstmt);

!     SetAccessPattern(AP_NORMAL);

      /* all done with this class, but hold lock until commit */
      relation_close(onerel, NoLock);
Index: src/backend/storage/buffer/README
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/README,v
retrieving revision 1.11
diff -c -r1.11 README
*** src/backend/storage/buffer/README    23 Jul 2006 03:07:58 -0000    1.11
--- src/backend/storage/buffer/README    16 May 2007 11:43:11 -0000
***************
*** 152,159 ****
  a field to show which backend is doing its I/O).


! Buffer replacement strategy
! ---------------------------

  There is a "free list" of buffers that are prime candidates for replacement.
  In particular, buffers that are completely free (contain no valid page) are
--- 152,159 ----
  a field to show which backend is doing its I/O).


! Normal buffer replacement strategy
! ----------------------------------

  There is a "free list" of buffers that are prime candidates for replacement.
  In particular, buffers that are completely free (contain no valid page) are
***************
*** 199,221 ****
  have to give up and try another buffer.  This however is not a concern
  of the basic select-a-victim-buffer algorithm.)

- A special provision is that while running VACUUM, a backend does not
- increment the usage count on buffers it accesses.  In fact, if ReleaseBuffer
- sees that it is dropping the pin count to zero and the usage count is zero,
- then it appends the buffer to the tail of the free list.  (This implies that
- VACUUM, but only VACUUM, must take the BufFreelistLock during ReleaseBuffer;
- this shouldn't create much of a contention problem.)  This provision
- encourages VACUUM to work in a relatively small number of buffers rather
- than blowing out the entire buffer cache.  It is reasonable since a page
- that has been touched only by VACUUM is unlikely to be needed again soon.
-
- Since VACUUM usually requests many pages very fast, the effect of this is that
- it will get back the very buffers it filled and possibly modified on the next
- call and will therefore do its work in a few shared memory buffers, while
- being able to use whatever it finds in the cache already.  This also implies
- that most of the write traffic caused by a VACUUM will be done by the VACUUM
- itself and not pushed off onto other processes.


  Background writer's processing
  ------------------------------
--- 199,243 ----
  have to give up and try another buffer.  This however is not a concern
  of the basic select-a-victim-buffer algorithm.)


+ Buffer ring replacement strategy
+ ---------------------------------
+
+ When running a query that needs to access a large number of pages, like VACUUM,
+ COPY, or a large sequential scan, a different strategy is used.  A page that
+ has been touched only by such a scan is unlikely to be needed again soon, so
+ instead of running the normal clock sweep algorithm and blowing out the entire
+ buffer cache, a small ring of buffers is allocated using the normal clock sweep
+ algorithm and those buffers are reused for the whole scan.  This also implies
+ that most of the write traffic caused by such a statement will be done by the
+ backend itself and not pushed off onto other processes.
+
+ The size of the ring used depends on the kind of scan:
+
+ For sequential scans, a small 256 KB ring is used. That's small enough to fit
+ in L2 cache, which makes transferring pages from OS cache to shared buffer
+ cache efficient. Even less would often be enough, but the ring must be big
+ enough to accommodate all pages in the scan that are pinned concurrently.
+ 256 KB should also be enough to leave a small cache trail for other backends to
+ join in a synchronized seq scan. If a buffer is dirtied and LSN set, the buffer
+ is removed from the ring and a replacement buffer is chosen using the normal
+ replacement strategy. In a scan that modifies every page in the scan, like a
+ bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and the
+ ring strategy effectively degrades to the normal strategy.
+
+ VACUUM uses a 256 KB ring like sequential scans, but dirty pages are not
+ removed from the ring. WAL is flushed instead to allow reuse of the buffers.
+ Before introducing the buffer ring strategy in 8.3, buffers were put to the
+ freelist, which was effectively a buffer ring of 1 buffer.
+
+ COPY behaves like VACUUM, but a much larger ring is used. The ring size is
+ chosen to be twice the WAL segment size. This avoids polluting the buffer cache
+ like the clock sweep would do, and using a ring larger than WAL segment size
+ avoids having to do any extra WAL flushes, since a WAL segment will always be
+ filled, forcing a WAL flush, before looping through the buffer ring and bumping
+ into a buffer that would force a WAL flush. However, for non-WAL-logged COPY
+ operations the smaller 256 KB ring is used because WAL flushes are not needed
+ to write the buffers.

  Background writer's processing
  ------------------------------
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.218
diff -c -r1.218 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    2 May 2007 23:34:48 -0000    1.218
--- src/backend/storage/buffer/bufmgr.c    16 May 2007 12:34:10 -0000
***************
*** 419,431 ****
      /* Loop here in case we have to try another victim buffer */
      for (;;)
      {
          /*
           * Select a victim buffer.    The buffer is returned with its header
           * spinlock still held!  Also the BufFreelistLock is still held, since
           * it would be bad to hold the spinlock while possibly waking up other
           * processes.
           */
!         buf = StrategyGetBuffer();

          Assert(buf->refcount == 0);

--- 419,433 ----
      /* Loop here in case we have to try another victim buffer */
      for (;;)
      {
+         bool lock_held;
+
          /*
           * Select a victim buffer.    The buffer is returned with its header
           * spinlock still held!  Also the BufFreelistLock is still held, since
           * it would be bad to hold the spinlock while possibly waking up other
           * processes.
           */
!         buf = StrategyGetBuffer(&lock_held);

          Assert(buf->refcount == 0);

***************
*** 436,442 ****
          PinBuffer_Locked(buf);

          /* Now it's safe to release the freelist lock */
!         LWLockRelease(BufFreelistLock);

          /*
           * If the buffer was dirty, try to write it out.  There is a race
--- 438,445 ----
          PinBuffer_Locked(buf);

          /* Now it's safe to release the freelist lock */
!         if (lock_held)
!             LWLockRelease(BufFreelistLock);

          /*
           * If the buffer was dirty, try to write it out.  There is a race
***************
*** 464,469 ****
--- 467,489 ----
               */
              if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED))
              {
+                 /* In BULKREAD-mode, check if a WAL flush would be needed to
+                  * evict this buffer. If so, ask the replacement strategy if
+                  * we should go ahead and do it or choose another victim.
+                  */
+                 if (active_access_pattern == AP_BULKREAD)
+                 {
+                     if (XLogNeedsFlush(BufferGetLSN(buf)))
+                     {
+                         if (StrategyRejectBuffer(buf))
+                         {
+                             LWLockRelease(buf->content_lock);
+                             UnpinBuffer(buf, true, false);
+                             continue;
+                         }
+                     }
+                 }
+
                  FlushBuffer(buf, NULL);
                  LWLockRelease(buf->content_lock);
              }
***************
*** 925,932 ****
      PrivateRefCount[b]--;
      if (PrivateRefCount[b] == 0)
      {
-         bool        immed_free_buffer = false;
-
          /* I'd better not still hold any locks on the buffer */
          Assert(!LWLockHeldByMe(buf->content_lock));
          Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
--- 945,950 ----
***************
*** 940,956 ****
          /* Update buffer usage info, unless this is an internal access */
          if (normalAccess)
          {
!             if (!strategy_hint_vacuum)
              {
!                 if (buf->usage_count < BM_MAX_USAGE_COUNT)
!                     buf->usage_count++;
              }
              else
!             {
!                 /* VACUUM accesses don't bump usage count, instead... */
!                 if (buf->refcount == 0 && buf->usage_count == 0)
!                     immed_free_buffer = true;
!             }
          }

          if ((buf->flags & BM_PIN_COUNT_WAITER) &&
--- 958,975 ----
          /* Update buffer usage info, unless this is an internal access */
          if (normalAccess)
          {
!             if (active_access_pattern != AP_NORMAL)
              {
!                 /* We don't want large one-off scans like vacuum to inflate
!                  * the usage_count. We do want to set it to 1, though, to keep
!                  * other backends from hijacking it from the buffer ring.
!                  */
!                 if (buf->usage_count == 0)
!                     buf->usage_count = 1;
              }
              else
!             if (buf->usage_count < BM_MAX_USAGE_COUNT)
!                 buf->usage_count++;
          }

          if ((buf->flags & BM_PIN_COUNT_WAITER) &&
***************
*** 965,978 ****
          }
          else
              UnlockBufHdr(buf);
-
-         /*
-          * If VACUUM is releasing an otherwise-unused buffer, send it to the
-          * freelist for near-term reuse.  We put it at the tail so that it
-          * won't be used before any invalid buffers that may exist.
-          */
-         if (immed_free_buffer)
-             StrategyFreeBuffer(buf, false);
      }
  }

--- 984,989 ----
Index: src/backend/storage/buffer/freelist.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/freelist.c,v
retrieving revision 1.58
diff -c -r1.58 freelist.c
*** src/backend/storage/buffer/freelist.c    5 Jan 2007 22:19:37 -0000    1.58
--- src/backend/storage/buffer/freelist.c    25 May 2007 19:25:16 -0000
***************
*** 18,23 ****
--- 18,25 ----
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"

+ #include "utils/memutils.h"
+

  /*
   * The shared freelist control information.
***************
*** 39,47 ****
  /* Pointers to shared state */
  static BufferStrategyControl *StrategyControl = NULL;

! /* Backend-local state about whether currently vacuuming */
! bool        strategy_hint_vacuum = false;


  /*
   * StrategyGetBuffer
--- 41,59 ----
  /* Pointers to shared state */
  static BufferStrategyControl *StrategyControl = NULL;

! /* Currently active access pattern hint. */
! AccessPattern active_access_pattern = AP_NORMAL;

+ /* prototypes for internal functions */
+ static volatile BufferDesc *GetBufferFromRing(void);
+ static void PutBufferToRing(volatile BufferDesc *buf);
+ static void InitRing(void);
+
+ /* Did the last buffer returned by StrategyGetBuffer come from the buffer
+  * ring or from freelist/clock sweep? StrategyRejectBuffer needs to know
+  * that, see comments there.
+  */
+ static bool lastBufferCameFromRing = false;

  /*
   * StrategyGetBuffer
***************
*** 51,67 ****
   *    the selected buffer must not currently be pinned by anyone.
   *
   *    To ensure that no one else can pin the buffer before we do, we must
!  *    return the buffer with the buffer header spinlock still held.  That
!  *    means that we return with the BufFreelistLock still held, as well;
!  *    the caller must release that lock once the spinlock is dropped.
   */
  volatile BufferDesc *
! StrategyGetBuffer(void)
  {
      volatile BufferDesc *buf;
      int            trycounter;

      LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);

      /*
       * Try to get a buffer from the freelist.  Note that the freeNext fields
--- 63,98 ----
   *    the selected buffer must not currently be pinned by anyone.
   *
   *    To ensure that no one else can pin the buffer before we do, we must
!  *    return the buffer with the buffer header spinlock still held.  If
!  *    *lock_held is set at return, we return with the BufFreelistLock still
!  *    held, as well;    the caller must release that lock once the spinlock is
!  *    dropped.
   */
  volatile BufferDesc *
! StrategyGetBuffer(bool *lock_held)
  {
      volatile BufferDesc *buf;
      int            trycounter;

+     /* Get a buffer from the ring if we're doing a bulk scan */
+     if (active_access_pattern != AP_NORMAL)
+     {
+         buf = GetBufferFromRing();
+         if (buf != NULL)
+         {
+             *lock_held = false;
+             lastBufferCameFromRing = true;
+             return buf;
+         }
+     }
+
+     lastBufferCameFromRing = false;
+
+     /*
+      * If our selected buffer wasn't available, pick another...
+      */
      LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+     *lock_held = true;

      /*
       * Try to get a buffer from the freelist.  Note that the freeNext fields
***************
*** 86,96 ****
           */
          LockBufHdr(buf);
          if (buf->refcount == 0 && buf->usage_count == 0)
              return buf;
          UnlockBufHdr(buf);
      }

!     /* Nothing on the freelist, so run the "clock sweep" algorithm */
      trycounter = NBuffers;
      for (;;)
      {
--- 117,131 ----
           */
          LockBufHdr(buf);
          if (buf->refcount == 0 && buf->usage_count == 0)
+         {
+             if (active_access_pattern != AP_NORMAL)
+                 PutBufferToRing(buf);
              return buf;
+         }
          UnlockBufHdr(buf);
      }

!     /* Nothing on the freelist, so run the shared "clock sweep" algorithm */
      trycounter = NBuffers;
      for (;;)
      {
***************
*** 105,111 ****
--- 140,150 ----
           */
          LockBufHdr(buf);
          if (buf->refcount == 0 && buf->usage_count == 0)
+         {
+             if (active_access_pattern != AP_NORMAL)
+                 PutBufferToRing(buf);
              return buf;
+         }
          if (buf->usage_count > 0)
          {
              buf->usage_count--;
***************
*** 191,204 ****
  }

  /*
!  * StrategyHintVacuum -- tell us whether VACUUM is active
   */
  void
! StrategyHintVacuum(bool vacuum_active)
  {
!     strategy_hint_vacuum = vacuum_active;
! }


  /*
   * StrategyShmemSize
--- 230,254 ----
  }

  /*
!  * SetAccessPattern -- Sets the active access pattern hint
!  *
!  * Caller is responsible for resetting the hint to AP_NORMAL after the bulk
!  * operation is done. It's ok to switch repeatedly between AP_NORMAL and one of
!  * the other strategies, for example in a query with one large sequential scan
!  * nested loop joined to an index scan. Index tuples should be fetched with the
!  * normal strategy and the pages from the seq scan should be read in with the
!  * AP_BULKREAD strategy. The ring won't be affected by such switching, however
!  * switching to an access pattern with different ring size will invalidate the
!  * old ring.
   */
  void
! SetAccessPattern(AccessPattern new_pattern)
  {
!     active_access_pattern = new_pattern;

+     if (active_access_pattern != AP_NORMAL)
+         InitRing();
+ }

  /*
   * StrategyShmemSize
***************
*** 274,276 ****
--- 324,504 ----
      else
          Assert(!init);
  }
+
+ /* ----------------------------------------------------------------
+  *                Backend-private buffer ring management
+  * ----------------------------------------------------------------
+  */
+
+ /*
+  * Ring sizes for different access patterns. See README for the rationale
+  * of these.
+  */
+ #define BULKREAD_RING_SIZE    256 * 1024 / BLCKSZ
+ #define VACUUM_RING_SIZE    256 * 1024 / BLCKSZ
+ #define COPY_RING_SIZE        Min(NBuffers / 8, (XLOG_SEG_SIZE / BLCKSZ) * 2)
+
+ /*
+  * BufferRing is an array of buffer ids, and RingSize it's size in number of
+  * elements. It's allocated in TopMemoryContext the first time it's needed.
+  */
+ static int *BufferRing = NULL;
+ static int RingSize = 0;
+
+ /* Index of the "current" slot in the ring. It's advanced every time a buffer
+  * is handed out from the ring with GetBufferFromRing and it points to the
+  * last buffer returned from the ring. RingCurSlot + 1 is the next victim
+  * GetBufferRing will hand out.
+  */
+ static int RingCurSlot = 0;
+
+ /* magic value to mark empty slots in the ring */
+ #define BUF_ID_NOT_SET -1
+
+
+ /*
+  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
+  *        ring is empty.
+  *
+  * The bufhdr spin lock is held on the returned buffer.
+  */
+ static volatile BufferDesc *
+ GetBufferFromRing(void)
+ {
+     volatile BufferDesc *buf;
+
+     /* ring should be initialized by now */
+     Assert(RingSize > 0 && BufferRing != NULL);
+
+     /* Run private "clock cycle" */
+     if (++RingCurSlot >= RingSize)
+         RingCurSlot = 0;
+
+     /*
+      * If that slot hasn't been filled yet, tell the caller to allocate
+      * a new buffer with the normal allocation strategy. He will then
+      * fill this slot by calling PutBufferToRing with the new buffer.
+      */
+     if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET)
+         return NULL;
+
+     buf = &BufferDescriptors[BufferRing[RingCurSlot]];
+
+     /*
+      * If the buffer is pinned we cannot use it under any circumstances.
+      * If usage_count == 0 then the buffer is fair game.
+      *
+      * We also choose this buffer if usage_count == 1. Strictly, this
+      * might sometimes be the wrong thing to do, but we rely on the high
+      * probability that it was this process that last touched the buffer.
+      * If it wasn't, we'll choose a suboptimal victim, but  it shouldn't
+      * make any difference in the big scheme of things.
+      *
+      */
+     LockBufHdr(buf);
+     if (buf->refcount == 0 && buf->usage_count <= 1)
+         return buf;
+     UnlockBufHdr(buf);
+
+     return NULL;
+ }
+
+ /*
+  * PutBufferToRing -- adds a buffer to the buffer ring
+  *
+  * Caller must hold the buffer header spinlock on the buffer.
+  */
+ static void
+ PutBufferToRing(volatile BufferDesc *buf)
+ {
+     /* ring should be initialized by now */
+     Assert(RingSize > 0 && BufferRing != NULL);
+
+     if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET)
+         BufferRing[RingCurSlot] = buf->buf_id;
+
+ }
+
+ /*
+  * Initializes a ring buffer with correct size for the currently
+  * active strategy. Does nothing if the ring already has the right size.
+  */
+ static void
+ InitRing(void)
+ {
+     int new_size;
+     int old_size = RingSize;
+     int i;
+     MemoryContext oldcxt;
+
+     /* Determine new size */
+
+     switch(active_access_pattern)
+     {
+         case AP_BULKREAD:
+             new_size = BULKREAD_RING_SIZE;
+             break;
+         case AP_COPY:
+             new_size = COPY_RING_SIZE;
+             break;
+         case AP_VACUUM:
+             new_size = VACUUM_RING_SIZE;
+             break;
+         default:
+             elog(ERROR, "unexpected buffer cache strategy %d",
+                  active_access_pattern);
+             return; /* keep compile happy */
+     }
+
+     /*
+      * Seq scans set and reset the strategy on every page, so we better exit
+      * quickly if no change in size is needed.
+      */
+     if (new_size == old_size)
+         return;
+
+     /* Allocate array */
+
+     oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+     if (old_size == 0)
+     {
+         Assert(BufferRing == NULL);
+         BufferRing = palloc(new_size * sizeof(int));
+     }
+     else
+         BufferRing = repalloc(BufferRing, new_size * sizeof(int));
+
+     MemoryContextSwitchTo(oldcxt);
+
+     for(i = 0; i < new_size; i++)
+         BufferRing[i] = BUF_ID_NOT_SET;
+
+     RingCurSlot = 0;
+     RingSize = new_size;
+ }
+
+ /*
+  * In AP_BULKREAD mode, the buffer manager calls this function when it turns
+  * out that the buffer handed to it by StrategyGetBuffer needs a WAL flush to
+  * write out. That gives us a second chance to choose another victim.
+  *
+  * Returns true if buffer manager should ask for a new victim, and false
+  * if WAL should be flushed and this buffer used.
+  */
+ bool
+ StrategyRejectBuffer(volatile BufferDesc *buf)
+ {
+     Assert(RingSize > 0);
+
+     /* If the buffer didn't come from the ring, we tell the caller to do the
+      * WAL flush and use the buffer. We don't want to mess with how the clock
+      * sweep works; in worst case there's no buffers in the buffer cache that
+      * can be reused without a WAL flush, and we'd get into an endless loop
+      * trying.
+      */
+     if (lastBufferCameFromRing)
+         BufferRing[RingCurSlot] = BUF_ID_NOT_SET;
+
+     return lastBufferCameFromRing;
+ }
Index: src/include/access/relscan.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/relscan.h,v
retrieving revision 1.52
diff -c -r1.52 relscan.h
*** src/include/access/relscan.h    20 Jan 2007 18:43:35 -0000    1.52
--- src/include/access/relscan.h    15 May 2007 17:01:31 -0000
***************
*** 28,33 ****
--- 28,34 ----
      ScanKey        rs_key;            /* array of scan key descriptors */
      BlockNumber rs_nblocks;        /* number of blocks to scan */
      bool        rs_pageatatime; /* verify visibility page-at-a-time? */
+     AccessPattern rs_accesspattern; /* access pattern to use for reads */

      /* scan current state */
      bool        rs_inited;        /* false = scan not init'd yet */
Index: src/include/access/xlog.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
retrieving revision 1.76
diff -c -r1.76 xlog.h
*** src/include/access/xlog.h    5 Jan 2007 22:19:51 -0000    1.76
--- src/include/access/xlog.h    14 May 2007 21:22:40 -0000
***************
*** 151,156 ****
--- 151,157 ----

  extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
  extern void XLogFlush(XLogRecPtr RecPtr);
+ extern bool XLogNeedsFlush(XLogRecPtr RecPtr);

  extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
Index: src/include/storage/buf_internals.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v
retrieving revision 1.89
diff -c -r1.89 buf_internals.h
*** src/include/storage/buf_internals.h    5 Jan 2007 22:19:57 -0000    1.89
--- src/include/storage/buf_internals.h    15 May 2007 17:07:59 -0000
***************
*** 16,21 ****
--- 16,22 ----
  #define BUFMGR_INTERNALS_H

  #include "storage/buf.h"
+ #include "storage/bufmgr.h"
  #include "storage/lwlock.h"
  #include "storage/shmem.h"
  #include "storage/spin.h"
***************
*** 168,174 ****
  extern BufferDesc *LocalBufferDescriptors;

  /* in freelist.c */
! extern bool strategy_hint_vacuum;

  /* event counters in buf_init.c */
  extern long int ReadBufferCount;
--- 169,175 ----
  extern BufferDesc *LocalBufferDescriptors;

  /* in freelist.c */
! extern AccessPattern active_access_pattern;

  /* event counters in buf_init.c */
  extern long int ReadBufferCount;
***************
*** 184,195 ****
   */

  /* freelist.c */
! extern volatile BufferDesc *StrategyGetBuffer(void);
  extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
  extern int    StrategySyncStart(void);
  extern Size StrategyShmemSize(void);
  extern void StrategyInitialize(bool init);

  /* buf_table.c */
  extern Size BufTableShmemSize(int size);
  extern void InitBufTable(int size);
--- 185,198 ----
   */

  /* freelist.c */
! extern volatile BufferDesc *StrategyGetBuffer(bool *lock_held);
  extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
  extern int    StrategySyncStart(void);
  extern Size StrategyShmemSize(void);
  extern void StrategyInitialize(bool init);

+ extern bool StrategyRejectBuffer(volatile BufferDesc *buf);
+
  /* buf_table.c */
  extern Size BufTableShmemSize(int size);
  extern void InitBufTable(int size);
Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.103
diff -c -r1.103 bufmgr.h
*** src/include/storage/bufmgr.h    2 May 2007 23:18:03 -0000    1.103
--- src/include/storage/bufmgr.h    25 May 2007 19:23:37 -0000
***************
*** 49,54 ****
--- 49,67 ----
  #define BUFFER_LOCK_EXCLUSIVE    2

  /*
+  * For better cache efficiency, we use different buffer replacement strategies
+  * for different kinds of access patterns. Use SetAccessPattern to hint the
+  * buffer manager which kind of access we're doing at the moment.
+  */
+ typedef enum AccessPattern
+ {
+     AP_NORMAL,        /* Normal random access */
+     AP_BULKREAD,    /* Large read-only scan (hint bit updates are ok) */
+     AP_COPY,        /* Large updating scan, like COPY with WAL enabled */
+     AP_VACUUM,        /* VACUUM */
+ } AccessPattern;
+
+ /*
   * These routines are beaten on quite heavily, hence the macroization.
   */

***************
*** 157,162 ****
  extern void AtProcExit_LocalBuffers(void);

  /* in freelist.c */
! extern void StrategyHintVacuum(bool vacuum_active);

  #endif
--- 170,175 ----
  extern void AtProcExit_LocalBuffers(void);

  /* in freelist.c */
! extern void SetAccessPattern(AccessPattern new_pattern);

  #endif

Re: Seq scans status update

From

Tom Lane

Date:

25 May 2007, 17:37:09

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Here's a new version, all known issues are now fixed. I'm now happy with
> this patch.
> Next, I'll start looking at the latest version of Jeff's synchronized
> scans patch.

I'm a bit confused --- weren't you intending to review these in parallel
because of the possible interactions?  Do you think this should be
applied now, or does it need to wait to see what happens with Jeff's
patch?

            regards, tom lane

Re: Seq scans status update

From

Heikki Linnakangas

Date:

26 May 2007, 07:29:08

Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Here's a new version, all known issues are now fixed. I'm now happy with
>> this patch.
>> Next, I'll start looking at the latest version of Jeff's synchronized
>> scans patch.
>
> I'm a bit confused --- weren't you intending to review these in parallel
> because of the possible interactions?  Do you think this should be
> applied now, or does it need to wait to see what happens with Jeff's
> patch?

I think it should be applied now. I've briefly looked at Jeff's patch
and I don't see any problems looming there.

Jeff performed tests with Simon's original patch and his patch, and I
think the results from those tests are still valid since the basic
behavior hasn't been changed. I'll repeat those tests myself, and run
some more to see if the added CPU overhead shows up in tests, but I'm
pretty confident the patches will work together as advertised.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Seq scans status update

From

Tom Lane

Date:

28 May 2007, 15:58:49

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Here's a new version, all known issues are now fixed. I'm now happy with
> this patch.

I'm looking this over and finding it fairly ugly from a
system-structural point of view.  In particular, this has pushed the
single-global-variable StrategyHintVacuum() concept well past the
breaking point.  That API was sort of tolerable for vacuum, since vacuum
isn't re-entrant and can't coexist with any other behavior within the
same backend; though I never liked the fact that it affected vacuum's
system-catalog accesses along with the intended table fetches.  But now
you've got bits of code hacking the access pattern in contexts that are
far less predictable than vacuum ... and not guaranteeing to reset the
access pattern on failure, either.

I think we've got to get rid of the global variable and make the access
pattern be a parameter to StrategyGetBuffer, instead.  Which in turn
suggests a variant ReadBuffer, maybe ReadBufferWithStrategy(), at the
next level up.  This would serve directly for the heapscan case, and
for the main heap accesses in vacuum/analyze, but it might be a bit
of a pain to make index vacuuming pay attention to the strategy.
The other case that the patch addresses is COPY TO REL, which we could
handle if we were willing to pass a strategy parameter down through
heap_insert and RelationGetBufferForTuple ... is it worth the trouble?
I don't recall anyone having presented measurements to show COPY TO REL
being helped by this patch.

I don't especially care for the ring data structure being global inside
freelist.c, either.  What I'm inclined to do is have the "access
strategy parameter" actually be a pointer to some struct type, with NULL
representing the default strategy.  At the beginning of a bulk operation
we'd create an access strategy object, which would be internally the
current Ring data structure, but it'd be opaque to code outside
freelist.c.  This'd mean that a query using two large seqscans would use
two rings not one, but that's not necessarily bad; it would in any case
make it a lot easier to generalize the strategy concept in future to
support more than one kind of non-default policy being active in
different parts of a query.

A point I have not figured out how to deal with is that in the patch as
given, UnpinBuffer needs to know the strategy; and getting it that info
would make the changes far more invasive.  But the patch's behavior here
is pretty risky anyway, since the strategy global at the time of unpin
might have nothing to do with how it was set when the buffer was
acquired.  What I'm tempted to do is remove the special case there and
adjust buffer acquisition instead (maybe make it decrement the
usage_count when re-using a buffer from the ring).

Comments?  I'm still playing with the ideas at this point...

            regards, tom lane

Re: Seq scans status update

From

Heikki Linnakangas

Date:

28 May 2007, 17:00:10

Tom Lane wrote:
> ... and not guaranteeing to reset theaccess pattern on failure, either.

Good catch, I thought I had that covered but apparently not.

> I think we've got to get rid of the global variable and make the access
> pattern be a parameter to StrategyGetBuffer, instead.  Which in turn
> suggests a variant ReadBuffer, maybe ReadBufferWithStrategy(), at the
> next level up.  This would serve directly for the heapscan case, and
> for the main heap accesses in vacuum/analyze, but it might be a bit
> of a pain to make index vacuuming pay attention to the strategy.

I thought of that as well, but settled on the global variable for that
same reason, the UnpinBuffer problem, and the fact that we'd end up with
quite many different ReadBuffer variants: ReadBuffer, ReadOrZeroBuffer,
ReadBufferWithStrategy, ReleaseAndReadBuffer,
ReleaseAndReadBufferWithStrategy (if we care about that).

Index vacuum code for all the indexams could be changed to use the
ReadBufferWithStrategy if we took that approach, though there's quite a
many of those calls in total. We could use a hybrid approach, where you
could use ReadBufferWithStrategy to give the strategy on page-by-page
basis, or use SetStrategyHintVacuum/SetAccessPattern to set it for all
subsequent ReadBuffer-calls.

> The other case that the patch addresses is COPY TO REL, which we could
> handle if we were willing to pass a strategy parameter down through
> heap_insert and RelationGetBufferForTuple ... is it worth the trouble?
> I don't recall anyone having presented measurements to show COPY TO REL
> being helped by this patch.

The theory was that it'll reduce the cache-spoiling of COPY, just like
it does for selects. I can run tests on that.

> I don't especially care for the ring data structure being global inside
> freelist.c, either.  What I'm inclined to do is have the "access
> strategy parameter" actually be a pointer to some struct type, with NULL
> representing the default strategy.  At the beginning of a bulk operation
> we'd create an access strategy object, which would be internally the
> current Ring data structure, but it'd be opaque to code outside
> freelist.c.  This'd mean that a query using two large seqscans would use
> two rings not one, but that's not necessarily bad; it would in any case
> make it a lot easier to generalize the strategy concept in future to
> support more than one kind of non-default policy being active in
> different parts of a query.

I thought about the two large scans scenario but came to the conclusion
that it's not possible to have two large scans active at the same time
in a single backend. You could have a query like "SELECT * FROM
bigtable_1, bigtable_2" with a cartesian product, or something with a
function that scans a large table, but even then one of the scans would
progress veery slowly compared to the other scan, and using more than
one ring buffer would make no difference.

I agree on the generalization argument, though.

> A point I have not figured out how to deal with is that in the patch as
> given, UnpinBuffer needs to know the strategy; and getting it that info
> would make the changes far more invasive.  But the patch's behavior here
> is pretty risky anyway, since the strategy global at the time of unpin
> might have nothing to do with how it was set when the buffer was
> acquired.  What I'm tempted to do is remove the special case there and
> adjust buffer acquisition instead (maybe make it decrement the
> usage_count when re-using a buffer from the ring).

It's assumed that the same strategy is used when unpinning, which is
true for the current usage (and apparently needs to be documented).

Normally buffers that are in the ring are recycled quickly, in which
case the usage_count will always be increased from 0->1 in UnpinBuffer,
regardless of the check. The check is there to avoid inflating the
usage_count on buffers that happened to be already in shared_buffers. It
might not be that important for the selects, but that's what keeps COPY
from bumping the usage_count all the way up to 5.

If we figure out another way to deal with the COPY usage_count, maybe we
could remove the check altogether. I've been thinking of changing COPY
anyway so that it always kept the last page it inserted to pinned, to
avoid the traffic to buffer manager on each tuple.

> Comments?  I'm still playing with the ideas at this point...

Let me know if you want me to make changes and submit a new version.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Seq scans status update

From

Tom Lane

Date:

28 May 2007, 17:48:52

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> A point I have not figured out how to deal with is that in the patch as
>> given, UnpinBuffer needs to know the strategy; and getting it that info
>> would make the changes far more invasive.  But the patch's behavior here
>> is pretty risky anyway, since the strategy global at the time of unpin
>> might have nothing to do with how it was set when the buffer was
>> acquired.

> It's assumed that the same strategy is used when unpinning, which is
> true for the current usage (and apparently needs to be documented).

I don't believe that for a moment.  Even in the trivial heapscan case,
the last pin is typically dropped during a tupleslot clear operation
sometime after the scan itself has moved on to the next page.  In a more
general context (such as parallel heapscan and indexscan on a rel) it's
certainly too risky to assume it.

> Normally buffers that are in the ring are recycled quickly, in which
> case the usage_count will always be increased from 0->1 in UnpinBuffer,
> regardless of the check. The check is there to avoid inflating the
> usage_count on buffers that happened to be already in shared_buffers.

A heapscan would pin the buffer only once and hence bump its count at
most once, so I don't see a big problem here.  Also, I'd argue that
buffers that had a positive usage_count shouldn't get sucked into the
ring to begin with.

> If we figure out another way to deal with the COPY usage_count, maybe we
> could remove the check altogether. I've been thinking of changing COPY
> anyway so that it always kept the last page it inserted to pinned, to
> avoid the traffic to buffer manager on each tuple.

This is getting fairly invasive for a part of the patch you've not
justified at all yet...

> Let me know if you want me to make changes and submit a new version.

I can work on this stuff; please do the tests to show whether it's worth
handling COPY TO REL, and keep on with Jeff's patch.

            regards, tom lane

Re: Seq scans status update

From

Heikki Linnakangas

Date:

28 May 2007, 18:20:38

Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> It's assumed that the same strategy is used when unpinning, which is
>> true for the current usage (and apparently needs to be documented).
>
> I don't believe that for a moment.  Even in the trivial heapscan case,
> the last pin is typically dropped during a tupleslot clear operation
> sometime after the scan itself has moved on to the next page.  In a more
> general context (such as parallel heapscan and indexscan on a rel) it's
> certainly too risky to assume it.

Hmm, I guess you're right. :(

One idea is to keep track which pins are taken using the bulk strategy.
It's a bit tricky when a buffer is pinned multiple times since we don't
know which ReleaseBuffer corresponds which ReadBuffer, but perhaps we
could get away with just a flag per pinned buffer. Set the flag when a
buffer is pinned with bulk strategy and it wasn't pinned by us before,
and clear it when it's pinned with another strategy. I'm thinking we
steal one bit from PrivateRefCount for this.

>> Normally buffers that are in the ring are recycled quickly, in which
>> case the usage_count will always be increased from 0->1 in UnpinBuffer,
>> regardless of the check. The check is there to avoid inflating the
>> usage_count on buffers that happened to be already in shared_buffers.
>
> A heapscan would pin the buffer only once and hence bump its count at
> most once, so I don't see a big problem here.  Also, I'd argue that
> buffers that had a positive usage_count shouldn't get sucked into the
> ring to begin with.

True, except that with the synchronized scans patch two synchronized
scans will pin the buffer twice.

>> If we figure out another way to deal with the COPY usage_count, maybe we
>> could remove the check altogether. I've been thinking of changing COPY
>> anyway so that it always kept the last page it inserted to pinned, to
>> avoid the traffic to buffer manager on each tuple.
>
> This is getting fairly invasive for a part of the patch you've not
> justified at all yet...
>
>> Let me know if you want me to make changes and submit a new version.
>
> I can work on this stuff; please do the tests to show whether it's worth
> handling COPY TO REL, and keep on with Jeff's patch.

Ok.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Seq scans status update

From

Tom Lane

Date:

28 May 2007, 18:50:53

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> One idea is to keep track which pins are taken using the bulk strategy.
> It's a bit tricky when a buffer is pinned multiple times since we don't
> know which ReleaseBuffer corresponds which ReadBuffer, but perhaps we
> could get away with just a flag per pinned buffer. Set the flag when a
> buffer is pinned with bulk strategy and it wasn't pinned by us before,
> and clear it when it's pinned with another strategy. I'm thinking we
> steal one bit from PrivateRefCount for this.

Seems like a mess.  Why don't we just fix it so there's no need for
different behavior at Unpin time?  The facts on the ground are that
the current patch's change in UnpinBuffer is a no-op anyway, because
of the tupletableslot interference.

The behavior I'm imagining is just that when we try to take a buffer
from the ring, if its usage count exceeds 1 then drop it from the ring
and get another buffer.  1 would be the expected case if no one had
touched it since we last used it.

>> A heapscan would pin the buffer only once and hence bump its count at
>> most once, so I don't see a big problem here.  Also, I'd argue that
>> buffers that had a positive usage_count shouldn't get sucked into the
>> ring to begin with.

> True, except that with the synchronized scans patch two synchronized
> scans will pin the buffer twice.

Hmm.  But we probably don't want the same buffer in two different
backends' rings, either.  You *sure* the sync-scan patch has no
interaction with this one?

One other question: I see the patch sets the threshold for switching
from normal to ring-buffer heapscans at table size = NBuffers.  Why
so high?  I'd have expected maybe at most NBuffers/4 or NBuffers/10.
If you don't want a seqscan blowing out your buffer cache, you surely
don't want it blowing out 90% of the cache either.

            regards, tom lane

Re: Seq scans status update

From

Gregory Stark

Date:

28 May 2007, 19:01:27

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> A point I have not figured out how to deal with is that in the patch as
> given, UnpinBuffer needs to know the strategy; and getting it that info
> would make the changes far more invasive.  But the patch's behavior here
> is pretty risky anyway, since the strategy global at the time of unpin
> might have nothing to do with how it was set when the buffer was
> acquired.  What I'm tempted to do is remove the special case there and
> adjust buffer acquisition instead (maybe make it decrement the
> usage_count when re-using a buffer from the ring).

Is there a reason UnpinBuffer has to be the one to increment the usage count
anyways? Why can't ReadBuffer handle incrementing the count and just trust
that it won't be decremented until the buffer is unpinned anyways?

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: Seq scans status update

From

Tom Lane

Date:

28 May 2007, 19:16:21

Gregory Stark <stark@enterprisedb.com> writes:
> Is there a reason UnpinBuffer has to be the one to increment the usage count
> anyways? Why can't ReadBuffer handle incrementing the count and just trust
> that it won't be decremented until the buffer is unpinned anyways?

That's a good question.  I think the idea was that if we hold a buffer
pinned for awhile (long enough that the bgwriter's clock sweep passes
over it one or more times), we want the usage count decrementing to
start when we release the pin, not when we acquire it.  But maybe that
could be fixed if the clock sweep doesn't touch the usage_count of a
pinned buffer.  Which in fact it may not do already --- didn't look.

            regards, tom lane

Re: Seq scans status update

From

Greg Smith

Date:

28 May 2007, 22:50:53

On Mon, 28 May 2007, Tom Lane wrote:

> But maybe that could be fixed if the clock sweep doesn't touch the
> usage_count of a pinned buffer.  Which in fact it may not do already ---
> didn't look.

StrategyGetBuffer doesn't care whether the buffer is pinned or not; it
decrements the usage_count regardless.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Seq scans status update

From

Alvaro Herrera

Date:

29 May 2007, 12:50:47

Tom Lane wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
> > Is there a reason UnpinBuffer has to be the one to increment the usage count
> > anyways? Why can't ReadBuffer handle incrementing the count and just trust
> > that it won't be decremented until the buffer is unpinned anyways?
>
> That's a good question.  I think the idea was that if we hold a buffer
> pinned for awhile (long enough that the bgwriter's clock sweep passes
> over it one or more times), we want the usage count decrementing to
> start when we release the pin, not when we acquire it.  But maybe that
> could be fixed if the clock sweep doesn't touch the usage_count of a
> pinned buffer.  Which in fact it may not do already --- didn't look.

It does -- in BgBufferSync the "all" scan calls SyncOneBuffer with
skip_pinned=false.  The "lru" scan does skip pinned buffers.

--
Alvaro Herrera                          Developer, http://www.PostgreSQL.org/
"World domination is proceeding according to plan"        (Andrew Morton)

Re: Seq scans status update

From

Heikki Linnakangas

Date:

29 May 2007, 14:23:55

Alvaro Herrera wrote:
> Tom Lane wrote:
>> Gregory Stark <stark@enterprisedb.com> writes:
>>> Is there a reason UnpinBuffer has to be the one to increment the usage count
>>> anyways? Why can't ReadBuffer handle incrementing the count and just trust
>>> that it won't be decremented until the buffer is unpinned anyways?
>> That's a good question.  I think the idea was that if we hold a buffer
>> pinned for awhile (long enough that the bgwriter's clock sweep passes
>> over it one or more times), we want the usage count decrementing to
>> start when we release the pin, not when we acquire it.  But maybe that
>> could be fixed if the clock sweep doesn't touch the usage_count of a
>> pinned buffer.  Which in fact it may not do already --- didn't look.
>
> It does -- in BgBufferSync the "all" scan calls SyncOneBuffer with
> skip_pinned=false.  The "lru" scan does skip pinned buffers.

You're looking at the wrong place. StrategyGetBuffer drives the clock
sweep, and it always decreases the usage_count, IOW it doesn't skip
pinned buffers. SyncOneBuffer and BgBufferSync don't decrease the
usage_count in any case.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Seq scans status update

From

Jeff Davis

Date:

29 May 2007, 21:43:32

On Mon, 2007-05-28 at 17:36 -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
> > One idea is to keep track which pins are taken using the bulk strategy.
> > It's a bit tricky when a buffer is pinned multiple times since we don't
> > know which ReleaseBuffer corresponds which ReadBuffer, but perhaps we
> > could get away with just a flag per pinned buffer. Set the flag when a
> > buffer is pinned with bulk strategy and it wasn't pinned by us before,
> > and clear it when it's pinned with another strategy. I'm thinking we
> > steal one bit from PrivateRefCount for this.
>
> Seems like a mess.  Why don't we just fix it so there's no need for
> different behavior at Unpin time?  The facts on the ground are that
> the current patch's change in UnpinBuffer is a no-op anyway, because
> of the tupletableslot interference.
>
> The behavior I'm imagining is just that when we try to take a buffer
> from the ring, if its usage count exceeds 1 then drop it from the ring
> and get another buffer.  1 would be the expected case if no one had
> touched it since we last used it.
>
> >> A heapscan would pin the buffer only once and hence bump its count at
> >> most once, so I don't see a big problem here.  Also, I'd argue that
> >> buffers that had a positive usage_count shouldn't get sucked into the
> >> ring to begin with.
>
> > True, except that with the synchronized scans patch two synchronized
> > scans will pin the buffer twice.
>
> Hmm.  But we probably don't want the same buffer in two different
> backends' rings, either.  You *sure* the sync-scan patch has no
> interaction with this one?
>

I will run some tests again tonight, I think the interaction needs more
testing than I did originally. Also, I'm not sure that the hardware I
have is sufficient to test those cases.

It looks like the case to worry about is when there are a large number
of scans on the same table and the I/O system is fast enough that it
causes lock contention on the buffers in the rings. Is this the case
you're worried about?

Also, keep in mind that I have added a SyncScanLock after I ran those
tests. That could have an effect.

Regards,
    Jeff Davis

Re: Seq scans status update

From

Heikki Linnakangas

Date:

30 May 2007, 15:28:18

Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>>> A heapscan would pin the buffer only once and hence bump its count at
>>> most once, so I don't see a big problem here.  Also, I'd argue that
>>> buffers that had a positive usage_count shouldn't get sucked into the
>>> ring to begin with.
>
>> True, except that with the synchronized scans patch two synchronized
>> scans will pin the buffer twice.
>
> Hmm.  But we probably don't want the same buffer in two different
> backends' rings, either.  You *sure* the sync-scan patch has no
> interaction with this one?

A buffer is only put to the ring when it's requested with
StrategyGetBuffer. If a page is requested with ReadBuffer, and it's in
cache already, it won't be put in the ring. When multiple scanners are
scanning synchronously, they will all use their own, distinct rings when
reading buffers into the cache, and "peek" into the buffers in other
backend's rings for pages that others read to the buffer cache.

I'm going to test the interactions in more detail...

> One other question: I see the patch sets the threshold for switching
> from normal to ring-buffer heapscans at table size = NBuffers.  Why
> so high?  I'd have expected maybe at most NBuffers/4 or NBuffers/10.
> If you don't want a seqscan blowing out your buffer cache, you surely
> don't want it blowing out 90% of the cache either.

NBuffers is the maximum value that makes sense; if you're scanning more
than NBuffers, the scan is definitely not going to fit in
shared_buffers. Anything less than that and we might be causing harm to
some use cases, so I chose that for the time being.

Simon argued for a GUC variable, and Jeff's patch as it stands
introduces one. I'm not sure we want it but if we do, we should use the
same variable to control both the sync scan and cache replacement
policy. It's essentially "how large a scan do you expect to fit in
shared_buffers?"

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Seq scans status update

From

Jeff Davis

Date:

30 May 2007, 16:08:51

On Tue, 2007-05-29 at 17:43 -0700, Jeff Davis wrote:
> > Hmm.  But we probably don't want the same buffer in two different
> > backends' rings, either.  You *sure* the sync-scan patch has no
> > interaction with this one?
> >
>
> I will run some tests again tonight, I think the interaction needs more
> testing than I did originally. Also, I'm not sure that the hardware I
> have is sufficient to test those cases.
>

I ran some sanity tests last night with cvs head, plus my syncscan20-
cvshead.patch, plus scan_recycle_buffers.v3.patch.

It passed the sanity tests at least.

I did see that there was more interference with sync_seqscan_threshold=0
(always on) and scan_recycle_buffers=0 (off) than I had previously seen
with 8.2.4, so I will test again against 8.2.4 to see why that might be.
The interference that I saw was still quite small, a scan moving
concurrently with 9 other scans was about 10% slower than a scan running
alone -- which is still very good compared with plain cvs head and no
sync scan -- it's just not ideal.

However, turning scan_recycle_buffers between 0 (off), 16, 32, and 128
didn't have much effect. At 32 it appeared to be about 1% worse during
10 scans, but that may have been noise. The other values I tried didn't
have any difference that I could see.

This was really just a quick sanity test, I think more hard data would
be useful.

Regards,
    Jeff Davis

Re: Seq scans status update

From

Tom Lane

Date:

30 May 2007, 16:21:59

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> One other question: I see the patch sets the threshold for switching
>> from normal to ring-buffer heapscans at table size = NBuffers.  Why
>> so high?  I'd have expected maybe at most NBuffers/4 or NBuffers/10.
>> If you don't want a seqscan blowing out your buffer cache, you surely
>> don't want it blowing out 90% of the cache either.

> NBuffers is the maximum value that makes sense; if you're scanning more
> than NBuffers, the scan is definitely not going to fit in
> shared_buffers. Anything less than that and we might be causing harm to
> some use cases, so I chose that for the time being.

But the flip side of that is you're failing to provide the benefit of
the patch in quite a lot of use-cases where it's clearly beneficial.
I just don't believe that there are very many cases where people will
want a heapscan to eat 90% of their cache.

> Simon argued for a GUC variable, and Jeff's patch as it stands
> introduces one. I'm not sure we want it but if we do, we should use the
> same variable to control both the sync scan and cache replacement
> policy. It's essentially "how large a scan do you expect to fit in
> shared_buffers?"

Well, let's do some experiments and see if there's really any point in
varying the cutover.

            regards, tom lane

Re: Seq scans status update

From

Heikki Linnakangas

Date:

30 May 2007, 16:37:23

Jeff Davis wrote:
> On Tue, 2007-05-29 at 17:43 -0700, Jeff Davis wrote:
>>> Hmm.  But we probably don't want the same buffer in two different
>>> backends' rings, either.  You *sure* the sync-scan patch has no
>>> interaction with this one?
>>>
>> I will run some tests again tonight, I think the interaction needs more
>> testing than I did originally. Also, I'm not sure that the hardware I
>> have is sufficient to test those cases.
>>
>
> I ran some sanity tests last night with cvs head, plus my syncscan20-
> cvshead.patch, plus scan_recycle_buffers.v3.patch.
>
> It passed the sanity tests at least.
>
> I did see that there was more interference with sync_seqscan_threshold=0
> (always on) and scan_recycle_buffers=0 (off) than I had previously seen
> with 8.2.4, so I will test again against 8.2.4 to see why that might be.
> The interference that I saw was still quite small, a scan moving
> concurrently with 9 other scans was about 10% slower than a scan running
> alone -- which is still very good compared with plain cvs head and no
> sync scan -- it's just not ideal.
>
> However, turning scan_recycle_buffers between 0 (off), 16, 32, and 128
> didn't have much effect. At 32 it appeared to be about 1% worse during
> 10 scans, but that may have been noise. The other values I tried didn't
> have any difference that I could see.
>
> This was really just a quick sanity test, I think more hard data would
> be useful.

The interesting question is whether the small buffer ring is big enough
to let all synchronized scans to process a page before it's being
recycled. Keep an eye on pg_buffercache to see if it gets filled with
pages from the table you're querying.

I just ran a quick test with 4 concurrent scans on a dual-core system,
and it looks like we do "leak" buffers from the rings because they're
pinned at the time they would be recycled. A full scan of a 30GB table
took just under 7 minutes, and starting after a postmaster restart it
took ~4-5 minutes until all of the 320MB of shared_buffers were used.
That means we're leaking a buffer from the ring very roughly on every
20-40th ReadBuffer call, but I'll put in some proper instrumentation and
  test with different configurations to get a better picture.

The synchronized scans gives such a big benefit when it's applicable,
that I think that some cache-spoiling is acceptable and in fact
unavoidable in some scenarios. It's much better than 8.2 behavior anyway.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Seq scans status update

From

Tom Lane

Date:

30 May 2007, 16:56:48

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> I just ran a quick test with 4 concurrent scans on a dual-core system,
> and it looks like we do "leak" buffers from the rings because they're
> pinned at the time they would be recycled.

Yeah, I noticed the same in some tests here.  I think there's not a lot
we can do about that; we don't have enough visibility into why someone
else has the buffer pinned.

Using a larger ring would help, by making it less probable that any
other sync-scanning backend is so far behind as to still have the oldest
element of our ring pinned.  But if we do that we have the L2-cache-size
effect to worry about.  Is there any actual data backing up that it's
useful to keep the ring fitting in L2, or is that just guesswork?  In
the sync-scan case the idea seems pretty bogus anyway, because the
actual working set will be N backends' rings not just one.

            regards, tom lane

Re: Seq scans status update

From

Jeff Davis

Date:

30 May 2007, 17:44:32

On Wed, 2007-05-30 at 15:56 -0400, Tom Lane wrote:
> In
> the sync-scan case the idea seems pretty bogus anyway, because the
> actual working set will be N backends' rings not just one.

I don't follow. Ideally, in the sync-scan case, the sets of buffers in
the ring of different scans on the same relation will overlap
completely, right?

We might not be at the ideal, but the sets of buffers in the rings
shouldn't be disjoint, should they?

Regards,
    Jeff Davis

Re: Seq scans status update

From

Tom Lane

Date:

30 May 2007, 18:46:10

Jeff Davis <pgsql@j-davis.com> writes:
> On Wed, 2007-05-30 at 15:56 -0400, Tom Lane wrote:
>> In the sync-scan case the idea seems pretty bogus anyway, because the
>> actual working set will be N backends' rings not just one.

> I don't follow. Ideally, in the sync-scan case, the sets of buffers in
> the ring of different scans on the same relation will overlap
> completely, right?

> We might not be at the ideal, but the sets of buffers in the rings
> shouldn't be disjoint, should they?

According to Heikki's explanation here
http://archives.postgresql.org/pgsql-patches/2007-05/msg00498.php
each backend doing a heapscan would collect its own ring of buffers.
You might have a few backends that are always followers, never leaders,
and so never actually fetch any pages --- but for each backend that
actually did any I/O there would be a separate ring.  In practice I'd
expect the lead would "change hands" pretty often and so you'd see all
the backends accumulating their own rings.

            regards, tom lane

Re: Seq scans status update

From

Jeff Davis

Date:

30 May 2007, 19:24:10

On Wed, 2007-05-30 at 17:45 -0400, Tom Lane wrote:
> According to Heikki's explanation here
> http://archives.postgresql.org/pgsql-patches/2007-05/msg00498.php
> each backend doing a heapscan would collect its own ring of buffers.
> You might have a few backends that are always followers, never leaders,
> and so never actually fetch any pages --- but for each backend that
> actually did any I/O there would be a separate ring.  In practice I'd
> expect the lead would "change hands" pretty often and so you'd see all
> the backends accumulating their own rings.
>

Oh, I see what you mean. The rings will actually become sparsely
populated with many concurrent scans, and therefore, extend outside of
L2 cache.

Regards,
    Jeff Davis

Re: Seq scans status update

From

Heikki Linnakangas

Date:

31 May 2007, 04:40:43

Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> I just ran a quick test with 4 concurrent scans on a dual-core system,
>> and it looks like we do "leak" buffers from the rings because they're
>> pinned at the time they would be recycled.
>
> Yeah, I noticed the same in some tests here.  I think there's not a lot
> we can do about that; we don't have enough visibility into why someone
> else has the buffer pinned.

We could stash pinned buffers to some other list etc. and try them again
later. But that gets a lot more complex.

> Using a larger ring would help, by making it less probable that any
> other sync-scanning backend is so far behind as to still have the oldest
> element of our ring pinned.  But if we do that we have the L2-cache-size
> effect to worry about.  Is there any actual data backing up that it's
> useful to keep the ring fitting in L2, or is that just guesswork?  In
> the sync-scan case the idea seems pretty bogus anyway, because the
> actual working set will be N backends' rings not just one.

Yes, I tested different ring sizes here:
http://archives.postgresql.org/pgsql-hackers/2007-05/msg00469.php

The tests above showed the effect when reading a table from OS cache. I
haven't seen direct evidence supporting Luke's claim that the ring makes
scans of tables bigger than RAM go faster with bigger I/O hardware,
because I don't have such hardware at hand. We did repeat the tests on
different hardware however, and monitored the CPU usage with vmstat at
the same time. The CPU usage was significantly lower with the patch, so
I believe that with better I/O hardware the test would become limited by
CPU and the patch would therefore make it go faster.

BTW, we've been talking about the "L2 cache effect" but we don't really
know for sure if the effect has anything to do with the L2 cache. But
whatever it is, it's real.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Seq scans status update

From

Luke Lonergan

Date:

02 June 2007, 14:48:27

Hi All,

On 5/31/07 12:40 AM, "Heikki Linnakangas" <heikki@enterprisedb.com> wrote:

> BTW, we've been talking about the "L2 cache effect" but we don't really
> know for sure if the effect has anything to do with the L2 cache. But
> whatever it is, it's real.

The mailing list archives contain the ample evidence of:
- it's definitely an L2 cache effect
- on fast I/O hardware tests show large benefits of keeping the ring in L2

I see no reason to re-open the discussion about these, can we accept these
as fact and continue?

- Luke

Re: Seq scans status update

From

Tom Lane

Date:

02 June 2007, 15:08:56

Luke Lonergan <llonergan@greenplum.com> writes:
> The mailing list archives contain the ample evidence of:
> - it's definitely an L2 cache effect
> - on fast I/O hardware tests show large benefits of keeping the ring in L2

Please provide some specific pointers, because I don't remember that.

            regards, tom lane