Re: Seq scans status update - Mailing list pgsql-patches
From | Heikki Linnakangas |
---|---|
Subject | Re: Seq scans status update |
Date | |
Msg-id | 4651452E.4060805@enterprisedb.com Whole thread Raw |
In response to | Seq scans status update (Heikki Linnakangas <heikki@enterprisedb.com>) |
List | pgsql-patches |
I forgot to attach the program used to generate test data. Here it is. Heikki Linnakangas wrote: > Attached is a new version of Simon's "scan-resistant buffer manager" > patch. It's not ready for committing yet because of a small issue I > found this morning (* see bottom), but here's a status update. > > To recap, the basic idea is to use a small ring of buffers for large > scans like VACUUM, COPY and seq-scans. Changes to the original patch: > > - a different sized ring is used for VACUUMs and seq-scans, and COPY. > VACUUM and COPY use a ring of 32 buffers, and COPY uses a ring of 4096 > buffers in default configuration. See README changes in the patch for > rationale. > > - for queries with large seqscans, the buffer ring is only used for > reads issued by the seq scan, not for any other reads in the query. > Typical scenario where this matters is doing a large seq scan with a > nested loop join to a smaller table. You don't want to use the buffer > ring for index lookups inside the nested loop. > > - for seqscans, drop buffers from the ring that would need a WAL flush > to reuse. That makes bulk updates to behave roughly like they do without > the patch, instead of having to do a WAL flush every 32 pages. > > I've spent a lot of time thinking of solutions to the last point. The > obvious solution would be to not use the buffer ring for updating scans. > The difficulty with that is that we don't know if a scan is read-only in > heapam.c, where the hint to use the buffer ring is set. > > I've completed a set of performance tests on a test server. The server > has 4 GB of RAM, of which 1 GB is used for shared_buffers. > > Results for a 10 GB table: > > head-copy-bigtable | 00:10:09.07016 > head-copy-bigtable | 00:10:20.507357 > head-copy-bigtable | 00:10:21.857677 > head-copy_nowal-bigtable | 00:05:18.232956 > head-copy_nowal-bigtable | 00:03:24.109047 > head-copy_nowal-bigtable | 00:05:31.019643 > head-select-bigtable | 00:03:47.102731 > head-select-bigtable | 00:01:08.314719 > head-select-bigtable | 00:01:08.238509 > head-select-bigtable | 00:01:08.208563 > head-select-bigtable | 00:01:08.28347 > head-select-bigtable | 00:01:08.308671 > head-vacuum_clean-bigtable | 00:01:04.227832 > head-vacuum_clean-bigtable | 00:01:04.232258 > head-vacuum_clean-bigtable | 00:01:04.294621 > head-vacuum_clean-bigtable | 00:01:04.280677 > head-vacuum_hintbits-bigtable | 00:04:01.123924 > head-vacuum_hintbits-bigtable | 00:03:58.253175 > head-vacuum_hintbits-bigtable | 00:04:26.318159 > head-vacuum_hintbits-bigtable | 00:04:37.512965 > patched-copy-bigtable | 00:09:52.776754 > patched-copy-bigtable | 00:10:18.185826 > patched-copy-bigtable | 00:10:16.975482 > patched-copy_nowal-bigtable | 00:03:14.882366 > patched-copy_nowal-bigtable | 00:04:01.04648 > patched-copy_nowal-bigtable | 00:03:56.062272 > patched-select-bigtable | 00:03:47.704154 > patched-select-bigtable | 00:01:08.460326 > patched-select-bigtable | 00:01:10.441544 > patched-select-bigtable | 00:01:11.916221 > patched-select-bigtable | 00:01:13.848038 > patched-select-bigtable | 00:01:10.956133 > patched-vacuum_clean-bigtable | 00:01:10.315439 > patched-vacuum_clean-bigtable | 00:01:12.210537 > patched-vacuum_clean-bigtable | 00:01:15.202114 > patched-vacuum_clean-bigtable | 00:01:10.712235 > patched-vacuum_hintbits-bigtable | 00:03:42.279201 > patched-vacuum_hintbits-bigtable | 00:04:02.057778 > patched-vacuum_hintbits-bigtable | 00:04:26.805822 > patched-vacuum_hintbits-bigtable | 00:04:28.911184 > > In other words, the patch has no significant effect, as expected. The > select times did go up by a couple of seconds, which I didn't expect, > though. One theory is that unused shared_buffers are swapped out during > the tests, and bgwriter pulls them back in. I'll set swappiness to 0 and > try again at some point. > > Results for a 2 GB table: > > copy-medsize-unpatched | 00:02:18.23246 > copy-medsize-unpatched | 00:02:22.347194 > copy-medsize-unpatched | 00:02:23.875874 > copy_nowal-medsize-unpatched | 00:01:27.606334 > copy_nowal-medsize-unpatched | 00:01:17.491243 > copy_nowal-medsize-unpatched | 00:01:31.902719 > select-medsize-unpatched | 00:00:03.786031 > select-medsize-unpatched | 00:00:02.678069 > select-medsize-unpatched | 00:00:02.666103 > select-medsize-unpatched | 00:00:02.673494 > select-medsize-unpatched | 00:00:02.669645 > select-medsize-unpatched | 00:00:02.666278 > vacuum_clean-medsize-unpatched | 00:00:01.091356 > vacuum_clean-medsize-unpatched | 00:00:01.923138 > vacuum_clean-medsize-unpatched | 00:00:01.917213 > vacuum_clean-medsize-unpatched | 00:00:01.917333 > vacuum_hintbits-medsize-unpatched | 00:00:01.683718 > vacuum_hintbits-medsize-unpatched | 00:00:01.864003 > vacuum_hintbits-medsize-unpatched | 00:00:03.186596 > vacuum_hintbits-medsize-unpatched | 00:00:02.16494 > copy-medsize-patched | 00:02:35.113501 > copy-medsize-patched | 00:02:25.269866 > copy-medsize-patched | 00:02:31.881089 > copy_nowal-medsize-patched | 00:01:00.254633 > copy_nowal-medsize-patched | 00:01:04.630687 > copy_nowal-medsize-patched | 00:01:03.729128 > select-medsize-patched | 00:00:03.201837 > select-medsize-patched | 00:00:01.332975 > select-medsize-patched | 00:00:01.33014 > select-medsize-patched | 00:00:01.332392 > select-medsize-patched | 00:00:01.333498 > select-medsize-patched | 00:00:01.332692 > vacuum_clean-medsize-patched | 00:00:01.140189 > vacuum_clean-medsize-patched | 00:00:01.062762 > vacuum_clean-medsize-patched | 00:00:01.062402 > vacuum_clean-medsize-patched | 00:00:01.07113 > vacuum_hintbits-medsize-patched | 00:00:17.865446 > vacuum_hintbits-medsize-patched | 00:00:15.162064 > vacuum_hintbits-medsize-patched | 00:00:01.704651 > vacuum_hintbits-medsize-patched | 00:00:02.671651 > > This looks good to me, except for some glitch at the last > vacuum_hintbits tests. Selects and vacuums benefit significantly, as > does non-WAL-logged copy. > > Not shown here, but I run tests earlier with vacuum on a table that > actually had dead tuples to be removed on it. In that test the patched > version really shined, reducing the runtime to ~ 1/6th. That was the > original motivation of this patch: not having to do a WAL flush on every > page in the 2nd phase of vacuum. > > Test script attached. To use it: > > 1. Edit testscript.sh. Change BIGTABLESIZE. > 2. Start postmaster > 3. Run script, giving test-label as argument. For example: > "./testscript.sh bigtable-patched" > > Attached is also the patch I used for the tests. > > I would appreciate it if people would download the patch and the script > and repeat the tests on different hardware. I'm particularly interested > in testing on a box with good I/O hardware where selects on unpatched > PostgreSQL are bottlenecked by CPU. > > Barring any surprises I'm going to fix the remaining issue and submit a > final patch, probably in the weekend. > > (*) The issue with this patch is that if the buffer cache is completely > filled with dirty buffers that need a WAL flush to evict, the buffer > ring code will get into an infinite loop trying to find one that doesn't > need a WAL flush. Should be simple to fix. > > > ------------------------------------------------------------------------ > > Index: src/backend/access/heap/heapam.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v > retrieving revision 1.232 > diff -c -r1.232 heapam.c > *** src/backend/access/heap/heapam.c 8 Apr 2007 01:26:27 -0000 1.232 > --- src/backend/access/heap/heapam.c 16 May 2007 11:35:14 -0000 > *************** > *** 83,88 **** > --- 83,96 ---- > */ > scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd); > > + /* A scan on a table smaller than shared_buffers is treated like random > + * access, but bigger scans should use the bulk read replacement policy. > + */ > + if (scan->rs_nblocks > NBuffers) > + scan->rs_accesspattern = AP_BULKREAD; > + else > + scan->rs_accesspattern = AP_NORMAL; > + > scan->rs_inited = false; > scan->rs_ctup.t_data = NULL; > ItemPointerSetInvalid(&scan->rs_ctup.t_self); > *************** > *** 123,133 **** > --- 131,146 ---- > > Assert(page < scan->rs_nblocks); > > + /* Read the page with the right strategy */ > + SetAccessPattern(scan->rs_accesspattern); > + > scan->rs_cbuf = ReleaseAndReadBuffer(scan->rs_cbuf, > scan->rs_rd, > page); > scan->rs_cblock = page; > > + SetAccessPattern(AP_NORMAL); > + > if (!scan->rs_pageatatime) > return; > > Index: src/backend/access/transam/xlog.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v > retrieving revision 1.268 > diff -c -r1.268 xlog.c > *** src/backend/access/transam/xlog.c 30 Apr 2007 21:01:52 -0000 1.268 > --- src/backend/access/transam/xlog.c 15 May 2007 16:23:30 -0000 > *************** > *** 1668,1673 **** > --- 1668,1700 ---- > } > > /* > + * Returns true if 'record' hasn't been flushed to disk yet. > + */ > + bool > + XLogNeedsFlush(XLogRecPtr record) > + { > + /* Quick exit if already known flushed */ > + if (XLByteLE(record, LogwrtResult.Flush)) > + return false; > + > + /* read LogwrtResult and update local state */ > + { > + /* use volatile pointer to prevent code rearrangement */ > + volatile XLogCtlData *xlogctl = XLogCtl; > + > + SpinLockAcquire(&xlogctl->info_lck); > + LogwrtResult = xlogctl->LogwrtResult; > + SpinLockRelease(&xlogctl->info_lck); > + } > + > + /* check again */ > + if (XLByteLE(record, LogwrtResult.Flush)) > + return false; > + > + return true; > + } > + > + /* > * Ensure that all XLOG data through the given position is flushed to disk. > * > * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not > Index: src/backend/commands/copy.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v > retrieving revision 1.283 > diff -c -r1.283 copy.c > *** src/backend/commands/copy.c 27 Apr 2007 22:05:46 -0000 1.283 > --- src/backend/commands/copy.c 15 May 2007 17:05:29 -0000 > *************** > *** 1876,1881 **** > --- 1876,1888 ---- > nfields = file_has_oids ? (attr_count + 1) : attr_count; > field_strings = (char **) palloc(nfields * sizeof(char *)); > > + /* Use the special COPY buffer replacement strategy if WAL-logging > + * is enabled. If it's not, the pages we're writing are dirty but > + * don't need a WAL flush to write out, so the BULKREAD strategy > + * is more suitable. > + */ > + SetAccessPattern(use_wal ? AP_COPY : AP_BULKREAD); > + > /* Initialize state variables */ > cstate->fe_eof = false; > cstate->eol_type = EOL_UNKNOWN; > *************** > *** 2161,2166 **** > --- 2168,2176 ---- > cstate->filename))); > } > > + /* Reset buffer replacement strategy */ > + SetAccessPattern(AP_NORMAL); > + > /* > * If we skipped writing WAL, then we need to sync the heap (but not > * indexes since those use WAL anyway) > Index: src/backend/commands/vacuum.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/vacuum.c,v > retrieving revision 1.350 > diff -c -r1.350 vacuum.c > *** src/backend/commands/vacuum.c 16 Apr 2007 18:29:50 -0000 1.350 > --- src/backend/commands/vacuum.c 15 May 2007 17:06:18 -0000 > *************** > *** 421,431 **** > * Tell the buffer replacement strategy that vacuum is causing > * the IO > */ > ! StrategyHintVacuum(true); > > analyze_rel(relid, vacstmt); > > ! StrategyHintVacuum(false); > > if (use_own_xacts) > CommitTransactionCommand(); > --- 421,431 ---- > * Tell the buffer replacement strategy that vacuum is causing > * the IO > */ > ! SetAccessPattern(AP_VACUUM); > > analyze_rel(relid, vacstmt); > > ! SetAccessPattern(AP_NORMAL); > > if (use_own_xacts) > CommitTransactionCommand(); > *************** > *** 442,448 **** > /* Make sure cost accounting is turned off after error */ > VacuumCostActive = false; > /* And reset buffer replacement strategy, too */ > ! StrategyHintVacuum(false); > PG_RE_THROW(); > } > PG_END_TRY(); > --- 442,448 ---- > /* Make sure cost accounting is turned off after error */ > VacuumCostActive = false; > /* And reset buffer replacement strategy, too */ > ! SetAccessPattern(AP_NORMAL); > PG_RE_THROW(); > } > PG_END_TRY(); > *************** > *** 1088,1094 **** > * Tell the cache replacement strategy that vacuum is causing all > * following IO > */ > ! StrategyHintVacuum(true); > > /* > * Do the actual work --- either FULL or "lazy" vacuum > --- 1088,1094 ---- > * Tell the cache replacement strategy that vacuum is causing all > * following IO > */ > ! SetAccessPattern(AP_VACUUM); > > /* > * Do the actual work --- either FULL or "lazy" vacuum > *************** > *** 1098,1104 **** > else > lazy_vacuum_rel(onerel, vacstmt); > > ! StrategyHintVacuum(false); > > /* all done with this class, but hold lock until commit */ > relation_close(onerel, NoLock); > --- 1098,1104 ---- > else > lazy_vacuum_rel(onerel, vacstmt); > > ! SetAccessPattern(AP_NORMAL); > > /* all done with this class, but hold lock until commit */ > relation_close(onerel, NoLock); > Index: src/backend/storage/buffer/README > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/README,v > retrieving revision 1.11 > diff -c -r1.11 README > *** src/backend/storage/buffer/README 23 Jul 2006 03:07:58 -0000 1.11 > --- src/backend/storage/buffer/README 16 May 2007 11:43:11 -0000 > *************** > *** 152,159 **** > a field to show which backend is doing its I/O). > > > ! Buffer replacement strategy > ! --------------------------- > > There is a "free list" of buffers that are prime candidates for replacement. > In particular, buffers that are completely free (contain no valid page) are > --- 152,159 ---- > a field to show which backend is doing its I/O). > > > ! Normal buffer replacement strategy > ! ---------------------------------- > > There is a "free list" of buffers that are prime candidates for replacement. > In particular, buffers that are completely free (contain no valid page) are > *************** > *** 199,221 **** > have to give up and try another buffer. This however is not a concern > of the basic select-a-victim-buffer algorithm.) > > - A special provision is that while running VACUUM, a backend does not > - increment the usage count on buffers it accesses. In fact, if ReleaseBuffer > - sees that it is dropping the pin count to zero and the usage count is zero, > - then it appends the buffer to the tail of the free list. (This implies that > - VACUUM, but only VACUUM, must take the BufFreelistLock during ReleaseBuffer; > - this shouldn't create much of a contention problem.) This provision > - encourages VACUUM to work in a relatively small number of buffers rather > - than blowing out the entire buffer cache. It is reasonable since a page > - that has been touched only by VACUUM is unlikely to be needed again soon. > - > - Since VACUUM usually requests many pages very fast, the effect of this is that > - it will get back the very buffers it filled and possibly modified on the next > - call and will therefore do its work in a few shared memory buffers, while > - being able to use whatever it finds in the cache already. This also implies > - that most of the write traffic caused by a VACUUM will be done by the VACUUM > - itself and not pushed off onto other processes. > > > Background writer's processing > ------------------------------ > --- 199,243 ---- > have to give up and try another buffer. This however is not a concern > of the basic select-a-victim-buffer algorithm.) > > > + Buffer ring replacement strategy > + --------------------------------- > + > + When running a query that needs to access a large number of pages, like VACUUM, > + COPY, or a large sequential scan, a different strategy is used. A page that > + has been touched only by such a scan is unlikely to be needed again soon, so > + instead of running the normal clock sweep algorithm and blowing out the entire > + buffer cache, a small ring of buffers is allocated using the normal clock sweep > + algorithm and those buffers are reused for the whole scan. This also implies > + that most of the write traffic caused by such a statement will be done by the > + backend itself and not pushed off onto other processes. > + > + The size of the ring used depends on the kind of scan: > + > + For sequential scans, a small 256 KB ring is used. That's small enough to fit > + in L2 cache, which makes transferring pages from OS cache to shared buffer > + cache efficient. Even less would often be enough, but the ring must be big > + enough to accommodate all pages in the scan that are pinned concurrently. > + 256 KB should also be enough to leave a small cache trail for other backends to > + join in a synchronized seq scan. If a buffer is dirtied and LSN set, the buffer > + is removed from the ring and a replacement buffer is chosen using the normal > + replacement strategy. In a scan that modifies every page in the scan, like a > + bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and the > + ring strategy effectively degrades to the normal strategy. > + > + VACUUM uses a 256 KB ring like sequential scans, but dirty pages are not > + removed from the ring. WAL is flushed instead to allow reuse of the buffers. > + Before introducing the buffer ring strategy in 8.3, buffers were put to the > + freelist, which was effectively a buffer ring of 1 buffer. > + > + COPY behaves like VACUUM, but a much larger ring is used. The ring size is > + chosen to be twice the WAL segment size. This avoids polluting the buffer cache > + like the clock sweep would do, and using a ring larger than WAL segment size > + avoids having to do any extra WAL flushes, since a WAL segment will always be > + filled, forcing a WAL flush, before looping through the buffer ring and bumping > + into a buffer that would force a WAL flush. However, for non-WAL-logged COPY > + operations the smaller 256 KB ring is used because WAL flushes are not needed > + to write the buffers. > > Background writer's processing > ------------------------------ > Index: src/backend/storage/buffer/bufmgr.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v > retrieving revision 1.218 > diff -c -r1.218 bufmgr.c > *** src/backend/storage/buffer/bufmgr.c 2 May 2007 23:34:48 -0000 1.218 > --- src/backend/storage/buffer/bufmgr.c 16 May 2007 12:34:10 -0000 > *************** > *** 419,431 **** > /* Loop here in case we have to try another victim buffer */ > for (;;) > { > /* > * Select a victim buffer. The buffer is returned with its header > * spinlock still held! Also the BufFreelistLock is still held, since > * it would be bad to hold the spinlock while possibly waking up other > * processes. > */ > ! buf = StrategyGetBuffer(); > > Assert(buf->refcount == 0); > > --- 419,433 ---- > /* Loop here in case we have to try another victim buffer */ > for (;;) > { > + bool lock_held; > + > /* > * Select a victim buffer. The buffer is returned with its header > * spinlock still held! Also the BufFreelistLock is still held, since > * it would be bad to hold the spinlock while possibly waking up other > * processes. > */ > ! buf = StrategyGetBuffer(&lock_held); > > Assert(buf->refcount == 0); > > *************** > *** 436,442 **** > PinBuffer_Locked(buf); > > /* Now it's safe to release the freelist lock */ > ! LWLockRelease(BufFreelistLock); > > /* > * If the buffer was dirty, try to write it out. There is a race > --- 438,445 ---- > PinBuffer_Locked(buf); > > /* Now it's safe to release the freelist lock */ > ! if (lock_held) > ! LWLockRelease(BufFreelistLock); > > /* > * If the buffer was dirty, try to write it out. There is a race > *************** > *** 464,469 **** > --- 467,489 ---- > */ > if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED)) > { > + /* In BULKREAD-mode, check if a WAL flush would be needed to > + * evict this buffer. If so, ask the replacement strategy if > + * we should go ahead and do it or choose another victim. > + */ > + if (active_access_pattern == AP_BULKREAD) > + { > + if (XLogNeedsFlush(BufferGetLSN(buf))) > + { > + if (StrategyRejectBuffer(buf)) > + { > + LWLockRelease(buf->content_lock); > + UnpinBuffer(buf, true, false); > + continue; > + } > + } > + } > + > FlushBuffer(buf, NULL); > LWLockRelease(buf->content_lock); > } > *************** > *** 925,932 **** > PrivateRefCount[b]--; > if (PrivateRefCount[b] == 0) > { > - bool immed_free_buffer = false; > - > /* I'd better not still hold any locks on the buffer */ > Assert(!LWLockHeldByMe(buf->content_lock)); > Assert(!LWLockHeldByMe(buf->io_in_progress_lock)); > --- 945,950 ---- > *************** > *** 940,956 **** > /* Update buffer usage info, unless this is an internal access */ > if (normalAccess) > { > ! if (!strategy_hint_vacuum) > { > ! if (buf->usage_count < BM_MAX_USAGE_COUNT) > ! buf->usage_count++; > } > else > ! { > ! /* VACUUM accesses don't bump usage count, instead... */ > ! if (buf->refcount == 0 && buf->usage_count == 0) > ! immed_free_buffer = true; > ! } > } > > if ((buf->flags & BM_PIN_COUNT_WAITER) && > --- 958,975 ---- > /* Update buffer usage info, unless this is an internal access */ > if (normalAccess) > { > ! if (active_access_pattern != AP_NORMAL) > { > ! /* We don't want large one-off scans like vacuum to inflate > ! * the usage_count. We do want to set it to 1, though, to keep > ! * other backends from hijacking it from the buffer ring. > ! */ > ! if (buf->usage_count == 0) > ! buf->usage_count = 1; > } > else > ! if (buf->usage_count < BM_MAX_USAGE_COUNT) > ! buf->usage_count++; > } > > if ((buf->flags & BM_PIN_COUNT_WAITER) && > *************** > *** 965,978 **** > } > else > UnlockBufHdr(buf); > - > - /* > - * If VACUUM is releasing an otherwise-unused buffer, send it to the > - * freelist for near-term reuse. We put it at the tail so that it > - * won't be used before any invalid buffers that may exist. > - */ > - if (immed_free_buffer) > - StrategyFreeBuffer(buf, false); > } > } > > --- 984,989 ---- > Index: src/backend/storage/buffer/freelist.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/freelist.c,v > retrieving revision 1.58 > diff -c -r1.58 freelist.c > *** src/backend/storage/buffer/freelist.c 5 Jan 2007 22:19:37 -0000 1.58 > --- src/backend/storage/buffer/freelist.c 17 May 2007 16:12:56 -0000 > *************** > *** 18,23 **** > --- 18,25 ---- > #include "storage/buf_internals.h" > #include "storage/bufmgr.h" > > + #include "utils/memutils.h" > + > > /* > * The shared freelist control information. > *************** > *** 39,47 **** > /* Pointers to shared state */ > static BufferStrategyControl *StrategyControl = NULL; > > ! /* Backend-local state about whether currently vacuuming */ > ! bool strategy_hint_vacuum = false; > > > /* > * StrategyGetBuffer > --- 41,53 ---- > /* Pointers to shared state */ > static BufferStrategyControl *StrategyControl = NULL; > > ! /* Currently active access pattern hint. */ > ! AccessPattern active_access_pattern = AP_NORMAL; > > + /* prototypes for internal functions */ > + static volatile BufferDesc *GetBufferFromRing(void); > + static void PutBufferToRing(volatile BufferDesc *buf); > + static void InitRing(void); > > /* > * StrategyGetBuffer > *************** > *** 51,67 **** > * the selected buffer must not currently be pinned by anyone. > * > * To ensure that no one else can pin the buffer before we do, we must > ! * return the buffer with the buffer header spinlock still held. That > ! * means that we return with the BufFreelistLock still held, as well; > ! * the caller must release that lock once the spinlock is dropped. > */ > volatile BufferDesc * > ! StrategyGetBuffer(void) > { > volatile BufferDesc *buf; > int trycounter; > > LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); > > /* > * Try to get a buffer from the freelist. Note that the freeNext fields > --- 57,89 ---- > * the selected buffer must not currently be pinned by anyone. > * > * To ensure that no one else can pin the buffer before we do, we must > ! * return the buffer with the buffer header spinlock still held. If > ! * *lock_held is set at return, we return with the BufFreelistLock still > ! * held, as well; the caller must release that lock once the spinlock is > ! * dropped. > */ > volatile BufferDesc * > ! StrategyGetBuffer(bool *lock_held) > { > volatile BufferDesc *buf; > int trycounter; > > + /* Get a buffer from the ring if we're doing a bulk scan */ > + if (active_access_pattern != AP_NORMAL) > + { > + buf = GetBufferFromRing(); > + if (buf != NULL) > + { > + *lock_held = false; > + return buf; > + } > + } > + > + /* > + * If our selected buffer wasn't available, pick another... > + */ > LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); > + *lock_held = true; > > /* > * Try to get a buffer from the freelist. Note that the freeNext fields > *************** > *** 86,96 **** > */ > LockBufHdr(buf); > if (buf->refcount == 0 && buf->usage_count == 0) > return buf; > UnlockBufHdr(buf); > } > > ! /* Nothing on the freelist, so run the "clock sweep" algorithm */ > trycounter = NBuffers; > for (;;) > { > --- 108,122 ---- > */ > LockBufHdr(buf); > if (buf->refcount == 0 && buf->usage_count == 0) > + { > + if (active_access_pattern != AP_NORMAL) > + PutBufferToRing(buf); > return buf; > + } > UnlockBufHdr(buf); > } > > ! /* Nothing on the freelist, so run the shared "clock sweep" algorithm */ > trycounter = NBuffers; > for (;;) > { > *************** > *** 105,111 **** > --- 131,141 ---- > */ > LockBufHdr(buf); > if (buf->refcount == 0 && buf->usage_count == 0) > + { > + if (active_access_pattern != AP_NORMAL) > + PutBufferToRing(buf); > return buf; > + } > if (buf->usage_count > 0) > { > buf->usage_count--; > *************** > *** 191,204 **** > } > > /* > ! * StrategyHintVacuum -- tell us whether VACUUM is active > */ > void > ! StrategyHintVacuum(bool vacuum_active) > { > ! strategy_hint_vacuum = vacuum_active; > ! } > > > /* > * StrategyShmemSize > --- 221,245 ---- > } > > /* > ! * SetAccessPattern -- Sets the active access pattern hint > ! * > ! * Caller is responsible for resetting the hint to AP_NORMAL after the bulk > ! * operation is done. It's ok to switch repeatedly between AP_NORMAL and one of > ! * the other strategies, for example in a query with one large sequential scan > ! * nested loop joined to an index scan. Index tuples should be fetched with the > ! * normal strategy and the pages from the seq scan should be read in with the > ! * AP_BULKREAD strategy. The ring won't be affected by such switching, however > ! * switching to an access pattern with different ring size will invalidate the > ! * old ring. > */ > void > ! SetAccessPattern(AccessPattern new_pattern) > { > ! active_access_pattern = new_pattern; > > + if (active_access_pattern != AP_NORMAL) > + InitRing(); > + } > > /* > * StrategyShmemSize > *************** > *** 274,276 **** > --- 315,498 ---- > else > Assert(!init); > } > + > + /* ---------------------------------------------------------------- > + * Backend-private buffer ring management > + * ---------------------------------------------------------------- > + */ > + > + /* > + * Ring sizes for different access patterns. See README for the rationale > + * of these. > + */ > + #define BULKREAD_RING_SIZE 256 * 1024 / BLCKSZ > + #define VACUUM_RING_SIZE 256 * 1024 / BLCKSZ > + #define COPY_RING_SIZE Min(NBuffers / 8, (XLOG_SEG_SIZE / BLCKSZ) * 2) > + > + /* > + * BufferRing is an array of buffer ids, and RingSize it's size in number of > + * elements. It's allocated in TopMemoryContext the first time it's needed. > + */ > + static int *BufferRing = NULL; > + static int RingSize = 0; > + > + /* Index of the "current" slot in the ring. It's advanced every time a buffer > + * is handed out from the ring with GetBufferFromRing and it points to the > + * last buffer returned from the ring. RingCurSlot + 1 is the next victim > + * GetBufferRing will hand out. > + */ > + static int RingCurSlot = 0; > + > + /* magic value to mark empty slots in the ring */ > + #define BUF_ID_NOT_SET -1 > + > + > + /* > + * GetBufferFromRing -- returns a buffer from the ring, or NULL if the > + * ring is empty. > + * > + * The bufhdr spin lock is held on the returned buffer. > + */ > + static volatile BufferDesc * > + GetBufferFromRing(void) > + { > + volatile BufferDesc *buf; > + > + /* ring should be initialized by now */ > + Assert(RingSize > 0 && BufferRing != NULL); > + > + /* Run private "clock cycle" */ > + if (++RingCurSlot >= RingSize) > + RingCurSlot = 0; > + > + /* > + * If that slot hasn't been filled yet, tell the caller to allocate > + * a new buffer with the normal allocation strategy. He will then > + * fill this slot by calling PutBufferToRing with the new buffer. > + */ > + if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET) > + return NULL; > + > + buf = &BufferDescriptors[BufferRing[RingCurSlot]]; > + > + /* > + * If the buffer is pinned we cannot use it under any circumstances. > + * If usage_count == 0 then the buffer is fair game. > + * > + * We also choose this buffer if usage_count == 1. Strictly, this > + * might sometimes be the wrong thing to do, but we rely on the high > + * probability that it was this process that last touched the buffer. > + * If it wasn't, we'll choose a suboptimal victim, but it shouldn't > + * make any difference in the big scheme of things. > + * > + */ > + LockBufHdr(buf); > + if (buf->refcount == 0 && buf->usage_count <= 1) > + return buf; > + UnlockBufHdr(buf); > + > + return NULL; > + } > + > + /* > + * PutBufferToRing -- adds a buffer to the buffer ring > + * > + * Caller must hold the buffer header spinlock on the buffer. > + */ > + static void > + PutBufferToRing(volatile BufferDesc *buf) > + { > + /* ring should be initialized by now */ > + Assert(RingSize > 0 && BufferRing != NULL); > + > + if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET) > + BufferRing[RingCurSlot] = buf->buf_id; > + } > + > + /* > + * Initializes a ring buffer with correct size for the currently > + * active strategy. Does nothing if the ring already has the right size. > + */ > + static void > + InitRing(void) > + { > + int new_size; > + int old_size = RingSize; > + int i; > + MemoryContext oldcxt; > + > + /* Determine new size */ > + > + switch(active_access_pattern) > + { > + case AP_BULKREAD: > + new_size = BULKREAD_RING_SIZE; > + break; > + case AP_COPY: > + new_size = COPY_RING_SIZE; > + break; > + case AP_VACUUM: > + new_size = VACUUM_RING_SIZE; > + break; > + default: > + elog(ERROR, "unexpected buffer cache strategy %d", > + active_access_pattern); > + return; /* keep compile happy */ > + } > + > + /* > + * Seq scans set and reset the strategy on every page, so we better exit > + * quickly if no change in size is needed. > + */ > + if (new_size == old_size) > + return; > + > + /* Allocate array */ > + > + oldcxt = MemoryContextSwitchTo(TopMemoryContext); > + > + if (old_size == 0) > + { > + Assert(BufferRing == NULL); > + BufferRing = palloc(new_size * sizeof(int)); > + } > + else > + BufferRing = repalloc(BufferRing, new_size * sizeof(int)); > + > + MemoryContextSwitchTo(oldcxt); > + > + for(i = 0; i < new_size; i++) > + BufferRing[i] = BUF_ID_NOT_SET; > + > + RingCurSlot = 0; > + RingSize = new_size; > + } > + > + /* > + * Buffer manager calls this function in AP_BULKREAD mode when the > + * buffer handed to it turns out to need a WAL flush to write out. This > + * gives the strategy a second chance to choose another victim. > + * > + * Returns true if buffer manager should ask for a new victim, and false > + * if WAL should be flushed and this buffer used. > + */ > + bool > + StrategyRejectBuffer(volatile BufferDesc *buf) > + { > + Assert(RingSize > 0); > + > + if (BufferRing[RingCurSlot] == buf->buf_id) > + { > + BufferRing[RingCurSlot] = BUF_ID_NOT_SET; > + return true; > + } > + else > + { > + /* Apparently the buffer didn't come from the ring. We don't want to > + * mess with how the clock sweep works; in worst case there's no > + * buffers in the buffer cache that can be reused without a WAL flush, > + * and we'd get into an endless loop trying. > + */ > + return false; > + } > + } > Index: src/include/access/relscan.h > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/relscan.h,v > retrieving revision 1.52 > diff -c -r1.52 relscan.h > *** src/include/access/relscan.h 20 Jan 2007 18:43:35 -0000 1.52 > --- src/include/access/relscan.h 15 May 2007 17:01:31 -0000 > *************** > *** 28,33 **** > --- 28,34 ---- > ScanKey rs_key; /* array of scan key descriptors */ > BlockNumber rs_nblocks; /* number of blocks to scan */ > bool rs_pageatatime; /* verify visibility page-at-a-time? */ > + AccessPattern rs_accesspattern; /* access pattern to use for reads */ > > /* scan current state */ > bool rs_inited; /* false = scan not init'd yet */ > Index: src/include/access/xlog.h > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v > retrieving revision 1.76 > diff -c -r1.76 xlog.h > *** src/include/access/xlog.h 5 Jan 2007 22:19:51 -0000 1.76 > --- src/include/access/xlog.h 14 May 2007 21:22:40 -0000 > *************** > *** 151,156 **** > --- 151,157 ---- > > extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata); > extern void XLogFlush(XLogRecPtr RecPtr); > + extern bool XLogNeedsFlush(XLogRecPtr RecPtr); > > extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record); > extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec); > Index: src/include/storage/buf_internals.h > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v > retrieving revision 1.89 > diff -c -r1.89 buf_internals.h > *** src/include/storage/buf_internals.h 5 Jan 2007 22:19:57 -0000 1.89 > --- src/include/storage/buf_internals.h 15 May 2007 17:07:59 -0000 > *************** > *** 16,21 **** > --- 16,22 ---- > #define BUFMGR_INTERNALS_H > > #include "storage/buf.h" > + #include "storage/bufmgr.h" > #include "storage/lwlock.h" > #include "storage/shmem.h" > #include "storage/spin.h" > *************** > *** 168,174 **** > extern BufferDesc *LocalBufferDescriptors; > > /* in freelist.c */ > ! extern bool strategy_hint_vacuum; > > /* event counters in buf_init.c */ > extern long int ReadBufferCount; > --- 169,175 ---- > extern BufferDesc *LocalBufferDescriptors; > > /* in freelist.c */ > ! extern AccessPattern active_access_pattern; > > /* event counters in buf_init.c */ > extern long int ReadBufferCount; > *************** > *** 184,195 **** > */ > > /* freelist.c */ > ! extern volatile BufferDesc *StrategyGetBuffer(void); > extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head); > extern int StrategySyncStart(void); > extern Size StrategyShmemSize(void); > extern void StrategyInitialize(bool init); > > /* buf_table.c */ > extern Size BufTableShmemSize(int size); > extern void InitBufTable(int size); > --- 185,198 ---- > */ > > /* freelist.c */ > ! extern volatile BufferDesc *StrategyGetBuffer(bool *lock_held); > extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head); > extern int StrategySyncStart(void); > extern Size StrategyShmemSize(void); > extern void StrategyInitialize(bool init); > > + extern bool StrategyRejectBuffer(volatile BufferDesc *buf); > + > /* buf_table.c */ > extern Size BufTableShmemSize(int size); > extern void InitBufTable(int size); > Index: src/include/storage/bufmgr.h > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v > retrieving revision 1.103 > diff -c -r1.103 bufmgr.h > *** src/include/storage/bufmgr.h 2 May 2007 23:18:03 -0000 1.103 > --- src/include/storage/bufmgr.h 15 May 2007 17:07:02 -0000 > *************** > *** 48,53 **** > --- 48,61 ---- > #define BUFFER_LOCK_SHARE 1 > #define BUFFER_LOCK_EXCLUSIVE 2 > > + typedef enum AccessPattern > + { > + AP_NORMAL, /* Normal random access */ > + AP_BULKREAD, /* Large read-only scan (hint bit updates are ok) */ > + AP_COPY, /* Large updating scan, like COPY with WAL enabled */ > + AP_VACUUM, /* VACUUM */ > + } AccessPattern; > + > /* > * These routines are beaten on quite heavily, hence the macroization. > */ > *************** > *** 157,162 **** > extern void AtProcExit_LocalBuffers(void); > > /* in freelist.c */ > ! extern void StrategyHintVacuum(bool vacuum_active); > > #endif > --- 165,170 ---- > extern void AtProcExit_LocalBuffers(void); > > /* in freelist.c */ > ! extern void SetAccessPattern(AccessPattern new_pattern); > > #endif -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com #include <stdio.h> #include <stdlib.h> #include <string.h> #define TUPLES_PER_PAGE 15 int main(int argc, char **argv) { int tablesize; int lines; char buf[1000]; int i; if (argc != 2) { exit(1); } memset(buf, 'a', 500); buf[500] = '\0'; tablesize = atoi(argv[1]); lines = tablesize * 1024 * 1024 / 8192 * TUPLES_PER_PAGE; for(i = 1; i <= lines; i++) printf("%d\t%s\n", i, buf); }
pgsql-patches by date: