Re: Synchronized Scan WIP patch - Mailing list pgsql-patches
From | Heikki Linnakangas |
---|---|
Subject | Re: Synchronized Scan WIP patch |
Date | |
Msg-id | 465E829B.3060707@enterprisedb.com Whole thread Raw |
In response to | Synchronized Scan WIP patch (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Synchronized Scan WIP patch
|
List | pgsql-patches |
Here's a work-in-progress update of this patch. I haven't done any major changes, but a lot of little refactoring and commenting, including: * moved the sync scan stuff to a new file access/heapam/syncscan.c. heapam.c is long enough already, and in theory the same mechanism could be used for large bitmap heap scans in the future. * initialization of the shared memory structures is done at postmaster startup, instead of lazily the first time it's needed * removed the freelist from the shared mem structure. Instead all entries are initialized with invalid values, so they will be consumed from the end of the LRU without any special handling. * changed many variable and struct names to suit my personal taste. TODO: * Only report the new scan location when the page was not in buffer cache. This should reduce the congestion when all synchronized scanners try to update the location at the same time. Should probably test to see if it's worth optimizing first. * decide what to do with the threshold. Should use the same threshold as the buffer ring stuff. * decide the size of the LRU list. Testing: * Multiple scans on different tables, causing movement in the LRU list * Measure the CPU overhead for a single scan * Measure the lock contention with multiple scanners Any thoughts? Any additional test cases you'd like to see? Jeff Davis wrote: > This is my latest revision of the Sync Scan patch, and it implements the > observability as discussed with Simon. > > Changes: > * ss_report_loc() called once per hundred pages rather than once per > page > * DEBUG messages are a little cleaner and easier to parse, for the sake > of analysis after the fact. > * DEBUG2 reports a sync scan starting, the relation size in pages, and > the location at which the scan starts. > * DEBUG2 reports the location of a scan every 50k pages, DEBUG3 every > 5k pages (before it was 100k/10k at DEBUG3/DEBUG4, respectively). > Numbers are aligned along 5k boundaries to make analysis easier. > * GUCs: > * sync_seqscan_threshold: fraction of NBuffers for the threshold > * sync_seqscan_offset: fraction of NBuffers for the offset > * trace_sync_seqscan: will be used in final version of patch to > control DEBUG output > > Sync_scan_offset may be eliminated completely if it's not shown to be > useful enough in conjunction with Simon's patch. Sync Scans are still a > big win without sync_seqscan_offset. > > Sync_scan_threshold=<real> may be turned into sync_seqscan=<boolean> > with a fixed activation threshold (NBuffers/2 per Simon's suggestion). > The reason is that synchronized scans should activate at the same > threshold as Simon's scan_recycle_buffers feature. Should we make a > "#define BIG_SCAN_THRESHOLD NBuffers/2" to use for both sync_seqscan and > for scan_recycle_buffers? > > Regards, > Jeff Davis > > > ------------------------------------------------------------------------ > > diff -cr postgresql-8.2.3/src/backend/access/heap/heapam.c postgresql-8.2.3-syncscan/src/backend/access/heap/heapam.c > *** postgresql-8.2.3/src/backend/access/heap/heapam.c 2007-02-04 12:00:49.000000000 -0800 > --- postgresql-8.2.3-syncscan/src/backend/access/heap/heapam.c 2007-03-13 23:21:27.000000000 -0700 > *************** > *** 65,70 **** > --- 65,279 ---- > * ---------------------------------------------------------------- > */ > > + static BlockNumber ss_init(HeapScanDesc); > + static int ss_store_hint(HeapScanDesc,BlockNumber); > + static int ss_hash(HeapScanDesc); > + bool Trace_sync_seqscan = false; > + double sync_seqscan_threshold = DEFAULT_SYNC_SCAN_THRESHOLD; > + double sync_seqscan_offset = DEFAULT_SYNC_SCAN_OFFSET; > + > + /* > + * ss_init: > + * > + * This function reads the Sync Scan Hint Table > + * (creating it if it doesn't already exist) to > + * find a possible location for an already running > + * sequential scan on this relation. > + * > + * By starting a sequential scan near the location > + * of an already running scan, we improve the chance > + * of finding pages in cache. > + * > + * Also, depending on SYNC_SCAN_START_OFFSET, this > + * function will subtract from the hint before > + * starting the scan, in order to pick up pages that > + * are likely to already be in cache. > + * > + * This function assumes that scan->rs_nblocks is > + * already properly set, and sets scan->rs_start_page > + * to a value based on the hint found. Also, it sets > + * scan->rs_hint to point to the location of the hint > + * in the hint table. > + */ > + static BlockNumber ss_init(HeapScanDesc scan) > + { > + ss_hint_t *hint_table; > + int table_offset; > + bool found; > + int threshold = sync_seqscan_threshold * NBuffers; > + int offset = sync_seqscan_offset * NBuffers; > + > + /* > + * If the table is not large compared to effective_cache_size, > + * don't Sync Scan. > + */ > + if(scan->rs_nblocks < threshold) > + { > + elog(DEBUG2,"SYNC_SCAN: Table too small to sync scan"); > + scan->rs_start_page = 0; > + return 0; > + } > + > + table_offset = ss_hash(scan); > + hint_table = (ss_hint_t*)ShmemInitStruct("Sync Scan Hint Table", > + SYNC_SCAN_TABLE_SIZE*sizeof(ss_hint_t),&found); > + > + scan->rs_hint = &hint_table[table_offset]; > + > + /* > + * If we just created the hint table for the first time, > + * initialize the table to zero and start the scan at page 0. > + */ > + if(!found) { > + elog(DEBUG2,"SYNC_SCAN: Created Hint Table"); > + memset(hint_table,0,sizeof(ss_hint_t)*SYNC_SCAN_TABLE_SIZE); > + scan->rs_start_page = 0; > + return 0; > + } > + > + /* > + * If the hint's relid is 0, that means > + * we have not previously created a hint > + * at this location in the table. > + */ > + if(scan->rs_hint->relid == 0) { > + elog(DEBUG2, "SYNC_SCAN: Hint empty"); > + scan->rs_start_page = 0; > + return 0; > + } > + > + /* > + * If the relid doesn't match the one in the hint, > + * we have a hash collision. > + */ > + if(RelationGetRelid(scan->rs_rd) != scan->rs_hint->relid) > + { > + elog(DEBUG1,"SYNC_SCAN: Hash collision"); > + scan->rs_start_page = 0; > + return 0; > + } > + > + /* > + * If the hint is not a valid block number > + * for this relation, start at 0. > + * > + * This can happen if, for instance, someone > + * TRUNCATEd the table between when the hint > + * was set and now. > + */ > + if(scan->rs_hint->location < 0 || > + scan->rs_hint->location >= scan->rs_nblocks) > + { > + elog(DEBUG2,"SYNC_SCAN: Hint %d out of range." \ > + " Relation has %d pages.", > + scan->rs_hint->location,scan->rs_nblocks); > + scan->rs_start_page = 0; > + return 0; > + } > + > + scan->rs_start_page = scan->rs_hint->location; > + > + /* > + * By starting at offset earlier than the hint, > + * it's likely that all of the blocks will already be > + * cached, and the scan will quickly catch up to the head. > + * > + * offset is a positive value that will be > + * subtracted from the hint. > + */ > + if(offset > scan->rs_nblocks) > + { > + elog(DEBUG2,"SYNC_SCAN: Relation smaller than start offset: %d", > + offset); > + return 0; > + } > + > + /* > + * If subtracting the offset would bring the value > + * to less than 0, we circle backwards to the end of the > + * file. > + */ > + if(offset > scan->rs_start_page) > + scan->rs_start_page += scan->rs_nblocks; > + > + scan->rs_start_page -= offset; > + > + elog(DEBUG2,"SYNC_SCAN: START: OID = %d; Location = %d; Size: %d", > + RelationGetRelid(scan->rs_rd), > + scan->rs_start_page,scan->rs_nblocks); > + > + return 0; > + } > + > + /* > + * ss_store_hint: > + * > + * Writes an entry in the Sync Scan Hint Table > + * of the form (relid,blocknumber). This will > + * overwrite any existing entry that may collide > + * with this entry in the table. > + * > + * No locking is performed here. When this data is > + * later read by ss_init(), sanity checking is > + * performed to ensure we don't use an invalid > + * relation block number. > + */ > + static int ss_store_hint(HeapScanDesc scan, BlockNumber location) > + { > + ss_hint_t hint; > + int threshold = sync_seqscan_threshold * NBuffers; > + int offset = sync_seqscan_offset * NBuffers; > + > + /* > + * Print every 100k pages to DEBUG3 > + * and every 10k pages to DEBUG4. > + */ > + if (!(location%50000)) > + elog(DEBUG2,"page: %d",location); > + else if (!(location%5000)) > + elog(DEBUG3,"page: %d",location); > + > + /* > + * If the table is too small, don't bother > + * with Sync Scan. > + */ > + if(scan->rs_nblocks < threshold) > + return 0; > + > + /* > + * If this scan has been progressing for less > + * than offset pages, don't store the hint. > + */ > + if(location >= scan->rs_start_page) > + { > + if((location - scan->rs_start_page) < offset) > + return 0; > + } > + else > + { > + if((location + scan->rs_nblocks - scan->rs_start_page) > + < offset) > + return 0; > + } > + > + hint.relid = RelationGetRelid(scan->rs_rd); > + hint.location = location; > + > + *scan->rs_hint = hint; > + > + return 0; > + } > + > + /* > + * This is a simplistic function to hash > + * the Oid of the relation for placement in > + * the Sync Scan Hint Table > + */ > + static int ss_hash(HeapScanDesc scan) > + { > + return RelationGetRelid(scan->rs_rd) % SYNC_SCAN_TABLE_SIZE; > + } > + > /* ---------------- > * initscan - scan code common to heap_beginscan and heap_rescan > * ---------------- > *************** > *** 81,86 **** > --- 290,300 ---- > */ > scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd); > > + /* > + * Choose an good place to start the relation scan. > + */ > + ss_init(scan); > + > scan->rs_inited = false; > scan->rs_ctup.t_data = NULL; > ItemPointerSetInvalid(&scan->rs_ctup.t_self); > *************** > *** 223,229 **** > tuple->t_data = NULL; > return; > } > ! page = 0; /* first page */ > heapgetpage(scan, page); > lineoff = FirstOffsetNumber; /* first offnum */ > scan->rs_inited = true; > --- 437,447 ---- > tuple->t_data = NULL; > return; > } > ! /* > ! * start the scan at the location that we chose > ! * in ss_init() > ! */ > ! page = scan->rs_start_page; > heapgetpage(scan, page); > lineoff = FirstOffsetNumber; /* first offnum */ > scan->rs_inited = true; > *************** > *** 364,378 **** > } > > /* > ! * if we get here, it means we've exhausted the items on this page and > * it's time to move to the next. > */ > LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK); > > /* > ! * return NULL if we've exhausted all the pages > */ > ! if (backward ? (page == 0) : (page + 1 >= scan->rs_nblocks)) > { > if (BufferIsValid(scan->rs_cbuf)) > ReleaseBuffer(scan->rs_cbuf); > --- 582,611 ---- > } > > /* > ! * If we get here, it means we've exhausted the items on this page and > * it's time to move to the next. > + * > + * For the forward scan, we need to wrap around to the beginning > + * of the relation file if we reach the end. > */ > LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK); > > + if(backward) > + page--; > + else > + page = (page + 1) % (scan->rs_nblocks); > + > + if(! (page % SYNC_SCAN_REPORT_INTERVAL) ) > + ss_store_hint(scan,page); > + > /* > ! * Return NULL if we've exhausted all the pages. > ! * For reverse scans, that means we've reached 0. For > ! * forward scans, that means we've reached the page on > ! * which we started. > */ > ! if ((backward && (page == 0)) || > ! ((page%(scan->rs_nblocks)) == scan->rs_start_page)) > { > if (BufferIsValid(scan->rs_cbuf)) > ReleaseBuffer(scan->rs_cbuf); > *************** > *** 383,390 **** > return; > } > > - page = backward ? (page - 1) : (page + 1); > - > heapgetpage(scan, page); > > LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE); > --- 616,621 ---- > *************** > *** 450,456 **** > tuple->t_data = NULL; > return; > } > ! page = 0; /* first page */ > heapgetpage(scan, page); > lineindex = 0; > scan->rs_inited = true; > --- 681,691 ---- > tuple->t_data = NULL; > return; > } > ! /* > ! * start the scan at the location that we chose > ! * in ss_init() > ! */ > ! page = scan->rs_start_page; > heapgetpage(scan, page); > lineindex = 0; > scan->rs_inited = true; > *************** > *** 585,598 **** > } > > /* > ! * if we get here, it means we've exhausted the items on this page and > * it's time to move to the next. > */ > > /* > ! * return NULL if we've exhausted all the pages > */ > ! if (backward ? (page == 0) : (page + 1 >= scan->rs_nblocks)) > { > if (BufferIsValid(scan->rs_cbuf)) > ReleaseBuffer(scan->rs_cbuf); > --- 820,847 ---- > } > > /* > ! * If we get here, it means we've exhausted the items on this page and > * it's time to move to the next. > + * > + * For the forward scan, we need to wrap around to the beginning > + * of the relation file if we reach the end. > */ > + if(backward) > + page--; > + else > + page = (page + 1) % (scan->rs_nblocks); > + > + if(! (page % SYNC_SCAN_REPORT_INTERVAL) ) > + ss_store_hint(scan,page); > > /* > ! * Return NULL if we've exhausted all the pages. > ! * For reverse scans, that means we've reached 0. For > ! * forward scans, that means we've reached the page on > ! * which we started. > */ > ! if ((backward && (page == 0)) || > ! ((page%(scan->rs_nblocks)) == scan->rs_start_page)) > { > if (BufferIsValid(scan->rs_cbuf)) > ReleaseBuffer(scan->rs_cbuf); > *************** > *** 603,609 **** > return; > } > > - page = backward ? (page - 1) : (page + 1); > heapgetpage(scan, page); > > dp = (Page) BufferGetPage(scan->rs_cbuf); > --- 852,857 ---- > *************** > *** 616,621 **** > --- 864,880 ---- > } > } > > + /* > + * SyncScanShmemSize: > + * > + * Called by CreateSharedMemoryAndSemaphores() > + * to find out how much room the Sync Scan Hint > + * Table will need to occupy. > + */ > + Size SyncScanShmemSize(void) > + { > + return SYNC_SCAN_TABLE_SIZE*sizeof(ss_hint_t); > + } > > #if defined(DISABLE_COMPLEX_MACRO) > /* > diff -cr postgresql-8.2.3/src/backend/storage/ipc/ipci.c postgresql-8.2.3-syncscan/src/backend/storage/ipc/ipci.c > *** postgresql-8.2.3/src/backend/storage/ipc/ipci.c 2006-10-15 15:04:07.000000000 -0700 > --- postgresql-8.2.3-syncscan/src/backend/storage/ipc/ipci.c 2007-03-13 21:58:56.000000000 -0700 > *************** > *** 19,24 **** > --- 19,25 ---- > #include "access/nbtree.h" > #include "access/subtrans.h" > #include "access/twophase.h" > + #include "access/heapam.h" > #include "miscadmin.h" > #include "pgstat.h" > #include "postmaster/bgwriter.h" > *************** > *** 110,115 **** > --- 111,117 ---- > size = add_size(size, FreeSpaceShmemSize()); > size = add_size(size, BgWriterShmemSize()); > size = add_size(size, BTreeShmemSize()); > + size = add_size(size, SyncScanShmemSize()); > #ifdef EXEC_BACKEND > size = add_size(size, ShmemBackendArraySize()); > #endif > diff -cr postgresql-8.2.3/src/backend/utils/misc/guc.c postgresql-8.2.3-syncscan/src/backend/utils/misc/guc.c > *** postgresql-8.2.3/src/backend/utils/misc/guc.c 2006-11-29 06:50:07.000000000 -0800 > --- postgresql-8.2.3-syncscan/src/backend/utils/misc/guc.c 2007-03-13 23:23:31.000000000 -0700 > *************** > *** 25,31 **** > #include <syslog.h> > #endif > > ! > #include "access/gin.h" > #include "access/twophase.h" > #include "access/xact.h" > --- 25,31 ---- > #include <syslog.h> > #endif > > ! #include "access/heapam.h" > #include "access/gin.h" > #include "access/twophase.h" > #include "access/xact.h" > *************** > *** 758,763 **** > --- 758,773 ---- > false, NULL, NULL > }, > > + { > + {"trace_sync_seqscan", PGC_USERSET, DEVELOPER_OPTIONS, > + gettext_noop("Generates debugging output for Synchronized Scans."), > + NULL, > + GUC_NOT_IN_SAMPLE > + }, > + &Trace_sync_seqscan, > + false, NULL, NULL > + }, > + > #ifdef LOCK_DEBUG > { > {"trace_locks", PGC_SUSET, DEVELOPER_OPTIONS, > *************** > *** 1723,1728 **** > --- 1733,1754 ---- > DEFAULT_GEQO_SELECTION_BIAS, MIN_GEQO_SELECTION_BIAS, > MAX_GEQO_SELECTION_BIAS, NULL, NULL > }, > + { > + {"sync_seqscan_threshold", PGC_USERSET, QUERY_TUNING_SYNC_SEQSCAN, > + gettext_noop("Minimum size of table before synchronized scanning takes effect, as a fraction of shared_buffers."), > + NULL > + }, > + &sync_seqscan_threshold, > + DEFAULT_SYNC_SCAN_THRESHOLD, 0.0, 100.0, NULL, NULL > + }, > + { > + {"sync_seqscan_offset", PGC_USERSET, QUERY_TUNING_SYNC_SEQSCAN, > + gettext_noop("Start synchronized scans at this offset (as a fraction of shared_buffers) before other scans."), > + NULL > + }, > + &sync_seqscan_offset, > + DEFAULT_SYNC_SCAN_OFFSET, 0.0, 100.0, NULL, NULL > + }, > > { > {"bgwriter_lru_percent", PGC_SIGHUP, RESOURCES, > diff -cr postgresql-8.2.3/src/include/access/heapam.h postgresql-8.2.3-syncscan/src/include/access/heapam.h > *** postgresql-8.2.3/src/include/access/heapam.h 2006-11-05 14:42:10.000000000 -0800 > --- postgresql-8.2.3-syncscan/src/include/access/heapam.h 2007-03-13 23:22:11.000000000 -0700 > *************** > *** 25,30 **** > --- 25,49 ---- > #include "utils/rel.h" > #include "utils/tqual.h" > > + /* > + * Size of the Sync Scan Hint Table. > + */ > + #define SYNC_SCAN_TABLE_SIZE 1000 > + > + /* > + * Interval between reports of the location > + * of the current scan, in pages. > + */ > + #define SYNC_SCAN_REPORT_INTERVAL 100 > + > + #define DEFAULT_SYNC_SCAN_THRESHOLD 1.0 > + #define DEFAULT_SYNC_SCAN_OFFSET 0.0 > + > + extern DLLIMPORT bool Trace_sync_seqscan; > + extern DLLIMPORT double sync_seqscan_threshold; > + extern DLLIMPORT double sync_seqscan_offset; > + extern Size SyncScanShmemSize(void); > + > /* ---------------- > * fastgetattr > * > diff -cr postgresql-8.2.3/src/include/access/relscan.h postgresql-8.2.3-syncscan/src/include/access/relscan.h > *** postgresql-8.2.3/src/include/access/relscan.h 2006-10-03 17:30:07.000000000 -0700 > --- postgresql-8.2.3-syncscan/src/include/access/relscan.h 2007-03-13 21:58:56.000000000 -0700 > *************** > *** 19,24 **** > --- 19,33 ---- > #include "utils/tqual.h" > > > + /* > + * Structure of an entry in the > + * Sync Scan Hint Table. > + */ > + typedef struct { > + Oid relid; /* The relid that tags this hint entry */ > + BlockNumber location; /* The location in the relation */ > + } ss_hint_t; > + > typedef struct HeapScanDescData > { > /* scan parameters */ > *************** > *** 33,38 **** > --- 42,49 ---- > bool rs_inited; /* false = scan not init'd yet */ > HeapTupleData rs_ctup; /* current tuple in scan, if any */ > BlockNumber rs_cblock; /* current block # in scan, if any */ > + BlockNumber rs_start_page; /* page where this scan began */ > + ss_hint_t *rs_hint; /* pointer to scan hint */ > Buffer rs_cbuf; /* current buffer in scan, if any */ > /* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */ > ItemPointerData rs_mctid; /* marked scan position, if any */ > diff -cr postgresql-8.2.3/src/include/utils/guc_tables.h postgresql-8.2.3-syncscan/src/include/utils/guc_tables.h > *** postgresql-8.2.3/src/include/utils/guc_tables.h 2006-10-03 14:11:55.000000000 -0700 > --- postgresql-8.2.3-syncscan/src/include/utils/guc_tables.h 2007-03-13 22:41:15.000000000 -0700 > *************** > *** 56,61 **** > --- 56,62 ---- > QUERY_TUNING_METHOD, > QUERY_TUNING_COST, > QUERY_TUNING_GEQO, > + QUERY_TUNING_SYNC_SEQSCAN, > QUERY_TUNING_OTHER, > LOGGING, > LOGGING_WHERE, > > > ------------------------------------------------------------------------ > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/access/heap/Makefile =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/Makefile,v retrieving revision 1.15 diff -c -r1.15 Makefile *** src/backend/access/heap/Makefile 8 Apr 2007 01:26:27 -0000 1.15 --- src/backend/access/heap/Makefile 25 May 2007 21:13:17 -0000 *************** *** 12,18 **** top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o rewriteheap.o tuptoaster.o all: SUBSYS.o --- 12,18 ---- top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o rewriteheap.o tuptoaster.o syncscan.o all: SUBSYS.o Index: src/backend/access/heap/heapam.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v retrieving revision 1.232 diff -c -r1.232 heapam.c *** src/backend/access/heap/heapam.c 8 Apr 2007 01:26:27 -0000 1.232 --- src/backend/access/heap/heapam.c 27 May 2007 11:54:52 -0000 *************** *** 83,88 **** --- 83,93 ---- */ scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd); + /* + * Choose a good place to start the relation scan. + */ + scan->rs_start_page = ss_get_location(scan); + scan->rs_inited = false; scan->rs_ctup.t_data = NULL; ItemPointerSetInvalid(&scan->rs_ctup.t_self); *************** *** 203,208 **** --- 208,214 ---- Snapshot snapshot = scan->rs_snapshot; bool backward = ScanDirectionIsBackward(dir); BlockNumber page; + BlockNumber prevpage; Page dp; int lines; OffsetNumber lineoff; *************** *** 225,231 **** tuple->t_data = NULL; return; } ! page = 0; /* first page */ heapgetpage(scan, page); lineoff = FirstOffsetNumber; /* first offnum */ scan->rs_inited = true; --- 231,241 ---- tuple->t_data = NULL; return; } ! /* ! * start the scan at the location that we chose ! * in ss_get_location() ! */ ! page = scan->rs_start_page; heapgetpage(scan, page); lineoff = FirstOffsetNumber; /* first offnum */ scan->rs_inited = true; *************** *** 259,265 **** tuple->t_data = NULL; return; } ! page = scan->rs_nblocks - 1; /* final page */ heapgetpage(scan, page); } else --- 269,275 ---- tuple->t_data = NULL; return; } ! page = scan->rs_start_page; heapgetpage(scan, page); } else *************** *** 366,380 **** } /* ! * if we get here, it means we've exhausted the items on this page and * it's time to move to the next. */ LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK); /* ! * return NULL if we've exhausted all the pages */ ! if (backward ? (page == 0) : (page + 1 >= scan->rs_nblocks)) { if (BufferIsValid(scan->rs_cbuf)) ReleaseBuffer(scan->rs_cbuf); --- 376,417 ---- } /* ! * If we get here, it means we've exhausted the items on this page and * it's time to move to the next. + * + * For the forward scan, we need to wrap around to the beginning + * of the relation file if we reach the end. */ LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK); + prevpage = page; + if(backward) { + if(page == 0) + page = scan->rs_nblocks; + page--; + } + else + page = (page + 1) % (scan->rs_nblocks); + + if(Trace_sync_seqscan) + { + if (!(page%50000)) + elog(DEBUG2,"page: %d",page); + else if (!(page%5000)) + elog(DEBUG3,"page: %d",page); + } + + /* XXX: only report if physical I/O happened */ + ss_report_location(scan, page); + /* ! * Return NULL if we've exhausted all the pages. ! * For reverse scans, that means we've reached 0. For ! * forward scans, that means we've reached the page on ! * which we started. */ ! if((ScanDirectionIsForward(dir) && page == scan->rs_start_page) || ! (ScanDirectionIsBackward(dir) && prevpage == scan->rs_start_page)) { if (BufferIsValid(scan->rs_cbuf)) ReleaseBuffer(scan->rs_cbuf); *************** *** 385,392 **** return; } - page = backward ? (page - 1) : (page + 1); - heapgetpage(scan, page); LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE); --- 422,427 ---- *************** *** 429,434 **** --- 464,470 ---- HeapTuple tuple = &(scan->rs_ctup); bool backward = ScanDirectionIsBackward(dir); BlockNumber page; + BlockNumber prevpage; Page dp; int lines; int lineindex; *************** *** 452,458 **** tuple->t_data = NULL; return; } ! page = 0; /* first page */ heapgetpage(scan, page); lineindex = 0; scan->rs_inited = true; --- 488,498 ---- tuple->t_data = NULL; return; } ! /* ! * start the scan at the location that we chose ! * in ss_get_location() ! */ ! page = scan->rs_start_page; heapgetpage(scan, page); lineindex = 0; scan->rs_inited = true; *************** *** 483,489 **** tuple->t_data = NULL; return; } ! page = scan->rs_nblocks - 1; /* final page */ heapgetpage(scan, page); } else --- 523,529 ---- tuple->t_data = NULL; return; } ! page = scan->rs_start_page; heapgetpage(scan, page); } else *************** *** 587,600 **** } /* ! * if we get here, it means we've exhausted the items on this page and * it's time to move to the next. */ /* ! * return NULL if we've exhausted all the pages */ ! if (backward ? (page == 0) : (page + 1 >= scan->rs_nblocks)) { if (BufferIsValid(scan->rs_cbuf)) ReleaseBuffer(scan->rs_cbuf); --- 627,665 ---- } /* ! * If we get here, it means we've exhausted the items on this page and * it's time to move to the next. + * + * For the forward scan, we need to wrap around to the beginning + * of the relation file if we reach the end. */ + prevpage = page; + if(backward) { + if(page == 0) + page = scan->rs_nblocks; + page--; + } + else + page = (page + 1) % (scan->rs_nblocks); + + if(Trace_sync_seqscan) + { + if (!(page%50000)) + elog(DEBUG2,"page: %d",page); + else if (!(page%5000)) + elog(DEBUG3,"page: %d",page); + } + + ss_report_location(scan,page); /* ! * Return NULL if we've exhausted all the pages. ! * For reverse scans, that means we've reached 0. For ! * forward scans, that means we've reached the page on ! * which we started. */ ! if((ScanDirectionIsForward(dir) && page == scan->rs_start_page) || ! (ScanDirectionIsBackward(dir) && prevpage == scan->rs_start_page)) { if (BufferIsValid(scan->rs_cbuf)) ReleaseBuffer(scan->rs_cbuf); *************** *** 605,611 **** return; } - page = backward ? (page - 1) : (page + 1); heapgetpage(scan, page); dp = (Page) BufferGetPage(scan->rs_cbuf); --- 670,675 ---- *************** *** 618,624 **** } } - #if defined(DISABLE_COMPLEX_MACRO) /* * This is formatted so oddly so that the correspondence to the macro --- 682,687 ---- Index: src/backend/access/heap/syncscan.c =================================================================== RCS file: src/backend/access/heap/syncscan.c diff -N src/backend/access/heap/syncscan.c *** /dev/null 1 Jan 1970 00:00:00 -0000 --- src/backend/access/heap/syncscan.c 28 May 2007 07:10:39 -0000 *************** *** 0 **** --- 1,309 ---- + /*------------------------------------------------------------------------- + * + * syncscan.c + * heap scan synchronization support + * + * When multiple backends run a sequential scan on the same table, we try + * to keep them synchronized to reduce the overall I/O needed. The goal is + * to read in each page only once to shared buffer cache, and let all backends + * that take part in the scan to process the page before it falls out of the + * cache. + * + * To accomplish that, we keep track of the scan position of each table, and + * start new scans close to where the previous scan(s) are. Assuming that I/O + * is slow compared to the processing of each page, that's enough to keep the + * scans synchronized. (XXX explain why) + * + * There can realistically only be a few sequential scans on different tables + * in progress at any time. Therefore we just keep the scan positions in a + * small LRU list which we scan every time we need to look up or update a scan + * position. + * + * + * INTERFACE ROUTINES + * ss_get_location - return current scan location of a relation + * ss_report_location - update current scan location + * + * NOTES + * XXX This file contains the heap_ routines which implement + * the POSTGRES heap access method used for all POSTGRES + * relations. + * + * Portions Copyright (c) 1996-2007, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * $PostgreSQL$ + * + *------------------------------------------------------------------------- + */ + #include "postgres.h" + + #include "access/heapam.h" + #include "access/hio.h" + #include "access/multixact.h" + #include "access/transam.h" + #include "access/tuptoaster.h" + #include "access/valid.h" + #include "access/xact.h" + #include "catalog/catalog.h" + #include "catalog/namespace.h" + #include "miscadmin.h" + #include "pgstat.h" + #include "storage/procarray.h" + #include "storage/smgr.h" + #include "utils/inval.h" + #include "utils/lsyscache.h" + #include "utils/relcache.h" + #include "utils/syscache.h" + + /* + * Size of the LRU list. + * + * Note: the code assumes that SYNC_SCAN_NELEM > 1. + */ + #define SYNC_SCAN_NELEM 100 /* XXX: What's a good value? */ + + /* + * Interval between reports of the location + * of the current scan, in pages. + * + * XXX: note relationship to buffer ring size + */ + #define SYNC_SCAN_REPORT_INTERVAL 16 + + + /* GUC variables */ + double sync_seqscan_threshold = DEFAULT_SYNC_SCAN_THRESHOLD; + bool Trace_sync_seqscan = false; + + /* + * The scan locations structure is essentially a doubly-linked LRU with head + * and tail pointer, but designed to hold a fixed maximum number of elements in + * fixed-size shared memory. + */ + + typedef struct ss_scan_location_t { + RelFileNode relfilenode; /* The relfilenode that tags this entry */ + BlockNumber location; /* The location in the relation */ + } ss_scan_location_t; + + typedef struct ss_lru_item_t { + struct ss_lru_item_t *prev; + struct ss_lru_item_t *next; + ss_scan_location_t location; + } ss_lru_item_t; + + typedef struct ss_scan_locations_t { + ss_lru_item_t *head; + ss_lru_item_t *tail; + ss_lru_item_t items[]; /* SYNC_SCAN_NELEM items */ + } ss_scan_locations_t; + + #define SizeOfScanLocations offsetof(ss_scan_locations_t, items[SYNC_SCAN_NELEM]) + + /* Pointer to struct in shared memory */ + static ss_scan_locations_t *scan_locations; + + + /* prototypes for internal functions */ + static BlockNumber ss_search(RelFileNode,BlockNumber,bool); + + + /* + * SyncScanShmemSize --- report amount of shared memory space needed + */ + Size SyncScanShmemSize(void) + { + return SizeOfScanLocations; + } + + /* + * SyncScanShmemInit --- initialize this module's shared memory + */ + void SyncScanShmemInit() + { + int i; + bool found; + + scan_locations = (ss_scan_locations_t *) + ShmemInitStruct("Sync Scan Locations List", SizeOfScanLocations, &found); + + if(!IsUnderPostmaster) + { + /* Initialize shared memory area */ + Assert(!found); + + scan_locations->head = &scan_locations->items[0]; + scan_locations->tail = &scan_locations->items[SYNC_SCAN_NELEM - 1]; + + for(i = 0; i < SYNC_SCAN_NELEM; i++) + { + ss_lru_item_t *item = &scan_locations->items[i]; + + /* Initialize all slots with invalid values. As scans are started, + * these invalid entries will fall off the LRU list and replaced + * with real entries. + */ + item->location.relfilenode.spcNode = InvalidOid; + item->location.relfilenode.dbNode = InvalidOid; + item->location.relfilenode.relNode = InvalidOid; + item->location.location = InvalidBlockNumber; + + item->prev = (i > 0) ? + (&scan_locations->items[i - 1]) : NULL; + item->next = (i < SYNC_SCAN_NELEM - 1) ? + (&scan_locations->items[i + 1]) : NULL; + } + } + else + Assert(found); + + } + + /* + * ss_search --- searches the scan_locations structure for an entry with the + * given relfilenode. + * + * If "set" is true, the location is updated to the given location. If no + * entry for the given relfilenode is found, it will be created at the head + * of the list with the given location, even if set == false. + * + * In any case, the location after possible update is returned. + */ + static BlockNumber ss_search(RelFileNode relfilenode, + BlockNumber location, bool set) + { + ss_lru_item_t *item; + + item = scan_locations->head; + + for (;;) + { + bool match; + + match = RelFileNodeEquals(item->location.relfilenode, relfilenode); + + if (match || item->next == NULL) + { + + /* + * If we reached the end of list and no match was found, + * take over the last entry + */ + if (!match) + { + item->location.relfilenode = relfilenode; + item->location.location = location; + } + else + if (set) + item->location.location = location; + + /* Move the entry to the front of the LRU list */ + if (item != scan_locations->head) + { + /* unlink */ + if(item == scan_locations->tail) + scan_locations->tail = item->prev; + item->prev->next = item->next; + if(item->next) + item->next->prev = item->prev; + + /* link */ + item->prev = NULL; + item->next = scan_locations->head; + scan_locations->head->prev = item; + scan_locations->head = item; + } + + return item->location.location; + } + item = item->next; + } + /* never reached */ + } + + /* + * ss_get_location --- gets the optimal starting location for scan + * + * Returns the location of an already running sequential scan on the + * relation, or 0 if no valid location is found. + * + * This function assumes that scan->rs_nblocks is already properly set. + */ + BlockNumber ss_get_location(HeapScanDesc scan) + { + int threshold = sync_seqscan_threshold * NBuffers; + BlockNumber startloc; + + /* + * If the table is not large enough, or sync_scan_threshold + * is disabled (negative), don't Sync Scan. + */ + if(threshold < 0 || scan->rs_nblocks < threshold) + return 0; + + LWLockAcquire(SyncScanLock,LW_EXCLUSIVE); + startloc = ss_search(scan->rs_rd->rd_node, 0, false); + LWLockRelease(SyncScanLock); + + /* + * If the location is not a valid block number for this scan, start at 0. + * + * This can happen if for instance a VACUUM truncated the table + * since the location was set. + */ + if(startloc < 0 || startloc >= scan->rs_nblocks) + { + if(Trace_sync_seqscan) + elog(DEBUG2,"SYNC_SCAN: Location %d out of range." \ + " Relation has %d pages.", + startloc, scan->rs_nblocks); + return 0; + } + + if(Trace_sync_seqscan) + elog(DEBUG2,"SYNC_SCAN: START: OID = %d; Location = %d; Size: %d", + RelationGetRelid(scan->rs_rd), + startloc,scan->rs_nblocks); + + return startloc; + } + + /* + * ss_report_location --- updates the current scan location + * + * Writes an entry in the Sync Scan structure of the form + * (relfilenode,blocknumber), overwriting any existing entry for the same + * relfilenode. + */ + void ss_report_location(HeapScanDesc scan, BlockNumber location) + { + int threshold = sync_seqscan_threshold * NBuffers; + + /* + * If the table is not large enough, or sync_scan_threshold + * is disabled (negative), don't Sync Scan. + */ + if(threshold < 0 || scan->rs_nblocks < threshold) + return; + + /* + * To avoid lock congestion, only report scan progress every n pages. + * For the same reason, don't block if the lock isn't immediately + * available. Missing a few updates isn't critical, it just means that a + * scan that wants to join the pack will start a little bit behind the + * head of the scan. That's ok because the pages should still be in OS + * cache and the scan will quickly catch up. + */ + if (location % SYNC_SCAN_REPORT_INTERVAL == 0) + { + if (LWLockConditionalAcquire(SyncScanLock, LW_EXCLUSIVE)) + { + ss_search(scan->rs_rd->rd_node,location,true); + LWLockRelease(SyncScanLock); + } + } + } + Index: src/backend/storage/ipc/ipci.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/ipc/ipci.c,v retrieving revision 1.91 diff -c -r1.91 ipci.c *** src/backend/storage/ipc/ipci.c 15 Feb 2007 23:23:23 -0000 1.91 --- src/backend/storage/ipc/ipci.c 25 May 2007 21:28:18 -0000 *************** *** 19,24 **** --- 19,25 ---- #include "access/nbtree.h" #include "access/subtrans.h" #include "access/twophase.h" + #include "access/heapam.h" #include "miscadmin.h" #include "pgstat.h" #include "postmaster/autovacuum.h" *************** *** 112,117 **** --- 113,119 ---- size = add_size(size, BgWriterShmemSize()); size = add_size(size, AutoVacuumShmemSize()); size = add_size(size, BTreeShmemSize()); + size = add_size(size, SyncScanShmemSize()); #ifdef EXEC_BACKEND size = add_size(size, ShmemBackendArraySize()); #endif *************** *** 216,221 **** --- 218,224 ---- * Set up other modules that need some shared memory space */ BTreeShmemInit(); + SyncScanShmemInit(); #ifdef EXEC_BACKEND Index: src/backend/utils/misc/guc.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v retrieving revision 1.391 diff -c -r1.391 guc.c *** src/backend/utils/misc/guc.c 8 May 2007 16:33:51 -0000 1.391 --- src/backend/utils/misc/guc.c 25 May 2007 21:07:53 -0000 *************** *** 25,31 **** #include <syslog.h> #endif ! #include "access/gin.h" #include "access/transam.h" #include "access/twophase.h" --- 25,31 ---- #include <syslog.h> #endif ! #include "access/heapam.h" #include "access/gin.h" #include "access/transam.h" #include "access/twophase.h" *************** *** 783,788 **** --- 783,798 ---- false, NULL, NULL }, + { + {"trace_sync_seqscan", PGC_USERSET, DEVELOPER_OPTIONS, + gettext_noop("Generates debugging output for Synchronized Scans."), + NULL, + GUC_NOT_IN_SAMPLE + }, + &Trace_sync_seqscan, + false, NULL, NULL + }, + #ifdef LOCK_DEBUG { {"trace_locks", PGC_SUSET, DEVELOPER_OPTIONS, *************** *** 1803,1809 **** DEFAULT_GEQO_SELECTION_BIAS, MIN_GEQO_SELECTION_BIAS, MAX_GEQO_SELECTION_BIAS, NULL, NULL }, ! { {"bgwriter_lru_percent", PGC_SIGHUP, RESOURCES, gettext_noop("Background writer percentage of LRU buffers to flush per round."), --- 1813,1826 ---- DEFAULT_GEQO_SELECTION_BIAS, MIN_GEQO_SELECTION_BIAS, MAX_GEQO_SELECTION_BIAS, NULL, NULL }, ! { ! {"sync_seqscan_threshold", PGC_USERSET, QUERY_TUNING_SYNC_SEQSCAN, ! gettext_noop("Minimum size of table before synchronized scanning takes effect, as a fraction of shared_buffers."), ! NULL ! }, ! &sync_seqscan_threshold, ! DEFAULT_SYNC_SCAN_THRESHOLD, -1.0, 100.0, NULL, NULL ! }, { {"bgwriter_lru_percent", PGC_SIGHUP, RESOURCES, gettext_noop("Background writer percentage of LRU buffers to flush per round."), Index: src/include/access/heapam.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/heapam.h,v retrieving revision 1.123 diff -c -r1.123 heapam.h *** src/include/access/heapam.h 8 Apr 2007 01:26:33 -0000 1.123 --- src/include/access/heapam.h 26 May 2007 18:42:46 -0000 *************** *** 25,30 **** --- 25,35 ---- #include "utils/rel.h" #include "utils/tqual.h" + #define DEFAULT_SYNC_SCAN_THRESHOLD 1.0 + + extern bool Trace_sync_seqscan; + extern double sync_seqscan_threshold; + /* ---------------- * fastgetattr * *************** *** 240,243 **** --- 245,254 ---- extern void heap_sync(Relation relation); + /* in heapam/syncscan.c */ + extern void ss_report_location(HeapScanDesc scan, BlockNumber location); + extern BlockNumber ss_get_location(HeapScanDesc scan); + extern void SyncScanShmemInit(void); + extern Size SyncScanShmemSize(void); + #endif /* HEAPAM_H */ Index: src/include/access/relscan.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/relscan.h,v retrieving revision 1.52 diff -c -r1.52 relscan.h *** src/include/access/relscan.h 20 Jan 2007 18:43:35 -0000 1.52 --- src/include/access/relscan.h 25 May 2007 21:35:36 -0000 *************** *** 16,22 **** --- 16,24 ---- #include "access/skey.h" #include "storage/bufpage.h" + #include "storage/lwlock.h" #include "utils/tqual.h" + #include "utils/dynahash.h" typedef struct HeapScanDescData *************** *** 33,38 **** --- 35,42 ---- bool rs_inited; /* false = scan not init'd yet */ HeapTupleData rs_ctup; /* current tuple in scan, if any */ BlockNumber rs_cblock; /* current block # in scan, if any */ + BlockNumber rs_start_page; /* page where this scan began */ + // ss_hints_t *rs_hints; /* pointer to scan hint */ Buffer rs_cbuf; /* current buffer in scan, if any */ /* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */ ItemPointerData rs_mctid; /* marked scan position, if any */ Index: src/include/storage/lwlock.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/lwlock.h,v retrieving revision 1.36 diff -c -r1.36 lwlock.h *** src/include/storage/lwlock.h 16 Apr 2007 18:30:04 -0000 1.36 --- src/include/storage/lwlock.h 25 May 2007 21:08:27 -0000 *************** *** 62,67 **** --- 62,68 ---- AddinShmemInitLock, AutovacuumLock, AutovacuumScheduleLock, + SyncScanLock, /* Individual lock IDs end here */ FirstBufMappingLock, FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS, Index: src/include/utils/guc_tables.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/utils/guc_tables.h,v retrieving revision 1.33 diff -c -r1.33 guc_tables.h *** src/include/utils/guc_tables.h 21 Apr 2007 20:02:41 -0000 1.33 --- src/include/utils/guc_tables.h 25 May 2007 21:07:53 -0000 *************** *** 56,61 **** --- 56,62 ---- QUERY_TUNING_METHOD, QUERY_TUNING_COST, QUERY_TUNING_GEQO, + QUERY_TUNING_SYNC_SEQSCAN, QUERY_TUNING_OTHER, LOGGING, LOGGING_WHERE,
pgsql-patches by date: