Thread: Synchronized scans
I'm now done with this patch and testing it. I fixed a little off-by-one in "backward scan, not inited" branch, but I was unable to test it. It seems that code is actually never used because that case is optimized to a rewind in the executor. I marked those seemingly unreachable places in the code with a comment. I didn't touch the large scan threshold of NBuffers / 4 Tom that committed as part of the buffer ring patch. IOW I removed the GUC variable from the patch. I think the jury is still out there on this one. I included a basic regression test as well. It creates a ~10MB table, which with the default 32MB shared_buffers setting is large enough that synchronized scans are used. It then runs a query with a LIMIT so that it scans ~1/2 of a table. and then runs a new seqscan and checks that it returns rows from the second half of the table. This is a bit flakey, as the needed table size depends on the large scan threshold, and we can't test the actual concurrent behavior, but it's better than nothing. 10 MB works for "make check", but isn't enough if one runs "installcheck" against an existing installation with a larger shared_buffers. I therefore only added the test case to the parallel_schedule, though it still breaks "installcheck-parallel". If we later add the GUC variable, that should be used in the test case. For the record, this patch has a small negative impact on scans like "SELECT * FROM foo LIMIT 1000". If such a scan is run repeatedly, in CVS HEAD the first 1000 rows will stay in buffer cache, but with the patch each scan will start from roughly where previous one stopped, requiring more pages to be read from disk each time. I don't think it's something to worry about in practice, but I thought I'd mention it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/access/heap/Makefile =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/Makefile,v retrieving revision 1.15 diff -c -r1.15 Makefile *** src/backend/access/heap/Makefile 8 Apr 2007 01:26:27 -0000 1.15 --- src/backend/access/heap/Makefile 31 May 2007 10:21:28 -0000 *************** *** 12,18 **** top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o rewriteheap.o tuptoaster.o all: SUBSYS.o --- 12,18 ---- top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o rewriteheap.o tuptoaster.o syncscan.o all: SUBSYS.o Index: src/backend/access/heap/heapam.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v retrieving revision 1.234 diff -c -r1.234 heapam.c *** src/backend/access/heap/heapam.c 30 May 2007 20:11:53 -0000 1.234 --- src/backend/access/heap/heapam.c 4 Jun 2007 08:41:56 -0000 *************** *** 85,104 **** /* * If the table is large relative to NBuffers, use a bulk-read access ! * strategy, else use the default random-access strategy. During a ! * rescan, don't make a new strategy object if we don't have to. */ if (scan->rs_nblocks > NBuffers / 4 && !scan->rs_rd->rd_istemp) { if (scan->rs_strategy == NULL) scan->rs_strategy = GetAccessStrategy(BAS_BULKREAD); } else { if (scan->rs_strategy != NULL) FreeAccessStrategy(scan->rs_strategy); scan->rs_strategy = NULL; } scan->rs_inited = false; --- 85,108 ---- /* * If the table is large relative to NBuffers, use a bulk-read access ! * strategy and enable synchronized scanning (see syncscan.c). ! * During a rescan, don't make a new strategy object if we don't have to. */ if (scan->rs_nblocks > NBuffers / 4 && !scan->rs_rd->rd_istemp) { if (scan->rs_strategy == NULL) scan->rs_strategy = GetAccessStrategy(BAS_BULKREAD); + + scan->rs_startpage = ss_get_location(scan); } else { if (scan->rs_strategy != NULL) FreeAccessStrategy(scan->rs_strategy); scan->rs_strategy = NULL; + + scan->rs_startpage = 0; } scan->rs_inited = false; *************** *** 229,234 **** --- 233,239 ---- Snapshot snapshot = scan->rs_snapshot; bool backward = ScanDirectionIsBackward(dir); BlockNumber page; + BlockNumber prevpage; Page dp; int lines; OffsetNumber lineoff; *************** *** 251,257 **** tuple->t_data = NULL; return; } ! page = 0; /* first page */ heapgetpage(scan, page); lineoff = FirstOffsetNumber; /* first offnum */ scan->rs_inited = true; --- 256,262 ---- tuple->t_data = NULL; return; } ! page = scan->rs_startpage; /* first page */ heapgetpage(scan, page); lineoff = FirstOffsetNumber; /* first offnum */ scan->rs_inited = true; *************** *** 276,281 **** --- 281,290 ---- { if (!scan->rs_inited) { + /* Note: This is not normally reached, because this case is optimized + * away in the executor. + */ + /* * return null immediately if relation is empty */ *************** *** 285,291 **** tuple->t_data = NULL; return; } ! page = scan->rs_nblocks - 1; /* final page */ heapgetpage(scan, page); } else --- 294,305 ---- tuple->t_data = NULL; return; } ! /* start from last page of the scan */ ! if (scan->rs_startpage == 0) ! page = scan->rs_nblocks - 1; ! else ! page = scan->rs_startpage - 1; ! heapgetpage(scan, page); } else *************** *** 398,406 **** LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK); /* ! * return NULL if we've exhausted all the pages */ ! if (backward ? (page == 0) : (page + 1 >= scan->rs_nblocks)) { if (BufferIsValid(scan->rs_cbuf)) ReleaseBuffer(scan->rs_cbuf); --- 412,439 ---- LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK); /* ! * Handle wrap-around correctly, we might have started from somewhere ! * in the middle of the relation. ! */ ! prevpage = page; ! if (backward) ! { ! if (page == 0) ! page = scan->rs_nblocks; ! page--; ! } ! else ! { ! page++; ! if (page == scan->rs_nblocks) ! page = 0; ! } ! ! /* ! * return NULL if we've exhausted all the pages. */ ! if((ScanDirectionIsForward(dir) && page == scan->rs_startpage) || ! (ScanDirectionIsBackward(dir) && prevpage == scan->rs_startpage)) { if (BufferIsValid(scan->rs_cbuf)) ReleaseBuffer(scan->rs_cbuf); *************** *** 411,420 **** return; } - page = backward ? (page - 1) : (page + 1); - heapgetpage(scan, page); LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE); dp = (Page) BufferGetPage(scan->rs_cbuf); --- 444,455 ---- return; } heapgetpage(scan, page); + /* Report our new scan position for synchronization purposes */ + if (scan->rs_strategy != NULL) + ss_report_location(scan, page); + LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE); dp = (Page) BufferGetPage(scan->rs_cbuf); *************** *** 455,460 **** --- 490,496 ---- HeapTuple tuple = &(scan->rs_ctup); bool backward = ScanDirectionIsBackward(dir); BlockNumber page; + BlockNumber prevpage; Page dp; int lines; int lineindex; *************** *** 478,484 **** tuple->t_data = NULL; return; } ! page = 0; /* first page */ heapgetpage(scan, page); lineindex = 0; scan->rs_inited = true; --- 514,520 ---- tuple->t_data = NULL; return; } ! page = scan->rs_startpage; heapgetpage(scan, page); lineindex = 0; scan->rs_inited = true; *************** *** 500,505 **** --- 536,545 ---- { if (!scan->rs_inited) { + /* Note: This is not normally reached, because this case is optimized + * away in the executor. + */ + /* * return null immediately if relation is empty */ *************** *** 509,515 **** tuple->t_data = NULL; return; } ! page = scan->rs_nblocks - 1; /* final page */ heapgetpage(scan, page); } else --- 549,560 ---- tuple->t_data = NULL; return; } ! /* start from last page of the scan */ ! if (scan->rs_startpage == 0) ! page = scan->rs_nblocks - 1; ! else ! page = scan->rs_startpage - 1; ! heapgetpage(scan, page); } else *************** *** 613,626 **** } /* ! * if we get here, it means we've exhausted the items on this page and ! * it's time to move to the next. */ /* ! * return NULL if we've exhausted all the pages */ ! if (backward ? (page == 0) : (page + 1 >= scan->rs_nblocks)) { if (BufferIsValid(scan->rs_cbuf)) ReleaseBuffer(scan->rs_cbuf); --- 658,686 ---- } /* ! * If we get here, it means we've exhausted the items on this page and ! * it's time to move to the next. Handle wrap around correctly; ! * we might have started from somewhere in the middle of the relation. */ + prevpage = page; + if (backward) + { + if (page == 0) + page = scan->rs_nblocks; + page--; + } + else + { + page++; + if (page == scan->rs_nblocks) + page = 0; + } /* ! * return NULL if we've exhausted all the pages. */ ! if((ScanDirectionIsForward(dir) && page == scan->rs_startpage) || ! (ScanDirectionIsBackward(dir) && prevpage == scan->rs_startpage)) { if (BufferIsValid(scan->rs_cbuf)) ReleaseBuffer(scan->rs_cbuf); *************** *** 631,639 **** return; } - page = backward ? (page - 1) : (page + 1); heapgetpage(scan, page); dp = (Page) BufferGetPage(scan->rs_cbuf); lines = scan->rs_ntuples; linesleft = lines; --- 691,701 ---- return; } heapgetpage(scan, page); + if (scan->rs_strategy != NULL) + ss_report_location(scan,page); + dp = (Page) BufferGetPage(scan->rs_cbuf); lines = scan->rs_ntuples; linesleft = lines; Index: src/backend/access/heap/syncscan.c =================================================================== RCS file: src/backend/access/heap/syncscan.c diff -N src/backend/access/heap/syncscan.c *** /dev/null 1 Jan 1970 00:00:00 -0000 --- src/backend/access/heap/syncscan.c 4 Jun 2007 08:57:16 -0000 *************** *** 0 **** --- 1,287 ---- + /*------------------------------------------------------------------------- + * + * syncscan.c + * heap scan synchronization support + * + * When multiple backends run a sequential scan on the same table, we try + * to keep them synchronized to reduce the overall I/O needed. The goal is + * to read in each page only once to shared buffer cache, and let all backends + * that take part in the scan to process the page before it falls out of the + * cache. + * + * To accomplish that, we keep track of the scan position of each table, and + * start new scans close to where the previous scan(s) are. Assuming that I/O + * is slow compared to the processing of each page, that's enough to keep the + * scans synchronized. We don't try do any further synchronization to keep the + * scans together; some scans might progress much more slowly than others if + * for example the results need to be transferred to the client over a slow + * network, and we don't want such queries to slow down others. + * + * There can realistically only be a few sequential scans on different tables + * in progress at any time. Therefore we just keep the scan positions in a + * small LRU list which we scan every time we need to look up or update a scan + * position. If that ever becomes a scalability issue, we could put the entries + * to a hash table. + * + * INTERFACE ROUTINES + * ss_get_location - return current scan location of a relation + * ss_report_location - update current scan location + * + * Portions Copyright (c) 1996-2007, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * $PostgreSQL$ + * + *------------------------------------------------------------------------- + */ + #include "postgres.h" + + #include "access/heapam.h" + #include "miscadmin.h" + + /* + * Size of the LRU list. + * + * Note: the code assumes that SYNC_SCAN_NELEM > 1. + * + * XXX: What's a good value? It should be large enough to hold the + * maximum number large tables scanned simultaneously. But a larger value + * means more traversing of the LRU list when starting a new scan. + */ + #define SYNC_SCAN_NELEM 20 + + /* + * Interval between reports of the location of the current scan, in pages. + * + * Note: This should be smaller than the ring size (see freelist.c) we use + * for bulk reads. Otherwise a scan joining other scans might start from a + * page that's no longer in the buffer cache, and needs to read it back + * again. This is a bit fuzzy; there's no guarantee that the new scan + * will read the page until it leaves the buffer cache anyway, and on the + * other hand the page is most likely still in the OS cache. + */ + #define SYNC_SCAN_REPORT_INTERVAL (128 * 1024 / BLCKSZ) + + + /* GUC variables */ + bool Trace_sync_seqscan = false; + + /* + * The scan locations structure is essentially a doubly-linked LRU with head + * and tail pointer, but designed to hold a fixed maximum number of elements in + * fixed-size shared memory. + */ + typedef struct ss_scan_location_t { + RelFileNode relfilenode; /* The relfilenode that tags this entry */ + BlockNumber location; /* The location in the relation */ + } ss_scan_location_t; + + typedef struct ss_lru_item_t { + struct ss_lru_item_t *prev; + struct ss_lru_item_t *next; + ss_scan_location_t location; + } ss_lru_item_t; + + typedef struct ss_scan_locations_t { + ss_lru_item_t *head; + ss_lru_item_t *tail; + ss_lru_item_t items[1]; /* SYNC_SCAN_NELEM items */ + } ss_scan_locations_t; + + #define SizeOfScanLocations offsetof(ss_scan_locations_t, items[SYNC_SCAN_NELEM]) + + /* Pointer to struct in shared memory */ + static ss_scan_locations_t *scan_locations; + + /* prototypes for internal functions */ + static BlockNumber ss_search(RelFileNode,BlockNumber,bool); + + + /* + * SyncScanShmemSize --- report amount of shared memory space needed + */ + Size SyncScanShmemSize(void) + { + return SizeOfScanLocations; + } + + /* + * SyncScanShmemInit --- initialize this module's shared memory + */ + void SyncScanShmemInit() + { + int i; + bool found; + + scan_locations = (ss_scan_locations_t *) + ShmemInitStruct("Sync Scan Locations List", + SizeOfScanLocations, &found); + + if(!IsUnderPostmaster) + { + /* Initialize shared memory area */ + Assert(!found); + + scan_locations->head = &scan_locations->items[0]; + scan_locations->tail = &scan_locations->items[SYNC_SCAN_NELEM - 1]; + + for(i = 0; i < SYNC_SCAN_NELEM; i++) + { + ss_lru_item_t *item = &scan_locations->items[i]; + + /* Initialize all slots with invalid values. As scans are started, + * these invalid entries will fall off the LRU list and are + * replaced with real entries. + */ + item->location.relfilenode.spcNode = InvalidOid; + item->location.relfilenode.dbNode = InvalidOid; + item->location.relfilenode.relNode = InvalidOid; + item->location.location = InvalidBlockNumber; + + item->prev = (i > 0) ? + (&scan_locations->items[i - 1]) : NULL; + item->next = (i < SYNC_SCAN_NELEM - 1) ? + (&scan_locations->items[i + 1]) : NULL; + } + } + else + Assert(found); + + } + + /* + * ss_search --- searches the scan_locations structure for an entry with the + * given relfilenode. + * + * If "set" is true, the location is updated to the given location. If no + * entry for the given relfilenode is found, it will be created at the head + * of the list with the given location, even if set == false. + * + * In any case, the location after possible update is returned. + */ + static BlockNumber ss_search(RelFileNode relfilenode, + BlockNumber location, bool set) + { + ss_lru_item_t *item; + + item = scan_locations->head; + + for (;;) + { + bool match; + + match = RelFileNodeEquals(item->location.relfilenode, relfilenode); + + if (match || item->next == NULL) + { + + /* + * If we reached the end of list and no match was found, + * take over the last entry + */ + if (!match) + { + item->location.relfilenode = relfilenode; + item->location.location = location; + } + else + if (set) + item->location.location = location; + + /* Move the entry to the front of the LRU list */ + if (item != scan_locations->head) + { + /* unlink */ + if(item == scan_locations->tail) + scan_locations->tail = item->prev; + item->prev->next = item->next; + if(item->next) + item->next->prev = item->prev; + + /* link */ + item->prev = NULL; + item->next = scan_locations->head; + scan_locations->head->prev = item; + scan_locations->head = item; + } + + return item->location.location; + } + item = item->next; + } + /* never reached */ + } + + /* + * ss_get_location --- gets the optimal starting location for scan + * + * Returns the location of an already running sequential scan on the + * relation, or 0 if no valid location is found. + * + * This function assumes that scan->rs_nblocks is already properly set. + */ + BlockNumber ss_get_location(HeapScanDesc scan) + { + BlockNumber startloc; + + LWLockAcquire(SyncScanLock,LW_EXCLUSIVE); + startloc = ss_search(scan->rs_rd->rd_node, 0, false); + LWLockRelease(SyncScanLock); + + /* + * If the location is not a valid block number for this scan, start at 0. + * + * This can happen if for instance a VACUUM truncated the table + * since the location was set. + */ + if(startloc < 0 || startloc >= scan->rs_nblocks) + { + if(Trace_sync_seqscan) + elog(DEBUG1,"SYNC_SCAN: Location %d out of range. Relation has %d pages.", + startloc, scan->rs_nblocks); + return 0; + } + + if(Trace_sync_seqscan) + elog(DEBUG1,"SYNC_SCAN: START: OID = %d; Location = %d; Size: %d", + RelationGetRelid(scan->rs_rd), startloc, scan->rs_nblocks); + + return startloc; + } + + /* + * ss_report_location --- updates the current scan location + * + * Writes an entry in the Sync Scan structure of the form + * (relfilenode,blocknumber), overwriting any existing entry for the same + * relfilenode. + */ + void ss_report_location(HeapScanDesc scan, BlockNumber location) + { + if(Trace_sync_seqscan) + { + if ((location % 1000) == 0) + elog(DEBUG2, "SYNC_SCAN: page %d", location); + } + + /* + * To reduce lock contention, only report scan progress every n pages. + * For the same reason, don't block if the lock isn't immediately + * available. Missing a few updates isn't critical, it just means that a + * new scan that wants to join the pack will start a little bit behind the + * head of the scan. Hopefully the pages are still in OS cache and the + * scan catches up quickly. + */ + if ((location % SYNC_SCAN_REPORT_INTERVAL) == 0) + { + if (LWLockConditionalAcquire(SyncScanLock, LW_EXCLUSIVE)) + { + ss_search(scan->rs_rd->rd_node,location,true); + LWLockRelease(SyncScanLock); + } + else if (Trace_sync_seqscan) + elog(DEBUG1, "SYNC_SCAN: missed update %d", location); + } + } + Index: src/backend/storage/ipc/ipci.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/ipc/ipci.c,v retrieving revision 1.91 diff -c -r1.91 ipci.c *** src/backend/storage/ipc/ipci.c 15 Feb 2007 23:23:23 -0000 1.91 --- src/backend/storage/ipc/ipci.c 31 May 2007 10:21:28 -0000 *************** *** 19,24 **** --- 19,25 ---- #include "access/nbtree.h" #include "access/subtrans.h" #include "access/twophase.h" + #include "access/heapam.h" #include "miscadmin.h" #include "pgstat.h" #include "postmaster/autovacuum.h" *************** *** 112,117 **** --- 113,119 ---- size = add_size(size, BgWriterShmemSize()); size = add_size(size, AutoVacuumShmemSize()); size = add_size(size, BTreeShmemSize()); + size = add_size(size, SyncScanShmemSize()); #ifdef EXEC_BACKEND size = add_size(size, ShmemBackendArraySize()); #endif *************** *** 216,221 **** --- 218,224 ---- * Set up other modules that need some shared memory space */ BTreeShmemInit(); + SyncScanShmemInit(); #ifdef EXEC_BACKEND Index: src/backend/utils/misc/guc.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v retrieving revision 1.391 diff -c -r1.391 guc.c *** src/backend/utils/misc/guc.c 8 May 2007 16:33:51 -0000 1.391 --- src/backend/utils/misc/guc.c 2 Jun 2007 07:50:23 -0000 *************** *** 26,31 **** --- 26,32 ---- #endif + #include "access/heapam.h" #include "access/gin.h" #include "access/transam.h" #include "access/twophase.h" *************** *** 783,788 **** --- 784,799 ---- false, NULL, NULL }, + { + {"trace_sync_seqscan", PGC_USERSET, DEVELOPER_OPTIONS, + gettext_noop("Generates debugging output for Synchronized Scans."), + NULL, + GUC_NOT_IN_SAMPLE + }, + &Trace_sync_seqscan, + false, NULL, NULL + }, + #ifdef LOCK_DEBUG { {"trace_locks", PGC_SUSET, DEVELOPER_OPTIONS, Index: src/include/access/heapam.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/heapam.h,v retrieving revision 1.124 diff -c -r1.124 heapam.h *** src/include/access/heapam.h 27 May 2007 03:50:39 -0000 1.124 --- src/include/access/heapam.h 2 Jun 2007 07:50:30 -0000 *************** *** 25,30 **** --- 25,33 ---- #include "utils/rel.h" #include "utils/tqual.h" + /* GUC variable, in syncscan.c */ + extern bool Trace_sync_seqscan; + /* ---------------- * fastgetattr * *************** *** 240,243 **** --- 243,252 ---- extern void heap_sync(Relation relation); + /* in heapam/syncscan.c */ + extern void ss_report_location(HeapScanDesc scan, BlockNumber location); + extern BlockNumber ss_get_location(HeapScanDesc scan); + extern void SyncScanShmemInit(void); + extern Size SyncScanShmemSize(void); + #endif /* HEAPAM_H */ Index: src/include/access/relscan.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/relscan.h,v retrieving revision 1.54 diff -c -r1.54 relscan.h *** src/include/access/relscan.h 30 May 2007 20:12:02 -0000 1.54 --- src/include/access/relscan.h 2 Jun 2007 07:51:10 -0000 *************** *** 34,39 **** --- 34,40 ---- bool rs_inited; /* false = scan not init'd yet */ HeapTupleData rs_ctup; /* current tuple in scan, if any */ BlockNumber rs_cblock; /* current block # in scan, if any */ + BlockNumber rs_startpage; /* page where this scan began */ Buffer rs_cbuf; /* current buffer in scan, if any */ /* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */ ItemPointerData rs_mctid; /* marked scan position, if any */ Index: src/include/storage/lwlock.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/lwlock.h,v retrieving revision 1.36 diff -c -r1.36 lwlock.h *** src/include/storage/lwlock.h 16 Apr 2007 18:30:04 -0000 1.36 --- src/include/storage/lwlock.h 31 May 2007 10:21:28 -0000 *************** *** 62,67 **** --- 62,68 ---- AddinShmemInitLock, AutovacuumLock, AutovacuumScheduleLock, + SyncScanLock, /* Individual lock IDs end here */ FirstBufMappingLock, FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS, Index: src/test/regress/parallel_schedule =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/test/regress/parallel_schedule,v retrieving revision 1.42 diff -c -r1.42 parallel_schedule *** src/test/regress/parallel_schedule 2 Apr 2007 03:49:42 -0000 1.42 --- src/test/regress/parallel_schedule 4 Jun 2007 09:07:27 -0000 *************** *** 61,67 **** # ---------- # The fourth group of parallel test # ---------- ! test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join aggregatestransactions random portals arrays btree_index hash_index update namespace prepared_xacts delete test: privileges test: misc --- 61,67 ---- # ---------- # The fourth group of parallel test # ---------- ! test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join aggregatestransactions random portals arrays btree_index hash_index update namespace prepared_xacts delete syncscan test: privileges test: misc Index: src/test/regress/expected/syncscan.out =================================================================== RCS file: src/test/regress/expected/syncscan.out diff -N src/test/regress/expected/syncscan.out *** /dev/null 1 Jan 1970 00:00:00 -0000 --- src/test/regress/expected/syncscan.out 4 Jun 2007 09:34:00 -0000 *************** *** 0 **** --- 1,24 ---- + -- + -- Sequential scan synchronization + -- + CREATE TABLE ss_test (id VARCHAR(10), data VARCHAR(500)); + INSERT INTO ss_test SELECT '1st half', repeat('a',500) FROM generate_series(1,10000); + INSERT INTO ss_test SELECT '2nd half', repeat('a',500) FROM generate_series(1,10000); + -- Scan little over half of the table with a sequential scan. OFFSET 1000 because we don't + -- track the scan location exactly, so the next scan will start a little bit behind from + -- where this scan stops. + SELECT id FROM ss_test WHERE id = '2nd half' OFFSET 1000 LIMIT 1; + id + ---------- + 2nd half + (1 row) + + -- If synchronized scans are used, this should begin roughly from where the above scan stopped, + -- and should return '2nd half' + SELECT id FROM ss_test LIMIT 1; + id + ---------- + 2nd half + (1 row) + + DROP TABLE ss_test; Index: src/test/regress/sql/syncscan.sql =================================================================== RCS file: src/test/regress/sql/syncscan.sql diff -N src/test/regress/sql/syncscan.sql *** /dev/null 1 Jan 1970 00:00:00 -0000 --- src/test/regress/sql/syncscan.sql 4 Jun 2007 09:37:56 -0000 *************** *** 0 **** --- 1,22 ---- + -- + -- Sequential scan synchronization + -- + -- NOTE: This test depends on ss_test being large enough that synchronized + -- scans are used. Increase the number of rows inserted to it, or decrease + -- shared_buffers if that's not happening. + + CREATE TABLE ss_test (id VARCHAR(10), data VARCHAR(500)); + + INSERT INTO ss_test SELECT '1st half', repeat('a',500) FROM generate_series(1,10000); + INSERT INTO ss_test SELECT '2nd half', repeat('a',500) FROM generate_series(1,10000); + + -- Scan little over half of the table with a sequential scan. OFFSET 1000 + -- because we don't track the scan location exactly, so the next scan will + -- start a little bit behind from where this scan stops. + SELECT id FROM ss_test WHERE id = '2nd half' OFFSET 1000 LIMIT 1; + + -- If synchronized scans are used, this should begin roughly from where the + -- above scan stopped, and should return '2nd half' + SELECT id FROM ss_test LIMIT 1; + + DROP TABLE ss_test;
Heikki Linnakangas <heikki@enterprisedb.com> writes: > For the record, this patch has a small negative impact on scans like > "SELECT * FROM foo LIMIT 1000". If such a scan is run repeatedly, in CVS > HEAD the first 1000 rows will stay in buffer cache, but with the patch > each scan will start from roughly where previous one stopped, requiring > more pages to be read from disk each time. I don't think it's something > to worry about in practice, but I thought I'd mention it. Urgh. The answers change depending on (more or less) the phase of the moon? I've got a serious problem with that. You might look back to 1997 when GEQO very nearly got tossed out entirely because it destroyed reproducibility of query results. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> For the record, this patch has a small negative impact on scans like >> "SELECT * FROM foo LIMIT 1000". If such a scan is run repeatedly, in CVS >> HEAD the first 1000 rows will stay in buffer cache, but with the patch >> each scan will start from roughly where previous one stopped, requiring >> more pages to be read from disk each time. I don't think it's something >> to worry about in practice, but I thought I'd mention it. > > Urgh. The answers change depending on (more or less) the phase of the > moon? I've got a serious problem with that. You might look back to > 1997 when GEQO very nearly got tossed out entirely because it destroyed > reproducibility of query results. That's a very fundamental result of this patch, unfortunately. It only happens on scans on tables larger than the threshold. And because we only report the current scan location every 128KB, if you repeat the same SELECT .. LIMIT X query with no other scanners on that table, you'll get the same results as long as X is smaller than 128KB. I thought we've been through this issue already... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Tom Lane wrote: > > Heikki Linnakangas <heikki@enterprisedb.com> writes: > >> For the record, this patch has a small negative impact on scans like > >> "SELECT * FROM foo LIMIT 1000". If such a scan is run repeatedly, in CVS > >> HEAD the first 1000 rows will stay in buffer cache, but with the patch > >> each scan will start from roughly where previous one stopped, requiring > >> more pages to be read from disk each time. I don't think it's something > >> to worry about in practice, but I thought I'd mention it. > > > > Urgh. The answers change depending on (more or less) the phase of the > > moon? I've got a serious problem with that. You might look back to > > 1997 when GEQO very nearly got tossed out entirely because it destroyed > > reproducibility of query results. > > That's a very fundamental result of this patch, unfortunately. It only > happens on scans on tables larger than the threshold. And because we > only report the current scan location every 128KB, if you repeat the > same SELECT .. LIMIT X query with no other scanners on that table, > you'll get the same results as long as X is smaller than 128KB. > > I thought we've been through this issue already... Agreed. I thought we always said that a LIMIT without an ORDER BY was meaningless, particuarly because an intervening UPDATE could have moved rows to another place in the table. In fact, at one time we considered prevening LIMIT without ORDER BY because it was meaningless, but decided if people want unstable results, they should be able to get them. An argument could be made that a LIMIT without ORDER BY on a table locked read-only should be stable. As I understand it, the problem is that while currently LIMIT without ORDER BY always starts at the beginning of the table, it will not with this patch. I consider that acceptable. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > As I understand it, the problem is that while currently LIMIT without > ORDER BY always starts at the beginning of the table, it will not with > this patch. I consider that acceptable. It's definitely going to require stronger warnings than we have now about using LIMIT without ORDER BY. regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: >> As I understand it, the problem is that while currently LIMIT without >> ORDER BY always starts at the beginning of the table, it will not with >> this patch. I consider that acceptable. > > It's definitely going to require stronger warnings than we have now > about using LIMIT without ORDER BY. Along the lines of NOTICE: LIMIT without ORDER BY returns an arbitrary set of matching rows perhaps? I wonder how easy it is to detect that in the planner. Or just a remark in the manual? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-06-04 at 10:53 +0100, Heikki Linnakangas wrote: > I'm now done with this patch and testing it. > Great! > For the record, this patch has a small negative impact on scans like > "SELECT * FROM foo LIMIT 1000". If such a scan is run repeatedly, in CVS > HEAD the first 1000 rows will stay in buffer cache, but with the patch > each scan will start from roughly where previous one stopped, requiring > more pages to be read from disk each time. I don't think it's something > to worry about in practice, but I thought I'd mention it. > No surprise here, as you and Bruce have already pointed out. If we wanted to reduce the occurrence of this phenomena, we could perhaps "time out" the hints so that it's impossible to pick up a hint from a scan that finished 5 minutes ago. It doesn't seem helpful to further obscure the non-determinism of the results, however. Regards, Jeff Davis
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > For the record, this patch has a small negative impact on scans like > > "SELECT * FROM foo LIMIT 1000". If such a scan is run repeatedly, in CVS > > HEAD the first 1000 rows will stay in buffer cache, but with the patch > > each scan will start from roughly where previous one stopped, requiring > > more pages to be read from disk each time. I don't think it's something > > to worry about in practice, but I thought I'd mention it. > > Urgh. The answers change depending on (more or less) the phase of the > moon? I've got a serious problem with that. You might look back to > 1997 when GEQO very nearly got tossed out entirely because it destroyed > reproducibility of query results. What about the simple idea of just disabling the use of a sync scan when the query has LIMIT and no ORDER BY, and start always at block 0 in that case? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Tom Lane wrote: >> Heikki Linnakangas <heikki@enterprisedb.com> writes: >>> For the record, this patch has a small negative impact on scans like >>> "SELECT * FROM foo LIMIT 1000". If such a scan is run repeatedly, in CVS >>> HEAD the first 1000 rows will stay in buffer cache, but with the patch >>> each scan will start from roughly where previous one stopped, requiring >>> more pages to be read from disk each time. I don't think it's something >>> to worry about in practice, but I thought I'd mention it. >> Urgh. The answers change depending on (more or less) the phase of the >> moon? I've got a serious problem with that. You might look back to >> 1997 when GEQO very nearly got tossed out entirely because it destroyed >> reproducibility of query results. > > What about the simple idea of just disabling the use of a sync scan when > the query has LIMIT and no ORDER BY, and start always at block 0 in that > case? That handles the LIMIT case, but you would still observe the different ordering. And some people do LIMIT-like behavior in client side, by opening a cursor and only fetching first n rows. I don't think anyone can reasonably expect to get the same ordering when the same query issued twice in general, but within the same transaction it wouldn't be that unreasonable. If we care about that, we could keep track of starting locations per transaction, only do the synchronization on the first scan in a transaction, and start subsequent scans from the same page as the first one. That way if you issue the same query twice in a transaction, or do something like: BEGIN; SELECT * FROM queue FOR UPDATE LIMIT 10 do stuff.. DELETE FROM queue LIMIT 10 COMMIT; you'd get the expected result. I think the warning on LIMIT without ORDER BY is a good idea, regardless of the synchronized scans patch. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Jeff Davis wrote: > No surprise here, as you and Bruce have already pointed out. > > If we wanted to reduce the occurrence of this phenomena, we could > perhaps "time out" the hints so that it's impossible to pick up a hint > from a scan that finished 5 minutes ago. > > It doesn't seem helpful to further obscure the non-determinism of the > results, however. I agree it's probably not a good idea to try masking this. It'll just make it harder to hit the issue in testing, so that you run into it in production. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > I don't think anyone can reasonably expect to get the same ordering when > the same query issued twice in general, but within the same transaction > it wouldn't be that unreasonable. If we care about that, we could keep > track of starting locations per transaction, only do the synchronization > on the first scan in a transaction, and start subsequent scans from the > same page as the first one. I think the real problem here is that the first scan is leaving state behind that changes the behavior of the next scan. Which can have no positive benefit, since obviously the first scan is not still proceeding; the best you can hope for is that it's a no-op and the worst case is that it actively pessimizes things. Why doesn't the patch remove the shmem entry at scan termination? > I think the warning on LIMIT without ORDER BY is a good idea, regardless > of the synchronized scans patch. I seriously doubt that can be done in any way that doesn't both warn about perfectly-safe cases and fail to warn about other unsafe ones. Furthermore, it's not uncommon for people to do "SELECT * ... LIMIT 1" just to remind themselves of column names or whatever. Do we really want the thing to be so nannyish? I was envisioning simply a stronger warning in the SELECT reference page ... regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> I don't think anyone can reasonably expect to get the same ordering when >> the same query issued twice in general, but within the same transaction >> it wouldn't be that unreasonable. If we care about that, we could keep >> track of starting locations per transaction, only do the synchronization >> on the first scan in a transaction, and start subsequent scans from the >> same page as the first one. > > I think the real problem here is that the first scan is leaving state > behind that changes the behavior of the next scan. Which can have no > positive benefit, since obviously the first scan is not still > proceeding; the best you can hope for is that it's a no-op and the worst > case is that it actively pessimizes things. Why doesn't the patch > remove the shmem entry at scan termination? Because there's no reason why it should, and it would require a lot more bookkeeping. There can be many scanners on the same table, so we'd need to implement some kind of reference counting, which means having to reliably decrement the counter when a scan terminates. In any case if there actually is a concurrent scan, you'd still see different ordering. Removing the entry when a scan is over would just make it harder to trigger. >> I think the warning on LIMIT without ORDER BY is a good idea, regardless >> of the synchronized scans patch. > > I seriously doubt that can be done in any way that doesn't both warn > about perfectly-safe cases and fail to warn about other unsafe ones. > Furthermore, it's not uncommon for people to do "SELECT * ... LIMIT 1" > just to remind themselves of column names or whatever. Do we really > want the thing to be so nannyish? It really depends on how many false negatives and positives it gives. If too many, it's just annoying, but if it's reasonably accurate I think it would be OK to remind people running queries like that. > I was envisioning simply a stronger warning in the SELECT reference page ... I doubt the people that would be bitten by this read the SELECT reference page. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Jun 4, 2007, at 15:24 , Heikki Linnakangas wrote: > I don't think anyone can reasonably expect to get the same ordering > when the same query issued twice in general, but within the same > transaction it wouldn't be that unreasonable. The order rows are returned without an ORDER BY clause *is* implementation dependent, and is not guaranteed, at least by the spec. Granted, LIMIT without ORDER BY (and DISTINCT for that matter) brings this into sharp relief. > I think the warning on LIMIT without ORDER BY is a good idea, > regardless of the synchronized scans patch. I'm not saying this isn't a good idea, but are there other places where there might be gotchas for the unwary, such as DISTINCT without ORDER BY or (for an unrelated example) UNION versus UNION ALL? How many of these types of messages would be useful? Michael Glaesemann grzm seespotcode net
On Mon, 2007-06-04 at 16:42 -0400, Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > I don't think anyone can reasonably expect to get the same ordering when > > the same query issued twice in general, but within the same transaction > > it wouldn't be that unreasonable. If we care about that, we could keep > > track of starting locations per transaction, only do the synchronization > > on the first scan in a transaction, and start subsequent scans from the > > same page as the first one. > > I think the real problem here is that the first scan is leaving state > behind that changes the behavior of the next scan. Which can have no > positive benefit, since obviously the first scan is not still > proceeding; the best you can hope for is that it's a no-op and the worst > case is that it actively pessimizes things. Why doesn't the patch > remove the shmem entry at scan termination? > Sounds like a reasonable idea to me. We could add the PID to the data structure so that it would only remove the hint if it's the one that set the hint. I think we'd just need to call a function to do that from heap_endscan(), correct? However, we couldn't 100% guarantee that the state would be cleared. A backend could be killed in the middle of a scan. The case you're worried about seems very narrow to me, and I think "actively pessimizes" is too strong. Unless I misunderstand, "the best you can hope for" no-op happens in all cases except a most bizarre one: that in which you're executing repeated identical LIMIT queries with no ORDER BY; and the tuples returned occupy more than 128K (16 pages is the reporting period) but fewer would be effective to cache; and the table in question is larger than the large table threshold. I'm just trying to add some perspective about what we're fixing, here. But it's fair to say that a scan should clear any state when it's done. Regards, Jeff Davis
Michael Glaesemann wrote: >> I think the warning on LIMIT without ORDER BY is a good idea, >> regardless of the synchronized scans patch. > > I'm not saying this isn't a good idea, but are there other places where > there might be gotchas for the unwary, such as DISTINCT without ORDER BY > or (for an unrelated example) UNION versus UNION ALL? How many of these > types of messages would be useful? LIMIT without ORDER BY is worse because it not only returns tuples in different order, but it can return different tuples altogether when you run it multiple times. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-06-04 at 22:09 +0100, Heikki Linnakangas wrote: > > I think the real problem here is that the first scan is leaving state > > behind that changes the behavior of the next scan. Which can have no > > positive benefit, since obviously the first scan is not still > > proceeding; the best you can hope for is that it's a no-op and the worst > > case is that it actively pessimizes things. Why doesn't the patch > > remove the shmem entry at scan termination? > > Because there's no reason why it should, and it would require a lot more > bookkeeping. There can be many scanners on the same table, so we'd need > to implement some kind of reference counting, which means having to > reliably decrement the counter when a scan terminates. > That's what I thought at first, and why I didn't do it. Right now I'm thinking we could just add the PID to the hint, so that it would only remove its own hint. Would that work? It's still vulnerable to a backend being killed and the hint hanging around. However, the next scan would clear get it back to normal. Reference counting would cause weirdness if a backend died or something, because it would never decrement to 0. > In any case if there actually is a concurrent scan, you'd still see > different ordering. Removing the entry when a scan is over would just > make it harder to trigger. > Agreed. I don't know for sure whether that's good or bad, but it would make the nondeterminism less immediately visible. Regards, Jeff Davis
Jeff Davis wrote: > On Mon, 2007-06-04 at 22:09 +0100, Heikki Linnakangas wrote: >>> I think the real problem here is that the first scan is leaving state >>> behind that changes the behavior of the next scan. Which can have no >>> positive benefit, since obviously the first scan is not still >>> proceeding; the best you can hope for is that it's a no-op and the worst >>> case is that it actively pessimizes things. Why doesn't the patch >>> remove the shmem entry at scan termination? >> Because there's no reason why it should, and it would require a lot more >> bookkeeping. There can be many scanners on the same table, so we'd need >> to implement some kind of reference counting, which means having to >> reliably decrement the counter when a scan terminates. >> > > That's what I thought at first, and why I didn't do it. Right now I'm > thinking we could just add the PID to the hint, so that it would only > remove its own hint. Would that work? Were you thinking of storing the PID of the backend that originally created the hint, or updating the PID every time the hint is updated? In any case, we still wouldn't know if there's other scanners still running. We could just always remove the hint when a scan ends, and rely on the fact that if there's other scans still running they will put the hint back very quickly. There would then be a small window where there's no hint but a scan is active, and a new scan starting during that window would fail to synchronize with the other scanners. > It's still vulnerable to a backend being killed and the hint hanging > around. However, the next scan would clear get it back to normal. Oh, did you mean that the PID would be updated whenever a new scan starts? So that the PID stored would always be the PID of the latest scanner. That might work pretty well, though a small scan with a LIMIT, or any other situation where some scans run faster than others, might clear the hint prematurely while other scans are still running. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-06-04 at 22:57 +0100, Heikki Linnakangas wrote: > > That's what I thought at first, and why I didn't do it. Right now I'm > > thinking we could just add the PID to the hint, so that it would only > > remove its own hint. Would that work? > > Were you thinking of storing the PID of the backend that originally > created the hint, or updating the PID every time the hint is updated? In > any case, we still wouldn't know if there's other scanners still running. > My thought was that every time the location was reported by a backend, it would store 3 pieces of information, not 2: * relfilenode * the PID of the backend that created or updated this particular hint last * the location Then, on heap_endscan() (if that's the right place), we find the hint, and if the PID matches, we remove it. If not, it does nothing. This would only matter when there weren't other scans. When concurrent scans were happening, chances are the PID wouldn't match anyway, and thus not be removed. Regards, Jeff Davis
On Jun 4, 2007, at 16:34 , Heikki Linnakangas wrote: > LIMIT without ORDER BY is worse because it not only returns tuples > in different order, but it can return different tuples altogether > when you run it multiple times. Wouldn't DISTINCT ON suffer from the same issue without ORDER BY? Michael Glaesemann grzm seespotcode net
Jeff Davis <pgsql@j-davis.com> writes: > My thought was that every time the location was reported by a backend, > it would store 3 pieces of information, not 2: > * relfilenode > * the PID of the backend that created or updated this particular hint > last > * the location > Then, on heap_endscan() (if that's the right place), we find the hint, > and if the PID matches, we remove it. If not, it does nothing. > This would only matter when there weren't other scans. When concurrent > scans were happening, chances are the PID wouldn't match anyway, and > thus not be removed. But note that barring backend crash, once all the scans are done it is guaranteed that the hint will be removed --- somebody will be last to update the hint, and therefore will remove it when they do heap_endscan, even if others are not quite done. This is good in the sense that later-starting backends won't be fooled into starting at what is guaranteed to be the most pessimal spot, but it's got a downside too, which is that there will be windows where seqscans are in process but a newly started scan won't see them. Maybe that's a killer objection. When exactly is the hint updated? I gathered from something Heikki said that it's set after processing X amount of data, but I think it might be better to set it *before* processing X amount of data. That is, the hint means "I'm going to be scanning at least <threshold> blocks starting here", not "I have scanned <threshold> blocks ending here", which seems like the interpretation that's being used at the moment. What that would mean is that successive "LIMIT 1000" calls would in fact all start at the same place, barring interference from other backends. regards, tom lane
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Were you thinking of storing the PID of the backend that originally created the > hint, or updating the PID every time the hint is updated? In any case, we still > wouldn't know if there's other scanners still running. My reaction was if you always put the pid in when updating the hint it might work out pretty well. When you finish you remove the hint iff the pid is your own. It has exactly the same properties you describe of leaving a small window with no hint when you finish a scan but only if you were the last to update the scan position which i would expect would be pretty rare except for the last scanner. If a backend died it would leave a scan position behind but the next scanner on that table would overwrite the pid and then remove it when it's finished. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > LIMIT without ORDER BY is worse because it not only returns tuples in different > order, but it can return different tuples altogether when you run it multiple > times. I don't think printing a notice is feasible. For regular DML a notice or info message is basically equivalent to an error. It makes it entirely impractical to use such a query in a production system where it will spam your logs with thousands of messages. Now it might be worth considering making this an error. We already make it an error to have DISTINCT ON without an matching ORDER BY or a full outer join without a merge-joinable operator. Both of those are really implementation limitations but they're also arguably conceptual limitations in that it's hard (though not impossible) to imagine it working any differently. However... there are circumstances where having non-deterministic queries is perfectly reasonable. Perhaps you're implementing a work queue and want to process a batch of records but don't care which. Or perhaps you want to cap the number of records a search result will display to a reasonable value that won't cause problems but you expect under normal conditions not to reach that limit. There are also cases where people use OFFSET and LIMIT with degenerate values (either OFFSET 0 or LIMIT <bignum>) to induce the planner to plan queries differently knowing that it won't actually change the results. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-06-04 at 18:25 -0400, Tom Lane wrote: > But note that barring backend crash, once all the scans are done it is > guaranteed that the hint will be removed --- somebody will be last to > update the hint, and therefore will remove it when they do heap_endscan, > even if others are not quite done. This is good in the sense that > later-starting backends won't be fooled into starting at what is > guaranteed to be the most pessimal spot, but it's got a downside too, > which is that there will be windows where seqscans are in process but > a newly started scan won't see them. Maybe that's a killer objection. I don't think it would be a major objection. If there aren't other sequential scans in progress, the point is moot, and if there are: (a) the hint has a lower probability of being removed, since it may contain the PID of one of those other scans. (b) the hint is likely to be replaced quite quickly The problem is, I think people would be more frustrated by 1 in 1000 queries starting the scan in the wrong place because a hint was deleted, because that could cause a major difference in performance. I expect the current patch would have more consistent performance for that reason. To me, it seems to be a small benefit and a small cost. It's hard for me to feel very strongly either way. > When exactly is the hint updated? I gathered from something Heikki said > that it's set after processing X amount of data, but I think it might be > better to set it *before* processing X amount of data. That is, the > hint means "I'm going to be scanning at least <threshold> blocks > starting here", not "I have scanned <threshold> blocks ending here", > which seems like the interpretation that's being used at the moment. > What that would mean is that successive "LIMIT 1000" calls would in fact > all start at the same place, barring interference from other backends. > If I understand correctly, this is a one-page difference in the report location, right? We can either report that we've just finished scanning block 1023 (ending an X block chunk of reading) and another backend can start scanning at 1023 (current behavior); or we could report that we're about to scan an X block chunk of data starting with block 1024, and the new scan can start at 1024. We don't want the new scan to jump in ahead of the existing scan, because then we're introducing uncached blocks between the two scans -- risking divergence. If the data occupies less than X data pages, the LIMIT queries will be deterministic for single-scans anyway, because no reports will happen (other than the starting location, which won't matter in this case). If the data is more than that, then at least one report would have happened. At this point, you're talking about rewinding the scan (how far?), which I originally coded for with sync_seqscan_offset. That feature didn't prove very useful (yet), so I removed it. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > The problem is, I think people would be more frustrated by 1 in 1000 > queries starting the scan in the wrong place because a hint was deleted, Yeah --- various people have been complaining recently about how we have good average performance and bad worst case, and this behavior would definitely be more of the same. So I'm kind of backing away from the idea of deleting the hint. But if we could change the hint behavior to say "start reading here", successive short LIMITed reads would all start reading from the same point, which fixes both my reproducibility concern and Heikki's original point about being able to re-use cached data. You'd only get into the irreproducible behavior if the LIMIT was larger than the amount of data scanned between hint updates. regards, tom lane
Tom Lane wrote: > But note that barring backend crash, once all the scans are done it is > guaranteed that the hint will be removed --- somebody will be last to > update the hint, and therefore will remove it when they do heap_endscan, > even if others are not quite done. This is good in the sense that > later-starting backends won't be fooled into starting at what is > guaranteed to be the most pessimal spot, but it's got a downside too, > which is that there will be windows where seqscans are in process but > a newly started scan won't see them. Maybe that's a killer objection. I think the way the patch is now is better than trying to remove the hints, but I don't feel strongly either way. However, I don't think we should try hard to mask the issue. It just means people are more likely to miss it in testing, and run into it in production. It's better to find out sooner than later. It might be a good idea to preserve the order within a transaction, though that means more code. > When exactly is the hint updated? I gathered from something Heikki said > that it's set after processing X amount of data, but I think it might be > better to set it *before* processing X amount of data. That is, the > hint means "I'm going to be scanning at least <threshold> blocks > starting here", not "I have scanned <threshold> blocks ending here", > which seems like the interpretation that's being used at the moment. > What that would mean is that successive "LIMIT 1000" calls would in fact > all start at the same place, barring interference from other backends. I don't see how it makes any difference whether you update the hint before or after processing. Running a LIMIT 1000 query repeatedly will start from the same place in any case, assuming 1000 tuples fit in the "report interval", which is 128KB currently. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-06-04 at 21:39 -0400, Tom Lane wrote: > idea of deleting the hint. But if we could change the hint behavior to > say "start reading here", successive short LIMITed reads would all start > reading from the same point, which fixes both my reproducibility concern > and Heikki's original point about being able to re-use cached data. > You'd only get into the irreproducible behavior if the LIMIT was larger > than the amount of data scanned between hint updates. That's how it works now. Small limit queries don't change the location in the hint, so if you repeat them, the queries keep starting from the same place, and fetching the same tuples. Large LIMIT queries (larger than the number of pages between updates) do change the location in the hint, and so that's really the case you're worried about. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > That's how it works now. Small limit queries don't change the location > in the hint, so if you repeat them, the queries keep starting from the > same place, and fetching the same tuples. OK, maybe the problem's not as severe as I thought then. regards, tom lane
On Mon, 2007-06-04 at 10:53 +0100, Heikki Linnakangas wrote: > I'm now done with this patch and testing it. > > One difference between our patches is that, in my patch, the ending condition of the scan is after the hint is set back to the starting position. That means, in my patch, if you do: SELECT * FROM bigtable; SELECT * FROM bigtable; with no concurrency at all, the returned order will be the same. In your patch, each full table scan leaves the hint at 16 pages before the position it started in, leading to different orders on the full table scans. Regards, Jeff Davis
Heikki Linnakangas <heikki@enterprisedb.com> writes: > I fixed a little off-by-one in "backward scan, not inited" branch, but I > was unable to test it. It seems that code is actually never used because > that case is optimized to a rewind in the executor. I marked those > seemingly unreachable places in the code with a comment. Actually it's not equivalent to a rewind, it's more like the startup condition for an Index Scan Backward: start at the far end of the relation and go backwards. I suspect that the whole thing may be unreachable code because the planner knows that seqscans are unordered and backward-scan is only of interest for an ordered scan. But be that as it may: do we even want a backwards-running scan to participate in a syncscan group? Unless *all* the backends are doing the same thing, it will not help and instead will bollix the syncscan for everyone else. I'm inclined to disable use of syncscan.c altogether when the scan is started backwards. It also seems prudent to suppress ss_report_location calls when stepping backward in a generally-forwards scan. Thoughts? regards, tom lane
On Thu, 2007-06-07 at 22:52 -0400, Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > I fixed a little off-by-one in "backward scan, not inited" branch, but I > > was unable to test it. It seems that code is actually never used because > > that case is optimized to a rewind in the executor. I marked those > > seemingly unreachable places in the code with a comment. > > Actually it's not equivalent to a rewind, it's more like the startup > condition for an Index Scan Backward: start at the far end of the > relation and go backwards. I suspect that the whole thing may be > unreachable code because the planner knows that seqscans are unordered > and backward-scan is only of interest for an ordered scan. But be that > as it may: do we even want a backwards-running scan to participate in a > syncscan group? Unless *all* the backends are doing the same thing, > it will not help and instead will bollix the syncscan for everyone else. > I'm inclined to disable use of syncscan.c altogether when the scan is Just to be sure: a backwards-started scan is currently unreachable code, correct? But as long as the code is there (reachable or not) it sounds good to disable sync scan in that case. > started backwards. It also seems prudent to suppress ss_report_location > calls when stepping backward in a generally-forwards scan. Thoughts? I agree that we should disable ss_report_location if the scan is moving backwards. I might go so far as to suggest if the scan *ever* moves backwards, we taint the scan such that it never reports. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > Just to be sure: a backwards-started scan is currently unreachable code, > correct? [ yawn... ] I think so, but I wouldn't swear to it right at the moment. In any case it doesn't seem like a path that we need to optimize. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> I fixed a little off-by-one in "backward scan, not inited" branch, but I >> was unable to test it. It seems that code is actually never used because >> that case is optimized to a rewind in the executor. I marked those >> seemingly unreachable places in the code with a comment. > > Actually it's not equivalent to a rewind, it's more like the startup > condition for an Index Scan Backward: start at the far end of the > relation and go backwards. I suspect that the whole thing may be > unreachable code because the planner knows that seqscans are unordered > and backward-scan is only of interest for an ordered scan. Right. I was looking at the case where you open a cursor and start immediately scanning backwards. That's optimized to a rewind. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: >> Just to be sure: a backwards-started scan is currently unreachable code, >> correct? > > [ yawn... ] I think so, but I wouldn't swear to it right at the moment. > In any case it doesn't seem like a path that we need to optimize. Agreed, let's just disable the reporting when moving backwards. Jeff's idea of tainting the whole scan if you ever move backwards is not bad either, but it's so uncommon that we don't really need to care either way. BTW: Should we do the synchronization in the non-page-at-a-time mode? It's not many lines of code to do so, but IIRC that codepath is only used for catalog access. System tables really shouldn't grow that big, and if they do we shouldn't be doing seq scans for them anyway. Does the unstable ordering confuse any catalog accesses? Here's an update of the patch. I reverted the behavior at end of scan back to the way it was in Jeff's original patch, and disabled reporting the position when moving backwards. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/access/heap/Makefile =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/Makefile,v retrieving revision 1.15 diff -c -r1.15 Makefile *** src/backend/access/heap/Makefile 8 Apr 2007 01:26:27 -0000 1.15 --- src/backend/access/heap/Makefile 31 May 2007 10:21:28 -0000 *************** *** 12,18 **** top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o rewriteheap.o tuptoaster.o all: SUBSYS.o --- 12,18 ---- top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o rewriteheap.o tuptoaster.o syncscan.o all: SUBSYS.o Index: src/backend/access/heap/heapam.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v retrieving revision 1.234 diff -c -r1.234 heapam.c *** src/backend/access/heap/heapam.c 30 May 2007 20:11:53 -0000 1.234 --- src/backend/access/heap/heapam.c 8 Jun 2007 09:53:12 -0000 *************** *** 85,104 **** /* * If the table is large relative to NBuffers, use a bulk-read access ! * strategy, else use the default random-access strategy. During a ! * rescan, don't make a new strategy object if we don't have to. */ if (scan->rs_nblocks > NBuffers / 4 && !scan->rs_rd->rd_istemp) { if (scan->rs_strategy == NULL) scan->rs_strategy = GetAccessStrategy(BAS_BULKREAD); } else { if (scan->rs_strategy != NULL) FreeAccessStrategy(scan->rs_strategy); scan->rs_strategy = NULL; } scan->rs_inited = false; --- 85,108 ---- /* * If the table is large relative to NBuffers, use a bulk-read access ! * strategy and enable synchronized scanning (see syncscan.c). ! * During a rescan, don't make a new strategy object if we don't have to. */ if (scan->rs_nblocks > NBuffers / 4 && !scan->rs_rd->rd_istemp) { if (scan->rs_strategy == NULL) scan->rs_strategy = GetAccessStrategy(BAS_BULKREAD); + + scan->rs_startpage = ss_get_location(scan); } else { if (scan->rs_strategy != NULL) FreeAccessStrategy(scan->rs_strategy); scan->rs_strategy = NULL; + + scan->rs_startpage = 0; } scan->rs_inited = false; *************** *** 229,234 **** --- 233,239 ---- Snapshot snapshot = scan->rs_snapshot; bool backward = ScanDirectionIsBackward(dir); BlockNumber page; + BlockNumber prevpage; Page dp; int lines; OffsetNumber lineoff; *************** *** 251,257 **** tuple->t_data = NULL; return; } ! page = 0; /* first page */ heapgetpage(scan, page); lineoff = FirstOffsetNumber; /* first offnum */ scan->rs_inited = true; --- 256,262 ---- tuple->t_data = NULL; return; } ! page = scan->rs_startpage; /* first page */ heapgetpage(scan, page); lineoff = FirstOffsetNumber; /* first offnum */ scan->rs_inited = true; *************** *** 274,279 **** --- 279,288 ---- } else if (backward) { + /* Note: This is not normally reached, because the planner knows + * that seq scans are unordered by nature, and doesn't generate + * plans with backward scans. + */ if (!scan->rs_inited) { /* *************** *** 285,291 **** tuple->t_data = NULL; return; } ! page = scan->rs_nblocks - 1; /* final page */ heapgetpage(scan, page); } else --- 294,305 ---- tuple->t_data = NULL; return; } ! /* Start from the final page. We don't want to participate ! * in synchronized scanning if we're scanning backwards, that ! * would just step on the toes of any forward-scanners. ! */ ! page = scan->rs_startpage = scan->rs_nblocks - 1; ! heapgetpage(scan, page); } else *************** *** 398,406 **** LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK); /* ! * return NULL if we've exhausted all the pages */ ! if (backward ? (page == 0) : (page + 1 >= scan->rs_nblocks)) { if (BufferIsValid(scan->rs_cbuf)) ReleaseBuffer(scan->rs_cbuf); --- 412,439 ---- LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK); /* ! * Handle wrap-around correctly, we might have started from somewhere ! * in the middle of the relation. */ ! prevpage = page; ! if (backward) ! { ! if (page == 0) ! page = scan->rs_nblocks; ! page--; ! } ! else ! { ! page++; ! if (page == scan->rs_nblocks) ! page = 0; ! } ! ! /* ! * return NULL if we've exhausted all the pages. ! */ ! if ((ScanDirectionIsForward(dir) && page == scan->rs_startpage) || ! (ScanDirectionIsBackward(dir) && prevpage == scan->rs_startpage)) { if (BufferIsValid(scan->rs_cbuf)) ReleaseBuffer(scan->rs_cbuf); *************** *** 411,420 **** return; } - page = backward ? (page - 1) : (page + 1); - heapgetpage(scan, page); LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE); dp = (Page) BufferGetPage(scan->rs_cbuf); --- 444,467 ---- return; } heapgetpage(scan, page); + /* Report our new scan position for synchronization purposes. + * We don't do that when moving backwards, however. That would + * just mess up with any other forward-moving scanners. + * + * Note: we do this before checking for end of scan to reset + * the position hint back to the starting page when we do reach + * end of scan. That's not strictly necessary, but otherwise + * when you run the same query multiple times the starting + * position would shift a little bit backwards on every invocation, + * which is confusing. We don't guarantee any specific ordering + * in general, though. If there's any concurrent scans, the ordering + * will be different. + */ + if (scan->rs_strategy != NULL && !backward) + ss_report_location(scan, page); + LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE); dp = (Page) BufferGetPage(scan->rs_cbuf); *************** *** 455,460 **** --- 502,508 ---- HeapTuple tuple = &(scan->rs_ctup); bool backward = ScanDirectionIsBackward(dir); BlockNumber page; + BlockNumber prevpage; Page dp; int lines; int lineindex; *************** *** 478,484 **** tuple->t_data = NULL; return; } ! page = 0; /* first page */ heapgetpage(scan, page); lineindex = 0; scan->rs_inited = true; --- 526,532 ---- tuple->t_data = NULL; return; } ! page = scan->rs_startpage; heapgetpage(scan, page); lineindex = 0; scan->rs_inited = true; *************** *** 500,505 **** --- 548,557 ---- { if (!scan->rs_inited) { + /* Note: This is not normally reached, because the planner knows + * that seq scans are unordered by nature, and doesn't generate + * plans with backward scans. + */ /* * return null immediately if relation is empty */ *************** *** 509,515 **** tuple->t_data = NULL; return; } ! page = scan->rs_nblocks - 1; /* final page */ heapgetpage(scan, page); } else --- 561,572 ---- tuple->t_data = NULL; return; } ! /* Start from the final page. We don't want to participate ! * in synchronized scanning if we're scanning backwards, that ! * would just step on the toes of any forward-scanners. ! */ ! page = scan->rs_startpage = scan->rs_nblocks - 1; ! heapgetpage(scan, page); } else *************** *** 613,626 **** } /* ! * if we get here, it means we've exhausted the items on this page and ! * it's time to move to the next. */ /* ! * return NULL if we've exhausted all the pages */ ! if (backward ? (page == 0) : (page + 1 >= scan->rs_nblocks)) { if (BufferIsValid(scan->rs_cbuf)) ReleaseBuffer(scan->rs_cbuf); --- 670,715 ---- } /* ! * If we get here, it means we've exhausted the items on this page and ! * it's time to move to the next. Handle wrap around correctly; ! * we might have started from somewhere in the middle of the relation. ! */ ! prevpage = page; ! if (backward) ! { ! if (page == 0) ! page = scan->rs_nblocks; ! page--; ! } ! else ! { ! page++; ! if (page == scan->rs_nblocks) ! page = 0; ! } ! ! ! /* Report our new scan position for synchronization purposes. ! * We don't do that when moving backwards, however. That would ! * just mess up with any other forward-moving scanners. ! * ! * Note: we do this before checking for end of scan to reset ! * the position hint back to the starting page when we do reach ! * end of scan. That's not strictly necessary, but otherwise ! * when you run the same query multiple times the starting ! * position would shift a little bit backwards on every invocation, ! * which is confusing. We don't guarantee any specific ordering ! * in general, though. If there's any concurrent scans, the ordering ! * will be different. */ + if (scan->rs_strategy != NULL && !backward) + ss_report_location(scan, page); /* ! * return NULL if we've exhausted all the pages. */ ! if ((ScanDirectionIsForward(dir) && page == scan->rs_startpage) || ! (ScanDirectionIsBackward(dir) && prevpage == scan->rs_startpage)) { if (BufferIsValid(scan->rs_cbuf)) ReleaseBuffer(scan->rs_cbuf); *************** *** 631,637 **** return; } - page = backward ? (page - 1) : (page + 1); heapgetpage(scan, page); dp = (Page) BufferGetPage(scan->rs_cbuf); --- 720,725 ---- Index: src/backend/access/heap/syncscan.c =================================================================== RCS file: src/backend/access/heap/syncscan.c diff -N src/backend/access/heap/syncscan.c *** /dev/null 1 Jan 1970 00:00:00 -0000 --- src/backend/access/heap/syncscan.c 8 Jun 2007 09:48:01 -0000 *************** *** 0 **** --- 1,287 ---- + /*------------------------------------------------------------------------- + * + * syncscan.c + * heap scan synchronization support + * + * When multiple backends run a sequential scan on the same table, we try + * to keep them synchronized to reduce the overall I/O needed. The goal is + * to read in each page only once to shared buffer cache, and let all backends + * that take part in the scan to process the page before it falls out of the + * cache. + * + * To accomplish that, we keep track of the scan position of each table, and + * start new scans close to where the previous scan(s) are. Assuming that I/O + * is slow compared to the processing of each page, that's enough to keep the + * scans synchronized. We don't try do any further synchronization to keep the + * scans together; some scans might progress much more slowly than others if + * for example the results need to be transferred to the client over a slow + * network, and we don't want such queries to slow down others. + * + * There can realistically only be a few sequential scans on different tables + * in progress at any time. Therefore we just keep the scan positions in a + * small LRU list which we scan every time we need to look up or update a scan + * position. If that ever becomes a scalability issue, we could put the entries + * to a hash table. + * + * INTERFACE ROUTINES + * ss_get_location - return current scan location of a relation + * ss_report_location - update current scan location + * + * Portions Copyright (c) 1996-2007, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * $PostgreSQL$ + * + *------------------------------------------------------------------------- + */ + #include "postgres.h" + + #include "access/heapam.h" + #include "miscadmin.h" + + /* + * Size of the LRU list. + * + * Note: the code assumes that SYNC_SCAN_NELEM > 1. + * + * XXX: What's a good value? It should be large enough to hold the + * maximum number large tables scanned simultaneously. But a larger value + * means more traversing of the LRU list when starting a new scan. + */ + #define SYNC_SCAN_NELEM 20 + + /* + * Interval between reports of the location of the current scan, in pages. + * + * Note: This should be smaller than the ring size (see freelist.c) we use + * for bulk reads. Otherwise a scan joining other scans might start from a + * page that's no longer in the buffer cache, and needs to read it back + * again. This is a bit fuzzy; there's no guarantee that the new scan + * will read the page until it leaves the buffer cache anyway, and on the + * other hand the page is most likely still in the OS cache. + */ + #define SYNC_SCAN_REPORT_INTERVAL (128 * 1024 / BLCKSZ) + + + /* GUC variables */ + bool Trace_sync_seqscan = false; + + /* + * The scan locations structure is essentially a doubly-linked LRU with head + * and tail pointer, but designed to hold a fixed maximum number of elements in + * fixed-size shared memory. + */ + typedef struct ss_scan_location_t { + RelFileNode relfilenode; /* The relfilenode that tags this entry */ + BlockNumber location; /* The location in the relation */ + } ss_scan_location_t; + + typedef struct ss_lru_item_t { + struct ss_lru_item_t *prev; + struct ss_lru_item_t *next; + ss_scan_location_t location; + } ss_lru_item_t; + + typedef struct ss_scan_locations_t { + ss_lru_item_t *head; + ss_lru_item_t *tail; + ss_lru_item_t items[1]; /* SYNC_SCAN_NELEM items */ + } ss_scan_locations_t; + + #define SizeOfScanLocations offsetof(ss_scan_locations_t, items[SYNC_SCAN_NELEM]) + + /* Pointer to struct in shared memory */ + static ss_scan_locations_t *scan_locations; + + /* prototypes for internal functions */ + static BlockNumber ss_search(RelFileNode,BlockNumber,bool); + + + /* + * SyncScanShmemSize --- report amount of shared memory space needed + */ + Size SyncScanShmemSize(void) + { + return SizeOfScanLocations; + } + + /* + * SyncScanShmemInit --- initialize this module's shared memory + */ + void SyncScanShmemInit() + { + int i; + bool found; + + scan_locations = (ss_scan_locations_t *) + ShmemInitStruct("Sync Scan Locations List", + SizeOfScanLocations, &found); + + if (!IsUnderPostmaster) + { + /* Initialize shared memory area */ + Assert(!found); + + scan_locations->head = &scan_locations->items[0]; + scan_locations->tail = &scan_locations->items[SYNC_SCAN_NELEM - 1]; + + for(i = 0; i < SYNC_SCAN_NELEM; i++) + { + ss_lru_item_t *item = &scan_locations->items[i]; + + /* Initialize all slots with invalid values. As scans are started, + * these invalid entries will fall off the LRU list and are + * replaced with real entries. + */ + item->location.relfilenode.spcNode = InvalidOid; + item->location.relfilenode.dbNode = InvalidOid; + item->location.relfilenode.relNode = InvalidOid; + item->location.location = InvalidBlockNumber; + + item->prev = (i > 0) ? + (&scan_locations->items[i - 1]) : NULL; + item->next = (i < SYNC_SCAN_NELEM - 1) ? + (&scan_locations->items[i + 1]) : NULL; + } + } + else + Assert(found); + + } + + /* + * ss_search --- searches the scan_locations structure for an entry with the + * given relfilenode. + * + * If "set" is true, the location is updated to the given location. If no + * entry for the given relfilenode is found, it will be created at the head + * of the list with the given location, even if set == false. + * + * In any case, the location after possible update is returned. + */ + static BlockNumber ss_search(RelFileNode relfilenode, + BlockNumber location, bool set) + { + ss_lru_item_t *item; + + item = scan_locations->head; + + for (;;) + { + bool match; + + match = RelFileNodeEquals(item->location.relfilenode, relfilenode); + + if (match || item->next == NULL) + { + + /* + * If we reached the end of list and no match was found, + * take over the last entry + */ + if (!match) + { + item->location.relfilenode = relfilenode; + item->location.location = location; + } + else + if (set) + item->location.location = location; + + /* Move the entry to the front of the LRU list */ + if (item != scan_locations->head) + { + /* unlink */ + if (item == scan_locations->tail) + scan_locations->tail = item->prev; + item->prev->next = item->next; + if (item->next) + item->next->prev = item->prev; + + /* link */ + item->prev = NULL; + item->next = scan_locations->head; + scan_locations->head->prev = item; + scan_locations->head = item; + } + + return item->location.location; + } + item = item->next; + } + /* never reached */ + } + + /* + * ss_get_location --- gets the optimal starting location for scan + * + * Returns the location of an already running sequential scan on the + * relation, or 0 if no valid location is found. + * + * This function assumes that scan->rs_nblocks is already properly set. + */ + BlockNumber ss_get_location(HeapScanDesc scan) + { + BlockNumber startloc; + + LWLockAcquire(SyncScanLock,LW_EXCLUSIVE); + startloc = ss_search(scan->rs_rd->rd_node, 0, false); + LWLockRelease(SyncScanLock); + + /* + * If the location is not a valid block number for this scan, start at 0. + * + * This can happen if for instance a VACUUM truncated the table + * since the location was set. + */ + if (startloc < 0 || startloc >= scan->rs_nblocks) + { + if (Trace_sync_seqscan) + elog(DEBUG1,"SYNC_SCAN: Location %d out of range. Relation has %d pages.", + startloc, scan->rs_nblocks); + return 0; + } + + if (Trace_sync_seqscan) + elog(DEBUG1,"SYNC_SCAN: START: OID = %d; Location = %d; Size: %d", + RelationGetRelid(scan->rs_rd), startloc, scan->rs_nblocks); + + return startloc; + } + + /* + * ss_report_location --- updates the current scan location + * + * Writes an entry in the Sync Scan structure of the form + * (relfilenode,blocknumber), overwriting any existing entry for the same + * relfilenode. + */ + void ss_report_location(HeapScanDesc scan, BlockNumber location) + { + if (Trace_sync_seqscan) + { + if ((location % 1000) == 0) + elog(DEBUG2, "SYNC_SCAN: page %d", location); + } + + /* + * To reduce lock contention, only report scan progress every n pages. + * For the same reason, don't block if the lock isn't immediately + * available. Missing a few updates isn't critical, it just means that a + * new scan that wants to join the pack will start a little bit behind the + * head of the scan. Hopefully the pages are still in OS cache and the + * scan catches up quickly. + */ + if ((location % SYNC_SCAN_REPORT_INTERVAL) == 0) + { + if (LWLockConditionalAcquire(SyncScanLock, LW_EXCLUSIVE)) + { + ss_search(scan->rs_rd->rd_node,location,true); + LWLockRelease(SyncScanLock); + } + else if (Trace_sync_seqscan) + elog(DEBUG1, "SYNC_SCAN: missed update %d", location); + } + } + Index: src/backend/storage/ipc/ipci.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/ipc/ipci.c,v retrieving revision 1.91 diff -c -r1.91 ipci.c *** src/backend/storage/ipc/ipci.c 15 Feb 2007 23:23:23 -0000 1.91 --- src/backend/storage/ipc/ipci.c 31 May 2007 10:21:28 -0000 *************** *** 19,24 **** --- 19,25 ---- #include "access/nbtree.h" #include "access/subtrans.h" #include "access/twophase.h" + #include "access/heapam.h" #include "miscadmin.h" #include "pgstat.h" #include "postmaster/autovacuum.h" *************** *** 112,117 **** --- 113,119 ---- size = add_size(size, BgWriterShmemSize()); size = add_size(size, AutoVacuumShmemSize()); size = add_size(size, BTreeShmemSize()); + size = add_size(size, SyncScanShmemSize()); #ifdef EXEC_BACKEND size = add_size(size, ShmemBackendArraySize()); #endif *************** *** 216,221 **** --- 218,224 ---- * Set up other modules that need some shared memory space */ BTreeShmemInit(); + SyncScanShmemInit(); #ifdef EXEC_BACKEND Index: src/backend/utils/misc/guc.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v retrieving revision 1.391 diff -c -r1.391 guc.c *** src/backend/utils/misc/guc.c 8 May 2007 16:33:51 -0000 1.391 --- src/backend/utils/misc/guc.c 2 Jun 2007 07:50:23 -0000 *************** *** 26,31 **** --- 26,32 ---- #endif + #include "access/heapam.h" #include "access/gin.h" #include "access/transam.h" #include "access/twophase.h" *************** *** 783,788 **** --- 784,799 ---- false, NULL, NULL }, + { + {"trace_sync_seqscan", PGC_USERSET, DEVELOPER_OPTIONS, + gettext_noop("Generates debugging output for Synchronized Scans."), + NULL, + GUC_NOT_IN_SAMPLE + }, + &Trace_sync_seqscan, + false, NULL, NULL + }, + #ifdef LOCK_DEBUG { {"trace_locks", PGC_SUSET, DEVELOPER_OPTIONS, Index: src/include/access/heapam.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/heapam.h,v retrieving revision 1.124 diff -c -r1.124 heapam.h *** src/include/access/heapam.h 27 May 2007 03:50:39 -0000 1.124 --- src/include/access/heapam.h 2 Jun 2007 07:50:30 -0000 *************** *** 25,30 **** --- 25,33 ---- #include "utils/rel.h" #include "utils/tqual.h" + /* GUC variable, in syncscan.c */ + extern bool Trace_sync_seqscan; + /* ---------------- * fastgetattr * *************** *** 240,243 **** --- 243,252 ---- extern void heap_sync(Relation relation); + /* in heapam/syncscan.c */ + extern void ss_report_location(HeapScanDesc scan, BlockNumber location); + extern BlockNumber ss_get_location(HeapScanDesc scan); + extern void SyncScanShmemInit(void); + extern Size SyncScanShmemSize(void); + #endif /* HEAPAM_H */ Index: src/include/access/relscan.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/relscan.h,v retrieving revision 1.54 diff -c -r1.54 relscan.h *** src/include/access/relscan.h 30 May 2007 20:12:02 -0000 1.54 --- src/include/access/relscan.h 2 Jun 2007 07:51:10 -0000 *************** *** 34,39 **** --- 34,40 ---- bool rs_inited; /* false = scan not init'd yet */ HeapTupleData rs_ctup; /* current tuple in scan, if any */ BlockNumber rs_cblock; /* current block # in scan, if any */ + BlockNumber rs_startpage; /* page where this scan began */ Buffer rs_cbuf; /* current buffer in scan, if any */ /* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */ ItemPointerData rs_mctid; /* marked scan position, if any */ Index: src/include/storage/lwlock.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/lwlock.h,v retrieving revision 1.36 diff -c -r1.36 lwlock.h *** src/include/storage/lwlock.h 16 Apr 2007 18:30:04 -0000 1.36 --- src/include/storage/lwlock.h 31 May 2007 10:21:28 -0000 *************** *** 62,67 **** --- 62,68 ---- AddinShmemInitLock, AutovacuumLock, AutovacuumScheduleLock, + SyncScanLock, /* Individual lock IDs end here */ FirstBufMappingLock, FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS, Index: src/test/regress/parallel_schedule =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/test/regress/parallel_schedule,v retrieving revision 1.42 diff -c -r1.42 parallel_schedule *** src/test/regress/parallel_schedule 2 Apr 2007 03:49:42 -0000 1.42 --- src/test/regress/parallel_schedule 4 Jun 2007 09:07:27 -0000 *************** *** 61,67 **** # ---------- # The fourth group of parallel test # ---------- ! test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join aggregatestransactions random portals arrays btree_index hash_index update namespace prepared_xacts delete test: privileges test: misc --- 61,67 ---- # ---------- # The fourth group of parallel test # ---------- ! test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join aggregatestransactions random portals arrays btree_index hash_index update namespace prepared_xacts delete syncscan test: privileges test: misc Index: src/test/regress/expected/syncscan.out =================================================================== RCS file: src/test/regress/expected/syncscan.out diff -N src/test/regress/expected/syncscan.out *** /dev/null 1 Jan 1970 00:00:00 -0000 --- src/test/regress/expected/syncscan.out 8 Jun 2007 10:02:11 -0000 *************** *** 0 **** --- 1,27 ---- + -- + -- Sequential scan synchronization + -- + -- NOTE: This test depends on ss_test being large enough that synchronized + -- scans are used. Increase the number of rows inserted to it, or decrease + -- shared_buffers if that's not happening. + CREATE TABLE ss_test (id VARCHAR(10), data VARCHAR(500)); + INSERT INTO ss_test SELECT '1st half', repeat('a',500) FROM generate_series(1,10000); + INSERT INTO ss_test SELECT '2nd half', repeat('a',500) FROM generate_series(1,10000); + -- Scan little over half of the table with a sequential scan. OFFSET 1000 + -- because we don't track the scan location exactly, so the next scan will + -- start a little bit behind from where this scan stops. + SELECT id FROM ss_test WHERE id = '2nd half' OFFSET 1000 LIMIT 1; + id + ---------- + 2nd half + (1 row) + + -- If synchronized scans are used, this should begin roughly from where the + -- above scan stopped, and should return '2nd half' + SELECT id FROM ss_test LIMIT 1; + id + ---------- + 2nd half + (1 row) + + DROP TABLE ss_test; Index: src/test/regress/sql/syncscan.sql =================================================================== RCS file: src/test/regress/sql/syncscan.sql diff -N src/test/regress/sql/syncscan.sql *** /dev/null 1 Jan 1970 00:00:00 -0000 --- src/test/regress/sql/syncscan.sql 4 Jun 2007 09:37:56 -0000 *************** *** 0 **** --- 1,22 ---- + -- + -- Sequential scan synchronization + -- + -- NOTE: This test depends on ss_test being large enough that synchronized + -- scans are used. Increase the number of rows inserted to it, or decrease + -- shared_buffers if that's not happening. + + CREATE TABLE ss_test (id VARCHAR(10), data VARCHAR(500)); + + INSERT INTO ss_test SELECT '1st half', repeat('a',500) FROM generate_series(1,10000); + INSERT INTO ss_test SELECT '2nd half', repeat('a',500) FROM generate_series(1,10000); + + -- Scan little over half of the table with a sequential scan. OFFSET 1000 + -- because we don't track the scan location exactly, so the next scan will + -- start a little bit behind from where this scan stops. + SELECT id FROM ss_test WHERE id = '2nd half' OFFSET 1000 LIMIT 1; + + -- If synchronized scans are used, this should begin roughly from where the + -- above scan stopped, and should return '2nd half' + SELECT id FROM ss_test LIMIT 1; + + DROP TABLE ss_test;
Heikki Linnakangas <heikki@enterprisedb.com> writes: > BTW: Should we do the synchronization in the non-page-at-a-time mode? > It's not many lines of code to do so, but IIRC that codepath is only > used for catalog access. System tables really shouldn't grow that big, > and if they do we shouldn't be doing seq scans for them anyway. It occurs to me that there's an actual bug here for catalog access. The code assumes that it can measure rs_nblocks only once and not worry about tuples added beyond that endpoint. But this is only true when using an MVCC-safe snapshot. In SnapshotNow mode, in particular, this could easily lead to missing a valid tuple. I'm not sure such a bug has ever been seen in the wild, because we don't do that many heapscans on system catalogs (and most of the ones we do do have some amount of higher-level locking preventing a concurrent update). But we ought to fix it while we're touching the code now. That will take re-measuring rs_nblocks each time we consider stopping, and I think we definitely will NOT want the syncscan code getting involved. I'll take care of incorporating this point into your mod-4 patch. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> BTW: Should we do the synchronization in the non-page-at-a-time mode? >> It's not many lines of code to do so, but IIRC that codepath is only >> used for catalog access. System tables really shouldn't grow that big, >> and if they do we shouldn't be doing seq scans for them anyway. > > It occurs to me that there's an actual bug here for catalog access. > The code assumes that it can measure rs_nblocks only once and not worry > about tuples added beyond that endpoint. But this is only true when > using an MVCC-safe snapshot. In SnapshotNow mode, in particular, this > could easily lead to missing a valid tuple. I'm not sure such a bug > has ever been seen in the wild, because we don't do that many heapscans > on system catalogs (and most of the ones we do do have some amount of > higher-level locking preventing a concurrent update). But we ought to > fix it while we're touching the code now. That will take re-measuring > rs_nblocks each time we consider stopping, and I think we definitely > will NOT want the syncscan code getting involved. You would only miss tuples inserted after you began the seqscan. After you've began the scan, you're going to miss any tuples that are inserted before the current position anyway, so stopping the scan early shouldn't do any real harm. It would only be a problem if you do something like: heap_beginscan(...) Lock while(heap_getnext) ... Unlock And expect to see all tuples inserted before acquiring the lock. Good catch, though. Fixing it is probably a good idea even if it's not a problem for any existing code. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> It occurs to me that there's an actual bug here for catalog access. >> The code assumes that it can measure rs_nblocks only once and not worry >> about tuples added beyond that endpoint. But this is only true when >> using an MVCC-safe snapshot. > You would only miss tuples inserted after you began the seqscan. After > you've began the scan, you're going to miss any tuples that are inserted > before the current position anyway, so stopping the scan early shouldn't > do any real harm. Good point. > It would only be a problem if you do something like: > heap_beginscan(...) > Lock > while(heap_getnext) ... > Unlock > And expect to see all tuples inserted before acquiring the lock. But that could be fixed by taking the lock before the heap_beginscan. Indeed it's hard to conceive of a situation where you'd want/need to take the lock afterward; in most cases the beginscan and the actual scan are right together. So I withdraw this complaint; it's complexity we don't need. I'll add a comment about the point though. regards, tom lane
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> Jeff Davis <pgsql@j-davis.com> writes: >>> Just to be sure: a backwards-started scan is currently unreachable code, >>> correct? >> >> [ yawn... ] I think so, but I wouldn't swear to it right at the moment. >> In any case it doesn't seem like a path that we need to optimize. > Agreed, let's just disable the reporting when moving backwards. Now that I'm awake, it is reachable code, per this comment: * Note: when we fall off the end of the scan in either direction, we * reset rs_inited. This means that a further request with the same * scan direction will restart the scan, which is a bit odd, but a * request with the opposite scan direction will start a fresh scan * in the proper direction. The latter is required behavior for cursors, * while the former case is generally undefined behavior in Postgres * so we don't care too much. That is, if you run a cursor to the end and then back up, you'll go through the init-in-the-backwards-direction code. But we're agreed that we don't want to report when moving backwards, so this is just an interesting note... regards, tom lane
On Fri, 2007-06-08 at 11:05 +0100, Heikki Linnakangas wrote: > BTW: Should we do the synchronization in the non-page-at-a-time mode? > It's not many lines of code to do so, but IIRC that codepath is only > used for catalog access. System tables really shouldn't grow that big, > and if they do we shouldn't be doing seq scans for them anyway. Does the > unstable ordering confuse any catalog accesses? http://archives.postgresql.org/pgsql-hackers/2006-09/msg01199.php There is a very minor assumption there that scans on pg_class will return in the same order. I'm not sure if that's even a problem. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Fri, 2007-06-08 at 11:05 +0100, Heikki Linnakangas wrote: >> BTW: Should we do the synchronization in the non-page-at-a-time mode? > http://archives.postgresql.org/pgsql-hackers/2006-09/msg01199.php > There is a very minor assumption there that scans on pg_class will > return in the same order. I'm not sure if that's even a problem. I'm inclined not to worry about that --- the odds of trouble from a concurrent schema update are at least as large as the odds of trouble from syncscan-induced variance. The correct fix would be to make ANALYZE sort the rels by OID, anyhow, not to dumb down the scan. regards, tom lane
On Fri, 2007-06-08 at 12:22 -0400, Tom Lane wrote: > Now that I'm awake, it is reachable code, per this comment: > > * Note: when we fall off the end of the scan in either direction, we > * reset rs_inited. This means that a further request with the same > * scan direction will restart the scan, which is a bit odd, but a I'm confused about this part of the comment. Regards, Jeff Davis
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Here's an update of the patch. I reverted the behavior at end of scan > back to the way it was in Jeff's original patch, and disabled reporting > the position when moving backwards. Applied with minor editorializations --- notably, I got rid of the HeapScanDesc dependency in syncscan.c's API, so that it could be used in other contexts (VACUUM, anyone?). There were a few glitches in the heapam.c code too. > I didn't touch the large scan threshold of NBuffers / 4 Tom that > committed as part of the buffer ring patch. IOW I removed the GUC > variable from the patch. I think the jury is still out there on this one. Yeah, this could do with more testing. I concur with the idea that the threshold should be the same for both bits of code, though. Otherwise we have four behaviors to try to tune, instead of two. > I included a basic regression test as well. I did not commit this, as it seemed a bit useless --- it's looking for a minor side-effect and not doing much of anything to prove that the code does what's intended. Possibly we could put in a real test after Greg's concurrent-psql thing is in. Jeff wrote: > I might go so far as to suggest if the scan *ever* moves backwards, we > taint the scan such that it never reports. This would be a trivial addition to the code-as-committed (clear rs_syncscan upon backing up by a page) but I didn't add it. Any strong feelings one way or the other? AFAIK the only case where it'd happen is if someone reads forwards in a large-seqscan cursor for awhile and then reads backwards. You could argue that doing that is a good reason to drop them out of the syncscan pack ... regards, tom lane
Jeff Davis <pgsql@j-davis.com> writes: > On Fri, 2007-06-08 at 12:22 -0400, Tom Lane wrote: >> Now that I'm awake, it is reachable code, per this comment: >> >> * Note: when we fall off the end of the scan in either direction, we >> * reset rs_inited. This means that a further request with the same >> * scan direction will restart the scan, which is a bit odd, but a > I'm confused about this part of the comment. If you tell the SeqScan plan node to fetch forward over and over, it eventually will return NULL indicating end-of-data. If you then demand *another* forward fetch, it'll start the whole scan over from scratch, because of the fact that it sees rs_inited clear. You might've expected another NULL but that's not what you'll get. In general, however, the behavior of a plan node in this scenario is not defined and not relied on by anything --- some of the join node types will crash outright if you try to fetch past EOF, IIRC. The case that *is* required is to be able to scan backwards after returning NULL. regards, tom lane
On Fri, 2007-06-08 at 14:36 -0400, Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > Here's an update of the patch. I reverted the behavior at end of scan > > back to the way it was in Jeff's original patch, and disabled reporting > > the position when moving backwards. > > Applied with minor editorializations --- notably, I got rid of the > HeapScanDesc dependency in syncscan.c's API, so that it could be used > in other contexts (VACUUM, anyone?). There were a few glitches in the > heapam.c code too. I think VACUUM would be an ideal place for it. I assume we don't want to make a change like that before 8.3 though, right? > This would be a trivial addition to the code-as-committed (clear > rs_syncscan upon backing up by a page) but I didn't add it. Any > strong feelings one way or the other? AFAIK the only case where > it'd happen is if someone reads forwards in a large-seqscan cursor > for awhile and then reads backwards. You could argue that doing > that is a good reason to drop them out of the syncscan pack ... > I don't feel strongly about it at all. Mostly because it seems so rare that it would matter. Regards, Jeff Davis
On Fri, 2007-06-08 at 11:57 -0700, Jeff Davis wrote: > On Fri, 2007-06-08 at 14:36 -0400, Tom Lane wrote: > > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > > Here's an update of the patch. I reverted the behavior at end of scan > > > back to the way it was in Jeff's original patch, and disabled reporting > > > the position when moving backwards. > > > > Applied with minor editorializations --- notably, I got rid of the > > HeapScanDesc dependency in syncscan.c's API, so that it could be used > > in other contexts (VACUUM, anyone?). There were a few glitches in the > > heapam.c code too. > > I think VACUUM would be an ideal place for it. I assume we don't want to I have a few thoughts: * For a large table, do lazy_scan_heap, scan_heap, and a sequential scan usually progress at approximately the same rate? * Are there any other parts of the vacuum process that may benefit? * Just adding in the syncscan to scan_heap and lazy_scan_heap seems very easy at first thought. Are there any complications that I'm missing? Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > * For a large table, do lazy_scan_heap, scan_heap, and a sequential > scan usually progress at approximately the same rate? scan_heap would probably be faster than a regular seqscan, since it isn't doing any where-clause-checking or data output. Except if you've got vacuum-cost-limit enabled, which I think is likely to be true by default in future. Another problem is that lazy_scan_heap stops every so often to make a pass over the table's indexes, which'd certainly cause it to fall out of sync with more typical seqscans. The vacuum-cost-limit issue may be sufficient reason to kill this idea; not sure. > * Just adding in the syncscan to scan_heap and lazy_scan_heap seems > very easy at first thought. Are there any complications that I'm > missing? I believe there are assumptions buried in both full and lazy vacuum that blocks are scanned in increasing order. Not sure how hard that would be to fix or work around. The only one I can specifically recall in lazy vacuum is that we assume the list of deletable TIDs is sorted a priori. Possibly you could deal with that by forcing an index-vacuum pass at the instant where the scan would wrap around, so that the list could be cleared before putting any lower-numbered blocks into it. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > The vacuum-cost-limit issue may be sufficient reason to kill this idea; > not sure. We already have a much higher cost for blocks that cause i/o than blocks which don't. I think if we had zero cost for blocks which don't cause i/o it would basically work unless the sleep time was so large that the other scans managed to cycle through the entire ring in that time. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> The vacuum-cost-limit issue may be sufficient reason to kill this idea; >> not sure. > We already have a much higher cost for blocks that cause i/o than > blocks which don't. I think if we had zero cost for blocks which don't > cause i/o it would basically work unless the sleep time was so large > that the other scans managed to cycle through the entire ring in that > time. I'm unconvinced. That could perhaps work as long as the vacuum process never did any I/O, ie was always a follower and never the leader of the syncscan pack. But if lazy_heap_scan is faster than a regular seqscan, as I suspect, then it'd frequently become the leader and have to do the I/O. A few iterations of that will cause it to hit the vacuum cost sleep, and that will make it fall behind the pack (and probably out of sync entirely, unless the sleep is really short). regards, tom lane
On Sat, 2007-06-09 at 09:58 -0400, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > * For a large table, do lazy_scan_heap, scan_heap, and a sequential > > scan usually progress at approximately the same rate? > > scan_heap would probably be faster than a regular seqscan, since it > isn't doing any where-clause-checking or data output. Except if you've > got vacuum-cost-limit enabled, which I think is likely to be true by > default in future. Another problem is that lazy_scan_heap stops every > so often to make a pass over the table's indexes, which'd certainly > cause it to fall out of sync with more typical seqscans. I think that these problems are significant enough that I'm not sure sync-scanning a VACUUM is the right way to approach the problem. Maybe a better solution would be to try to get a sequential scan to do some of the work required by a VACUUM. I don't think we can stop in the middle of a sequential scan to vacuum the indexes, but perhaps we could come up with some kind of scheme. It would be cheaper (perhaps) to spill the list of deletable TIDs to disk than to rescan a big (mostly live) table later. And if it was costly, we wouldn't need to do the scan part of a VACUUM on every sequential scan. I'm sure this has been brought up before, does someone have a pointer to a discussion about doing VACUUM-like work in a sequential scan? Regards, Jeff Davis
Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: >> * Just adding in the syncscan to scan_heap and lazy_scan_heap seems >> very easy at first thought. Are there any complications that I'm >> missing? > > I believe there are assumptions buried in both full and lazy vacuum that > blocks are scanned in increasing order. Not sure how hard that would be > to fix or work around. The only one I can specifically recall in lazy > vacuum is that we assume the list of deletable TIDs is sorted a priori. > Possibly you could deal with that by forcing an index-vacuum pass at the > instant where the scan would wrap around, so that the list could be > cleared before putting any lower-numbered blocks into it. In this case, we're still scanning the table in increasing order, the zero-point is just shifted. We can still do a binary search if we do it in a whacky module-arithmetic fashion. I believe TID list ordering is the only reason why we need to scan in order. I don't think sync-scanning vacuum is worth pursuing, though, because of the other issues: index scans, vacuum cost accounting, and the fact that the 2nd pass would be harder to synchronize. There's a lot of other interesting ideas for vacuum that are more generally applicable. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > I don't think sync-scanning vacuum is worth pursuing, though, because of the > other issues: index scans, vacuum cost accounting, and the fact that the 2nd > pass would be harder to synchronize. There's a lot of other interesting ideas > for vacuum that are more generally applicable. I think we could probably arrange for vacuum to synchronize. If there's one sequential scan running we have to imagine there are others coming along soon too. so if we become desynchronized we'll just coerce the next one to start where we want and follow it for a while. However I have a another worry. Even if we did manage to get vacuum synchronizing well what would it do to the sequential scan performance. Instead of zipping along reading clean blocks into its small ring buffer and discarding them when it's done it'll suddenly find many of its blocks dirty when it goes to reuse them. Effectively we'll have just reinvented the problem we had with vacuum previously albeit in a way which only hits sequential scans particularly hard. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Jeff Davis <pgsql@j-davis.com> writes: > I'm sure this has been brought up before, does someone have a pointer to > a discussion about doing VACUUM-like work in a sequential scan? Yeah, it's been discussed before; try looking for "incremental vacuum" and such phrases. The main stumbling block is cleaning out index entries for the known-dead heap tuple. The current VACUUM design amortizes that cost across as many dead heap tuples as it can manage; doing it retail seems inevitably to be a lot more expensive. regards, tom lane
Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > I'm sure this has been brought up before, does someone have a pointer to > > a discussion about doing VACUUM-like work in a sequential scan? > > Yeah, it's been discussed before; try looking for "incremental vacuum" > and such phrases. > > The main stumbling block is cleaning out index entries for the > known-dead heap tuple. The current VACUUM design amortizes that cost > across as many dead heap tuples as it can manage; doing it retail seems > inevitably to be a lot more expensive. Maybe what we could do is have a seqscan save known-dead tuple IDs in a file, and then in a different operation (initiated by autovacuum) we would remove those TIDs from indexes, before the regular heap scan. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.