Thread: Synchronized Scan WIP patch

Synchronized Scan WIP patch

From
Jeff Davis
Date:
This is my latest revision of the Sync Scan patch, and it implements the
observability as discussed with Simon.

Changes:
 * ss_report_loc() called once per hundred pages rather than once per
page
 * DEBUG messages are a little cleaner and easier to parse, for the sake
of analysis after the fact.
 * DEBUG2 reports a sync scan starting, the relation size in pages, and
the location at which the scan starts.
 * DEBUG2 reports the location of a scan every 50k pages, DEBUG3 every
5k pages (before it was 100k/10k at DEBUG3/DEBUG4, respectively).
Numbers are aligned along 5k boundaries to make analysis easier.
 * GUCs:
   * sync_seqscan_threshold: fraction of NBuffers for the threshold
   * sync_seqscan_offset: fraction of NBuffers for the offset
   * trace_sync_seqscan: will be used in final version of patch to
control DEBUG output

Sync_scan_offset may be eliminated completely if it's not shown to be
useful enough in conjunction with Simon's patch. Sync Scans are still a
big win without sync_seqscan_offset.

Sync_scan_threshold=<real> may be turned into sync_seqscan=<boolean>
with a fixed activation threshold (NBuffers/2 per Simon's suggestion).
The reason is that synchronized scans should activate at the same
threshold as Simon's scan_recycle_buffers feature. Should we make a
"#define BIG_SCAN_THRESHOLD NBuffers/2" to use for both sync_seqscan and
for scan_recycle_buffers?

Regards,
    Jeff Davis

Attachment

Re: Synchronized Scan WIP patch

From
"Simon Riggs"
Date:
On Wed, 2007-03-14 at 01:42 -0700, Jeff Davis wrote:
> SYNC_SCAN_REPORT_INTERVAL 100

Jeff,

This will stop SeqScans from working with buffer recycling, unless we
put the recycle limit to more than 100. That was why I requested you set
this to 16, so we can use a recycle buffer of 32.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: Synchronized Scan WIP patch

From
Jeff Davis
Date:
On Thu, 2007-03-15 at 08:27 +0000, Simon Riggs wrote:
> On Wed, 2007-03-14 at 01:42 -0700, Jeff Davis wrote:
> > SYNC_SCAN_REPORT_INTERVAL 100
>
> Jeff,
>
> This will stop SeqScans from working with buffer recycling, unless we
> put the recycle limit to more than 100. That was why I requested you set
> this to 16, so we can use a recycle buffer of 32.
>

Yes, you're correct. If it's set at 100 it should still work by using
the OS buffer cache to catch up (and then be within a few buffers), but
I think you're right that 16 is the correct value.

I'll change my patch to be 16. I don't think I need to send along a
diff ;)

Thanks,
    Jeff Davis


Re: Synchronized Scan WIP patch

From
Bruce Momjian
Date:
Will use '16' rather than '100'.

Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Jeff Davis wrote:
> This is my latest revision of the Sync Scan patch, and it implements the
> observability as discussed with Simon.
>
> Changes:
>  * ss_report_loc() called once per hundred pages rather than once per
> page
>  * DEBUG messages are a little cleaner and easier to parse, for the sake
> of analysis after the fact.
>  * DEBUG2 reports a sync scan starting, the relation size in pages, and
> the location at which the scan starts.
>  * DEBUG2 reports the location of a scan every 50k pages, DEBUG3 every
> 5k pages (before it was 100k/10k at DEBUG3/DEBUG4, respectively).
> Numbers are aligned along 5k boundaries to make analysis easier.
>  * GUCs:
>    * sync_seqscan_threshold: fraction of NBuffers for the threshold
>    * sync_seqscan_offset: fraction of NBuffers for the offset
>    * trace_sync_seqscan: will be used in final version of patch to
> control DEBUG output
>
> Sync_scan_offset may be eliminated completely if it's not shown to be
> useful enough in conjunction with Simon's patch. Sync Scans are still a
> big win without sync_seqscan_offset.
>
> Sync_scan_threshold=<real> may be turned into sync_seqscan=<boolean>
> with a fixed activation threshold (NBuffers/2 per Simon's suggestion).
> The reason is that synchronized scans should activate at the same
> threshold as Simon's scan_recycle_buffers feature. Should we make a
> "#define BIG_SCAN_THRESHOLD NBuffers/2" to use for both sync_seqscan and
> for scan_recycle_buffers?
>
> Regards,
>     Jeff Davis

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Synchronized Scan WIP patch

From
Jeff Davis
Date:
On Thu, 2007-03-22 at 16:43 -0400, Bruce Momjian wrote:
> Will use '16' rather than '100'.
>
> Your patch has been added to the PostgreSQL unapplied patches list at:
>
>     http://momjian.postgresql.org/cgi-bin/pgpatches
>
> It will be applied as soon as one of the PostgreSQL committers reviews
> and approves it.
>

Here is the latest version, which includes the change to report every 16
pages.

This patch has the following improvements:

 * reporting interval to 16 pages
 * rearranges the scan location tracing output to work regardless of the
reporting interval. Previously it did not trace the output correctly if
the logging interval was not an even multiple of the reporting interval
 * GUC trace_sync_seqscan=<bool> now controls whether the DEBUG output
is generated or not. If this is true, a lot of logging output will be
generated at DEBUG3.
 * You can set sync_seqscan_threshold=<-1.0 ... 100.0>. Positive values
are treated as a fraction of NBuffers. Negative values disable sync
scans.

Still TODO:

 * Publish my test results (I've collected much of the raw data already
on this version of the patch)
 * SGML documentation (after we stabilize the GUC names and meanings)
 * Possibly remove sync_seqscan_threshold=<real> and instead use a
simple enable/disable boolean that sets the threshold at a constant
fraction of NBuffers (most likely the same fraction as Simon's recycle
buffers patch)

Regards,
    Jeff Davis

Attachment

Re: Synchronized Scan WIP patchf

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Jeff Davis wrote:
> On Thu, 2007-03-22 at 16:43 -0400, Bruce Momjian wrote:
> > Will use '16' rather than '100'.
> >
> > Your patch has been added to the PostgreSQL unapplied patches list at:
> >
> >     http://momjian.postgresql.org/cgi-bin/pgpatches
> >
> > It will be applied as soon as one of the PostgreSQL committers reviews
> > and approves it.
> >
>
> Here is the latest version, which includes the change to report every 16
> pages.
>
> This patch has the following improvements:
>
>  * reporting interval to 16 pages
>  * rearranges the scan location tracing output to work regardless of the
> reporting interval. Previously it did not trace the output correctly if
> the logging interval was not an even multiple of the reporting interval
>  * GUC trace_sync_seqscan=<bool> now controls whether the DEBUG output
> is generated or not. If this is true, a lot of logging output will be
> generated at DEBUG3.
>  * You can set sync_seqscan_threshold=<-1.0 ... 100.0>. Positive values
> are treated as a fraction of NBuffers. Negative values disable sync
> scans.
>
> Still TODO:
>
>  * Publish my test results (I've collected much of the raw data already
> on this version of the patch)
>  * SGML documentation (after we stabilize the GUC names and meanings)
>  * Possibly remove sync_seqscan_threshold=<real> and instead use a
> simple enable/disable boolean that sets the threshold at a constant
> fraction of NBuffers (most likely the same fraction as Simon's recycle
> buffers patch)
>
> Regards,
>     Jeff Davis

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Synchronized Scan WIP patch

From
Heikki Linnakangas
Date:
Here's a work-in-progress update of this patch.

I haven't done any major changes, but a lot of little refactoring and
commenting, including:

* moved the sync scan stuff to a new file access/heapam/syncscan.c.
heapam.c is long enough already, and in theory the same mechanism could
be used for large bitmap heap scans in the future.
* initialization of the shared memory structures is done at postmaster
startup, instead of lazily the first time it's needed
* removed the freelist from the shared mem structure. Instead all
entries are initialized with invalid values, so they will be consumed
from the end of the LRU without any special handling.
* changed many variable and struct names to suit my personal taste.

TODO:
* Only report the new scan location when the page was not in buffer
cache. This should reduce the congestion when all synchronized scanners
try to update the location at the same time. Should probably test to see
if it's worth optimizing first.
* decide what to do with the threshold. Should use the same threshold as
the buffer ring stuff.
* decide the size of the LRU list.

Testing:
* Multiple scans on different tables, causing movement in the LRU list
* Measure the CPU overhead for a single scan
* Measure the lock contention with multiple scanners

Any thoughts? Any additional test cases you'd like to see?

Jeff Davis wrote:
> This is my latest revision of the Sync Scan patch, and it implements the
> observability as discussed with Simon.
>
> Changes:
>  * ss_report_loc() called once per hundred pages rather than once per
> page
>  * DEBUG messages are a little cleaner and easier to parse, for the sake
> of analysis after the fact.
>  * DEBUG2 reports a sync scan starting, the relation size in pages, and
> the location at which the scan starts.
>  * DEBUG2 reports the location of a scan every 50k pages, DEBUG3 every
> 5k pages (before it was 100k/10k at DEBUG3/DEBUG4, respectively).
> Numbers are aligned along 5k boundaries to make analysis easier.
>  * GUCs:
>    * sync_seqscan_threshold: fraction of NBuffers for the threshold
>    * sync_seqscan_offset: fraction of NBuffers for the offset
>    * trace_sync_seqscan: will be used in final version of patch to
> control DEBUG output
>
> Sync_scan_offset may be eliminated completely if it's not shown to be
> useful enough in conjunction with Simon's patch. Sync Scans are still a
> big win without sync_seqscan_offset.
>
> Sync_scan_threshold=<real> may be turned into sync_seqscan=<boolean>
> with a fixed activation threshold (NBuffers/2 per Simon's suggestion).
> The reason is that synchronized scans should activate at the same
> threshold as Simon's scan_recycle_buffers feature. Should we make a
> "#define BIG_SCAN_THRESHOLD NBuffers/2" to use for both sync_seqscan and
> for scan_recycle_buffers?
>
> Regards,
>     Jeff Davis
>
>
> ------------------------------------------------------------------------
>
> diff -cr postgresql-8.2.3/src/backend/access/heap/heapam.c postgresql-8.2.3-syncscan/src/backend/access/heap/heapam.c
> *** postgresql-8.2.3/src/backend/access/heap/heapam.c    2007-02-04 12:00:49.000000000 -0800
> --- postgresql-8.2.3-syncscan/src/backend/access/heap/heapam.c    2007-03-13 23:21:27.000000000 -0700
> ***************
> *** 65,70 ****
> --- 65,279 ----
>    * ----------------------------------------------------------------
>    */
>
> + static BlockNumber ss_init(HeapScanDesc);
> + static int         ss_store_hint(HeapScanDesc,BlockNumber);
> + static int         ss_hash(HeapScanDesc);
> + bool Trace_sync_seqscan = false;
> + double sync_seqscan_threshold = DEFAULT_SYNC_SCAN_THRESHOLD;
> + double sync_seqscan_offset = DEFAULT_SYNC_SCAN_OFFSET;
> +
> + /*
> +  * ss_init:
> +  *
> +  * This function reads the Sync Scan Hint Table
> +  * (creating it if it doesn't already exist) to
> +  * find a possible location for an already running
> +  * sequential scan on this relation.
> +  *
> +  * By starting a sequential scan near the location
> +  * of an already running scan, we improve the chance
> +  * of finding pages in cache.
> +  *
> +  * Also, depending on SYNC_SCAN_START_OFFSET, this
> +  * function will subtract from the hint before
> +  * starting the scan, in order to pick up pages that
> +  * are likely to already be in cache.
> +  *
> +  * This function assumes that scan->rs_nblocks is
> +  * already properly set, and sets scan->rs_start_page
> +  * to a value based on the hint found. Also, it sets
> +  * scan->rs_hint to point to the location of the hint
> +  * in the hint table.
> +  */
> + static BlockNumber ss_init(HeapScanDesc scan)
> + {
> +     ss_hint_t *hint_table;
> +     int table_offset;
> +     bool found;
> +     int threshold = sync_seqscan_threshold * NBuffers;
> +     int offset = sync_seqscan_offset * NBuffers;
> +
> +     /*
> +      * If the table is not large compared to effective_cache_size,
> +      * don't Sync Scan.
> +      */
> +     if(scan->rs_nblocks < threshold)
> +     {
> +         elog(DEBUG2,"SYNC_SCAN: Table too small to sync scan");
> +         scan->rs_start_page = 0;
> +         return 0;
> +     }
> +
> +     table_offset = ss_hash(scan);
> +     hint_table = (ss_hint_t*)ShmemInitStruct("Sync Scan Hint Table",
> +         SYNC_SCAN_TABLE_SIZE*sizeof(ss_hint_t),&found);
> +
> +     scan->rs_hint = &hint_table[table_offset];
> +
> +     /*
> +      * If we just created the hint table for the first time,
> +      * initialize the table to zero and start the scan at page 0.
> +      */
> +     if(!found) {
> +         elog(DEBUG2,"SYNC_SCAN: Created Hint Table");
> +         memset(hint_table,0,sizeof(ss_hint_t)*SYNC_SCAN_TABLE_SIZE);
> +         scan->rs_start_page = 0;
> +         return 0;
> +     }
> +
> +     /*
> +      * If the hint's relid is 0, that means
> +      * we have not previously created a hint
> +      * at this location in the table.
> +      */
> +     if(scan->rs_hint->relid == 0) {
> +         elog(DEBUG2, "SYNC_SCAN: Hint empty");
> +         scan->rs_start_page = 0;
> +         return 0;
> +     }
> +
> +     /*
> +      * If the relid doesn't match the one in the hint,
> +      * we have a hash collision.
> +      */
> +     if(RelationGetRelid(scan->rs_rd) != scan->rs_hint->relid)
> +     {
> +         elog(DEBUG1,"SYNC_SCAN: Hash collision");
> +         scan->rs_start_page = 0;
> +         return 0;
> +     }
> +
> +     /*
> +      * If the hint is not a valid block number
> +      * for this relation, start at 0.
> +      *
> +      * This can happen if, for instance, someone
> +      * TRUNCATEd the table between when the hint
> +      * was set and now.
> +      */
> +     if(scan->rs_hint->location < 0 ||
> +         scan->rs_hint->location >= scan->rs_nblocks)
> +     {
> +         elog(DEBUG2,"SYNC_SCAN: Hint %d out of range." \
> +                 " Relation has %d pages.",
> +             scan->rs_hint->location,scan->rs_nblocks);
> +         scan->rs_start_page = 0;
> +         return 0;
> +     }
> +
> +     scan->rs_start_page = scan->rs_hint->location;
> +
> +     /*
> +      * By starting at offset earlier than the hint,
> +      * it's likely that all of the blocks will already be
> +      * cached, and the scan will quickly catch up to the head.
> +      *
> +      * offset is a positive value that will be
> +      * subtracted from the hint.
> +      */
> +     if(offset > scan->rs_nblocks)
> +     {
> +         elog(DEBUG2,"SYNC_SCAN: Relation smaller than start offset: %d",
> +             offset);
> +         return 0;
> +     }
> +
> +     /*
> +      * If subtracting the offset would bring the value
> +      * to less than 0, we circle backwards to the end of the
> +      * file.
> +      */
> +     if(offset > scan->rs_start_page)
> +         scan->rs_start_page += scan->rs_nblocks;
> +
> +     scan->rs_start_page -= offset;
> +
> +     elog(DEBUG2,"SYNC_SCAN: START: OID = %d; Location = %d; Size: %d",
> +         RelationGetRelid(scan->rs_rd),
> +         scan->rs_start_page,scan->rs_nblocks);
> +
> +     return 0;
> + }
> +
> + /*
> +  * ss_store_hint:
> +  *
> +  * Writes an entry in the Sync Scan Hint Table
> +  * of the form (relid,blocknumber). This will
> +  * overwrite any existing entry that may collide
> +  * with this entry in the table.
> +  *
> +  * No locking is performed here. When this data is
> +  * later read by ss_init(), sanity checking is
> +  * performed to ensure we don't use an invalid
> +  * relation block number.
> +  */
> + static int ss_store_hint(HeapScanDesc scan, BlockNumber location)
> + {
> +     ss_hint_t hint;
> +     int threshold = sync_seqscan_threshold * NBuffers;
> +     int offset = sync_seqscan_offset * NBuffers;
> +
> +     /*
> +      * Print every 100k pages to DEBUG3
> +      * and every 10k pages to DEBUG4.
> +      */
> +     if (!(location%50000))
> +         elog(DEBUG2,"page: %d",location);
> +     else if (!(location%5000))
> +         elog(DEBUG3,"page: %d",location);
> +
> +     /*
> +      * If the table is too small, don't bother
> +      * with Sync Scan.
> +      */
> +     if(scan->rs_nblocks < threshold)
> +         return 0;
> +
> +     /*
> +      * If this scan has been progressing for less
> +      * than offset pages, don't store the hint.
> +      */
> +     if(location >= scan->rs_start_page)
> +     {
> +         if((location - scan->rs_start_page) < offset)
> +             return 0;
> +     }
> +     else
> +     {
> +         if((location + scan->rs_nblocks - scan->rs_start_page)
> +             < offset)
> +             return 0;
> +     }
> +
> +     hint.relid = RelationGetRelid(scan->rs_rd);
> +     hint.location = location;
> +
> +     *scan->rs_hint = hint;
> +
> +     return 0;
> + }
> +
> + /*
> +  * This is a simplistic function to hash
> +  * the Oid of the relation for placement in
> +  * the Sync Scan Hint Table
> +  */
> + static int ss_hash(HeapScanDesc scan)
> + {
> +     return RelationGetRelid(scan->rs_rd) % SYNC_SCAN_TABLE_SIZE;
> + }
> +
>   /* ----------------
>    *        initscan - scan code common to heap_beginscan and heap_rescan
>    * ----------------
> ***************
> *** 81,86 ****
> --- 290,300 ----
>        */
>       scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
>
> +     /*
> +      * Choose an good place to start the relation scan.
> +      */
> +     ss_init(scan);
> +
>       scan->rs_inited = false;
>       scan->rs_ctup.t_data = NULL;
>       ItemPointerSetInvalid(&scan->rs_ctup.t_self);
> ***************
> *** 223,229 ****
>                   tuple->t_data = NULL;
>                   return;
>               }
> !             page = 0;            /* first page */
>               heapgetpage(scan, page);
>               lineoff = FirstOffsetNumber;        /* first offnum */
>               scan->rs_inited = true;
> --- 437,447 ----
>                   tuple->t_data = NULL;
>                   return;
>               }
> !             /*
> !              * start the scan at the location that we chose
> !              * in ss_init()
> !              */
> !             page = scan->rs_start_page;
>               heapgetpage(scan, page);
>               lineoff = FirstOffsetNumber;        /* first offnum */
>               scan->rs_inited = true;
> ***************
> *** 364,378 ****
>           }
>
>           /*
> !          * if we get here, it means we've exhausted the items on this page and
>            * it's time to move to the next.
>            */
>           LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
>
>           /*
> !          * return NULL if we've exhausted all the pages
>            */
> !         if (backward ? (page == 0) : (page + 1 >= scan->rs_nblocks))
>           {
>               if (BufferIsValid(scan->rs_cbuf))
>                   ReleaseBuffer(scan->rs_cbuf);
> --- 582,611 ----
>           }
>
>           /*
> !          * If we get here, it means we've exhausted the items on this page and
>            * it's time to move to the next.
> +          *
> +          * For the forward scan, we need to wrap around to the beginning
> +          * of the relation file if we reach the end.
>            */
>           LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
>
> +         if(backward)
> +             page--;
> +         else
> +             page = (page + 1) % (scan->rs_nblocks);
> +
> +         if(! (page % SYNC_SCAN_REPORT_INTERVAL) )
> +             ss_store_hint(scan,page);
> +
>           /*
> !          * Return NULL if we've exhausted all the pages.
> !          * For reverse scans, that means we've reached 0. For
> !          * forward scans, that means we've reached the page on
> !          * which we started.
>            */
> !         if ((backward && (page == 0)) ||
> !             ((page%(scan->rs_nblocks)) == scan->rs_start_page))
>           {
>               if (BufferIsValid(scan->rs_cbuf))
>                   ReleaseBuffer(scan->rs_cbuf);
> ***************
> *** 383,390 ****
>               return;
>           }
>
> -         page = backward ? (page - 1) : (page + 1);
> -
>           heapgetpage(scan, page);
>
>           LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
> --- 616,621 ----
> ***************
> *** 450,456 ****
>                   tuple->t_data = NULL;
>                   return;
>               }
> !             page = 0;            /* first page */
>               heapgetpage(scan, page);
>               lineindex = 0;
>               scan->rs_inited = true;
> --- 681,691 ----
>                   tuple->t_data = NULL;
>                   return;
>               }
> !             /*
> !              * start the scan at the location that we chose
> !              * in ss_init()
> !              */
> !             page = scan->rs_start_page;
>               heapgetpage(scan, page);
>               lineindex = 0;
>               scan->rs_inited = true;
> ***************
> *** 585,598 ****
>           }
>
>           /*
> !          * if we get here, it means we've exhausted the items on this page and
>            * it's time to move to the next.
>            */
>
>           /*
> !          * return NULL if we've exhausted all the pages
>            */
> !         if (backward ? (page == 0) : (page + 1 >= scan->rs_nblocks))
>           {
>               if (BufferIsValid(scan->rs_cbuf))
>                   ReleaseBuffer(scan->rs_cbuf);
> --- 820,847 ----
>           }
>
>           /*
> !          * If we get here, it means we've exhausted the items on this page and
>            * it's time to move to the next.
> +          *
> +          * For the forward scan, we need to wrap around to the beginning
> +          * of the relation file if we reach the end.
>            */
> +         if(backward)
> +             page--;
> +         else
> +             page = (page + 1) % (scan->rs_nblocks);
> +
> +         if(! (page % SYNC_SCAN_REPORT_INTERVAL) )
> +             ss_store_hint(scan,page);
>
>           /*
> !          * Return NULL if we've exhausted all the pages.
> !          * For reverse scans, that means we've reached 0. For
> !          * forward scans, that means we've reached the page on
> !          * which we started.
>            */
> !         if ((backward && (page == 0)) ||
> !             ((page%(scan->rs_nblocks)) == scan->rs_start_page))
>           {
>               if (BufferIsValid(scan->rs_cbuf))
>                   ReleaseBuffer(scan->rs_cbuf);
> ***************
> *** 603,609 ****
>               return;
>           }
>
> -         page = backward ? (page - 1) : (page + 1);
>           heapgetpage(scan, page);
>
>           dp = (Page) BufferGetPage(scan->rs_cbuf);
> --- 852,857 ----
> ***************
> *** 616,621 ****
> --- 864,880 ----
>       }
>   }
>
> + /*
> +  * SyncScanShmemSize:
> +  *
> +  * Called by CreateSharedMemoryAndSemaphores()
> +  * to find out how much room the Sync Scan Hint
> +  * Table will need to occupy.
> +  */
> + Size SyncScanShmemSize(void)
> + {
> +     return SYNC_SCAN_TABLE_SIZE*sizeof(ss_hint_t);
> + }
>
>   #if defined(DISABLE_COMPLEX_MACRO)
>   /*
> diff -cr postgresql-8.2.3/src/backend/storage/ipc/ipci.c postgresql-8.2.3-syncscan/src/backend/storage/ipc/ipci.c
> *** postgresql-8.2.3/src/backend/storage/ipc/ipci.c    2006-10-15 15:04:07.000000000 -0700
> --- postgresql-8.2.3-syncscan/src/backend/storage/ipc/ipci.c    2007-03-13 21:58:56.000000000 -0700
> ***************
> *** 19,24 ****
> --- 19,25 ----
>   #include "access/nbtree.h"
>   #include "access/subtrans.h"
>   #include "access/twophase.h"
> + #include "access/heapam.h"
>   #include "miscadmin.h"
>   #include "pgstat.h"
>   #include "postmaster/bgwriter.h"
> ***************
> *** 110,115 ****
> --- 111,117 ----
>           size = add_size(size, FreeSpaceShmemSize());
>           size = add_size(size, BgWriterShmemSize());
>           size = add_size(size, BTreeShmemSize());
> +         size = add_size(size, SyncScanShmemSize());
>   #ifdef EXEC_BACKEND
>           size = add_size(size, ShmemBackendArraySize());
>   #endif
> diff -cr postgresql-8.2.3/src/backend/utils/misc/guc.c postgresql-8.2.3-syncscan/src/backend/utils/misc/guc.c
> *** postgresql-8.2.3/src/backend/utils/misc/guc.c    2006-11-29 06:50:07.000000000 -0800
> --- postgresql-8.2.3-syncscan/src/backend/utils/misc/guc.c    2007-03-13 23:23:31.000000000 -0700
> ***************
> *** 25,31 ****
>   #include <syslog.h>
>   #endif
>
> !
>   #include "access/gin.h"
>   #include "access/twophase.h"
>   #include "access/xact.h"
> --- 25,31 ----
>   #include <syslog.h>
>   #endif
>
> ! #include "access/heapam.h"
>   #include "access/gin.h"
>   #include "access/twophase.h"
>   #include "access/xact.h"
> ***************
> *** 758,763 ****
> --- 758,773 ----
>           false, NULL, NULL
>       },
>
> +     {
> +         {"trace_sync_seqscan", PGC_USERSET, DEVELOPER_OPTIONS,
> +             gettext_noop("Generates debugging output for Synchronized Scans."),
> +             NULL,
> +             GUC_NOT_IN_SAMPLE
> +         },
> +         &Trace_sync_seqscan,
> +         false, NULL, NULL
> +     },
> +
>   #ifdef LOCK_DEBUG
>       {
>           {"trace_locks", PGC_SUSET, DEVELOPER_OPTIONS,
> ***************
> *** 1723,1728 ****
> --- 1733,1754 ----
>           DEFAULT_GEQO_SELECTION_BIAS, MIN_GEQO_SELECTION_BIAS,
>           MAX_GEQO_SELECTION_BIAS, NULL, NULL
>       },
> +     {
> +         {"sync_seqscan_threshold", PGC_USERSET, QUERY_TUNING_SYNC_SEQSCAN,
> +             gettext_noop("Minimum size of table before synchronized scanning takes effect, as a fraction of
shared_buffers."),
> +             NULL
> +         },
> +         &sync_seqscan_threshold,
> +         DEFAULT_SYNC_SCAN_THRESHOLD, 0.0, 100.0, NULL, NULL
> +     },
> +     {
> +         {"sync_seqscan_offset", PGC_USERSET, QUERY_TUNING_SYNC_SEQSCAN,
> +             gettext_noop("Start synchronized scans at this offset (as a fraction of shared_buffers) before other
scans."),
> +             NULL
> +         },
> +         &sync_seqscan_offset,
> +         DEFAULT_SYNC_SCAN_OFFSET, 0.0, 100.0, NULL, NULL
> +     },
>
>       {
>           {"bgwriter_lru_percent", PGC_SIGHUP, RESOURCES,
> diff -cr postgresql-8.2.3/src/include/access/heapam.h postgresql-8.2.3-syncscan/src/include/access/heapam.h
> *** postgresql-8.2.3/src/include/access/heapam.h    2006-11-05 14:42:10.000000000 -0800
> --- postgresql-8.2.3-syncscan/src/include/access/heapam.h    2007-03-13 23:22:11.000000000 -0700
> ***************
> *** 25,30 ****
> --- 25,49 ----
>   #include "utils/rel.h"
>   #include "utils/tqual.h"
>
> + /*
> +  * Size of the Sync Scan Hint Table.
> +  */
> + #define SYNC_SCAN_TABLE_SIZE   1000
> +
> + /*
> +  * Interval between reports of the location
> +  * of the current scan, in pages.
> +  */
> + #define SYNC_SCAN_REPORT_INTERVAL 100
> +
> + #define DEFAULT_SYNC_SCAN_THRESHOLD 1.0
> + #define DEFAULT_SYNC_SCAN_OFFSET 0.0
> +
> + extern DLLIMPORT bool Trace_sync_seqscan;
> + extern DLLIMPORT double sync_seqscan_threshold;
> + extern DLLIMPORT double sync_seqscan_offset;
> + extern Size SyncScanShmemSize(void);
> +
>   /* ----------------
>    *        fastgetattr
>    *
> diff -cr postgresql-8.2.3/src/include/access/relscan.h postgresql-8.2.3-syncscan/src/include/access/relscan.h
> *** postgresql-8.2.3/src/include/access/relscan.h    2006-10-03 17:30:07.000000000 -0700
> --- postgresql-8.2.3-syncscan/src/include/access/relscan.h    2007-03-13 21:58:56.000000000 -0700
> ***************
> *** 19,24 ****
> --- 19,33 ----
>   #include "utils/tqual.h"
>
>
> + /*
> +  * Structure of an entry in the
> +  * Sync Scan Hint Table.
> +  */
> + typedef struct {
> +     Oid         relid;    /* The relid that tags this hint entry */
> +     BlockNumber location; /* The location in the relation */
> + } ss_hint_t;
> +
>   typedef struct HeapScanDescData
>   {
>       /* scan parameters */
> ***************
> *** 33,38 ****
> --- 42,49 ----
>       bool        rs_inited;        /* false = scan not init'd yet */
>       HeapTupleData rs_ctup;        /* current tuple in scan, if any */
>       BlockNumber rs_cblock;        /* current block # in scan, if any */
> +     BlockNumber rs_start_page;  /* page where this scan began */
> +     ss_hint_t    *rs_hint;        /* pointer to scan hint */
>       Buffer        rs_cbuf;        /* current buffer in scan, if any */
>       /* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
>       ItemPointerData rs_mctid;    /* marked scan position, if any */
> diff -cr postgresql-8.2.3/src/include/utils/guc_tables.h postgresql-8.2.3-syncscan/src/include/utils/guc_tables.h
> *** postgresql-8.2.3/src/include/utils/guc_tables.h    2006-10-03 14:11:55.000000000 -0700
> --- postgresql-8.2.3-syncscan/src/include/utils/guc_tables.h    2007-03-13 22:41:15.000000000 -0700
> ***************
> *** 56,61 ****
> --- 56,62 ----
>       QUERY_TUNING_METHOD,
>       QUERY_TUNING_COST,
>       QUERY_TUNING_GEQO,
> +     QUERY_TUNING_SYNC_SEQSCAN,
>       QUERY_TUNING_OTHER,
>       LOGGING,
>       LOGGING_WHERE,
>
>
> ------------------------------------------------------------------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly


--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/heap/Makefile
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/Makefile,v
retrieving revision 1.15
diff -c -r1.15 Makefile
*** src/backend/access/heap/Makefile    8 Apr 2007 01:26:27 -0000    1.15
--- src/backend/access/heap/Makefile    25 May 2007 21:13:17 -0000
***************
*** 12,18 ****
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global

! OBJS = heapam.o hio.o rewriteheap.o tuptoaster.o

  all: SUBSYS.o

--- 12,18 ----
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global

! OBJS = heapam.o hio.o rewriteheap.o tuptoaster.o syncscan.o

  all: SUBSYS.o

Index: src/backend/access/heap/heapam.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.232
diff -c -r1.232 heapam.c
*** src/backend/access/heap/heapam.c    8 Apr 2007 01:26:27 -0000    1.232
--- src/backend/access/heap/heapam.c    27 May 2007 11:54:52 -0000
***************
*** 83,88 ****
--- 83,93 ----
       */
      scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);

+     /*
+      * Choose a good place to start the relation scan.
+      */
+     scan->rs_start_page = ss_get_location(scan);
+
      scan->rs_inited = false;
      scan->rs_ctup.t_data = NULL;
      ItemPointerSetInvalid(&scan->rs_ctup.t_self);
***************
*** 203,208 ****
--- 208,214 ----
      Snapshot    snapshot = scan->rs_snapshot;
      bool        backward = ScanDirectionIsBackward(dir);
      BlockNumber page;
+     BlockNumber prevpage;
      Page        dp;
      int            lines;
      OffsetNumber lineoff;
***************
*** 225,231 ****
                  tuple->t_data = NULL;
                  return;
              }
!             page = 0;            /* first page */
              heapgetpage(scan, page);
              lineoff = FirstOffsetNumber;        /* first offnum */
              scan->rs_inited = true;
--- 231,241 ----
                  tuple->t_data = NULL;
                  return;
              }
!             /*
!              * start the scan at the location that we chose
!              * in ss_get_location()
!              */
!             page = scan->rs_start_page;
              heapgetpage(scan, page);
              lineoff = FirstOffsetNumber;        /* first offnum */
              scan->rs_inited = true;
***************
*** 259,265 ****
                  tuple->t_data = NULL;
                  return;
              }
!             page = scan->rs_nblocks - 1;        /* final page */
              heapgetpage(scan, page);
          }
          else
--- 269,275 ----
                  tuple->t_data = NULL;
                  return;
              }
!             page = scan->rs_start_page;
              heapgetpage(scan, page);
          }
          else
***************
*** 366,380 ****
          }

          /*
!          * if we get here, it means we've exhausted the items on this page and
           * it's time to move to the next.
           */
          LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);

          /*
!          * return NULL if we've exhausted all the pages
           */
!         if (backward ? (page == 0) : (page + 1 >= scan->rs_nblocks))
          {
              if (BufferIsValid(scan->rs_cbuf))
                  ReleaseBuffer(scan->rs_cbuf);
--- 376,417 ----
          }

          /*
!          * If we get here, it means we've exhausted the items on this page and
           * it's time to move to the next.
+          *
+          * For the forward scan, we need to wrap around to the beginning
+          * of the relation file if we reach the end.
           */
          LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);

+         prevpage = page;
+         if(backward) {
+             if(page == 0)
+                 page = scan->rs_nblocks;
+             page--;
+         }
+         else
+             page = (page + 1) % (scan->rs_nblocks);
+
+         if(Trace_sync_seqscan)
+         {
+             if (!(page%50000))
+                 elog(DEBUG2,"page: %d",page);
+             else if (!(page%5000))
+                 elog(DEBUG3,"page: %d",page);
+         }
+
+         /* XXX: only report if physical I/O happened */
+         ss_report_location(scan, page);
+
          /*
!          * Return NULL if we've exhausted all the pages.
!          * For reverse scans, that means we've reached 0. For
!          * forward scans, that means we've reached the page on
!          * which we started.
           */
!         if((ScanDirectionIsForward(dir)  && page == scan->rs_start_page) ||
!            (ScanDirectionIsBackward(dir) && prevpage == scan->rs_start_page))
          {
              if (BufferIsValid(scan->rs_cbuf))
                  ReleaseBuffer(scan->rs_cbuf);
***************
*** 385,392 ****
              return;
          }

-         page = backward ? (page - 1) : (page + 1);
-
          heapgetpage(scan, page);

          LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
--- 422,427 ----
***************
*** 429,434 ****
--- 464,470 ----
      HeapTuple    tuple = &(scan->rs_ctup);
      bool        backward = ScanDirectionIsBackward(dir);
      BlockNumber page;
+     BlockNumber prevpage;
      Page        dp;
      int            lines;
      int            lineindex;
***************
*** 452,458 ****
                  tuple->t_data = NULL;
                  return;
              }
!             page = 0;            /* first page */
              heapgetpage(scan, page);
              lineindex = 0;
              scan->rs_inited = true;
--- 488,498 ----
                  tuple->t_data = NULL;
                  return;
              }
!             /*
!              * start the scan at the location that we chose
!              * in ss_get_location()
!              */
!             page = scan->rs_start_page;
              heapgetpage(scan, page);
              lineindex = 0;
              scan->rs_inited = true;
***************
*** 483,489 ****
                  tuple->t_data = NULL;
                  return;
              }
!             page = scan->rs_nblocks - 1;        /* final page */
              heapgetpage(scan, page);
          }
          else
--- 523,529 ----
                  tuple->t_data = NULL;
                  return;
              }
!             page = scan->rs_start_page;
              heapgetpage(scan, page);
          }
          else
***************
*** 587,600 ****
          }

          /*
!          * if we get here, it means we've exhausted the items on this page and
           * it's time to move to the next.
           */

          /*
!          * return NULL if we've exhausted all the pages
           */
!         if (backward ? (page == 0) : (page + 1 >= scan->rs_nblocks))
          {
              if (BufferIsValid(scan->rs_cbuf))
                  ReleaseBuffer(scan->rs_cbuf);
--- 627,665 ----
          }

          /*
!          * If we get here, it means we've exhausted the items on this page and
           * it's time to move to the next.
+          *
+          * For the forward scan, we need to wrap around to the beginning
+          * of the relation file if we reach the end.
           */
+         prevpage = page;
+         if(backward) {
+             if(page == 0)
+                 page = scan->rs_nblocks;
+             page--;
+         }
+         else
+             page = (page + 1) % (scan->rs_nblocks);
+
+         if(Trace_sync_seqscan)
+         {
+             if (!(page%50000))
+                 elog(DEBUG2,"page: %d",page);
+             else if (!(page%5000))
+                 elog(DEBUG3,"page: %d",page);
+         }
+
+         ss_report_location(scan,page);

          /*
!          * Return NULL if we've exhausted all the pages.
!          * For reverse scans, that means we've reached 0. For
!          * forward scans, that means we've reached the page on
!          * which we started.
           */
!         if((ScanDirectionIsForward(dir)  && page == scan->rs_start_page) ||
!            (ScanDirectionIsBackward(dir) && prevpage == scan->rs_start_page))
          {
              if (BufferIsValid(scan->rs_cbuf))
                  ReleaseBuffer(scan->rs_cbuf);
***************
*** 605,611 ****
              return;
          }

-         page = backward ? (page - 1) : (page + 1);
          heapgetpage(scan, page);

          dp = (Page) BufferGetPage(scan->rs_cbuf);
--- 670,675 ----
***************
*** 618,624 ****
      }
  }

-
  #if defined(DISABLE_COMPLEX_MACRO)
  /*
   * This is formatted so oddly so that the correspondence to the macro
--- 682,687 ----
Index: src/backend/access/heap/syncscan.c
===================================================================
RCS file: src/backend/access/heap/syncscan.c
diff -N src/backend/access/heap/syncscan.c
*** /dev/null    1 Jan 1970 00:00:00 -0000
--- src/backend/access/heap/syncscan.c    28 May 2007 07:10:39 -0000
***************
*** 0 ****
--- 1,309 ----
+ /*-------------------------------------------------------------------------
+  *
+  * syncscan.c
+  *      heap scan synchronization support
+  *
+  * When multiple backends run a sequential scan on the same table, we try
+  * to keep them synchronized to reduce the overall I/O needed. The goal is
+  * to read in each page only once to shared buffer cache, and let all backends
+  * that take part in the scan to process the page before it falls out of the
+  * cache.
+  *
+  * To accomplish that, we keep track of the scan position of each table, and
+  * start new scans close to where the previous scan(s) are. Assuming that I/O
+  * is slow compared to the processing of each page, that's enough to keep the
+  * scans synchronized. (XXX explain why)
+  *
+  * There can realistically only be a few sequential scans on different tables
+  * in progress at any time. Therefore we just keep the scan positions in a
+  * small LRU list which we scan every time we need to look up or update a scan
+  * position.
+  *
+  *
+  * INTERFACE ROUTINES
+  *        ss_get_location        - return current scan location of a relation
+  *        ss_report_location    - update current scan location
+  *
+  * NOTES
+  *      XXX This file contains the heap_ routines which implement
+  *      the POSTGRES heap access method used for all POSTGRES
+  *      relations.
+  *
+  * Portions Copyright (c) 1996-2007, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * IDENTIFICATION
+  *      $PostgreSQL$
+  *
+  *-------------------------------------------------------------------------
+  */
+ #include "postgres.h"
+
+ #include "access/heapam.h"
+ #include "access/hio.h"
+ #include "access/multixact.h"
+ #include "access/transam.h"
+ #include "access/tuptoaster.h"
+ #include "access/valid.h"
+ #include "access/xact.h"
+ #include "catalog/catalog.h"
+ #include "catalog/namespace.h"
+ #include "miscadmin.h"
+ #include "pgstat.h"
+ #include "storage/procarray.h"
+ #include "storage/smgr.h"
+ #include "utils/inval.h"
+ #include "utils/lsyscache.h"
+ #include "utils/relcache.h"
+ #include "utils/syscache.h"
+
+ /*
+  * Size of the LRU list.
+  *
+  * Note: the code assumes that SYNC_SCAN_NELEM > 1.
+  */
+ #define SYNC_SCAN_NELEM 100        /* XXX: What's a good value? */
+
+ /*
+  * Interval between reports of the location
+  * of the current scan, in pages.
+  *
+  * XXX: note relationship to buffer ring size
+  */
+ #define SYNC_SCAN_REPORT_INTERVAL 16
+
+
+ /* GUC variables */
+ double sync_seqscan_threshold = DEFAULT_SYNC_SCAN_THRESHOLD;
+ bool Trace_sync_seqscan = false;
+
+ /*
+  * The scan locations structure is essentially a doubly-linked LRU with head
+  * and tail pointer, but designed to hold a fixed maximum number of elements in
+  * fixed-size shared memory.
+  */
+
+ typedef struct ss_scan_location_t {
+     RelFileNode relfilenode;    /* The relfilenode that tags this entry */
+     BlockNumber location;        /* The location in the relation */
+ } ss_scan_location_t;
+
+ typedef struct ss_lru_item_t {
+     struct ss_lru_item_t    *prev;
+     struct ss_lru_item_t    *next;
+     ss_scan_location_t        location;
+ } ss_lru_item_t;
+
+ typedef struct ss_scan_locations_t {
+     ss_lru_item_t        *head;
+     ss_lru_item_t        *tail;
+     ss_lru_item_t        items[]; /* SYNC_SCAN_NELEM items */
+ } ss_scan_locations_t;
+
+ #define SizeOfScanLocations offsetof(ss_scan_locations_t, items[SYNC_SCAN_NELEM])
+
+ /* Pointer to struct in shared memory */
+ static ss_scan_locations_t *scan_locations;
+
+
+ /* prototypes for internal functions */
+ static BlockNumber ss_search(RelFileNode,BlockNumber,bool);
+
+
+ /*
+  * SyncScanShmemSize --- report amount of shared memory space needed
+  */
+ Size SyncScanShmemSize(void)
+ {
+     return SizeOfScanLocations;
+ }
+
+ /*
+  * SyncScanShmemInit --- initialize this module's shared memory
+  */
+ void SyncScanShmemInit()
+ {
+     int i;
+     bool found;
+
+     scan_locations = (ss_scan_locations_t *)
+         ShmemInitStruct("Sync Scan Locations List", SizeOfScanLocations, &found);
+
+     if(!IsUnderPostmaster)
+     {
+         /* Initialize shared memory area */
+         Assert(!found);
+
+         scan_locations->head = &scan_locations->items[0];
+         scan_locations->tail = &scan_locations->items[SYNC_SCAN_NELEM - 1];
+
+         for(i = 0; i < SYNC_SCAN_NELEM; i++)
+         {
+             ss_lru_item_t *item = &scan_locations->items[i];
+
+             /* Initialize all slots with invalid values. As scans are started,
+              * these invalid entries will fall off the LRU list and replaced
+              * with real entries.
+              */
+             item->location.relfilenode.spcNode = InvalidOid;
+             item->location.relfilenode.dbNode = InvalidOid;
+             item->location.relfilenode.relNode = InvalidOid;
+             item->location.location = InvalidBlockNumber;
+
+             item->prev = (i > 0) ?
+                 (&scan_locations->items[i - 1]) : NULL;
+             item->next = (i < SYNC_SCAN_NELEM - 1) ?
+                 (&scan_locations->items[i + 1]) : NULL;
+         }
+     }
+     else
+         Assert(found);
+
+ }
+
+ /*
+  * ss_search --- searches the scan_locations structure for an entry with the
+  *        given relfilenode.
+  *
+  * If "set" is true, the location is updated to the given location.  If no
+  * entry for the given relfilenode is found, it will be created at the head
+  * of the list with the given location, even if set == false.
+  *
+  * In any case, the location after possible update is returned.
+  */
+ static BlockNumber ss_search(RelFileNode relfilenode,
+     BlockNumber location, bool set)
+ {
+     ss_lru_item_t    *item;
+
+     item = scan_locations->head;
+
+     for (;;)
+     {
+         bool match;
+
+         match = RelFileNodeEquals(item->location.relfilenode, relfilenode);
+
+         if (match || item->next == NULL)
+         {
+
+             /*
+              * If we reached the end of list and no match was found,
+              * take over the last entry
+              */
+             if (!match)
+             {
+                 item->location.relfilenode = relfilenode;
+                 item->location.location = location;
+             }
+             else
+             if (set)
+                 item->location.location = location;
+
+             /* Move the entry to the front of the LRU list */
+             if (item != scan_locations->head)
+             {
+                 /* unlink */
+                 if(item == scan_locations->tail)
+                     scan_locations->tail = item->prev;
+                 item->prev->next = item->next;
+                 if(item->next)
+                     item->next->prev = item->prev;
+
+                 /* link */
+                 item->prev = NULL;
+                 item->next = scan_locations->head;
+                 scan_locations->head->prev = item;
+                 scan_locations->head = item;
+             }
+
+             return item->location.location;
+         }
+         item = item->next;
+     }
+     /* never reached */
+ }
+
+ /*
+  * ss_get_location --- gets the optimal starting location for scan
+  *
+  * Returns the location of an already running sequential scan on the
+  * relation, or 0 if no valid location is found.
+  *
+  * This function assumes that scan->rs_nblocks is already properly set.
+  */
+ BlockNumber ss_get_location(HeapScanDesc scan)
+ {
+     int threshold = sync_seqscan_threshold * NBuffers;
+     BlockNumber startloc;
+
+     /*
+      * If the table is not large enough, or sync_scan_threshold
+      * is disabled (negative), don't Sync Scan.
+      */
+     if(threshold < 0 || scan->rs_nblocks < threshold)
+         return 0;
+
+     LWLockAcquire(SyncScanLock,LW_EXCLUSIVE);
+     startloc = ss_search(scan->rs_rd->rd_node, 0, false);
+     LWLockRelease(SyncScanLock);
+
+     /*
+      * If the location is not a valid block number for this scan, start at 0.
+      *
+      * This can happen if for instance a VACUUM truncated the table
+      * since the location was set.
+      */
+     if(startloc < 0 || startloc >= scan->rs_nblocks)
+     {
+         if(Trace_sync_seqscan)
+             elog(DEBUG2,"SYNC_SCAN: Location %d out of range." \
+                 " Relation has %d pages.",
+                 startloc, scan->rs_nblocks);
+         return 0;
+     }
+
+     if(Trace_sync_seqscan)
+         elog(DEBUG2,"SYNC_SCAN: START: OID = %d; Location = %d; Size: %d",
+             RelationGetRelid(scan->rs_rd),
+             startloc,scan->rs_nblocks);
+
+     return startloc;
+ }
+
+ /*
+  * ss_report_location --- updates the current scan location
+  *
+  * Writes an entry in the Sync Scan structure of the form
+  * (relfilenode,blocknumber), overwriting any existing entry for the same
+  * relfilenode.
+  */
+ void ss_report_location(HeapScanDesc scan, BlockNumber location)
+ {
+     int threshold = sync_seqscan_threshold * NBuffers;
+
+     /*
+      * If the table is not large enough, or sync_scan_threshold
+      * is disabled (negative), don't Sync Scan.
+      */
+     if(threshold < 0 || scan->rs_nblocks < threshold)
+         return;
+
+     /*
+      * To avoid lock congestion, only report scan progress every n pages.
+      * For the same reason, don't block if the lock isn't immediately
+      * available. Missing a few updates isn't critical, it just means that a
+      * scan that wants to join the pack will start a little bit behind the
+      * head of the scan. That's ok because the pages should still be in OS
+      * cache and the scan will quickly catch up.
+      */
+     if (location % SYNC_SCAN_REPORT_INTERVAL == 0)
+     {
+         if (LWLockConditionalAcquire(SyncScanLock, LW_EXCLUSIVE))
+         {
+             ss_search(scan->rs_rd->rd_node,location,true);
+             LWLockRelease(SyncScanLock);
+         }
+     }
+ }
+
Index: src/backend/storage/ipc/ipci.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/ipc/ipci.c,v
retrieving revision 1.91
diff -c -r1.91 ipci.c
*** src/backend/storage/ipc/ipci.c    15 Feb 2007 23:23:23 -0000    1.91
--- src/backend/storage/ipc/ipci.c    25 May 2007 21:28:18 -0000
***************
*** 19,24 ****
--- 19,25 ----
  #include "access/nbtree.h"
  #include "access/subtrans.h"
  #include "access/twophase.h"
+ #include "access/heapam.h"
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "postmaster/autovacuum.h"
***************
*** 112,117 ****
--- 113,119 ----
          size = add_size(size, BgWriterShmemSize());
          size = add_size(size, AutoVacuumShmemSize());
          size = add_size(size, BTreeShmemSize());
+         size = add_size(size, SyncScanShmemSize());
  #ifdef EXEC_BACKEND
          size = add_size(size, ShmemBackendArraySize());
  #endif
***************
*** 216,221 ****
--- 218,224 ----
       * Set up other modules that need some shared memory space
       */
      BTreeShmemInit();
+     SyncScanShmemInit();

  #ifdef EXEC_BACKEND

Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.391
diff -c -r1.391 guc.c
*** src/backend/utils/misc/guc.c    8 May 2007 16:33:51 -0000    1.391
--- src/backend/utils/misc/guc.c    25 May 2007 21:07:53 -0000
***************
*** 25,31 ****
  #include <syslog.h>
  #endif

!
  #include "access/gin.h"
  #include "access/transam.h"
  #include "access/twophase.h"
--- 25,31 ----
  #include <syslog.h>
  #endif

! #include "access/heapam.h"
  #include "access/gin.h"
  #include "access/transam.h"
  #include "access/twophase.h"
***************
*** 783,788 ****
--- 783,798 ----
          false, NULL, NULL
      },

+     {
+         {"trace_sync_seqscan", PGC_USERSET, DEVELOPER_OPTIONS,
+             gettext_noop("Generates debugging output for Synchronized Scans."),
+             NULL,
+             GUC_NOT_IN_SAMPLE
+         },
+         &Trace_sync_seqscan,
+         false, NULL, NULL
+     },
+
  #ifdef LOCK_DEBUG
      {
          {"trace_locks", PGC_SUSET, DEVELOPER_OPTIONS,
***************
*** 1803,1809 ****
          DEFAULT_GEQO_SELECTION_BIAS, MIN_GEQO_SELECTION_BIAS,
          MAX_GEQO_SELECTION_BIAS, NULL, NULL
      },
!
      {
          {"bgwriter_lru_percent", PGC_SIGHUP, RESOURCES,
              gettext_noop("Background writer percentage of LRU buffers to flush per round."),
--- 1813,1826 ----
          DEFAULT_GEQO_SELECTION_BIAS, MIN_GEQO_SELECTION_BIAS,
          MAX_GEQO_SELECTION_BIAS, NULL, NULL
      },
!     {
!         {"sync_seqscan_threshold", PGC_USERSET, QUERY_TUNING_SYNC_SEQSCAN,
!             gettext_noop("Minimum size of table before synchronized scanning takes effect, as a fraction of
shared_buffers."),
!             NULL
!         },
!         &sync_seqscan_threshold,
!         DEFAULT_SYNC_SCAN_THRESHOLD, -1.0, 100.0, NULL, NULL
!     },
      {
          {"bgwriter_lru_percent", PGC_SIGHUP, RESOURCES,
              gettext_noop("Background writer percentage of LRU buffers to flush per round."),
Index: src/include/access/heapam.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/heapam.h,v
retrieving revision 1.123
diff -c -r1.123 heapam.h
*** src/include/access/heapam.h    8 Apr 2007 01:26:33 -0000    1.123
--- src/include/access/heapam.h    26 May 2007 18:42:46 -0000
***************
*** 25,30 ****
--- 25,35 ----
  #include "utils/rel.h"
  #include "utils/tqual.h"

+ #define DEFAULT_SYNC_SCAN_THRESHOLD 1.0
+
+ extern bool Trace_sync_seqscan;
+ extern double sync_seqscan_threshold;
+
  /* ----------------
   *        fastgetattr
   *
***************
*** 240,243 ****
--- 245,254 ----

  extern void heap_sync(Relation relation);

+ /* in heapam/syncscan.c */
+ extern void ss_report_location(HeapScanDesc scan, BlockNumber location);
+ extern BlockNumber ss_get_location(HeapScanDesc scan);
+ extern void SyncScanShmemInit(void);
+ extern Size SyncScanShmemSize(void);
+
  #endif   /* HEAPAM_H */
Index: src/include/access/relscan.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/relscan.h,v
retrieving revision 1.52
diff -c -r1.52 relscan.h
*** src/include/access/relscan.h    20 Jan 2007 18:43:35 -0000    1.52
--- src/include/access/relscan.h    25 May 2007 21:35:36 -0000
***************
*** 16,22 ****
--- 16,24 ----

  #include "access/skey.h"
  #include "storage/bufpage.h"
+ #include "storage/lwlock.h"
  #include "utils/tqual.h"
+ #include "utils/dynahash.h"


  typedef struct HeapScanDescData
***************
*** 33,38 ****
--- 35,42 ----
      bool        rs_inited;        /* false = scan not init'd yet */
      HeapTupleData rs_ctup;        /* current tuple in scan, if any */
      BlockNumber rs_cblock;        /* current block # in scan, if any */
+     BlockNumber rs_start_page;  /* page where this scan began */
+ //    ss_hints_t    *rs_hints;        /* pointer to scan hint */
      Buffer        rs_cbuf;        /* current buffer in scan, if any */
      /* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
      ItemPointerData rs_mctid;    /* marked scan position, if any */
Index: src/include/storage/lwlock.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/lwlock.h,v
retrieving revision 1.36
diff -c -r1.36 lwlock.h
*** src/include/storage/lwlock.h    16 Apr 2007 18:30:04 -0000    1.36
--- src/include/storage/lwlock.h    25 May 2007 21:08:27 -0000
***************
*** 62,67 ****
--- 62,68 ----
      AddinShmemInitLock,
      AutovacuumLock,
      AutovacuumScheduleLock,
+     SyncScanLock,
      /* Individual lock IDs end here */
      FirstBufMappingLock,
      FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
Index: src/include/utils/guc_tables.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/utils/guc_tables.h,v
retrieving revision 1.33
diff -c -r1.33 guc_tables.h
*** src/include/utils/guc_tables.h    21 Apr 2007 20:02:41 -0000    1.33
--- src/include/utils/guc_tables.h    25 May 2007 21:07:53 -0000
***************
*** 56,61 ****
--- 56,62 ----
      QUERY_TUNING_METHOD,
      QUERY_TUNING_COST,
      QUERY_TUNING_GEQO,
+     QUERY_TUNING_SYNC_SEQSCAN,
      QUERY_TUNING_OTHER,
      LOGGING,
      LOGGING_WHERE,

Re: Synchronized Scan WIP patch

From
Jeff Davis
Date:
On Thu, 2007-05-31 at 09:08 +0100, Heikki Linnakangas wrote:
> Here's a work-in-progress update of this patch.
>
> I haven't done any major changes, but a lot of little refactoring and
> commenting, including:
>
> * moved the sync scan stuff to a new file access/heapam/syncscan.c.
> heapam.c is long enough already, and in theory the same mechanism could
> be used for large bitmap heap scans in the future.

Good idea, I hadn't thought of that. It seems like the bitmaps in two
bitmap scans would have to match very closely, but that sounds
plausible.

This is similar to another idea I had considered (I forget who thought
of it) to try to have a bitmap of "tuples still needed" and then try to
optimize based on that information somehow (read the ones in cache
first, etc). Seems substantially more complex though, more like a
prefetch system at that point.

I expected the general refactoring. Hopefully my next patch is a little
closer to the code expectations and places less burden on the reviewers.

> Testing:
> * Multiple scans on different tables, causing movement in the LRU list
> * Measure the CPU overhead for a single scan
> * Measure the lock contention with multiple scanners
>

Is there any way to measure the necessity of the hash table? I would
think the conditions for that would be a large number of tables being
actively scanned causing a lot of LRU activity such that the locks are
held too long.

I also think the optimization of only reporting when the block is not
found in cache would be useful to test if the lock contention is a problem.

Regards,
    Jeff Davis



Re: Synchronized Scan WIP patch

From
Heikki Linnakangas
Date:
Jeff Davis wrote:
> On Thu, 2007-05-31 at 09:08 +0100, Heikki Linnakangas wrote:
>> * moved the sync scan stuff to a new file access/heapam/syncscan.c.
>> heapam.c is long enough already, and in theory the same mechanism could
>> be used for large bitmap heap scans in the future.
>
> Good idea, I hadn't thought of that. It seems like the bitmaps in two
> bitmap scans would have to match very closely, but that sounds
> plausible.

Yeah, it's a pretty narrow use case, but plausible in theory.

> This is similar to another idea I had considered (I forget who thought
> of it) to try to have a bitmap of "tuples still needed" and then try to
> optimize based on that information somehow (read the ones in cache
> first, etc). Seems substantially more complex though, more like a
> prefetch system at that point.
>
> I expected the general refactoring. Hopefully my next patch is a little
> closer to the code expectations and places less burden on the reviewers.

No worries, that's the easy part.

>> Testing:
>> * Multiple scans on different tables, causing movement in the LRU list
>> * Measure the CPU overhead for a single scan
>> * Measure the lock contention with multiple scanners
>
> Is there any way to measure the necessity of the hash table? I would
> think the conditions for that would be a large number of tables being
> actively scanned causing a lot of LRU activity such that the locks are
> held too long.

Yep, and it's hard to imagine a system like that.

> I also think the optimization of only reporting when the block is not
> found in cache would be useful to test if the lock contention is a problem.

I tried to demonstrate lock contention by running 10 backends all
repeatedly scanning different tables that are bigger than the sync scan
threshold but small enough to all fit in OS cache. The total runtime of
the tests was the same, ~45 s, with and without the patch. That's pretty
much the worst case scenario I could think of, so it seems that
contention of the SyncScanLock is not an issue. There was some "missed
updates" of the scan location, due to the LWLockConditionalAcquire that
I put there to reduce lock contention, but not too much to worry about.

I'll post an updated patch shortly..

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com