Re: Synchronized Scan update - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Synchronized Scan update
Date
Msg-id 1173705670.3641.600.camel@silverbirch.site
Whole thread Raw
In response to Re: Synchronized Scan update  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Synchronized Scan update  ("Luke Lonergan" <llonergan@greenplum.com>)
Re: Synchronized Scan update  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On Fri, 2007-03-02 at 15:03 -0800, Jeff Davis wrote:
> Is there any consensus about whether to include these two parameters as
> GUCs or constants if my patch is to be accepted?
> 
> (1) sync_scan_threshold: Use synchronized scanning for tables greater
> than this many pages; smaller tables will not be affected.
> (2) sync_scan_offset: Start a new scan this many pages before a
> currently running scan to take advantage of the pages
>  that are likely already in cache.
> 
> Right now they are just constants defined in a header, but a GUC might
> make sense. I'd like to know which version is more acceptable when I
> submit my final patch.

I'm looking at ways of helping Synch Scan and scan_recycle_buffers to
work well together. I've had a look through the synch scan patch in more
detail to help this along.

ISTM they can play nicely together.

Results from tests show that with a single scan we get better
performance with scan_recycle_buffers = ~32. That gives a +25%
performance bonus. Synch scan has the capability to reduce I/O by 50%
across two concurrent scans, so that would be likely to outweigh the
gain from L2 cache reduction, unless we can get both together.

It's possible that multiple scans would use the same CPU, or at least
use multiple cores on the same chip and hence same L2 cache. That's more
likely if the synch scan works well, since the second scan can process
the buffer while the lead process performs I/O. So its likely that on
dual and quad core CPUs that we'll be able to get both benefits in most
cases.

So based on those thoughts, sync_scan_offset should be fixed at 16,
rather than being variable. In addition, ss_report_loc() should only
report its position every 16 blocks, rather than do this every time,
which will reduce overhead of this call.

To match that, scan_recycle_buffers should be fixed at 32. So GUCs for
sync_scan_offset and scan_recycle_buffers would not be required at all.

IMHO we can also remove sync_scan_threshold and just use NBuffers
instead. That way we get the benefit of both patches or neither, making
it easier to understand what's going on.

On my patch, I'll make buffer recycling only work for SeqScans and
VACUUMs.

If need be, the value of scan_recycle_buffers can be varied upwards
should the scans drift apart, as a way of bringing them back together.
We aren't tracking whether they are together or apart, so I would like
to see some debug output from synch scans to allow us to assess how far
behind the second scan is as it progresses. e.g.
LOG:  synch scan currently on block N, trailing pathfinder by M blocks
issued every 128 blocks as we go through the scans. 

Thoughts?

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Grouped Index Tuples / Clustered Indexes
Next
From: Tom Lane
Date:
Subject: Re: Bug: Buffer cache is not scan resistant