Thread: Synchronized Scan update

Synchronized Scan update

From
Jeff Davis
Date:
I have found some interesting results from my tests with the
Synchronized Scan patch I'm working on.

The two benefits that I hope to achieve with the patch are:
(1) Better caching behavior with multiple sequential scans running in
parallel
(2) Faster sequential reads from disk and less seeking

I have consistently seen #1 to be true. There is still more testing to
be done (hopefully soon), but I haven't found a problem yet. And the
benefits I've seen are very substantial, which isn't hard, since in the
typical case, a large sequential scan will have 0% cache hit rate. These
numbers were retrieved using log_executor_stats=on.

#2 however, is a little trickier. IIRC, Tom was the first to point out
that the I/O system might not recognize that reads coming from different
processes are indeed one sequential read.

At first I never saw the problem actually happen, and I assumed that the
OS was being smart enough. However, recently I noticed this problem on
my home machine, which experienced great caching behavior but poor I/O
throughput (as measured by iostat). My home machine was using the Linux
CFQ io scheduler, and when I swapped the CFQ io scheduler for the
anticipatory scheduler (AS), it worked great. When I sent Josh my patch
(per his request) I mentioned the problem I experienced.

Then I started investigating, and found some mixed results. My test was
basically to use iostat (or zpool iostat) to measure disk throughput,
and N processes of "dd if=bigfile of=/dev/null" (started simultaneously)
to run the test. I consider the test to be "passed" if the additional
processes did not interfere (i.e. each process finished as though it
were the only one running). Of course, all tests were I/O bound.

My home machine (core 2 duo, single SATA disk, intel controller):
Linux/ext3/AS: passed
Linux/ext3/CFQ: failed
Linux/ext3/noop: passed
Linux/ext3/deadline: passed

Machine 2 (old thinkpad, IDE disk):
Solaris/UFS: failed
Solaris/ZFS: passed

Machine 3 (dell 2950, LSI PERC/5i controller, 6 SAS disks, RAID-10,
adaptive read ahead):
FreeBSD/UFS: failed

(I suspect the last test would be fine with read ahead always on, and it
may just be a problem with the adaptive read ahead feature)

There are a lot of factors involved, because several components of the
I/O system have the ability to reorder requests or read ahead, such as
the block layer and the controller. 

The block request ordering isn't the only factor because Solaris/UFS
only orders the requests by cylinder and moves only in one direction
(i.e. looks like a simple elevator algorithm that isn't affected by
process id). At least, that's how I understand it.

Readahead can't be the only factor either because replacing the io
scheduler in Linux solved the problem, even when that replacement was
the noop scheduler.

Anyway, back to the patch, it looks like there are some complications if
you try to use it with the wrong combination of fs, io scheduler, and
controller. 

The patch is designed for certain query patterns anyway, so I don't
think that this is a show-stopper. Given the better cache behavior, it
seems like it's really the job of the I/O system to get a single,
sequential stream of blocks efficiently.

The alternative would be to have a single block-reader process, which I
don't think we want to do. However, I/O systems don't really seem to
know how to handle multiple processes reading from the same file very
well.

Comments?

Regards,Jeff Davis



Re: Synchronized Scan update

From
Jeff Davis
Date:
Is there any consensus about whether to include these two parameters as
GUCs or constants if my patch is to be accepted?

(1) sync_scan_threshold: Use synchronized scanning for tables greater
than this many pages; smaller tables will not be affected.
(2) sync_scan_offset: Start a new scan this many pages before a
currently running scan to take advantage of the pagesthat are likely already in cache.

Right now they are just constants defined in a header, but a GUC might
make sense. I'd like to know which version is more acceptable when I
submit my final patch.

Regards,Jeff Davis



Re: Synchronized Scan update

From
Josh Berkus
Date:
Jeff,

> Right now they are just constants defined in a header, but a GUC might
> make sense. I'd like to know which version is more acceptable when I
> submit my final patch.

As much as I hate the thought of more GUCs, until we have a solid 
performance profile for synch scan we probably need them.  You should 
include the option to turn synch_scan off, such as by setting 
synch_scan_threshold to -1.

Oh, and remember that these now need to be able to take K/MB/GB.

These options should probably go in postgresql.conf under QUERY TUNING, 
with their own sub-head.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: Synchronized Scan update

From
Jeff Davis
Date:
On Fri, 2007-03-02 at 15:49 -0800, Josh Berkus wrote:
> Jeff,
> 
> > Right now they are just constants defined in a header, but a GUC might
> > make sense. I'd like to know which version is more acceptable when I
> > submit my final patch.
> 
> As much as I hate the thought of more GUCs, until we have a solid 
> performance profile for synch scan we probably need them.  You should 

I will include them in the final patch then. 

> include the option to turn synch_scan off, such as by setting 
> synch_scan_threshold to -1.

Naturally.

> Oh, and remember that these now need to be able to take K/MB/GB.

Will do.

> These options should probably go in postgresql.conf under QUERY TUNING, 
> with their own sub-head.

That makes sense to me.

Regards,Jeff Davis

PS: Did you happen to get my patch for testing (sent off-list)? If
testing will take a while, that's OK, I'd just like to know whether to
expect the results before feature freeze.



Re: Synchronized Scan update

From
Josh Berkus
Date:
Jeff,

> PS: Did you happen to get my patch for testing (sent off-list)? If
> testing will take a while, that's OK, I'd just like to know whether to
> expect the results before feature freeze.

I'm not sure.  We have a bunch to patches in our queue to test, and the 
benchmark guys don't really expect synch scan to affect OLTP benchmarks 
much.  You might want to pester Greenplum about testing on TPCH.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: Synchronized Scan update

From
"Simon Riggs"
Date:
On Fri, 2007-03-02 at 15:03 -0800, Jeff Davis wrote:
> Is there any consensus about whether to include these two parameters as
> GUCs or constants if my patch is to be accepted?
> 
> (1) sync_scan_threshold: Use synchronized scanning for tables greater
> than this many pages; smaller tables will not be affected.

That sounds OK.

> (2) sync_scan_offset: Start a new scan this many pages before a
> currently running scan to take advantage of the pages
>  that are likely already in cache.

I'm somewhat dubious about this parameter, I have to say, even though I
am eager for this feature. It seems like a "magic" parameter that works
only when we have the right knowledge to set it correctly.

How will we know what to default it to and how will we know whether to
set it higher or lower for better performance? Does that value vary
according to the workload on the system? How?

I'm worried that we get a feature that works well on simple tests and
not at all in real world circumstances. I don't want to cast doubt on
what could be a great patch or be negative: I just see that the feature
relies on the dynamic behaviour of the system. I'd like to see some
further studies on how this works to make sure that we can realistically
set know how to set this knob, that its the correct knob and it is the
only one we need.

Further thoughts: It sounds like sync_scan_offset is related to
effective_cache_size. Can you comment on whether that might be a
something we can use as well/instead? (i.e. set the scan offset to say K
* effective_cache_size, 0.1 <= K <= 0.5)???

Might we do roughly the same thing with sync_scan_threshold as well, and
just have enable_sync_scan instead? i.e. sync_scan_threshold =
effective_cache_size? When would those two parameters not be connected
directly to each other?

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Synchronized Scan update

From
"Joshua D. Drake"
Date:
>
>> (2) sync_scan_offset: Start a new scan this many pages before a
>> currently running scan to take advantage of the pages
>>  that are likely already in cache.
>>     
>
> I'm somewhat dubious about this parameter, I have to say, even though I
> am eager for this feature. It seems like a "magic" parameter that works
> only when we have the right knowledge to set it correctly.
>
>   
Hello,

Don't get me wrong, I want things to be easily understandable as well 
but the reason you site above pretty much
makes us need to remove most of the postgresql.conf, including all 
bgwriter, vacuum cost delay, and autovac settings.
Not to mention commit delay and others ;).

Sincerely,

Joshua D. Drake








Re: Synchronized Scan update

From
Jeff Davis
Date:
On Sun, 2007-03-04 at 11:54 +0000, Simon Riggs wrote:
> > (2) sync_scan_offset: Start a new scan this many pages before a
> > currently running scan to take advantage of the pages
> >  that are likely already in cache.
> 
> I'm somewhat dubious about this parameter, I have to say, even though I
> am eager for this feature. It seems like a "magic" parameter that works
> only when we have the right knowledge to set it correctly.
> 

That was my concern about this parameter also.

> How will we know what to default it to and how will we know whether to
> set it higher or lower for better performance? Does that value vary
> according to the workload on the system? How?
> 

Perhaps people would only set this parameter when they know it will
help, and for more complex (or varied) usage patterns they'd set
sync_scan_offset to 0 to be safe.

My thinking on the subject (and this is only backed up by very basic
tests) is that there are basically two situations where setting this
parameter too high can hurt:
(1) It's too close to the limits of your physical memory, and you end up
diverging the scans when they could be kept together.
(2) You're using a lot of CPU and the backends aren't processing the
buffers as fast as your I/O system is delivering them. This will prevent
the scans from converging.

If your CPUs are well below capacity and you choose a size significantly
less than your effective cache size, I don't think it will hurt.

> I'm worried that we get a feature that works well on simple tests and
> not at all in real world circumstances. I don't want to cast doubt on
> what could be a great patch or be negative: I just see that the feature
> relies on the dynamic behaviour of the system. I'd like to see some
> further studies on how this works to make sure that we can realistically
> set know how to set this knob, that its the correct knob and it is the
> only one we need.

I will do some better tests on some better hardware this week and next
week. I hope that sheds some light.

> Further thoughts: It sounds like sync_scan_offset is related to
> effective_cache_size. Can you comment on whether that might be a
> something we can use as well/instead? (i.e. set the scan offset to say K
> * effective_cache_size, 0.1 <= K <= 0.5)???
> 
> Might we do roughly the same thing with sync_scan_threshold as well, and
> just have enable_sync_scan instead? i.e. sync_scan_threshold =
> effective_cache_size? When would those two parameters not be connected
> directly to each other?
> 

Originally, these parameters were in terms of the effective_cache_size.
Somebody else convinced me that it was too confusing to have the
variables dependent on each other, so I made them independent. I don't
have a strong opinion either way.

Regards,Jeff Davis






Re: Synchronized Scan update

From
Josh Berkus
Date:
JD,

> Don't get me wrong, I want things to be easily understandable as well
> but the reason you site above pretty much
> makes us need to remove most of the postgresql.conf, including all
> bgwriter, vacuum cost delay, and autovac settings.
> Not to mention commit delay and others ;).

Wouldn't that be nice!

The explosion of GUC settings is primarily a result of not enough information.  
The reason there are 7 bgwriter settings, for example, is that we have no 
idea what those settings should be and are hoping that people will tinker 
with them and tell us.  Someday when I can fully profile bgwriter, we'll just 
have one setting: bgwriter_aggressive, set to a number between 0 and 9.  

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: Synchronized Scan update

From
Jim Nasby
Date:
On Mar 6, 2007, at 9:43 AM, Josh Berkus wrote:
>> Don't get me wrong, I want things to be easily understandable as well
>> but the reason you site above pretty much
>> makes us need to remove most of the postgresql.conf, including all
>> bgwriter, vacuum cost delay, and autovac settings.
>> Not to mention commit delay and others ;).
>
> Wouldn't that be nice!
>
> The explosion of GUC settings is primarily a result of not enough  
> information.
> The reason there are 7 bgwriter settings, for example, is that we  
> have no
> idea what those settings should be and are hoping that people will  
> tinker
> with them and tell us.  Someday when I can fully profile bgwriter,  
> we'll just
> have one setting: bgwriter_aggressive, set to a number between 0  
> and 9.

In the mean time; it would be great for these multiple-settings cases  
to be listed somewhere, indicating that it's something we could use  
help with. I think that with some explanation of what we're looking  
for there's any number of people who could do this kind of profiling.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Re: Synchronized Scan update

From
"Simon Riggs"
Date:
On Fri, 2007-03-02 at 15:03 -0800, Jeff Davis wrote:
> Is there any consensus about whether to include these two parameters as
> GUCs or constants if my patch is to be accepted?
> 
> (1) sync_scan_threshold: Use synchronized scanning for tables greater
> than this many pages; smaller tables will not be affected.
> (2) sync_scan_offset: Start a new scan this many pages before a
> currently running scan to take advantage of the pages
>  that are likely already in cache.
> 
> Right now they are just constants defined in a header, but a GUC might
> make sense. I'd like to know which version is more acceptable when I
> submit my final patch.

I'm looking at ways of helping Synch Scan and scan_recycle_buffers to
work well together. I've had a look through the synch scan patch in more
detail to help this along.

ISTM they can play nicely together.

Results from tests show that with a single scan we get better
performance with scan_recycle_buffers = ~32. That gives a +25%
performance bonus. Synch scan has the capability to reduce I/O by 50%
across two concurrent scans, so that would be likely to outweigh the
gain from L2 cache reduction, unless we can get both together.

It's possible that multiple scans would use the same CPU, or at least
use multiple cores on the same chip and hence same L2 cache. That's more
likely if the synch scan works well, since the second scan can process
the buffer while the lead process performs I/O. So its likely that on
dual and quad core CPUs that we'll be able to get both benefits in most
cases.

So based on those thoughts, sync_scan_offset should be fixed at 16,
rather than being variable. In addition, ss_report_loc() should only
report its position every 16 blocks, rather than do this every time,
which will reduce overhead of this call.

To match that, scan_recycle_buffers should be fixed at 32. So GUCs for
sync_scan_offset and scan_recycle_buffers would not be required at all.

IMHO we can also remove sync_scan_threshold and just use NBuffers
instead. That way we get the benefit of both patches or neither, making
it easier to understand what's going on.

On my patch, I'll make buffer recycling only work for SeqScans and
VACUUMs.

If need be, the value of scan_recycle_buffers can be varied upwards
should the scans drift apart, as a way of bringing them back together.
We aren't tracking whether they are together or apart, so I would like
to see some debug output from synch scans to allow us to assess how far
behind the second scan is as it progresses. e.g.
LOG:  synch scan currently on block N, trailing pathfinder by M blocks
issued every 128 blocks as we go through the scans. 

Thoughts?

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Synchronized Scan update

From
"Luke Lonergan"
Date:
Simon,

On 3/12/07 6:21 AM, "Simon Riggs" <simon@2ndquadrant.com> wrote:

> So based on those thoughts, sync_scan_offset should be fixed at 16,
> rather than being variable. In addition, ss_report_loc() should only
> report its position every 16 blocks, rather than do this every time,
> which will reduce overhead of this call.

And for N concurrent scans?

I think there is actually no need to synchronize the shared buffers at all
for synchronized scans.  The OS I/O cache will do that for us and we're just
going to interfere and pessimize by trying to divide up a local buffer.

I suggest that this be proven or disproved by running this test: measure the
performance of syncscan with the non-polluting buffer change, then measure
with Jeff's patch and non-polluting with multiple scans, then measure with
your suggested changes to synchronize the buffers.

Somewhere in that progression we'll learn more about how the multi-level
buffering really works.  I think we'll get all the shared I/O cache we need.

- Luke




Re: Synchronized Scan update

From
"Simon Riggs"
Date:
On Mon, 2007-03-12 at 08:42 -0700, Luke Lonergan wrote:

> On 3/12/07 6:21 AM, "Simon Riggs" <simon@2ndquadrant.com> wrote:
> 
> > So based on those thoughts, sync_scan_offset should be fixed at 16,
> > rather than being variable. In addition, ss_report_loc() should only
> > report its position every 16 blocks, rather than do this every time,
> > which will reduce overhead of this call.
> 
> And for N concurrent scans?
> 
> I think there is actually no need to synchronize the shared buffers at all
> for synchronized scans.  The OS I/O cache will do that for us and we're just
> going to interfere and pessimize by trying to divide up a local buffer.

I think you've misunderstood my comment slightly.

In Jeff's patch, ss_report_loc() is called after every block is read,
AFAICS. I was suggesting that we don't do it that frequently, to reduce
the overhead of reporting the location. 

That has nothing to do with re-synchronising the two scans mid-way.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Synchronized Scan update

From
Jeff Davis
Date:
On Mon, 2007-03-12 at 13:21 +0000, Simon Riggs wrote:
> So based on those thoughts, sync_scan_offset should be fixed at 16,
> rather than being variable. In addition, ss_report_loc() should only
> report its position every 16 blocks, rather than do this every time,
> which will reduce overhead of this call.

If we fix sync_scan_offset at 16, we might as well just get rid of it.
Sync scans are only useful on large tables, and getting a free 16 pages
over a scan isn't worth the trouble. However, even without
sync_scan_offset, sync scans are still a valuable feature.

I agree that ss_report_loc() doesn't need to report on every call. If
there's any significant overhead I agree that it should report less
often. Do you think that the overhead is significant on such a simple
function?

> 
> To match that, scan_recycle_buffers should be fixed at 32. So GUCs for
> sync_scan_offset and scan_recycle_buffers would not be required at all.
> 
> IMHO we can also remove sync_scan_threshold and just use NBuffers
> instead. That way we get the benefit of both patches or neither, making
> it easier to understand what's going on.

I like the idea of reducing tuning parameters, but we should, at a
minimum, still allow an on/off button for sync scans. My tests revealed
that the wrong combination of OS/FS/IO-Scheduler/Controller could result
in bad I/O behavior.

> If need be, the value of scan_recycle_buffers can be varied upwards
> should the scans drift apart, as a way of bringing them back together.

If the scans aren't being brought together, that means that one of the
scans is CPU bound or outside the combined cache trail (shared_buffers
+ OS buffer cache). 

> We aren't tracking whether they are together or apart, so I would like
> to see some debug output from synch scans to allow us to assess how far
> behind the second scan is as it progresses. e.g.
> LOG:  synch scan currently on block N, trailing pathfinder by M blocks
> issued every 128 blocks as we go through the scans. 
> 
> Thoughts?
> 

It's hard to track where all the scans are currently. One of the
advantages of my patch is its simplicity: the scans don't need to know
about other specific scans, and there is no concept in the code of a
"head" scan or a "pack".

There is no easy way to tell which scan is ahead and which is behind.
There was a discussion when I submitted this proposal at the beginning
of 8.3, but I didn't see enough benefit to justify all of the costs and
risks associated with scans communicating between eachother. I
certainly can't implement that kind of thing before feature freeze, and
I think there's a risk of lock contention for the communication
required. I'm also concerned that -- if the scans are too
interdependent -- it would make postgres less robust against the
disappearance of a single backend (i.e. what if the backend that is
leading a scan dies?).

Regards,Jeff Davis





Re: Synchronized Scan update

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> I agree that ss_report_loc() doesn't need to report on every call. If
> there's any significant overhead I agree that it should report less
> often. Do you think that the overhead is significant on such a simple
> function?

One extra LWLock cycle per page processed definitely *is* a significant
overhead ... can you say "context swap storm"?  I'd think about doing it
once every 100 or so pages.
        regards, tom lane


Re: Synchronized Scan update

From
Jeff Davis
Date:
On Tue, 2007-03-13 at 12:53 -0400, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > I agree that ss_report_loc() doesn't need to report on every call. If
> > there's any significant overhead I agree that it should report less
> > often. Do you think that the overhead is significant on such a simple
> > function?
> 
> One extra LWLock cycle per page processed definitely *is* a significant
> overhead ... can you say "context swap storm"?  I'd think about doing it
> once every 100 or so pages.
> 

No lock is needed to store the hint. If somehow the hint (which is
stored in a static table, no pointers) gets invalid data due to a race
condition, the new scan will simply consider the hint invalid and start
at 0.

I did this precisely to avoid causing a performance regression for usage
patterns that don't benefit from sync scans.

Regards,Jeff Davis



Re: Synchronized Scan update

From
"Simon Riggs"
Date:
On Mon, 2007-03-12 at 17:46 -0700, Jeff Davis wrote:
> On Mon, 2007-03-12 at 13:21 +0000, Simon Riggs wrote:
> > So based on those thoughts, sync_scan_offset should be fixed at 16,
> > rather than being variable. In addition, ss_report_loc() should only
> > report its position every 16 blocks, rather than do this every time,
> > which will reduce overhead of this call.
> 
> If we fix sync_scan_offset at 16, we might as well just get rid of it.
> Sync scans are only useful on large tables, and getting a free 16 pages
> over a scan isn't worth the trouble. However, even without
> sync_scan_offset, 

Not sure what you mean by "a free 16 pages". Please explain?

> sync scans are still a valuable feature.

I have always thought synch scans are a valuable feature too. 

> I agree that ss_report_loc() doesn't need to report on every call. If
> there's any significant overhead I agree that it should report less
> often. Do you think that the overhead is significant on such a simple
> function?

Lets try without it and see. There's no need to access shared memory so
often.

> I like the idea of reducing tuning parameters, but we should, at a
> minimum, still allow an on/off button for sync scans. My tests revealed
> that the wrong combination of OS/FS/IO-Scheduler/Controller could result
> in bad I/O behavior.

Agreed

> > If need be, the value of scan_recycle_buffers can be varied upwards
> > should the scans drift apart, as a way of bringing them back together.
> 
> If the scans aren't being brought together, that means that one of the
> scans is CPU bound or outside the combined cache trail (shared_buffers
> + OS buffer cache). 
> 
> > We aren't tracking whether they are together or apart, so I would like
> > to see some debug output from synch scans to allow us to assess how far
> > behind the second scan is as it progresses. e.g.
> > LOG:  synch scan currently on block N, trailing pathfinder by M blocks
> > issued every 128 blocks as we go through the scans. 
> > 
> > Thoughts?
> > 
> 
> It's hard to track where all the scans are currently. One of the
> advantages of my patch is its simplicity: the scans don't need to know
> about other specific scans, and there is no concept in the code of a
> "head" scan or a "pack".

I'd still like to be able to trace each scan to see how far ahead/behind
it is from the other scans on the same table, however we do that.

Any backend can read the position of other backend's scans, so it should
be easy enough to put in a regular LOG entry that shows how far
ahead/behind they are from other scans. We can trace just one backend
and have it report on where it is with respect to other backends, or you
could have them all calculate their position and have just the lead scan
report the position of all other scans.

> There is no easy way to tell which scan is ahead and which is behind.
> There was a discussion when I submitted this proposal at the beginning
> of 8.3, but I didn't see enough benefit to justify all of the costs and
> risks associated with scans communicating between eachother. I
> certainly can't implement that kind of thing before feature freeze, and
> I think there's a risk of lock contention for the communication
> required. I'm also concerned that -- if the scans are too
> interdependent -- it would make postgres less robust against the
> disappearance of a single backend (i.e. what if the backend that is
> leading a scan dies?).

I've not mentioned that again.

I'd like to see the trace option to allow us to tell whether its working
as well as we'd like it to pre-release and in production. Also I want to
see whether various settings of scan_recycle_buffers help/hinder the
effectiveness of synch scans, as others have worried it might.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Synchronized Scan update

From
Jeff Davis
Date:
On Tue, 2007-03-13 at 10:08 -0700, Jeff Davis wrote:
> > One extra LWLock cycle per page processed definitely *is* a significant
> > overhead ... can you say "context swap storm"?  I'd think about doing it
> > once every 100 or so pages.
> > 
> 
> No lock is needed to store the hint. If somehow the hint (which is
> stored in a static table, no pointers) gets invalid data due to a race
> condition, the new scan will simply consider the hint invalid and start
> at 0.
> 
> I did this precisely to avoid causing a performance regression for usage
> patterns that don't benefit from sync scans.
> 

I'd also like to add that, if a lock was required, a constant offset
would also seem to prone to a context swap storm; it would just happen
100th as much. We'd need to do something to spread the locks over time.

That being said, I'll adjust it to report once per hundred pages anyway,
because there's really no drawback.

Regards,Jeff Davis



Re: Synchronized Scan update

From
"Simon Riggs"
Date:
On Tue, 2007-03-13 at 10:08 -0700, Jeff Davis wrote:
> On Tue, 2007-03-13 at 12:53 -0400, Tom Lane wrote:
> > Jeff Davis <pgsql@j-davis.com> writes:
> > > I agree that ss_report_loc() doesn't need to report on every call. If
> > > there's any significant overhead I agree that it should report less
> > > often. Do you think that the overhead is significant on such a simple
> > > function?
> > 
> > One extra LWLock cycle per page processed definitely *is* a significant
> > overhead ... can you say "context swap storm"?  I'd think about doing it
> > once every 100 or so pages.
> > 
> 
> No lock is needed to store the hint. If somehow the hint (which is
> stored in a static table, no pointers) gets invalid data due to a race
> condition, the new scan will simply consider the hint invalid and start
> at 0.
> 
> I did this precisely to avoid causing a performance regression for usage
> patterns that don't benefit from sync scans.

Shared memory access is still a performance/scalability concern because
so many people want access to it at the same time. 

There really is no need to do this after each block. 8 CPUs ought to be
able to do 8 scans without tripping over each other. Especially if they
are on separate tables.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Synchronized Scan update

From
Jeff Davis
Date:
On Tue, 2007-03-13 at 17:17 +0000, Simon Riggs wrote:
> On Tue, 2007-03-13 at 10:08 -0700, Jeff Davis wrote:
> > On Tue, 2007-03-13 at 12:53 -0400, Tom Lane wrote:
> > > Jeff Davis <pgsql@j-davis.com> writes:
> > > > I agree that ss_report_loc() doesn't need to report on every call. If
> > > > there's any significant overhead I agree that it should report less
> > > > often. Do you think that the overhead is significant on such a simple
> > > > function?
> > > 
> > > One extra LWLock cycle per page processed definitely *is* a significant
> > > overhead ... can you say "context swap storm"?  I'd think about doing it
> > > once every 100 or so pages.
> > > 
> > 
> > No lock is needed to store the hint. If somehow the hint (which is
> > stored in a static table, no pointers) gets invalid data due to a race
> > condition, the new scan will simply consider the hint invalid and start
> > at 0.
> > 
> > I did this precisely to avoid causing a performance regression for usage
> > patterns that don't benefit from sync scans.
> 
> Shared memory access is still a performance/scalability concern because
> so many people want access to it at the same time. 
> 
> There really is no need to do this after each block. 8 CPUs ought to be
> able to do 8 scans without tripping over each other. Especially if they
> are on separate tables.
> 

Ok, I'll do it every 100 pages.

Regards,Jeff Davis





Re: Synchronized Scan update

From
Hannu Krosing
Date:
Ühel kenal päeval, T, 2007-03-13 kell 12:53, kirjutas Tom Lane:
> Jeff Davis <pgsql@j-davis.com> writes:
> > I agree that ss_report_loc() doesn't need to report on every call. If
> > there's any significant overhead I agree that it should report less
> > often. Do you think that the overhead is significant on such a simple
> > function?
> 
> One extra LWLock cycle per page processed definitely *is* a significant
> overhead ... can you say "context swap storm"?  I'd think about doing it
> once every 100 or so pages.

Can't we do it in some lock-free way ? writing page numbers (4-byte
ints) into a predetermined location isn shared mem should be atomic on
all platforms we support (still may cause some cache ping-pong in
multiprocessor systems, but this should be much cheaper), and even an
occasional error in establishing the "scan head" should not be
catastrophic.

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: Synchronized Scan update

From
Jeff Davis
Date:
On Tue, 2007-03-13 at 17:11 +0000, Simon Riggs wrote:
> On Mon, 2007-03-12 at 17:46 -0700, Jeff Davis wrote:
> > On Mon, 2007-03-12 at 13:21 +0000, Simon Riggs wrote:
> > > So based on those thoughts, sync_scan_offset should be fixed at 16,
> > > rather than being variable. In addition, ss_report_loc() should only
> > > report its position every 16 blocks, rather than do this every time,
> > > which will reduce overhead of this call.
> > 
> > If we fix sync_scan_offset at 16, we might as well just get rid of it.
> > Sync scans are only useful on large tables, and getting a free 16 pages
> > over a scan isn't worth the trouble. However, even without
> > sync_scan_offset, 
> 
> Not sure what you mean by "a free 16 pages". Please explain?
> 

By "free" I mean already in cache, and therefore don't have to do I/O to
get it. I used the term loosely above, so let me re-explain:

My only point was that 16 is essentially 0 when it comes to
sync_scan_offset, because it's a small number of blocks over the course
of the scan of a large table.

If sync_scan_offset is 0, my patch will cause scans on a big table to
start where other scans are, and those scans should tend to stay
together and use newly-cached pages efficiently (and achieve the primary
goal of the patch).

The advantage of sync_scan_offset is that, in some situations, a second
scan can actually finish faster than if it were the only query
executing, because a previous scan has already caused some blocks to be
cached. However, 16 is a small number because that benefit would only be
once per scan, and sync scans are only helpful on large tables.

> > I like the idea of reducing tuning parameters, but we should, at a
> > minimum, still allow an on/off button for sync scans. My tests revealed
> > that the wrong combination of OS/FS/IO-Scheduler/Controller could result
> > in bad I/O behavior.
> 
> Agreed
> 

Do you have an opinion about sync_scan_threshold versus a simple
sync_scan_enable?

> I'd still like to be able to trace each scan to see how far ahead/behind
> it is from the other scans on the same table, however we do that.
> 
> Any backend can read the position of other backend's scans, so it should

Where is that information stored? Right now my patch will overwrite the
hints of other backends, because I'm using a static data structure
(rather than one that grows). I do this to avoid the need for locking.

> be easy enough to put in a regular LOG entry that shows how far
> ahead/behind they are from other scans. We can trace just one backend
> and have it report on where it is with respect to other backends, or you
> could have them all calculate their position and have just the lead scan
> report the position of all other scans.
> 

I already have each backend log it's progression through the tablescan
every 100k blocks to DEBUG (higher DEBUG gives every 10k blocks). I
currently use this information to see whether scans are staying together
or not. I think this gives us the information we need without backends
needing to communicate the information during execution.

I think I will increase the resolution of the scan progress so that we
can track every 5k or even 1k blocks read per pid per scan. That might
tell us more about the shared memory usage versus OS cache.

Is there any other information you need reported?

> I'd like to see the trace option to allow us to tell whether its working
> as well as we'd like it to pre-release and in production. Also I want to
> see whether various settings of scan_recycle_buffers help/hinder the
> effectiveness of synch scans, as others have worried it might.
> 

Can you tell me what you mean by trace option, if you mean something
different than tracking the relative positions of the scans?

I will update my patch and send it along so that we can see how they
work together. 

Regards,Jeff Davis



Re: Synchronized Scan update

From
"Simon Riggs"
Date:
On Tue, 2007-03-13 at 11:28 -0700, Jeff Davis wrote:
> On Tue, 2007-03-13 at 17:11 +0000, Simon Riggs wrote:
> > On Mon, 2007-03-12 at 17:46 -0700, Jeff Davis wrote:
> > > On Mon, 2007-03-12 at 13:21 +0000, Simon Riggs wrote:
> > > > So based on those thoughts, sync_scan_offset should be fixed at 16,
> > > > rather than being variable. In addition, ss_report_loc() should only
> > > > report its position every 16 blocks, rather than do this every time,
> > > > which will reduce overhead of this call.
> > > 
> > > If we fix sync_scan_offset at 16, we might as well just get rid of it.
> > > Sync scans are only useful on large tables, and getting a free 16 pages
> > > over a scan isn't worth the trouble. However, even without
> > > sync_scan_offset, 
> > 
> > Not sure what you mean by "a free 16 pages". Please explain?
> > 
> 
> By "free" I mean already in cache, and therefore don't have to do I/O to
> get it. I used the term loosely above, so let me re-explain:
> 
> My only point was that 16 is essentially 0 when it comes to
> sync_scan_offset, because it's a small number of blocks over the course
> of the scan of a large table.
> 
> If sync_scan_offset is 0, my patch will cause scans on a big table to
> start where other scans are, and those scans should tend to stay
> together and use newly-cached pages efficiently (and achieve the primary
> goal of the patch).

OK

> The advantage of sync_scan_offset is that, in some situations, a second
> scan can actually finish faster than if it were the only query
> executing, because a previous scan has already caused some blocks to be
> cached. However, 16 is a small number because that benefit would only be
> once per scan, and sync scans are only helpful on large tables.

Alright, understood. That last part is actually something I now want to
avoid because it's using the current cache-spoiling behaviour of
seqscans to advantage. I'd like to remove that behaviour, but it sounds
like we can have both
- SeqScans that don't spoil cache
- Synch scans
by setting "sync_scan_offset" to zero.

> > > I like the idea of reducing tuning parameters, but we should, at a
> > > minimum, still allow an on/off button for sync scans. My tests revealed
> > > that the wrong combination of OS/FS/IO-Scheduler/Controller could result
> > > in bad I/O behavior.
> > 
> > Agreed
> > 
> 
> Do you have an opinion about sync_scan_threshold versus a simple
> sync_scan_enable?

enable_sync_scan?

> > I'd still like to be able to trace each scan to see how far ahead/behind
> > it is from the other scans on the same table, however we do that.
> > 
> > Any backend can read the position of other backend's scans, so it should
> 
> Where is that information stored? Right now my patch will overwrite the
> hints of other backends, because I'm using a static data structure
> (rather than one that grows). I do this to avoid the need for locking.

OK, well, we can still read it before we overwrite it to calc the
difference. That will at least allow us to get a difference between
points as we go along. That seems like its worth having, even if it
isn't accurate for 3+ concurrent scans.

> > be easy enough to put in a regular LOG entry that shows how far
> > ahead/behind they are from other scans. We can trace just one backend
> > and have it report on where it is with respect to other backends, or you
> > could have them all calculate their position and have just the lead scan
> > report the position of all other scans.
> > 
> 
> I already have each backend log it's progression through the tablescan
> every 100k blocks to DEBUG (higher DEBUG gives every 10k blocks). I
> currently use this information to see whether scans are staying together
> or not. I think this gives us the information we need without backends
> needing to communicate the information during execution.

Well, that is good, thank you for adding that after initial discussions.

Does it have the time at which a particular numbered block is reached?
(i.e. Block #117 is not the same thing as the 117th block scanned). We
can use that to compare the time difference of each scan.

> I think I will increase the resolution of the scan progress so that we
> can track every 5k or even 1k blocks read per pid per scan. That might
> tell us more about the shared memory usage versus OS cache.
> 
> Is there any other information you need reported?

Not sure yet! I just want to look one level deeper, to see if everything
is working like we think it should.

> > I'd like to see the trace option to allow us to tell whether its working
> > as well as we'd like it to pre-release and in production. Also I want to
> > see whether various settings of scan_recycle_buffers help/hinder the
> > effectiveness of synch scans, as others have worried it might.
> > 
> 
> Can you tell me what you mean by trace option, if you mean something
> different than tracking the relative positions of the scans?
> 
> I will update my patch and send it along so that we can see how they
> work together. 

Great

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Synchronized Scan update

From
Jeff Davis
Date:
On Tue, 2007-03-13 at 19:24 +0000, Simon Riggs wrote:
> > The advantage of sync_scan_offset is that, in some situations, a second
> > scan can actually finish faster than if it were the only query
> > executing, because a previous scan has already caused some blocks to be
> > cached. However, 16 is a small number because that benefit would only be
> > once per scan, and sync scans are only helpful on large tables.
> 
> Alright, understood. That last part is actually something I now want to
> avoid because it's using the current cache-spoiling behaviour of
> seqscans to advantage. I'd like to remove that behaviour, but it sounds
> like we can have both
> - SeqScans that don't spoil cache
> - Synch scans
> by setting "sync_scan_offset" to zero.
> 

Precisely.

If there is a cache-spoiling effect of the OS buffer cache that we want
to take advantage of, we could still set it to a non-zero value. But the
utility of sync_scan_offset does decrease with your patch, so removing
it altogether is a possibility (hopefully the numbers will tell us what
to do).


> > Do you have an opinion about sync_scan_threshold versus a simple
> > sync_scan_enable?
> 
> enable_sync_scan?
> 

After looking at other GUC names, I suggest that it's either
"sync_scan" (for on/off) or "sync_scan_threshold" (if we do want to
allow a numerical threshold). All the GUCs beginning with "enable_" are
planner settings.

If we only allow on/off, we could probably just sync scan every table
because of your recycle_buffers patch.

> > > I'd still like to be able to trace each scan to see how far ahead/behind
> > > it is from the other scans on the same table, however we do that.
> > > 
> > > Any backend can read the position of other backend's scans, so it should
> > 
> > Where is that information stored? Right now my patch will overwrite the
> > hints of other backends, because I'm using a static data structure
> > (rather than one that grows). I do this to avoid the need for locking.
> 
> OK, well, we can still read it before we overwrite it to calc the
> difference. That will at least allow us to get a difference between
> points as we go along. That seems like its worth having, even if it
> isn't accurate for 3+ concurrent scans.

Let me know if the things I list below don't cover what the information
you're looking for here. It would be easy for me to emit a log message
at the time it's overwriting the hint, but that would be a lot of noise:
every time ss_report_loc() is called, which we discussed would be once
per 100 pages read per scan. 

> > > be easy enough to put in a regular LOG entry that shows how far
> > > ahead/behind they are from other scans. We can trace just one backend
> > > and have it report on where it is with respect to other backends, or you
> > > could have them all calculate their position and have just the lead scan
> > > report the position of all other scans.
> > > 
> > 
> > I already have each backend log it's progression through the tablescan
> > every 100k blocks to DEBUG (higher DEBUG gives every 10k blocks). I
> > currently use this information to see whether scans are staying together
> > or not. I think this gives us the information we need without backends
> > needing to communicate the information during execution.
> 
> Well, that is good, thank you for adding that after initial discussions.
> 
> Does it have the time at which a particular numbered block is reached?
> (i.e. Block #117 is not the same thing as the 117th block scanned). We
> can use that to compare the time difference of each scan.

Right now it logs when a scan starts, what block number of the table it
starts on, and also prints out the current block it's scanning every N
blocks (100k or 10k depending on debug level). The time and the pid are,
of course, available from log_prefix.

I'll add the table OID to each log message in case we test, for example,
a single backend scanning multiple tables at once.

I'll also clean it up a bit, so that the information is a little easier
to grep out of the postgres logfiles and easier to analyze/graph.

Regards,Jeff Davis



Re: Synchronized Scan update

From
"Simon Riggs"
Date:
On Tue, 2007-03-13 at 13:39 -0700, Jeff Davis wrote:

> > > Do you have an opinion about sync_scan_threshold versus a simple
> > > sync_scan_enable?
> > 
> > enable_sync_scan?
> > 
> 
> After looking at other GUC names, I suggest that it's either
> "sync_scan" (for on/off) or "sync_scan_threshold" (if we do want to
> allow a numerical threshold). All the GUCs beginning with "enable_" are
> planner settings.

How about: sync_seqscans

so the phrase matches the equivalent enable_ parameter

> If we only allow on/off, we could probably just sync scan every table
> because of your recycle_buffers patch.

The buffer recycling only makes sense for large scans, so there's an
exact match for when both techniques need to kick-in.

I think I'd just lose this parameter and have it kick-in at either
NBuffers or NBuffers/2. We don't need another parameter...

I'm not planning to have scan_recycle_buffers continue into the
production version.

> > > > I'd still like to be able to trace each scan to see how far ahead/behind
> > > > it is from the other scans on the same table, however we do that.
> > > > 
> > > > Any backend can read the position of other backend's scans, so it should
> > > 
> > > Where is that information stored? Right now my patch will overwrite the
> > > hints of other backends, because I'm using a static data structure
> > > (rather than one that grows). I do this to avoid the need for locking.
> > 
> > OK, well, we can still read it before we overwrite it to calc the
> > difference. That will at least allow us to get a difference between
> > points as we go along. That seems like its worth having, even if it
> > isn't accurate for 3+ concurrent scans.
> 
> Let me know if the things I list below don't cover what the information
> you're looking for here. It would be easy for me to emit a log message
> at the time it's overwriting the hint, but that would be a lot of noise:
> every time ss_report_loc() is called, which we discussed would be once
> per 100 pages read per scan. 
> 
> > > > be easy enough to put in a regular LOG entry that shows how far
> > > > ahead/behind they are from other scans. We can trace just one backend
> > > > and have it report on where it is with respect to other backends, or you
> > > > could have them all calculate their position and have just the lead scan
> > > > report the position of all other scans.
> > > > 
> > > 
> > > I already have each backend log it's progression through the tablescan
> > > every 100k blocks to DEBUG (higher DEBUG gives every 10k blocks). I
> > > currently use this information to see whether scans are staying together
> > > or not. I think this gives us the information we need without backends
> > > needing to communicate the information during execution.
> > 
> > Well, that is good, thank you for adding that after initial discussions.
> > 
> > Does it have the time at which a particular numbered block is reached?
> > (i.e. Block #117 is not the same thing as the 117th block scanned). We
> > can use that to compare the time difference of each scan.
> 
> Right now it logs when a scan starts, what block number of the table it
> starts on, and also prints out the current block it's scanning every N
> blocks (100k or 10k depending on debug level). The time and the pid are,
> of course, available from log_prefix.

Can you make it log every block whose id is divisible by 100k or 10k?

Otherwise one scan will log blocks 100,000... 200,000 ... etc and the
next scan will log 17357.... 117357 ... etc which will be much harder to
work out. That will give us "lap times" for every 100,000 blocks.

I'm particularly interested in the turning point where the scan starts
again at the beginning of the file. It would be good to know what
blockid it turned at and when that was. We may get out of step at that
point. Maybe. We'll find out.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Synchronized Scan update

From
"Zeugswetter Andreas ADI SD"
Date:
> > The advantage of sync_scan_offset is that, in some situations, a
> > second scan can actually finish faster than if it were the only
query
> > executing, because a previous scan has already caused some blocks to

> > be cached. However, 16 is a small number because that benefit would
> > only be once per scan, and sync scans are only helpful on large
tables.

Agreed.

> Alright, understood. That last part is actually something I
> now want to avoid because it's using the current
> cache-spoiling behaviour of seqscans to advantage. I'd like
> to remove that behaviour, but it sounds like we can have both
> - SeqScans that don't spoil cache
> - Synch scans by setting "sync_scan_offset" to zero.
>
> > > > I like the idea of reducing tuning parameters, but we should, at
a
> > > > minimum, still allow an on/off button for sync scans. My tests
> > > > revealed that the wrong combination of
> > > > OS/FS/IO-Scheduler/Controller could result in bad I/O behavior.
> > >
> > > Agreed
> > >
> >
> > Do you have an opinion about sync_scan_threshold versus a simple
> > sync_scan_enable?
>
> enable_sync_scan?

Seems the suggested guc's are very related. IIRC The agreed suggestion
was to use NBuffers (or a percentage thereof ?) to decide whether to
spoil the buffer cache for a seq scan. I seems this same metric should
be used to decide whether to sync a scan when sync scan is enabled. So
when the tablesize is below NBuffers (or a percentage thereof) you
neighter recycle buffers nor sync the seq scans.

Andreas