Thread: idea for concurrent seqscans

idea for concurrent seqscans

From
Jeff Davis
Date:
I had an idea that might improve parallel seqscans on the same relation.

If you have lots of concurrent seqscans going on a large relation, the
cache hit ratio is very low. But, if the seqscans are concurrent on the
same relation, there may be something to gain by starting a seqscan near
the page being accessed by an already-in-progress seqscan, and wrapping
back around to that start location. That would make some use of the
shared buffers, which would otherwise just be cache pollution.

I made a proof-of-concept implementation, which is entirely in heapam.c,
except for one addition to the HeapScanDesc struct in relscan.h. It is
not at all up to production quality; there are things I know that need
to be addressed. Basically, I just modified heapam.c to be able to start
at any page in the relation. Then, every time it reads a new page, I
have it mark the relation's oid and the page number in a shared mem
segment. Everytime a new scan is started, it reads the shared mem
segment, and if the relation's oid matches, it starts the scan at the
page number it found in the shared memory. Otherwise, it starts the scan
at 0.

There are a couple obvious issues, one is that my whole implementation
doesn't account for reverse scans at all (since initscan doesn't know
what direction the scan will move in), but that shouldn't be a major
problem since at worst it will be the current behavior (aside: can
someone tell me how to force reverse scans so I can test that better?).
Another is that there's a race condition with the shared mem, and that's
out of pure laziness on my part.

This method is really only effective at all if there is a significant
amount of disk i/o. If it's pulling the data from O/S buffers the
various scans will diverge too much and not be using eachother's shared
buffers.

I tested with shared_buffers=500 and all stats on. I used 60 threads
performing 30 seqscans each in my script ssf.rb (I refer to my
modification as "sequential scan follower" or ssf).

Here are some results with my modifications:
$ time ./ssf.rb # my script

real    4m22.476s
user    0m0.389s
sys     0m0.186s

test=# select relpages from pg_class where relname='test_ssf';
 relpages
----------
     1667
(1 row)

test=# select count(*) from test_ssf;
 count
--------
 200000
(1 row)

test=# select pg_stat_get_blocks_hit(17232) as hit,
pg_stat_get_blocks_fetched(17232) as total;
  hit   |  total
--------+---------
 971503 | 3353963
(1 row)

Or, approx. 29% cache hit.

Here are the results without my modifications:

test=# select relpages from pg_class where relname='test_ssf';
 relpages
----------
     1667
(1 row)

test=# select count(*) from test_ssf;
 count
--------
 200000
(1 row)

test=# select pg_stat_get_blocks_hit(17231) as hit,
pg_stat_get_blocks_fetched(17231) as total;
  hit   |  total
--------+---------
 199999 | 3353963
(1 row)

Or, approx. 6% cache hit. Note: the oid is different, because I have two
seperately initdb'd data directories, one for the modified version, one
for the unmodified 8.0.0.

This is the first time I've really modified the PG source code to do
anything that looked promising, so this is more of a question than
anything else. Is it promising? Is this a potentially good approach? I'm
happy to post more test data and more documentation, and I'd also be
happy to bring the code to production quality. However, before I spend
too much more time on that, I'd like to get a general response from a
3rd party to let me know if I'm off base.

Regards,
    Jeff Davis


Attachment

Re: idea for concurrent seqscans

From
Simon Riggs
Date:
On Fri, 2005-02-25 at 00:34 -0800, Jeff Davis wrote:
> I had an idea that might improve parallel seqscans on the same relation.
> 
> If you have lots of concurrent seqscans going on a large relation, the
> cache hit ratio is very low. But, if the seqscans are concurrent on the
> same relation, there may be something to gain by starting a seqscan near
> the page being accessed by an already-in-progress seqscan, and wrapping
> back around to that start location. That would make some use of the
> shared buffers, which would otherwise just be cache pollution.

This is cool and was on my list of would-like-to-implement features. 

It's usually known as Synchronised Scanning. AFAIK it is free of any
patent restriction: it has already been implemented by both Teradata and
RedBrick.

> This is the first time I've really modified the PG source code to do
> anything that looked promising, so this is more of a question than
> anything else. Is it promising? Is this a potentially good approach? I'm
> happy to post more test data and more documentation, and I'd also be
> happy to bring the code to production quality. 

I'll be happy to help you do this, at least for design and code review.

I'll come back later with more detailed comments on your thoughts so
far.

> However, before I spend
> too much more time on that, I'd like to get a general response from a
> 3rd party to let me know if I'm off base.

Third party?

Best Regards, Simon Riggs



Re: idea for concurrent seqscans

From
Jeff Davis
Date:
On Fri, 2005-02-25 at 13:38 +0000, Simon Riggs wrote:
> On Fri, 2005-02-25 at 00:34 -0800, Jeff Davis wrote:
> > I had an idea that might improve parallel seqscans on the same relation.
> > 
> > If you have lots of concurrent seqscans going on a large relation, the
> > cache hit ratio is very low. But, if the seqscans are concurrent on the
> > same relation, there may be something to gain by starting a seqscan near
> > the page being accessed by an already-in-progress seqscan, and wrapping
> > back around to that start location. That would make some use of the
> > shared buffers, which would otherwise just be cache pollution.
> 
> This is cool and was on my list of would-like-to-implement features. 
> 
> It's usually known as Synchronised Scanning. AFAIK it is free of any
> patent restriction: it has already been implemented by both Teradata and
> RedBrick.
> 
> > This is the first time I've really modified the PG source code to do
> > anything that looked promising, so this is more of a question than
> > anything else. Is it promising? Is this a potentially good approach? I'm
> > happy to post more test data and more documentation, and I'd also be
> > happy to bring the code to production quality. 
> 
> I'll be happy to help you do this, at least for design and code review.
> 
> I'll come back later with more detailed comments on your thoughts so
> far.
> 

Good to hear. I'll clean up the code and document some more tests. Three
questions come to mind right now:
(1) Do we care about reverse scans being done with synchronized
scanning? If so, is there a good way to know in advance whether it is
going to be a forward or reverse scan (i.e. before heap_getnext())?
(2) Where is the appropriate place to put the page location of an
in-progress scan? Are there other pieces of shared memory that aren't
disk buffers that I should be making use of?


> > However, before I spend
> > too much more time on that, I'd like to get a general response from a
> > 3rd party to let me know if I'm off base.
> 
> Third party?
> 

A 2nd party? Anyone else? That was a typo :)

Regards,Jeff Davis



Re: idea for concurrent seqscans

From
Tom Lane
Date:
Jeff Davis <jdavis-pgsql@empires.org> writes:
> (1) Do we care about reverse scans being done with synchronized
> scanning? If so, is there a good way to know in advance whether it is
> going to be a forward or reverse scan (i.e. before heap_getnext())?

There are no reverse heapscans --- the only case where you'll see
direction = backwards is while backing up a cursor with FETCH BACKWARD.
I don't think you need to optimize that case.

What I'm more concerned about is your use of shared memory.  I didn't
have time to look at the patch, but how are you determining an upper
bound on the amount of memory you need?  What sort of locking and
contention issues will there be?

Another point is that this will render the results from heapscans
unstable, since different executions of the same query might start
at different points.  This would for example probably break many
of the regression tests.  We can deal with that if we have to, but
it raises the bar of how much benefit I'd want to see from the patch.

One detail that might or might not be significant: different scans are
very likely to have slightly different ideas about where the end of the
table is, since they determine this with an lseek(SEEK_END) at the
instant they start the scan.  I don't think this invalidates your idea
but you need to watch out for corner-case bugs in the coding.
        regards, tom lane


Re: idea for concurrent seqscans

From
Jeff Davis
Date:
On Fri, 2005-02-25 at 12:54 -0500, Tom Lane wrote:
> Jeff Davis <jdavis-pgsql@empires.org> writes:
> > (1) Do we care about reverse scans being done with synchronized
> > scanning? If so, is there a good way to know in advance whether it is
> > going to be a forward or reverse scan (i.e. before heap_getnext())?
> 
> There are no reverse heapscans --- the only case where you'll see
> direction = backwards is while backing up a cursor with FETCH BACKWARD.
> I don't think you need to optimize that case.
> 

Ok, I was wondering about that.

> What I'm more concerned about is your use of shared memory.  I didn't
> have time to look at the patch, but how are you determining an upper
> bound on the amount of memory you need?  What sort of locking and
> contention issues will there be?

Right now a scanning backend puts the page it's scanning into shared
memory when it gets a new page (so it's not every tuple). I haven't
determined whether this will be a major point of locking contention.
However, one possible implementation seems to solve both problems at
once:
Let's say we just had a static hash table of size
100*sizeof(oid)*sizeof(blocknumber) (to hold the relation's oid and the
page number it's currently scanning). The relid would predetermine the
placement in the table. If there's a collision, overwrite. I don't think
much is lost in that case, unless, for example, two tables in an
important join have oids that hash to the same value. In that case the
effectiveness of synchronized scanning will be lost, but not worse than
the current behavior.
Let's say we didn't use any locks at all. Are there any real dangers
there? If there's a race, and one backend gets some garbage data, it can
just say "this is out of bounds, start the scan at 0". Since it's a
static hash table, we don't have to worry about following a bad pointer,
etc. If that looks like it will be a problem, I can test with locking
also to see what kind of contention there is.

The current patch I sent was very much a proof of concept, but all it
did was have a shared mem segment of size 8 bytes (only holds info for
one relid at a time). That would probably be somewhat effective in many
cases, but of course we want it larger than that (800? 8KB?).

In short, I tried to overcome these problems with simplicity. Where
simplicity doesn't work I default to starting the scan at 0. Hopefully
those non-simple cases (like hash collisions and shared memory races)
are rare enough that we don't lose all that we gain.

> Another point is that this will render the results from heapscans
> unstable, since different executions of the same query might start
> at different points.  This would for example probably break many
> of the regression tests.  We can deal with that if we have to, but
> it raises the bar of how much benefit I'd want to see from the patch.
> 

I didn't consider that. Is there a reason the regression tests assume
the results will be returned in a certain order (or a consistent order)?

> One detail that might or might not be significant: different scans are
> very likely to have slightly different ideas about where the end of the
> table is, since they determine this with an lseek(SEEK_END) at the
> instant they start the scan.  I don't think this invalidates your idea
> but you need to watch out for corner-case bugs in the coding.
> 

I only see that as an issue in initscan(), where it sets the start page.
A simple bounds check would cure that, no? If it was out of bounds, set
the start page to zero, and we didn't lose much. I need a bounds check
there anyway, since the data we get from shared memory needs to be
validated. That bounds check would be comparing against the current
backend's scan->rs_nblocks, which should be the correct number for that
backend.

Regards,
Jeff Davis




Re: idea for concurrent seqscans

From
Tom Lane
Date:
Jeff Davis <jdavis-pgsql@empires.org> writes:
> I didn't consider that. Is there a reason the regression tests assume
> the results will be returned in a certain order (or a consistent order)?

We use diff as the checking tool.
        regards, tom lane


Re: idea for concurrent seqscans

From
"Jim C. Nasby"
Date:
On Fri, Feb 25, 2005 at 01:30:57PM -0500, Tom Lane wrote:
> Jeff Davis <jdavis-pgsql@empires.org> writes:
> > I didn't consider that. Is there a reason the regression tests assume
> > the results will be returned in a certain order (or a consistent order)?
> 
> We use diff as the checking tool.

Doesn't the SQL spec specifically state that the only time you'll get
results in a deterministic order is if you use ORDER BY? Assuming
otherwise seems a bad idea (though at least in the case of testing it
makes the test more strenuous rather than less...)
-- 
Jim C. Nasby, Database Consultant               decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: idea for concurrent seqscans

From
Jeff Davis
Date:
On Fri, 2005-02-25 at 13:30 -0500, Tom Lane wrote:
> Jeff Davis <jdavis-pgsql@empires.org> writes:
> > I didn't consider that. Is there a reason the regression tests assume
> > the results will be returned in a certain order (or a consistent order)?
> 
> We use diff as the checking tool.
> 

Well, that does make testing more difficult, or it at least requires
extra work to make the regression tests understand the results better.

I'll sumbmit a better patch, and then if everyone decides it's worth the
hassle with the regression tests, we can use it in 8.1. Some more
testing is required to see if the results are really as good as we hope.

Regards,Jeff Davis



Re: idea for concurrent seqscans

From
Jeff Davis
Date:
On Fri, 2005-02-25 at 18:03 -0600, Jim C. Nasby wrote:
> On Fri, Feb 25, 2005 at 01:30:57PM -0500, Tom Lane wrote:
> > Jeff Davis <jdavis-pgsql@empires.org> writes:
> > > I didn't consider that. Is there a reason the regression tests assume
> > > the results will be returned in a certain order (or a consistent order)?
> > 
> > We use diff as the checking tool.
> 
> Doesn't the SQL spec specifically state that the only time you'll get
> results in a deterministic order is if you use ORDER BY? Assuming
> otherwise seems a bad idea (though at least in the case of testing it
> makes the test more strenuous rather than less...)

True, that was my reasoning when I proposed synchronized scanning.

Keep in mind that this is a criticism of only the regression tests, not
the RDBMS itself.

I don't know much about the regression tests, so maybe it's impractical
to not assume consistent order. I'm sure the developers will vote one
way or the other. I hate to throw away a potential performance boost,
but I also hate to burden the developers with rewriting a lot of
regression tests when their time could be better spent elsewhere.

Regards,Jeff Davis








Re: idea for concurrent seqscans

From
Bruce Momjian
Date:
Jim C. Nasby wrote:
> On Fri, Feb 25, 2005 at 01:30:57PM -0500, Tom Lane wrote:
> > Jeff Davis <jdavis-pgsql@empires.org> writes:
> > > I didn't consider that. Is there a reason the regression tests assume
> > > the results will be returned in a certain order (or a consistent order)?
> > 
> > We use diff as the checking tool.
> 
> Doesn't the SQL spec specifically state that the only time you'll get
> results in a deterministic order is if you use ORDER BY? Assuming
> otherwise seems a bad idea (though at least in the case of testing it
> makes the test more strenuous rather than less...)

The only trick I can think of is to use SELECT ... INTO TEMPORARY tab
... oRDER BY and then use COPY to dump the table.  It will then dump in
the order of the ORDER BY.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: idea for concurrent seqscans

From
Bruce Momjian
Date:
Sorry, please disregard my ramblings.  I thought it was a different
question.

---------------------------------------------------------------------------

pgman wrote:
> Jim C. Nasby wrote:
> > On Fri, Feb 25, 2005 at 01:30:57PM -0500, Tom Lane wrote:
> > > Jeff Davis <jdavis-pgsql@empires.org> writes:
> > > > I didn't consider that. Is there a reason the regression tests assume
> > > > the results will be returned in a certain order (or a consistent order)?
> > > 
> > > We use diff as the checking tool.
> > 
> > Doesn't the SQL spec specifically state that the only time you'll get
> > results in a deterministic order is if you use ORDER BY? Assuming
> > otherwise seems a bad idea (though at least in the case of testing it
> > makes the test more strenuous rather than less...)
> 
> The only trick I can think of is to use SELECT ... INTO TEMPORARY tab
> ... oRDER BY and then use COPY to dump the table.  It will then dump in
> the order of the ORDER BY.
> 
> -- 
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: idea for concurrent seqscans

From
"Jim C. Nasby"
Date:
On Fri, Feb 25, 2005 at 04:30:17PM -0800, Jeff Davis wrote:
> On Fri, 2005-02-25 at 18:03 -0600, Jim C. Nasby wrote:
> > On Fri, Feb 25, 2005 at 01:30:57PM -0500, Tom Lane wrote:
> > > Jeff Davis <jdavis-pgsql@empires.org> writes:
> > > > I didn't consider that. Is there a reason the regression tests assume
> > > > the results will be returned in a certain order (or a consistent order)?
> > > 
> > > We use diff as the checking tool.
> > 
> > Doesn't the SQL spec specifically state that the only time you'll get
> > results in a deterministic order is if you use ORDER BY? Assuming
> > otherwise seems a bad idea (though at least in the case of testing it
> > makes the test more strenuous rather than less...)
> 
> True, that was my reasoning when I proposed synchronized scanning.
> 
> Keep in mind that this is a criticism of only the regression tests, not
> the RDBMS itself.
> 
> I don't know much about the regression tests, so maybe it's impractical
> to not assume consistent order. I'm sure the developers will vote one
> way or the other. I hate to throw away a potential performance boost,
> but I also hate to burden the developers with rewriting a lot of
> regression tests when their time could be better spent elsewhere.

Certainly, but I suspect it's just a matter of adding ORDER BY to
everything, which just about anyone (even myself!) should be able to do.
-- 
Jim C. Nasby, Database Consultant               decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: idea for concurrent seqscans

From
Tom Lane
Date:
"Jim C. Nasby" <decibel@decibel.org> writes:
>> but I also hate to burden the developers with rewriting a lot of
>> regression tests when their time could be better spent elsewhere.

> Certainly, but I suspect it's just a matter of adding ORDER BY to
> everything, which just about anyone (even myself!) should be able to do.

Performance is not the issue; test coverage, however, is an issue.
See the comment at the end of
http://developer.postgresql.org/docs/postgres/regress-evaluation.html#AEN22383
        regards, tom lane


Re: idea for concurrent seqscans

From
"Jim C. Nasby"
Date:
On Fri, Feb 25, 2005 at 11:51:40PM -0500, Tom Lane wrote:
> "Jim C. Nasby" <decibel@decibel.org> writes:
> >> but I also hate to burden the developers with rewriting a lot of
> >> regression tests when their time could be better spent elsewhere.
> 
> > Certainly, but I suspect it's just a matter of adding ORDER BY to
> > everything, which just about anyone (even myself!) should be able to do.
> 
> Performance is not the issue; test coverage, however, is an issue.
> See the comment at the end of
> http://developer.postgresql.org/docs/postgres/regress-evaluation.html#AEN22383

Assuming you're talkning about "You might wonder why we don't order all
the regression test queries explicitly to get rid of this issue once and
for all. The reason is that that would make the regression tests less
useful, not more, since they'd tend to exercise query plan types that
produce ordered results to the exclusion of those that don't.", good
point. I can think of 2 ways around this:

1) Select into a temptable, then select out of it with an order by

2) Run the output through sort before doing the diff

Is there any reason one of these wouldn't work?
-- 
Jim C. Nasby, Database Consultant               decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: idea for concurrent seqscans

From
Jeff Davis
Date:
I have a newer version of my Synchronized Scanning patch which hopefully
makes it closer to a real patch, the first one was more of a proof of
concept.

DONE:
* I added a proper bounds check for the result it gets from shared
memory.
* I expanded the shared memory to be a static hash table (currently set
to a size of 8KB, but that is a one line change if that's too big). Now
it can keep track of many scans on many tables at once.

TODO:
* More testing. I plan to get some more tests up on Monday, and when I
start to see the best uses for this patch, I will also try to benchmark
against MySQL or something else. I'm seeing some good preliminary
results in cache hit rate (which is much higher in some cases), but I
need to demonstrate an actual decrease in runtime.
* Right now the shared mem isn't destroyed when the postmaster shuts
down.
* This patch still makes no use of locks at all. If someone thinks locks
are required, please let me know. Currently, if there is inconsistent
data in the shared memory area the worst that can happen is no worse
than the current behavior.

Regards,
    Jeff Davis

Attachment

Re: idea for concurrent seqscans

From
William Volkman
Date:
On Fri, 2005-02-25 at 22:49, Jim C. Nasby wrote:
> On Fri, Feb 25, 2005 at 11:51:40PM -0500, Tom Lane wrote:
> > "Jim C. Nasby" <decibel@decibel.org> writes:
> > >> but I also hate to burden the developers with rewriting a lot of
> > >> regression tests when their time could be better spent elsewhere.

The patch is for *concurrent* seqscans, would the regression tests
even be affected by this (IIRC they are single user, single process)?




Re: idea for concurrent seqscans

From
Neil Conway
Date:
William Volkman wrote:
> The patch is for *concurrent* seqscans, would the regression tests
> even be affected by this (IIRC they are single user, single process)?

No; `make installcheck' is serial, but `make check' executes tests in 
parallel in multiple backends concurrently.

-Neil


Re: idea for concurrent seqscans

From
"Andrew Dunstan"
Date:
Tom Lane said:
> "Jim C. Nasby" <decibel@decibel.org> writes:
>>> but I also hate to burden the developers with rewriting a lot of
>>> regression tests when their time could be better spent elsewhere.
>
>> Certainly, but I suspect it's just a matter of adding ORDER BY to
>> everything, which just about anyone (even myself!) should be able to
>> do.
>
> Performance is not the issue; test coverage, however, is an issue. See
> the comment at the end of
> http://developer.postgresql.org/docs/postgres/regress-evaluation.html#AEN22383>


Is it not possible to wrap the original query in an outer query that applies
the ordering, leaving the original query run without ordering? Would that
cause a lessening of test coverage?

cheers

andrew





Re: idea for concurrent seqscans

From
Neil Conway
Date:
Jeff Davis wrote:
> I have a newer version of my Synchronized Scanning patch which hopefully
> makes it closer to a real patch, the first one was more of a proof of
> concept.

A few minor comments:

- context diffs (diff -c) are the preferred format. Also, folks usually 
send patches as a single diff; an easy way to generate that is via `cvs 
diff', or `diff -r old_dir new_dir'.

- needlessly reindenting code makes it difficult to understand what 
you've changed. You should probably follow the PG coding conventions WRT 
indentation, brace placement, and so forth, although this will be fixed 
by a script later in any case. See Chapter 43 of the docs.

- you don't need to (and should not) declare `static' functions in 
header files. If your additions to heapam.c aren't used outside of 
heapam.c, they needn't be declared in the header file.

- PG has an abstraction layer for using shared memory that you should 
take advantage of. You should do something like: (1) create a function 
that returns the amount of shared memory space you require (2) invoke 
the function from CreateSharedMemoryAndSemaphores (3) create/attach to 
and initialize the shared memory during startup, via ShmemInitStruct(). 
See how InitProcGlobal() works, for example.

- it makes me quite nervous to be reading and writing to shared data 
without using locks. If there is too much of a performance hit to 
acquire and release a lock for each page traversed, what about only 
updating the shared memory stats every K pages?

-Neil


Re: idea for concurrent seqscans

From
Jeff Davis
Date:
Thanks for the information.

On Sat, 2005-02-26 at 23:39 +1100, Neil Conway wrote:
> Jeff Davis wrote:
> > I have a newer version of my Synchronized Scanning patch which hopefully
> > makes it closer to a real patch, the first one was more of a proof of
> > concept.
> 
> A few minor comments:
> 
> - context diffs (diff -c) are the preferred format. Also, folks usually 
> send patches as a single diff; an easy way to generate that is via `cvs 
> diff', or `diff -r old_dir new_dir'.

Ok.

> - needlessly reindenting code makes it difficult to understand what 
> you've changed. You should probably follow the PG coding conventions WRT 
> indentation, brace placement, and so forth, although this will be fixed 
> by a script later in any case. See Chapter 43 of the docs.
> 

The only reason I did that was because the original source was difficult
to read and work on. Is there a reason that code and comments wrap
around to the next line throughout the file? 

> - PG has an abstraction layer for using shared memory that you should 
> take advantage of. You should do something like: (1) create a function 
> that returns the amount of shared memory space you require (2) invoke 
> the function from CreateSharedMemoryAndSemaphores (3) create/attach to 
> and initialize the shared memory during startup, via ShmemInitStruct(). 
> See how InitProcGlobal() works, for example.

Will do.

> - it makes me quite nervous to be reading and writing to shared data 
> without using locks. If there is too much of a performance hit to 
> acquire and release a lock for each page traversed, what about only 
> updating the shared memory stats every K pages?
> 

I'll try that and test it and hopefully any performance problems appear
sooner rather than later. If something appears, every K pages sounds
like a plan to me.

I will get a new patch ready soon with your suggestions.

Regards,Jeff Davis



Re: idea for concurrent seqscans

From
"Jim C. Nasby"
Date:
On Sat, Feb 26, 2005 at 12:22:45AM -0800, Jeff Davis wrote:
> * I expanded the shared memory to be a static hash table (currently set
> to a size of 8KB, but that is a one line change if that's too big). Now
> it can keep track of many scans on many tables at once.

I know you're still early in this, but should this value be a GUC?

> * This patch still makes no use of locks at all. If someone thinks locks
> are required, please let me know. Currently, if there is inconsistent
> data in the shared memory area the worst that can happen is no worse
> than the current behavior.

It would be useful to have the backend put something in the logfile any
time it has to revert to the default behavior. Without that, we have no
way to know how often it's happening, and if it's worth trying to make
it happen less.
-- 
Jim C. Nasby, Database Consultant               decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: idea for concurrent seqscans

From
Tom Lane
Date:
"Jim C. Nasby" <decibel@decibel.org> writes:
> Assuming you're talkning about "You might wonder why we don't order all
> the regression test queries explicitly to get rid of this issue once and
> for all. The reason is that that would make the regression tests less
> useful, not more, since they'd tend to exercise query plan types that
> produce ordered results to the exclusion of those that don't.", good
> point. I can think of 2 ways around this:

> 1) Select into a temptable, then select out of it with an order by

> 2) Run the output through sort before doing the diff

> Is there any reason one of these wouldn't work?

Like I said originally, we could certainly devise a solution if we
needed to.  I was just pointing out that this is a nontrivial
consideration, and I don't want to buy into it if the patch proves
to offer only marginal performance improvements.
        regards, tom lane


Re: idea for concurrent seqscans

From
Jeff Davis
Date:
On Sat, 2005-02-26 at 10:16 -0600, Jim C. Nasby wrote:
> On Sat, Feb 26, 2005 at 12:22:45AM -0800, Jeff Davis wrote:
> > * I expanded the shared memory to be a static hash table (currently set
> > to a size of 8KB, but that is a one line change if that's too big). Now
> > it can keep track of many scans on many tables at once.
> 
> I know you're still early in this, but should this value be a GUC?
> 

I don't know if we want to clutter the GUC variables with a bunch of
minor things like this. Most users won't understand what they're
actually getting by allocating more memory here. The only improvement
would be if there are two tables that are very often scanned
simultaneously that happen to hash to the same value, in which case
decreasing the shared mem has almost as much chance of solving the
problem as increasing it :)

However, if people think it's worthwhile, I'll be happy to do it.

> > * This patch still makes no use of locks at all. If someone thinks locks
> > are required, please let me know. Currently, if there is inconsistent
> > data in the shared memory area the worst that can happen is no worse
> > than the current behavior.
> 
> It would be useful to have the backend put something in the logfile any
> time it has to revert to the default behavior. Without that, we have no
> way to know how often it's happening, and if it's worth trying to make
> it happen less.

Good idea. However, the first scan on a table after a server start will
always be default, so I'll need to avoid excessive logging.

Regards,Jeff Davis



Re: idea for concurrent seqscans

From
Neil Conway
Date:
Jeff Davis wrote:
> The only reason I did that was because the original source was difficult
> to read and work on. Is there a reason that code and comments wrap
> around to the next line throughout the file?

I'm not sure what you mean. Assuming your editor expand tabs to 4 spaces 
(which is the convention in the PostgreSQL source), lines should be 
about 80 characters at most. Expressions that would take more characters 
horizontally if left on one line are divided into multiple lines, and 
indented appropriately. At any rate, the source is perfectly readable to 
me :) Perhaps you need to tweak your editor's configuration?

-Neil


Re: idea for concurrent seqscans

From
Gaetano Mendola
Date:
Tom Lane wrote:
> "Jim C. Nasby" <decibel@decibel.org> writes:
> 
>>Assuming you're talkning about "You might wonder why we don't order all
>>the regression test queries explicitly to get rid of this issue once and
>>for all. The reason is that that would make the regression tests less
>>useful, not more, since they'd tend to exercise query plan types that
>>produce ordered results to the exclusion of those that don't.", good
>>point. I can think of 2 ways around this:
> 
> 
>>1) Select into a temptable, then select out of it with an order by
> 
> 
>>2) Run the output through sort before doing the diff
> 
> 
>>Is there any reason one of these wouldn't work?
> 
> 
> Like I said originally, we could certainly devise a solution if we
> needed to.  I was just pointing out that this is a nontrivial
> consideration, and I don't want to buy into it if the patch proves
> to offer only marginal performance improvements.
> 

I'll bet will not offer only marginal performance improvements. I see some
time my 4-CPU server with 3 CPU in holiday and other CPU working on a long
sequential scan. I hope that this patch, if it works correctly will be used
in future Postgresql version

Regards
Gaetano Mendola




Re: idea for concurrent seqscans

From
Simon Riggs
Date:
On Sat, 2005-02-26 at 10:47 -0500, Tom Lane wrote:
> "Jim C. Nasby" <decibel@decibel.org> writes:
> > Assuming you're talkning about "You might wonder why we don't order all
> > the regression test queries explicitly to get rid of this issue once and
> > for all. The reason is that that would make the regression tests less
> > useful, not more, since they'd tend to exercise query plan types that
> > produce ordered results to the exclusion of those that don't.", good
> > point. I can think of 2 ways around this:
> 
> > 1) Select into a temptable, then select out of it with an order by
> 
> > 2) Run the output through sort before doing the diff
> 
> > Is there any reason one of these wouldn't work?
> 
> Like I said originally, we could certainly devise a solution if we
> needed to.  I was just pointing out that this is a nontrivial
> consideration, and I don't want to buy into it if the patch proves
> to offer only marginal performance improvements.

Yes, the starting place is performance prototyping. Jeff has sensibly
started with that thought in his initial post.

I would suggest that we used a new GUC
enable_synch_scans = off, by default.
to control whether this new behaviour was utilised.

That way, all of the current tests would stand as-is. My feeling is that
in general, we should only add tests, not alter existing ones. It would
be straightforward, even if laborious, to add some additional tests that
prove that this type of system feature returns correct SQL results.
However, the key seems to be that the results of SQL runs without an
ORDER BY would be timing dependant, so would actually be a wholly new
type of test - we would need to run 1 on its own, then compare with 2
run together to check the same answer was still returned, possibly with
a post execution external sort.

Best Regards, Simon Riggs




Re: idea for concurrent seqscans

From
Bruce Momjian
Date:
Added to TODO list:

* Allow sequential scans to take advantage of other concurrent sequentiqal scans, also called "Synchronised Scanning"


---------------------------------------------------------------------------

Jeff Davis wrote:
> I had an idea that might improve parallel seqscans on the same relation.
> 
> If you have lots of concurrent seqscans going on a large relation, the
> cache hit ratio is very low. But, if the seqscans are concurrent on the
> same relation, there may be something to gain by starting a seqscan near
> the page being accessed by an already-in-progress seqscan, and wrapping
> back around to that start location. That would make some use of the
> shared buffers, which would otherwise just be cache pollution.
> 
> I made a proof-of-concept implementation, which is entirely in heapam.c,
> except for one addition to the HeapScanDesc struct in relscan.h. It is
> not at all up to production quality; there are things I know that need
> to be addressed. Basically, I just modified heapam.c to be able to start
> at any page in the relation. Then, every time it reads a new page, I
> have it mark the relation's oid and the page number in a shared mem
> segment. Everytime a new scan is started, it reads the shared mem
> segment, and if the relation's oid matches, it starts the scan at the
> page number it found in the shared memory. Otherwise, it starts the scan
> at 0.
> 
> There are a couple obvious issues, one is that my whole implementation
> doesn't account for reverse scans at all (since initscan doesn't know
> what direction the scan will move in), but that shouldn't be a major
> problem since at worst it will be the current behavior (aside: can
> someone tell me how to force reverse scans so I can test that better?).
> Another is that there's a race condition with the shared mem, and that's
> out of pure laziness on my part.
> 
> This method is really only effective at all if there is a significant
> amount of disk i/o. If it's pulling the data from O/S buffers the
> various scans will diverge too much and not be using eachother's shared
> buffers.
> 
> I tested with shared_buffers=500 and all stats on. I used 60 threads
> performing 30 seqscans each in my script ssf.rb (I refer to my
> modification as "sequential scan follower" or ssf). 
> 
> Here are some results with my modifications:
> $ time ./ssf.rb # my script
> 
> real    4m22.476s
> user    0m0.389s
> sys     0m0.186s
> 
> test=# select relpages from pg_class where relname='test_ssf';
>  relpages
> ----------
>      1667
> (1 row)
> 
> test=# select count(*) from test_ssf;
>  count
> --------
>  200000
> (1 row)
> 
> test=# select pg_stat_get_blocks_hit(17232) as hit,
> pg_stat_get_blocks_fetched(17232) as total;
>   hit   |  total
> --------+---------
>  971503 | 3353963
> (1 row)
> 
> Or, approx. 29% cache hit.
> 
> Here are the results without my modifications:
> 
> test=# select relpages from pg_class where relname='test_ssf';
>  relpages
> ----------
>      1667
> (1 row)
> 
> test=# select count(*) from test_ssf;
>  count
> --------
>  200000
> (1 row)
> 
> test=# select pg_stat_get_blocks_hit(17231) as hit,
> pg_stat_get_blocks_fetched(17231) as total;
>   hit   |  total
> --------+---------
>  199999 | 3353963
> (1 row)
> 
> Or, approx. 6% cache hit. Note: the oid is different, because I have two
> seperately initdb'd data directories, one for the modified version, one
> for the unmodified 8.0.0.
> 
> This is the first time I've really modified the PG source code to do
> anything that looked promising, so this is more of a question than
> anything else. Is it promising? Is this a potentially good approach? I'm
> happy to post more test data and more documentation, and I'd also be
> happy to bring the code to production quality. However, before I spend
> too much more time on that, I'd like to get a general response from a
> 3rd party to let me know if I'm off base.
> 
> Regards,
>     Jeff Davis
> 

[ Attachment, skipping... ]

[ Attachment, skipping... ]

> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: idea for concurrent seqscans

From
Bruce Momjian
Date:
TODO description added:* Allow sequential scans to take advantage of other concurrent  sequentiqal scans, also called
"SynchronisedScanning"  One possible implementation is to start sequential scans from the lowest  numbered buffer in
theshared cache, and when reaching the end wrap  around to the beginning, rather than always starting sequential scans
atthe start of the table.
 

---------------------------------------------------------------------------

Jeff Davis wrote:
> I had an idea that might improve parallel seqscans on the same relation.
> 
> If you have lots of concurrent seqscans going on a large relation, the
> cache hit ratio is very low. But, if the seqscans are concurrent on the
> same relation, there may be something to gain by starting a seqscan near
> the page being accessed by an already-in-progress seqscan, and wrapping
> back around to that start location. That would make some use of the
> shared buffers, which would otherwise just be cache pollution.
> 
> I made a proof-of-concept implementation, which is entirely in heapam.c,
> except for one addition to the HeapScanDesc struct in relscan.h. It is
> not at all up to production quality; there are things I know that need
> to be addressed. Basically, I just modified heapam.c to be able to start
> at any page in the relation. Then, every time it reads a new page, I
> have it mark the relation's oid and the page number in a shared mem
> segment. Everytime a new scan is started, it reads the shared mem
> segment, and if the relation's oid matches, it starts the scan at the
> page number it found in the shared memory. Otherwise, it starts the scan
> at 0.
> 
> There are a couple obvious issues, one is that my whole implementation
> doesn't account for reverse scans at all (since initscan doesn't know
> what direction the scan will move in), but that shouldn't be a major
> problem since at worst it will be the current behavior (aside: can
> someone tell me how to force reverse scans so I can test that better?).
> Another is that there's a race condition with the shared mem, and that's
> out of pure laziness on my part.
> 
> This method is really only effective at all if there is a significant
> amount of disk i/o. If it's pulling the data from O/S buffers the
> various scans will diverge too much and not be using eachother's shared
> buffers.
> 
> I tested with shared_buffers=500 and all stats on. I used 60 threads
> performing 30 seqscans each in my script ssf.rb (I refer to my
> modification as "sequential scan follower" or ssf). 
> 
> Here are some results with my modifications:
> $ time ./ssf.rb # my script
> 
> real    4m22.476s
> user    0m0.389s
> sys     0m0.186s
> 
> test=# select relpages from pg_class where relname='test_ssf';
>  relpages
> ----------
>      1667
> (1 row)
> 
> test=# select count(*) from test_ssf;
>  count
> --------
>  200000
> (1 row)
> 
> test=# select pg_stat_get_blocks_hit(17232) as hit,
> pg_stat_get_blocks_fetched(17232) as total;
>   hit   |  total
> --------+---------
>  971503 | 3353963
> (1 row)
> 
> Or, approx. 29% cache hit.
> 
> Here are the results without my modifications:
> 
> test=# select relpages from pg_class where relname='test_ssf';
>  relpages
> ----------
>      1667
> (1 row)
> 
> test=# select count(*) from test_ssf;
>  count
> --------
>  200000
> (1 row)
> 
> test=# select pg_stat_get_blocks_hit(17231) as hit,
> pg_stat_get_blocks_fetched(17231) as total;
>   hit   |  total
> --------+---------
>  199999 | 3353963
> (1 row)
> 
> Or, approx. 6% cache hit. Note: the oid is different, because I have two
> seperately initdb'd data directories, one for the modified version, one
> for the unmodified 8.0.0.
> 
> This is the first time I've really modified the PG source code to do
> anything that looked promising, so this is more of a question than
> anything else. Is it promising? Is this a potentially good approach? I'm
> happy to post more test data and more documentation, and I'd also be
> happy to bring the code to production quality. However, before I spend
> too much more time on that, I'd like to get a general response from a
> 3rd party to let me know if I'm off base.
> 
> Regards,
>     Jeff Davis
> 

[ Attachment, skipping... ]

[ Attachment, skipping... ]

> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073