Thread: idea for concurrent seqscans
I had an idea that might improve parallel seqscans on the same relation. If you have lots of concurrent seqscans going on a large relation, the cache hit ratio is very low. But, if the seqscans are concurrent on the same relation, there may be something to gain by starting a seqscan near the page being accessed by an already-in-progress seqscan, and wrapping back around to that start location. That would make some use of the shared buffers, which would otherwise just be cache pollution. I made a proof-of-concept implementation, which is entirely in heapam.c, except for one addition to the HeapScanDesc struct in relscan.h. It is not at all up to production quality; there are things I know that need to be addressed. Basically, I just modified heapam.c to be able to start at any page in the relation. Then, every time it reads a new page, I have it mark the relation's oid and the page number in a shared mem segment. Everytime a new scan is started, it reads the shared mem segment, and if the relation's oid matches, it starts the scan at the page number it found in the shared memory. Otherwise, it starts the scan at 0. There are a couple obvious issues, one is that my whole implementation doesn't account for reverse scans at all (since initscan doesn't know what direction the scan will move in), but that shouldn't be a major problem since at worst it will be the current behavior (aside: can someone tell me how to force reverse scans so I can test that better?). Another is that there's a race condition with the shared mem, and that's out of pure laziness on my part. This method is really only effective at all if there is a significant amount of disk i/o. If it's pulling the data from O/S buffers the various scans will diverge too much and not be using eachother's shared buffers. I tested with shared_buffers=500 and all stats on. I used 60 threads performing 30 seqscans each in my script ssf.rb (I refer to my modification as "sequential scan follower" or ssf). Here are some results with my modifications: $ time ./ssf.rb # my script real 4m22.476s user 0m0.389s sys 0m0.186s test=# select relpages from pg_class where relname='test_ssf'; relpages ---------- 1667 (1 row) test=# select count(*) from test_ssf; count -------- 200000 (1 row) test=# select pg_stat_get_blocks_hit(17232) as hit, pg_stat_get_blocks_fetched(17232) as total; hit | total --------+--------- 971503 | 3353963 (1 row) Or, approx. 29% cache hit. Here are the results without my modifications: test=# select relpages from pg_class where relname='test_ssf'; relpages ---------- 1667 (1 row) test=# select count(*) from test_ssf; count -------- 200000 (1 row) test=# select pg_stat_get_blocks_hit(17231) as hit, pg_stat_get_blocks_fetched(17231) as total; hit | total --------+--------- 199999 | 3353963 (1 row) Or, approx. 6% cache hit. Note: the oid is different, because I have two seperately initdb'd data directories, one for the modified version, one for the unmodified 8.0.0. This is the first time I've really modified the PG source code to do anything that looked promising, so this is more of a question than anything else. Is it promising? Is this a potentially good approach? I'm happy to post more test data and more documentation, and I'd also be happy to bring the code to production quality. However, before I spend too much more time on that, I'd like to get a general response from a 3rd party to let me know if I'm off base. Regards, Jeff Davis
Attachment
On Fri, 2005-02-25 at 00:34 -0800, Jeff Davis wrote: > I had an idea that might improve parallel seqscans on the same relation. > > If you have lots of concurrent seqscans going on a large relation, the > cache hit ratio is very low. But, if the seqscans are concurrent on the > same relation, there may be something to gain by starting a seqscan near > the page being accessed by an already-in-progress seqscan, and wrapping > back around to that start location. That would make some use of the > shared buffers, which would otherwise just be cache pollution. This is cool and was on my list of would-like-to-implement features. It's usually known as Synchronised Scanning. AFAIK it is free of any patent restriction: it has already been implemented by both Teradata and RedBrick. > This is the first time I've really modified the PG source code to do > anything that looked promising, so this is more of a question than > anything else. Is it promising? Is this a potentially good approach? I'm > happy to post more test data and more documentation, and I'd also be > happy to bring the code to production quality. I'll be happy to help you do this, at least for design and code review. I'll come back later with more detailed comments on your thoughts so far. > However, before I spend > too much more time on that, I'd like to get a general response from a > 3rd party to let me know if I'm off base. Third party? Best Regards, Simon Riggs
On Fri, 2005-02-25 at 13:38 +0000, Simon Riggs wrote: > On Fri, 2005-02-25 at 00:34 -0800, Jeff Davis wrote: > > I had an idea that might improve parallel seqscans on the same relation. > > > > If you have lots of concurrent seqscans going on a large relation, the > > cache hit ratio is very low. But, if the seqscans are concurrent on the > > same relation, there may be something to gain by starting a seqscan near > > the page being accessed by an already-in-progress seqscan, and wrapping > > back around to that start location. That would make some use of the > > shared buffers, which would otherwise just be cache pollution. > > This is cool and was on my list of would-like-to-implement features. > > It's usually known as Synchronised Scanning. AFAIK it is free of any > patent restriction: it has already been implemented by both Teradata and > RedBrick. > > > This is the first time I've really modified the PG source code to do > > anything that looked promising, so this is more of a question than > > anything else. Is it promising? Is this a potentially good approach? I'm > > happy to post more test data and more documentation, and I'd also be > > happy to bring the code to production quality. > > I'll be happy to help you do this, at least for design and code review. > > I'll come back later with more detailed comments on your thoughts so > far. > Good to hear. I'll clean up the code and document some more tests. Three questions come to mind right now: (1) Do we care about reverse scans being done with synchronized scanning? If so, is there a good way to know in advance whether it is going to be a forward or reverse scan (i.e. before heap_getnext())? (2) Where is the appropriate place to put the page location of an in-progress scan? Are there other pieces of shared memory that aren't disk buffers that I should be making use of? > > However, before I spend > > too much more time on that, I'd like to get a general response from a > > 3rd party to let me know if I'm off base. > > Third party? > A 2nd party? Anyone else? That was a typo :) Regards,Jeff Davis
Jeff Davis <jdavis-pgsql@empires.org> writes: > (1) Do we care about reverse scans being done with synchronized > scanning? If so, is there a good way to know in advance whether it is > going to be a forward or reverse scan (i.e. before heap_getnext())? There are no reverse heapscans --- the only case where you'll see direction = backwards is while backing up a cursor with FETCH BACKWARD. I don't think you need to optimize that case. What I'm more concerned about is your use of shared memory. I didn't have time to look at the patch, but how are you determining an upper bound on the amount of memory you need? What sort of locking and contention issues will there be? Another point is that this will render the results from heapscans unstable, since different executions of the same query might start at different points. This would for example probably break many of the regression tests. We can deal with that if we have to, but it raises the bar of how much benefit I'd want to see from the patch. One detail that might or might not be significant: different scans are very likely to have slightly different ideas about where the end of the table is, since they determine this with an lseek(SEEK_END) at the instant they start the scan. I don't think this invalidates your idea but you need to watch out for corner-case bugs in the coding. regards, tom lane
On Fri, 2005-02-25 at 12:54 -0500, Tom Lane wrote: > Jeff Davis <jdavis-pgsql@empires.org> writes: > > (1) Do we care about reverse scans being done with synchronized > > scanning? If so, is there a good way to know in advance whether it is > > going to be a forward or reverse scan (i.e. before heap_getnext())? > > There are no reverse heapscans --- the only case where you'll see > direction = backwards is while backing up a cursor with FETCH BACKWARD. > I don't think you need to optimize that case. > Ok, I was wondering about that. > What I'm more concerned about is your use of shared memory. I didn't > have time to look at the patch, but how are you determining an upper > bound on the amount of memory you need? What sort of locking and > contention issues will there be? Right now a scanning backend puts the page it's scanning into shared memory when it gets a new page (so it's not every tuple). I haven't determined whether this will be a major point of locking contention. However, one possible implementation seems to solve both problems at once: Let's say we just had a static hash table of size 100*sizeof(oid)*sizeof(blocknumber) (to hold the relation's oid and the page number it's currently scanning). The relid would predetermine the placement in the table. If there's a collision, overwrite. I don't think much is lost in that case, unless, for example, two tables in an important join have oids that hash to the same value. In that case the effectiveness of synchronized scanning will be lost, but not worse than the current behavior. Let's say we didn't use any locks at all. Are there any real dangers there? If there's a race, and one backend gets some garbage data, it can just say "this is out of bounds, start the scan at 0". Since it's a static hash table, we don't have to worry about following a bad pointer, etc. If that looks like it will be a problem, I can test with locking also to see what kind of contention there is. The current patch I sent was very much a proof of concept, but all it did was have a shared mem segment of size 8 bytes (only holds info for one relid at a time). That would probably be somewhat effective in many cases, but of course we want it larger than that (800? 8KB?). In short, I tried to overcome these problems with simplicity. Where simplicity doesn't work I default to starting the scan at 0. Hopefully those non-simple cases (like hash collisions and shared memory races) are rare enough that we don't lose all that we gain. > Another point is that this will render the results from heapscans > unstable, since different executions of the same query might start > at different points. This would for example probably break many > of the regression tests. We can deal with that if we have to, but > it raises the bar of how much benefit I'd want to see from the patch. > I didn't consider that. Is there a reason the regression tests assume the results will be returned in a certain order (or a consistent order)? > One detail that might or might not be significant: different scans are > very likely to have slightly different ideas about where the end of the > table is, since they determine this with an lseek(SEEK_END) at the > instant they start the scan. I don't think this invalidates your idea > but you need to watch out for corner-case bugs in the coding. > I only see that as an issue in initscan(), where it sets the start page. A simple bounds check would cure that, no? If it was out of bounds, set the start page to zero, and we didn't lose much. I need a bounds check there anyway, since the data we get from shared memory needs to be validated. That bounds check would be comparing against the current backend's scan->rs_nblocks, which should be the correct number for that backend. Regards, Jeff Davis
Jeff Davis <jdavis-pgsql@empires.org> writes: > I didn't consider that. Is there a reason the regression tests assume > the results will be returned in a certain order (or a consistent order)? We use diff as the checking tool. regards, tom lane
On Fri, Feb 25, 2005 at 01:30:57PM -0500, Tom Lane wrote: > Jeff Davis <jdavis-pgsql@empires.org> writes: > > I didn't consider that. Is there a reason the regression tests assume > > the results will be returned in a certain order (or a consistent order)? > > We use diff as the checking tool. Doesn't the SQL spec specifically state that the only time you'll get results in a deterministic order is if you use ORDER BY? Assuming otherwise seems a bad idea (though at least in the case of testing it makes the test more strenuous rather than less...) -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Fri, 2005-02-25 at 13:30 -0500, Tom Lane wrote: > Jeff Davis <jdavis-pgsql@empires.org> writes: > > I didn't consider that. Is there a reason the regression tests assume > > the results will be returned in a certain order (or a consistent order)? > > We use diff as the checking tool. > Well, that does make testing more difficult, or it at least requires extra work to make the regression tests understand the results better. I'll sumbmit a better patch, and then if everyone decides it's worth the hassle with the regression tests, we can use it in 8.1. Some more testing is required to see if the results are really as good as we hope. Regards,Jeff Davis
On Fri, 2005-02-25 at 18:03 -0600, Jim C. Nasby wrote: > On Fri, Feb 25, 2005 at 01:30:57PM -0500, Tom Lane wrote: > > Jeff Davis <jdavis-pgsql@empires.org> writes: > > > I didn't consider that. Is there a reason the regression tests assume > > > the results will be returned in a certain order (or a consistent order)? > > > > We use diff as the checking tool. > > Doesn't the SQL spec specifically state that the only time you'll get > results in a deterministic order is if you use ORDER BY? Assuming > otherwise seems a bad idea (though at least in the case of testing it > makes the test more strenuous rather than less...) True, that was my reasoning when I proposed synchronized scanning. Keep in mind that this is a criticism of only the regression tests, not the RDBMS itself. I don't know much about the regression tests, so maybe it's impractical to not assume consistent order. I'm sure the developers will vote one way or the other. I hate to throw away a potential performance boost, but I also hate to burden the developers with rewriting a lot of regression tests when their time could be better spent elsewhere. Regards,Jeff Davis
Jim C. Nasby wrote: > On Fri, Feb 25, 2005 at 01:30:57PM -0500, Tom Lane wrote: > > Jeff Davis <jdavis-pgsql@empires.org> writes: > > > I didn't consider that. Is there a reason the regression tests assume > > > the results will be returned in a certain order (or a consistent order)? > > > > We use diff as the checking tool. > > Doesn't the SQL spec specifically state that the only time you'll get > results in a deterministic order is if you use ORDER BY? Assuming > otherwise seems a bad idea (though at least in the case of testing it > makes the test more strenuous rather than less...) The only trick I can think of is to use SELECT ... INTO TEMPORARY tab ... oRDER BY and then use COPY to dump the table. It will then dump in the order of the ORDER BY. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Sorry, please disregard my ramblings. I thought it was a different question. --------------------------------------------------------------------------- pgman wrote: > Jim C. Nasby wrote: > > On Fri, Feb 25, 2005 at 01:30:57PM -0500, Tom Lane wrote: > > > Jeff Davis <jdavis-pgsql@empires.org> writes: > > > > I didn't consider that. Is there a reason the regression tests assume > > > > the results will be returned in a certain order (or a consistent order)? > > > > > > We use diff as the checking tool. > > > > Doesn't the SQL spec specifically state that the only time you'll get > > results in a deterministic order is if you use ORDER BY? Assuming > > otherwise seems a bad idea (though at least in the case of testing it > > makes the test more strenuous rather than less...) > > The only trick I can think of is to use SELECT ... INTO TEMPORARY tab > ... oRDER BY and then use COPY to dump the table. It will then dump in > the order of the ORDER BY. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Fri, Feb 25, 2005 at 04:30:17PM -0800, Jeff Davis wrote: > On Fri, 2005-02-25 at 18:03 -0600, Jim C. Nasby wrote: > > On Fri, Feb 25, 2005 at 01:30:57PM -0500, Tom Lane wrote: > > > Jeff Davis <jdavis-pgsql@empires.org> writes: > > > > I didn't consider that. Is there a reason the regression tests assume > > > > the results will be returned in a certain order (or a consistent order)? > > > > > > We use diff as the checking tool. > > > > Doesn't the SQL spec specifically state that the only time you'll get > > results in a deterministic order is if you use ORDER BY? Assuming > > otherwise seems a bad idea (though at least in the case of testing it > > makes the test more strenuous rather than less...) > > True, that was my reasoning when I proposed synchronized scanning. > > Keep in mind that this is a criticism of only the regression tests, not > the RDBMS itself. > > I don't know much about the regression tests, so maybe it's impractical > to not assume consistent order. I'm sure the developers will vote one > way or the other. I hate to throw away a potential performance boost, > but I also hate to burden the developers with rewriting a lot of > regression tests when their time could be better spent elsewhere. Certainly, but I suspect it's just a matter of adding ORDER BY to everything, which just about anyone (even myself!) should be able to do. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <decibel@decibel.org> writes: >> but I also hate to burden the developers with rewriting a lot of >> regression tests when their time could be better spent elsewhere. > Certainly, but I suspect it's just a matter of adding ORDER BY to > everything, which just about anyone (even myself!) should be able to do. Performance is not the issue; test coverage, however, is an issue. See the comment at the end of http://developer.postgresql.org/docs/postgres/regress-evaluation.html#AEN22383 regards, tom lane
On Fri, Feb 25, 2005 at 11:51:40PM -0500, Tom Lane wrote: > "Jim C. Nasby" <decibel@decibel.org> writes: > >> but I also hate to burden the developers with rewriting a lot of > >> regression tests when their time could be better spent elsewhere. > > > Certainly, but I suspect it's just a matter of adding ORDER BY to > > everything, which just about anyone (even myself!) should be able to do. > > Performance is not the issue; test coverage, however, is an issue. > See the comment at the end of > http://developer.postgresql.org/docs/postgres/regress-evaluation.html#AEN22383 Assuming you're talkning about "You might wonder why we don't order all the regression test queries explicitly to get rid of this issue once and for all. The reason is that that would make the regression tests less useful, not more, since they'd tend to exercise query plan types that produce ordered results to the exclusion of those that don't.", good point. I can think of 2 ways around this: 1) Select into a temptable, then select out of it with an order by 2) Run the output through sort before doing the diff Is there any reason one of these wouldn't work? -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
I have a newer version of my Synchronized Scanning patch which hopefully makes it closer to a real patch, the first one was more of a proof of concept. DONE: * I added a proper bounds check for the result it gets from shared memory. * I expanded the shared memory to be a static hash table (currently set to a size of 8KB, but that is a one line change if that's too big). Now it can keep track of many scans on many tables at once. TODO: * More testing. I plan to get some more tests up on Monday, and when I start to see the best uses for this patch, I will also try to benchmark against MySQL or something else. I'm seeing some good preliminary results in cache hit rate (which is much higher in some cases), but I need to demonstrate an actual decrease in runtime. * Right now the shared mem isn't destroyed when the postmaster shuts down. * This patch still makes no use of locks at all. If someone thinks locks are required, please let me know. Currently, if there is inconsistent data in the shared memory area the worst that can happen is no worse than the current behavior. Regards, Jeff Davis
Attachment
On Fri, 2005-02-25 at 22:49, Jim C. Nasby wrote: > On Fri, Feb 25, 2005 at 11:51:40PM -0500, Tom Lane wrote: > > "Jim C. Nasby" <decibel@decibel.org> writes: > > >> but I also hate to burden the developers with rewriting a lot of > > >> regression tests when their time could be better spent elsewhere. The patch is for *concurrent* seqscans, would the regression tests even be affected by this (IIRC they are single user, single process)?
William Volkman wrote: > The patch is for *concurrent* seqscans, would the regression tests > even be affected by this (IIRC they are single user, single process)? No; `make installcheck' is serial, but `make check' executes tests in parallel in multiple backends concurrently. -Neil
Tom Lane said: > "Jim C. Nasby" <decibel@decibel.org> writes: >>> but I also hate to burden the developers with rewriting a lot of >>> regression tests when their time could be better spent elsewhere. > >> Certainly, but I suspect it's just a matter of adding ORDER BY to >> everything, which just about anyone (even myself!) should be able to >> do. > > Performance is not the issue; test coverage, however, is an issue. See > the comment at the end of > http://developer.postgresql.org/docs/postgres/regress-evaluation.html#AEN22383> Is it not possible to wrap the original query in an outer query that applies the ordering, leaving the original query run without ordering? Would that cause a lessening of test coverage? cheers andrew
Jeff Davis wrote: > I have a newer version of my Synchronized Scanning patch which hopefully > makes it closer to a real patch, the first one was more of a proof of > concept. A few minor comments: - context diffs (diff -c) are the preferred format. Also, folks usually send patches as a single diff; an easy way to generate that is via `cvs diff', or `diff -r old_dir new_dir'. - needlessly reindenting code makes it difficult to understand what you've changed. You should probably follow the PG coding conventions WRT indentation, brace placement, and so forth, although this will be fixed by a script later in any case. See Chapter 43 of the docs. - you don't need to (and should not) declare `static' functions in header files. If your additions to heapam.c aren't used outside of heapam.c, they needn't be declared in the header file. - PG has an abstraction layer for using shared memory that you should take advantage of. You should do something like: (1) create a function that returns the amount of shared memory space you require (2) invoke the function from CreateSharedMemoryAndSemaphores (3) create/attach to and initialize the shared memory during startup, via ShmemInitStruct(). See how InitProcGlobal() works, for example. - it makes me quite nervous to be reading and writing to shared data without using locks. If there is too much of a performance hit to acquire and release a lock for each page traversed, what about only updating the shared memory stats every K pages? -Neil
Thanks for the information. On Sat, 2005-02-26 at 23:39 +1100, Neil Conway wrote: > Jeff Davis wrote: > > I have a newer version of my Synchronized Scanning patch which hopefully > > makes it closer to a real patch, the first one was more of a proof of > > concept. > > A few minor comments: > > - context diffs (diff -c) are the preferred format. Also, folks usually > send patches as a single diff; an easy way to generate that is via `cvs > diff', or `diff -r old_dir new_dir'. Ok. > - needlessly reindenting code makes it difficult to understand what > you've changed. You should probably follow the PG coding conventions WRT > indentation, brace placement, and so forth, although this will be fixed > by a script later in any case. See Chapter 43 of the docs. > The only reason I did that was because the original source was difficult to read and work on. Is there a reason that code and comments wrap around to the next line throughout the file? > - PG has an abstraction layer for using shared memory that you should > take advantage of. You should do something like: (1) create a function > that returns the amount of shared memory space you require (2) invoke > the function from CreateSharedMemoryAndSemaphores (3) create/attach to > and initialize the shared memory during startup, via ShmemInitStruct(). > See how InitProcGlobal() works, for example. Will do. > - it makes me quite nervous to be reading and writing to shared data > without using locks. If there is too much of a performance hit to > acquire and release a lock for each page traversed, what about only > updating the shared memory stats every K pages? > I'll try that and test it and hopefully any performance problems appear sooner rather than later. If something appears, every K pages sounds like a plan to me. I will get a new patch ready soon with your suggestions. Regards,Jeff Davis
On Sat, Feb 26, 2005 at 12:22:45AM -0800, Jeff Davis wrote: > * I expanded the shared memory to be a static hash table (currently set > to a size of 8KB, but that is a one line change if that's too big). Now > it can keep track of many scans on many tables at once. I know you're still early in this, but should this value be a GUC? > * This patch still makes no use of locks at all. If someone thinks locks > are required, please let me know. Currently, if there is inconsistent > data in the shared memory area the worst that can happen is no worse > than the current behavior. It would be useful to have the backend put something in the logfile any time it has to revert to the default behavior. Without that, we have no way to know how often it's happening, and if it's worth trying to make it happen less. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <decibel@decibel.org> writes: > Assuming you're talkning about "You might wonder why we don't order all > the regression test queries explicitly to get rid of this issue once and > for all. The reason is that that would make the regression tests less > useful, not more, since they'd tend to exercise query plan types that > produce ordered results to the exclusion of those that don't.", good > point. I can think of 2 ways around this: > 1) Select into a temptable, then select out of it with an order by > 2) Run the output through sort before doing the diff > Is there any reason one of these wouldn't work? Like I said originally, we could certainly devise a solution if we needed to. I was just pointing out that this is a nontrivial consideration, and I don't want to buy into it if the patch proves to offer only marginal performance improvements. regards, tom lane
On Sat, 2005-02-26 at 10:16 -0600, Jim C. Nasby wrote: > On Sat, Feb 26, 2005 at 12:22:45AM -0800, Jeff Davis wrote: > > * I expanded the shared memory to be a static hash table (currently set > > to a size of 8KB, but that is a one line change if that's too big). Now > > it can keep track of many scans on many tables at once. > > I know you're still early in this, but should this value be a GUC? > I don't know if we want to clutter the GUC variables with a bunch of minor things like this. Most users won't understand what they're actually getting by allocating more memory here. The only improvement would be if there are two tables that are very often scanned simultaneously that happen to hash to the same value, in which case decreasing the shared mem has almost as much chance of solving the problem as increasing it :) However, if people think it's worthwhile, I'll be happy to do it. > > * This patch still makes no use of locks at all. If someone thinks locks > > are required, please let me know. Currently, if there is inconsistent > > data in the shared memory area the worst that can happen is no worse > > than the current behavior. > > It would be useful to have the backend put something in the logfile any > time it has to revert to the default behavior. Without that, we have no > way to know how often it's happening, and if it's worth trying to make > it happen less. Good idea. However, the first scan on a table after a server start will always be default, so I'll need to avoid excessive logging. Regards,Jeff Davis
Jeff Davis wrote: > The only reason I did that was because the original source was difficult > to read and work on. Is there a reason that code and comments wrap > around to the next line throughout the file? I'm not sure what you mean. Assuming your editor expand tabs to 4 spaces (which is the convention in the PostgreSQL source), lines should be about 80 characters at most. Expressions that would take more characters horizontally if left on one line are divided into multiple lines, and indented appropriately. At any rate, the source is perfectly readable to me :) Perhaps you need to tweak your editor's configuration? -Neil
Tom Lane wrote: > "Jim C. Nasby" <decibel@decibel.org> writes: > >>Assuming you're talkning about "You might wonder why we don't order all >>the regression test queries explicitly to get rid of this issue once and >>for all. The reason is that that would make the regression tests less >>useful, not more, since they'd tend to exercise query plan types that >>produce ordered results to the exclusion of those that don't.", good >>point. I can think of 2 ways around this: > > >>1) Select into a temptable, then select out of it with an order by > > >>2) Run the output through sort before doing the diff > > >>Is there any reason one of these wouldn't work? > > > Like I said originally, we could certainly devise a solution if we > needed to. I was just pointing out that this is a nontrivial > consideration, and I don't want to buy into it if the patch proves > to offer only marginal performance improvements. > I'll bet will not offer only marginal performance improvements. I see some time my 4-CPU server with 3 CPU in holiday and other CPU working on a long sequential scan. I hope that this patch, if it works correctly will be used in future Postgresql version Regards Gaetano Mendola
On Sat, 2005-02-26 at 10:47 -0500, Tom Lane wrote: > "Jim C. Nasby" <decibel@decibel.org> writes: > > Assuming you're talkning about "You might wonder why we don't order all > > the regression test queries explicitly to get rid of this issue once and > > for all. The reason is that that would make the regression tests less > > useful, not more, since they'd tend to exercise query plan types that > > produce ordered results to the exclusion of those that don't.", good > > point. I can think of 2 ways around this: > > > 1) Select into a temptable, then select out of it with an order by > > > 2) Run the output through sort before doing the diff > > > Is there any reason one of these wouldn't work? > > Like I said originally, we could certainly devise a solution if we > needed to. I was just pointing out that this is a nontrivial > consideration, and I don't want to buy into it if the patch proves > to offer only marginal performance improvements. Yes, the starting place is performance prototyping. Jeff has sensibly started with that thought in his initial post. I would suggest that we used a new GUC enable_synch_scans = off, by default. to control whether this new behaviour was utilised. That way, all of the current tests would stand as-is. My feeling is that in general, we should only add tests, not alter existing ones. It would be straightforward, even if laborious, to add some additional tests that prove that this type of system feature returns correct SQL results. However, the key seems to be that the results of SQL runs without an ORDER BY would be timing dependant, so would actually be a wholly new type of test - we would need to run 1 on its own, then compare with 2 run together to check the same answer was still returned, possibly with a post execution external sort. Best Regards, Simon Riggs
Added to TODO list: * Allow sequential scans to take advantage of other concurrent sequentiqal scans, also called "Synchronised Scanning" --------------------------------------------------------------------------- Jeff Davis wrote: > I had an idea that might improve parallel seqscans on the same relation. > > If you have lots of concurrent seqscans going on a large relation, the > cache hit ratio is very low. But, if the seqscans are concurrent on the > same relation, there may be something to gain by starting a seqscan near > the page being accessed by an already-in-progress seqscan, and wrapping > back around to that start location. That would make some use of the > shared buffers, which would otherwise just be cache pollution. > > I made a proof-of-concept implementation, which is entirely in heapam.c, > except for one addition to the HeapScanDesc struct in relscan.h. It is > not at all up to production quality; there are things I know that need > to be addressed. Basically, I just modified heapam.c to be able to start > at any page in the relation. Then, every time it reads a new page, I > have it mark the relation's oid and the page number in a shared mem > segment. Everytime a new scan is started, it reads the shared mem > segment, and if the relation's oid matches, it starts the scan at the > page number it found in the shared memory. Otherwise, it starts the scan > at 0. > > There are a couple obvious issues, one is that my whole implementation > doesn't account for reverse scans at all (since initscan doesn't know > what direction the scan will move in), but that shouldn't be a major > problem since at worst it will be the current behavior (aside: can > someone tell me how to force reverse scans so I can test that better?). > Another is that there's a race condition with the shared mem, and that's > out of pure laziness on my part. > > This method is really only effective at all if there is a significant > amount of disk i/o. If it's pulling the data from O/S buffers the > various scans will diverge too much and not be using eachother's shared > buffers. > > I tested with shared_buffers=500 and all stats on. I used 60 threads > performing 30 seqscans each in my script ssf.rb (I refer to my > modification as "sequential scan follower" or ssf). > > Here are some results with my modifications: > $ time ./ssf.rb # my script > > real 4m22.476s > user 0m0.389s > sys 0m0.186s > > test=# select relpages from pg_class where relname='test_ssf'; > relpages > ---------- > 1667 > (1 row) > > test=# select count(*) from test_ssf; > count > -------- > 200000 > (1 row) > > test=# select pg_stat_get_blocks_hit(17232) as hit, > pg_stat_get_blocks_fetched(17232) as total; > hit | total > --------+--------- > 971503 | 3353963 > (1 row) > > Or, approx. 29% cache hit. > > Here are the results without my modifications: > > test=# select relpages from pg_class where relname='test_ssf'; > relpages > ---------- > 1667 > (1 row) > > test=# select count(*) from test_ssf; > count > -------- > 200000 > (1 row) > > test=# select pg_stat_get_blocks_hit(17231) as hit, > pg_stat_get_blocks_fetched(17231) as total; > hit | total > --------+--------- > 199999 | 3353963 > (1 row) > > Or, approx. 6% cache hit. Note: the oid is different, because I have two > seperately initdb'd data directories, one for the modified version, one > for the unmodified 8.0.0. > > This is the first time I've really modified the PG source code to do > anything that looked promising, so this is more of a question than > anything else. Is it promising? Is this a potentially good approach? I'm > happy to post more test data and more documentation, and I'd also be > happy to bring the code to production quality. However, before I spend > too much more time on that, I'd like to get a general response from a > 3rd party to let me know if I'm off base. > > Regards, > Jeff Davis > [ Attachment, skipping... ] [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
TODO description added:* Allow sequential scans to take advantage of other concurrent sequentiqal scans, also called "SynchronisedScanning" One possible implementation is to start sequential scans from the lowest numbered buffer in theshared cache, and when reaching the end wrap around to the beginning, rather than always starting sequential scans atthe start of the table. --------------------------------------------------------------------------- Jeff Davis wrote: > I had an idea that might improve parallel seqscans on the same relation. > > If you have lots of concurrent seqscans going on a large relation, the > cache hit ratio is very low. But, if the seqscans are concurrent on the > same relation, there may be something to gain by starting a seqscan near > the page being accessed by an already-in-progress seqscan, and wrapping > back around to that start location. That would make some use of the > shared buffers, which would otherwise just be cache pollution. > > I made a proof-of-concept implementation, which is entirely in heapam.c, > except for one addition to the HeapScanDesc struct in relscan.h. It is > not at all up to production quality; there are things I know that need > to be addressed. Basically, I just modified heapam.c to be able to start > at any page in the relation. Then, every time it reads a new page, I > have it mark the relation's oid and the page number in a shared mem > segment. Everytime a new scan is started, it reads the shared mem > segment, and if the relation's oid matches, it starts the scan at the > page number it found in the shared memory. Otherwise, it starts the scan > at 0. > > There are a couple obvious issues, one is that my whole implementation > doesn't account for reverse scans at all (since initscan doesn't know > what direction the scan will move in), but that shouldn't be a major > problem since at worst it will be the current behavior (aside: can > someone tell me how to force reverse scans so I can test that better?). > Another is that there's a race condition with the shared mem, and that's > out of pure laziness on my part. > > This method is really only effective at all if there is a significant > amount of disk i/o. If it's pulling the data from O/S buffers the > various scans will diverge too much and not be using eachother's shared > buffers. > > I tested with shared_buffers=500 and all stats on. I used 60 threads > performing 30 seqscans each in my script ssf.rb (I refer to my > modification as "sequential scan follower" or ssf). > > Here are some results with my modifications: > $ time ./ssf.rb # my script > > real 4m22.476s > user 0m0.389s > sys 0m0.186s > > test=# select relpages from pg_class where relname='test_ssf'; > relpages > ---------- > 1667 > (1 row) > > test=# select count(*) from test_ssf; > count > -------- > 200000 > (1 row) > > test=# select pg_stat_get_blocks_hit(17232) as hit, > pg_stat_get_blocks_fetched(17232) as total; > hit | total > --------+--------- > 971503 | 3353963 > (1 row) > > Or, approx. 29% cache hit. > > Here are the results without my modifications: > > test=# select relpages from pg_class where relname='test_ssf'; > relpages > ---------- > 1667 > (1 row) > > test=# select count(*) from test_ssf; > count > -------- > 200000 > (1 row) > > test=# select pg_stat_get_blocks_hit(17231) as hit, > pg_stat_get_blocks_fetched(17231) as total; > hit | total > --------+--------- > 199999 | 3353963 > (1 row) > > Or, approx. 6% cache hit. Note: the oid is different, because I have two > seperately initdb'd data directories, one for the modified version, one > for the unmodified 8.0.0. > > This is the first time I've really modified the PG source code to do > anything that looked promising, so this is more of a question than > anything else. Is it promising? Is this a potentially good approach? I'm > happy to post more test data and more documentation, and I'd also be > happy to bring the code to production quality. However, before I spend > too much more time on that, I'd like to get a general response from a > 3rd party to let me know if I'm off base. > > Regards, > Jeff Davis > [ Attachment, skipping... ] [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073