Thread: Synchronized Scan update
I have found some interesting results from my tests with the Synchronized Scan patch I'm working on. The two benefits that I hope to achieve with the patch are: (1) Better caching behavior with multiple sequential scans running in parallel (2) Faster sequential reads from disk and less seeking I have consistently seen #1 to be true. There is still more testing to be done (hopefully soon), but I haven't found a problem yet. And the benefits I've seen are very substantial, which isn't hard, since in the typical case, a large sequential scan will have 0% cache hit rate. These numbers were retrieved using log_executor_stats=on. #2 however, is a little trickier. IIRC, Tom was the first to point out that the I/O system might not recognize that reads coming from different processes are indeed one sequential read. At first I never saw the problem actually happen, and I assumed that the OS was being smart enough. However, recently I noticed this problem on my home machine, which experienced great caching behavior but poor I/O throughput (as measured by iostat). My home machine was using the Linux CFQ io scheduler, and when I swapped the CFQ io scheduler for the anticipatory scheduler (AS), it worked great. When I sent Josh my patch (per his request) I mentioned the problem I experienced. Then I started investigating, and found some mixed results. My test was basically to use iostat (or zpool iostat) to measure disk throughput, and N processes of "dd if=bigfile of=/dev/null" (started simultaneously) to run the test. I consider the test to be "passed" if the additional processes did not interfere (i.e. each process finished as though it were the only one running). Of course, all tests were I/O bound. My home machine (core 2 duo, single SATA disk, intel controller): Linux/ext3/AS: passed Linux/ext3/CFQ: failed Linux/ext3/noop: passed Linux/ext3/deadline: passed Machine 2 (old thinkpad, IDE disk): Solaris/UFS: failed Solaris/ZFS: passed Machine 3 (dell 2950, LSI PERC/5i controller, 6 SAS disks, RAID-10, adaptive read ahead): FreeBSD/UFS: failed (I suspect the last test would be fine with read ahead always on, and it may just be a problem with the adaptive read ahead feature) There are a lot of factors involved, because several components of the I/O system have the ability to reorder requests or read ahead, such as the block layer and the controller. The block request ordering isn't the only factor because Solaris/UFS only orders the requests by cylinder and moves only in one direction (i.e. looks like a simple elevator algorithm that isn't affected by process id). At least, that's how I understand it. Readahead can't be the only factor either because replacing the io scheduler in Linux solved the problem, even when that replacement was the noop scheduler. Anyway, back to the patch, it looks like there are some complications if you try to use it with the wrong combination of fs, io scheduler, and controller. The patch is designed for certain query patterns anyway, so I don't think that this is a show-stopper. Given the better cache behavior, it seems like it's really the job of the I/O system to get a single, sequential stream of blocks efficiently. The alternative would be to have a single block-reader process, which I don't think we want to do. However, I/O systems don't really seem to know how to handle multiple processes reading from the same file very well. Comments? Regards,Jeff Davis
Is there any consensus about whether to include these two parameters as GUCs or constants if my patch is to be accepted? (1) sync_scan_threshold: Use synchronized scanning for tables greater than this many pages; smaller tables will not be affected. (2) sync_scan_offset: Start a new scan this many pages before a currently running scan to take advantage of the pagesthat are likely already in cache. Right now they are just constants defined in a header, but a GUC might make sense. I'd like to know which version is more acceptable when I submit my final patch. Regards,Jeff Davis
Jeff, > Right now they are just constants defined in a header, but a GUC might > make sense. I'd like to know which version is more acceptable when I > submit my final patch. As much as I hate the thought of more GUCs, until we have a solid performance profile for synch scan we probably need them. You should include the option to turn synch_scan off, such as by setting synch_scan_threshold to -1. Oh, and remember that these now need to be able to take K/MB/GB. These options should probably go in postgresql.conf under QUERY TUNING, with their own sub-head. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
On Fri, 2007-03-02 at 15:49 -0800, Josh Berkus wrote: > Jeff, > > > Right now they are just constants defined in a header, but a GUC might > > make sense. I'd like to know which version is more acceptable when I > > submit my final patch. > > As much as I hate the thought of more GUCs, until we have a solid > performance profile for synch scan we probably need them. You should I will include them in the final patch then. > include the option to turn synch_scan off, such as by setting > synch_scan_threshold to -1. Naturally. > Oh, and remember that these now need to be able to take K/MB/GB. Will do. > These options should probably go in postgresql.conf under QUERY TUNING, > with their own sub-head. That makes sense to me. Regards,Jeff Davis PS: Did you happen to get my patch for testing (sent off-list)? If testing will take a while, that's OK, I'd just like to know whether to expect the results before feature freeze.
Jeff, > PS: Did you happen to get my patch for testing (sent off-list)? If > testing will take a while, that's OK, I'd just like to know whether to > expect the results before feature freeze. I'm not sure. We have a bunch to patches in our queue to test, and the benchmark guys don't really expect synch scan to affect OLTP benchmarks much. You might want to pester Greenplum about testing on TPCH. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
On Fri, 2007-03-02 at 15:03 -0800, Jeff Davis wrote: > Is there any consensus about whether to include these two parameters as > GUCs or constants if my patch is to be accepted? > > (1) sync_scan_threshold: Use synchronized scanning for tables greater > than this many pages; smaller tables will not be affected. That sounds OK. > (2) sync_scan_offset: Start a new scan this many pages before a > currently running scan to take advantage of the pages > that are likely already in cache. I'm somewhat dubious about this parameter, I have to say, even though I am eager for this feature. It seems like a "magic" parameter that works only when we have the right knowledge to set it correctly. How will we know what to default it to and how will we know whether to set it higher or lower for better performance? Does that value vary according to the workload on the system? How? I'm worried that we get a feature that works well on simple tests and not at all in real world circumstances. I don't want to cast doubt on what could be a great patch or be negative: I just see that the feature relies on the dynamic behaviour of the system. I'd like to see some further studies on how this works to make sure that we can realistically set know how to set this knob, that its the correct knob and it is the only one we need. Further thoughts: It sounds like sync_scan_offset is related to effective_cache_size. Can you comment on whether that might be a something we can use as well/instead? (i.e. set the scan offset to say K * effective_cache_size, 0.1 <= K <= 0.5)??? Might we do roughly the same thing with sync_scan_threshold as well, and just have enable_sync_scan instead? i.e. sync_scan_threshold = effective_cache_size? When would those two parameters not be connected directly to each other? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
> >> (2) sync_scan_offset: Start a new scan this many pages before a >> currently running scan to take advantage of the pages >> that are likely already in cache. >> > > I'm somewhat dubious about this parameter, I have to say, even though I > am eager for this feature. It seems like a "magic" parameter that works > only when we have the right knowledge to set it correctly. > > Hello, Don't get me wrong, I want things to be easily understandable as well but the reason you site above pretty much makes us need to remove most of the postgresql.conf, including all bgwriter, vacuum cost delay, and autovac settings. Not to mention commit delay and others ;). Sincerely, Joshua D. Drake
On Sun, 2007-03-04 at 11:54 +0000, Simon Riggs wrote: > > (2) sync_scan_offset: Start a new scan this many pages before a > > currently running scan to take advantage of the pages > > that are likely already in cache. > > I'm somewhat dubious about this parameter, I have to say, even though I > am eager for this feature. It seems like a "magic" parameter that works > only when we have the right knowledge to set it correctly. > That was my concern about this parameter also. > How will we know what to default it to and how will we know whether to > set it higher or lower for better performance? Does that value vary > according to the workload on the system? How? > Perhaps people would only set this parameter when they know it will help, and for more complex (or varied) usage patterns they'd set sync_scan_offset to 0 to be safe. My thinking on the subject (and this is only backed up by very basic tests) is that there are basically two situations where setting this parameter too high can hurt: (1) It's too close to the limits of your physical memory, and you end up diverging the scans when they could be kept together. (2) You're using a lot of CPU and the backends aren't processing the buffers as fast as your I/O system is delivering them. This will prevent the scans from converging. If your CPUs are well below capacity and you choose a size significantly less than your effective cache size, I don't think it will hurt. > I'm worried that we get a feature that works well on simple tests and > not at all in real world circumstances. I don't want to cast doubt on > what could be a great patch or be negative: I just see that the feature > relies on the dynamic behaviour of the system. I'd like to see some > further studies on how this works to make sure that we can realistically > set know how to set this knob, that its the correct knob and it is the > only one we need. I will do some better tests on some better hardware this week and next week. I hope that sheds some light. > Further thoughts: It sounds like sync_scan_offset is related to > effective_cache_size. Can you comment on whether that might be a > something we can use as well/instead? (i.e. set the scan offset to say K > * effective_cache_size, 0.1 <= K <= 0.5)??? > > Might we do roughly the same thing with sync_scan_threshold as well, and > just have enable_sync_scan instead? i.e. sync_scan_threshold = > effective_cache_size? When would those two parameters not be connected > directly to each other? > Originally, these parameters were in terms of the effective_cache_size. Somebody else convinced me that it was too confusing to have the variables dependent on each other, so I made them independent. I don't have a strong opinion either way. Regards,Jeff Davis
JD, > Don't get me wrong, I want things to be easily understandable as well > but the reason you site above pretty much > makes us need to remove most of the postgresql.conf, including all > bgwriter, vacuum cost delay, and autovac settings. > Not to mention commit delay and others ;). Wouldn't that be nice! The explosion of GUC settings is primarily a result of not enough information. The reason there are 7 bgwriter settings, for example, is that we have no idea what those settings should be and are hoping that people will tinker with them and tell us. Someday when I can fully profile bgwriter, we'll just have one setting: bgwriter_aggressive, set to a number between 0 and 9. -- Josh Berkus PostgreSQL @ Sun San Francisco
On Mar 6, 2007, at 9:43 AM, Josh Berkus wrote: >> Don't get me wrong, I want things to be easily understandable as well >> but the reason you site above pretty much >> makes us need to remove most of the postgresql.conf, including all >> bgwriter, vacuum cost delay, and autovac settings. >> Not to mention commit delay and others ;). > > Wouldn't that be nice! > > The explosion of GUC settings is primarily a result of not enough > information. > The reason there are 7 bgwriter settings, for example, is that we > have no > idea what those settings should be and are hoping that people will > tinker > with them and tell us. Someday when I can fully profile bgwriter, > we'll just > have one setting: bgwriter_aggressive, set to a number between 0 > and 9. In the mean time; it would be great for these multiple-settings cases to be listed somewhere, indicating that it's something we could use help with. I think that with some explanation of what we're looking for there's any number of people who could do this kind of profiling. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Fri, 2007-03-02 at 15:03 -0800, Jeff Davis wrote: > Is there any consensus about whether to include these two parameters as > GUCs or constants if my patch is to be accepted? > > (1) sync_scan_threshold: Use synchronized scanning for tables greater > than this many pages; smaller tables will not be affected. > (2) sync_scan_offset: Start a new scan this many pages before a > currently running scan to take advantage of the pages > that are likely already in cache. > > Right now they are just constants defined in a header, but a GUC might > make sense. I'd like to know which version is more acceptable when I > submit my final patch. I'm looking at ways of helping Synch Scan and scan_recycle_buffers to work well together. I've had a look through the synch scan patch in more detail to help this along. ISTM they can play nicely together. Results from tests show that with a single scan we get better performance with scan_recycle_buffers = ~32. That gives a +25% performance bonus. Synch scan has the capability to reduce I/O by 50% across two concurrent scans, so that would be likely to outweigh the gain from L2 cache reduction, unless we can get both together. It's possible that multiple scans would use the same CPU, or at least use multiple cores on the same chip and hence same L2 cache. That's more likely if the synch scan works well, since the second scan can process the buffer while the lead process performs I/O. So its likely that on dual and quad core CPUs that we'll be able to get both benefits in most cases. So based on those thoughts, sync_scan_offset should be fixed at 16, rather than being variable. In addition, ss_report_loc() should only report its position every 16 blocks, rather than do this every time, which will reduce overhead of this call. To match that, scan_recycle_buffers should be fixed at 32. So GUCs for sync_scan_offset and scan_recycle_buffers would not be required at all. IMHO we can also remove sync_scan_threshold and just use NBuffers instead. That way we get the benefit of both patches or neither, making it easier to understand what's going on. On my patch, I'll make buffer recycling only work for SeqScans and VACUUMs. If need be, the value of scan_recycle_buffers can be varied upwards should the scans drift apart, as a way of bringing them back together. We aren't tracking whether they are together or apart, so I would like to see some debug output from synch scans to allow us to assess how far behind the second scan is as it progresses. e.g. LOG: synch scan currently on block N, trailing pathfinder by M blocks issued every 128 blocks as we go through the scans. Thoughts? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon, On 3/12/07 6:21 AM, "Simon Riggs" <simon@2ndquadrant.com> wrote: > So based on those thoughts, sync_scan_offset should be fixed at 16, > rather than being variable. In addition, ss_report_loc() should only > report its position every 16 blocks, rather than do this every time, > which will reduce overhead of this call. And for N concurrent scans? I think there is actually no need to synchronize the shared buffers at all for synchronized scans. The OS I/O cache will do that for us and we're just going to interfere and pessimize by trying to divide up a local buffer. I suggest that this be proven or disproved by running this test: measure the performance of syncscan with the non-polluting buffer change, then measure with Jeff's patch and non-polluting with multiple scans, then measure with your suggested changes to synchronize the buffers. Somewhere in that progression we'll learn more about how the multi-level buffering really works. I think we'll get all the shared I/O cache we need. - Luke
On Mon, 2007-03-12 at 08:42 -0700, Luke Lonergan wrote: > On 3/12/07 6:21 AM, "Simon Riggs" <simon@2ndquadrant.com> wrote: > > > So based on those thoughts, sync_scan_offset should be fixed at 16, > > rather than being variable. In addition, ss_report_loc() should only > > report its position every 16 blocks, rather than do this every time, > > which will reduce overhead of this call. > > And for N concurrent scans? > > I think there is actually no need to synchronize the shared buffers at all > for synchronized scans. The OS I/O cache will do that for us and we're just > going to interfere and pessimize by trying to divide up a local buffer. I think you've misunderstood my comment slightly. In Jeff's patch, ss_report_loc() is called after every block is read, AFAICS. I was suggesting that we don't do it that frequently, to reduce the overhead of reporting the location. That has nothing to do with re-synchronising the two scans mid-way. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-03-12 at 13:21 +0000, Simon Riggs wrote: > So based on those thoughts, sync_scan_offset should be fixed at 16, > rather than being variable. In addition, ss_report_loc() should only > report its position every 16 blocks, rather than do this every time, > which will reduce overhead of this call. If we fix sync_scan_offset at 16, we might as well just get rid of it. Sync scans are only useful on large tables, and getting a free 16 pages over a scan isn't worth the trouble. However, even without sync_scan_offset, sync scans are still a valuable feature. I agree that ss_report_loc() doesn't need to report on every call. If there's any significant overhead I agree that it should report less often. Do you think that the overhead is significant on such a simple function? > > To match that, scan_recycle_buffers should be fixed at 32. So GUCs for > sync_scan_offset and scan_recycle_buffers would not be required at all. > > IMHO we can also remove sync_scan_threshold and just use NBuffers > instead. That way we get the benefit of both patches or neither, making > it easier to understand what's going on. I like the idea of reducing tuning parameters, but we should, at a minimum, still allow an on/off button for sync scans. My tests revealed that the wrong combination of OS/FS/IO-Scheduler/Controller could result in bad I/O behavior. > If need be, the value of scan_recycle_buffers can be varied upwards > should the scans drift apart, as a way of bringing them back together. If the scans aren't being brought together, that means that one of the scans is CPU bound or outside the combined cache trail (shared_buffers + OS buffer cache). > We aren't tracking whether they are together or apart, so I would like > to see some debug output from synch scans to allow us to assess how far > behind the second scan is as it progresses. e.g. > LOG: synch scan currently on block N, trailing pathfinder by M blocks > issued every 128 blocks as we go through the scans. > > Thoughts? > It's hard to track where all the scans are currently. One of the advantages of my patch is its simplicity: the scans don't need to know about other specific scans, and there is no concept in the code of a "head" scan or a "pack". There is no easy way to tell which scan is ahead and which is behind. There was a discussion when I submitted this proposal at the beginning of 8.3, but I didn't see enough benefit to justify all of the costs and risks associated with scans communicating between eachother. I certainly can't implement that kind of thing before feature freeze, and I think there's a risk of lock contention for the communication required. I'm also concerned that -- if the scans are too interdependent -- it would make postgres less robust against the disappearance of a single backend (i.e. what if the backend that is leading a scan dies?). Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > I agree that ss_report_loc() doesn't need to report on every call. If > there's any significant overhead I agree that it should report less > often. Do you think that the overhead is significant on such a simple > function? One extra LWLock cycle per page processed definitely *is* a significant overhead ... can you say "context swap storm"? I'd think about doing it once every 100 or so pages. regards, tom lane
On Tue, 2007-03-13 at 12:53 -0400, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > I agree that ss_report_loc() doesn't need to report on every call. If > > there's any significant overhead I agree that it should report less > > often. Do you think that the overhead is significant on such a simple > > function? > > One extra LWLock cycle per page processed definitely *is* a significant > overhead ... can you say "context swap storm"? I'd think about doing it > once every 100 or so pages. > No lock is needed to store the hint. If somehow the hint (which is stored in a static table, no pointers) gets invalid data due to a race condition, the new scan will simply consider the hint invalid and start at 0. I did this precisely to avoid causing a performance regression for usage patterns that don't benefit from sync scans. Regards,Jeff Davis
On Mon, 2007-03-12 at 17:46 -0700, Jeff Davis wrote: > On Mon, 2007-03-12 at 13:21 +0000, Simon Riggs wrote: > > So based on those thoughts, sync_scan_offset should be fixed at 16, > > rather than being variable. In addition, ss_report_loc() should only > > report its position every 16 blocks, rather than do this every time, > > which will reduce overhead of this call. > > If we fix sync_scan_offset at 16, we might as well just get rid of it. > Sync scans are only useful on large tables, and getting a free 16 pages > over a scan isn't worth the trouble. However, even without > sync_scan_offset, Not sure what you mean by "a free 16 pages". Please explain? > sync scans are still a valuable feature. I have always thought synch scans are a valuable feature too. > I agree that ss_report_loc() doesn't need to report on every call. If > there's any significant overhead I agree that it should report less > often. Do you think that the overhead is significant on such a simple > function? Lets try without it and see. There's no need to access shared memory so often. > I like the idea of reducing tuning parameters, but we should, at a > minimum, still allow an on/off button for sync scans. My tests revealed > that the wrong combination of OS/FS/IO-Scheduler/Controller could result > in bad I/O behavior. Agreed > > If need be, the value of scan_recycle_buffers can be varied upwards > > should the scans drift apart, as a way of bringing them back together. > > If the scans aren't being brought together, that means that one of the > scans is CPU bound or outside the combined cache trail (shared_buffers > + OS buffer cache). > > > We aren't tracking whether they are together or apart, so I would like > > to see some debug output from synch scans to allow us to assess how far > > behind the second scan is as it progresses. e.g. > > LOG: synch scan currently on block N, trailing pathfinder by M blocks > > issued every 128 blocks as we go through the scans. > > > > Thoughts? > > > > It's hard to track where all the scans are currently. One of the > advantages of my patch is its simplicity: the scans don't need to know > about other specific scans, and there is no concept in the code of a > "head" scan or a "pack". I'd still like to be able to trace each scan to see how far ahead/behind it is from the other scans on the same table, however we do that. Any backend can read the position of other backend's scans, so it should be easy enough to put in a regular LOG entry that shows how far ahead/behind they are from other scans. We can trace just one backend and have it report on where it is with respect to other backends, or you could have them all calculate their position and have just the lead scan report the position of all other scans. > There is no easy way to tell which scan is ahead and which is behind. > There was a discussion when I submitted this proposal at the beginning > of 8.3, but I didn't see enough benefit to justify all of the costs and > risks associated with scans communicating between eachother. I > certainly can't implement that kind of thing before feature freeze, and > I think there's a risk of lock contention for the communication > required. I'm also concerned that -- if the scans are too > interdependent -- it would make postgres less robust against the > disappearance of a single backend (i.e. what if the backend that is > leading a scan dies?). I've not mentioned that again. I'd like to see the trace option to allow us to tell whether its working as well as we'd like it to pre-release and in production. Also I want to see whether various settings of scan_recycle_buffers help/hinder the effectiveness of synch scans, as others have worried it might. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Tue, 2007-03-13 at 10:08 -0700, Jeff Davis wrote: > > One extra LWLock cycle per page processed definitely *is* a significant > > overhead ... can you say "context swap storm"? I'd think about doing it > > once every 100 or so pages. > > > > No lock is needed to store the hint. If somehow the hint (which is > stored in a static table, no pointers) gets invalid data due to a race > condition, the new scan will simply consider the hint invalid and start > at 0. > > I did this precisely to avoid causing a performance regression for usage > patterns that don't benefit from sync scans. > I'd also like to add that, if a lock was required, a constant offset would also seem to prone to a context swap storm; it would just happen 100th as much. We'd need to do something to spread the locks over time. That being said, I'll adjust it to report once per hundred pages anyway, because there's really no drawback. Regards,Jeff Davis
On Tue, 2007-03-13 at 10:08 -0700, Jeff Davis wrote: > On Tue, 2007-03-13 at 12:53 -0400, Tom Lane wrote: > > Jeff Davis <pgsql@j-davis.com> writes: > > > I agree that ss_report_loc() doesn't need to report on every call. If > > > there's any significant overhead I agree that it should report less > > > often. Do you think that the overhead is significant on such a simple > > > function? > > > > One extra LWLock cycle per page processed definitely *is* a significant > > overhead ... can you say "context swap storm"? I'd think about doing it > > once every 100 or so pages. > > > > No lock is needed to store the hint. If somehow the hint (which is > stored in a static table, no pointers) gets invalid data due to a race > condition, the new scan will simply consider the hint invalid and start > at 0. > > I did this precisely to avoid causing a performance regression for usage > patterns that don't benefit from sync scans. Shared memory access is still a performance/scalability concern because so many people want access to it at the same time. There really is no need to do this after each block. 8 CPUs ought to be able to do 8 scans without tripping over each other. Especially if they are on separate tables. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Tue, 2007-03-13 at 17:17 +0000, Simon Riggs wrote: > On Tue, 2007-03-13 at 10:08 -0700, Jeff Davis wrote: > > On Tue, 2007-03-13 at 12:53 -0400, Tom Lane wrote: > > > Jeff Davis <pgsql@j-davis.com> writes: > > > > I agree that ss_report_loc() doesn't need to report on every call. If > > > > there's any significant overhead I agree that it should report less > > > > often. Do you think that the overhead is significant on such a simple > > > > function? > > > > > > One extra LWLock cycle per page processed definitely *is* a significant > > > overhead ... can you say "context swap storm"? I'd think about doing it > > > once every 100 or so pages. > > > > > > > No lock is needed to store the hint. If somehow the hint (which is > > stored in a static table, no pointers) gets invalid data due to a race > > condition, the new scan will simply consider the hint invalid and start > > at 0. > > > > I did this precisely to avoid causing a performance regression for usage > > patterns that don't benefit from sync scans. > > Shared memory access is still a performance/scalability concern because > so many people want access to it at the same time. > > There really is no need to do this after each block. 8 CPUs ought to be > able to do 8 scans without tripping over each other. Especially if they > are on separate tables. > Ok, I'll do it every 100 pages. Regards,Jeff Davis
Ühel kenal päeval, T, 2007-03-13 kell 12:53, kirjutas Tom Lane: > Jeff Davis <pgsql@j-davis.com> writes: > > I agree that ss_report_loc() doesn't need to report on every call. If > > there's any significant overhead I agree that it should report less > > often. Do you think that the overhead is significant on such a simple > > function? > > One extra LWLock cycle per page processed definitely *is* a significant > overhead ... can you say "context swap storm"? I'd think about doing it > once every 100 or so pages. Can't we do it in some lock-free way ? writing page numbers (4-byte ints) into a predetermined location isn shared mem should be atomic on all platforms we support (still may cause some cache ping-pong in multiprocessor systems, but this should be much cheaper), and even an occasional error in establishing the "scan head" should not be catastrophic. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
On Tue, 2007-03-13 at 17:11 +0000, Simon Riggs wrote: > On Mon, 2007-03-12 at 17:46 -0700, Jeff Davis wrote: > > On Mon, 2007-03-12 at 13:21 +0000, Simon Riggs wrote: > > > So based on those thoughts, sync_scan_offset should be fixed at 16, > > > rather than being variable. In addition, ss_report_loc() should only > > > report its position every 16 blocks, rather than do this every time, > > > which will reduce overhead of this call. > > > > If we fix sync_scan_offset at 16, we might as well just get rid of it. > > Sync scans are only useful on large tables, and getting a free 16 pages > > over a scan isn't worth the trouble. However, even without > > sync_scan_offset, > > Not sure what you mean by "a free 16 pages". Please explain? > By "free" I mean already in cache, and therefore don't have to do I/O to get it. I used the term loosely above, so let me re-explain: My only point was that 16 is essentially 0 when it comes to sync_scan_offset, because it's a small number of blocks over the course of the scan of a large table. If sync_scan_offset is 0, my patch will cause scans on a big table to start where other scans are, and those scans should tend to stay together and use newly-cached pages efficiently (and achieve the primary goal of the patch). The advantage of sync_scan_offset is that, in some situations, a second scan can actually finish faster than if it were the only query executing, because a previous scan has already caused some blocks to be cached. However, 16 is a small number because that benefit would only be once per scan, and sync scans are only helpful on large tables. > > I like the idea of reducing tuning parameters, but we should, at a > > minimum, still allow an on/off button for sync scans. My tests revealed > > that the wrong combination of OS/FS/IO-Scheduler/Controller could result > > in bad I/O behavior. > > Agreed > Do you have an opinion about sync_scan_threshold versus a simple sync_scan_enable? > I'd still like to be able to trace each scan to see how far ahead/behind > it is from the other scans on the same table, however we do that. > > Any backend can read the position of other backend's scans, so it should Where is that information stored? Right now my patch will overwrite the hints of other backends, because I'm using a static data structure (rather than one that grows). I do this to avoid the need for locking. > be easy enough to put in a regular LOG entry that shows how far > ahead/behind they are from other scans. We can trace just one backend > and have it report on where it is with respect to other backends, or you > could have them all calculate their position and have just the lead scan > report the position of all other scans. > I already have each backend log it's progression through the tablescan every 100k blocks to DEBUG (higher DEBUG gives every 10k blocks). I currently use this information to see whether scans are staying together or not. I think this gives us the information we need without backends needing to communicate the information during execution. I think I will increase the resolution of the scan progress so that we can track every 5k or even 1k blocks read per pid per scan. That might tell us more about the shared memory usage versus OS cache. Is there any other information you need reported? > I'd like to see the trace option to allow us to tell whether its working > as well as we'd like it to pre-release and in production. Also I want to > see whether various settings of scan_recycle_buffers help/hinder the > effectiveness of synch scans, as others have worried it might. > Can you tell me what you mean by trace option, if you mean something different than tracking the relative positions of the scans? I will update my patch and send it along so that we can see how they work together. Regards,Jeff Davis
On Tue, 2007-03-13 at 11:28 -0700, Jeff Davis wrote: > On Tue, 2007-03-13 at 17:11 +0000, Simon Riggs wrote: > > On Mon, 2007-03-12 at 17:46 -0700, Jeff Davis wrote: > > > On Mon, 2007-03-12 at 13:21 +0000, Simon Riggs wrote: > > > > So based on those thoughts, sync_scan_offset should be fixed at 16, > > > > rather than being variable. In addition, ss_report_loc() should only > > > > report its position every 16 blocks, rather than do this every time, > > > > which will reduce overhead of this call. > > > > > > If we fix sync_scan_offset at 16, we might as well just get rid of it. > > > Sync scans are only useful on large tables, and getting a free 16 pages > > > over a scan isn't worth the trouble. However, even without > > > sync_scan_offset, > > > > Not sure what you mean by "a free 16 pages". Please explain? > > > > By "free" I mean already in cache, and therefore don't have to do I/O to > get it. I used the term loosely above, so let me re-explain: > > My only point was that 16 is essentially 0 when it comes to > sync_scan_offset, because it's a small number of blocks over the course > of the scan of a large table. > > If sync_scan_offset is 0, my patch will cause scans on a big table to > start where other scans are, and those scans should tend to stay > together and use newly-cached pages efficiently (and achieve the primary > goal of the patch). OK > The advantage of sync_scan_offset is that, in some situations, a second > scan can actually finish faster than if it were the only query > executing, because a previous scan has already caused some blocks to be > cached. However, 16 is a small number because that benefit would only be > once per scan, and sync scans are only helpful on large tables. Alright, understood. That last part is actually something I now want to avoid because it's using the current cache-spoiling behaviour of seqscans to advantage. I'd like to remove that behaviour, but it sounds like we can have both - SeqScans that don't spoil cache - Synch scans by setting "sync_scan_offset" to zero. > > > I like the idea of reducing tuning parameters, but we should, at a > > > minimum, still allow an on/off button for sync scans. My tests revealed > > > that the wrong combination of OS/FS/IO-Scheduler/Controller could result > > > in bad I/O behavior. > > > > Agreed > > > > Do you have an opinion about sync_scan_threshold versus a simple > sync_scan_enable? enable_sync_scan? > > I'd still like to be able to trace each scan to see how far ahead/behind > > it is from the other scans on the same table, however we do that. > > > > Any backend can read the position of other backend's scans, so it should > > Where is that information stored? Right now my patch will overwrite the > hints of other backends, because I'm using a static data structure > (rather than one that grows). I do this to avoid the need for locking. OK, well, we can still read it before we overwrite it to calc the difference. That will at least allow us to get a difference between points as we go along. That seems like its worth having, even if it isn't accurate for 3+ concurrent scans. > > be easy enough to put in a regular LOG entry that shows how far > > ahead/behind they are from other scans. We can trace just one backend > > and have it report on where it is with respect to other backends, or you > > could have them all calculate their position and have just the lead scan > > report the position of all other scans. > > > > I already have each backend log it's progression through the tablescan > every 100k blocks to DEBUG (higher DEBUG gives every 10k blocks). I > currently use this information to see whether scans are staying together > or not. I think this gives us the information we need without backends > needing to communicate the information during execution. Well, that is good, thank you for adding that after initial discussions. Does it have the time at which a particular numbered block is reached? (i.e. Block #117 is not the same thing as the 117th block scanned). We can use that to compare the time difference of each scan. > I think I will increase the resolution of the scan progress so that we > can track every 5k or even 1k blocks read per pid per scan. That might > tell us more about the shared memory usage versus OS cache. > > Is there any other information you need reported? Not sure yet! I just want to look one level deeper, to see if everything is working like we think it should. > > I'd like to see the trace option to allow us to tell whether its working > > as well as we'd like it to pre-release and in production. Also I want to > > see whether various settings of scan_recycle_buffers help/hinder the > > effectiveness of synch scans, as others have worried it might. > > > > Can you tell me what you mean by trace option, if you mean something > different than tracking the relative positions of the scans? > > I will update my patch and send it along so that we can see how they > work together. Great -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Tue, 2007-03-13 at 19:24 +0000, Simon Riggs wrote: > > The advantage of sync_scan_offset is that, in some situations, a second > > scan can actually finish faster than if it were the only query > > executing, because a previous scan has already caused some blocks to be > > cached. However, 16 is a small number because that benefit would only be > > once per scan, and sync scans are only helpful on large tables. > > Alright, understood. That last part is actually something I now want to > avoid because it's using the current cache-spoiling behaviour of > seqscans to advantage. I'd like to remove that behaviour, but it sounds > like we can have both > - SeqScans that don't spoil cache > - Synch scans > by setting "sync_scan_offset" to zero. > Precisely. If there is a cache-spoiling effect of the OS buffer cache that we want to take advantage of, we could still set it to a non-zero value. But the utility of sync_scan_offset does decrease with your patch, so removing it altogether is a possibility (hopefully the numbers will tell us what to do). > > Do you have an opinion about sync_scan_threshold versus a simple > > sync_scan_enable? > > enable_sync_scan? > After looking at other GUC names, I suggest that it's either "sync_scan" (for on/off) or "sync_scan_threshold" (if we do want to allow a numerical threshold). All the GUCs beginning with "enable_" are planner settings. If we only allow on/off, we could probably just sync scan every table because of your recycle_buffers patch. > > > I'd still like to be able to trace each scan to see how far ahead/behind > > > it is from the other scans on the same table, however we do that. > > > > > > Any backend can read the position of other backend's scans, so it should > > > > Where is that information stored? Right now my patch will overwrite the > > hints of other backends, because I'm using a static data structure > > (rather than one that grows). I do this to avoid the need for locking. > > OK, well, we can still read it before we overwrite it to calc the > difference. That will at least allow us to get a difference between > points as we go along. That seems like its worth having, even if it > isn't accurate for 3+ concurrent scans. Let me know if the things I list below don't cover what the information you're looking for here. It would be easy for me to emit a log message at the time it's overwriting the hint, but that would be a lot of noise: every time ss_report_loc() is called, which we discussed would be once per 100 pages read per scan. > > > be easy enough to put in a regular LOG entry that shows how far > > > ahead/behind they are from other scans. We can trace just one backend > > > and have it report on where it is with respect to other backends, or you > > > could have them all calculate their position and have just the lead scan > > > report the position of all other scans. > > > > > > > I already have each backend log it's progression through the tablescan > > every 100k blocks to DEBUG (higher DEBUG gives every 10k blocks). I > > currently use this information to see whether scans are staying together > > or not. I think this gives us the information we need without backends > > needing to communicate the information during execution. > > Well, that is good, thank you for adding that after initial discussions. > > Does it have the time at which a particular numbered block is reached? > (i.e. Block #117 is not the same thing as the 117th block scanned). We > can use that to compare the time difference of each scan. Right now it logs when a scan starts, what block number of the table it starts on, and also prints out the current block it's scanning every N blocks (100k or 10k depending on debug level). The time and the pid are, of course, available from log_prefix. I'll add the table OID to each log message in case we test, for example, a single backend scanning multiple tables at once. I'll also clean it up a bit, so that the information is a little easier to grep out of the postgres logfiles and easier to analyze/graph. Regards,Jeff Davis
On Tue, 2007-03-13 at 13:39 -0700, Jeff Davis wrote: > > > Do you have an opinion about sync_scan_threshold versus a simple > > > sync_scan_enable? > > > > enable_sync_scan? > > > > After looking at other GUC names, I suggest that it's either > "sync_scan" (for on/off) or "sync_scan_threshold" (if we do want to > allow a numerical threshold). All the GUCs beginning with "enable_" are > planner settings. How about: sync_seqscans so the phrase matches the equivalent enable_ parameter > If we only allow on/off, we could probably just sync scan every table > because of your recycle_buffers patch. The buffer recycling only makes sense for large scans, so there's an exact match for when both techniques need to kick-in. I think I'd just lose this parameter and have it kick-in at either NBuffers or NBuffers/2. We don't need another parameter... I'm not planning to have scan_recycle_buffers continue into the production version. > > > > I'd still like to be able to trace each scan to see how far ahead/behind > > > > it is from the other scans on the same table, however we do that. > > > > > > > > Any backend can read the position of other backend's scans, so it should > > > > > > Where is that information stored? Right now my patch will overwrite the > > > hints of other backends, because I'm using a static data structure > > > (rather than one that grows). I do this to avoid the need for locking. > > > > OK, well, we can still read it before we overwrite it to calc the > > difference. That will at least allow us to get a difference between > > points as we go along. That seems like its worth having, even if it > > isn't accurate for 3+ concurrent scans. > > Let me know if the things I list below don't cover what the information > you're looking for here. It would be easy for me to emit a log message > at the time it's overwriting the hint, but that would be a lot of noise: > every time ss_report_loc() is called, which we discussed would be once > per 100 pages read per scan. > > > > > be easy enough to put in a regular LOG entry that shows how far > > > > ahead/behind they are from other scans. We can trace just one backend > > > > and have it report on where it is with respect to other backends, or you > > > > could have them all calculate their position and have just the lead scan > > > > report the position of all other scans. > > > > > > > > > > I already have each backend log it's progression through the tablescan > > > every 100k blocks to DEBUG (higher DEBUG gives every 10k blocks). I > > > currently use this information to see whether scans are staying together > > > or not. I think this gives us the information we need without backends > > > needing to communicate the information during execution. > > > > Well, that is good, thank you for adding that after initial discussions. > > > > Does it have the time at which a particular numbered block is reached? > > (i.e. Block #117 is not the same thing as the 117th block scanned). We > > can use that to compare the time difference of each scan. > > Right now it logs when a scan starts, what block number of the table it > starts on, and also prints out the current block it's scanning every N > blocks (100k or 10k depending on debug level). The time and the pid are, > of course, available from log_prefix. Can you make it log every block whose id is divisible by 100k or 10k? Otherwise one scan will log blocks 100,000... 200,000 ... etc and the next scan will log 17357.... 117357 ... etc which will be much harder to work out. That will give us "lap times" for every 100,000 blocks. I'm particularly interested in the turning point where the scan starts again at the beginning of the file. It would be good to know what blockid it turned at and when that was. We may get out of step at that point. Maybe. We'll find out. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
> > The advantage of sync_scan_offset is that, in some situations, a > > second scan can actually finish faster than if it were the only query > > executing, because a previous scan has already caused some blocks to > > be cached. However, 16 is a small number because that benefit would > > only be once per scan, and sync scans are only helpful on large tables. Agreed. > Alright, understood. That last part is actually something I > now want to avoid because it's using the current > cache-spoiling behaviour of seqscans to advantage. I'd like > to remove that behaviour, but it sounds like we can have both > - SeqScans that don't spoil cache > - Synch scans by setting "sync_scan_offset" to zero. > > > > > I like the idea of reducing tuning parameters, but we should, at a > > > > minimum, still allow an on/off button for sync scans. My tests > > > > revealed that the wrong combination of > > > > OS/FS/IO-Scheduler/Controller could result in bad I/O behavior. > > > > > > Agreed > > > > > > > Do you have an opinion about sync_scan_threshold versus a simple > > sync_scan_enable? > > enable_sync_scan? Seems the suggested guc's are very related. IIRC The agreed suggestion was to use NBuffers (or a percentage thereof ?) to decide whether to spoil the buffer cache for a seq scan. I seems this same metric should be used to decide whether to sync a scan when sync scan is enabled. So when the tablesize is below NBuffers (or a percentage thereof) you neighter recycle buffers nor sync the seq scans. Andreas