Thread: Initial prefetch performance testing
The complicated patch I've been working with for a while now is labeled "sequential scan posix fadvise" in the CommitFest queue. There are a lot of parts to that, going back to last December, and I've added the many most relevant links to the September CommitFest page. The first message there on this topic is http://archives.postgresql.org/message-id/87ve7egxow.fsf@oxford.xeocode.com which is a program from Greg Stark that measures how much prefetching advisory information improves the overall transfer speed on a synthetic random read benchmark. The idea is that you advise the OS about up to n requests at a time, where n goes from 1 (no prefetch at all) to 8192. As n goes up, the total net bandwidth usually goes up as well. You can basically divide the bandwidth at any prefetch level by the baseline (1=no prefetch) to get a speedup multiplier. The program allows you to submit both unsorted and sorted requests, and the speedup is pretty large and similarly distributed (but of different magnitude) in both cases. While not a useful PostgreSQL patch on its own, this program lets one figure out if the basic idea here, advise about blocks ahead of time to speed up the whole thing, works on a particular system without having to cope with a larger test. What I have to report here are some results from many systems running both Linux and Solaris with various numbers of disk spindles. The Linux systems use the posix fadvise call, while the Solaris ones use its aio library. Using the maximum prefetch working set tested, 8192, here's the speedup multiplier on this benchmark for both sorted and unsorted requests using a 8GB file: OS Spindles Unsorted X Sorted X 1:Linux 1 2.3 2.1 2:Linux 1 1.5 1.0 3:Solaris 1 2.6 3.0 4:Linux 3 6.3 2.8 5:Linux (Stark) 3 5.3 3.6 6:Linux 10 5.4 4.9 7:Solaris* 48 16.9 9.2 Systems (1)-(3) are standard single-disk workstations with various speed and size disks. (4) is a 3-disk software RAID0 (on an Areca card in JBOD mode). (5) is the system Greg Stark originally reported his results on, which is also a 3-disk array of some sort. (6) uses a Sun 2640 disk array with a 10 disk RAID0+1 setup, while (7) is a Sun Fire X4500 with 48 disks in a giant RAID-Z array. The Linux systems drop the OS cache after each run, they're all running kernel 2.6.18 or higher with that feature. Solaris system (3) is using the UFS filesystem with the default tuning, which doesn't cache enough information for that to be necessary[1]--the results look very similar to the Linux case even without explicitly dropping the cache. * For (7) the results there showed obvious caching (>150MB/s), as I expected from Solaris's ZFS which does cache aggressively by default. In order to get useful results with the server's 16GB of RAM, I increased the test file to 64GB, at which point the results looked reasonable. Comparing with a prefetch working set of 256, which I eyeballed on the results spreadsheet I made as the best return on prefetch effort before improvements leveled off, the speedups looked like this: OS Spindles Unsorted X Sorted X 1:Linux 1 2.3 2.0 2:Linux 1 1.5 0.9 3:Solaris 1 2.5 3.3 4:Linux 3 5.8 2.6 5:Linux (Stark) 3 5.6 3.7 6:Linux 10 5.7 5.1 7:Solaris 48 10.0 7.8 Observations: -For the most part, using the fadvise/aio technique was a significant win even on single disk systems. The worst result, on system (2) with sorted blocks, was basically break even within the measurement tolerance here: 94% of the no prefetch rate is the worst result I saw, but all these bounced around about +/- 5% so I wouldn't read too much into that. In every other case, there was at least a 50% speed increase even with a single disk. -As Greg Stark suggested, the larger the spindle count the larger the speedup, and the larger the prefetch size that might make sense. His suggestion to model the user GUC as "effective_spindle_count" looks like a good one. The sequential scan fadvise implementation patch submitted uses the earlier preread_pages name for that parameter, which I agree seems less friendly. -The Solaris aio implementation seems to perform a bit better relative to no prefetch than the Linux fadvise one. I'm left wondering a bit about whether that's just a Solaris vs. Linux thing, in particular whether that's just some lucky caching on Solaris where the cache isn't completely cleared, or whether Linux's aio library might work better than its fadvise call does. The attached archive file includes a couple of useful bits for anyone who wants to try this test on their hardware. I think I filed away all the rough edges here and it should be real easy for someone else to run this test now. It includes: -prefetch.c is a slightly modified version of the original test program. I fixed a couple of minor bugs in the parameter input/output code that only showed up under some platform combinations, the actual prefetch implementation is untouched. -prefetchtest is a shell script that compiles the program and runs it against a full range of prefetch sizes. Just run it and tell it where you want the test data file to go (with an optional size that defaults to 8GB), and it produces an output file named prefetch-results.csv with all the results in it. -I included all of the raw data for the various systems I tested so other testers have baselines to compare against. An OpenOffice spreadsheet comparing all the results and that computes the ratios shown above is also included. Conclusion: on all the systems I tested on, this approach gave excellent results, which makes me feel confident that I should see a corresponding speedup on database-level tests that use this same basic technique. I'm not sure whether it might make sense to bundle this test program up somehow so others can use it for similar compatibility tests (I'm thinking of something similar to contrib/test_fsync), will revisit that after the rest of the review. Next step: I've got two data sets (one generated, one real-world sample) that should demonstrate a useful heap scan prefetch speedup, and one test program I think will demonstrate whether the sequential scan prefetch code works right. Now that I've vetted all the hardware/OS combinations I hope I can squeeze that in this week, I don't need to test all of them now that I know which are the interesting systems. As far as other platforms go, I should get a Mac OS system in the near future to test on as well (once I have the database tests working, not worth scheduling yet), but as it will only have a single disk that will basically just be a compatibility test rather than a serious performance one. Would be nice to get a report from someone running FreeBSD to see what's needed to make the test script run on that OS. [1] http://blogs.sun.com/jkshah/entry/postgresql_east_2008_talk_best : Page 8 of the presentation covers just how limited the default UFS cache tuning is. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Mon, 2008-09-22 at 04:57 -0400, Greg Smith wrote: > -As Greg Stark suggested, the larger the spindle count the larger the > speedup, and the larger the prefetch size that might make sense. His > suggestion to model the user GUC as "effective_spindle_count" looks like a > good one. The sequential scan fadvise implementation patch submitted uses > the earlier preread_pages name for that parameter, which I agree seems > less friendly. Good news about the testing. I'd prefer to set this as a tablespace level storage parameter. Since that is where it would need to live when we have multiple tablespaces. Specifically as a storage parameter, so we have same syntax for table-level and tablespace-level storage parameters. That would also allow us to have tablespace-level defaults for table-level settings. prefetch_... is a much better name since its an existing industry term. I'm not in favour of introducing the concept of spindles, since I can almost hear the questions about ramdisks and memory-based storage. Plus I don't ever want to discover that the best setting for effective_spindles is 7 (or 5) when I have 6 disks because of some technology shift or postgres behaviour change in the future. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Sep 22, 2008, at 12:02 PM, Simon Riggs wrote: > > On Mon, 2008-09-22 at 04:57 -0400, Greg Smith wrote: > >> -As Greg Stark suggested, the larger the spindle count the larger the >> speedup, and the larger the prefetch size that might make sense. His >> suggestion to model the user GUC as "effective_spindle_count" looks >> like a >> good one. The sequential scan fadvise implementation patch >> submitted uses >> the earlier preread_pages name for that parameter, which I agree >> seems >> less friendly. > > Good news about the testing. absolutely; we made tests and got similar figures. also, I/O is much more stable and steady with the patch. > > I'd prefer to set this as a tablespace level storage parameter. Since > that is where it would need to live when we have multiple tablespaces. > Specifically as a storage parameter, so we have same syntax for > table-level and tablespace-level storage parameters. That would also > allow us to have tablespace-level defaults for table-level settings. > +1 > prefetch_... is a much better name since its an existing industry > term. > I'm not in favour of introducing the concept of spindles, since I can > almost hear the questions about ramdisks and memory-based storage. > Plus > I don't ever want to discover that the best setting for > effective_spindles is 7 (or 5) when I have 6 disks because of some > technology shift or postgres behaviour change in the future. i would definitely avoid to use of "spindles". i totally agree with simon here. once mature SSD storage or some in- memory stuff will be available for the masses, this is not suitable anymore. the best thing would be to simply use the parameter as it was in the original patch. maybe we should simply make the parameter adjustable per table and per index. this would automatically cover 95% of all cases such as clustered tables and so on. many thanks and best regards, hans -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: www.postgresql-support.de
Simon Riggs <simon@2ndQuadrant.com> writes: > On Mon, 2008-09-22 at 04:57 -0400, Greg Smith wrote: > >> -As Greg Stark suggested, the larger the spindle count the larger the >> speedup, and the larger the prefetch size that might make sense. His >> suggestion to model the user GUC as "effective_spindle_count" looks like a >> good one. The sequential scan fadvise implementation patch submitted uses >> the earlier preread_pages name for that parameter, which I agree seems >> less friendly. > > Good news about the testing. > > I'd prefer to set this as a tablespace level storage parameter. Sounds, like a good idea, except... what's a tablespace level storage parameter? > prefetch_... is a much better name since its an existing industry term. > I'm not in favour of introducing the concept of spindles, since I can > almost hear the questions about ramdisks and memory-based storage. Plus > I don't ever want to discover that the best setting for > effective_spindles is 7 (or 5) when I have 6 disks because of some > technology shift or postgres behaviour change in the future. In principle I quite strongly disagree with this. Someone might very well want to set spindle_count to 6 when he actually has 7 but at least he can have an intuitive feel for what he's doing -- he's setting it to slightly less than Postgres thinks is optimal. Number of blocks to prefetch is an internal implementation detail that the DBA has absolutely no way to know what the correct value is. That's how we get the cargo cult configuration tweaks we've seen in the past where people follow recommendations with no idea what the consequences are or whether they apply. In an ideal world we would have a half-dozen parameters to tell Postgres how much memory is available, how many disks available, etc and Postgres would know how best to use the resources. I think if we expose internal knobs like you propose then we end up with hundreds of parameters and to adjust them you'll have to be an expert in Postgres internals. That said, there is a place for these internal knobs when we don't really know how to best make use of resources. At this point we only have results from a few systems and the results don't seem to jibe with the theory. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
On Mon, 2008-09-22 at 16:46 +0100, Gregory Stark wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > > On Mon, 2008-09-22 at 04:57 -0400, Greg Smith wrote: > > > >> -As Greg Stark suggested, the larger the spindle count the larger the > >> speedup, and the larger the prefetch size that might make sense. His > >> suggestion to model the user GUC as "effective_spindle_count" looks like a > >> good one. The sequential scan fadvise implementation patch submitted uses > >> the earlier preread_pages name for that parameter, which I agree seems > >> less friendly. > > > > Good news about the testing. > > > > I'd prefer to set this as a tablespace level storage parameter. > > Sounds, like a good idea, except... what's a tablespace level storage parameter? A storage parameter, just at tablespace level. WITH (storage_parameter = value) -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Mon, 22 Sep 2008, Simon Riggs wrote: > I'd prefer to set this as a tablespace level storage parameter. That seems reasonable, but I'm not working at that level yet. There's still a larger open questions about how the buffer manager interaction will work here, and I'd like to have a better view of that first before getting into the exact syntax used to set the parameter. For now, a GUC works well enough, but you're right that something finer-grained may make sense before this actually hits the codebase. > prefetch_... is a much better name since its an existing industry term. > I'm not in favour of introducing the concept of spindles, since I can > almost hear the questions about ramdisks and memory-based storage. It's possible to make a case for exposing the internal number that's getting varied here, naming the parameter something like prefetch_depth, and letting people set that to whatever they want. Based on the current data I might suggest a default of 256, using 0 to turn the feature off altogether, and a maximum of at least 8192 and possibly more. In practice I expect there to only be a couple of popular values and the idea of fine-tuning is a bit questionable. I think that's what Greg Stark was driving at with how the value was re-spun. Instead of using effective_spindle_count, you could just as easily make a case for an enum like [off,low,medium,high] mapping to [0,16,256,8192]. From what I've seen so far, that would reduce tweaking time in the field considerably while not really changing the range of available behavior very much. I will be running a set of tests on a fast SSD device before I'm done, that's another one that I'll try once I've got the database-level tests ready to run, too. What I expect is that it will favor 0, presumably you might as well just read the blocks rather than advise about them when the seek overhead is close to zero. Should be able to do a RAM disk run as well. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Mon, 2008-09-22 at 13:06 -0400, Greg Smith wrote: > > prefetch_... is a much better name since its an existing industry term. > > I'm not in favour of introducing the concept of spindles, since I can > > almost hear the questions about ramdisks and memory-based storage. > > It's possible to make a case for exposing the internal number that's > getting varied here, naming the parameter something like prefetch_depth, > and letting people set that to whatever they want. Based on the current > data I might suggest a default of 256, using 0 to turn the feature off > altogether, and a maximum of at least 8192 and possibly more. > > In practice I expect there to only be a couple of popular values and the > idea of fine-tuning is a bit questionable. I think that's what Greg Stark > was driving at with how the value was re-spun. Instead of using > effective_spindle_count, you could just as easily make a case for an enum > like [off,low,medium,high] mapping to [0,16,256,8192]. From what I've > seen so far, that would reduce tweaking time in the field considerably > while not really changing the range of available behavior very much. Tuning Postgres I/O already involves quite a few parameters called buffersize, segment width, stripe size, etc.. I've never heard anything from a disk manufacturer say this is wrong and we should just have "spindle equivalents". I don't think we should dress this up too much, that's all. We aren't going to make anybody's life any easier. But we will probably generate lots of annoying phone calls to disk manufacturers asking "so how many spindles is your subsystem worth in Postgres terms?" to which they will shrug and say "no idea". Is the behaviour of this sufficiently linear to be able to say that 3 spindles = 3 effective_spindles and 6=6 etc.? I would guess it won't be and you're left with a name more misleading than useful. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Gregory Stark wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> I'm not in favour of introducing the concept of spindles.... > > In principle I quite strongly disagree with this.... > Number of blocks to prefetch is an internal implementation detail that the DBA > has absolutely no way to know what the correct value is. Even more often on systems I see these days, "spindles" is an implementation detail that the DBA has no way to know what the correct value is. For example, on our sites hosted with Amazon's compute cloud (a great place to host web sites), I know nothing about spindles, but know about Amazon Elastic Block Store[2]'s and Instance Store's[1]. I have some specs and are able to run benchmarks on them; but couldn't guess how many spindles my X% of the N-disk device that corresponds to. For another example, some of our salesguys with SSD drives have 0 spindles on their demo machines. I'd rather a parameter that expressed things more in terms of measurable quantities -- perhaps seeks/second? perhaps random-access/sequential-access times? [1] http://www.amazon.com/gp/browse.html?node=201590011 [2] http://www.amazon.com/b/ref=sc_fe_c_0_201590011_1?ie=UTF8&node=689343011&no=201590011&me=A36L942TSJ2AJA
On Mon, 22 Sep 2008, Gregory Stark wrote: > Hm, I'm disappointed with the 48-drive array here. I wonder why it maxed out > at only 10x the bandwidth of one drive. I would expect more like 24x or more. The ZFS RAID-Z implementation doesn't really scale that linearly. It's rather hard to get the full bandwidth out of a X4500 with any single process, and I haven't done any filesystem tuning to improve things--everything is at the defaults. > I'm quite surprised Solaris doesn't support posix_fadvise -- perhaps > it's in some other version of Solaris? Both the systems I used were standard Solaris 10 boxes and I'm not aware of any changes in this area in the later OpenSolaris releases (which is where I'd expect something like this to change first). The test program I tried failed to find #ifdef POSIX_FADV_WILLNEED, and the message I saw from you at http://archives.postgresql.org/message-id/877imua265.fsf@oxford.xeocode.com suggested you didn't find any fadvise either so I didn't look much further. The above is a cue for someone from Sun to chime in on this subject. > I have an updated patch I'll be sending along shortly. You might want to test > with that? Obviously I've got everything setup to test right now, am currently analyzing your earlier patch and the sequential scan fork that derived from it. If you've got a later version of the bitmap heap scan one as well, I'll replace the one I had been planning to test (your bitmap-preread-v9) with that one when it's available. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: > For example, on our sites hosted with Amazon's compute cloud (a great > place to host web sites), I know nothing about spindles, but know > about Amazon Elastic Block Store[2]'s and Instance Store's[1]. I > have some specs and are able to run benchmarks on them; but couldn't > guess how many spindles my X% of the N-disk device that corresponds > to. Well I don't see how you're going to guess how much prefetching is optimal for those environments either... > For another example, some of our salesguys with SSD drives > have 0 spindles on their demo machines. Sounds to me like you're finding it pretty intuitive. Actually you would want "1" because it can handle one request at a time. Actually if you have a multipath array I imagine you would want to think of each interface as a spindle because that's the bottleneck and you'll want to keep all the interfaces busy. > I'd rather a parameter that expressed things more in terms of > measurable quantities -- perhaps seeks/second? perhaps > random-access/sequential-access times? Well that's precisely what I'm saying. Simon et al want a parameter to control how much prefetching to do. That's *not* a measurable quantity. I'm suggesting effective_spindle_count which *is* a measurable quantity even if it might be a bit harder to measure in some environments than others. The two other quantities you describe are both currently represented by our random_page_cost (or random_page_cost/sequential_page_cost). What we're dealing with now is an entirely orthogonal property of your system: how many concurrent requests can the system handle. If you have ten spindles then you really want to send enough requests to ensure there are ten concurrent requests being processed on ten different drives (assuming you want each scan to make maximum use of the resources which is primarily true in DSS but might not be true in OLTP). That's a lot more than ten requests though because if you sent ten requests many of them would end up on the same devices. In theory my logic led me to think for ten drives it would be about 30. Experiments seem to show it's more like 300-400. That discrepancy might be a reason to put this debate aside for now anywaysand expose the internal implementation until we understand better what's going on there. Ironically I'm pretty happy to lose this argument because EDB is interested in rolling this into its dynamic tuning module. If there's a consensus -- by my count three people have spoken up already which is more than usual -- then I'll gladly concede. Anyone object to going back to preread_pages? Or should it be prefetch_pages? prefetch_blocks? Something else? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Gregory Stark wrote: > Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: >> I'd rather a parameter that expressed things more in terms of >> measurable quantities [...] > > ...What we're > dealing with now is an entirely orthogonal property of your system: how many > concurrent requests can the system handle. Really? I'd have thought you'd want to give the OS enough guesses about the future that it's elevator algorithms for the drive heads don't keep seeking back-and-forth but rather do as much per sweep across a device that they can. > Ironically I'm pretty happy to lose this argument because EDB is interested in > rolling this into its dynamic tuning module. If there's a consensus -- by my > count three people have spoken up already which is more than usual -- then > I'll gladly concede. Anyone object to going back to preread_pages? Or should > it be prefetch_pages? prefetch_blocks? Something else? Well - as you pointed out, I'm not on their side of the debate either. I'm not sure what a relevant measurable parameter would be so I'm not being too helpful in the conversation either.
[resending due to the attachment being too large for the -hackers list -- weren't we going to raise it when we killed -patches?] Greg Smith <gsmith@gregsmith.com> writes: > Using the maximum prefetch working set tested, 8192, here's the speedup > multiplier on this benchmark for both sorted and unsorted requests using a 8GB > file: > > OS Spindles Unsorted X Sorted X > 1:Linux 1 2.3 2.1 > 2:Linux 1 1.5 1.0 > 3:Solaris 1 2.6 3.0 > 4:Linux 3 6.3 2.8 > 5:Linux (Stark) 3 5.3 3.6 > 6:Linux 10 5.4 4.9 > 7:Solaris* 48 16.9 9.2 Incidentally I've been looking primarily at the sorted numbers because they parallel bitmap heap scans. (Note that the heap scan is only about half the i/o of a bitmap index scan + heap scan so even if it's infinitely faster it'll only halve the time spent in the two nodes.) Hm, I'm disappointed with the 48-drive array here. I wonder why it maxed out at only 10x the bandwidth of one drive. I would expect more like 24x or more. I wonder if Solaris's aio has an internal limit on how many pending i/o requests it can handle. Perhaps it's a tunable? Unfortunately I don't see a convenient low-invasive way to integrate aio into Postgres. posix_fadvise we can just issue the advice and then forget about it. But aio we would pretty much have to pick a target buffer, pin it, issue the aio and then remember the pin later when we need to read the buffer. That would require restructuring the code significantly. I'm quite surprised Solaris doesn't support posix_fadvise -- perhaps it's in some other version of Solaris? Here's a graph of results from this program for various sized arrays on a single machine: http://wiki.postgresql.org/images/a/a3/Results.svg Each colour corresponds to an array of a different number of spindles ranging from 1 to 15 drives. The X axis is how much prefetching was done and the Y axis is the bandwidth obtained. There is a distinct maximum and then dropoff and it would be great to get some data points for larger arrays to understand where that maximum goes as the array gets larger. > Conclusion: on all the systems I tested on, this approach gave excellent > results, which makes me feel confident that I should see a corresponding > speedup on database-level tests that use this same basic technique. I'm not > sure whether it might make sense to bundle this test program up somehow so > others can use it for similar compatibility tests (I'm thinking of something > similar to contrib/test_fsync), will revisit that after the rest of the review. > > Next step: I've got two data sets (one generated, one real-world sample) that > should demonstrate a useful heap scan prefetch speedup, and one test program I > think will demonstrate whether the sequential scan prefetch code works right. > Now that I've vetted all the hardware/OS combinations I hope I can squeeze that > in this week, I don't need to test all of them now that I know which are the > interesting systems. I have an updated patch I'll be sending along shortly. You might want to test with that? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Greg Smith napsal(a): > On Mon, 22 Sep 2008, Gregory Stark wrote: > >> I'm quite surprised Solaris doesn't support posix_fadvise -- perhaps >> it's in some other version of Solaris? Solaris has only fake variant of posix_fadvise. See http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/posix_fadvise.c UFS has own optimization. For example if it detects sequential scan then file cache is limited on 80kB. Or it also has prefetching. ZFS has intelligent read patter recognitions algorithms and other improvements. Zdenek
Greg Smith <gsmith@gregsmith.com> writes: > On Mon, 22 Sep 2008, Gregory Stark wrote: > >> Hm, I'm disappointed with the 48-drive array here. I wonder why it maxed out >> at only 10x the bandwidth of one drive. I would expect more like 24x or more. > > The ZFS RAID-Z implementation doesn't really scale that linearly. It's rather > hard to get the full bandwidth out of a X4500 with any single process, and I > haven't done any filesystem tuning to improve things--everything is at the > defaults. Well random access i/o will fall pretty far short of the full bandwidth. Actually this is a major issue, our sequential_page_cost vs random_page_cost dichotomy doesn't really work when we're prefetching pages. In my experiments an array capable of supplying about 1.4GB/s in sequential i/o could only muster about 40MB/s of random i/o with prefetching and only about 5MB/s without. For this machine we would have quite a dilemma setting random_page_cost -- do we set it to 280 or 35? Perhaps access paths which expect to be able to prefetch most of their accesses should use random_page_cost / effective_spindle_count for their i/o costs? But then if people don't set random_page_cost high enough they could easily find themselves with random fetches being costed as less expensive than sequential fetches. And I have a feeling it'll be a hard sell to get people to set random_page_cost in the double digits let alone triple digits. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
Gregory Stark <stark@enterprisedb.com> writes: > Perhaps access paths which expect to be able to prefetch most of their > accesses should use random_page_cost / effective_spindle_count for their i/o > costs? > But then if people don't set random_page_cost high enough they could easily > find themselves with random fetches being costed as less expensive than > sequential fetches. And I have a feeling it'll be a hard sell to get people to > set random_page_cost in the double digits let alone triple digits. Well, we could use something likeMax(random_page_cost / effective_spindle_count, seq_page_cost) to ensure the result remains somewhat sane. regards, tom lane
Greg Smith <gsmith@gregsmith.com> writes: >> I have an updated patch I'll be sending along shortly. You might want to test >> with that? > > Obviously I've got everything setup to test right now, am currently analyzing > your earlier patch and the sequential scan fork that derived from it. If > you've got a later version of the bitmap heap scan one as well, I'll replace > the one I had been planning to test (your bitmap-preread-v9) with that one when > it's available. Well here you go. It includes: . Bitmap heap scans (as before) . Index scans . Setting POSIX_FADV_SEQUENTIAL for bulk sequential scans . Improved (ie, debugged) autoconf tests for posix_fadvise (and posix_fallocate though I don't have any code using it yet) The bitmap heap scans are still prefetching the number of buffers I estimate based on effective_spindle_count according to my magic formula. We've been discussing throwing that out, I just haven't modified this to do that yet. Index scans prefetch all pages for matching index tuples on the leaf page when we do page-at-a-time scans. I haven't bothered doing the gradual ramp-up or keeping a ring of the optimal size prefetched. The sequential scan stuff is based on Zoltan's posts but done in a different way. It passes an i/o access strategy to smgr and fd.c which keeps track of what the previous strategy was and calls posix_fadvise if it's changed. This will correctly handle queries which reference the same table twice even if one reference is a sequential scan and the other is an index lookup. I have *not* been able to observe any significant effect from POSIX_FADV_SEQUENTIAL but I'm not sure what circumstances it was a problem. It sounds like it's a peculiar situation which is not easy to reliably reproduce. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Attachment
On Tue, 23 Sep 2008, Gregory Stark wrote: > I have *not* been able to observe any significant effect from > POSIX_FADV_SEQUENTIAL but I'm not sure what circumstances it was a problem. It > sounds like it's a peculiar situation which is not easy to reliably reproduce. Zoltan, Hans-Juergen: would it be possible for you to try the latest bitmap-preread-v18.diff.gz patch Greg Stark just sent over to the list? It's at http://archives.postgresql.org/message-id/87ljxjudu7.fsf@oxford.xeocode.com as well. That's a refinement of the original strategy you used, and I'd be curious to hear whether it still works usefully on the troublesome workload you submitted your original patch against. Since none of the rest of us have been successful so far replicating the large speed-up on multiple concurrent sequential scans you reported, I think you're the best candidate to see if there was any regression because of how the patch was refactored. I'm excited to see index scans in the new patch as well, since I've got 1TB of test data that gets navigated that way I can test with. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Ron Mayer wrote: > Even more often on systems I see these days, "spindles" > is an implementation detail that the DBA has no way to know > what the correct value is. > > For example, on our sites hosted with Amazon's compute cloud (a great > place to host web sites), I know nothing about spindles, but know > about Amazon Elastic Block Store[2]'s and Instance Store's[1]. I > have some specs and are able to run benchmarks on them; but couldn't > guess how many spindles my X% of the N-disk device that corresponds > to. For another example, some of our salesguys with SSD drives > have 0 spindles on their demo machines. > > I'd rather a parameter that expressed things more in terms of > measurable quantities -- perhaps seeks/second? perhaps > random-access/sequential-access times? I assume SAN users might not know the number of spindles either. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Gregory Stark wrote: > > Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: > > > For example, on our sites hosted with Amazon's compute cloud (a great > > place to host web sites), I know nothing about spindles, but know > > about Amazon Elastic Block Store[2]'s and Instance Store's[1]. I > > have some specs and are able to run benchmarks on them; but couldn't > > guess how many spindles my X% of the N-disk device that corresponds > > to. > > Well I don't see how you're going to guess how much prefetching is optimal for > those environments either... > > > For another example, some of our salesguys with SSD drives > > have 0 spindles on their demo machines. > > Sounds to me like you're finding it pretty intuitive. Actually you would want > "1" because it can handle one request at a time. Actually if you have a > multipath array I imagine you would want to think of each interface as a > spindle because that's the bottleneck and you'll want to keep all the > interfaces busy. I assume everyone would want at least one because you would want the page to be prefetched while you are processing the existing page. I think a larger problem is that it is likely the page prefetch will take longer than processing the existing page, which might mean that you need to prefetch a few pages ahead to allow read reordering for better performance. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Ron Mayer wrote: >> Even more often on systems I see these days, "spindles" >> is an implementation detail that the DBA has no way to know >> what the correct value is. >> >> For example, on our sites hosted with Amazon's compute cloud (a great >> place to host web sites), I know nothing about spindles, but know >> about Amazon Elastic Block Store[2]'s and Instance Store's[1]. I >> have some specs and are able to run benchmarks on them; but couldn't >> guess how many spindles my X% of the N-disk device that corresponds >> to. For another example, some of our salesguys with SSD drives >> have 0 spindles on their demo machines. >> >> I'd rather a parameter that expressed things more in terms of >> measurable quantities -- perhaps seeks/second? perhaps >> random-access/sequential-access times? > > I assume SAN users might not know the number of spindles either. Yeah. Nevertheless I like the way effective_spindle_count works, as opposed to an unintuitive "number of blocks to prefetch" (assuming the formula we use to turn the former into latter works). Perhaps we should keep the meaning the same, but call it "effective_io_concurrency"? Something that conveys the idea of "how many simultaneous I/O requests the I/O subsystem can handle", without referring to any specific technology. That concept applies to SANs and RAM drives as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2008-09-24 at 17:42 +0300, Heikki Linnakangas wrote: > Yeah. Nevertheless I like the way effective_spindle_count works, as > opposed to an unintuitive "number of blocks to prefetch" (assuming the > formula we use to turn the former into latter works). Perhaps we should > keep the meaning the same, but call it "effective_io_concurrency"? > Something that conveys the idea of "how many simultaneous I/O requests > the I/O subsystem can handle", without referring to any specific > technology. That concept applies to SANs and RAM drives as well. You've spoiled all the fun now with a good suggestion. I was looking forward to the Jules Verne-like nostalgia of the other suggestion over the years to come. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > On Mon, 2008-09-22 at 16:46 +0100, Gregory Stark wrote: > >> Simon Riggs <simon@2ndQuadrant.com> writes: >> >> > I'd prefer to set this as a tablespace level storage parameter. >> >> Sounds, like a good idea, except... what's a tablespace level storage parameter? > > A storage parameter, just at tablespace level. > > WITH (storage_parameter = value) I still think this is a good idea but I still think there are unanswered questions about it. Surely whatever we do with this parameter also holds for random_page_cost and sequential_page_cost as well? Should they remain GUC parameters at all? If so, how would setting any of them locally interact with the tablespace parameter for tables used in the query? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!