Thread: RAID arrays and performance
I have a question about how Postgres makes use of RAID arrays for performance, because we are considering buying a 12-disc array for performance reasons. I'm interested in how the performance scales with the number of discs in the array. Now, I know that for an OLTP workload (in other words, lots of small parallel random accesses), the performance will scale almost with the number of discs. However, I'm more interested in the performance of individual queries, particularly those where Postgres has to do an index scan, which will result in a single query performing lots of random accesses to the disc system. Theoretically, this *can* scale with the number of discs too - my question is does it? Does Postgres issue requests to each random access in turn, waiting for each one to complete before issuing the next request (in which case the performance will not exceed that of a single disc), or does it use some clever asynchronous access method to send a queue of random access requests to the OS that can be distributed among the available discs? Any knowledgable answers or benchmark proof would be appreciated, Matthew -- "To err is human; to really louse things up requires root privileges." -- Alexander Pope, slightly paraphrased
"Matthew" <matthew@flymine.org> writes: > Does Postgres issue requests to each random access in turn, waiting for > each one to complete before issuing the next request (in which case the > performance will not exceed that of a single disc), or does it use some > clever asynchronous access method to send a queue of random access > requests to the OS that can be distributed among the available discs? Sorry, it does the former, at least currently. That said, this doesn't really come up nearly as often as you might think. Normally queries fit mostly in either the large batch query domain or the small quick oltp query domain. For the former Postgres tries quite hard to do sequential i/o which the OS will do readahead for and you'll get good performance. For the latter you're normally running many simultaneous such queries and the raid array helps quite a bit. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
On Tue, 4 Dec 2007, Gregory Stark wrote: > "Matthew" <matthew@flymine.org> writes: > > > Does Postgres issue requests to each random access in turn, waiting for > > each one to complete before issuing the next request (in which case the > > performance will not exceed that of a single disc), or does it use some > > clever asynchronous access method to send a queue of random access > > requests to the OS that can be distributed among the available discs? > > Sorry, it does the former, at least currently. > > That said, this doesn't really come up nearly as often as you might think. Shame. It comes up a *lot* in my project. A while ago we converted a task that processes a queue of objects to processing groups of a thousand objects, which sped up the process considerably. So we run an awful lot of queries with IN lists with a thousand values. They hit the indexes, then fetch the rows by random access. A full table sequential scan would take much longer. It'd be awfully nice to have those queries go twelve times faster. > Normally queries fit mostly in either the large batch query domain or the > small quick oltp query domain. For the former Postgres tries quite hard to do > sequential i/o which the OS will do readahead for and you'll get good > performance. For the latter you're normally running many simultaneous such > queries and the raid array helps quite a bit. Having twelve discs will certainly improve the sequential IO throughput! However, if this was implemented (and I have *no* idea whatsoever how hard it would be), then large index scans would scale with the number of discs in the system, which would be quite a win, I would imagine. Large index scans can't be that rare! Matthew -- Software suppliers are trying to make their software packages more 'user-friendly'.... Their best approach, so far, has been to take all the old brochures, and stamp the words, 'user-friendly' on the cover. -- Bill Gates
Matthew wrote: > On Tue, 4 Dec 2007, Gregory Stark wrote: > >> "Matthew" <matthew@flymine.org> writes >>> Does Postgres issue requests to each random access in turn, waiting for >>> each one to complete before issuing the next request (in which case the >>> performance will not exceed that of a single disc), or does it use some >>> clever asynchronous access method to send a queue of random access >>> requests to the OS that can be distributed among the available discs? >>> >> Sorry, it does the former, at least currently. >> That said, this doesn't really come up nearly as often as you might think. >> > Shame. It comes up a *lot* in my project. A while ago we converted a task > that processes a queue of objects to processing groups of a thousand > objects, which sped up the process considerably. So we run an awful lot of > queries with IN lists with a thousand values. They hit the indexes, then > fetch the rows by random access. A full table sequential scan would take > much longer. It'd be awfully nice to have those queries go twelve times > faster. > The bitmap scan method does ordered reads of the table, which can partially take advantage of sequential reads. Not sure whether bitmap scan is optimal for your situation or whether your situation would allow this to be taken advantage of. >> Normally queries fit mostly in either the large batch query domain or the >> small quick oltp query domain. For the former Postgres tries quite hard to do >> sequential i/o which the OS will do readahead for and you'll get good >> performance. For the latter you're normally running many simultaneous such >> queries and the raid array helps quite a bit. >> > Having twelve discs will certainly improve the sequential IO throughput! > > However, if this was implemented (and I have *no* idea whatsoever how hard > it would be), then large index scans would scale with the number of discs > in the system, which would be quite a win, I would imagine. Large index > scans can't be that rare! > Do you know that there is a problem, or are you speculating about one? I think your case would be far more compelling if you could show a problem. :-) I would think that at a minimum, having 12 disks with RAID 0 or RAID 1+0 would allow your insane queries to run concurrent with up to 12 other queries. Unless your insane query is the only query in use on the system, I think you may be speculating about a nearly non-existence problem. Just a suggestion... I recall talk of more intelligent table scanning algorithms, and the use of asynchronous I/O to benefit from RAID arrays, but the numbers prepared to convince people that the change would have effect have been less than impressive. Cheers, mark -- Mark Mielke <mark@mielke.cc>
On Tue, 4 Dec 2007, Mark Mielke wrote: > The bitmap scan method does ordered reads of the table, which can > partially take advantage of sequential reads. Not sure whether bitmap > scan is optimal for your situation or whether your situation would allow > this to be taken advantage of. Bitmap scan may help where several randomly-accessed pages are next to each other. However, I would expect the operating system's readahead and buffer cache to take care of that one. > Do you know that there is a problem, or are you speculating about one? I > think your case would be far more compelling if you could show a > problem. :-) > > I would think that at a minimum, having 12 disks with RAID 0 or RAID 1+0 > would allow your insane queries to run concurrent with up to 12 other > queries. Yeah, I don't really care about concurrency. It's pretty obvious that running x concurrent queries on an x-disc RAID system will allow the utilisation of all the discs at once, therefore allowing the performance to scale by x. What I'm talking about is a single query running on an x-disc RAID array. > Unless your insane query is the only query in use on the system... That's exactly the case. > I recall talk of more intelligent table scanning algorithms, and the use > of asynchronous I/O to benefit from RAID arrays, but the numbers > prepared to convince people that the change would have effect have been > less than impressive. I think a twelve-times speed increase is impressive. Granted, given greatly-concurrent access, the benefits go away, but I think it'd be worth it for when there are few queries running on the system. I don't think you would have to create a more intelligent table scanning algorithm. What you would need to do is take the results of the index, convert that to a list of page fetches, then pass that list to the OS as an asynchronous "please fetch all these into the buffer cache" request, then do the normal algorithm as is currently done. The requests would then come out of the cache instead of from the disc. Of course, this is from a simple Java programmer who doesn't know the OS interfaces for this sort of thing. Matthew -- Here we go - the Fairy Godmother redundancy proof. -- Computer Science Lecturer
Matthew wrote: > On Tue, 4 Dec 2007, Mark Mielke wrote: > >> The bitmap scan method does ordered reads of the table, which can >> partially take advantage of sequential reads. Not sure whether bitmap >> scan is optimal for your situation or whether your situation would allow >> this to be taken advantage of. >> > Bitmap scan may help where several randomly-accessed pages are next to > each other. However, I would expect the operating system's readahead and > buffer cache to take care of that one. > The disk head has less theoretical distance to travel if always moving in a single direction instead of randomly seeking back and forth. >> Do you know that there is a problem, or are you speculating about one? I >> think your case would be far more compelling if you could show a >> problem. :-) >> >> I would think that at a minimum, having 12 disks with RAID 0 or RAID 1+0 >> would allow your insane queries to run concurrent with up to 12 other >> queries. >> > Yeah, I don't really care about concurrency. It's pretty obvious that > running x concurrent queries on an x-disc RAID system will allow the > utilisation of all the discs at once, therefore allowing the performance > to scale by x. What I'm talking about is a single query running on an > x-disc RAID array. > The time to seek to a particular sector does not reduce 12X with 12 disks. It is still approximately the same, only it can handle 12X the concurrency. This makes RAID best for concurrent loads. In your scenario, you are trying to make a single query take advantage of this concurrency by taking advantage of the system cache. I don't think full 12X concurrency of a single query is possible for most loads, probably including yours. See later for details. >> I recall talk of more intelligent table scanning algorithms, and the use >> of asynchronous I/O to benefit from RAID arrays, but the numbers >> prepared to convince people that the change would have effect have been >> less than impressive. >> > I think a twelve-times speed increase is impressive. Granted, given > greatly-concurrent access, the benefits go away, but I think it'd be worth > it for when there are few queries running on the system. > > I don't think you would have to create a more intelligent table scanning > algorithm. What you would need to do is take the results of the index, > convert that to a list of page fetches, then pass that list to the OS as > an asynchronous "please fetch all these into the buffer cache" request, > then do the normal algorithm as is currently done. The requests would then > come out of the cache instead of from the disc. Of course, this is from a > simple Java programmer who doesn't know the OS interfaces for this sort of > thing. > That's about how the talk went. :-) The problem is that a 12X speed for 12 disks seems unlikely except under very specific loads (such as a sequential scan of a single table). Each of the indexes may need to be scanned or searched in turn, then each of the tables would need to be scanned or searched in turn, depending on the query plan. There is no guarantee that the index rows or the table rows are equally spread across the 12 disks. CPU processing becomes involved with is currently limited to a single processor thread. I suspect no database would achieve a 12X speedup for 12 disks unless a simple sequential scan of a single table was required, in which case the reads could be fully parallelized with RAID 0 using standard sequential reads, and this is available today using built-in OS or disk read-ahead. Cheers, mark -- Mark Mielke <mark@mielke.cc>
"Mark Mielke" <mark@mark.mielke.cc> writes: > Matthew wrote: > >> I don't think you would have to create a more intelligent table scanning >> algorithm. What you would need to do is take the results of the index, >> convert that to a list of page fetches, then pass that list to the OS as >> an asynchronous "please fetch all these into the buffer cache" request, >> then do the normal algorithm as is currently done. The requests would then >> come out of the cache instead of from the disc. Of course, this is from a >> simple Java programmer who doesn't know the OS interfaces for this sort of >> thing. > > That's about how the talk went. :-) > > The problem is that a 12X speed for 12 disks seems unlikely except under very > specific loads (such as a sequential scan of a single table). Each of the > indexes may need to be scanned or searched in turn, then each of the tables > would need to be scanned or searched in turn, depending on the query plan. > There is no guarantee that the index rows or the table rows are equally spread > across the 12 disks. CPU processing becomes involved with is currently limited > to a single processor thread. I suspect no database would achieve a 12X speedup > for 12 disks unless a simple sequential scan of a single table was required, in > which case the reads could be fully parallelized with RAID 0 using standard > sequential reads, and this is available today using built-in OS or disk > read-ahead. I'm sure you would get something between 1x and 12x though... I'm rerunning my synthetic readahead tests now. That doesn't show the effect of the other cpu and i/o work being done in the meantime but surely if they're being evicted from cache too soon that just means your machine is starved for cache and you should add more RAM? Also, it's true, you need to preread more than 12 blocks to handle a 12-disk raid. My offhand combinatorics analysis seems to indicate you would expect to need to n(n-1)/2 blocks on average before you've hit all the blocks. There's little penalty to prereading unless you use up too much kernel resources or you do unnecessary i/o which you never use, so I would expect doing n^2 capped at some reasonable number like 1,000 pages (enough to handle a 32-disk raid) would be reasonable. The real trick is avoiding doing prefetches that are never needed. The user may never actually read all the tuples being requested. I think that means we shouldn't prefetch until the second tuple is read and then gradually increase the prefetch distance as you read more and more of the results. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
On Tue, 4 Dec 2007, Mark Mielke wrote: > The disk head has less theoretical distance to travel if always moving > in a single direction instead of randomly seeking back and forth. True... and false. The head can move pretty quickly, and it also has rotational latency and settling time to deal with. This means that there are cases where it is actually faster to move across the disc and read a block, then move back and read a block than to read them in order. So, if you hand requests one by one to the disc, it will almost always be faster to order them. On the other hand, if you hand a huge long list of requests to a decent SCSI or SATA-NCQ disc in one go, it will reorder the reads itself, and it will do it much better than you. > The time to seek to a particular sector does not reduce 12X with 12 > disks. It is still approximately the same, only it can handle 12X the > concurrency. This makes RAID best for concurrent loads. In your > scenario, you are trying to make a single query take advantage of this > concurrency by taking advantage of the system cache. Kind of. The system cache is just a method to make it simpler to explain - I don't know the operating system interfaces, but it's likely that the actual call is something more like "transfer these blocks to these memory locations and tell me when they're all finished." I'm trying to make a single query concurrent by using the knowledge of a *list* of accesses to be made, and getting the operating system to do all of them concurrently. > The problem is that a 12X speed for 12 disks seems unlikely except under > very specific loads (such as a sequential scan of a single table). I'll grant you that 12X may not actually be reached, but it'll be somewhere between 1X and 12X. I'm optimistic. > Each of the indexes may need to be scanned or searched in turn, then > each of the tables would need to be scanned or searched in turn, > depending on the query plan. Yes, the indexes would also have to be accessed concurrently, and that will undoubtedly be harder to code than accessing the tables concurrently. > There is no guarantee that the index rows or the table rows are equally > spread across the 12 disks. Indeed. However, you will get an advantage if they are spread out at all. Statistically, the larger the number of pages that need to be retrieved in the set, the more equally spread between the discs they will be. It's the times when there are a large number of pages to retrieve that this will be most useful. > CPU processing becomes involved with is currently limited to a single > processor thread. On the contrary, this is a problem at the moment for sequential table scans, but would not be a problem for random accesses. If you have twelve discs all throwing 80MB/s at the CPU, it's understandable that the CPU won't keep up. However, when you're making random accesses, with say a 15,000 rpm disc, and retrieving a single 8k page on every access, each disc will be producing a maximum of 2MB per second, which can be handled quite easily by modern CPUs. Index scans are limited by the disc, not the CPU. Mathew -- A. Top Posters > Q. What's the most annoying thing in the world?
On Tue, 4 Dec 2007, Gregory Stark wrote: > Also, it's true, you need to preread more than 12 blocks to handle a 12-disk > raid. My offhand combinatorics analysis seems to indicate you would expect to > need to n(n-1)/2 blocks on average before you've hit all the blocks. There's > little penalty to prereading unless you use up too much kernel resources or > you do unnecessary i/o which you never use, so I would expect doing n^2 capped > at some reasonable number like 1,000 pages (enough to handle a 32-disk raid) > would be reasonable. It's better than that, actually. Let's assume a RAID 0 striped set of twelve discs. If you spread out twelve requests to twelve discs, then the expected number of requests to each disc is one. The probablility that any single disc receives more than say three requests is rather small. As you increase the number of requests, the longest reasonably-expected queue length for a particular disc gets closer to the number of requests divided by the number of discs, as the requests get spread more and more evenly among the discs. The larger the set of requests, the closer the performance will scale to the number of discs. Matthew -- All of this sounds mildly turgid and messy and confusing... but what the heck. That's what programming's all about, really -- Computer Science Lecturer
Fwiw, what made you bring up this topic now? You're the second person in about two days to bring up precisely this issue and it was an issue I had been planning to bring up on -hackers as it was. "Matthew" <matthew@flymine.org> writes: > Kind of. The system cache is just a method to make it simpler to explain - > I don't know the operating system interfaces, but it's likely that the > actual call is something more like "transfer these blocks to these memory > locations and tell me when they're all finished." I'm trying to make a > single query concurrent by using the knowledge of a *list* of accesses to > be made, and getting the operating system to do all of them concurrently. There are two interfaces of which I'm aware of. posix_fadvise() which just tells the kernel you'll be needing the blocks soon. Linux at least initiates I/O on them into cache. libaio which lets you specify the location to read the blocks and how to notify you when they're ready. On Linux posix_fadvise works great but libaio doesn't seem to gain any speed bonus, at least on 2.6.22 with glibc 2.6.1. I was under the impression there was a kernel implementation somewhere but apparently it's not helping. On Solaris apparently it doesn't have posix_fadvise but libaio works great. We could use libaio as a kind of backdoor fadvise where we just initiate i/o on the block but throw away the results assuming they'll stay in cache for the real read or we could add an asynchronous interface to the buffer manager. The latter is attractive but would be a much more invasive patch. I'm inclined to start with the former. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
Matthew wrote:
This assumes that you can know which pages to fetch ahead of time - which you do not except for sequential read of a single table.
I think it would be possible that your particular case could be up to 6X faster, but that most other people will see little or no speed up. If it does read the wrong pages, it is wasting it's time.
I am not trying to discourage you - only trying to ensure that you have reasonable expectations. 12X is far too optimistic.
Please show one of your query plans and how you as a person would design which pages to request reads for.
Cheers,
mark
On Tue, 4 Dec 2007, Gregory Stark wrote:Also, it's true, you need to preread more than 12 blocks to handle a 12-disk raid. My offhand combinatorics analysis seems to indicate you would expect to need to n(n-1)/2 blocks on average before you've hit all the blocks. There's little penalty to prereading unless you use up too much kernel resources or you do unnecessary i/o which you never use, so I would expect doing n^2 capped at some reasonable number like 1,000 pages (enough to handle a 32-disk raid) would be reasonable.It's better than that, actually. Let's assume a RAID 0 striped set of twelve discs. If you spread out twelve requests to twelve discs, then the expected number of requests to each disc is one. The probablility that any single disc receives more than say three requests is rather small. As you increase the number of requests, the longest reasonably-expected queue length for a particular disc gets closer to the number of requests divided by the number of discs, as the requests get spread more and more evenly among the discs. The larger the set of requests, the closer the performance will scale to the number of discs
This assumes that you can know which pages to fetch ahead of time - which you do not except for sequential read of a single table.
I think it would be possible that your particular case could be up to 6X faster, but that most other people will see little or no speed up. If it does read the wrong pages, it is wasting it's time.
I am not trying to discourage you - only trying to ensure that you have reasonable expectations. 12X is far too optimistic.
Please show one of your query plans and how you as a person would design which pages to request reads for.
Cheers,
mark
-- Mark Mielke <mark@mielke.cc>
matthew@flymine.org wrote: > So, if you hand requests one by one to the disc, it will almost always be > faster to order them. On the other hand, if you hand a huge long list of > requests to a decent SCSI or SATA-NCQ disc in one go, it will reorder the > reads itself, and it will do it much better than you. > > Sure - but this doesn't suggest threading so much as pushing all reads into AIO as soon as they can be identified - and hoping that your os has a decent AIO subsystem, which is sadly a tall order for many *nix systems. I do think some thought should be given to threading but I would expect the effect to be more noticeable for updates where you update tables that have multiple indices. In the case of your scan then you need threading on CPU (rather than concurrent IO through AIO) if the disks can feed you data faster than you can scan it. Which might be the case for some scans using user functions, but I wouldn't have thought it would be the case for a sinple index scan. At some point, hopefully, the engines will be: a) thread safe (if not thread hot) so it can exist with threaded user functions and embedded languages b) able to incorporate C++ add-in functionality There may not be a pressing reason to support either of these, but having a capability to experiment would surely be helpful and allow incremental enhancement - so baby steps could be made to (for example) move stats and logging to a background thread, move push of results to clients out of the query evaluator thread, and so on. Parallel threading queries is a whle different ball game which needs thought in the optimiser. James
Mark Mielke wrote: > This assumes that you can know which pages to fetch ahead of time - > which you do not except for sequential read of a single table. > Why doesn't it help to issue IO ahead-of-time requests when you are scanning an index? You can read-ahead in index pages, and submit requests for data pages as soon as it is clear you'll want them. Doing so can allow the disks and OS to relax the order in which you receive them, which may allow you to process them while IO continues, and it may also optimise away some seeking and settle time. Maybe.
On Tue, 4 Dec 2007, Mark Mielke wrote: > > The larger the set of requests, the closer the performance will scale to > > the number of discs > > This assumes that you can know which pages to fetch ahead of time - > which you do not except for sequential read of a single table. There are circumstances where it may be hard to locate all the pages ahead of time - that's probably when you're doing a nested loop join. However, if you're looking up in an index and get a thousand row hits in the index, then there you go. Page locations to load. > Please show one of your query plans and how you as a person would design > which pages to request reads for. How about the query that "cluster <skrald@amossen.dk>" was trying to get to run faster a few days ago? Tom Lane wrote about it: | Wouldn't help, because the accesses to "questions" are not the problem. | The query's spending nearly all its time in the scan of "posts", and | I'm wondering why --- doesn't seem like it should take 6400msec to fetch | 646 rows, unless perhaps the data is just horribly misordered relative | to the index. Which is exactly what's going on. The disc is having to seek 646 times fetching a single row each time, and that takes 6400ms. He obviously has a standard 5,400 or 7,200 rpm drive with a seek time around 10ms. Or on a similar vein, fill a table with completely random values, say ten million rows with a column containing integer values ranging from zero to ten thousand. Create an index on that column, analyse it. Then pick a number between zero and ten thousand, and "SELECT * FROM table WHERE that_column = the_number_you_picked" Matthew -- Experience is what allows you to recognise a mistake the second time you make it.
James Mansion wrote: > Mark Mielke wrote: >> This assumes that you can know which pages to fetch ahead of time - >> which you do not except for sequential read of a single table. > Why doesn't it help to issue IO ahead-of-time requests when you are > scanning an index? You can read-ahead > in index pages, and submit requests for data pages as soon as it is > clear you'll want them. Doing so can allow > the disks and OS to relax the order in which you receive them, which > may allow you to process them while IO > continues, and it may also optimise away some seeking and settle > time. Maybe. Sorry to be unclear. To achieve a massive speedup (12X for 12 disks with RAID 0) requires that you know what reads to perform in advance. The moment you do not, you only have a starting point, your operations begin to serialize again. For example, you must scan the first index, to be able to know what table rows to read. At a minimum, this breaks your query into: 1) Preload all the index pages you will need, 2) Scan the index pages you needed, 3) Preload all the table page you will need, 4) Scan the table pages you needed. But do you really need the whole index? What if you only need parts of the index, and this plan now reads the whole index using async I/O "just in case" it is useful? Index is a B-Tree for a reason. In Matthew's case where he has an IN clause with thousands of possibles (I think?), perhaps a complete index scan is always the best case - but that's only one use case, and in my opinion, an obscure one. As soon as additional table joins become involved, the chance that whole index scans are required would probably normally reduce, which turns the index scan into a regular B-Tree scan, which is difficult to perform async I/O for, as you don't necessarily know which pages to read next. It seems like a valuable goal - but throwing imaginary numbers around does not appeal to me. I am more interested in Gregory's simulations. I would like to understand his simulation better, and see his results. Speculation about amazing potential is barely worth the words used to express it. The real work is in design and implementation. :-) Cheers, mark -- Mark Mielke <mark@mielke.cc>
Matthew wrote:
It seems reasonable - but still a guess.
Cheers,
mark
Sure.On Tue, 4 Dec 2007, Mark Mielke wrote:The larger the set of requests, the closer the performance will scale to the number of discsThis assumes that you can know which pages to fetch ahead of time - which you do not except for sequential read of a single table.There are circumstances where it may be hard to locate all the pages ahead of time - that's probably when you're doing a nested loop join. However, if you're looking up in an index and get a thousand row hits in the index, then there you go. Page locations to load.
Your proposal would not necessarily improve his case unless he also purchased additional disks, at which point his execution time may be different. More speculation. :-)Please show one of your query plans and how you as a person would design which pages to request reads for.How about the query that "cluster <skrald@amossen.dk>" was trying to get to run faster a few days ago? Tom Lane wrote about it: | Wouldn't help, because the accesses to "questions" are not the problem. | The query's spending nearly all its time in the scan of "posts", and | I'm wondering why --- doesn't seem like it should take 6400msec to fetch | 646 rows, unless perhaps the data is just horribly misordered relative | to the index. Which is exactly what's going on. The disc is having to seek 646 times fetching a single row each time, and that takes 6400ms. He obviously has a standard 5,400 or 7,200 rpm drive with a seek time around 10ms.
It seems reasonable - but still a guess.
This isn't a real use case. Optimizing for the worst case scenario is not always valuable.Or on a similar vein, fill a table with completely random values, say ten million rows with a column containing integer values ranging from zero to ten thousand. Create an index on that column, analyse it. Then pick a number between zero and ten thousand, and "SELECT * FROM table WHERE that_column = the_number_you_picked
Cheers,
mark
-- Mark Mielke <mark@mielke.cc>
On Tue, 4 Dec 2007, Gregory Stark wrote: > Fwiw, what made you bring up this topic now? You're the second person in about > two days to bring up precisely this issue and it was an issue I had been > planning to bring up on -hackers as it was. I only just joined the performance mailing list to talk about R-trees. I would probably have brought it up earlier if I had been here earlier. However, we're thinking of buying this large machine, and that reminded me. I have been biting at the bit for my bosses to allow me time to write an indexing system for transient data - a lookup table backed by disc, looking up from an integer to get an object, native in Java. Our system often needs to fetch a list of a thousand different objects by a key like that, and Postgres just doesn't do that exact thing fast. I was going to implement it with full asynchronous IO, to do that particular job very fast, so I have done a reasonable amount of research into the topic. In Java, that is. It would add a little bit more performance for our system. That wouldn't cover us - we still need to do complex queries with the same problem, and that'll have to stay in Postgres. Matthew -- The early bird gets the worm. If you want something else for breakfast, get up later.
Matthew wrote:
You describe a new asynchronous I/O system to map integers to Java objects above. Why would you write this? Have you tried BerkeleyDB or BerkeleyDB JE? BerkeleyDB with BDB as a backend or their full Java backend gives you a Java persistence API that will allow you to map Java objects (including integers) to other Java objects. They use generated Java run time instructions instead of reflection to store and lock your Java objects. If it came to a bet, I would bet that their research and tuning over several years, and many people, would beat your initial implementation, asynchronous I/O or not.
Asynchronous I/O is no more a magic bullet than threading. It requires a lot of work to get it right, and if one gets it wrong, it can be slower than the regular I/O or single threaded scenarios. Both look sexy on paper, neither may be the solution to your problem. Or they may be. We wouldn't know without numbers.
Cheers,
mark
So much excitement and zeal - refreshing to see. And yet, no numbers! :-)On Tue, 4 Dec 2007, Gregory Stark wrote:Fwiw, what made you bring up this topic now? You're the second person in about two days to bring up precisely this issue and it was an issue I had been planning to bring up on -hackers as it was.I only just joined the performance mailing list to talk about R-trees. I would probably have brought it up earlier if I had been here earlier. However, we're thinking of buying this large machine, and that reminded me. I have been biting at the bit for my bosses to allow me time to write an indexing system for transient data - a lookup table backed by disc, looking up from an integer to get an object, native in Java. Our system often needs to fetch a list of a thousand different objects by a key like that, and Postgres just doesn't do that exact thing fast. I was going to implement it with full asynchronous IO, to do that particular job very fast, so I have done a reasonable amount of research into the topic. In Java, that is. It would add a little bit more performance for our system. That wouldn't cover us - we still need to do complex queries with the same problem, and that'll have to stay in Postgres
You describe a new asynchronous I/O system to map integers to Java objects above. Why would you write this? Have you tried BerkeleyDB or BerkeleyDB JE? BerkeleyDB with BDB as a backend or their full Java backend gives you a Java persistence API that will allow you to map Java objects (including integers) to other Java objects. They use generated Java run time instructions instead of reflection to store and lock your Java objects. If it came to a bet, I would bet that their research and tuning over several years, and many people, would beat your initial implementation, asynchronous I/O or not.
Asynchronous I/O is no more a magic bullet than threading. It requires a lot of work to get it right, and if one gets it wrong, it can be slower than the regular I/O or single threaded scenarios. Both look sexy on paper, neither may be the solution to your problem. Or they may be. We wouldn't know without numbers.
Cheers,
mark
-- Mark Mielke <mark@mielke.cc>
On Tue, 4 Dec 2007, Mark Mielke wrote: > So much excitement and zeal - refreshing to see. And yet, no numbers! :-) What sort of numbers did you want to see? > You describe a new asynchronous I/O system to map integers to Java > objects above. Why would you write this? Have you tried BerkeleyDB or > BerkeleyDB JE? BerkeleyDB with BDB as a backend or their full Java > backend gives you a Java persistence API that will allow you to map Java > objects (including integers) to other Java objects. Looked at all those. Didn't like their performance characteristics, or their interfaces. It's simply the fact that they're not designed for the workload that we want to put on them. > If it came to a bet, I would bet that their research and > tuning over several years, and many people, would beat your initial > implementation, asynchronous I/O or not. Quite possibly. However, there's the possibility that it wouldn't. And I can improve it - initial implementations often suck. > Asynchronous I/O is no more a magic bullet than threading. It requires a > lot of work to get it right, and if one gets it wrong, it can be slower > than the regular I/O or single threaded scenarios. Both look sexy on > paper, neither may be the solution to your problem. Or they may be. We > wouldn't know without numbers. Okay, numbers. About eight years ago I wrote the shell of a filesystem implementation, concentrating on performance and integrity. It absolutely whooped ext2 for both read and write speed - especially metadata write speed. Anything up to 60 times as fast. I wrote a load of metadata to ext2, which took 599.52 seconds, and on my system took 10.54 seconds. Listing it back (presumably from cache) took 1.92 seconds on ext2 and 0.22 seconds on my system. No silly caching tricks that sacrifice integrity. It's a pity I got nowhere near finishing that system - just enough to prove my point and get a degree, but looking back on it there are massive ways it's rubbish and should be improved. It was an initial implementation. I didn't have reiserfs, jfs, or xfs available at that time, but it would have been really interesting to compare. This is the system I would have based my indexing thing on. Matthew -- Anyone who goes to a psychiatrist ought to have his head examined.
"Matthew" <matthew@flymine.org> writes: > On Tue, 4 Dec 2007, Mark Mielke wrote: >> So much excitement and zeal - refreshing to see. And yet, no numbers! :-) > > What sort of numbers did you want to see? FWIW I posted some numbers from a synthetic case to pgsql-hackers http://archives.postgresql.org/pgsql-hackers/2007-12/msg00088.php Very promising -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
Mark Mielke wrote: > At a minimum, this breaks your query into: 1) Preload all the index > pages you will need Isn't this fairly predictable - the planner has chosen the index so it will be operating on a bounded subset. > , 2) Scan the index pages you needed Yes, and AIO helps when you can scan them in arbitrary order, as they are returned. > , 3) Preload all the table page you will need No - just request that they load. You can do work as soon as any page is returned. > , 4) Scan the table pages you needed. In the order that is most naturally returned by the disks. > But do you really need the whole index? I don't think I suggested that. > What if you only need parts of the index, and this plan now reads the > whole index using async I/O "just in case" it is useful? Where did you get that from? > index scan into a regular B-Tree scan, which is difficult to perform > async I/O for, as you don't necessarily know which pages to read next. > Most B-trees are not so deep. It would generally be a win to retain interior nodes of indices in shared memory, even if leaf pages are not present. In such a case, it is quite quick to establish which leaf pages must be demand loaded. I'm not suggesting that Postgres indices are structured in a way that would support this sort of thing now. James
James Mansion wrote: > Mark Mielke wrote: >> At a minimum, this breaks your query into: 1) Preload all the index >> pages you will need > Isn't this fairly predictable - the planner has chosen the index so it > will be operating > on a bounded subset. What is the bounded subset? It is bounded by the value. What value? You need to access the first page before you know what the second page is. PostgreSQL or the kernel should already have the hottest pages in memory, so the value of doing async I/O is very likely the cooler pages that are unique to the query. We don't know what the cooler pages are until we follow three tree down. >> , 2) Scan the index pages you needed > Yes, and AIO helps when you can scan them in arbitrary order, as they > are returned. I don't think you are talking about searching a B-Tree, as the order is important when searching, and search performance would be reduced if one reads and scans more pages than necessary to map from the value to the row. I presume you are talking about scanning the entire index. Where "needed" means "all". Again, this only benefits a subset of the queries. >> , 3) Preload all the table page you will need > No - just request that they load. You can do work as soon as any page > is returned. The difference between preload and handling async I/O in terms of performance is debatable. Greg reports that async I/O on Linux sucks, but posix_fadvise*() has substantial benefits. posix_fadvise*() is preload not async I/O (he also reported that async I/O on Solaris seems to work well). Being able to do work as the first page is available is a micro-optimization as far as I am concerned at this point (one that may not yet work on Linux), as the real benefit comes from utilizing all 12 disks in Matthew's case, not from guaranteeing that data is processed as soon as possible. >> , 4) Scan the table pages you needed. > In the order that is most naturally returned by the disks. Micro-optimization. >> But do you really need the whole index? > I don't think I suggested that. >> What if you only need parts of the index, and this plan now reads the >> whole index using async I/O "just in case" it is useful? > Where did you get that from? I get it from your presumption that you can know which pages of the index to load in advance. The only way you can know which pages must be loaded, is to know that you need to query them all. Unless you have some way of speculating with some degree of accuracy which colder pages in the index you will need, without looking at the index? >> index scan into a regular B-Tree scan, which is difficult to perform >> async I/O for, as you don't necessarily know which pages to read next. > Most B-trees are not so deep. It would generally be a win to retain > interior nodes of indices in > shared memory, even if leaf pages are not present. In such a case, it > is quite quick to establish > which leaf pages must be demand loaded. This is bogus. The less deep the B-Tree is, the less there should be any requirement for async I/O. Hot index pages will be in cache. > I'm not suggesting that Postgres indices are structured in a way that > would support this sort > of thing now. In your hand waving, you have assumed that PostgreSQL B-Tree index might need to be replaced? :-) Cheers, mark -- Mark Mielke <mark@mielke.cc>
Mark Mielke wrote: > PostgreSQL or the kernel should already have the hottest pages in > memory, so the value of doing async I/O is very likely the cooler > pages that are unique to the query. We don't know what the cooler > pages are until we follow three tree down. > I'm assuming that at the time we start to search the index, we have some idea of value or values that we are looking for. Or, as you say, we are applying a function to 'all of it'. Think of a 'between' query. The subset of the index that can be a match can be bounded by the leaf pages that contain the end point(s). Similarly if we have a merge with a sorted intermediate set from a prior step then we also have bounds on the values. I'm not convinced that your assertion that the index leaf pages must necessarily be processed in-order is true - it depends what sort of operation is under way. I am assuming that we try hard to keep interior index nodes and data in meory and that having identified the subset of these that we want, we can immediately infer the set of leaves that are potentially of interest. > The difference between preload and handling async I/O in terms of > performance is debatable. Greg reports that async I/O on Linux sucks, > but posix_fadvise*() has substantial benefits. posix_fadvise*() is > preload not async I/O (he also reported that async I/O on Solaris > seems to work well). Being able to do work as the first page is > available is a micro-optimization as far as I am concerned at this > point (one that may not yet work on Linux), as the real benefit comes > from utilizing all 12 disks in Matthew's case, not from guaranteeing > that data is processed as soon as possible. > I see it as part of the same problem. You can partition the data across all the disks and run queries in parallel against the partitions, or you can lay out the data in the RAID array in which case the optimiser has very little idea how the data will map to physical data layout - so its best bet is to let the systems that DO know decide the access strategy. And those systems can only do that if you give them a lot of requests that CAN be reordered, so they can choose a good ordering. > Micro-optimization. > Well, you like to assert this - but why? If a concern is the latency (and my experience suggests that latency is the biggest issue in practice, not throughput per se) then overlapping processing while waiting for 'distant' data is important - and we don't have any information about the physical layout of the data that allows us to assert that forward access pre-read of data from a file is the right strategy for accessing it as fast as possible - we have to allow the OS (and to an increasing extent the disks) to manage the elevator IO to best effect. Its clear that the speed of streaming read and write of modern disks is really high compared to that of random access, so anything we can do to help the disks run in that mode is pretty worthwhile even if the physical streaming doesn't match any obvious logical ordering of the OS files or logical data pages within them. If you have a function to apply to a set of data elements and the elements are independant, then requiring that the function is applied in an order rather than conceptually in parallel is going to put a lot of constraint on how the hardware can optimise it. Clearly a hint to preload is better than nothing. But it seems to me that the worst case is that we wait for the slowest page to load and then start processing hoping that the rest of the data stays in the buffer cache and is 'instant', while AIO and evaluate-when-ready means that process is still bound by the slowest data to arrive, but at that point there is little processing still to do, and the already-processed buffers can be reused earlier. In the case where there is significant presure on the buffer cache, this can be significant. Of course, a couple of decades bullying Sybase systems on Sun Enterprise boxes may have left me somewhat jaundiced - but Sybase can at least parallelise things. Sometimes. When it does, its quite a big win. > In your hand waving, you have assumed that PostgreSQL B-Tree index > might need to be replaced? :-) > Sure, if the intent is to make the system thread-hot or AIO-hot, then the change is potentially very invasive. The strategy to evaluate queries based on parallel execution and async IO is not necessarily very like a strategy where you delegate to the OS buffer cache. I'm not too bothered for the urpose of this discussion whether the way that postgres currently navigates indexes is amenable to this. This is bikeshed land, right? I think it is foolish to disregard strategies that will allow overlapping IO and processing - and we want to keep disks reading and writing rather than seeking. To me that suggests AIO and disk-native queuing are quite a big deal. And parallel evaluation will be too as the number of cores goes up and there is an expectation that this should reduce latency of individual query, not just allow throughput with lots of concurrent demand.
James Mansion wrote: > Mark Mielke wrote: >> PostgreSQL or the kernel should already have the hottest pages in >> memory, so the value of doing async I/O is very likely the cooler >> pages that are unique to the query. We don't know what the cooler >> pages are until we follow three tree down. >> > I'm assuming that at the time we start to search the index, we have > some idea of value or values that > we are looking for. Or, as you say, we are applying a function to > 'all of it'. > > Think of a 'between' query. The subset of the index that can be a > match can be bounded by the leaf > pages that contain the end point(s). Similarly if we have a merge > with a sorted intermediate set from > a prior step then we also have bounds on the values. How do you find the bounding points for these pages? Does this not require descending through the tree in a more traditional way? > I'm not convinced that your assertion that the index leaf pages must > necessarily be processed in-order > is true - it depends what sort of operation is under way. I am > assuming that we try hard to keep > interior index nodes and data in meory and that having identified the > subset of these that we want, we > can immediately infer the set of leaves that are potentially of interest. It is because you are missing my point. In order to find a reduced set of pages to load, one must descend through the B-Tree. Identifying the subset requires sequential loading of pages. There is some theoretical potential for async I/O, but I doubt your own assertion that this is significant in any significant way. I ask you again - how do you know which lower level pages to load before you load the higher level pages? The only time the B-Tree can be processed out of order in this regard is if you are doing a bitmap index scan or some similar operation that will scan the entire tree, and you do not care what order the pages arrive in. If you are looking for a specific key, one parent page leads to one child page, and the operation is sequential. >> The difference between preload and handling async I/O in terms of >> performance is debatable. Greg reports that async I/O on Linux sucks, >> but posix_fadvise*() has substantial benefits. posix_fadvise*() is >> preload not async I/O (he also reported that async I/O on Solaris >> seems to work well). Being able to do work as the first page is >> available is a micro-optimization as far as I am concerned at this >> point (one that may not yet work on Linux), as the real benefit comes >> from utilizing all 12 disks in Matthew's case, not from guaranteeing >> that data is processed as soon as possible. > I see it as part of the same problem. You can partition the data > across all the disks and run queries in parallel > against the partitions, or you can lay out the data in the RAID array > in which case the optimiser has very little idea > how the data will map to physical data layout - so its best bet is to > let the systems that DO know decide the > access strategy. And those systems can only do that if you give them > a lot of requests that CAN be reordered, > so they can choose a good ordering. You can see it however you like - what remains, is that the 12X speed up is going to come from using 12 disks, and loading the 12 disks in parallel. More theoretical improvements with regard to the ability for a particular hard disk to schedule reads and return results out of order, have not, in my reading, been shown to reliably improve performance to a significant degree. Unfortunately, Linux doesn't support my SATA NCQ yet, so I haven't been able to experiment myself. Gregory provided numbers showing a 2X - 3X performance when using three disks. This has the potential for significant improvement with real hardware, modest cost, and perhaps trivial changes to PostgreSQL. What you describe is a re-design of the I/O strategy that will be designed for asynchronous I/O, with some sort of state machine that will be able to deal efficiently with either index pages or table pages out of order. Do you have numbers to show that such a significant change would result in a reasonable return on the investment? >> Micro-optimization. > Well, you like to assert this - but why? I'll quote from: http://en.wikipedia.org/wiki/Native_Command_Queuing Most reading I have done shows NCQ to have limited gains, with most benchmarks being suspect. Also, latency is only for the first page. One presumes that asynch I/O will be mostly valuable in the case where many pages can be scheduled for reading at the same time. In the case that many pages are scheduled for reading, the first page will be eventually served, and the overall bandwidth is still the same. In your theoretical model, you are presuming the CPU is a bottleneck, either for processing, or scheduling the next batch of reads. I think you are hand waving, and given that Linux doesn't yet support asynch I/O well, Gregory's model will serve more PostgreSQL users than yours with less complexity. > Clearly a hint to preload is better than nothing. But it seems to me > that the worst case is that we wait for > the slowest page to load and then start processing hoping that the > rest of the data stays in the buffer cache > and is 'instant', while AIO and evaluate-when-ready means that process > is still bound by the slowest > data to arrive, but at that point there is little processing still to > do, and the already-processed buffers can be > reused earlier. In the case where there is significant presure on the > buffer cache, this can be significant. It seems to me that you have yet to prove that there will be substantial gains in overall performance for preload. Leaping on to presuming that PostgreSQL should switch to a fully asynch I/O model seems a greater stretch. By the sounds of it, Gregory could have the first implemented very soon. When will you have yours? :-) >> In your hand waving, you have assumed that PostgreSQL B-Tree index >> might need to be replaced? :-) > Sure, if the intent is to make the system thread-hot or AIO-hot, then > the change is potentially very > invasive. The strategy to evaluate queries based on parallel > execution and async IO is not necessarily > very like a strategy where you delegate to the OS buffer cache. > > I'm not too bothered for the urpose of this discussion whether the way > that postgres currently > navigates indexes is amenable to this. This is bikeshed land, right? I am only interested by juicy projects that have a hope of success. This subject does interest me - I am hoping my devil's advocate participation encourages people to seek a practical implementation that will benefit me. > I think it is foolish to disregard strategies that will allow > overlapping IO and processing - and we want to > keep disks reading and writing rather than seeking. To me that > suggests AIO and disk-native queuing > are quite a big deal. And parallel evaluation will be too as the > number of cores goes up and there is > an expectation that this should reduce latency of individual query, > not just allow throughput with lots > of concurrent demand. I am more amenable to multi-threaded index processing for the same query than async I/O to take advantage of NCQ. Guess we each come from a different background. :-) Cheers, mark -- Mark Mielke <mark@mielke.cc>
On Tue, 4 Dec 2007, Mark Mielke wrote: >> This is bikeshed land, right? > I am only interested by juicy projects that have a hope of success. This > subject does interest me - I am hoping my devil's advocate participation > encourages people to seek a practical implementation that will benefit me. Nah, you're both in bikeshed land. Ultimately this feature is going to get built out in a way that prioritizes as much portability as is possible while minimizing the impact on existing code. The stated PostgreSQL bias is that async-I/O is not worth the trouble until proven otherwise: http://www.postgresql.org/docs/faqs.FAQ_DEV.html#item1.14 That's why Greg Stark is busy with low-level tests of I/O options while you're arguing much higher level details. Until you've got some benchmarks in this specific area to go by, you can talk theory all day and not impact what implementation will actually get built one bit. As an example of an area you've already brought up where theory and benchmarks clash violently, the actual situation with NCQ on SATA drives has been heavily blurred because of how shoddy some manufacturer's NCQ implementation are. Take a look at the interesting thread on this topic at http://lkml.org/lkml/2007/4/3/159 to see what I'm talking about. Even if you have an NCQ drive, and a version of Linux that supports it, and you've setup everything up right, you can still end up with unexpected performance regressions under some circumstances. It's still the wild west with that technology. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Mark Mielke wrote: > Asynchronous I/O is no more a magic bullet than threading. It requires a > lot of work to get it right, and if one gets it wrong, it can be slower > than the regular I/O or single threaded scenarios. Both look sexy on > paper, neither may be the solution to your problem. Or they may be. We > wouldn't know without numbers. Agreed. We currently don't use multiple CPUs or multiple disks efficiently for single-query loads. There is certainly more we could do in these areas, and it is on the TODO list. The good news is that most work loads are multi-user and we use resources more evenly in those cases. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Matthew" <matthew@flymine.org> writes: > On Tue, 4 Dec 2007, Gregory Stark wrote: >> FWIW I posted some numbers from a synthetic case to pgsql-hackers >> >> http://archives.postgresql.org/pgsql-hackers/2007-12/msg00088.php >... > This was with 8192 random requests of size 8192 bytes from an 80GB test file. > Unsorted requests ranged from 1.8 MB/s with no prefetching to 28MB/s with lots > of prefetching. Sorted requests went from 2.4MB/s to 38MB/s. That's almost > exactly 16x improvement for both, and this is top of the line hardware. Neat. The curves look very similar to mine. I also like that with your hardware the benefit maxes out at pretty much exactly where I had mathematically predicted they would ((stripe size)^2 / 2). > So, this is FYI, and also an added encouragement to implement fadvise > prefetching in some form or another. How's that going by the way? I have a patch which implements it for the low hanging fruit of bitmap index scans. it does it using an extra trip through the buffer manager which is the least invasive approach but not necessarily the best. Heikki was pushing me to look at changing the buffer manager to support doing asynchronous buffer reads. In that scheme you would issue a ReadBufferAsync which would give you back a pinned buffer which the scan would have to hold onto and call a new buffer manager call to complete the i/o when it actually needed the buffer. The ReadBufferAsync call would put the buffer in some form of i/o in progress. On systems with only posix_fadvise ReadBufferAsync would issue posix_fadvise and ReadBufferFinish would issue a regular read. On systems with an effective libaio the ReadBufferAsync could issue the aio operation and ReadBufferFinish could then do a aio_return. The pros of such an approach would be less locking and cpu overhead in the buffer manager since the second trip would have the buffer handle handy and just have to issue the read. The con of such an approach would be the extra shared buffers occupied by buffers which aren't really needed yet. Also the additional complexity in the buffer manager with the new i/o in progress state. (I'm not sure but I don't think we get to reuse the existing i/o in progress state.) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!
On Tue, 29 Jan 2008, Gregory Stark wrote: >> This was with 8192 random requests of size 8192 bytes from an 80GB test file. >> Unsorted requests ranged from 1.8 MB/s with no prefetching to 28MB/s with lots >> of prefetching. Sorted requests went from 2.4MB/s to 38MB/s. That's almost >> exactly 16x improvement for both, and this is top of the line hardware. > > Neat. The curves look very similar to mine. I also like that with your > hardware the benefit maxes out at pretty much exactly where I had > mathematically predicted they would ((stripe size)^2 / 2). Why would that be the case? Does that mean that we can select a stripe size of 100GB and get massive performance improvements? Doesn't seem logical to me. To me, it maxes out at 16x speed because there are 16 discs. Amusingly, there appears to be a spam filter preventing my message (with its image) getting through to the performance mailing list. Matthew
"Matthew" <matthew@flymine.org> writes: > On Tue, 29 Jan 2008, Gregory Stark wrote: >>> This was with 8192 random requests of size 8192 bytes from an 80GB test file. >>> Unsorted requests ranged from 1.8 MB/s with no prefetching to 28MB/s with lots >>> of prefetching. Sorted requests went from 2.4MB/s to 38MB/s. That's almost >>> exactly 16x improvement for both, and this is top of the line hardware. >> >> Neat. The curves look very similar to mine. I also like that with your >> hardware the benefit maxes out at pretty much exactly where I had >> mathematically predicted they would ((stripe size)^2 / 2). > > Why would that be the case? Does that mean that we can select a stripe size of > 100GB and get massive performance improvements? Doesn't seem logical to me. To > me, it maxes out at 16x speed because there are 16 discs. Sorry, I meant "number of drives in the array" not number of bytes. So with 16 drives you would need approximately 128 random pending i/o operations to expect all drives to be busy at all times. I got this from a back-of-the-envelope calculation which now that I'm trying to reproduce it seems to be wrong. Previously I thought it was n(n+1)/2 or about n^2/2. So at 16 I would have expected about 128 pending i/o requests before all the drives could be expected to be busy. Now that I'm working it out more carefully I'm getting that the expected number of pending i/o requests before all drives are busy is n + n/2 + n/3 + ... + n/n which is actually n * H(n) which is approximated closely by n * log(n). That would predict that 16 drives would actually max out at 44.4 pending i/o requests. It would predict that my three-drive array would max out well below that at 7.7 pending i/o requests. Empirically neither result seems to match reality. Other factors must be dominating. > Amusingly, there appears to be a spam filter preventing my message (with its > image) getting through to the performance mailing list. This has been plaguing us for a while. When we figure out who's misconfigured system is doing it I expect they'll be banned from the internet for life! -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
>>> On Tue, Jan 29, 2008 at 9:52 AM, in message <877ihsvdcb.fsf@oxford.xeocode.com>, Gregory Stark <stark@enterprisedb.com> wrote: > I got this from a back-of-the-envelope calculation which now that I'm trying > to reproduce it seems to be wrong. Previously I thought it was n(n+1)/2 or > about n^2/2. So at 16 I would have expected about 128 pending i/o requests > before all the drives could be expected to be busy. That seems right to me, based on the probabilities of any new request hitting an already-busy drive. > Now that I'm working it out more carefully I'm getting that the expected > number of pending i/o requests before all drives are busy is > n + n/2 + n/3 + ... + n/n What's the basis for that? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > >>> On Tue, Jan 29, 2008 at 9:52 AM, in message > <877ihsvdcb.fsf@oxford.xeocode.com>, Gregory Stark <stark@enterprisedb.com> > wrote: > > > I got this from a back-of-the-envelope calculation which now that I'm trying > > to reproduce it seems to be wrong. Previously I thought it was n(n+1)/2 or > > about n^2/2. So at 16 I would have expected about 128 pending i/o requests > > before all the drives could be expected to be busy. > > That seems right to me, based on the probabilities of any new > request hitting an already-busy drive. > > > Now that I'm working it out more carefully I'm getting that the expected > > number of pending i/o requests before all drives are busy is > > n + n/2 + n/3 + ... + n/n > > What's the basis for that? Well consider when you've reached n-1 drives; the expected number of requests before you hit the 1 idle drive remaining out of n would be n requests. When you're at n-2 the expected number of requests before you hit either of the two idle drives would be n/2. And so on. The last term of n/n would be the first i/o when all the drives are idle and you obviously only need one i/o to hit an idle drive. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
>>> On Tue, Jan 29, 2008 at 10:45 AM, in message <873asginrz.fsf@stark.xeocode.com>, Gregory Stark <gsstark@mit.edu> wrote: > Well consider when you've reached n-1 drives; the expected number of requests > before you hit the 1 idle drive remaining out of n would be n requests. When > you're at n-2 the expected number of requests before you hit either of the > two > idle drives would be n/2. And so on. The last term of n/n would be the first > i/o when all the drives are idle and you obviously only need one i/o to hit > an > idle drive. You're right. Perhaps the reason more requests continue to improve performance is that a smart controller will move across the tracks and satisfy the pending requests in the most efficient order? -Kevin
On Tue, 29 Jan 2008, Gregory Stark wrote: >> So, this is FYI, and also an added encouragement to implement fadvise >> prefetching in some form or another. How's that going by the way? > > I have a patch which implements it for the low hanging fruit of bitmap index > scans. it does it using an extra trip through the buffer manager which is the > least invasive approach but not necessarily the best. Gregory - what's the status of that patch at the moment? Will it be making it into a new version of Postgres, or are we waiting for it to be implemented fully? It's just that our system is doing a lot of bitmap index scans at the moment, and it'd help to be able to spread them across the 16 discs in the RAID array. It's the bottleneck in our system at the moment. Matthew -- The email of the species is more deadly than the mail.
On Thu, 18 Sep 2008, Matthew Wakeling wrote: > On Tue, 29 Jan 2008, Gregory Stark wrote: >> I have a patch which implements it for the low hanging fruit of bitmap >> index scans. it does it using an extra trip through the buffer manager >> which is the least invasive approach but not necessarily the best. > > Gregory - what's the status of that patch at the moment? Will it be making it > into a new version of Postgres, or are we waiting for it to be implemented > fully? It and a related fadvise patch have been floating around the project queue for a while now. I just sorted through all the various patches and mesasges related to this area and updated the list at http://wiki.postgresql.org/wiki/CommitFestInProgress#Pending_patches recently, I'm trying to kick back a broader reviewed version of this concept right now. > It's just that our system is doing a lot of bitmap index scans at the moment, > and it'd help to be able to spread them across the 16 discs in the RAID > array. It's the bottleneck in our system at the moment. If you have some specific bitmap index scan test case suggestions you can pass along (either publicly or in private to me, I can probably help anonymize them), that's one of the things that has been holding this up. Alternately, if you'd like to join in on testing this all out more help would certainly be welcome. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, Sep 18, 2008 at 1:30 PM, Greg Smith <gsmith@gregsmith.com> wrote: > If you have some specific bitmap index scan test case suggestions you can > pass along (either publicly or in private to me, I can probably help > anonymize them), that's one of the things that has been holding this up. > Alternately, if you'd like to join in on testing this all out more help > would certainly be welcome. I posted in pgsql-perform about a problem that's using bitmap heap scans that's really slow compared to just using nested-loops. Don't know if that is relevant or not.
On Thu, 18 Sep 2008, Greg Smith wrote: >> It's just that our system is doing a lot of bitmap index scans at the >> moment, and it'd help to be able to spread them across the 16 discs in the >> RAID array. It's the bottleneck in our system at the moment. > > If you have some specific bitmap index scan test case suggestions you can > pass along (either publicly or in private to me, I can probably help > anonymize them), that's one of the things that has been holding this up. Okay, here's a description of what we're doing. We are importing data from a large file (so we don't have a choice on the order that the data comes in). For each entry in the source, we have to look up the corresponding row in the database, and issue the correct "UPDATE" command according to what's in the database and what's in the data source. Long story - it isn't feasible to change the overall process. In order to improve the performance, I made the system look ahead in the source, in groups of a thousand entries, so instead of running: SELECT * FROM table WHERE field = 'something'; a thousand times, we now run: SELECT * FROM table WHERE field IN ('something', 'something else'...); with a thousand things in the IN. Very simple query. It does run faster than the individual queries, but it still takes quite a while. Here is an example query: SELECT a1_.id AS a1_id, a1_.primaryIdentifier AS a2_ FROM Gene AS a1_ WHERE a1_.primaryIdentifier IN ('SPAC11D3.15', 'SPAC11D3.16c', 'SPAC11D3.17', 'SPAC11D3.18c', 'SPAC15F9.01c', 'SPAC15F9.02', 'SPAC16.01', 'SPAC18G6.01c', 'SPAC18G6.02c', 'SPAC18G6.04c', 'SPAC18G6.05c', 'SPAC18G6.06', 'SPAC18G6.07c', 'SPAC18G6.09c', 'SPAC18G6.10', 'SPAC18G6.11c', 'SPAC18G6.12c', 'SPAC18G6.13', 'SPAC18G6.14c', 'SPAC18G6.15', 'SPAC1B9.01c', 'SPAC1D4.02c', 'SPAC1D4.03c', 'SPAC1D4.04', 'SPAC1D4.05c', 'SPAC1D4.07c', 'SPAC1D4.08', 'SPAC1D4.09c', 'SPAC1D4.10', 'SPAC1D4.11c', 'SPAC1F3.11', 'SPAC23A1.10', 'SPAC23E2.01', 'SPAC23E2.02', 'SPAC23E2.03c', 'SPAC26A3.02', 'SPAC26A3.03c', 'SPAC26A3.05', 'SPAC26A3.06', 'SPAC26A3.07c', 'SPAC26A3.08', 'SPAC26A3.09c', 'SPAC26A3.10', 'SPAC26A3.11', 'SPAC26A3.14c', 'SPAC26A3.15c', 'SPAC26A3.16', 'SPAC27F1.01c', 'SPAC27F1.03c', 'SPAC27F1.04c', 'SPAC27F1.05c', 'SPAC27F1.06c', 'SPAC3H8.02', 'SPAC3H8.03', 'SPAC3H8.04', 'SPAC3H8.05c', 'SPAC3H8.06', 'SPAC3H8.07c', 'SPAC3H8.08c', 'SPAC3H8.09c', 'SPAC3H8.10', 'SPAC3H8.11', 'SPAC8E11.11', 'SPBC106.15', 'SPBC17G9.10', 'SPBC24E9.15c', 'WBGene00000969', 'WBGene00003035', 'WBGene00004095', 'WBGene00016011', 'WBGene00018672', 'WBGene00018674', 'WBGene00018675', 'WBGene00018676', 'WBGene00018959', 'WBGene00018960', 'WBGene00018961', 'WBGene00023407') ORDER BY a1_.id LIMIT 2000; And the corresponding EXPLAIN ANALYSE: Limit (cost=331.28..331.47 rows=77 width=17) (actual time=121.973..122.501 rows=78 loops=1) -> Sort (cost=331.28..331.47 rows=77 width=17) (actual time=121.968..122.152 rows=78 loops=1) Sort Key: id Sort Method: quicksort Memory: 29kB -> Bitmap Heap Scan on gene a1_ (cost=174.24..328.87 rows=77 width=17) (actual time=114.311..121.705 rows=78loops=1) Recheck Cond: (primaryidentifier = ANY ('{SPAC11D3.15... -> Bitmap Index Scan on gene__key_primaryidentifier (cost=0.00..174.22 rows=77 width=0) (actual time=44.434..44.434rows=150 loops=1) Index Cond: (primaryidentifier = ANY ('{SPAC11D3.15,SPAC11D3.16c... Total runtime: 122.733 ms (9 rows) Although it's probably in the cache, as it took 1073 ms the first time. The table has half a million rows, but tables all over the database are being accessed, so the cache is shared between several hundred million rows. Postgres executes this query in two stages. First it does a trawl of the index (on field), and builds an bitmap. Then it fetches the pages according to the bitmap. I can see the second stage being quite easy to adapt for fadvise, but the first stage would be a little more tricky. Both stages are equally important, as they take a comparable amount of time. We are running this database on a 16-spindle RAID array, so the benefits to our process of fully utilising them would be quite large. I'm considering if I can parallelise things a little though. > Alternately, if you'd like to join in on testing this all out more help would > certainly be welcome. How would you like me to help? Matthew -- What goes up must come down. Ask any system administrator.
Matthew Wakeling <matthew@flymine.org> writes: > In order to improve the performance, I made the system look ahead in the > source, in groups of a thousand entries, so instead of running: > SELECT * FROM table WHERE field = 'something'; > a thousand times, we now run: > SELECT * FROM table WHERE field IN ('something', 'something else'...); > with a thousand things in the IN. Very simple query. It does run faster > than the individual queries, but it still takes quite a while. Here is an > example query: Your example shows the IN-list as being sorted, but I wonder whether you actually are sorting the items in practice? If not, you might try that to improve locality of access to the index. Also, parsing/planning time could be part of your problem here with 1000 things to look at. Can you adjust your client code to use a prepared query? I'd try SELECT * FROM table WHERE field = ANY($1::text[]) (or whatever the field datatype actually is) and then push the list over as a single parameter value using array syntax. You might find that it scales to much larger IN-lists that way. regards, tom lane
On Fri, 19 Sep 2008, Tom Lane wrote: > Your example shows the IN-list as being sorted, but I wonder whether you > actually are sorting the items in practice? If not, you might try that > to improve locality of access to the index. Well, like I said, we generally don't have the luxury of dictating the order of entries in the data source. However, the IN list itself is sorted - more to do with making the logs readable and the tests reproducable than for performance. However, I have been looking at changing the order of the input data. This particular data source is a 29GB xml file, and I wrote a quick program which sorts that by one key in 40 minutes, which will hopefully allow later data sources (which are easier to sort) to take advantage of spacial locality in the table. However, that key is not the same one as the one used in the query above, hence why I say we can't really dictate the order of the entries. There's another complication which I won't go into. > Also, parsing/planning time could be part of your problem here with 1000 > things to look at. Can you adjust your client code to use a prepared > query? I'd try > SELECT * FROM table WHERE field = ANY($1::text[]) > (or whatever the field datatype actually is) and then push the list > over as a single parameter value using array syntax. You might find > that it scales to much larger IN-lists that way. Yes, that is a useful suggestion. However, I am fairly clear that the system is disk seek-bound at the moment, so it probably wouldn't make a massive improvement. It would also unfortunately require changing a lot of our code. Worth doing at some point. Matthew -- "Interwoven alignment preambles are not allowed." If you have been so devious as to get this message, you will understand it, and you deserve no sympathy. -- Knuth, in the TeXbook
Matthew Wakeling <matthew@flymine.org> writes:
Have you considered temporary tables? Use COPY to put everything you want to query into a temporary table, then SELECT to join the results, and pull all of the results, doing additional processing (UPDATE) as you pull?
Cheers,
mark
In order to improve the performance, I made the system look ahead in the source, in groups of a thousand entries, so instead of running: SELECT * FROM table WHERE field = 'something'; a thousand times, we now run: SELECT * FROM table WHERE field IN ('something', 'something else'...); with a thousand things in the IN. Very simple query. It does run faster than the individual queries, but it still takes quite a while. Here is an example query:
Have you considered temporary tables? Use COPY to put everything you want to query into a temporary table, then SELECT to join the results, and pull all of the results, doing additional processing (UPDATE) as you pull?
Cheers,
mark
-- Mark Mielke <mark@mielke.cc>