Thread: Commitfest patches
Bruce, you seem to have removed one of my three patches from the queue. I would actually prefer you remove the other two and put back that one. It's the one I most urgently need feedback on to continue. The patch I'm so interested in receiving feedback on is the patch to preread pages in bitmap index scans using posix_fadvise. This is basically complete modulo documentation and autoconf tests. However there is a larger patch struggling to get out here. The patch I submitted just adds a new low level routine to preread blocks and doesn't change the buffer manager routines at all. That means we have to make two trips through the buffer manager to check if the block is present and then subsequently to assign a buffer and read it in (from cache) synchronously. A more invasive form of this patch would be to assign and pin a buffer when the preread is done. That would men subsequently we would have a pinned buffer ready to go and not need to go back to the buffer manager a second time. We would instead just "complete" the i/o by issuing a normal read call. The up-side would be fewer trips through the buffer manager and attendant locking. The down-side would be stealing buffers from the shared buffers which could be holding other pages. I don't feel like writing the more invasive patch only to throw it out and take the one I've already written, but I don't know what other people's feelings are between the two. I'm leaning towards the more invasive buffer manager changes myself. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
On Fri, 2008-03-28 at 09:08 +0000, Gregory Stark wrote: > A more invasive form of this patch would be to assign and pin a buffer when > the preread is done. That would men subsequently we would have a pinned buffer > ready to go and not need to go back to the buffer manager a second time. We > would instead just "complete" the i/o by issuing a normal read call. So if posix_fadvise did nothing or there was a longer than optimal delay, this would be a net loss. You'd need reasonable evidence that the posix_fadvise facility was a win on all platforms and recent release levels before we could agree with that. I think we need a more thorough examination of this area before we commit anything. Maybe you've done this, but I haven't seen the analysis. Can you say more, please? Or at least say what you don't know, so other experts listening can fill in the blanks. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Fri, 2008-03-28 at 09:08 +0000, Gregory Stark wrote: > >> A more invasive form of this patch would be to assign and pin a buffer when >> the preread is done. That would men subsequently we would have a pinned buffer >> ready to go and not need to go back to the buffer manager a second time. We >> would instead just "complete" the i/o by issuing a normal read call. > > So if posix_fadvise did nothing or there was a longer than optimal > delay, this would be a net loss. > > You'd need reasonable evidence that the posix_fadvise facility was a win > on all platforms and recent release levels before we could agree with > that. I posted a test program and asked for others to test it on various platforms and various sized raid arrays. matthew@flymine.org was the only person who bothered to run it and he only found a 16x speedup on his hardware. http://archives.postgresql.org/pgsql-performance/2008-01/msg00285.php (unfortunately our mail archives seem to have failed to archive his message, this is my followup to it) > I think we need a more thorough examination of this area before we > commit anything. Maybe you've done this, but I haven't seen the > analysis. Can you say more, please? Or at least say what you don't know, > so other experts listening can fill in the blanks. For heaven's sake. I've been posting about this for months. Search the archives for "RAID arrays and performance", "There's random access and then there's random access", and "Bitmap index scan preread using posix_fadvise". I described which interfaces worked on Linux and Solaris based on empirical tests. I posted source code for synthetic benchmarks so we could test it on a wide range of hardware. I posted graphs based on empirical results. I posted mathematical formulas analysing just how much preread would be expected to exercise a raid array fully. I'm not sure what else I can do to effect a more thorough examination. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
On Fri, 2008-03-28 at 11:26 +0000, Gregory Stark wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > For heaven's sake. I've been posting about this for months. Any chance of getting all that together on a single Wiki page, so we can review everything? We'll need those docs after its committed anyway. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk
Gregory Stark wrote: > I described which interfaces worked on Linux and Solaris based on empirical > tests. I posted source code for synthetic benchmarks so we could test it on a > wide range of hardware. I posted graphs based on empirical results. I posted > mathematical formulas analysing just how much preread would be expected to > exercise a raid array fully. I'm not sure what else I can do to effect a more > thorough examination. I'm sure posix_fadvise is a win in the case where it's supposed to help: a scan that does a lot of random reads, on RAID array. And you've posted results demonstrating that. What we need to make sure is that there's no significant loss when it's not helping. It seems that the worst case for this patch is a scan on a table that doesn't fit in shared_buffers, but is fully cached in the OS cache. In that case, the posix_fadvise calls would be a certain waste of time. That could be alleviated by deciding at plan time whether to preread or not, based on effective_cache_size. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Prereading using posix_fadvise (was Re: Commitfest patches)
From
"Zeugswetter Andreas OSB SD"
Date:
Heikki wrote: > It seems that the worst case for this patch is a scan on a table that > doesn't fit in shared_buffers, but is fully cached in the OS cache. In > that case, the posix_fadvise calls would be a certain waste of time. I think this is a misunderstanding, the fadvise is not issued to read the whole table and is not issued for table scans at all (and if it were it would only advise for the next N pages). So it has nothing to do with table size. The fadvise calls need to be (and are) limited by what can be used in the near future, and not for the whole statement. e.g. N next level index pages that are relevant, or N relevant heap pages one index leaf page points at. Maybe in the index case N does not need to be limited, since we have a natural limit on how many pointers fit on one page. In general I think separate reader processes (or threads :-) that directly read into the bufferpool would be a more portable and efficient implementation. E.g. it could use ScatterGather IO. So I think we should look, that the fadvise solution is not obstruing that path, but I think it does not. Gregory wrote: >> A more invasive form of this patch would be to assign and pin a buffer when >> the preread is done. That would men subsequently we would have a pinned buffer >> ready to go and not need to go back to the buffer manager a second time. We >> would instead just "complete" the i/o by issuing a normal read call. I guess you would rather need to mark the buffer for use for this page, but let any backend that needs it first, pin it and issue the read. I think the fadviser should not pin it in advance, since he cannot guarantee to actually read the page [soon]. Rather remember the buffer and later check and pin it for the read. Else you might be blocking the buffer. But I think doing something like this might be good since it avoids issuing duplicate fadvises. Andreas
Zeugswetter Andreas OSB SD wrote: > Heikki wrote: >> It seems that the worst case for this patch is a scan on a table that >> doesn't fit in shared_buffers, but is fully cached in the OS cache. In > >> that case, the posix_fadvise calls would be a certain waste of time. > > I think this is a misunderstanding, the fadvise is not issued to read > the > whole table and is not issued for table scans at all (and if it were it > would > only advise for the next N pages). > > So it has nothing to do with table size. The fadvise calls need to be > (and are) > limited by what can be used in the near future, and not for the whole > statement. Right, I was sloppy. Instead of table size, what matters is the amount of data the scan needs to access. The point remains that if the data is already in OS cache, the posix_fadvise calls are a waste of time, regardless of how many pages ahead you advise. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > > So it has nothing to do with table size. The fadvise calls need to be > > (and are) > > limited by what can be used in the near future, and not for the whole > > statement. > > Right, I was sloppy. Instead of table size, what matters is the amount > of data the scan needs to access. The point remains that if the data is > already in OS cache, the posix_fadvise calls are a waste of time, > regardless of how many pages ahead you advise. I now understand what posix_fadvise() is allowing us to do. posix_fadvise(POSIX_FADV_WILLNEED) allows us to tell the kernel we will need a certain block in the future --- this seems much cheaper than a background reader. We know we will need the blocks, and telling the kernel can't hurt, except that there is overhead in telling the kernel. Has anyone measured how much overhead? I would be interested in a test program that read the same page over and over again from the kernel, with and without a posix_fadvise() call. Should we consider only telling the kernel X pages ahead, meaning when we are on page 10 we tell it about page 16? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Gregory Stark wrote: > > Bruce, you seem to have removed one of my three patches from the queue. I > would actually prefer you remove the other two and put back that one. It's the > one I most urgently need feedback on to continue. I talked to Greg on IM. The complaint was that his posix_fadvise() patch was added to the TODO list as a URL, rather than getting him feedback. He is getting feedback now in another thread. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, Mar 28, 2008 at 11:41:58AM -0400, Bruce Momjian wrote: > Should we consider only telling the kernel X pages ahead, meaning when > we are on page 10 we tell it about page 16? It's not so interesting for sequential reads, the kernel can work that out for itself. Disk reads are usually in blocks of at least 128K anyway, so there's a real good chance you have block 16 already. The interesting case is index scan, where you so a posix_fadvise() on the next block *before* returning the items in the current block. Then by the time you've processed these tuples, the next block will hopefully have been read in and we can proceed without delay. Or fadvising all the tuples referred to from an index page at once so the kernel can determine the optimal order to fetch them. The read() will still be in order of the tuple, but the delay will (hopefully) be less. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Bruce Momjian wrote: > Heikki Linnakangas wrote: >>> So it has nothing to do with table size. The fadvise calls need to be >>> (and are) >>> limited by what can be used in the near future, and not for the whole >>> statement. >> Right, I was sloppy. Instead of table size, what matters is the amount >> of data the scan needs to access. The point remains that if the data is >> already in OS cache, the posix_fadvise calls are a waste of time, >> regardless of how many pages ahead you advise. > > I now understand what posix_fadvise() is allowing us to do. > posix_fadvise(POSIX_FADV_WILLNEED) allows us to tell the kernel we will > need a certain block in the future --- this seems much cheaper than a > background reader. Yep. > We know we will need the blocks, and telling the kernel can't hurt, > except that there is overhead in telling the kernel. Has anyone > measured how much overhead? I would be interested in a test program > that read the same page over and over again from the kernel, with and > without a posix_fadvise() call. Agreed, that needs to be benchmarked next. There's also some overhead in doing the buffer manager hash table lookup to check whether the page is in shared_buffers. We could reduce that by the more complex approach Greg mentioned of allocating a buffer in shared_buffers when we do posix_fadvise. > Should we consider only telling the kernel X pages ahead, meaning when > we are on page 10 we tell it about page 16? Yes. You don't want to fire off thousands of posix_fadvise calls upfront. That'll just flood the kernel, and it will most likely ignore any advise after the first few hundred or so. I'm not sure what the appropriate amount of read ahead would be, though. Probably depends a lot on the OS and hardware, and needs to be a adjustable. In some cases we can't easily read ahead more than a certain number of pages. For example, in a regular index scan, we can easily fire off posix_advise calls for all the heap pages referenced by a single index page, but reading ahead more than that becomes much more complex. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Right, I was sloppy. Instead of table size, what matters is the amount of data > the scan needs to access. The point remains that if the data is already in OS > cache, the posix_fadvise calls are a waste of time, regardless of how many > pages ahead you advise. I'm not sure that's really fair to the scan. The typical bitmap index scan will probably be accessing far fewer pages than what fits in cache. And prereading them will usually help unless you're always reading the same pages. The usual case will be bitmap scans for different pages each time. I'm picturing something like our mail archive search. Each user who uses it retrieves just 10 hits or so. But there's no reason to think that because those 10 hits fit in effective_cache_size that they'll actually be found there. It might make more sense to compare the table size. If the table size fits in cache then any random bunch of pages is likely to be in cache somewhere. Except we have no idea how much of the database this table represents. If the user has a schema like ones we've seen posted to the lists many times with hundreds of tables then deductions based on the size of a single table will be questionable. I'm also leery about scaling back the prereading for systems with large effective_cache_size since those are precisely the systems which are likely to have raid arrays and be helped the most by this on queries where it's helpful. I feel like we're probably stuck optimizing for the case where i/o is happening and hoping that the filesystem cache is efficient. If the entire database fits in cache then the user has to adjust random_page_cost and should probably set preread_pages to 0 (or effective_spindle_count to 1 depending on which version of the patch you're reading) as well. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
Heikki Linnakangas wrote: > > Should we consider only telling the kernel X pages ahead, meaning when > > we are on page 10 we tell it about page 16? > > Yes. You don't want to fire off thousands of posix_fadvise calls > upfront. That'll just flood the kernel, and it will most likely ignore > any advise after the first few hundred or so. I'm not sure what the > appropriate amount of read ahead would be, though. Probably depends a > lot on the OS and hardware, and needs to be a adjustable. > > In some cases we can't easily read ahead more than a certain number of > pages. For example, in a regular index scan, we can easily fire off > posix_advise calls for all the heap pages referenced by a single index > page, but reading ahead more than that becomes much more complex. And if you read-ahead too far the pages might get pushed out of the kernel cache before you ask to read them. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Someone wrote: >>> >>> Should we consider only telling the kernel X pages ahead, meaning when >>> we are on page 10 we tell it about page 16? The patch I posted specifically handles bitmap heap scans. It does in fact prefetch only a limited number of pages from the bitmap stream based on a guc, but it tries to be a bit clever about ramping up gradually. The real danger here, imho, is doing read-ahead for blocks the client never ends up reading. By ramping up the read-ahead gradually as the client reads records we protect against that. > Heikki Linnakangas wrote: >> >> Yes. You don't want to fire off thousands of posix_fadvise calls >> upfront. That'll just flood the kernel, and it will most likely ignore >> any advise after the first few hundred or so. I'm not sure what the >> appropriate amount of read ahead would be, though. Probably depends a >> lot on the OS and hardware, and needs to be a adjustable. "Bruce Momjian" <bruce@momjian.us> writes: > > And if you read-ahead too far the pages might get pushed out of the > kernel cache before you ask to read them. While these concerns aren't entirely baseless the actual experiments seem to show the point of diminishing returns is pretty far out there. Look at the graphs below, keeping in mind that the X axis is the number of blocks prefetched. http://archives.postgresql.org/pgsql-hackers/2007-12/msg00088.php The pink case is analogous to a bitmap index scan where the blocks are read in order. In that case the point of diminishing returns is reached around 64 pages. But performance doesn't actually dip until around 512 pages. And even prefetching 8,192 blocks the negative impact on performance is still much less severe than using a smaller-than-optimal prefetch size. This is on a piddly little 3-way raid. On a larger raid you would want even larger prefetch sizes. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
"Bruce Momjian" <bruce@momjian.us> writes: > Gregory Stark wrote: >> >> Bruce, you seem to have removed one of my three patches from the queue. I >> would actually prefer you remove the other two and put back that one. It's the >> one I most urgently need feedback on to continue. > > I talked to Greg on IM. The complaint was that his posix_fadvise() > patch was added to the TODO list as a URL, rather than getting him > feedback. He is getting feedback now in another thread. Sort of. The feedback I'm getting is from people who want to discuss the "easy" parameters of the patch like how much prefetching to do and when without actually reading to see what's already been done. I'm happy to discuss them as I find these things interesting too. But what I really need is someone to read the patch and say "looks good" or point out things they don't like. In particular, what I really, really want is some guidance on the singular key question I asked. I want to know if we're interested in the more invasive patch restructuring the buffer manager. My feeling is that we probably are eventually. But I wonder if people wouldn't feel more comfortable taking baby steps at first which will have less impact in cases where it's not being heavily used. It's a lot more work to do the invasive patch and there's not much point in doing it if we're not interested in taking it when it's ready. Here's an updated patch. It's mainly just a CVS update but it also includes some window dressing: an autoconf test, some documentation, and some fancy arithmetic to auto-tune the amount of prefetching based on a more real-world parameter "effective_spindle_count". It also includes printing out how much the prefetching target got ramped up to in the explain output -- though I'm not sure we really want that in the tree, but it's nice for performance testing. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
Attachment
Gregory Stark wrote: > I want to know if we're interested in the more invasive patch restructuring > the buffer manager. My feeling is that we probably are eventually. But I > wonder if people wouldn't feel more comfortable taking baby steps at first > which will have less impact in cases where it's not being heavily used. > > It's a lot more work to do the invasive patch and there's not much point in > doing it if we're not interested in taking it when it's ready. I like the simplicity of this patch. My biggest worry is the impact on performance when the posix_fadvise calls are not helping. I'd like to see some testing of that. How expensive are the extra bufmgr hash table lookups, compared to all the other stuff that's involved in a bitmap heap scan? Another thing I wonder is whether this approach can easily be extended to other types of scans than bitmap heap scans. Normal index scans seem most interesting; they can generate a lot of random I/O. I don't see any big problems there, except again the worst-case performance: If you issue AdviseBuffer calls for all the heap tuples in an index page, in the naive way, you can issue hundreds of posix_fadvise calls. But what if they're all actually on the same page? Surely you only want to go through the buffer manager once (like we do in the scan, thanks to ReleaseAndReadBuffer()) in that case, and only call posix_fadvise once. But again, maybe you can convince me that it's a non-issue by some benchmarking. > Here's an updated patch. It's mainly just a CVS update but it also includes > some window dressing: an autoconf test, some documentation, and some fancy > arithmetic to auto-tune the amount of prefetching based on a more real-world > parameter "effective_spindle_count". I like the "effective_spindle_count". That's very intuitive. The formula to turn that into preread_pages looks complex, though. I can see the reasoning behind it, but I doubt thing behave that beautifully in the real world. (That's not important right now, we have plenty of time to tweak that after more testing.) > It also includes printing out how much > the prefetching target got ramped up to in the explain output -- though I'm > not sure we really want that in the tree, but it's nice for performance > testing. I don't understand the ramp up logic. Can you explain what that's for and how it works? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Mar 28, 2008 at 05:34:30PM +0000, Gregory Stark wrote: > But what I really need is someone to read the patch and say "looks good" or > point out things they don't like. In particular, what I really, really want is > some guidance on the singular key question I asked. I was going to write all sorts of stuff, till I noticed Heikki said basically everything I was going to say: - I think normal index scans could benefit from this (it was measurable when I was playing with AIO in postgres a few years back). - The integration with the bitmap scan is good, neat even - I think the number of preread_count is far too small, given you get a benefit even if you only have one spindle. - I don't understand the ramp-up method either. People spend a lot of time worrying about hundreds of posix_fadvise() calls but you don't need anywhere near that much to be effective. With AIO I limited the number of outstanding requests to a dozen and it was still useful. You lose nothing by capping the number of requests at any point. > I want to know if we're interested in the more invasive patch restructuring > the buffer manager. My feeling is that we probably are eventually. But I > wonder if people wouldn't feel more comfortable taking baby steps at first > which will have less impact in cases where it's not being heavily used. I think the way it is now is neat and simple and enough for now. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
On Fri, 28 Mar 2008, Gregory Stark wrote: > I described which interfaces worked on Linux and Solaris based on empirical > tests. I posted source code for synthetic benchmarks so we could test it on a > wide range of hardware. I posted graphs based on empirical results. Is it possible to post whatever script that generates the graph (gnuplot?) so people can compare the results they get to yours? It's when I realized I didn't have that and would have to recreate one myself that my intention to just run some quick results and ship them to you lost momentum and I never circled back to it. It would be nice if you made it easier for people to generate fancy results here as immediate gratification for them (while of course just asking them to send the raw data). I can run some tests on smaller Linux/Solaris systems to see if they don't show a regression, that was my main concern about this experiment. Some of the discussion that followed your original request for tests was kind of confusing as far as how to interpret the results as well; I think I know what to look for but certainly wouldn't mind some more guidance there, too. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
"Martijn van Oosterhout" <kleptog@svana.org> writes: > - I think normal index scans could benefit from this (it was measurable > when I was playing with AIO in postgres a few years back). I don't want to torture any existing code paths to get prefetching to work. Heikki suggested I take advantage of the page-at-a-time index scanning though which does sound like it could be convenient. > - I think the number of preread_count is far too small, given you get a > benefit even if you only have one spindle. > - I don't understand the ramp-up method either. So the idea is that I was deathly afraid of being accused of doing unnecessary additional I/O. So I was worried about the user doing something like SELECT ... LIMIT 1. Or worse, just starting a select and discarding it after only selecting a handful of records. I also didn't want to start the bitmap scan by doing hundreds of syscalls before we even return the first record. So I figured it's safer to read only the first block at first. Only if the user goes on to the next record do we try prefetching the next block. Each record the user reads we bump up the prefetch amount by 1 until we hit the goal value for the size raid array we're using. That also nicely spreads out the syscalls so we get one prefetch between each record returned. We also don't know how densely packed the records are on the pages. If they're densely packed then we'll have prefetched a whole bunch of records in time for the second page. But if there's only one record on a page and we're already on the second page I figured that indicated we would be reading many pages with sparsely distributed records. So we ramp up the prefetch exponentially by multiplying by two each time we move to the next page. The net result is that if you do a bitmap scan which is processing lots of pointers it'll quickly reach the point where it's prefetching the number of blocks based on effective_spindle_count. If you're just processing the first few tuples it'll only read a small number of extra pages. And if you're only processing one record it won't prefetch at all. > People spend a lot of time worrying about hundreds of posix_fadvise() > calls but you don't need anywhere near that much to be effective. With > AIO I limited the number of outstanding requests to a dozen and it was > still useful. You lose nothing by capping the number of requests at any > point. Well you leave money on the table. But yes, I'm trying to be conservative about how much to prefetch when we don't know that it's in our favour. >> I want to know if we're interested in the more invasive patch restructuring >> the buffer manager. My feeling is that we probably are eventually. But I >> wonder if people wouldn't feel more comfortable taking baby steps at first >> which will have less impact in cases where it's not being heavily used. > > I think the way it is now is neat and simple and enough for now. Thanks. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
"Greg Smith" <gsmith@gregsmith.com> writes: > On Fri, 28 Mar 2008, Gregory Stark wrote: > >> I described which interfaces worked on Linux and Solaris based on empirical >> tests. I posted source code for synthetic benchmarks so we could test it on a >> wide range of hardware. I posted graphs based on empirical results. > > Is it possible to post whatever script that generates the graph (gnuplot?) so > people can compare the results they get to yours? Unfortunately I couldn't figure out how to get gnuplot to do this. I ended up doing it in gnumeric. I'll look into fixing it the script up to be more automatic. Maybe look at gnuplot again. > I can run some tests on smaller Linux/Solaris systems to see if they don't show > a regression, that was my main concern about this experiment. Heikki suggested some things to test for regressions. I'll look at that. What I'm more curious about and hoped I could get data from others is whether posix_fadvise(WILL_NEED) does anything useful on various versions of FreeBSD, OSX, etc. Afaict the syscall doesn't exist at all on Solaris though, which surprises me :( -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!