Thread: hint bit cache v5
Attached is an updated version of the 'hint bit cache'. What's changed: *) 'bucket' was changed to 'page' everywhere *) rollup array is now gets added during 'set', not the 'get' (pretty dumb the way it was before -- wasn't really dealing with non-commit bits yet) *) more source comments, including a description of the cache in the intro *) now caching 'invalid' bits. I went back and forth several times whether to store invalid bits in the same cache, a separate cache, or not at all. I finally settled upon storing them in the same cache which has some pros and cons. It makes it more or less exactly like the clog cache (so I could copy/pasto some code out from there), but adds some overhead because 2 bit addressing is more expensive than 1 bit addressing -- this is showing up in profiling...i'm upping the estimate of cpu bound scan overhead from 1% to 2%. Still fairly cheap, but i'm running into the edge of where I can claim the cache is 'free' for most workloads -- any claim is worthless without real world testing though. Of course, if tuple hint bits are set or PD_ALL_VISIBLE is set, you don't have to pay that price. What's not: *) Haven't touched any satisfies routine besides HeapTupleSatisfiesMVCC (should they be?) *) Haven't pushed the cache data into CacheMemoryContext. I figure this is the way to go, but requires extra 'if' on every cache 'get'. *) Didn't abstract the clog bit addressing macros. I'm leaning on not doing this, but maybe they should be. My reasoning is that there is no requirement for hint bit cache that pages should be whatever block size is, and I'd like to reserve the ability to adjust the cache page size independently. I'd like to know if this is a strategy that merits further work...If anybody has time/interest that is. It's getting close to the point where I can just post it to the commit fest for review. In particular, I'm concerned if Tom's earlier objections can be satisfied. If not, it's back to the drawing board... merlin
Attachment
On Mon, May 9, 2011 at 5:12 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > I'd like to know if this is a strategy that merits further work...If > anybody has time/interest that is. It's getting close to the point > where I can just post it to the commit fest for review. In > particular, I'm concerned if Tom's earlier objections can be > satisfied. If not, it's back to the drawing board... I'm interested in what you're doing here. From here, there's quite a lot of tuning possibilities. It would be very useful to be able to define some metrics we are interested in reducing and working out how to measure them. That way we can compare all the different variants of this to see which way of doing things works best. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, May 10, 2011 at 11:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Mon, May 9, 2011 at 5:12 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > >> I'd like to know if this is a strategy that merits further work...If >> anybody has time/interest that is. It's getting close to the point >> where I can just post it to the commit fest for review. In >> particular, I'm concerned if Tom's earlier objections can be >> satisfied. If not, it's back to the drawing board... > > I'm interested in what you're doing here. > > From here, there's quite a lot of tuning possibilities. It would be > very useful to be able to define some metrics we are interested in > reducing and working out how to measure them. > > That way we can compare all the different variants of this to see > which way of doing things works best. thanks for that! I settled on this approach because the downside cases should hopefully be pretty limited. The upside is a matter of debate although fairly trivial to demonstrate synthetically. I'm looking for some way of benchmarking the benefits in non-simulated fashion. We desperately need something like a performance farm (as many many others have mentioned). merlin
On Tue, May 10, 2011 at 11:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Mon, May 9, 2011 at 5:12 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > >> I'd like to know if this is a strategy that merits further work...If >> anybody has time/interest that is. It's getting close to the point >> where I can just post it to the commit fest for review. In >> particular, I'm concerned if Tom's earlier objections can be >> satisfied. If not, it's back to the drawing board... > > I'm interested in what you're doing here. > > From here, there's quite a lot of tuning possibilities. It would be > very useful to be able to define some metrics we are interested in > reducing and working out how to measure them. Following are results that are fairly typical of the benefits you might see when the optimization kicks in. The attached benchmark just creates a bunch of records in a random table and scans it. This is more or less the scenario that causes people to grip about hint bit i/o, especially in systems that are already under moderate to heavy i/o stress. I'm gonna call it for 20%, although it could be less if you have an i/o system that spanks the test (try cranking -c and the creation # records in bench.sql in that case). Anecdotal reports of extreme duress caused by hint bit i/o suggest problematic or mixed use (OLTP + OLAP) workloads might see even more benefit. One thing I need to test is how much benefit you'll see with wider records. I think I'm gonna revert the change to cache invalid bits. I just don't see hint bits as a major contributor to dead tuples following epic rollbacks (really, the solution for that case is simply to try and not get in that scenario if you can). This will put the code back into the cheaper and simpler bit per transaction addressing. What I do plan to do though, is to check and set xmax commit bits in the cache...that way deleted tuples will see cache benefits. [hbcache] merlin@mmoncure-ubuntu:~$ time pgbench -c 4 -n -T 200 -f bench.sql transaction type: Custom query scaling factor: 1 query mode: simple number of clients: 4 number of threads: 1 duration: 200 s number of transactions actually processed: 8 tps = 0.037167 (including connections establishing) tps = 0.037171 (excluding connections establishing) real 3m35.549s user 0m0.008s sys 0m0.004s [HEAD] merlin@mmoncure-ubuntu:~$ time pgbench -c 4 -n -T 200 -f bench.sql transaction type: Custom query scaling factor: 1 query mode: simple number of clients: 4 number of threads: 1 duration: 200 s number of transactions actually processed: 8 tps = 0.030313 (including connections establishing) tps = 0.030317 (excluding connections establishing) real 4m24.216s user 0m0.000s sys 0m0.012s
Attachment
On Wed, May 11, 2011 at 10:38 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > One thing I need to test is how much benefit you'll see with wider records. The results are a bit better, around 25% using a similar methodology on ~ 1k wide records. > I think I'm gonna revert the change to cache invalid bits. I just > don't see hint bits as a major contributor to dead tuples following > epic rollbacks what I meant to say here was, I don't see hint bit i/o following rollbacks as a major issue. Point being, I don't see much use in optimizing management of INVALID tuple bits beyond what is already done. Anyways, demonstrating a 'good' case is obviously not the whole story.But what are the downsides? There are basically two: 1) tiny cpu penalty on every heap fetch 2) extremely widely dispersed (in terms of transaction id) unhinted tuples can force the cache to refresh every 100 tuples in the absolute worst case. A cache refresh is a 100 int sort and a loop. For '1', the absolute worst case I can come up with, cpu bound scans of extremely narrow records, the overall cpu usage goes up around 1%. '2' seems just impossible to see in the real world -- and if it does, you are also paying for lots of clog lookups all the way through the slru, and you are having i/o and other issues on top of it. Even if all the stars align and it does happen, all the tuples get hinted and dirtied anyways so it will only happen at most once on that particular set of data. merlin
On Wed, May 11, 2011 at 4:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > Following are results that are fairly typical of the benefits you > might see when the optimization kicks in. The attached benchmark just > [hbcache] > real 3m35.549s > [HEAD] > real 4m24.216s These numbers look very good. Thanks for responding to my request. What people have said historically at this point is "ah, but you've just deferred the pain from clog lookups". The best way to show this does what we hope is to run a normal-ish OLTP access to the table that would normally thrash the clog and show no ill effects there either. Run that immediately after the above tests so that the cache and hint bits are both primed. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, May 11, 2011 at 12:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, May 11, 2011 at 4:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > >> Following are results that are fairly typical of the benefits you >> might see when the optimization kicks in. The attached benchmark just > >> [hbcache] >> real 3m35.549s > >> [HEAD] >> real 4m24.216s > > These numbers look very good. Thanks for responding to my request. > > What people have said historically at this point is "ah, but you've > just deferred the pain from clog lookups". Deferred, or eliminated. If any tuple on the page gets updated, deleted, etc or the the table itself is dropped then you've 'won'...the page with rhw hint bit only change was never booted out to the heap before another substantive change happened. This is exactly what happens in certain common workloads -- you insert a bunch of records, scan them with some batch process, then delete them. Let's say a million records were inserted under a single transaction and you are getting bounced out of the transam.c cache, you just made a million calls to TransactionIdIsCurrentTransactionId and (especially) TransactionIdIsInProgress for the *exact same* transaction_id, over and over. That stuff adds up even before looking at the i/o incurred. Put another way, the tuple hint bits have a lot of usefulness when the tuples on the page are coming from all kinds of differently aged transactions. When all the tuples have the same or similar xid, the information value is quite low, and the i/o price isn't worth it. The cache neatly haircuts the downside case. If the cache isn't helping (any tuple fetch on the page faults through it), the page is dirtied and the next time it's fetched all the bits will be set. > The best way to show this does what we hope is to run a normal-ish > OLTP access to the table that would normally thrash the clog and show > no ill effects there either. Run that immediately after the above > tests so that the cache and hint bits are both primed. yeah. the only easy way I know of to do this extremely long pgbench runs, and getting good results is harder than it sounds...if the tuple hint bits make it to disk (via vacuum or a cache fault), they stay there and that tuple is no longer interesting from the cache point of view. If you make the scale really large the test will just take forever just to get the tables primed (like a week). Keep in mind, autovacuum can roll around at any time and set the bits under you (you can of course disable it, but who really does than on OLTP?). Small scale oltp tests are not real world realistic because anybody sane would just let autovacuum loose on the table. clog thrashing systems are typically mature, high load oltp databases...not fun to test on your single 7200 rpm drive. I'm going to boldly predict that with all the i/o flying around in cases like that, the paltry cpu cycles spent dealing with the cache are the least of your problems. Not discounting the need to verify that though. merlin