Thread: hint bit cache v5

hint bit cache v5

From
Merlin Moncure
Date:
Attached is an updated version of the 'hint bit cache'.

What's changed:
*) 'bucket' was changed to 'page' everywhere
*) rollup array is now gets added during 'set', not the 'get' (pretty
dumb the way it was before -- wasn't really dealing with non-commit
bits yet)
*) more source comments, including a description of the cache in the intro
*) now caching 'invalid' bits.

I went back and forth several times whether to store invalid bits in
the same cache, a separate cache, or not at all.  I finally settled
upon storing them in the same cache which has some pros and cons.  It
makes it more or less exactly like the clog cache (so I could
copy/pasto some code out from there), but adds some overhead because 2
bit addressing is more expensive than 1 bit addressing -- this is
showing up in profiling...i'm upping the estimate of cpu bound scan
overhead from 1% to 2%.   Still fairly cheap, but i'm running into the
edge of where I can claim the cache is 'free' for most workloads --
any claim is worthless without real world testing though.  Of course,
if tuple hint bits are set or PD_ALL_VISIBLE is set, you don't have to
pay that price.

What's not:
*) Haven't touched any satisfies routine besides
HeapTupleSatisfiesMVCC (should they be?)
*) Haven't pushed the cache data into CacheMemoryContext.  I figure
this is the way to go, but requires extra 'if' on every cache 'get'.
*) Didn't abstract the clog bit addressing macros.  I'm leaning on not
doing this, but maybe they should be.  My reasoning is that there is
no requirement for hint bit cache that pages should be whatever block
size is, and I'd like to reserve the ability to adjust the cache page
size independently.

I'd like to know if this is a strategy that merits further work...If
anybody has time/interest that is.  It's getting close to the point
where I can just post it to the commit fest for review.  In
particular, I'm concerned if Tom's earlier objections can be
satisfied. If not, it's back to the drawing board...

merlin

Attachment

Re: hint bit cache v5

From
Simon Riggs
Date:
On Mon, May 9, 2011 at 5:12 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

> I'd like to know if this is a strategy that merits further work...If
> anybody has time/interest that is.  It's getting close to the point
> where I can just post it to the commit fest for review.  In
> particular, I'm concerned if Tom's earlier objections can be
> satisfied. If not, it's back to the drawing board...

I'm interested in what you're doing here.

From here, there's quite a lot of tuning possibilities. It would be
very useful to be able to define some metrics we are interested in
reducing and working out how to measure them.

That way we can compare all the different variants of this to see
which way of doing things works best.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: hint bit cache v5

From
Merlin Moncure
Date:
On Tue, May 10, 2011 at 11:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Mon, May 9, 2011 at 5:12 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>
>> I'd like to know if this is a strategy that merits further work...If
>> anybody has time/interest that is.  It's getting close to the point
>> where I can just post it to the commit fest for review.  In
>> particular, I'm concerned if Tom's earlier objections can be
>> satisfied. If not, it's back to the drawing board...
>
> I'm interested in what you're doing here.
>
> From here, there's quite a lot of tuning possibilities. It would be
> very useful to be able to define some metrics we are interested in
> reducing and working out how to measure them.
>
> That way we can compare all the different variants of this to see
> which way of doing things works best.

thanks for that!  I settled on this approach because the downside
cases should hopefully be pretty limited.  The upside is a matter of
debate although fairly trivial to demonstrate synthetically.

I'm looking for some way of benchmarking the benefits in non-simulated
fashion.  We desperately need something like a performance farm (as
many many others have mentioned).

merlin


Re: hint bit cache v5

From
Merlin Moncure
Date:
On Tue, May 10, 2011 at 11:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Mon, May 9, 2011 at 5:12 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>
>> I'd like to know if this is a strategy that merits further work...If
>> anybody has time/interest that is.  It's getting close to the point
>> where I can just post it to the commit fest for review.  In
>> particular, I'm concerned if Tom's earlier objections can be
>> satisfied. If not, it's back to the drawing board...
>
> I'm interested in what you're doing here.
>
> From here, there's quite a lot of tuning possibilities. It would be
> very useful to be able to define some metrics we are interested in
> reducing and working out how to measure them.

Following are results that are fairly typical of the benefits you
might see when the optimization kicks in.  The attached benchmark just
creates a bunch of records in a random table and scans it.  This is
more or less the scenario that causes people to grip about hint bit
i/o, especially in systems that are already under moderate to heavy
i/o stress.  I'm gonna call it for 20%, although it could be less if
you have an i/o system that spanks the test (try cranking -c and the
creation # records in bench.sql in that case).  Anecdotal reports of
extreme duress caused by hint bit i/o suggest problematic or mixed use
(OLTP + OLAP) workloads might see even more benefit.  One thing I need
to test is how much benefit you'll see with wider records.

I think I'm gonna revert the change to cache invalid bits. I just
don't see hint bits as a major contributor to dead tuples following
epic rollbacks (really, the solution for that case is simply to try
and not get in that scenario if you can).  This will put the code back
into the cheaper and simpler bit per transaction addressing.  What I
do plan to do though, is to check and set xmax commit bits in the
cache...that way deleted tuples will see cache benefits.

[hbcache]
merlin@mmoncure-ubuntu:~$ time pgbench -c 4 -n -T 200 -f bench.sql
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 1
duration: 200 s
number of transactions actually processed: 8
tps = 0.037167 (including connections establishing)
tps = 0.037171 (excluding connections establishing)

real    3m35.549s
user    0m0.008s
sys     0m0.004s

[HEAD]
merlin@mmoncure-ubuntu:~$ time pgbench -c 4 -n -T 200 -f bench.sql
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 1
duration: 200 s
number of transactions actually processed: 8
tps = 0.030313 (including connections establishing)
tps = 0.030317 (excluding connections establishing)

real    4m24.216s
user    0m0.000s
sys     0m0.012s

Attachment

Re: hint bit cache v5

From
Merlin Moncure
Date:
On Wed, May 11, 2011 at 10:38 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> One thing I need to test is how much benefit you'll see with wider records.

The results are a bit better, around 25% using a similar methodology
on ~ 1k wide records.

> I think I'm gonna revert the change to cache invalid bits. I just
> don't see hint bits as a major contributor to dead tuples following
> epic rollbacks

what I meant to say here was, I don't see hint bit i/o following
rollbacks as a major issue.  Point being, I don't see much use in
optimizing management of INVALID tuple bits beyond what is already
done.

Anyways, demonstrating a 'good' case is obviously not the whole story.But what are the downsides?  There are basically
two:

1) tiny cpu penalty on every heap fetch
2) extremely widely dispersed (in terms of transaction id) unhinted
tuples can force the cache to refresh every 100 tuples in the absolute
worst case.  A cache refresh is a 100 int sort and a loop.

For '1', the absolute worst case I can come up with, cpu bound scans
of extremely narrow records, the overall cpu usage goes up around 1%.
'2' seems just impossible to see in the real world -- and if it does,
you are also paying for lots of clog lookups all the way through the
slru, and you are having i/o and other issues on top of it.  Even if
all the stars align and it does happen, all the tuples get hinted and
dirtied anyways so it will only happen at most once on that particular
set of data.

merlin


Re: hint bit cache v5

From
Simon Riggs
Date:
On Wed, May 11, 2011 at 4:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

> Following are results that are fairly typical of the benefits you
> might see when the optimization kicks in.  The attached benchmark just

> [hbcache]
> real    3m35.549s

> [HEAD]
> real    4m24.216s

These numbers look very good. Thanks for responding to my request.

What people have said historically at this point is "ah, but you've
just deferred the pain from clog lookups".

The best way to show this does what we hope is to run a normal-ish
OLTP access to the table that would normally thrash the clog and show
no ill effects there either. Run that immediately after the above
tests so that the cache and hint bits are both primed.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: hint bit cache v5

From
Merlin Moncure
Date:
On Wed, May 11, 2011 at 12:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, May 11, 2011 at 4:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>
>> Following are results that are fairly typical of the benefits you
>> might see when the optimization kicks in.  The attached benchmark just
>
>> [hbcache]
>> real    3m35.549s
>
>> [HEAD]
>> real    4m24.216s
>
> These numbers look very good. Thanks for responding to my request.
>
> What people have said historically at this point is "ah, but you've
> just deferred the pain from clog lookups".

Deferred, or eliminated. If any tuple on the page gets updated,
deleted, etc or the the table itself is dropped then you've
'won'...the page with rhw hint bit only change was never booted out to
the heap before another substantive change happened.  This is exactly
what happens in certain common workloads -- you insert a bunch of
records, scan them with some batch process, then delete them.  Let's
say a million records were inserted under a single transaction and you
are getting bounced out of the transam.c cache, you just made a
million calls to TransactionIdIsCurrentTransactionId and (especially)
TransactionIdIsInProgress for the *exact same* transaction_id, over
and over.  That stuff adds up even before looking at the i/o incurred.

Put another way, the tuple hint bits have a lot of usefulness when the
tuples on the page are coming from all kinds of differently aged
transactions.  When all the tuples have the same or similar xid, the
information value is quite low, and the i/o price isn't worth it.  The
cache neatly haircuts the downside case.  If the cache isn't helping
(any tuple fetch on the page faults through it), the page is dirtied
and the next time it's fetched all the bits will be set.

> The best way to show this does what we hope is to run a normal-ish
> OLTP access to the table that would normally thrash the clog and show
> no ill effects there either. Run that immediately after the above
> tests so that the cache and hint bits are both primed.

yeah.  the only easy way I know of to do this extremely long pgbench
runs, and getting good results is harder than it sounds...if the tuple
hint bits make it to disk (via vacuum or a cache fault), they stay
there and that tuple is no longer interesting from the cache point of
view.

If you make the scale really large the test will just take forever
just to get the tables primed (like a week).  Keep in mind, autovacuum
can roll around at any time and set the bits under you (you can of
course disable it, but who really does than on OLTP?).  Small scale
oltp tests are not real world realistic because anybody sane would
just let autovacuum loose on the table. clog thrashing systems are
typically mature, high load oltp databases...not fun to test on your
single 7200 rpm drive.

I'm going to boldly predict that with all the i/o flying around in
cases like that, the paltry cpu cycles spent dealing with the cache
are the least of your problems.  Not discounting the need to verify
that though.

merlin