Re: WIP: dynahash replacement for buffer table - Mailing list pgsql-hackers

From Robert Haas
Subject Re: WIP: dynahash replacement for buffer table
Date
Msg-id CA+Tgmoak-hK4k-sjCBXQvBR3BZ-Tib3zx_0xB1nb6DgL0fiH=w@mail.gmail.com
Whole thread Raw
In response to Re: WIP: dynahash replacement for buffer table  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: WIP: dynahash replacement for buffer table  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
On Wed, Nov 5, 2014 at 6:19 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> * In benchmarks it becomes apparent that the dynamic element width makes
>   some macros like CHashTableGetNode() and
>   CHashTableGetGarbageByBucket() quite expensive. At least gcc on x86
>   can't figure how to build an efficient expression for the
>   target. There's two ways to address this:
>   a) Actually morph chash into something that will be included into the
>      user sites. Since I think there'll only be a limited number of those
>      that's probably acceptable.

Have you benchmarked this?  How much does it help?

>   b) Simplify the access rules. I quite seriously doubt that the
>     interleaving of garbage and freelists is actually necessary/helpful.

That wasn't added for nothing.  I did a bunch of performance testing
on this vs. dynahash (I think the test code is included in the patch)
and the region of memory containing the freelists practically caught
fire.  Spreading them out like this greatly improved the performance
under heavy concurrency, but even with those changes the freelists
were extremely hot.  Now, since this is the buffer mapping table
rather than a tight loop around hash table manipulation, the
difference may not be enough to measure.  But on a microbenchmark it
*definitely* was.

> * There's several cases where it's noticeable that chash creates more
>   cacheline misses - that's imo a good part of why the single threaded
>   performance regresses. There's good reasons for the current design
>   without a inline bucket, namely that it makes the concurrency handling
>   easier. But I think that can be countered by still storing a pointer -
>   just to an element inside the bucket. Afaics that'd allow this to be
>   introduced quite easily?

It's not obvious to me how that would work.  If you just throw those
elements on the garbage lists with everything else, it will soon be
the case that one bucket header ends up using the cell stored in some
other bucket, which isn't going to be awesome.  So you need some kind
of more complex scheme to preserve locality.

>   I'm afraid that I can't see us going forward unless we address the
>   noticeable single threaded penalties. Do you see that differently?

I'm not sure.  We're talking about something on the order of half a
percent on my tests, and lots of changes cause things to bounce around
by that much.  I'm not sure it makes sense to get too worked up about
that if the alternative is to add a big pile of new complexity.

> * There's some whitespace damage. (space followed by tab, followed by
>   actual contents)

That's the least of our problems at this point.

> * I still think it's a good idea to move the full memory barriers into
>   the cleanup/writing process by doing write memory barriers when
>   acquiring a hazard pointer and RMW atomic operations (e.g. atomic add)
>   in the process testing for the hazard pointer.

Can you cite any existing precedent in other system software?  Does
Linux do anything like that, for example?

> * Shouldn't we relax the CPU in a couple places like CHashAllocate(),
>   CHashBucketScan()'s loops?

You mean like, have it sleep the way SpinLockAcquire() does?  Yeah,
possibly, but that adds non-trivial code complexity which may not be
worth it if - as is hoped for - there's no real contention there.

> * I don't understand right now (but then I'm in a Hotel bar) why it's
>   safe for CHashAddToGarbage() to clobber the hash value?
>   CHashBucketScan() relies the chain being ordered by hashcode. No?
>   Another backend might just have been past the IsMarked() bit and look
>   at the hashcode?

I think the worst case scenario is that we falsely conclude that
there's a hash match when there really isn't.  The ensuing memcmp()
will clarify matters.

> * We really should find a way to sensibly print the stats.

Yep.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Ross Reedstrom
Date:
Subject: row_to_json bug with index only scans: empty keys!
Next
From: Robert Haas
Date:
Subject: Re: recovery_target_time and standby_mode