On 2016-07-27 10:04:52 -0400, Robert Haas wrote:
> On Tue, Jul 26, 2016 at 8:43 PM, Andres Freund <andres@anarazel.de> wrote:
> > As previously mentioned, dynahash isn't particularly fast. In many cases
> > that doesn't particularly matter. But e.g. hash aggregation and bitmap
> > index scans are quite sensitive to hash performance.
> >
> > The biggest problems with dynahash (also discussed at [1]) are
> >
> > a) two level structure doubling the amount of indirect lookups
> > b) indirect function calls
> > c) using separate chaining based conflict resolution
> > d) being too general.
>
> I am ... skeptical about open addressing. It's less unappealing for
> this application than in general because we don't actually need to
> delete anything, but that wouldn't be true for the catcache.
I don't think deletions really change the picture for open addressing -
you don't need toombstones, you can instead move elements closer to
their origin at deletion. I just hadn't implemented that yet, because it
didn't seem like a crucial point.
Unfortunately open addressing, particularly linear one, has such vastly
superiour cache access patterns, that it's hard to come close with a
chained hash table. I've previously tried a chained hash table, and the
performance gains were a lot smaller. For that reason many hash-table
implementations used in performance critical paths appear to be open
addressing.
Don't get me wrong, I do certainly think that there's good cases for
using separate chaining in places where performance is less of a
concern; and possibly when you want lock-free concurrency.
> All the same, I feel that we need to assess the risk that we're
> improving average-case performance while creating large regressions in
> other cases (i.e. where there is clustering).
The actual algorithmic worst-case complexity isn't different. It's
obviously important to resize appropriately. It's not that hard to beat
dynahash's memory efficiency, so fillfactors don't have to be kept super
high to not regress in memory usage.
> Do likely() and unlikely() actually buy us anything here?
It's only a few percent here, but overall, especially when used around
error checks, it's above 10%. The reason I used it here is primary that
it made the assembly easier to read for me... ;)
Greetings,
Andres Freund