Re: WIP: dynahash replacement for buffer table - Mailing list pgsql-hackers

From Robert Haas
Subject Re: WIP: dynahash replacement for buffer table
Date
Msg-id CA+TgmobA=sBeX9P4T8sGJSsUsEqZWcqEJeN13y71J-2ewag11g@mail.gmail.com
Whole thread Raw
In response to Re: WIP: dynahash replacement for buffer table  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: WIP: dynahash replacement for buffer table  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Sun, Nov 9, 2014 at 3:58 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > * There's several cases where it's noticeable that chash creates more
>> >   cacheline misses - that's imo a good part of why the single threaded
>> >   performance regresses. There's good reasons for the current design
>> >   without a inline bucket, namely that it makes the concurrency handling
>> >   easier. But I think that can be countered by still storing a pointer -
>> >   just to an element inside the bucket. Afaics that'd allow this to be
>> >   introduced quite easily?
>>
>> It's not obvious to me how that would work.  If you just throw those
>> elements on the garbage lists with everything else, it will soon be
>> the case that one bucket header ends up using the cell stored in some
>> other bucket, which isn't going to be awesome.  So you need some kind
>> of more complex scheme to preserve locality.
>
> Treating the element inside the bucket as a kind of one element freelist
> seems quite workable to me. Test whether it's marked deleted, scan the
> hazard pointer list to see whether it can be used. If yes, grand. If
> not, go to the current code. The fact that the element is cacheline
> local will make the test for its deletedness almost free. Yes, the
> hazard pointer scan surely ain't free - but at least for cases like
> bufmgr where reads are never less frequent than writes and very often
> *much* more frequent I'm pretty sure it'd be a noticeable win.

Maybe. I'd expect that to radically increase the time spend doing
hazard pointer scans.  The performance of this system depends heavily
on garbage collection not happening too often, and ISTR that the
performance changes significantly if you vary the amount of of
overallocation.

>> I'm not sure.  We're talking about something on the order of half a
>> percent on my tests, and lots of changes cause things to bounce around
>> by that much.  I'm not sure it makes sense to get too worked up about
>> that if the alternative is to add a big pile of new complexity.
>
> I saw things in the range of 3-4% on my laptop.

Mrmpf.  Well, that sucks.

>> > * I don't understand right now (but then I'm in a Hotel bar) why it's
>> >   safe for CHashAddToGarbage() to clobber the hash value?
>> >   CHashBucketScan() relies the chain being ordered by hashcode. No?
>> >   Another backend might just have been past the IsMarked() bit and look
>> >   at the hashcode?
>>
>> I think the worst case scenario is that we falsely conclude that
>> there's a hash match when there really isn't.  The ensuing memcmp()
>> will clarify matters.
>
> Hm. Let me think about it for a while.
>
> I was thinking that a spurious cmp < 0 could causing us to to miss a
> match - but that could only happen if the match just has been removed
> from the list. Which is fine. Perhaps that case deserves to be mentioned
> in the comment?

Maybe.  I mean, the general principle here is that there may be some
difference between the apparent order of execution on one CPU as
opposed to another, and the code that uses the hash table has to be OK
with that - at least, unless it has independent methods of assuring
that things happen in the right order, but in that case that other
thing may well become the botleneck.  This is just one example of
that.  You might also fail to see an insert that's just happened on
some other node but is not committed to main memory yet, which is not
really an issue we need to fix; it's just how things are.  A general
discussion of this somewhere might be worthwhile, but it's pretty much
the same as any other highly-concurrent hashtable implemenation you'll
find out there.

(It's also not really different from surrounding the hash table
operation, and only the hash table operation, with a big lock.  Then
things can't change while you are scanning the bucket list, but they
can change by the time you can do anything with the returned value,
which is the same problem we have to cope with here.)

> * Another thing I'm wondering about here is whether it's possible that
> somebody is currently walking an "older version" of the bucket list
> (i.e. is currently at an already unlinked element) and then visits a
> newly inserted element with an 'apparently' out of order hash
> value. That seems possible because the inserter might not have seen the
> unlinked element. Afaics that's not a problem for searches - but I'm not
> sure whether it couldn't cause a problem for concurrent inserts and
> deletes.

This is why the delete-mark bit has to be part of the same atomic word
as the next-pointer.  If somebody applies a delete-mark, a subsequent
attempt to insert an entry at that position will fail because the
CAS() of the next-word won't find the right value there - it will find
a delete-marked value, which will never be the value it's expecting.

> * One thing I noticed while looking at that part of code is:
> +               h = target_node->un.hashcode;
> +               if (h == hashcode)
> +                       cmp = memcmp(CHashNodeGetItem(target_node), key,
> +                                                table->desc.key_size);
> +               else if (h > hashcode)
> +                       cmp = 1;
> +               else
> +                       cmp = -1;
>
> Compilers can feel entirely free to reload local variables from memory
> (i.e. use target_node->un.hashcode instead of h) in situations like
> these. That's why I used the ACCESS_ONCE macros in my freelist
> patch. The same is done in a couple of other places (like looking at
> ->next). I'm *very* wary of relying on the compiler to not do stuff like
> that. I unfortunately think you'll need similar things here.

Even if the compiler decides to fetch the target node's hash code
twice, it won't create a correctness issue, because it's guaranteed to
see the same value both times.  The hash code is set before adding the
item to the list (with a memory barrier to enforce ordering), and it's
never reset until the item is put back on a new list - which requires
an intervening cycle of garbage collection and hazard-pointer
scanning, so we can't still be looking at the old item at that point.

> * I noticed is the following comment:
> +        * Note: We can't begin this operation until the clearing of the
> +        * garbage list has been committed to memory, but since that was
> +        * done using an atomic operation no explicit barrier is needed
> +        * here.
> +        *
>
> Uh. I don't think that holds true. A memory barrier on one cpu doesn't
> forbid all kinds of reordering on others.

Hmm, you might be right about that.  I thought that was guaranteed,
but a quick look at http://en.wikipedia.org/wiki/Memory_ordering
suggests otherwise.

> * How is the consistency of the element data guaranteed? Afaics there
>   very well could be two concurrent CHashInsert()'s for the same key
>   interleaving inside their respective memcpy()s. It'll often be fine
>   for small element sizes, but even there the compiler might restort to
>   a char-by-char memcpy.

The two operations would have to CAS() the same next-pointer, and one
of those operations will fail.  Note that the item is completely
initialized *before* we insert it into the hash table, so the premise
that two memcpy() operations can target the same address is just
wrong.

> * Should CHashInsert perhaps be renamed to CHashSetKey or CHashUpdate -
>   since it's not actually just inserting?

This may be the root of the confusion.  It *is* just inserting.  It
does not update; there is no update operation, and like most
algorithms for highly-concurrent hash tables, the algorithm can't
support such an operation.  The copy goes the other way: if an insert
fails, the caller gets back the value already present in the table,
which cannot be concurrently getting modified for the same reasons the
hashcode can't be modified while a bucket scan is looking at it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: Commit fest 2014-12, let's begin!
Next
From: Robert Haas
Date:
Subject: Re: Logical Replication Helpers WIP for discussion