Re: WIP: dynahash replacement for buffer table - Mailing list pgsql-hackers

From Robert Haas
Subject Re: WIP: dynahash replacement for buffer table
Date
Msg-id CA+TgmoYZ=0vO5W97v=EH18WrpsgLELOih3yYiSPyHQtBOa=XUw@mail.gmail.com
Whole thread Raw
In response to Re: WIP: dynahash replacement for buffer table  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: WIP: dynahash replacement for buffer table
Re: WIP: dynahash replacement for buffer table
List pgsql-hackers
On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> The code in CHashSearch shows the problem there: you need to STORE the
>> hazard pointer before you begin to do the LOAD operations to scan the
>> bucket, and you must finish all of those LOADs before you STORE the
>> NULL hazard pointer.  A read or write barrier won't provide store/load
>> or load/store ordering.
>
> I'm not sure I understand the problem here - but a read + write barrier
> should suffice? To avoid falling back to two full barriers, we'd need a
> separate define pg_read_write_barrier(), but that's not a problem. IIUC
> that should allow us to avoid emitting any actual code on x86.

Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain.  It probably doesn't need the second mfence,
before clearing the the hazard pointer, because x86 allows loads to be
reordered before stores but not stores before loads.  We could
certainly try removing the second fence on x86 for benchmarking
purposes, and we could also check whether mfence outperforms lock;
addl in that scenario.

I tested PPC, because that's what I have.  I think we're emitting two
sync instructions there, but maybe lwsync or isync would be enough in
one or both cases.  The first of these links indicates that lwsync
ought to be enough for both cases; the second says that we need a sync
after setting the hazard pointer but that an lwsync is enough before
clearing it.   Or that's my reading anyway.

http://www.ibm.com/developerworks/systems/articles/powerpc.html
http://peeterjoot.wordpress.com/2010/07/23/use-of-lwsync-instead-of-isync-as-a-mutex-acquire-memory-barrier/

>> > My guess is that the additional indirection via the arena explains the
>> > difference in cache misses? But I might be completely off here.
>>
>> The arena makes the calculation of the pointer address less
>> predictable, which might make things more difficult for the CPU
>> pipeline.  But aside from that, I think it's the same basic idea: you
>> bounce from some sort of bucket header to the first element of the
>> chain and then, hopefully, you're done.  Am I missing something?
>
> I haven't really read much of the code yet (wrote most of this email on
> my workstation before embarking on a plane), but afaics there's one
> indirection to the bucket, and then one to the first node in the linked
> list. Right?
> In dynahash it's first to the segment, but there's only few, so it's
> likely to be cached. Then to the bucket - which, afaics, already can
> contain the key.
>
> So it's not really the arena, but that you chose not to store the first
> element in a chain inline...

Hmm, you have a point.  I think the reason I did it that way is
because I don't know how to make the memory management work out
otherwise.  If you store the first element inline, then even an empty
bucket uses up as much memory space as one with a single element.
More seriously, it breaks the deletion algorithm.  There's no good way
to atomically swap out the bucket header if it is located in a fixed
position and bigger than the size of object the machine can manipulate
atomically.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: narwhal and PGDLLIMPORT
Next
From: Michael Paquier
Date:
Subject: Re: Better support of exported snapshots with pg_dump