Re: WIP: dynahash replacement for buffer table - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: WIP: dynahash replacement for buffer table |
Date | |
Msg-id | CA+TgmoYZ=0vO5W97v=EH18WrpsgLELOih3yYiSPyHQtBOa=XUw@mail.gmail.com Whole thread Raw |
In response to | Re: WIP: dynahash replacement for buffer table (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: WIP: dynahash replacement for buffer table
Re: WIP: dynahash replacement for buffer table |
List | pgsql-hackers |
On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> The code in CHashSearch shows the problem there: you need to STORE the >> hazard pointer before you begin to do the LOAD operations to scan the >> bucket, and you must finish all of those LOADs before you STORE the >> NULL hazard pointer. A read or write barrier won't provide store/load >> or load/store ordering. > > I'm not sure I understand the problem here - but a read + write barrier > should suffice? To avoid falling back to two full barriers, we'd need a > separate define pg_read_write_barrier(), but that's not a problem. IIUC > that should allow us to avoid emitting any actual code on x86. Well, thinking about x86 specifically, it definitely needs at least one mfence, after setting the hazard pointer and before beginning to read the bucket chain. It probably doesn't need the second mfence, before clearing the the hazard pointer, because x86 allows loads to be reordered before stores but not stores before loads. We could certainly try removing the second fence on x86 for benchmarking purposes, and we could also check whether mfence outperforms lock; addl in that scenario. I tested PPC, because that's what I have. I think we're emitting two sync instructions there, but maybe lwsync or isync would be enough in one or both cases. The first of these links indicates that lwsync ought to be enough for both cases; the second says that we need a sync after setting the hazard pointer but that an lwsync is enough before clearing it. Or that's my reading anyway. http://www.ibm.com/developerworks/systems/articles/powerpc.html http://peeterjoot.wordpress.com/2010/07/23/use-of-lwsync-instead-of-isync-as-a-mutex-acquire-memory-barrier/ >> > My guess is that the additional indirection via the arena explains the >> > difference in cache misses? But I might be completely off here. >> >> The arena makes the calculation of the pointer address less >> predictable, which might make things more difficult for the CPU >> pipeline. But aside from that, I think it's the same basic idea: you >> bounce from some sort of bucket header to the first element of the >> chain and then, hopefully, you're done. Am I missing something? > > I haven't really read much of the code yet (wrote most of this email on > my workstation before embarking on a plane), but afaics there's one > indirection to the bucket, and then one to the first node in the linked > list. Right? > In dynahash it's first to the segment, but there's only few, so it's > likely to be cached. Then to the bucket - which, afaics, already can > contain the key. > > So it's not really the arena, but that you chose not to store the first > element in a chain inline... Hmm, you have a point. I think the reason I did it that way is because I don't know how to make the memory management work out otherwise. If you store the first element inline, then even an empty bucket uses up as much memory space as one with a single element. More seriously, it breaks the deletion algorithm. There's no good way to atomically swap out the bucket header if it is located in a fixed position and bigger than the size of object the machine can manipulate atomically. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: