Home > mailing lists

Re: WIP: dynahash replacement for buffer table - Mailing list pgsql-hackers

From	Ryan Johnson
Subject	Re: WIP: dynahash replacement for buffer table
Date	October 16, 2014 12:03:37
Msg-id	543FB408.6020206@cs.utoronto.ca Whole thread Raw
In response to	Re: WIP: dynahash replacement for buffer table (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: WIP: dynahash replacement for buffer table
List	pgsql-hackers

Tree view

<div class="moz-cite-prefix">On 15/10/2014 11:41 AM, Robert Haas wrote:<br /></div><blockquote
cite="mid:CA+TgmoZ=xR14my-mMPZcuO1RGwXbdppytUU3vfT8Oca-2WYRRQ@mail.gmail.com"type="cite"><pre wrap="">On Wed, Oct 15,
2014at 2:03 AM, Andres Freund <a class="moz-txt-link-rfc2396E"
href="mailto:andres@2ndquadrant.com"><andres@2ndquadrant.com></a>wrote:
 
</pre><blockquote type="cite"><pre wrap="">On 2014-10-14 17:53:10 -0400, Robert Haas wrote:
</pre><blockquote type="cite"><pre wrap="">On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <a
class="moz-txt-link-rfc2396E"href="mailto:andres@2ndquadrant.com"><andres@2ndquadrant.com></a> wrote:
 
</pre><blockquote type="cite"><blockquote type="cite"><pre wrap="">The code in CHashSearch shows the problem there: you
needto STORE the
 
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer.  A read or write barrier won't provide store/load
or load/store ordering.
</pre></blockquote><pre wrap="">
I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.
</pre></blockquote><pre wrap="">
Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain.
</pre></blockquote><pre wrap="">
Yes, I can see that now. I do wonder if that's not going to prove quite
expensive... I think I can see some ways around needing that. But I
think that needs some benchmarking first - no need to build a even more
complex solution if not needed.

The solution I'm thinking of is to essentially move away from hazard
pointers and store something like a generation counter per
backend. Which is updated less often, and in combination with other
operations. When a backend need to clean up and sees that there's a
straggler with a old generation it sends that backend a signal to ensure
it sets the latest generation.
</pre></blockquote><pre wrap="">
It's possible that might work ... but on the timescale we're talking
about here, that's asking the garbage collecting process to wait for
practically geologic time.

Back when I first wrote this code, I spent a fair amount of time
looking at existing work in the area of lock-free hash tables.
Essentially all of the work I was able to find on this topic assumes a
threaded model - or more precisely, it assumes that shared memory
allocation is cheap and easy and you'll have no problem getting as
much of it as you need whenever you need it.  This assumption often
isn't even spelled out explicitly: it's just assumed that that's the
programming environment you're working in.  Finding highly parallel
algorithms that don't rely on memory allocation as a primitive is
hard.  Hazard pointers are one of the few techniques I found that
seems like it can work in our architecture.  I'm quite reluctant to
leave aside techniques that have been proven to work well in favor of
inventing something entirely novel to PostgreSQL.

That having been said, there is some literature on generation numbers,
and I think something like that could be made to work.  It does have
some significant disadvantages, though.  One is that a single process
which fails to update its generation number prevents all reclamation,
system-wide.    In my algorithm, a process whose execution is
suspended while a hazard pointer is set prevents recycling of only one
of many garbage lists.  A process searching for a reusable element can
mostly likely find some other garbage list to reclaim instead.  Also,
a generation number implies that we update the value periodically,
rather than before and after each critical section.  Anything that
forces garbage collection to be postponed longer than absolutely
necessary seems to me likely to be a loser.  It might be a little
faster as long as we have free elements to allocate, but as soon as
we're out and have to wait longer than we otherwise would for garbage
collection, and all system activity halts as a result, even for a few
milliseconds, it's going to be a whole lot slower.  Or at least, I
think.

That having been said, I don't know what to do about the fact that the
fence is too expensive.  I don't know that we've really established
that that's the true root of the problem rather than some other
pedestrian optimization failure.  But the existing code is using an
atomic operation to acquire a spinlock, then releasing it, walking the
bucket chain, and then using another atomic operation to acquire a
spinlock and then releasing it again.  Surely a pure fence shouldn't
cost more than a spinlock cycle?  Even with arguably one extra cache
line touch, that seems like it ought to be a win.  But my intuitions
in this area are shaky.</pre></blockquote> Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like
anideal fit...<br /><br /> In brief, RCU has the following requirements:<br /><ul><li>Read-heavy access pattern<br
/><li>Writersmust be able to make dead objects unreachable to new readers (easily done for most data
structures)<li>Writersmust be able to mark dead objects in such a way that existing readers know to ignore their
contentsbut can still traverse the data structure properly (usually straightforward)<li> Readers must occasionally
informthe system that they are not currently using any RCU-protected pointers (to allow resource reclamation)</ul> [1]
<aclass="moz-txt-link-freetext" href="http://www.rdrop.com/~paulmck/RCU/">http://www.rdrop.com/~paulmck/RCU/</a><br
/><br/> If the above are satisfied, then:<br /><ul><li>Readers need no synchronization at all (not even compiler
fences)<li>Writerscan use the synchronization mechanism of their choice to coordinate with each other (lwlock, atomic
CAS,etc.)<br /><li>Object reclamation can be performed in the background, or synchronously (and incrementally, if
desired)</ul><br /> I've had very good results implementing user-level RCU for my own database projects. It slashes the
complexityof reasoning about concurrent reader accesses. Implemented properly, it requires only a bit of shared memory,
hasextremely low overhead in the common case of no stragglers, and requires only minimal communication except when
stragglersare present (and even then mostly between stragglers). It's reasonably simple to implement in userspace dbms
context(< 1kLoC with comments, assertions, etc.). Just don't forget to quiesce all in-flight back ends at regular
intervals,or the system won't be able to reclaim anything...<br /><br /> Thoughts?<br /> Ryan<br /><br /><br />

pgsql-hackers by date:

From: Michael Paquier
Date: 16 October 2014, 12:01:41
Subject: Re: Incorrect initialization of sentPtr in walsender.c

From: Jan Wieck
Date: 16 October 2014, 12:04:39
Subject: Moving of INT64_FORMAT to c.h

Re: WIP: dynahash replacement for buffer table - Mailing list pgsql-hackers

Previous

Next