Re: WIP: dynahash replacement for buffer table - Mailing list pgsql-hackers

From Andres Freund
Subject Re: WIP: dynahash replacement for buffer table
Date
Msg-id 20141015202339.GI7242@alap3.anarazel.de
Whole thread Raw
In response to Re: WIP: dynahash replacement for buffer table  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 2014-10-15 13:41:25 -0400, Robert Haas wrote:
> On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > The solution I'm thinking of is to essentially move away from hazard
> > pointers and store something like a generation counter per
> > backend. Which is updated less often, and in combination with other
> > operations. When a backend need to clean up and sees that there's a
> > straggler with a old generation it sends that backend a signal to ensure
> > it sets the latest generation.
> 
> It's possible that might work ... but on the timescale we're talking
> about here, that's asking the garbage collecting process to wait for
> practically geologic time.

I think it depends on what we're tying the generation increase to. We
very well could add a implict generation increment to, say, lwlock
acquisition - which are full barriers anyway. And I don't think we'll
normally have a that high rate of garbage collection anyway - there
should be plenty of headroom.

> Back when I first wrote this code, I spent a fair amount of time
> looking at existing work in the area of lock-free hash tables.
> Essentially all of the work I was able to find on this topic assumes a
> threaded model - or more precisely, it assumes that shared memory
> allocation is cheap and easy and you'll have no problem getting as
> much of it as you need whenever you need it.

FWIW, I think many of the solutions that are actually used in practice
don't rely on that heavily. I know at least some that require memory to
be reserved beforehand, in special contexts.

> This assumption often
> isn't even spelled out explicitly: it's just assumed that that's the
> programming environment you're working in.  Finding highly parallel
> algorithms that don't rely on memory allocation as a primitive is
> hard.  Hazard pointers are one of the few techniques I found that
> seems like it can work in our architecture.  I'm quite reluctant to
> leave aside techniques that have been proven to work well in favor of
> inventing something entirely novel to PostgreSQL.

I don't think something like generation numbers is really that new and
unproven - it's essentially a more trivial version of RCU. Which is used
quite heavily, and under different names.

That said, I'm far from convinced that they are the solution - they just
were the first thing that came to my mind thinking about the problem.

> That having been said, there is some literature on generation numbers,
> and I think something like that could be made to work.  It does have
> some significant disadvantages, though.  One is that a single process
> which fails to update its generation number prevents all reclamation,
> system-wide.    In my algorithm, a process whose execution is
> suspended while a hazard pointer is set prevents recycling of only one
> of many garbage lists.  A process searching for a reusable element can
> mostly likely find some other garbage list to reclaim instead.

Hm. Couldn't a similar scheme with several lists be used with generation
numbers?

> Also, a generation number implies that we update the value
> periodically, rather than before and after each critical section.

Hm. Don't think it necessarily has to do that.

> Anything that forces garbage collection to be postponed longer than
> absolutely necessary seems to me likely to be a loser.  It might be a
> little faster as long as we have free elements to allocate, but as
> soon as we're out and have to wait longer than we otherwise would for
> garbage collection, and all system activity halts as a result, even
> for a few milliseconds, it's going to be a whole lot slower.  Or at
> least, I think.

I think it really depends on the user of the facility. If the generation
were e.g. also tied to lwlocks I couldn't see bufmgr.c outrunning that
realistically.

> That having been said, I don't know what to do about the fact that the
> fence is too expensive.

I'm far from sure that it's the problem at hand here - the reason I'm
wondering about the barriers is primarily that the buffer mapping hash
table is one of the top profile entries in in all concurrent
workloads. The top one often even, after removing some locking
bottleneck. Nearly all of the time is spent there due to cache
misses. While I think we can get the table smaller and more efficient
for the same NBuffers value, realistically we need to cope with much
bigger NBuffers values.

Since cache misses are a problem that we're going to have to deal with,
restricting the processor's tool for efficiently dealing with that (out
of order execution, SMT), doesn't seem like a wise choice.

> I don't know that we've really established
> that that's the true root of the problem rather than some other
> pedestrian optimization failure.

Absolutely.

> But the existing code is using an
> atomic operation to acquire a spinlock, then releasing it, walking the
> bucket chain, and then using another atomic operation to acquire a
> spinlock and then releasing it again.

Well, we don't do that for lookups right now. Just for
insertions/removals. Or are you talking about the buffer mapping lock?

> Surely a pure fence shouldn't cost more than a spinlock cycle?

It really shouldn't - there's a difference right now that there's some
other thing the processor can do while dealing with the cache miss.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: Re: Maximum number of WAL files in the pg_xlog directory
Next
From: Guillaume Lelarge
Date:
Subject: Re: Maximum number of WAL files in the pg_xlog directory