Re: WIP: Barriers - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: WIP: Barriers
Date
Msg-id CAEepm=3yJ65sQZUAhfF3S7UfEv83X_rnH5a4-JXmqxGQRQ+7qQ@mail.gmail.com
Whole thread Raw
In response to Re: WIP: Barriers  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: WIP: Barriers  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
On Tue, Aug 16, 2016 at 1:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Aug 13, 2016 at 7:18 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> My use case for this is coordinating the phases of parallel hash
>> joins, but I strongly suspect there are other cases.  Parallel sort
>> springs to mind, which is why I wanted to post this separately and
>> earlier than my larger patch series, to get feedback from people
>> working on other parallel features.
>
> I was thinking about this over the weekend and I started to wonder
> whether this is really going to work well for hash joins.  For
> example, suppose that 6GB of work_mem is available and the projected
> size of the hash table is 8GB.  Clearly, we're going to need 2
> batches, but, if our estimates are accurate and the tuples split
> evenly between batches, each batch will be only 4GB!  That means that
> we can build up to 2GB of the hash table for the next batch before
> we've finished with the hash table for the previous batch.  It seems
> like a really good idea to try to take advantage of that as much as
> possible.

Right.  Thanks for that example.  A naive approach with barriers
between building and probing does indeed introduce a kind of pipeline
stall where workers twiddle their thumbs, despite there being unused
work_mem and useful things that could be done concurrently to fill it.

> The simple version of this is that when a worker gets done with its
> own probe phase for batch X, it can immediately start building the
> hash table for phase X+1, stopping if it fills up the unused portion
> of work_mem before the old hash table goes away.  Of course, there are
> some tricky issues with reading tapes that were originally created by
> other backends, but if I understand correctly, Peter Geoghegan has
> already done some work on that problem, and it seems like something we
> can eventually solve, even if not in the first version.
>
> The more complicated version of this is that we might want to delegate
> one or more workers to start building as much of the next-batch hash
> table as will fit instead of assisting with the current probe phase.
> Once work_mem is full, they join the probe phase and continue until
> it's done.  Again, tape management is an issue.  But you can see that
> if you can make this work, in this example, you can reduce the
> enforced pause between batches by about 50%; half the work is already
> done by the time the old hash table goes away.  I bet that has a
> chance of being fairly significant, especially for hash joins that
> have tons of batches.  I once saw a 64-batch hash join outperform a
> nested loop with inner index scan!
>
> Anyway, my point here is that I'm not sure whether the barrier
> mechanic is going to work well for computations with overlapping
> phases, and I suspect that overlapping phases is going to be an
> important technique, so we should make sure not to settle into a
> synchronization model that makes it hard.

I have a draft scheme worked out to do what you called the "simple
version", though not in the first version I will post (the "naive
version").  I don't really want to go into all the details in this
thread, not least because I'm still working on it, but I do want to
point out that barriers as a synchronisation primitive are not at all
incompatible with overlapping work: on the contrary, they can help
coordinate it while keeping the code tractable.

Rough sketch*:  There will be a secondary hash table, used for
preloading the next batch.  As in the naive version, there is a
barrier after the hash table is loaded: you can't start probing an
incomplete hash table.  But after probing the primary table for batch
n, workers will immediately begin preloading the secondary table with
tuples for batch n + 1, until either work_mem is exhausted or batch n
+ 1 is entirely loaded.  Then will they synchronise, elect one worker
to promote the secondary table to primary, synchronise again, finish
loading the newly promoted primary table, and then rince and repeat:
synchronise to coordinate the beginning of probing for batch n + 1...

Of course there are also several other phases relating to
initialization, hashing, resizing and special cases for first and last
batches, as I will describe in a future thread with code.

*Actual patch may not resemble illustration.

>> A problem that I'm still grappling with is how to deal with workers
>> that fail to launch.  What I'm proposing so far is based on static
>> worker sets, where you can only give the number of workers at
>> initialisation time, just like pthread_barrier_init.  Some other
>> libraries allow for adjustable worker sets, and I'm wondering if a
>> parallel leader might need to be able to adjust the barrier when it
>> hears of a worker not starting.  More on that soon.
>
> I think tying this into the number of workers for the parallel context
> in general is going to get you into trouble.  For example, suppose
> that we have an Append plan and beneath that we have two children of
> each of which is a Hash Join.  Right now, Append is dumb, so it will
> blindly throw all of the workers at the first Hash Join and then, as
> they emerge, it will throw them at the second one.  However, we've had
> previous discussions which I'm too lazy to look up right now about
> making a Parallel Append that would split the workers between the two
> hash joins; if one of them finished, then those workers could go join
> the other hash join in medias res.

Yeah.  I continued grappling with this and found a way forward that I
am quite happy with for now.

Picture an N-lane Olympic swimming pool with N identical swimmers at
the starting blocks waiting for the gun to fire, with a rule that they
must wait for all swimmers to reach the end before the next heat of
the race begins.  That's what textbook parallel computing of the type
supported by pthread_barrier_t looks like, and what barrier-v1.patch
provided.

Postgres parallel query is not at all like that: it's more like a
flash mob pool party.  The referee fires the starting gun, and some
number of swimmers show up some time soonish, or not, and they jump in
anywhere they like; the referee (the backend executing the gather
node) may get bored and decide to dive-bomb into the middle of it all
too.  In future, the executor may send in new swimmers at random by
parachute!  OK, sorry for the tortured analogy... but the point is
that the size of the party may change at any time and new workers need
to synchronise with the active phase.

Here's a new version of barriers extended to allow for dynamic worker
parties.  It's something like .NET's System.Threading.Barrier, Java's
java.util.concurrent.Phaser and X10's clock, so I have at least a
small amount of confidence that it's not completely crackpot.  This
addition allows new workers to arrive at any time and get in sync with
the ongoing work: for example, if the gather node feels the impulse to
jump in, it will join the current activity and pitch in.

There are some complexities I'm still working out: it would be nice to
figure out how to remain attached to the barrier if I know for certain
that the caller will call again (basically an ExecutorRun(..., ..., 0)
loop, as opposed to a one-off call from a gather node).

> [...]
> That approach is not entirely satisfying, though.  As discussed on the
> thread about asynchronous and vectorized execution, it's desirable
> that, when a particular branch of a parallel query can't use any more
> workers - e.g. because they are all waiting for the next phase to
> begin - those workers can leave and go do something else.
> [...]

I believe my current approach is not incompatible with those future
work scheduling ideas.  The key requirement seems to be the ability to
detach and walk away without holding anyone else up, and reattach and
help out with whatever is currently going on as appropriate, while
synchronising work as appropriate.  From the point of view of an
executor node, there are basically two things you need to support a
user space coroutine/green thread/fibre/event style scheduler as I see
it: a way to yield and a way to continue from the right point, and the
latter is already covered.  Whatever yield-like operations are
invented in future will hopefully just slot into the right places,
without changing the underlying parallel hash join algorithm.

>> I thought about using a different name to avoid colliding with
>> barrier.h and overloading the term: there are of course also compiler
>> barriers and memory barriers.  But then I realised that that header
>> was basically vacant real estate, and 'barrier' is the super-well
>> established standard term for this parallel computing primitive.
>
> If we're going to remove barrier.h, I think that should be a separate
> commit from creating a new barrier.h.

OK.  Here's a patch to remove the old header, and the v2 barrier patch
which adds phases and attach/detach.  As before, it depends on
condition-variable-v2.patch.

--
Thomas Munro
http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Patch: initdb: "'" for QUOTE_PATH (non-windows)
Next
From: Joshua Bay
Date:
Subject: Most efficient way for libPQ .. PGresult serialization