Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id CAH2-WzmrSsQV-L4n+Qd3dOddAvxbOFEe99Ciy8xaqpc+T=aB8Q@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
List pgsql-hackers
On Thu, Jan 18, 2018 at 5:53 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I guess the fact that tuplesort_leader_wait() could be optional means
> that it could be removed, which means that we could in fact throw out
> the last condition variable within tuplesort.c, and fully rely on
> using a barrier for everything within nbtsort.c. However,
> tuplesort_leader_wait() seems kind of like something that we should
> have on general principle. And, more importantly, it would be tricky
> to use a barrier even for this, because we still have that baked-in
> assumption that nworkers_launched is the single source of truth about
> the number of participants.

On third thought, tuplesort_leader_wait() should be removed entirely,
and tuplesort.c should get entirely out of the IPC business (it should
do the bare minimum of recording/reading a little state in shared
memory, while knowing nothing about condition variables, barriers, or
anything declared in parallel.h). Thinking about dealing with 2 spools
at once clinched it for me -- calling tuplesort_leader_wait() for both
underlying Tuplesortstates was silly, especially because there is
still a need for an nbtsort.c-specific wait for workers to fill-in
ambuild stats. When I said "tuplesort_leader_wait() seems kind of like
something that we should have on general principle", I was wrong.

It's normal for parallel workers to have all kinds of overlapping
responsibilities, and tuplesort_leader_wait() was doing something that
I now imagine isn't desirable to most callers. They can easily provide
something equivalent at a higher level. Besides, they'll very likely
be forced to anyway, due to some high level, caller-specific need --
which is exactly what we see within nbtsort.c.

Attached patch details:

* The patch synchronizes processes used the approach just described.
Note that this allowed me to remove several #include statements within
tuplesort.c.

* The patch uses only a single condition variable for a single wait
within nbtsort.c, for the leader. No barriers are used at all (and, as
I said, tuplesort.c doesn't use condition variables anymore). Since
things are now very simple, I can't imagine anyone still arguing for
the use of barriers.

Note that I'm still relying on nworkers_launched as the single source
of truth on the number of participants that can be expected to
eventually show up (even if they end up doing zero real work). This
should be fine, because I think that it will end up being formally
guaranteed to be reliable by the work currently underway from Robert
and Amit. But even if I'm wrong about things going that way, and it
turns out that the leader needs to decide how many putative launched
workers don't "get lost" due to fork() failure (the approach which
Amit apparently advocates), then there *still* isn't much that needs
to change.

Ultimately, the leader needs to have the exact number of workers that
participated, because that's fundamental to the tuplesort approach to
parallel sort. If necessary, the leader can just figure it out in
whatever way it likes at one central point within nbtsort.c, before
the leader calls its main spool's tuplesort_begin_index_btree() --
that can happen fairly late in the process. Actually doing that (and
not just using nworkers_launched) seems questionable to me, because it
would be almost the first thing that the leader would do after
starting parallel mode -- why not just have the parallel
infrastructure do it for us, and for everyone else?

If the new tuplesort infrastructure is used in the executor at some
future date, then the leader will still need to figure out the number
of workers that reached tuplesort_begin* some other way. This
shouldn't be surprising to anyone -- tuplesort.h is very clear on this
point.

* I revised the tuplesort.h contract to account for the developments
already described (mostly that I've removed tuplesort_leader_wait()).

* The patch makes the IPC wait event CREATE INDEX specific, since
tuplesort no longer does any waits of its own -- it's now called
ParallelCreateIndexScan. Patch also removes the second wait event
entirely (the one that we called ParallelSortTapeHandover).

* We now support index builds on catalogs.

I rebased on top of Robert's recent "REINDEX state in parallel
workers" commit, 29d58fd3. Note that there was a bug here in error
paths that caused Robert's "can't happen" error to be raised (the
PG_CATCH() block call to ResetReindexProcessing()). I fixed this in
passing, by simply removing that one "can't happen" error. Note that
ResetReindexProcessing() is only called directly within
reindex_index()/IndexCheckExclusion(). This made the idea of
preserving the check in a diminished form (#includ'ing parallel.h
within index.c, in order to check if we're a parallel worker as a
condition of raising that "can't happen" error) seem unnecessary.

* The patch does not alter anything about
parallel_leader_participation, except the alterations that Robert
requested to the docs (he requested these alterations on the
assumption that we won't end up doing nothing special with
parallel_leader_participation).

I am waiting for a final decision on what is to be done about
parallel_leader_participation, but for now I've changed nothing.

* I removed BufFileView(). I also renamed BufFileViewAppend() to
BufFileAppend().

* I performed some other minor tweaks, including some requested by
Robert in his most recent round of review.

Thanks
-- 
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] postgres_fdw bug in 9.6
Next
From: Amit Kapila
Date:
Subject: Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)