Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CAH2-WzmrSsQV-L4n+Qd3dOddAvxbOFEe99Ciy8xaqpc+T=aB8Q@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
|
List | pgsql-hackers |
On Thu, Jan 18, 2018 at 5:53 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I guess the fact that tuplesort_leader_wait() could be optional means > that it could be removed, which means that we could in fact throw out > the last condition variable within tuplesort.c, and fully rely on > using a barrier for everything within nbtsort.c. However, > tuplesort_leader_wait() seems kind of like something that we should > have on general principle. And, more importantly, it would be tricky > to use a barrier even for this, because we still have that baked-in > assumption that nworkers_launched is the single source of truth about > the number of participants. On third thought, tuplesort_leader_wait() should be removed entirely, and tuplesort.c should get entirely out of the IPC business (it should do the bare minimum of recording/reading a little state in shared memory, while knowing nothing about condition variables, barriers, or anything declared in parallel.h). Thinking about dealing with 2 spools at once clinched it for me -- calling tuplesort_leader_wait() for both underlying Tuplesortstates was silly, especially because there is still a need for an nbtsort.c-specific wait for workers to fill-in ambuild stats. When I said "tuplesort_leader_wait() seems kind of like something that we should have on general principle", I was wrong. It's normal for parallel workers to have all kinds of overlapping responsibilities, and tuplesort_leader_wait() was doing something that I now imagine isn't desirable to most callers. They can easily provide something equivalent at a higher level. Besides, they'll very likely be forced to anyway, due to some high level, caller-specific need -- which is exactly what we see within nbtsort.c. Attached patch details: * The patch synchronizes processes used the approach just described. Note that this allowed me to remove several #include statements within tuplesort.c. * The patch uses only a single condition variable for a single wait within nbtsort.c, for the leader. No barriers are used at all (and, as I said, tuplesort.c doesn't use condition variables anymore). Since things are now very simple, I can't imagine anyone still arguing for the use of barriers. Note that I'm still relying on nworkers_launched as the single source of truth on the number of participants that can be expected to eventually show up (even if they end up doing zero real work). This should be fine, because I think that it will end up being formally guaranteed to be reliable by the work currently underway from Robert and Amit. But even if I'm wrong about things going that way, and it turns out that the leader needs to decide how many putative launched workers don't "get lost" due to fork() failure (the approach which Amit apparently advocates), then there *still* isn't much that needs to change. Ultimately, the leader needs to have the exact number of workers that participated, because that's fundamental to the tuplesort approach to parallel sort. If necessary, the leader can just figure it out in whatever way it likes at one central point within nbtsort.c, before the leader calls its main spool's tuplesort_begin_index_btree() -- that can happen fairly late in the process. Actually doing that (and not just using nworkers_launched) seems questionable to me, because it would be almost the first thing that the leader would do after starting parallel mode -- why not just have the parallel infrastructure do it for us, and for everyone else? If the new tuplesort infrastructure is used in the executor at some future date, then the leader will still need to figure out the number of workers that reached tuplesort_begin* some other way. This shouldn't be surprising to anyone -- tuplesort.h is very clear on this point. * I revised the tuplesort.h contract to account for the developments already described (mostly that I've removed tuplesort_leader_wait()). * The patch makes the IPC wait event CREATE INDEX specific, since tuplesort no longer does any waits of its own -- it's now called ParallelCreateIndexScan. Patch also removes the second wait event entirely (the one that we called ParallelSortTapeHandover). * We now support index builds on catalogs. I rebased on top of Robert's recent "REINDEX state in parallel workers" commit, 29d58fd3. Note that there was a bug here in error paths that caused Robert's "can't happen" error to be raised (the PG_CATCH() block call to ResetReindexProcessing()). I fixed this in passing, by simply removing that one "can't happen" error. Note that ResetReindexProcessing() is only called directly within reindex_index()/IndexCheckExclusion(). This made the idea of preserving the check in a diminished form (#includ'ing parallel.h within index.c, in order to check if we're a parallel worker as a condition of raising that "can't happen" error) seem unnecessary. * The patch does not alter anything about parallel_leader_participation, except the alterations that Robert requested to the docs (he requested these alterations on the assumption that we won't end up doing nothing special with parallel_leader_participation). I am waiting for a final decision on what is to be done about parallel_leader_participation, but for now I've changed nothing. * I removed BufFileView(). I also renamed BufFileViewAppend() to BufFileAppend(). * I performed some other minor tweaks, including some requested by Robert in his most recent round of review. Thanks -- Peter Geoghegan
Attachment
pgsql-hackers by date: