Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CAH2-Wzk0XqQdAVE7duFx5h_vExKEELgKVf2KwEmUAAYj=zKShA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
|
List | pgsql-hackers |
On Thu, Jan 18, 2018 at 9:22 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I just went back to the thread on "parallel.c oblivion of > worker-startup failures" and refreshed my memory about what's going on > over there. What's going on over there is (1) currently, > nworkers_launched can be over-reported in a scenario that doesn't > error out or crash and (2) I'm proposing to tighten things up so that > this is no longer the case. I think that we need to be able to rely on nworkers_launched to not over-report the number of workers launched. To be fair to Amit, I haven't actually gone off and studied the problem myself, so it's not fair to dismiss his point of view. It nevertheless seems to me that it makes life an awful lot easier to be able to rely on nworkers_launched. > So, thinking about this, I think that my proposal to use dynamic > barriers here seems like it will work regardless of who wins that > argument. Your proposal to use static barriers and decrement the > party size based on the number of participants which fail to start > will work if I win that argument, but will not work if Amit wins that > argument. Sorry, but I've changed my mind. I don't think barriers owned by tuplesort.c will work for us (though I think we will still need a synchronization primitive within nbtsort.c). The design that Robert sketched for using barriers seemed fine at first. But then I realized: what about the case where you have 2 spools? I now understand why Thomas thought that I'd end up using static barriers, because I now see that dynamic barriers have problems of their own if used by tuplesort.c, even with the trick of only having participants actually participate on the condition that they show up before the party is over (before there is no tuples left for the worker to consume). The idea of the leader using nworkers_launched as the assumed-launched number of workers is pretty much baked into my patch, because my patch makes tuplesorts composable (e.g. nbtsort.c uses two tuplesorts when there is a unique index build/2 spools). Do individual workers need to be prepared to back out of the main spool's sort, but not the spool2 sort (for unique index builds), or vice-versa? Clearly that's untenable, because they're going to need to have both as long as they're participating in a parallel CREATE INDEX (of a unique index) -- IndexBuildHeapScan() expects both at the same time, but there is a race condition when launching workers with 2 spools. So does nbtsort.c need to own the barrier instead? If it does, and if that barrier subsumes the responsibilities of tuplesort.c's condition variables, then I don't see how that can avoid causing a mess due to confusion about phases across tuplesorts/spools. nbtsort.c *will* need some synchronization primitive, actually, (I'm thinking of a condition variable), but only because of the fact that nbtsort.c happens to want to aggregate statistics about the sort at the end (for pg_index) -- this doesn't seem like tuplesort's problem at all. In general, it's very natural to just call tuplesort_leader_wait(), and have all the relevant details encapsulated within tuplesort.c. We could make tuplesort_leader_wait() totally optional, and just use the barrier within nbtsort.c for the wait (more on that later). > In particular, for parallel query, there is absolutely zero guarantee > that every worker reaches every plan node. For a parallel utility > command, things seem a little better: we can assume that workers are > started only for one particular purpose. But even that might not be > true in the future. I expect workers that are reported launched to show up eventually, or report failure. They don't strictly have to do any work beyond just showing up (finding no tuples, reaching tuplesort_performsort(), then finally reaching tuplesort_end()). The spool2 issue I describe above shows why this is. They own the state (tuplesort tuples) that they consume, and may possibly have 2 or more tuplesorts. If they cannot do the bare minimum of checking in with us, then we're in big trouble, because that's indistinguishable from their having actually sorted some tuples without our knowing that the leader ultimately gets to consume them. It wouldn't be impossible to use barriers for everything. That just seems to be incompatible with tuplesorts being composable. Long ago, nbtsort.c actually did the sorting, too. If that was still true, then it would be rather a lot more like parallel hashjoin, I think. You could then just have one barrier for one state machine (with one or two spools). It seems clear that we should avoid teaching tuplesort.c about nbtsort.c. > But, hang on a minute. Why do the workers need to wait for the leader > anyway? Can't they just exit once they're done sorting? I think the > original reason why this ended up in the patch is that we needed the > leader to assume ownership of the tapes to avoid having the tapes get > blown away when the worker exits. But, IIUC, with sharedfileset.c, > that problem no longer exists. You're right. This is why we could make calling tuplesort_leader_wait() optional. We only need one condition variable in tuplesort.c. Which makes me even less inclined to make the remaining workersFinishedCv condition variable into a barrier, since it's not at all barrier-like. After all, workers don't care about each other's progress, or where the leader is. The leader needs to wait until all known-launched participants report having finished, which seems like a typical reason to use a condition variable. That doesn't seem phase-like at all. As for workers, they don't have phases ("done" isn't a phase for them, because as you say, there is no need for them to wait until the leader says they can go with the shared fileset stuff -- that's the leader's problem alone.) I guess the fact that tuplesort_leader_wait() could be optional means that it could be removed, which means that we could in fact throw out the last condition variable within tuplesort.c, and fully rely on using a barrier for everything within nbtsort.c. However, tuplesort_leader_wait() seems kind of like something that we should have on general principle. And, more importantly, it would be tricky to use a barrier even for this, because we still have that baked-in assumption that nworkers_launched is the single source of truth about the number of participants. -- Peter Geoghegan
pgsql-hackers by date: