Re: ExecGather() + nworkers - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: ExecGather() + nworkers
Date
Msg-id CAM3SWZTaLTixbC_YnaJHnEU12hQwrjGhZqjpj5YgBcktvxonLA@mail.gmail.com
Whole thread Raw
In response to Re: ExecGather() + nworkers  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Sun, Jan 10, 2016 at 5:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Well, in general, the parallel sort code doesn't really get to pick
> whether or not a BackgroundWorkerSlot gets used or not.  Whoever
> created the parallel context decides how many workers to request, and
> then the context got as many of those as it could.  It then did
> arbitrary computation, which at some point in the middle involves one
> or more parallel sorts.  You can't just have one of those workers up
> and exit in the middle.  Now, in the specific case of parallel index
> build, you probably can do that, if you want to.  But to be honest,
> I'd be inclined not to include that in the first version.  If you get
> fewer workers than you asked for, just use the number you got.  Let's
> see how that actually works out before we decide that we need a lot
> more mechanism here.  You may find that it's surprisingly effective to
> do it this way.

I am inclined to just accept for the time being (until I post the
patch) that one worker and one leader may sometimes be all that we
get. I will put off dealing with the problem until I show code, in
other words.

>> Now, you might wonder why it is that the leader cannot also sort runs,
>> just as a worker would. It's possible, but it isn't exactly
>> straightforward.

> I am surprised that this is not straightforward.  I don't see why it
> shouldn't be, and it worries me that you think it isn't.

It isn't entirely straightforward because it requires special case
handling. For example, I must teach the leader not to try and wait on
itself to finish sorting runs. It might otherwise attempt that ahead
of its final on-the-fly merge.

>> More importantly, I have other, entirely general concerns. Other major
>> RDBMSs have settings that are very similar to max_parallel_degree,
>> with a setting of 1 effectively disabling all parallelism. Both Oracle
>> and SQL Server have settings that they both call the "maximum degree
>> or parallelism". I think it's a bit odd that with Postgres,
>> max_parallel_degree = 1 can still use parallelism at all. I have to
>> wonder: are we conflating controlling the resources used by parallel
>> operations with how shared memory is doled out?
>
> We could redefined things so that max_parallel_degree = N means use N
> - 1 workers, with a minimum value of 1 rather than 0, if there's a
> consensus that that's better.  Personally, I prefer it the way we've
> got it: it's real darned clear in my mind that max_parallel_degree=0
> means "not parallel".  But I won't cry into my beer if a consensus
> emerges that the other way would be better.

The fact that we don't do that isn't quite the issue, though. It may
or may not make sense to count the leader as an additional worker when
the leader has very little work to do. In good cases for parallel
sequential scan, the leader has very little leader-specific work to
do, because most of the time is spent on workers (including the
leader) determining that tuples must not be returned to the leader.
When that is less true, maybe the leader could reasonably count as a
fixed cost for a parallel operation. Hard to say.

I'm sorry that that's not really actionable, but I'm still working
this stuff out.

>> I could actually "give back" my parallel worker slots early if I
>> really wanted to (that would be messy, but the then-acquiesced workers
>> do nothing for the duration of the merge beyond conceptually owning
>> the shared tape temp files). I don't think releasing the slots early
>> makes sense, because I tend to think that hanging on to the workers
>> helps the DBA in managing the server's resources. The still-serial
>> merge phase is likely to become a big bottleneck with parallel sort.
>
> Like I say, the sort code better not know anything about this
> directly, or it's going to break when embedded in a query.

tuplesort.c knows very little. nbtsort.c manages workers, and their
worker tuplesort states, as well as the leader and its tuplesort
state. So tuplesort.c knows a little bit about how the leader
tuplesort state may need to reconstruct worker state in order to do
its final on-the-fly merge. It knows nothing else, though, and
provides generic hooks for assigning worker numbers to worker
processes, or logical run numbers (this keeps trace_sort output
straight, plus a few other things).

Parallel workers are all managed in nbtsort.c, which seems
appropriate. Note that I have introduced a way in which a single
tuplesort state doesn't perfectly encapsulate a single sort operation,
though.

> This seems dead wrong.  A max_parallel_degree of 8 means you have a
> leader and 8 workers.  Where are the other 7 processes coming from?
> What you should have is 8 processes each of which is participating in
> both the parallel seq scan and the parallel sort, not 8 processes
> scanning and 8 separate processes sorting.

I simply conflated max_parallel_degree and max_worker_processes for a
moment. The point is that max_worker_processes is defined in terms of
one definition of a worker process that could in theory be quite
narrow.

A related issue is that I have no way to force Postgres to use however
many worker processes I feel like. Currently, I don't have a cost
model for parallel index builds, because I just use
max_worker_processes; this will probably change soon; I think having
something better than just using max_worker_processes is *essential*
for parallel index builds. There is already such a model for parallel
seq scan in the optimizer, of course.

The inability to force a certain number of workers is less of a
concern with parallel sequential scan, but would be nice to test it
that way. For parallel index builds, it seems reasonable to suppose
that advanced DBAs will want to avoid using the cost model sometimes,
for example because there are no statistics available following a bulk
load. The possible range of sensible nworkers could be drastically
different to max_parallel_degree.

Other systems directly support a way of doing this. Have you thought
about adding, say, a "parallel_degree" GUC, defaulting to 0 or -1
("use cost model"), but otherwise used directly as an nworkers input
for CreateParallelContext()?

>>> I think that's probably over-engineered.  I mean, it wouldn't be that
>>> hard to have the workers just exit if you decide you don't want them,
>>> and I don't really want to make the signaling here more complicated
>>> than it really needs to be.
>>
>> I worry about the additional overhead of constantly starting and
>> stopping a single worker in some cases (not so much with parallel
>> index build, but other uses of parallel sort beyond 9.6). Furthermore,
>> the coordination between worker and leader processes to make this
>> happen seems messy -- you actually have the postmaster launch
>> processes, but they must immediately get permission to do anything.
>>
>> It wouldn't be that hard to offer a general way of doing this, so why not?
>
> Well, if these things become actual problems, fine, we can fix them.
> But let's not decide to add the API before we're agreed that we need
> it to solve an actual problem that we both agree we have.  We are not
> there yet.

That's fair. Before too long, I'll post a patch that doesn't attempt
to deal with there only being one worker process, where that isn't
expected to be any faster than a serial sort. We can iterate on that.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Spelling corrections
Next
From: Amit Langote
Date:
Subject: Re: [PROPOSAL] VACUUM Progress Checker.