Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id CA+TgmoY-kQTA4fjHYhXG6bW0f2ruGe2E=DkoYmb_bVqWR+SGJA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Tue, Mar 21, 2017 at 2:03 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I think that since the comment refers to code from before 1999, it can
> go. Any separate patch to remove it would have an entirely negative
> linediff.

It's a good general principle that a patch should do one thing well
and not make unrelated changes.  I try hard to adhere to that
principle in my commits, and I think other committers generally do
(and should), too.  Of course, different people draw the line in
different places.  If you can convince another committer to include
that change in their commit of this patch, well, that's not my cup of
tea, but so be it.  If you want me to consider committing this, you're
going to have to submit that part separately, preferably on a separate
thread with a suitably descriptive subject line.

> Obviously iff you write to the file in the leader, there is little
> that the worker can do afterwards, but it's not a given that you'd
> want to do that, and this patch actually never does. You could equally
> well say that PHJ fails to provide for my requirement for having the
> leader write to the files sensibly in order to recycle blocks, a
> requirement that its shared BufFile mechanism expressly does not
> support.

From my point of view, the main point is that having two completely
separate mechanisms for managing temporary files that need to be
shared across cooperating workers is not a good decision.  That's a
need that's going to come up over and over again, and it's not
reasonable for everybody who needs it to add a separate mechanism for
doing it.  We need to have ONE mechanism for it.

The second point is that I'm pretty convinced that the design you've
chosen is fundamentally wrong.  I've attempt to explain that multiple
times, starting about three months ago with
http://postgr.es/m/CA+TgmoYP0vzPw64DfMQT1JHY6SzyAvjogLkj3erMZzzN2f9xLA@mail.gmail.com
and continuing across many subsequent emails on multiple threads.
It's just not OK in my book for a worker to create something that it
initially owns and then later transfer it to the leader.  The
cooperating backends should have joint ownership of the objects from
the beginning, and the last process to exit the set should clean up
those resources.

>> That would cut hundreds of
>> lines from this patch with no real disadvantage that I can see --
>> including things like worker_wait(), which are only needed because of
>> the shortcomings of the underlying mechanism.
>
> I think it would definitely be a significant net gain in LOC. And,
> worker_wait() will probably be replaced by the use of the barrier
> abstraction anyway.

No, because if you do it Thomas's way, the worker can exit right away,
without waiting.  You don't have to wait via a different method; you
escape waiting altogether.  I understand that your point is that the
wait will always be brief, but I think that's probably an optimistic
assumption and definitely an unnecessary assumption.  It's optimistic
because there is absolutely no guarantee that all workers will take
the same amount of time to sort the data they read.  It is absolutely
not the case that all data sets sort at the same speed.  Because of
the way parallel sequential scan works, we're somewhat insulated from
that; workers that sort faster will get a larger chunk of the table.
However, that only means that workers will finish generating their
sorted runs at about the same time, not that they will finish merging
at the same time.  And, indeed, if some workers end up with more data
than others (so that they finish building runs at about the same time)
then some will probably take longer to complete the merging than
others.

But even if were true that the waits will always be brief, I still
think the way you've done it is a bad idea, because now tuplesort.c
has to know that it needs to wait because of some detail of
lower-level resource management about which it should not have to
care.  That alone is a sufficient reason to want a better approach.

I completely accept that whatever abstraction we use at the BufFile
level has to be something that can be plumbed into logtape.c, and if
Thomas's mechanism can't be bolted in there in a sensible way then
that's a problem.  But I feel quite strongly that the solution to that
problem isn't to adopt the approach you've taken here.

>> + * run.  Parallel workers always use quicksort, however.
>>
>> Comment fails to mention a reason.
>
> Well, I don't think that there is any reason to use replacement
> selection at all, what with the additional merge heap work last year.
> But, the theory there remains that RS is good when you can get one big
> run and no merge. You're not going to get that with parallel sort in
> any case, since the leader must merge. Besides, merging in the workers
> happens in the workers. And, the backspace requirement of 32MB of
> workMem per participant pretty much eliminates any use of RS that
> you'd get otherwise.

So, please mention that briefly in the comment.

> I believe that the main reason that you like the design I came up with
> on the whole is that it's minimally divergent from the serial case.

That's part of it, I guess, but it's more that the code you've added
to do parallelism here looks an awful lot like what's gotten added to
do parallelism in other cases, like parallel query.  That's probably a
good sign.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] GUC for cleanup indexes threshold.