Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id CAH2-Wzn+f7jOmKwkVScTryu_GVN3GYv57cy1YnoOJ4740WmJcA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Tue, Mar 21, 2017 at 12:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> From my point of view, the main point is that having two completely
> separate mechanisms for managing temporary files that need to be
> shared across cooperating workers is not a good decision.  That's a
> need that's going to come up over and over again, and it's not
> reasonable for everybody who needs it to add a separate mechanism for
> doing it.  We need to have ONE mechanism for it.

Obviously I understand that there is value in code reuse in general.
The exact extent to which code reuse is possible here has been unclear
throughout, because it's complicated for all kinds of reasons. That's
why Thomas and I had 2 multi-hour Skype calls all about it.

> It's just not OK in my book for a worker to create something that it
> initially owns and then later transfer it to the leader.

Isn't that an essential part of having a refcount, in general? You
were the one that suggested refcounting.

> The cooperating backends should have joint ownership of the objects from
> the beginning, and the last process to exit the set should clean up
> those resources.

That seems like a facile summary of the situation. There is a sense in
which there is always joint ownership of files with my design. But
there is also a sense is which there isn't, because it's impossible to
do that while not completely reinventing resource management of temp
files. I wanted to preserve resowner.c ownership of fd.c segments.

You maintain that it's better to have the leader unlink() everything
at the end, and suppress the errors when that doesn't work, so that
that path always just plows through. I disagree with that. It is a
trade-off, I suppose. I have now run out of time to work through it
with you or Thomas, though.

> But even if were true that the waits will always be brief, I still
> think the way you've done it is a bad idea, because now tuplesort.c
> has to know that it needs to wait because of some detail of
> lower-level resource management about which it should not have to
> care.  That alone is a sufficient reason to want a better approach.

There is already a point at which the leader needs to wait, so that it
can accumulate stats that nbtsort.c cares about. So we already need a
leader wait point within nbtsort.c (that one is called directly by
nbtsort.c). Doesn't seem like too bad of a wart to have the same thing
for workers.

>> I believe that the main reason that you like the design I came up with
>> on the whole is that it's minimally divergent from the serial case.
>
> That's part of it, I guess, but it's more that the code you've added
> to do parallelism here looks an awful lot like what's gotten added to
> do parallelism in other cases, like parallel query.  That's probably a
> good sign.

It's also a good sign that it makes CREATE INDEX approximately 3 times faster.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
Next
From: Alvaro Herrera
Date:
Subject: Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)