Home > mailing lists

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From	Peter Geoghegan
Subject	Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date	March 22, 2017 01:50:53
Msg-id	CAH2-Wzn+f7jOmKwkVScTryu_GVN3GYv57cy1YnoOJ4740WmJcA@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
List	pgsql-hackers

Tree view

On Tue, Mar 21, 2017 at 12:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> From my point of view, the main point is that having two completely
> separate mechanisms for managing temporary files that need to be
> shared across cooperating workers is not a good decision.  That's a
> need that's going to come up over and over again, and it's not
> reasonable for everybody who needs it to add a separate mechanism for
> doing it.  We need to have ONE mechanism for it.

Obviously I understand that there is value in code reuse in general.
The exact extent to which code reuse is possible here has been unclear
throughout, because it's complicated for all kinds of reasons. That's
why Thomas and I had 2 multi-hour Skype calls all about it.

> It's just not OK in my book for a worker to create something that it
> initially owns and then later transfer it to the leader.

Isn't that an essential part of having a refcount, in general? You
were the one that suggested refcounting.

> The cooperating backends should have joint ownership of the objects from
> the beginning, and the last process to exit the set should clean up
> those resources.

That seems like a facile summary of the situation. There is a sense in
which there is always joint ownership of files with my design. But
there is also a sense is which there isn't, because it's impossible to
do that while not completely reinventing resource management of temp
files. I wanted to preserve resowner.c ownership of fd.c segments.

You maintain that it's better to have the leader unlink() everything
at the end, and suppress the errors when that doesn't work, so that
that path always just plows through. I disagree with that. It is a
trade-off, I suppose. I have now run out of time to work through it
with you or Thomas, though.

> But even if were true that the waits will always be brief, I still
> think the way you've done it is a bad idea, because now tuplesort.c
> has to know that it needs to wait because of some detail of
> lower-level resource management about which it should not have to
> care.  That alone is a sufficient reason to want a better approach.

There is already a point at which the leader needs to wait, so that it
can accumulate stats that nbtsort.c cares about. So we already need a
leader wait point within nbtsort.c (that one is called directly by
nbtsort.c). Doesn't seem like too bad of a wart to have the same thing
for workers.

>> I believe that the main reason that you like the design I came up with
>> on the whole is that it's minimally divergent from the serial case.
>
> That's part of it, I guess, but it's more that the code you've added
> to do parallelism here looks an awful lot like what's gotten added to
> do parallelism in other cases, like parallel query.  That's probably a
> good sign.

It's also a good sign that it makes CREATE INDEX approximately 3 times faster.

-- 
Peter Geoghegan

pgsql-hackers by date:

From: Bruce Momjian
Date: 22 March 2017, 01:48:11
Subject: Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

From: Alvaro Herrera
Date: 22 March 2017, 01:56:16
Subject: Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

Previous

Next