Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id CA+TgmoY0pB4qxgOH=bD_VfvLktj8f2w58s2tYF_NEaj2QJdNxQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Thomas Munro <thomas.munro@enterprisedb.com>)
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Tue, Mar 21, 2017 at 3:50 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Mar 21, 2017 at 12:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> From my point of view, the main point is that having two completely
>> separate mechanisms for managing temporary files that need to be
>> shared across cooperating workers is not a good decision.  That's a
>> need that's going to come up over and over again, and it's not
>> reasonable for everybody who needs it to add a separate mechanism for
>> doing it.  We need to have ONE mechanism for it.
>
> Obviously I understand that there is value in code reuse in general.
> The exact extent to which code reuse is possible here has been unclear
> throughout, because it's complicated for all kinds of reasons. That's
> why Thomas and I had 2 multi-hour Skype calls all about it.

I agree that the extent to which code reuse is possible here is
somewhat unclear, but I am 100% confident that the answer is non-zero.
You and Thomas both need BufFiles that can be shared across multiple
backends associated with the same ParallelContext.  I don't understand
how you can argue that it's reasonable to have two different ways of
sharing the same kind of object across the same set of processes.  And
if that's not reasonable, then somehow we need to come up with a
single mechanism that can meet both your requirements and Thomas's
requirements.

>> It's just not OK in my book for a worker to create something that it
>> initially owns and then later transfer it to the leader.
>
> Isn't that an essential part of having a refcount, in general? You
> were the one that suggested refcounting.

No, quite the opposite.  My point in suggesting adding a refcount was
to avoid needing to have a single owner.  Instead, the process that
decrements the reference count to zero becomes responsible for doing
the cleanup.  What you've done with the ref count is use it as some
kind of medium for transferring responsibility from backend A to
backend B; what I want is to allow backends A, B, C, D, E, and F to
attach to the same shared resource, and whichever one of them happens
to be the last one out of the room shuts off the lights.

>> The cooperating backends should have joint ownership of the objects from
>> the beginning, and the last process to exit the set should clean up
>> those resources.
>
> That seems like a facile summary of the situation. There is a sense in
> which there is always joint ownership of files with my design. But
> there is also a sense is which there isn't, because it's impossible to
> do that while not completely reinventing resource management of temp
> files. I wanted to preserve resowner.c ownership of fd.c segments.

As I've said before, I think that's an anti-goal.  This is a different
problem, and trying to reuse the solution we chose for the
non-parallel case doesn't really work.  resowner.c could end up owning
a shared reference count which it's responsible for decrementing --
and then decrementing it removes the file if the result is zero.  But
it can't own performing the actual unlink(), because then we can't
support cases where the file may have multiple readers, since whoever
owns the unlink() might try to zap the file out from under one of the
others.

> You maintain that it's better to have the leader unlink() everything
> at the end, and suppress the errors when that doesn't work, so that
> that path always just plows through.

I don't want the leader to be responsible for anything.  I want the
last process to detach to be responsible for cleanup, regardless of
which process that ends up being.  I want that for lots of good
reasons which I have articulated including (1) it's how all other
resource management for parallel query already works, e.g. DSM, DSA,
and group locking; (2) it avoids the need for one process to sit and
wait until another process assumes ownership, which isn't a feature
even if (as you contend, and I'm not convinced) it doesn't hurt much;
and (3) it allows for use cases where multiple processes are reading
from the same shared BufFile without the risk that some other process
will try to unlink() the file while it's still in use.  The point for
me isn't so much whether unlink() ever ignores errors as whether
cleanup (however defined) is an operation guaranteed to happen exactly
once.

> I disagree with that. It is a
> trade-off, I suppose. I have now run out of time to work through it
> with you or Thomas, though.

Bummer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: [HACKERS] [COMMITTERS] pgsql: Add missing support for new nodefields
Next
From: David Steele
Date:
Subject: Re: [HACKERS] cast result of copyNode()