Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id CAH2-WzkCsHm3nYJAqDotbof61XM1b_ug9bMDp4ftBNTiTv6i0A@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Tue, Mar 21, 2017 at 2:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I agree that the extent to which code reuse is possible here is
> somewhat unclear, but I am 100% confident that the answer is non-zero.
> You and Thomas both need BufFiles that can be shared across multiple
> backends associated with the same ParallelContext.  I don't understand
> how you can argue that it's reasonable to have two different ways of
> sharing the same kind of object across the same set of processes.

I didn't argue that. Rather, I argued that there is going to be
significant additional requirements for PHJ, because it has to support
arbitrary many BufFiles, rather than either 1 or 2 (one per
tuplesort/logtapeset). Just how "signficant" that would be I cannot
say, regrettably. (Or, we're going to have to make logtape.c multiplex
BufFiles, which risks breaking other logtape.c routines that aren't
even used just yet.)

>> Isn't that an essential part of having a refcount, in general? You
>> were the one that suggested refcounting.
>
> No, quite the opposite.  My point in suggesting adding a refcount was
> to avoid needing to have a single owner.  Instead, the process that
> decrements the reference count to zero becomes responsible for doing
> the cleanup.  What you've done with the ref count is use it as some
> kind of medium for transferring responsibility from backend A to
> backend B; what I want is to allow backends A, B, C, D, E, and F to
> attach to the same shared resource, and whichever one of them happens
> to be the last one out of the room shuts off the lights.

Actually, that's quite possible with the design I came up with. The
restriction that Thomas can't live with as I've left things is that
you have to know the number of BufFiles ahead of time. I'm pretty sure
that that's all it is. (I do sympathize with the fact that that isn't
very helpful to him, though.)

> As I've said before, I think that's an anti-goal.  This is a different
> problem, and trying to reuse the solution we chose for the
> non-parallel case doesn't really work.  resowner.c could end up owning
> a shared reference count which it's responsible for decrementing --
> and then decrementing it removes the file if the result is zero.  But
> it can't own performing the actual unlink(), because then we can't
> support cases where the file may have multiple readers, since whoever
> owns the unlink() might try to zap the file out from under one of the
> others.

Define "zap the file". I think, based on your remarks here, that
you've misunderstood my design. I think you should at least understand
it fully if you're going to dismiss it.

It is true that a worker resowner can unlink() the files
mid-unification, in the same manner as with conventional temp files,
and not decrement its refcount in shared memory, or care at all in any
special way. This is okay because the leader (in the case of parallel
tuplesort) will realize that it should not "turn out the lights",
finding that remaining reference when it calls BufFileClose() in
registered callback, as it alone must. It doesn't matter that the
unlink() may have already occurred, or may be just about to occur,
because we are only operating on already-opened files, and never on
the link itself (we don't have to stat() the file link for example,
which is naturally only a task for the unlink()'ing backend anyway).
You might say that the worker only blows away the link itself, not the
file proper, since it may still be open in leader (say).

** We rely on the fact that files are themselves a kind of reference
counted thing, in general; they have an independent existence from the
link originally used to open() them. **

The reason that there is a brief wait in workers for parallel
tuplesort is because it gives us the opportunity to have the
immediately subsequent worker BufFileClose() not turn out the lights
in worker, because leader must have a reference on the BufFile when
workers are released. So, there is a kind of interlock that makes sure
that there is always at least 1 owner.

** There would be no need for an additional wait but for the fact the
leader wants to unify multiple worker BufFiles as one, and must open
them all at once for the sake of simplicity. But that's just how
parallel tuplesort in particular happens to work, since it has only
one BufFile in the leader, which it wants to operate on with
everything set up up-front. **

Thomas' design cannot reliably know how many segments there are in
workers in error paths, which necessitates his unlink()-ENOENT-ignore
hack. My solution is that workers/owners look after their own temp
segments in the conventional way, until they reach BufFileClose(),
which may never come if there is an error. The only way that clean-up
won't happen in conventional resowner.c-in-worker fashion is if
BufFileClose() is reached in owner/worker. BufFileClose() must be
reached when there is no error, which has to happen anyway when using
temp files. (Else there is a temp file leak warning from resowner.c.)

This is the only way to avoid the unlink()-ENOENT-ignore hack, AFAICT,
since only the worker itself can reliably know how many segments it
has opened at every single instant in time. Because it's the owner!

>> You maintain that it's better to have the leader unlink() everything
>> at the end, and suppress the errors when that doesn't work, so that
>> that path always just plows through.
>
> I don't want the leader to be responsible for anything.

I meant in the case of parallel CREATE INDEX specifically, were it to
use this other mechanism. Substitute "leader" with "the last backend"
in reading my remarks here.

> I want the
> last process to detach to be responsible for cleanup, regardless of
> which process that ends up being.  I want that for lots of good
> reasons which I have articulated including (1) it's how all other
> resource management for parallel query already works, e.g. DSM, DSA,
> and group locking; (2) it avoids the need for one process to sit and
> wait until another process assumes ownership, which isn't a feature
> even if (as you contend, and I'm not convinced) it doesn't hurt much;
> and (3) it allows for use cases where multiple processes are reading
> from the same shared BufFile without the risk that some other process
> will try to unlink() the file while it's still in use.  The point for
> me isn't so much whether unlink() ever ignores errors as whether
> cleanup (however defined) is an operation guaranteed to happen exactly
> once.

My patch demonstrably has these properties. I've done quite a bit of
fault injection testing to prove it.

(Granted, I need to take extra steps for the leader-as-worker backend,
a special case, which I haven't done already because I was waiting on
your feedback on the appropriate trade-off there.)

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: [HACKERS] Aggregates and row types
Next
From: Michael Banck
Date:
Subject: Re: [HACKERS] Create replication slot in pg_basebackup if requestedand not yet present