Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id CA+TgmoaWCfJ2Z_9cDO09NBmtLi4BDJ=mCK1bV62mc13ULvJ+5w@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Tue, Mar 21, 2017 at 7:37 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>>> Isn't that an essential part of having a refcount, in general? You
>>> were the one that suggested refcounting.
>>
>> No, quite the opposite.  My point in suggesting adding a refcount was
>> to avoid needing to have a single owner.  Instead, the process that
>> decrements the reference count to zero becomes responsible for doing
>> the cleanup.  What you've done with the ref count is use it as some
>> kind of medium for transferring responsibility from backend A to
>> backend B; what I want is to allow backends A, B, C, D, E, and F to
>> attach to the same shared resource, and whichever one of them happens
>> to be the last one out of the room shuts off the lights.
>
> Actually, that's quite possible with the design I came up with.

I don't think it is.  What sequence of calls do the APIs you've
proposed would accomplish that goal?  I don't see anything in this
patch set that would permit anything other than a handoff from the
worker to the leader.  There seems to be no way for the ref count to
be more than 1 (or 2?).

> The
> restriction that Thomas can't live with as I've left things is that
> you have to know the number of BufFiles ahead of time. I'm pretty sure
> that that's all it is. (I do sympathize with the fact that that isn't
> very helpful to him, though.)

I feel like there's some cognitive dissonance here.  On the one hand,
you're saying we should use your design.  On the other hand, you are
admitting that in at least one key respect, it won't meet Thomas's
requirements.  On the third hand, you just said that you weren't
arguing for two mechanisms for sharing a BufFile across cooperating
parallel processes.  I don't see how you can hold all three of those
positions simultaneously.

>> As I've said before, I think that's an anti-goal.  This is a different
>> problem, and trying to reuse the solution we chose for the
>> non-parallel case doesn't really work.  resowner.c could end up owning
>> a shared reference count which it's responsible for decrementing --
>> and then decrementing it removes the file if the result is zero.  But
>> it can't own performing the actual unlink(), because then we can't
>> support cases where the file may have multiple readers, since whoever
>> owns the unlink() might try to zap the file out from under one of the
>> others.
>
> Define "zap the file". I think, based on your remarks here, that
> you've misunderstood my design. I think you should at least understand
> it fully if you're going to dismiss it.

zap was a colloquialism for unlink().  I concede that I don't fully
understand your design, and am trying to understand those things I do
not yet understand.

> It is true that a worker resowner can unlink() the files
> mid-unification, in the same manner as with conventional temp files,
> and not decrement its refcount in shared memory, or care at all in any
> special way. This is okay because the leader (in the case of parallel
> tuplesort) will realize that it should not "turn out the lights",
> finding that remaining reference when it calls BufFileClose() in
> registered callback, as it alone must. It doesn't matter that the
> unlink() may have already occurred, or may be just about to occur,
> because we are only operating on already-opened files, and never on
> the link itself (we don't have to stat() the file link for example,
> which is naturally only a task for the unlink()'ing backend anyway).
> You might say that the worker only blows away the link itself, not the
> file proper, since it may still be open in leader (say).

Well, that sounds like it's counting on fd.c not to close the file
descriptor at an inconvenient point in time and reopen it later, which
is not guaranteed.

> Thomas' design cannot reliably know how many segments there are in
> workers in error paths, which necessitates his unlink()-ENOENT-ignore
> hack. My solution is that workers/owners look after their own temp
> segments in the conventional way, until they reach BufFileClose(),
> which may never come if there is an error. The only way that clean-up
> won't happen in conventional resowner.c-in-worker fashion is if
> BufFileClose() is reached in owner/worker. BufFileClose() must be
> reached when there is no error, which has to happen anyway when using
> temp files. (Else there is a temp file leak warning from resowner.c.)
>
> This is the only way to avoid the unlink()-ENOENT-ignore hack, AFAICT,
> since only the worker itself can reliably know how many segments it
> has opened at every single instant in time. Because it's the owner!

Above, you said that your design would allow for a group of processes
to share access to a file, with the last one that abandons it "turning
out the lights".  But here, you are referring to it as having one
owner - the "only the worker itself" can know the number of segments.
Those things are exact opposites of each other.

I don't think there's any problem with ignoring ENOENT, and I don't
think there's any need for a process to know the exact number of
segments in some temporary file.  In a shared-ownership environment,
that information can't be stored in a backend-private cache; it's got
to be available to whichever backend ends up being the last one out.
There are only two ways to do that.  One is to store it in shared
memory, and the other is to discover it from the filesystem.  The
former is conceptually more appealing, but it can't handle Thomas's
requirement of an unlimited number of files, so I think it makes sense
to go with the latter.  The only problem with that which I can see is
that we might orphan some temporary files if the disk is flaky and
filesystem operations are failing intermittently, but that's already a
pretty bad situation which we're not going to make much worse with
this approach.

>> I want the
>> last process to detach to be responsible for cleanup, regardless of
>> which process that ends up being.  I want that for lots of good
>> reasons which I have articulated including (1) it's how all other
>> resource management for parallel query already works, e.g. DSM, DSA,
>> and group locking; (2) it avoids the need for one process to sit and
>> wait until another process assumes ownership, which isn't a feature
>> even if (as you contend, and I'm not convinced) it doesn't hurt much;
>> and (3) it allows for use cases where multiple processes are reading
>> from the same shared BufFile without the risk that some other process
>> will try to unlink() the file while it's still in use.  The point for
>> me isn't so much whether unlink() ever ignores errors as whether
>> cleanup (however defined) is an operation guaranteed to happen exactly
>> once.
>
> My patch demonstrably has these properties. I've done quite a bit of
> fault injection testing to prove it.

I don't understand this comment, because 0 of the 3 properties that I
just articulated are things which can be proved or disproved by fault
injection.  Fault injection can confirm the presence of bugs or
suggest their absence, but none of those properties have to do with
whether there are bugs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] Logical replication existing data copy
Next
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] identity columns