Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management
Date
Msg-id CAH2-Wzka_qj2V_mpOsimF_M9gahsX7gsZDBYF=czvExyWKb4Zg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
On Thu, Mar 9, 2017 at 4:29 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Mar 10, 2017 at 8:19 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> by having state for each segment, it ends up actually *relying on*
>> ENOENT-on-unlink() as a condition that terminates execution:
>
> Yeah, this seems to fall out of the requirement to manage a growable
> number of partition files in a fixed space.  I wonder how this could
> go wrong.  One way would be for a crash-restart to happen (which
> leaves all temporary files in place by design, though it could clean
> them up like a normal restart if it wanted to), followed by a very
> long running cluster eventually generating the same (pid, set number)
> pair.  I think I see a simple way to defend against that, which I'll
> write about in the PHJ thread.

I am not expressing any real opinion about the idea of relying on or
suppressing ENOENT-on-unlink() just yet. What's clear is that that's
unorthodox. I seldom have any practical reason to make a distinction
between unorthodox and unacceptable. It's usually easier to just not
do the unorthodox thing. Maybe this is one of the rare exceptions.

> Thanks.  I will respond with code and comments over on the PHJ thread.
> Aside from the broken extendBufFile behaviour you mentioned, I'll look
> into the general modularity complaints I'm hearing about fd.c and
> buffile.c interaction.

buffile.c should stop pretending to care about anything other than
temp files, IMV. 100% of all clients that want temporary files go
through buffile.c. 100% of all clients that want non-temp files (files
which are not marked FD_TEMPORARY) access fd.c directly, rather than
going through buffile.c.

>> But my observations illustrate the difficulty with tying
>> resource manager resources in local memory (temp segments, and
>> associated BufFile(s)) with clean-up at shared memory segment
>> detachment. It's possible, but ticklish enough that you'll end up with
>> some wart or other when you're done.

> Suppressing ENOENT because it's not clear which backend actually owns
> a file is a bit different from using it to detect that there are no
> more segments...  Obviously I screwed some things up in the code I
> posted, but I think the general idea that the DSM segment owns the
> files on disk is a good one.  I figure that it (via the detach hook)
> should be able to delete the files using only data in shmem, without
> reference to any backend-local memory, so that file cleanup is
> entirely disconnected from backend-local resource cleanup (fds and
> memory).

The problem with that is that it kind of implies that we have to
invent a mechanism that 100% supersedes fd.c resource management. We
still want to free backend local vFDs and so on, using fd.c cleanup,
so that file descriptors are not leaked, but it's not clear we can do
that by half measure. Wouldn't even that have to be duplicated to deal
with shared memory state by way of your on_dsm_detach callback()? And,
if it did, it would presumably need to keep track of every single
segment, much like fd.c. But the number of segments isn't predictable
in advance, so you're forced to confront that problem, which you hoped
to avoid. And, where does all of this leave max_files_per_process
enforcement?

> Looking forward to seeing your v9.  We need to figure out what common
> ground we can find here.  Quality issues with the code I posted aside
> (which I'm sorting out), it seems that the underlying reason we're not
> converging yet is my requirement for a variable number of partitions
> managed in a fixed shmem space, resulting in the ENOENT-as-terminator
> behaviour that you don't like.

I agree that that's the snag we're getting caught on here. In even
simpler terms, what you need to do with BufFiles is somewhat more
difficult than what I need to do (for either of us), and now that time
grows short, I feel the need to draw a line under those differences.
At the same time, I would at the very least like to avoid doing you a
disservice by not at least anticipating your needs.

> I'm 100% happy to switch parts of what
> I'm doing to using anything you come up with if it can support a
> dynamic set of N partitions from fixed set of M participants.  I'm
> fairly sure that people won't be happy with two separate ways to
> manage shared temporary files.  I also don't want patch A to be
> derailed by patch B, but even if these things finish up in different
> Postgres releases we'll still have to solve the problem.

That's all fair. And, in case it isn't clear, I think that all of this
BufFile + fd.c + resowner.c + on_dsm_detach() stuff might be the
single biggest yak shaving expedition I've ever found myself on. It
just seems like the kind of thing where the only workable approach is
to try to converge on a good plan iteratively. That isn't anybody's
fault.

So, yes, there is a bit of friction between the requirements for
parallel tuplesort, and for parallel hash join, but I see that as
useful, and part of the process. Also, for what it's worth, I think
that you have the right general idea about parallel hash join.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] [COMMITTERS] pgsql: Fix hard-coded relkind constants in pg_dump.c.
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] contrib modules and relkind check