Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management
Date
Msg-id CAEepm=3g9hjf64WR=ViUe=OFfBZYWi-q53cBcaKeMOruc441Ng@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management  (Peter Geoghegan <pg@bowt.ie>)
Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Fri, Mar 10, 2017 at 8:19 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> by having state for each segment, it ends up actually *relying on*
> ENOENT-on-unlink() as a condition that terminates execution:

Yeah, this seems to fall out of the requirement to manage a growable
number of partition files in a fixed space.  I wonder how this could
go wrong.  One way would be for a crash-restart to happen (which
leaves all temporary files in place by design, though it could clean
them up like a normal restart if it wanted to), followed by a very
long running cluster eventually generating the same (pid, set number)
pair.  I think I see a simple way to defend against that, which I'll
write about in the PHJ thread.

On Fri, Mar 10, 2017 at 10:01 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> IOW, I cannot see why 0007-hj-shared-buf-file-v6.patch leaves parallel
> hash join with BufFiles that are even capable of being expanded past
> 1GiB (1 segment), which is surely not acceptable -- it's a bug. Have I
> just missed something?

Bleugh, yeah that's completely busted.  Fixing.

> segments). I suggest that the MAX_PHYSICAL_FILESIZE stress-test I
> sometimes perform [1] be incorporated in Thomas' testing.

Right, will do.

> (Looks more carefully...)
>
> I now see that PathNameCreateFile() is being called, a new function
> which is largely just a new interface to the existing
> OpenTemporaryFileInTablespace() temp files. So, if you look carefully,
> you notice that you do in fact end up with FD_TEMPORARY fd.c segments
> here...so maybe temp_file_limit isn't broken, because, it turns out,
> 0007-hj-shared-buf-file-v6.patch is still *halfway* buying into the
> conventional idea of file temp-ness. But buffile.c if left behind
> ("file->isTemp == false"), so AFAICT it must still be broken simply
> because new segments cannot be written, per BufFileCreate() comments
> quoted above.
>
> This has become more of a PHJ review, which is off-topic for this
> thread.

Thanks.  I will respond with code and comments over on the PHJ thread.
Aside from the broken extendBufFile behaviour you mentioned, I'll look
into the general modularity complaints I'm hearing about fd.c and
buffile.c interaction.

> But my observations illustrate the difficulty with tying
> resource manager resources in local memory (temp segments, and
> associated BufFile(s)) with clean-up at shared memory segment
> detachment. It's possible, but ticklish enough that you'll end up with
> some wart or other when you're done. The question, then, is what wart
> is the most acceptable for parallel tuplesort? I see two plausible
> paths forward right now:
>
> 1. Selectively suppress spurious ENOENT-on-unlink() LOG message in the
> brief window where it might be spurious. This is the easiest option.
> (There is less code this way.)
>
> 2. Do more or less what I've already drafted as my V9, which is
> described in opening mail to this thread, while accepting the ugliness
> that that approach brings with it.

Suppressing ENOENT because it's not clear which backend actually owns
a file is a bit different from using it to detect that there are no
more segments...  Obviously I screwed some things up in the code I
posted, but I think the general idea that the DSM segment owns the
files on disk is a good one.  I figure that it (via the detach hook)
should be able to delete the files using only data in shmem, without
reference to any backend-local memory, so that file cleanup is
entirely disconnected from backend-local resource cleanup (fds and
memory).

Looking forward to seeing your v9.  We need to figure out what common
ground we can find here.  Quality issues with the code I posted aside
(which I'm sorting out), it seems that the underlying reason we're not
converging yet is my requirement for a variable number of partitions
managed in a fixed shmem space, resulting in the ENOENT-as-terminator
behaviour that you don't like.  I'm 100% happy to switch parts of what
I'm doing to using anything you come up with if it can support a
dynamic set of N partitions from fixed set of M participants.  I'm
fairly sure that people won't be happy with two separate ways to
manage shared temporary files.  I also don't want patch A to be
derailed by patch B, but even if these things finish up in different
Postgres releases we'll still have to solve the problem.

-- 
Thomas Munro
http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: [HACKERS] amcheck (B-Tree integrity checking tool)
Next
From: Stephen Frost
Date:
Subject: Re: [HACKERS] contrib modules and relkind check