Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management
Date
Msg-id CAH2-Wz=W8ZCYMTbNKqikLf+YftQvFen45WACu=PLMM41toZAMg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management  (Peter Geoghegan <pg@bowt.ie>)
Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Wed, Mar 8, 2017 at 5:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think something like 0007-hj-shared-buf-file-v6.patch from
> https://www.postgresql.org/message-id/CAEepm=3g=dMG+84083fkFzLvgMJ7HdhbGB=AeZABNukbZm3hpA@mail.gmail.com
> is probably a good approach to this problem.  In essence, it dodges
> the problem of trying to transfer ownership by making ownership be
> common from the beginning.  That's what I've been recommending, or
> trying to, anyway.

I think that I'm already doing that, to at least the same extent that
0007-hj-shared-buf-file-v6.patch is.

That patch seems to be solving the problem by completely taking over
management of temp files from fd.c. That is, these temp files are not
marked as temp files in the way ordinary temp BufFiles are, with
explicit buy-in from resowner.c about their temp-ness. It adds
"#include "catalog/pg_tablespace.h" to buffile.c, even though that
kind of thing generally lives within fd.c. It does use
PG_TEMP_FILE_PREFIX, but never manages to use temp_tablespaces if
that's set by user. It also doesn't do anything about temp_file_limit
enforcement.

Quite a lot of thought seems to have gone into making the
fd.c/resowner mechanism leak-proof in the face of errors. So, quite
apart from what that approaches misses out on, I really don't want to
change resource management too much. As I went into in my "recap", it
seems useful to change as little as possible about temp files.
Besides, because parallel hash join cannot know the size of the
BufFiles from workers in advance, and thus cannot replicate fd.c fully
by having state for each segment, it ends up actually *relying on*
ENOENT-on-unlink() as a condition that terminates execution:

> +bool
> +BufFileDeleteShared(Oid tablespace, pid_t pid, int set, int partition,
> +                   int participant)
> +{
> +   char        tempdirpath[MAXPGPATH];
> +   char        tempfilepath[MAXPGPATH];
> +   int         segment = 0;
> +   bool        found = false;
> +
> +   /*
> +    * We don't know if the BufFile really exists, because SharedBufFile
> +    * tracks only the range of file numbers.  If it does exists, we don't
> +    * khow many 1GB segments it has, so we'll delete until we hit ENOENT or
> +    * an IO error.
> +    */
> +   for (;;)
> +   {
> +       make_shareable_path(tempdirpath, tempfilepath,
> +                           tablespace, pid, set, partition,
> +                           participant, segment);
> +       if (!PathNameDelete(tempfilepath))
> +           break;
> +       found = true;
> +       ++segment;
> +   }
> +
> +   return found;
> +}

However, the whole point of my fortifying the refcount mechanism for
V9 of parallel tuplesort is to not have to ignore a ENOENT like this,
on the grounds that that's ugly/weird (I pointed this out when I
posted my V8, actually). Obviously I could very easily teach fd.c not
to LOG a complaint about an ENOENT on unlink() for the relevant
parallel case, on the assumption that it's only down to an error
during the brief period of co-ownership of a Buffile at the start of
leader unification. Is that acceptable?

Anyway, the only advantage I immediately see with the approach
0007-hj-shared-buf-file-v6.patch takes (over what I've provisionally
written as my V9) is that by putting everything in shared memory all
along, there is no weirdness with tying local memory clean-up to a
shared memory on_dsm_detach() callback. As I said, stashing a local
pointer for the leader in shared memory is weird, even if it
accomplishes the stated goals for V9.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] bytea_output vs make installcheck
Next
From: Andres Freund
Date:
Subject: Re: [HACKERS] Performance issue after upgrading from 9.4 to 9.6