Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management
Date
Msg-id CAH2-Wz=LLFYoecXyALK6V4zNj_Hsf-AkbBiu1FOJBA6zC1JiFg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
On Thu, Mar 9, 2017 at 11:19 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> That patch seems to be solving the problem by completely taking over
> management of temp files from fd.c. That is, these temp files are not
> marked as temp files in the way ordinary temp BufFiles are, with
> explicit buy-in from resowner.c about their temp-ness. It adds
> "#include "catalog/pg_tablespace.h" to buffile.c, even though that
> kind of thing generally lives within fd.c. It does use
> PG_TEMP_FILE_PREFIX, but never manages to use temp_tablespaces if
> that's set by user. It also doesn't do anything about temp_file_limit
> enforcement.

Actually, it's a bigger departure than I suggest here, even.
Currently, literally no BufFile caller ever gets anything other than
fd.c-wise temp files (those marked FD_TEMPORARY). This is the closest
that buffile.c comes to allowing this:

#ifdef NOT_USED
/** Create a BufFile and attach it to an already-opened virtual File.** This is comparable to fdopen() in stdio.  This
isthe only way at present* to attach a BufFile to a non-temporary file.  Note that BufFiles created* in this way CANNOT
beexpanded into multiple files.*/
 
BufFile *
BufFileCreate(File file)
{   return makeBufFile(file);
}
#endif

IOW, I cannot see why 0007-hj-shared-buf-file-v6.patch leaves parallel
hash join with BufFiles that are even capable of being expanded past
1GiB (1 segment), which is surely not acceptable -- it's a bug. Have I
just missed something?

(This is not to be confused with BufFiles that are interXact -- those
are allowed, but won't work here either. This is why Thomas changed
PHJ to not do it that way some months back.)

At first I thought that it was okay because BufFiles are
immutable/"readonly" once handed over from workers, but of course the
PHJ SharedBufFile mechanism is up-front, and does all resource
management in shared memory. Thus, it must even create BufFiles in
workers that have this only-one-segment restriction as a consequence
of not being temp files (in the conventional sense: FD_TEMPORARY fd.c
segments). I suggest that the MAX_PHYSICAL_FILESIZE stress-test I
sometimes perform [1] be incorporated in Thomas' testing.

(Looks more carefully...)

I now see that PathNameCreateFile() is being called, a new function
which is largely just a new interface to the existing
OpenTemporaryFileInTablespace() temp files. So, if you look carefully,
you notice that you do in fact end up with FD_TEMPORARY fd.c segments
here...so maybe temp_file_limit isn't broken, because, it turns out,
0007-hj-shared-buf-file-v6.patch is still *halfway* buying into the
conventional idea of file temp-ness. But buffile.c if left behind
("file->isTemp == false"), so AFAICT it must still be broken simply
because new segments cannot be written, per BufFileCreate() comments
quoted above.

This has become more of a PHJ review, which is off-topic for this
thread. But my observations illustrate the difficulty with tying
resource manager resources in local memory (temp segments, and
associated BufFile(s)) with clean-up at shared memory segment
detachment. It's possible, but ticklish enough that you'll end up with
some wart or other when you're done. The question, then, is what wart
is the most acceptable for parallel tuplesort? I see two plausible
paths forward right now:

1. Selectively suppress spurious ENOENT-on-unlink() LOG message in the
brief window where it might be spurious. This is the easiest option.
(There is less code this way.)

2. Do more or less what I've already drafted as my V9, which is
described in opening mail to this thread, while accepting the ugliness
that that approach brings with it.

[1] postgr.es/m/CAM3SWZRWdNtkhiG0GyiX_1mUAypiK3dV6-6542pYe2iEL-foTA@mail.gmail.com
-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Joe Conway
Date:
Subject: Re: [HACKERS] Upgrading postmaster's log messages about bind/listenerrors
Next
From: Andres Freund
Date:
Subject: Re: [HACKERS] exposing wait events for non-backends (was: Trackingwait event for latches)