Re: WIP: [[Parallel] Shared] Hash - Mailing list pgsql-hackers

From Andres Freund
Subject Re: WIP: [[Parallel] Shared] Hash
Date
Msg-id 20170326204128.ye26eywdqyha52on@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] WIP: [[Parallel] Shared] Hash  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: WIP: [[Parallel] Shared] Hash  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
Hi,


SharedBufFile allows temporary files to be created by one backend and
then exported for read-only access by other backends, with clean-up
managed by reference counting associated with a DSM segment.  This includes
changes to fd.c and buffile.c to support new kinds of temporary file.


diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 4ca0ea4..a509c05 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c

I think the new facilities should be explained in the file's header.


@@ -68,9 +71,10 @@ struct BufFile     * avoid making redundant FileSeek calls.     */
-    bool        isTemp;            /* can only add files if this is TRUE */
+    bool        isSegmented;    /* can only add files if this is TRUE */

That's a bit of a weird and uncommented upon change.


@@ -79,6 +83,8 @@ struct BufFile     */    ResourceOwner resowner;
+    BufFileTag    tag;            /* for discoverability between backends */

Not perfectly happy with the name tag here, the name is a bit too
similar to BufferTag - something quite different.



+static void
+make_tagged_path(char *tempdirpath, char *tempfilepath,
+                 const BufFileTag *tag, int segment)
+{
+    if (tag->tablespace == DEFAULTTABLESPACE_OID ||
+        tag->tablespace == GLOBALTABLESPACE_OID)
+        snprintf(tempdirpath, MAXPGPATH, "base/%s", PG_TEMP_FILES_DIR);
+    else
+    {
+        snprintf(tempdirpath, MAXPGPATH, "pg_tblspc/%u/%s/%s",
+                 tag->tablespace, TABLESPACE_VERSION_DIRECTORY,
+                 PG_TEMP_FILES_DIR);
+    }
+
+    snprintf(tempfilepath, MAXPGPATH, "%s/%s%d.%d.%d.%d.%d", tempdirpath,
+             PG_TEMP_FILE_PREFIX,
+             tag->creator_pid, tag->set, tag->partition, tag->participant,
+             segment);

Is there a risk that this ends up running afoul of filename length
limits on some platforms?

+}
+
+static File
+make_tagged_segment(const BufFileTag *tag, int segment)
+{
+    File        file;
+    char        tempdirpath[MAXPGPATH];
+    char        tempfilepath[MAXPGPATH];
+
+    /*
+     * There is a remote chance that disk files with this (pid, set) pair
+     * already exists after a crash-restart.  Since the presence of
+     * consecutively numbered segment files is used by BufFileOpenShared to
+     * determine the total size of a shared BufFile, we'll defend against
+     * confusion by unlinking segment 1 (if it exists) before creating segment
+     * 0.
+     */

Gah.  Why on earth aren't we removing temp files when restarting, not
just on the initial start?  That seems completely wrong?


If we do decide not to change this: Why is that sufficient? Doesn't the
same problem exist for segments later than the first?

+/*
+ * Open a file that was previously created in another backend with
+ * BufFileCreateShared.
+ */
+BufFile *
+BufFileOpenTagged(const BufFileTag *tag)
+{
+    BufFile    *file = (BufFile *) palloc(sizeof(BufFile));
+    char        tempdirpath[MAXPGPATH];
+    char        tempfilepath[MAXPGPATH];
+    Size        capacity = 1024;
+    File       *files = palloc(sizeof(File) * capacity);
+    int            nfiles = 0;
+
+    /*
+     * We don't know how many segments there are, so we'll probe the
+     * filesystem to find out.
+     */
+    for (;;)
+    {
+        /* See if we need to expand our file space. */
+        if (nfiles + 1 > capacity)
+        {
+            capacity *= 2;
+            files = repalloc(files, sizeof(File) * capacity);
+        }
+        /* Try to load a segment. */
+        make_tagged_path(tempdirpath, tempfilepath, tag, nfiles);
+        files[nfiles] = PathNameOpenTemporaryFile(tempfilepath);
+        if (files[nfiles] <= 0)
+            break;

Isn't 0 a theoretically valid return value from
PathNameOpenTemporaryFile?

+/*
+ * Delete a BufFile that was created by BufFileCreateTagged.  Return true if
+ * at least one segment was deleted; false indicates that no segment was
+ * found, or an error occurred while trying to delete.  Errors are logged but
+ * the function returns normally because this is assumed to run in a clean-up
+ * path that might already involve an error.
+ */
+bool
+BufFileDeleteTagged(const BufFileTag *tag)
+{
+    char        tempdirpath[MAXPGPATH];
+    char        tempfilepath[MAXPGPATH];
+    int            segment = 0;
+    bool        found = false;
+
+    /*
+     * We don't know if the BufFile really exists, because SharedBufFile
+     * tracks only the range of file numbers.  If it does exists, we don't
+     * know many 1GB segments it has, so we'll delete until we hit ENOENT or
+     * an IO error.
+     */
+    for (;;)
+    {
+        make_tagged_path(tempdirpath, tempfilepath, tag, segment);
+        if (!PathNameDeleteTemporaryFile(tempfilepath, false))
+            break;
+        found = true;
+        ++segment;
+    }
+
+    return found;
+}

If we crash in the middle of this, we'll leave the later files abanded,
no?


+/*
+ * BufFileSetReadOnly --- flush and make read-only, in preparation for sharing
+ */
+void
+BufFileSetReadOnly(BufFile *file)
+{
+    BufFileFlush(file);
+    file->readOnly = true;
+}

That flag is unused, right?


+ * PathNameCreateTemporaryFile, PathNameOpenTemporaryFile and
+ * PathNameDeleteTemporaryFile are used for temporary files that may be shared
+ * between backends.  A File created or opened with these functions is not
+ * automatically deleted when the file is closed, but it is automatically
+ * closed and end of transaction and counts agains the temporary file limit of
+ * the backend that created it.  Any File created this way must be explicitly
+ * deleted with PathNameDeleteTemporaryFile.  Automatic file deletion is not
+ * provided because this interface is designed for use by buffile.c and
+ * indirectly by sharedbuffile.c to implement temporary files with shared
+ * ownership and cleanup.

Hm. Those name are pretty easy to misunderstand, no? s/Temp/Shared/?/*
+ * Called whenever a temporary file is deleted to report its size.
+ */
+static void
+ReportTemporaryFileUsage(const char *path, off_t size)
+{
+    pgstat_report_tempfile(size);
+
+    if (log_temp_files >= 0)
+    {
+        if ((size / 1024) >= log_temp_files)
+            ereport(LOG,
+                    (errmsg("temporary file: path \"%s\", size %lu",
+                            path, (unsigned long) size)));
+    }
+}

Man, the code for this sucks (not your fault).  Shouldn't this properly
be at the buffile.c level, where we could implement limits above 1GB
properly?


+/*
+ * Open a file that was created with PathNameCreateTemporaryFile in another
+ * backend.  Files opened this way don't count agains the temp_file_limit of
+ * the caller, are read-only and are automatically closed at the end of the
+ * transaction but are not deleted on close.
+ */

This really reinforces my issues with the naming scheme.  This ain't a
normal tempfile.

+File
+PathNameOpenTemporaryFile(char *tempfilepath)
+{
+    File file;
+
+    /*
+     * Open the file.  Note: we don't use O_EXCL, in case there is an orphaned
+     * temp file that can be reused.
+     */
+    file = PathNameOpenFile(tempfilepath, O_RDONLY | PG_BINARY, 0);

If so, wouldn't we need to truncate the file?


+ * A single SharedBufFileSet can manage any number of 'tagged' BufFiles that
+ * are shared between a fixed number of participating backends.  Each shared
+ * BufFile can be written to by a single participant but can be read by any
+ * backend after it has been 'exported'.  Once a given BufFile is exported, it
+ * becomes read-only and cannot be extended.  To create a new shared BufFile,
+ * a participant needs its own distinct participant number, and needs to
+ * specify an arbitrary partition number for the file.  To make it available
+ * to other backends, it must be explicitly exported, which flushes internal
+ * buffers and renders it read-only.  To open a file that has been shared, a
+ * backend needs to know the number of the participant that created the file,
+ * and the partition number.  It is the responsibily of calling code to ensure
+ * that files are not accessed before they have been shared.

Hm. One way to make this safer would be to rename files when exporting.
Should be sufficient to do this to the first segment, I guess.


+ * Each file is identified by a partition number and a participant number, so
+ * that a SharedBufFileSet can be viewed as a 2D table of individual files.

I think using "files" as a term here is a bit dangerous - they're
individually segmented again, right?



+/*
+ * The number of bytes of shared memory required to construct a
+ * SharedBufFileSet.
+ */
+Size
+SharedBufFileSetSize(int participants)
+{
+    return offsetof(SharedBufFileSet, participants) +
+        sizeof(SharedBufFileParticipant) * participants;
+}

The function name sounds a bit like a function actuallize setting some
size...  s/Size/DetermineSize/?


+/*
+ * Create a new file suitable for sharing.  Each backend that calls this must
+ * use a distinct participant number.  Behavior is undefined if a participant
+ * calls this more than once for the same partition number.  Partitions should
+ * ideally be numbered consecutively or in as small a range as possible,
+ * because file cleanup will scan the range of known partitions looking for
+ * files.
+ */

Wonder if we shouldn't just create a directory for all such files.


I'm a bit unhappy with the partition terminology around this. It's
getting a bit confusing. We have partitions, participants and
segements. Most of them could be understood for something entirely
different than the meaning you have here...


+static void
+shared_buf_file_on_dsm_detach(dsm_segment *segment, Datum datum)
+{
+    bool unlink_files = false;
+    SharedBufFileSet *set = (SharedBufFileSet *) DatumGetPointer(datum);
+
+    SpinLockAcquire(&set->mutex);
+    Assert(set->refcount > 0);
+    if (--set->refcount == 0)
+        unlink_files = true;
+    SpinLockRelease(&set->mutex);

I'm a bit uncomfortable with releasing a refcount, and then still using
the memory from the set...  I don't think there's a concrete danger
here as the code stands, but it's a fairly dangerous pattern.


- Andres



pgsql-hackers by date:

Previous
From: Nikolay Shaplov
Date:
Subject: Re: [PATCH] Move all am-related reloption code into src/backend/access/[am-name] and get rid of relopt_kind for custom AM
Next
From: Tomas Vondra
Date:
Subject: Re: Valgrind failures caused by multivariate stats patch