Re: [HACKERS] Parallel Hash take II - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: [HACKERS] Parallel Hash take II
Date
Msg-id CAEepm=1zWGPpok6BKCZyzm46jz8Zy1fR9+tWcnHPMF6DKgBwiA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Parallel Hash take II  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
Hi Peter,

See responses to a couple of points below.  I'll respond to the other
points separately (ie with code/comment changes).

On Wed, Nov 8, 2017 at 10:32 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Nov 7, 2017 at 1:01 PM, Andres Freund <andres@anarazel.de> wrote:
>> +/*
>> + * Delete a BufFile that was created by BufFileCreateShared in the given
>> + * SharedFileSet using the given name.
>> + *
>> + * It is not necessary to delete files explicitly with this function.  It is
>> + * provided only as a way to delete files proactively, rather than waiting for
>> + * the SharedFileSet to be cleaned up.
>> + *
>> + * Only one backend should attempt to delete a given name, and should know
>> + * that it exists and has been exported or closed.
>> + */
>
> This part is new to me. We now want one backend to delete a given
> filename. What changed? Please provide a Message-Id reference if that
> will help me to understand.
>
> For now, I'm going to guess that this development had something to do
> with the need to deal with virtual FDs that do a close() on an FD to
> keep under backend limits. Do I have that right?

No -- this is simply an option available to client code that wants to
clean up individual temporary files earlier.  Such client code is
responsible for meeting the synchronisation requirements described in
the comment, but it's entirely optional.  For example:
SharedTuplestore is backed by multiple files and it could take the
opportunity to delete individual files after they've been scanned, if
you told it you were going to scan only once
(SHARED_TUPLESTORE_SINGLE_PASS) and if it could prove that no other
backend could still need to read from it, as mentioned in a comment in
sts_end_parallel_scan().

However, since I changed to the page/chunk based model (spinlock while
advancing block counter, mimicking Parallel Seq Scan) instead of the
earlier Fisher Price version (LWLock while reading each tuple with a
shared read head maintained with tell/seek), I don't actually do that.
Hitting the end of the file no longer means that no one else is
reading from the file (someone else might still be reading from an
earlier chunk even though you've finished reading the final chunk in
the file, and the vfd systems means that they must be free to close
and reopen the file at any time).  In the current patch version the
files are cleaned up wholesale at two times: SharedFileSet cleanup
triggered by DSM destruction, and SharedFileSet reset triggered by
rescan.  Practically, it's always the former case.  It's vanishingly
rare that you'd actually want to be rescanning a Parallel Hash that
spills to disk but in that case we delete the old files and recreate,
and that case is tested in the regression tests.

If it bothers you that I have an API there that I'm not actually using
yet, I will remove it.

>> +       if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
>> +       {
>> +               /* Subtract its size from current usage (do first in case of error) */
>> +               temporary_files_size -= vfdP->fileSize;
>> +               vfdP->fileSize = 0;
>> +       }
>>
>> So, is it right to do so unconditionally and without regard for errors?
>> If the file isn't deleted, it shouldn't be subtracted from fileSize. I
>> guess you're managing that through the flag, but that's not entirely
>> obvious.
>
> I think that the problem here is that the accounting is expected to
> always work. It's not like there is a resowner style error path in
> which temporary_files_size gets reset to 0.

But there is a resowner error path in which File handles get
automatically closed and temporary_files_size gets adjusted.  The
counter goes up when you create, and goes down when you close or
resowner closes for you.  Eventually you either close and the
bookkeeping is consistent or you crash and it doesn't matter.  And
some kind of freak multiple close attempt is guarded against by
setting the files size to 0 so we can't double-subtract.  Do you see a
bug?

None of this has any impact on whether files are leaked: either
SharedFileSet removes the files, or you crash (or take a filesystem
snapshot, etc) and RemovePgTempFiles() mops them up at the next clean
startup.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

Previous
From: Haribabu Kommi
Date:
Subject: Re: [HACKERS] Refactor handling of database attributes betweenpg_dump and pg_dumpall
Next
From: Haribabu Kommi
Date:
Subject: Re: [HACKERS] Exclude pg_internal.init from base backup