Re: [HACKERS] WIP: [[Parallel] Shared] Hash - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Date
Msg-id CAEepm=2ug5CZ7S9c6yqaDH0er9oLNH=eXa7WzkFHrBV_FDmTag@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] WIP: [[Parallel] Shared] Hash  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: [HACKERS] WIP: [[Parallel] Shared] Hash  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
Hi,

Here is a new version addressing feedback from Peter and Andres.
Please see below.

On Wed, Mar 22, 2017 at 3:18 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Mar 21, 2017 at 5:07 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>>> buffile.c should stop pretending to care about anything other than
>>> temp files, IMV. 100% of all clients that want temporary files go
>>> through buffile.c. 100% of all clients that want non-temp files (files
>>> which are not marked FD_TEMPORARY) access fd.c directly, rather than
>>> going through buffile.c.
>>
>> I still need BufFile because I want buffering.
>>
>> There are 3 separate characteristics enabled by flags with 'temporary'
>> in their name.  I think we should consider separating the concerns by
>> splitting and renaming them:
>>
>> 1.  Segmented BufFile behaviour.  I propose renaming BufFile's isTemp
>> member to isSegmented, because that is what it really does.   I want
>> that feature independently without getting confused about lifetimes.
>> Tested with small MAX_PHYSICAL_FILESIZE as you suggested.
>
> I would have proposed to get rid of the isTemp field entirely. It is
> always true with current usage, any only #ifdef NOT_USED code presumes
> that it could be any other way. BufFile is all about temp files, which
> ISTM should be formalized. The whole point of BufFile is to segment
> fd.c temp file segments. Who would ever want to use BufFile without
> that capability anyway?

Yeah, it looks like you're probably right, but I guess others could
have uses for BufFile that we don't know about.  It doesn't seem like
it hurts to leave the variable in existence.

>> 2.  The temp_file_limit system.  Currently this applies to fd.c files
>> opened with FD_TEMPORARY.  You're right that we shouldn't be able to
>> escape that sanity check on disk space just because we want to manage
>> disk file ownership differently.  I propose that we create a new flag
>> FD_TEMP_FILE_LIMIT that can be set independentlyisTemp of the flags
>> controlling disk file lifetime.  When working with SharedBufFileSet,
>> the limit applies to each backend in respect of files it created,
>> while it has them open.  This seems a lot simpler than any
>> shared-temp-file-limit type scheme and is vaguely similar to the way
>> work_mem applies in each backend for parallel query.
>
> I agree that that makes sense as a user-visible behavior of
> temp_file_limit. This user-visible behavior is what I actually
> implemented for parallel CREATE INDEX.

Ok, good.

>> 3.  Delete-on-close/delete-at-end-of-xact.  I don't want to use that
>> facility so I propose disconnecting it from the above.  We c{ould
>> rename those fd.c-internal flags FD_TEMPORARY and FD_XACT_TEMPORARY to
>> FD_DELETE_AT_CLOSE and FD_DELETE_AT_EOXACT.
>
> This reliably unlink()s all files, albeit while relying on unlink()
> ENOENT as a condition that terminates deletion of one particular
> worker's BufFile's segments. However, because you effectively no
> longer use resowner.c, ISTM that there is still a resource leak in
> error paths. ResourceOwnerReleaseInternal() won't call FileClose() for
> temp-ish files (that are not quite temp files in the current sense) in
> the absence of no other place managing to do so, such as
> BufFileClose(). How can you be sure that you'll actually close() the
> FD itself (not vFD) within fd.c in the event of an error? Or Delete(),
> which does some LRU maintenance for backend's local VfdCache?

Yeah, I definitely need to use resowner.c.  The only thing I want to
opt out of is automatic file deletion in that code path.

> If I follow the new code correctly, then it doesn't matter that you've
> unlink()'d to take care of the more obvious resource management chore.
> You can still have a reference leak like this, if I'm not mistaken,
> because you still have backend local state (local VfdCache) that is
> left totally decoupled with the new "shadow resource manager" for
> shared BufFiles.

You're right.  The attached version fixes these problems.  The
BufFiles created or opened in this new way now participate in both of
our leak-detection and clean-up schemes:  the one in resowner.c
(because I'm now explicitly registering with it as I had failed to do
before) and the one in CleanupTempFiles (because FD_CLOSE_AT_EOXACT is
set, which I already had in the previous version for the creator, but
not the opener of such a file).  I tested by commenting out my
explicit BufFileClose calls to check that resowner.c starts
complaining, and then by commenting out the resowner registration too
to check that CleanupTempFiles starts complaining.

>> As shown in 0008-hj-shared-buf-file-v8.patch.  Thoughts?
>
> A less serious issue I've also noticed is that you add palloc() calls,
> implicitly using the current memory context, within buffile.c.
> BufFileOpenTagged() has some, for example. However, there is a note
> that we don't need to save the memory context when we open a BufFile
> because we always repalloc(). That is no longer the case here.

I don't see a problem here.  BufFileOpenTagged() is similar to
BufFileCreateTemp() which calls makeBufFile() and thereore returns a
result that is allocated in the current memory context.  This seems
like the usual deal.

Thanks for the review!

On Wed, Mar 22, 2017 at 1:07 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Feb 16, 2017 at 3:36 PM, Andres Freund <andres@anarazel.de> wrote:
>> I think the synchronization protocol with the various phases needs to be
>> documented somewhere.  Probably in nodeHashjoin.c's header.
>
> I will supply that shortly.

Added in the attached version.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

Previous
From: Pavan Deolasee
Date:
Subject: Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
Next
From: Etsuro Fujita
Date:
Subject: Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw