[HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management - Mailing list pgsql-hackers

From Peter Geoghegan
Subject [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management
Date
Msg-id CAH2-WznJ_UgLux=_jTgCQ4yFz0iBntudsNKa1we3kN1BAG=88w@mail.gmail.com
Whole thread Raw
Responses Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Recap
=====

A design goal of parallel tuplesort is that the parallel case be as
close to possible as the serial case is already. Very little new code
is needed in tuplesort.c and logtape.c, most of which is about using a
new lower-level facility which is very well encapsulated. And, even
buffile.c "unification", the guts of everything, doesn't have many new
special cases.

So, workers start with "unifiable" BufFiles, which have very little
difference with conventional temp BufFiles -- they just create
filenames in a deterministic fashion, making them discoverable to the
leader process, through a process of unification (plus you have the
refcount state in shared memory, though very little). In principle,
workers could decide to not go through with unification long after the
fact, and close their unifiable temp files without doing anything for
the leader, and that would be just fine. By the same token, the
unified BufFile owned by the leader can be extended if needs be, in a
way that is virtually indistinguishable from just extending the end of
any other BufFile. logtape.c recycling for future randomAccess cases
works just the same as before.

I think that this is a nice state of affairs, since having no special
cases will tend to make the code robust. Applying leader-wise offsets
within logtape.c is done at one precise choke-point (one function),
and so it seems very likely that logtape.c routines that CREATE INDEX
happens to not use will "just work" when parallel tuplesort has more
clients at some point in the future. A logical tapeset doesn't even
remember specifically that it is using unification (unless you count
that it has a non-zero offset to apply).

V8 of the parallel tuplesort patch added a refcount mechanism that is
associated with a "unified" BufFile, consisting of $nworker unifiable
BufFiles, with metadata in shared memory per worker/unifiable BufFile.
We have a refcount per unifiable BufFile, not per fd.c segment. The
refcount thing lets worker processes go away as soon as the leader
begins its final on-the-fly merge -- the leader assumes ownership of
the BufFiles/segments initially owned only by workers. There is a very
brief period of co-ownership, which occurs during unification in the
leader.

Concern
=======

My pending V9 of the patch has to solve a problem that V8 had, namely
that it's not very nice that an error in the leader (e.g. palloc()
failure) during this brief period of co-ownership will make the
resource manager attempt a double unlink() (one from the worker,
another from the leader). There is no danger of data loss, but it's
ugly. The general consensus is that the way to handle it is with a
on_dsm_detach() callback with the top-level parallel context shared
memory segment. I have already written some rough code that does all
this, and demonstrably removes the double unlink() hazard, but I have
some outstanding concerns, that I'd like to clear up.

We have at least 3 different "resource contexts" in play in this rough code:

1. The DSM segment.

2. fd.c segment clean-up (i.e., cleanup that falls on resourceOwner
stored within buffile.c for BufFile segments).

3. The memory context in which BufFile stuff is allocated.

I am bridging the gap between local memory and shared memory here, in
essence. The local state like virtual FDs is what we need to mark as
non-temp, so the second unlink() is never attempted on fd.c resource
cleanup, which comes last of all (per resource manager) on the one
hand. We do need to touch refcounts in shared memory too, but shared
memory only has only some relevant state, which seems like something
to avoid changing, because we want to keep all of those nice
advantages I went into above. On the other hand, it seems wrong to
jump from shared memory to shared local memory in my on_dsm_detach()
callback, though that's what works for me right now. The callback is
only registered in the leader process, and only really needs to be
around for as long as there is a brief period of co-ownership of
BufFiles.

Say the caller never gets around to a BufFileClose() call. That's a
bug anyway, and will produce a temp file reference leak warning today.
But this change turns that bug into a hard crash, which is a more
serious bug. This happens because the on_dsm_detach() callback ends up
calling BufFileClose() against a pointer to garbage (freed local
memory). Had BufFileClose() been called before resource cleanup called
the callback, which should happen anyway, then it would have marked
the pointer to local memory (stored in *shared* memory) NULL. No hard
crash would happen from the callback, which returns early, never
spuriously calling BufFileClose() a second time within leader.

ISTM that even today, buffile.c caller's memory context tacitly shares
the lifetime of caller's resourceOwner. Otherwise, everything would
break. It doesn't seem that much of a stretch to assume that the DSM
seg callback implicitly has the opportunity to "go first", since ISTM
that things are ordered within AbortTransaction() and other similar
places such that that does always happen. Testing shows my rough code
does the right thing when an error is thrown from a variety of places.
But:

* I'm still stashing a pointer to local memory in shared memory,
purely out of convenience, which is just weird.

* It seems wrong that a piece of infrastructure pretty far removed
from the core transaction manager code relies on ordering like that.
It's not as if the BufFile contract requires that the segment original
from the parallel infrastructure at all, though that's what I do right
this minute. And it seems wrong that the memory better not have been
freed already, through a normal pfree() or similar, without caller
actually going through BufFileClose(), which knows to unset pointer
within shared memory (or, alternatively, with the callback occurring
before local memory is freed, in the case where the callback "really
needs to fire").

I certainly don't grok resource managers like Robert does, so a much
better approach might be right in front of me. The "obvious solution"
is to have a lot more state in shared memory, but per-segment state
needs to grow unpredictably. Plus, that would be a big, risky change
to make for only a small benefit, at least as far as parallel
tuplesort goes. Thomas does require something significantly more
sophisticated for his parallel hash join patch, but it looks
increasingly likely that that will end up being specialized to the
problem he is trying to solve.

I'm also slightly tempted to hard-code BufFiles as a new type of
resource that a resource manager can take ownership of, but that also
seems unappealing.

What I've come up with may be as robust as anything will be for
parallel CREATE INDEX alone, but I want to have confidence that any
assumptions made are clear, and are correct.

Thanks
-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [HACKERS] Performance degradation in TPC-H Q18
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] dump a comment of a TSDictionary