Re: Dynamic Shared Memory stuff - Mailing list pgsql-hackers

From Noah Misch
Subject Re: Dynamic Shared Memory stuff
Date
Msg-id 20140121195821.GB1929456@tornado.leadboat.com
Whole thread Raw
In response to Re: Dynamic Shared Memory stuff  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Dynamic Shared Memory stuff
List pgsql-hackers
On Wed, Dec 18, 2013 at 12:21:08PM -0500, Robert Haas wrote:
> On Tue, Dec 10, 2013 at 6:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > The larger point is that such a shutdown process has never in the history
> > of Postgres been successful at removing shared-memory (or semaphore)
> > resources.  I do not feel a need to put a higher recoverability standard
> > onto the DSM code than what we've had for fifteen years with SysV shmem.
> >
> > But, by the same token, it shouldn't be any *less* recoverable.  In this
> > context that means that starting a new postmaster should recover the
> > resources, at least 90% of the time (100% if we still have the old
> > postmaster lockfile).  I think the idea of keeping enough info in the
> > SysV segment to permit recovery of DSM resources is a good one.  Then,
> > any case where the existing logic is able to free the old SysV segment
> > is guaranteed to also work for DSM.
> 
> So I'm taking a look at this.  There doesn't appear to be anything
> intrinsically intractable here, but it seems important to time the
> order of operations so as to minimize the chances that things fail in
> awkward places.  The point where we remove the old shared memory
> segment from the previous postmaster invocation is here:
> 
>         /*
>          * The segment appears to be from a dead Postgres process, or from a
>          * previous cycle of life in this same process.  Zap it, if possible.
>          * This probably shouldn't fail, but if it does, assume the segment
>          * belongs to someone else after all, and continue quietly.
>          */
>         shmdt(memAddress);
>         if (shmctl(shmid, IPC_RMID, NULL) < 0)
>             continue;
> 
> My first thought was to remember the control segment ID from the
> header just before the shmdt() there, and then later when the DSM
> module is starting, do cleanup.  But that doesn't seem like a good
> idea, because then if startup fails after we remove the old shared
> memory segment and before we get to the DSM initialization code, we'll
> have lost the information on what control segment needs to be cleaned
> up.  A subsequent startup attempt won't see the old shm again, because
> it's already gone.  I'm fairly sure that this would be a net reduction
> in reliability vs. the status quo.
> 
> So now what I'm thinking is that we ought to actually perform the DSM
> cleanup before detaching the old segment and trying to remove it.
> That shouldn't fail, but if it does, or if we get killed before
> completing it, the next run will hopefully find the same old shm and
> finish the cleanup.  But that kind of flies in the face of the comment
> above: if we perform DSM cleanup and then discover that the segment
> wasn't ours after all, that means we just stomped all over somebody
> else's stuff.  That's not too good.  But trying to remove the segment
> first and then perform the cleanup creates a window where, if we get
> killed, the next restart will have lost information about how to
> finish cleaning up.  So it seems that some kind of logic tweak is
> needed here, but I'm not sure exactly what.  As I see it, the options
> are:
> 
> 1. Make failure to remove the shared memory segment we thought was
> ours an error.  This will definitely show up any problems, but only
> after we've burned down some other processes's dynamic shared memory
> segments.  The most likely outcome is creating failure-to-start
> problems that don't exist today.
> 
> 2. Make it a warning.  We'll still burn down somebody else's DSMs, but
> at least we'll still start ourselves.  Sadly, "WARNING: You have just
> done a bad thing.  It's too late to fix it.  Sorry!" is not very
> appealing.

It has long been the responsibility of PGSharedMemoryCreate() to determine
that a segment is unimportant before calling IPC_RMID.  The success or failure
of IPC_RMID is an unreliable guide to the correctness of that determination.
IPC_RMID will succeed on an important segment owned by the same UID, and it
will fail if some other process removed the segment after our shmat().  Such a
failure does not impugn our having requested DSM cleanup on the basis of the
PGShmemHeader we did read, so an apologetic WARNING would be wrong.

> 3. Remove the old shared memory segment first, then perform the
> cleanup immediately afterwards.  If we get killed before completing
> the cleanup, we'll leak the un-cleaned-up stuff.  Decide that's OK and
> move on.

The period in which process exit would leak segments is narrow enough that
this seems fine.  I see no net advantage over performing the cleanup before
shmdt(), though.

> 4. Adopt some sort of belt-and-suspenders approach, keeping the state
> file we have now and backstopping it with this mechanism, so that we
> really only need this to work when $PGDATA has been blown away and
> recreated.  This seems pretty inelegant, and I'm not sure who it'll
> benefit other than those (few?) people who kill -9 the postmaster (or
> it segfaults or otherwise dies without running the code to remove
> shared memory segments) and then remove $PGDATA and then re-initdb and
> then expect cleanup to find the old DSMs.

Independent of the choice we make here, I would keep the state file.  POSIX
permits shm_open() segments to survive reboots, so the state file would come
into play after crashes on such systems.  Also, folks who use "ipcrm" after a
"kill -9" of the postmaster would benefit from the state file.  Nobody should
do that, but hey, catering to things nobody should do is what brought us here.

An option I had envisioned was to make PGSharedMemoryCreate() create a state
file based on the old sysv segment.  Then the later dsm_postmaster_startup()
would proceed normally, and a restart after a failure between shmdt() and
dsm_postmaster_startup() would also do the cleanup.  A wrinkle comes from the
fact that an existing state file could reference a different control segment;
this would happen if we picked a different sysv shm key compared to the last
postmaster start.  In that case, we should presumably clean both control
segments.  Performing the cleanup per #1/#2 achieves that.

> 5. Give up on this approach.  We could keep what we have now, or make
> the DSM control segment land at a well-known address as we do for the
> main segment.

How would having the DSM control segment at a well-known address affect the
problem at hand?  Did you mean a well-known dsm_handle?

> What do people prefer?

I recommend performing cleanup on the control segment named in PGShmemHeader
just before shmdt() in PGSharedMemoryCreate().  No new ERROR or WARNING sites
are necessary.  Have dsm_postmaster_startup() continue to perform a cleanup on
the control segment named in the state file.

> The other thing I'm a bit squidgy about is injecting more code that
> runs before we get the new shared memory segment up and running.
> During that time, we don't have effective mutual exclusion, so it's
> possible that someone else could be trying to do cleanup at the same
> time we are.  That should be OK as far as the DSM code itself is
> concerned, but there may be other downsides to lengthening the period
> of time that's required to get mutual exclusion in place that I should
> be worrying about.

I see no general need to avoid small increases to that time period.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Add min and max execute statement time in pg_stat_statement
Next
From: Миша Тюрин
Date:
Subject: Re[2]: [HACKERS] Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17)