Thread: Improving backend startup interlock

Improving backend startup interlock

From
Tom Lane
Date:
I have the beginnings of an idea about improving our interlock logic
for postmaster startup.  The existing method is pretty good, but we
have had multiple reports that it can fail during system boot if the
old postmaster wasn't given a chance to shut down cleanly: there's
a fair-sized chance that the old postmaster PID will have been assigned
to some other process, and that fools the interlock check.

I think we can improve matters by combining the existing checks for
old-postmaster-PID and old-shared-memory-segment into one cohesive
entity.  To do this, we must abandon the existing special case for
"private memory" when running a bootstrap or standalone backend.
Even a standalone backend will be required to get a shmem segment
just like a postmaster would.  This ensures that we can use both
parts of the safety check, even when the old holder of the data
directory interlock was a standalone backend.

Here's a sketch of the improved startup procedure:

1. Try to open and read the $PGDATA/postmaster.pid file.  If we fail
because it's not there, okay to continue, because old postmaster must
have shut down cleanly; skip to step 8.  If we fail for any other reason
(eg, permissions failure), complain and abort startup.  (Because we
write the postmaster.pid file mode 600, getting past this step
guarantees we are either the same UID as the old postmaster or root;
else we'd have failed to read the old file.  This fact justifies some
assumptions below.)

2. Extract old postmaster PID and old shared memory key from file.
(Both will now always be there, per above; abort if file contents are
not as expected.)  We do not bother with trying kill(PID, 0) anymore,
because it doesn't prove anything.

3. Try to attach to the old shared memory segment using the old key.
There are three possible outcomes:
A: fail because it's not there.  Then we know the old postmaster  (or standalone backend) is gone, and so are all its
children. Okay to skip to step 7.
 
B: fail for some other reason, eg permissions violation.  Because  we know we are the same UID (or root) as before,
thismust indicate  that the "old" shmem segment actually belongs to someone else;  so we have a chance collision with
someoneelse's shmem key.  Ignore the shmem segment, skip to step 7.  (In short,  we can treat all failures alike, which
isa Good Thing.)
 
C: attach succeeds. Continue to step 4.

4. Examine header of old shmem segment to see if it contains the right  magic number *and* old postmaster PID.  If not,
itisn't really  a Postgres shmem segment, so ignore it; detach and skip to step 7.
 

5. If old shmem segment still has other processes attached to it,  abort: these must be an old postmaster and/or old
backendsstill  alive.  (We can check nattach > 1 in the SysV case, or just assume  they are there in the
hugepages-segmentcase that Neil wants to add.)
 

6. Detach from and delete the old shmem segment.  (Deletion isn't  strictly necessary, but we should do it to avoid
suckingresources.)
 

7. Delete the old postmaster.pid file.  If this fails for any reason,  abort.  (Either we've got permissions problems
ora race condition  with someone else trying to start up.)
 

8. Create a shared memory segment.

9. Create a new postmaster.pid file and record my PID and segment key.  If we fail to do this (with O_EXCL create),
abort;someone else  must be trying to start up at the same time.  Be careful to create  the lockfile mode 600, per
notesabove.
 


This is not quite ready for prime time yet, because it's not very
bulletproof against the scenario where two would-be postmasters are
starting concurrently.  The first one might get all the way through the
sequence before the second one arrives at step 7 --- in which case the
second one will be deleting the first one's lockfile.  Oops.  A possible
answer is to create a second lockfile that only exists for the duration
of the startup sequence, and use that to ensure that only one process is
trying this sequence at a time.  This reintroduces the same problem
we're trying to get away from (must rely on kill(PID, 0) to determine
validity of the lock file), but at least the window of vulnerability is
much smaller than before.  Does anyone see a better way?

A more general objection is that this approach will hardwire, even more
solidly than before, the assumption that we are using a shared-memory
API that provides identifiable shmem segments (ie, something we can
record a key for and later try to attach to).  I think some people
wanted to get away from that.  But so far I've not seen any proposal
for an alternative startup interlock that doesn't require attachable
shared memory.
        regards, tom lane


Re: Improving backend startup interlock

From
Giles Lean
Date:
Tom Lane wrote:

[ discussion of new startup interlock ]

> This is not quite ready for prime time yet, because it's not very
> bulletproof against the scenario where two would-be postmasters are
> starting concurrently.

A solution to this is to require would-be postmasters to obtain an
exclusive lock on a lock file before touching the pid file.  (The lock
file perhaps could be the pid file, but it doesn't have to be.)

Is there some reason that file locking is not acceptable?  Is there
any platform or filesystem supported for use with PostgreSQL which
doesn't have working exclusive file locking?

> A possible answer is to create a second lockfile that only exists
> for the duration of the startup sequence, and use that to ensure
> that only one process is trying this sequence at a time.
> ...
> This reintroduces the same problem
> we're trying to get away from (must rely on kill(PID, 0) to determine
> validity of the lock file), but at least the window of vulnerability is
> much smaller than before.

A lock file locked for the whole time the postmaster is running can be
responsible for preventing multiple postmasters running without
relying on pids.  All that is needed is that the OS drop exclusive
file locks on process exit and that locks not survive across reboots.

The checks of the shared memory segment (number of attachements etc)
look after orphaned back end processes, per the proposal.

Regards,

Giles


Re: Improving backend startup interlock

From
Tom Lane
Date:
Giles Lean <giles@nemeton.com.au> writes:
> Is there some reason that file locking is not acceptable?  Is there
> any platform or filesystem supported for use with PostgreSQL which
> doesn't have working exclusive file locking?

How would we know?  We have never tried to use such a feature.

For sure I would not trust it on an NFS filesystem.  (Although we
disparage running an NFS-mounted database, people do it anyway.)
        regards, tom lane


Re: Improving backend startup interlock

From
Giles Lean
Date:
Tom Lane wrote:

> Giles Lean <giles@nemeton.com.au> writes:
> > Is there some reason that file locking is not acceptable?  Is there
> > any platform or filesystem supported for use with PostgreSQL which
> > doesn't have working exclusive file locking?
> 
> How would we know?  We have never tried to use such a feature.

I asked because I've not been following this project long enough to
know if it had been tried and rejected previously.  Newcomers being
prone to making silly suggestions and all that. :-)

> For sure I would not trust it on an NFS filesystem.  (Although we
> disparage running an NFS-mounted database, people do it anyway.)

<scratches head>

I can't work out if that's an objection or not.

I'm certainly no fan of NFS locking, but if someone trusts their NFS
client and server implementations enough to put their data on, they
might as well trust it to get a single lock file for startup right
too.  IMHO.  Your mileage may vary.

Regards,

Giles


Re: Improving backend startup interlock

From
Tom Lane
Date:
Giles Lean <giles@nemeton.com.au> writes:
> I'm certainly no fan of NFS locking, but if someone trusts their NFS
> client and server implementations enough to put their data on, they
> might as well trust it to get a single lock file for startup right
> too.  IMHO.  Your mileage may vary.

Well, my local man page for lockf() sez
    The advisory record-locking capabilities of lockf() are implemented    throughout the network by the ``network lock
daemon''(see lockd(1M)).    If the file server crashes and is rebooted, the lock daemon attempts    to recover all
locksassociated with the crashed server.  If a lock    cannot be reclaimed, the process that held the lock is issued a
 SIGLOST signal.
 

and the lockd man page mentions that not only lockd but statd have to be
running locally *and* at the NFS server.

This sure sounds like file locking on NFS introduces additional
failure modes above and beyond what we have already.

Since the entire point of this locking exercise is to improve PG's
robustness, solutions that depend on other daemons not crashing
don't sound like a step forward to me.  I'm willing to trust the local
kernel, but I get antsy if I have to trust more than that.
        regards, tom lane


Re: Improving backend startup interlock

From
Bruce Momjian
Date:
Have people considered flock (advisory locking) on the postmaster.pid
file for backend detection?   It has a nonblocking option.  Don't most
OS's support it?

I can't understand why we can't get an easier solution to postmaster
detection than shared memory.

---------------------------------------------------------------------------

Tom Lane wrote:
> Giles Lean <giles@nemeton.com.au> writes:
> > I'm certainly no fan of NFS locking, but if someone trusts their NFS
> > client and server implementations enough to put their data on, they
> > might as well trust it to get a single lock file for startup right
> > too.  IMHO.  Your mileage may vary.
> 
> Well, my local man page for lockf() sez
> 
>      The advisory record-locking capabilities of lockf() are implemented
>      throughout the network by the ``network lock daemon'' (see lockd(1M)).
>      If the file server crashes and is rebooted, the lock daemon attempts
>      to recover all locks associated with the crashed server.  If a lock
>      cannot be reclaimed, the process that held the lock is issued a
>      SIGLOST signal.
> 
> and the lockd man page mentions that not only lockd but statd have to be
> running locally *and* at the NFS server.
> 
> This sure sounds like file locking on NFS introduces additional
> failure modes above and beyond what we have already.
> 
> Since the entire point of this locking exercise is to improve PG's
> robustness, solutions that depend on other daemons not crashing
> don't sound like a step forward to me.  I'm willing to trust the local
> kernel, but I get antsy if I have to trust more than that.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Improving backend startup interlock

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Have people considered flock (advisory locking) on the postmaster.pid
> file for backend detection?

$ man flock
No manual entry for flock.
$

HPUX has generally taken the position of adopting both BSD and SysV
features, so if it doesn't exist here, it's not portable to older
Unixen ...
        regards, tom lane


Re: Improving backend startup interlock

From
Giles Lean
Date:
Tom Lane writes:

> $ man flock
> No manual entry for flock.
> $
> 
> HPUX has generally taken the position of adopting both BSD and SysV
> features, so if it doesn't exist here, it's not portable to older
> Unixen ...

If only local locking is at issue then finding any one of fcntl()
locking, flock(), or lockf() would do.  All Unixen will have one or
more of these and autoconf machinery exists to find them.

The issue Tom raised about NFS support remains: locking over NFS
introduces new failure modes.  It also only works for NFS clients
that support NFS locking, which not all do.

Mind you NFS users are currently entirely unprotected from someone
starting a postmaster on a different NFS client using the same data
directory right now, which file locking would prevent. So there is
some win for NFS users as well as local filesystem users.  (Anyone
using NFS care to put their hand up?  Maybe nobody does?)

Is the benefit of better local filesystem behaviour plus multiple
client protection for NFS users who have file locking enough to
outweigh the drawbacks?  My two cents says it is, but my two cents are
worth approximately USD$0.01, which is to say not very much ...

Regards,

Giles


Re: Improving backend startup interlock

From
"Michael Paesold"
Date:
Giles Lean <giles@nemeton.com.au> wrote:

> Tom Lane writes:
> 
> > $ man flock
> > No manual entry for flock.
> > $
> > 
> > HPUX has generally taken the position of adopting both BSD and SysV
> > features, so if it doesn't exist here, it's not portable to older
> > Unixen ...
> 
> If only local locking is at issue then finding any one of fcntl()
> locking, flock(), or lockf() would do.  All Unixen will have one or
> more of these and autoconf machinery exists to find them.
> 
> The issue Tom raised about NFS support remains: locking over NFS
> introduces new failure modes.  It also only works for NFS clients
> that support NFS locking, which not all do.
> 
> Mind you NFS users are currently entirely unprotected from someone
> starting a postmaster on a different NFS client using the same data
> directory right now, which file locking would prevent. So there is
> some win for NFS users as well as local filesystem users.  (Anyone
> using NFS care to put their hand up?  Maybe nobody does?)
> 
> Is the benefit of better local filesystem behaviour plus multiple
> client protection for NFS users who have file locking enough to
> outweigh the drawbacks?  My two cents says it is, but my two cents are
> worth approximately USD$0.01, which is to say not very much ...

Well, I am going to do some tests with postgresql and our netapp
filer later in October. If that setup proves to work fast and reliable
I would also be interested in such a locking. I don't care about
the feature if I find the postgresql/NFS/netapp-filer setup to be
unreliable or bad performing.

I'll see.

Regards,
Michael Paesold



Re: Improving backend startup interlock

From
Joe Conway
Date:
Michael Paesold wrote:
> Giles Lean <giles@nemeton.com.au> wrote:
>>Mind you NFS users are currently entirely unprotected from someone
>>starting a postmaster on a different NFS client using the same data
>>directory right now, which file locking would prevent. So there is
>>some win for NFS users as well as local filesystem users.  (Anyone
>>using NFS care to put their hand up?  Maybe nobody does?)
>>
>>Is the benefit of better local filesystem behaviour plus multiple
>>client protection for NFS users who have file locking enough to
>>outweigh the drawbacks?  My two cents says it is, but my two cents are
>>worth approximately USD$0.01, which is to say not very much ...
> 
> 
> Well, I am going to do some tests with postgresql and our netapp
> filer later in October. If that setup proves to work fast and reliable
> I would also be interested in such a locking. I don't care about
> the feature if I find the postgresql/NFS/netapp-filer setup to be
> unreliable or bad performing.
> 

We have multiple Oracle databases running over NFS from an HPUX server to a 
netapp and have been pleased with the performance overall. It does require 
some tuning to get it right, and it hasn't been entirely without issues, but I 
don't see us going back to local storage. We also just recently set up a Linux 
box running Oracle against an NFS mounted netapp. Soon I'll be adding Postgres 
on the same machine, initially using locally attached storage, but at some 
point I may need to shift to the netapp due to data volume.

If you do try Postgres on the netapp, please post your results/experience and 
I'll do the same.

Anyway, I guess I qualify as interested in an NFS safe locking method.

Joe