Thread: CreateLockFile() race condition

CreateLockFile() race condition

From

Noah Misch

Date:

03 August 2012, 11:56:53

Despite its efforts, CreateLockFile() does not reliably prevent multiple
closely-timed postmasters from completing startup on the same data directory.
This test procedure evades its defenses:

1. kill -9 the running postmaster for $PGDATA
2. "mkdir /tmp/sock1 /tmp/sock2"
3. "gdb postgres", "b unlink", "run -k /tmp/sock1". Hits breakpoint.
4. "postgres -k /tmp/sock2"
5. In gdb: "d 1", "c".

The use of two socket directories is probably not essential, but it makes the
test case simpler.

The problem here is a race between concluding the assessment of a PID file as
defunct and unlinking it; during that period, another postmaster may have
replaced the PID file and proceeded. As far as I've been able to figure, this
flaw is fundamental to any PID file invalidation algorithm relying solely on
atomic filesystem operations like unlink(2), link(2), rename(2) and small
write(2) for mutual exclusion. Do any of you see a way to remove the race?

I think we should instead implement postmaster mutual exclusion by way of
fcntl(F_SETLK) on Unix and CreateFile(..., FILE_SHARE_READ, ...) on Windows.
PostgreSQL used fcntl(F_SETLK) on its Unix socket a decade ago, and that usage
was removed[1] due to portability problems[2][3]. POSIX does not require it
to work on other than regular files, but I can't find evidence of a Unix-like
platform not supporting it for regular files. Perl's metaconfig does test it,
with a note about VMS and DOS/DJGPP lacking support. Gnulib reports only the
lack of Windows support, though the Gnulib test suite does not cover F_SETLK.

CreateLockFile() would have this algorithm:

1. open("postmaster.pid", O_RDWR | O_CREAT, 0600)
2. Request a fcntl write lock on the entire postmaster.pid file. If this
fails, abandon startup: another postmaster is running.
3. Read the PID from the file. If the read returns 0 bytes, either we created
the file or the postmaster creating it crashed before writing and syncing any
content. Such a postmaster would not have reached the point of allocating
shared memory or forking children, so we're safe to proceed in either case.
Any other content problems should never happen and can remain fatal errors.
4. Check any old PID as we do today. In theory, this should always be
redundant with the fcntl lock. However, since the PID check code is mature,
let's keep it around indefinitely as a backstop. Perhaps adjust the error
message to better reflect its "can't happen" nature, though.
5. Read the shared memory key from the file. If we find one, probe it as we
do today. Otherwise, the PID file's creating postmaster crashed before
writing and syncing that value. Such a postmaster would not have reached the
point of forking children, so we're safe to proceed.
6. ftruncate() the file and write our own content.
7. Set FD_CLOEXEC on the file, leaving it open to retain the lock.

AddToDataDirLockFile() would use the retained file descriptor. On a related
but independent note, I think the ereport(LOG) calls in that function should
be ereport(FATAL) as they are in CreateLockFile(). We're not significantly
further into the startup process, and failing to add the shared memory key to
the file creates a hazardous situation.

The hazard[4] keeping fcntl locking from replacing the PGSharedMemoryIsInUse()
check does not apply here, because the postmaster itself does not run
arbitrary code that might reopen postmaster.pid.

This change adds some interlock to data directories shared over NFS. It also
closes the occasionally-reported[5][6] problem that an empty postmaster.pid
blocks cluster startup; we no longer face the possibility that an empty file
is the just-created product of a concurrent postmaster. It's probably no
longer necessary to fsync postmaster.pid, though I'm disinclined to change
that straightway.

Do any of you see holes in that design or have better ideas?

Thanks,
nm

[1] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=792b0f4666b6ea6346aa8d29b568e5d3fe1fcef5
[2] http://archives.postgresql.org/pgsql-hackers/2000-07/msg00359.php
[3] http://archives.postgresql.org/pgsql-hackers/2000-11/msg01306.php
[4] http://archives.postgresql.org/pgsql-hackers/2012-06/msg01598.php
[5] http://archives.postgresql.org/pgsql-hackers/2010-08/msg01094.php
[6] http://archives.postgresql.org/pgsql-hackers/2012-01/msg00233.php

Re: CreateLockFile() race condition

From

Tom Lane

Date:

03 August 2012, 12:59:19

Noah Misch <noah@leadboat.com> writes:
> The problem here is a race between concluding the assessment of a PID file as
> defunct and unlinking it; during that period, another postmaster may have
> replaced the PID file and proceeded.  As far as I've been able to figure, this
> flaw is fundamental to any PID file invalidation algorithm relying solely on
> atomic filesystem operations like unlink(2), link(2), rename(2) and small
> write(2) for mutual exclusion.  Do any of you see a way to remove the race?

Nasty.  Still, the issue only exists for two postmasters launched at
just about exactly the same time, which is an unlikely case.

> I think we should instead implement postmaster mutual exclusion by way of
> fcntl(F_SETLK) on Unix and CreateFile(..., FILE_SHARE_READ, ...) on Windows.

I'm a bit worried about what new problems this solution is going to open
up.  It seems not unlikely that the cure is worse than the disease.
Having locking that actually works on (some) NFS setups would be nice,
but ...

> The hazard[4] keeping fcntl locking from replacing the PGSharedMemoryIsInUse()
> check does not apply here, because the postmaster itself does not run
> arbitrary code that might reopen postmaster.pid.

False.  See shared_preload_libraries.
        regards, tom lane

Re: CreateLockFile() race condition

From

Robert Haas

Date:

03 August 2012, 13:43:25

On Fri, Aug 3, 2012 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think we should instead implement postmaster mutual exclusion by way of
>> fcntl(F_SETLK) on Unix and CreateFile(..., FILE_SHARE_READ, ...) on Windows.
>
> I'm a bit worried about what new problems this solution is going to open
> up.  It seems not unlikely that the cure is worse than the disease.
> Having locking that actually works on (some) NFS setups would be nice,
> but ...
>
>> The hazard[4] keeping fcntl locking from replacing the PGSharedMemoryIsInUse()
>> check does not apply here, because the postmaster itself does not run
>> arbitrary code that might reopen postmaster.pid.
>
> False.  See shared_preload_libraries.

It strikes me that it would be sufficient to hold the fcntl() lock
just long enough to establish one of the other interlocks.  We don't
really need to hold it for the entire lifetime of the postmaster.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: CreateLockFile() race condition

From

Noah Misch

Date:

03 August 2012, 14:56:15

On Fri, Aug 03, 2012 at 11:59:00AM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > I think we should instead implement postmaster mutual exclusion by way of
> > fcntl(F_SETLK) on Unix and CreateFile(..., FILE_SHARE_READ, ...) on Windows.
> 
> I'm a bit worried about what new problems this solution is going to open
> up.  It seems not unlikely that the cure is worse than the disease.

That's a fair concern.  There's only so much we'll know in advance.

> Having locking that actually works on (some) NFS setups would be nice,
> but ...
> 
> > The hazard[4] keeping fcntl locking from replacing the PGSharedMemoryIsInUse()
> > check does not apply here, because the postmaster itself does not run
> > arbitrary code that might reopen postmaster.pid.
> 
> False.  See shared_preload_libraries.

Quite right.  Even so, that code has a special role and narrower goals to
which it can reasonable aspire, giving credibility to ignoring the problem or
documenting the problem away.  (I don't see that we document any of the other
constraints on _PG_init of libraries named in shared_preload_libraries.)

Thanks,
nm