Thread: Nameless IPC on POSIX systems

Nameless IPC on POSIX systems

From
des@des.no (Dag-Erling Smørgrav)
Date:
The attached patch implements new semaphore and shared memory
mechanisms for POSIX systems.

Semaphores are implemented using unnamed pipes.  A semaphore is
incremented by writing a single character to the pipe, and decremented
by reading a single character.  The only semaphore operation we can't
reliably simulate in this manner is sem_getvalue(), but PostgreSQL
doesn't use it.

Shared memory is implemented using file-less (swap-backed) mmap(),
either with MAP_ANON on systems which support it, or with /dev/zero
(SysV-style).  Note that I've only tested this on systems which
support MAP_ANON, so there may be bugs in the /dev/zero code.

One system which will definitely benefit from this is FreeBSD.
FreeBSD has both SysV and POSIX semaphores and shared memory, but
unnamed POSIX semaphores can't be shared between processes, and POSIX
shared memory is implemented using plain files, so the POSIX
primitives can't be used.  The SysV primitives use a global namespace,
which causes problems when multiple PostgreSQL instances run in
separate jails (they can't run on the same port, and a compromised
postmaster in one jail can be used to crash postmasters in other
jails)

The patch was developed and tested on FreeBSD 6, and has also been
tested cursorily on SuSE Linux 9.2.  It passes 'make check', and osdb
(for what it's worth) shows no difference in performance between
patched and unpatched postmasters built from the same source.

Remember to run autoconf and configure before testing, as the patch
modifies configure.in and the FreeBSD and Linux templates.

DES
--
Dag-Erling Smørgrav - des@des.no


Attachment

Re: Nameless IPC on POSIX systems

From
Tom Lane
Date:
des@des.no (=?iso-8859-1?q?Dag-Erling_Sm=F8rgrav?=) writes:
> The attached patch implements new semaphore and shared memory
> mechanisms for POSIX systems.

I'm afraid we'll have to reject this out of hand:

> +bool
> +PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2)
> +{
> +    /*
> +     * This is never the case when using mmap(), since the segments will
> +     * vanish into thin air when postmaster exits or crashes.
> +     */
> +    return false;
> +}

This is not acceptable in the slightest, because it offers no protection
against the situation where the old postmaster has crashed but there are
still live backends.  If a new postmaster and new backends are allowed
to start in that situation, using a new shared memory segment, you
*will* have major database corruption (eg, duplicate use of transaction
IDs).  We need the SysV ability to detect whether any backends are still
connected to the old shared memory segment in order to be safe against
this scenario.

The semaphore code may be functionally OK, but I'm not thrilled with the
fact that it requires two open file descriptors per semaphore, which
have to be passed down to each postmaster child process.  That's a lot
of files if MaxBackends is large; not only does it constrain the number
of file slots available for fd.c to use, but you run the risk of
overflowing what an fd_set can handle, which I notice breaks this code
:-(.  For comparison, the Darwin implementation needs one descriptor per
semaphore, and we have seen performance issues with that.

            regards, tom lane

Re: Nameless IPC on POSIX systems

From
des@des.no (Dag-Erling Smørgrav)
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> This is not acceptable in the slightest, because it offers no protection
> against the situation where the old postmaster has crashed but there are
> still live backends.  If a new postmaster and new backends are allowed
> to start in that situation, using a new shared memory segment, you
> *will* have major database corruption (eg, duplicate use of transaction
> IDs).

I assumed the backends would terminate if postmaster crashed, and that
"reattach" was only necessary for the EXEC_BACKEND case.

You can use file-backed shared memory instead.  You need a directory
which you know is writeable and unique to this instance, on a file
system with enough free space to accomodate the full size of the
shared memory segment.  DataDir is probably a good choice.  If the
file does not exist, you create it at startup.  If it does exist, you
map it in and perform the same checks as in the SysV case.

> The semaphore code may be functionally OK, but I'm not thrilled with the
> fact that it requires two open file descriptors per semaphore, which
> have to be passed down to each postmaster child process.  That's a lot
> of files if MaxBackends is large; not only does it constrain the number
> of file slots available for fd.c to use, but you run the risk of
> overflowing what an fd_set can handle, which I notice breaks this code
> :-(.

#define FD_SETSIZE BIG_NUMBER

Anyway, I'm not sure you fully understand the problem this patch
addresses.  It is currently impractical if not impossible to run
PostgreSQL in jails on FreeBSD, because:

 - SysV IPC is normally not allowed in jails, and must be explicitly
   enabled.

 - the namespace is global, not per-jail, so separate instances in
   separate jails risk collision (I believe there is a workaround for
   this in 8.0, but I haven't tested it)

 - even if collision is avoided, SysV IPC breaches the separation
   between jails, allowing anyone who manages to compromise one jail
   to crash or corrupt any process using SysV IPC in any other jail on
   the system.

DES
--
Dag-Erling Smørgrav - des@des.no

Re: Nameless IPC on POSIX systems

From
Tom Lane
Date:
des@des.no (=?iso-8859-1?q?Dag-Erling_Sm=F8rgrav?=) writes:
> You can use file-backed shared memory instead.  You need a directory
> which you know is writeable and unique to this instance, on a file
> system with enough free space to accomodate the full size of the
> shared memory segment.  DataDir is probably a good choice.  If the
> file does not exist, you create it at startup.  If it does exist, you
> map it in and perform the same checks as in the SysV case.

The check we need is "are there any other processes (still) attached to
this shmem" and AFAIK that is not available in the mmap API.  Do you
know how to get it?

> Anyway, I'm not sure you fully understand the problem this patch
> addresses.

Yes, I do.  I'm not interested in substituting a risk of data corruption
for them.

            regards, tom lane

Re: Nameless IPC on POSIX systems

From
des@des.no (Dag-Erling Smørgrav)
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> The check we need is "are there any other processes (still) attached to
> this shmem" and AFAIK that is not available in the mmap API.  Do you
> know how to get it?

You can hack something up with fcntl() locks.  If a process has a
shared lock on the shm file, F_GETLK will get you its pid.  Then grab
your own shared lock.

DES
--
Dag-Erling Smørgrav - des@des.no

Re: Nameless IPC on POSIX systems

From
Tom Lane
Date:
des@des.no (=?iso-8859-1?q?Dag-Erling_Sm=F8rgrav?=) writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> The check we need is "are there any other processes (still) attached to
>> this shmem" and AFAIK that is not available in the mmap API.  Do you
>> know how to get it?

> You can hack something up with fcntl() locks.  If a process has a
> shared lock on the shm file, F_GETLK will get you its pid.  Then grab
> your own shared lock.

Seems fairly race-condition-prone: what about recently spawned child
processes that haven't yet taken their own locks?  If I read the fork()
page correctly, a forked child doesn't inherit any file locks.

            regards, tom lane