Thread: Nameless IPC on POSIX systems
The attached patch implements new semaphore and shared memory mechanisms for POSIX systems. Semaphores are implemented using unnamed pipes. A semaphore is incremented by writing a single character to the pipe, and decremented by reading a single character. The only semaphore operation we can't reliably simulate in this manner is sem_getvalue(), but PostgreSQL doesn't use it. Shared memory is implemented using file-less (swap-backed) mmap(), either with MAP_ANON on systems which support it, or with /dev/zero (SysV-style). Note that I've only tested this on systems which support MAP_ANON, so there may be bugs in the /dev/zero code. One system which will definitely benefit from this is FreeBSD. FreeBSD has both SysV and POSIX semaphores and shared memory, but unnamed POSIX semaphores can't be shared between processes, and POSIX shared memory is implemented using plain files, so the POSIX primitives can't be used. The SysV primitives use a global namespace, which causes problems when multiple PostgreSQL instances run in separate jails (they can't run on the same port, and a compromised postmaster in one jail can be used to crash postmasters in other jails) The patch was developed and tested on FreeBSD 6, and has also been tested cursorily on SuSE Linux 9.2. It passes 'make check', and osdb (for what it's worth) shows no difference in performance between patched and unpatched postmasters built from the same source. Remember to run autoconf and configure before testing, as the patch modifies configure.in and the FreeBSD and Linux templates. DES -- Dag-Erling Smørgrav - des@des.no
Attachment
des@des.no (=?iso-8859-1?q?Dag-Erling_Sm=F8rgrav?=) writes: > The attached patch implements new semaphore and shared memory > mechanisms for POSIX systems. I'm afraid we'll have to reject this out of hand: > +bool > +PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2) > +{ > + /* > + * This is never the case when using mmap(), since the segments will > + * vanish into thin air when postmaster exits or crashes. > + */ > + return false; > +} This is not acceptable in the slightest, because it offers no protection against the situation where the old postmaster has crashed but there are still live backends. If a new postmaster and new backends are allowed to start in that situation, using a new shared memory segment, you *will* have major database corruption (eg, duplicate use of transaction IDs). We need the SysV ability to detect whether any backends are still connected to the old shared memory segment in order to be safe against this scenario. The semaphore code may be functionally OK, but I'm not thrilled with the fact that it requires two open file descriptors per semaphore, which have to be passed down to each postmaster child process. That's a lot of files if MaxBackends is large; not only does it constrain the number of file slots available for fd.c to use, but you run the risk of overflowing what an fd_set can handle, which I notice breaks this code :-(. For comparison, the Darwin implementation needs one descriptor per semaphore, and we have seen performance issues with that. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > This is not acceptable in the slightest, because it offers no protection > against the situation where the old postmaster has crashed but there are > still live backends. If a new postmaster and new backends are allowed > to start in that situation, using a new shared memory segment, you > *will* have major database corruption (eg, duplicate use of transaction > IDs). I assumed the backends would terminate if postmaster crashed, and that "reattach" was only necessary for the EXEC_BACKEND case. You can use file-backed shared memory instead. You need a directory which you know is writeable and unique to this instance, on a file system with enough free space to accomodate the full size of the shared memory segment. DataDir is probably a good choice. If the file does not exist, you create it at startup. If it does exist, you map it in and perform the same checks as in the SysV case. > The semaphore code may be functionally OK, but I'm not thrilled with the > fact that it requires two open file descriptors per semaphore, which > have to be passed down to each postmaster child process. That's a lot > of files if MaxBackends is large; not only does it constrain the number > of file slots available for fd.c to use, but you run the risk of > overflowing what an fd_set can handle, which I notice breaks this code > :-(. #define FD_SETSIZE BIG_NUMBER Anyway, I'm not sure you fully understand the problem this patch addresses. It is currently impractical if not impossible to run PostgreSQL in jails on FreeBSD, because: - SysV IPC is normally not allowed in jails, and must be explicitly enabled. - the namespace is global, not per-jail, so separate instances in separate jails risk collision (I believe there is a workaround for this in 8.0, but I haven't tested it) - even if collision is avoided, SysV IPC breaches the separation between jails, allowing anyone who manages to compromise one jail to crash or corrupt any process using SysV IPC in any other jail on the system. DES -- Dag-Erling Smørgrav - des@des.no
des@des.no (=?iso-8859-1?q?Dag-Erling_Sm=F8rgrav?=) writes: > You can use file-backed shared memory instead. You need a directory > which you know is writeable and unique to this instance, on a file > system with enough free space to accomodate the full size of the > shared memory segment. DataDir is probably a good choice. If the > file does not exist, you create it at startup. If it does exist, you > map it in and perform the same checks as in the SysV case. The check we need is "are there any other processes (still) attached to this shmem" and AFAIK that is not available in the mmap API. Do you know how to get it? > Anyway, I'm not sure you fully understand the problem this patch > addresses. Yes, I do. I'm not interested in substituting a risk of data corruption for them. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > The check we need is "are there any other processes (still) attached to > this shmem" and AFAIK that is not available in the mmap API. Do you > know how to get it? You can hack something up with fcntl() locks. If a process has a shared lock on the shm file, F_GETLK will get you its pid. Then grab your own shared lock. DES -- Dag-Erling Smørgrav - des@des.no
des@des.no (=?iso-8859-1?q?Dag-Erling_Sm=F8rgrav?=) writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> The check we need is "are there any other processes (still) attached to >> this shmem" and AFAIK that is not available in the mmap API. Do you >> know how to get it? > You can hack something up with fcntl() locks. If a process has a > shared lock on the shm file, F_GETLK will get you its pid. Then grab > your own shared lock. Seems fairly race-condition-prone: what about recently spawned child processes that haven't yet taken their own locks? If I read the fork() page correctly, a forked child doesn't inherit any file locks. regards, tom lane