Thread: [PATCH] Make ENOSPC not fatal in semaphore creation
From: Mikhail <mp39590@gmail.com> We might be in situation when we have "just enough" semaphores in the system limit to start but previously crashed unexpectedly, in that case we won't be able to start again - semget() will return ENOSPC, despite the semaphores are ours, and we can recycle them, so check this situation and try to remove the semaphore, if we are unable - give up and abort. --- src/backend/port/sysv_sema.c | 31 +++++++++++++++++++++++++------ 1 file changed, 25 insertions(+), 6 deletions(-) diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c index 21c883ba9a..a889591dba 100644 --- a/src/backend/port/sysv_sema.c +++ b/src/backend/port/sysv_sema.c @@ -88,10 +88,6 @@ static void ReleaseSemaphores(int status, Datum arg); * * Attempt to create a new semaphore set with the specified key. * Will fail (return -1) if such a set already exists. - * - * If we fail with a failure code other than collision-with-existing-set, - * print out an error and abort. Other types of errors suggest nonrecoverable - * problems. */ static IpcSemaphoreId InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, int numSems) @@ -118,10 +114,33 @@ InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, int numSems) return -1; /* - * Else complain and abort + * We might be in situation when we have "just enough" semaphores in the system + * limit to start but previously crashed unexpectedly, in that case we won't be + * able to start again - semget() will return ENOSPC, despite the semaphores + * are ours, and we can recycle them, so check this situation and try to remove + * the semaphore, if we are unable - give up and abort. + * + * We use same semkey for every start - it's gotten from inode number of the + * data folder. So on repeated starts we will use the same key. */ + if (saved_errno == ENOSPC) + { + union semun semun; + + semId = semget(semKey, 0, 0); + + semun.val = 0; /* unused, but keep compiler quiet */ + if (semctl(semId, 0, IPC_RMID, semun) == 0) + { + /* Recycled - get the same semaphore again */ + semId = semget(semKey, numSems, IPC_CREAT | IPC_EXCL | IPCProtection); + + return semId; + } + } + ereport(FATAL, - (errmsg("could not create semaphores: %m"), + (errmsg("could not create semaphores: %s", strerror(saved_errno)), errdetail("Failed system call was semget(%lu, %d, 0%o).", (unsigned long) semKey, numSems, IPC_CREAT | IPC_EXCL | IPCProtection), -- 2.33.0
mp39590@gmail.com writes: > We might be in situation when we have "just enough" semaphores in the > system limit to start but previously crashed unexpectedly, in that case > we won't be able to start again - semget() will return ENOSPC, despite > the semaphores are ours, and we can recycle them, so check this > situation and try to remove the semaphore, if we are unable - give up > and abort. AFAICS, this patch could be disastrous. What if the semaphore in question belongs to some other postmaster? Also, you haven't explained why the existing (and much safer) recycling logic in IpcSemaphoreCreate doesn't solve your problem. regards, tom lane
On Sun, Oct 17, 2021 at 10:29:24AM -0400, Tom Lane wrote: > mp39590@gmail.com writes: > > We might be in situation when we have "just enough" semaphores in the > > system limit to start but previously crashed unexpectedly, in that case > > we won't be able to start again - semget() will return ENOSPC, despite > > the semaphores are ours, and we can recycle them, so check this > > situation and try to remove the semaphore, if we are unable - give up > > and abort. > > AFAICS, this patch could be disastrous. What if the semaphore in > question belongs to some other postmaster? Does running more than one postmaster on the same PGDATA is supported at all? Currently seed for the semaphore key is inode number of PGDATA. > Also, you haven't explained why the existing (and much safer) recycling > logic in IpcSemaphoreCreate doesn't solve your problem. The logic of creating semas: 218 /* Loop till we find a free IPC key */ 219 for (nextSemaKey++;; nextSemaKey++) 220 { 221 pid_t creatorPID; 222 223 /* Try to create new semaphore set */ 224 semId = InternalIpcSemaphoreCreate(nextSemaKey, numSems + 1); 225 if (semId >= 0) 226 break; /* successful create */ InternalIpcSemaphoreCreate: 101 semId = semget(semKey, numSems, IPC_CREAT | IPC_EXCL | IPCProtection); 102 103 if (semId < 0) 104 { 105 int saved_errno = errno; 106 [...] 113 if (saved_errno == EEXIST || saved_errno == EACCES 114 #ifdef EIDRM 115 || saved_errno == EIDRM 116 #endif 117 ) 118 return -1; 119 120 /* 121 * Else complain and abort 122 */ 123 ereport(FATAL, [...] semget() returns ENOSPC, so InternalIpcSemaphoreCreate doesn't return -1 so the whole logic of IpcSemaphoreCreate is not checked.
Mikhail <mp39590@gmail.com> writes: > On Sun, Oct 17, 2021 at 10:29:24AM -0400, Tom Lane wrote: >> AFAICS, this patch could be disastrous. What if the semaphore in >> question belongs to some other postmaster? > Does running more than one postmaster on the same PGDATA is supported at > all? Currently seed for the semaphore key is inode number of PGDATA. That hardly guarantees no collisions. If it did, we'd never have bothered with the PGSemaMagic business or the IpcSemaphoreGetLastPID check. >> Also, you haven't explained why the existing (and much safer) recycling >> logic in IpcSemaphoreCreate doesn't solve your problem. > semget() returns ENOSPC, so InternalIpcSemaphoreCreate doesn't return -1 > so the whole logic of IpcSemaphoreCreate is not checked. Hmm. Maybe you could improve this by removing the first InternalIpcSemaphoreCreate call in IpcSemaphoreCreate, and rearranging the logic so that the first step consists of seeing whether a sema set is already there (and can safely be zapped), and only then proceed with creation. I am, however, concerned that this'll just trade off one hazard for another. Instead of a risk of failing with ENOSPC (which the DBA can fix), we'll have a risk of kneecapping some other process at random (which the DBA can do nothing to prevent). I'm also fairly unclear on when the logic you propose would trigger at all. If the sema set is already there, I'd expect EEXIST or equivalent, not ENOSPC. regards, tom lane
On Sun, Oct 17, 2021 at 10:52:38AM -0400, Tom Lane wrote: > Mikhail <mp39590@gmail.com> writes: > > On Sun, Oct 17, 2021 at 10:29:24AM -0400, Tom Lane wrote: > >> AFAICS, this patch could be disastrous. What if the semaphore in > >> question belongs to some other postmaster? > > > Does running more than one postmaster on the same PGDATA is supported at > > all? Currently seed for the semaphore key is inode number of PGDATA. > > That hardly guarantees no collisions. If it did, we'd never have bothered > with the PGSemaMagic business or the IpcSemaphoreGetLastPID check. Got it, makes sense. Also, I was presented with examples that inode number can be reused across mounting points for different clusters. > >> Also, you haven't explained why the existing (and much safer) recycling > >> logic in IpcSemaphoreCreate doesn't solve your problem. > > > semget() returns ENOSPC, so InternalIpcSemaphoreCreate doesn't return -1 > > so the whole logic of IpcSemaphoreCreate is not checked. > > Hmm. Maybe you could improve this by removing the first > InternalIpcSemaphoreCreate call in IpcSemaphoreCreate, and > rearranging the logic so that the first step consists of seeing > whether a sema set is already there (and can safely be zapped), > and only then proceed with creation. I think, I can look into this on the next weekend. On first glance the solution works for me. > I am, however, concerned that this'll just trade off one hazard for > another. Instead of a risk of failing with ENOSPC (which the DBA > can fix), we'll have a risk of kneecapping some other process at > random (which the DBA can do nothing to prevent). Good argument, but I'll try to make second version of the patch with the proposed logic change to see what we will get. I think it's "right" behavior to recycle our own used semaphores, so the whole approach is correct. > I'm also fairly unclear on when the logic you propose would trigger > at all. If the sema set is already there, I'd expect EEXIST or > equivalent, not ENOSPC. The logic works - the initial call to semget() in InternalIpcSemaphoreCreate returns -1 and errno is set to ENOSPC - I tested the patch on OpenBSD 7.0, it successfully recycles sem's after previous "pkill -6 postgres". Verified it with 'ipcs -s'.
On Mon, Oct 18, 2021 at 4:49 AM Mikhail <mp39590@gmail.com> wrote: > The logic works - the initial call to semget() in > InternalIpcSemaphoreCreate returns -1 and errno is set to ENOSPC - I > tested the patch on OpenBSD 7.0, it successfully recycles sem's after > previous "pkill -6 postgres". Verified it with 'ipcs -s'. Since you mentioned OpenBSD, what do you think of the idea of making named POSIX semas the default on that platform? You can't run out of those practically speaking, but then you get lots of little memory mappings (from memory, at least it does close the fd for each one, unlike some other OSes where we wouldn't want to use this technique). Trivial patch: https://www.postgresql.org/message-id/CA%2BhUKGJVSjiDjbJpHwUrvA1TikFnJRfyJanrHofAWhnqcDJayQ%40mail.gmail.com No strong opinion on the tradeoffs here, as I'm not an OpenBSD user, but it's something I think about whenever testing portability stuff there and having to adjust the relevant sysctls. Note: The best kind would be *unnamed* POSIX semas, where we get to control their placement in existing memory; that's what we do on Linux and FreeBSD. They weren't supported on OpenBSD last time we checked: it rejects requests for shared ones. I wonder if someone could implement them with just a few lines of user space code, using atomic counters and futex() for waiting.
On Mon, Oct 18, 2021 at 10:07:40AM +1300, Thomas Munro wrote: > On Mon, Oct 18, 2021 at 4:49 AM Mikhail <mp39590@gmail.com> wrote: > > The logic works - the initial call to semget() in > > InternalIpcSemaphoreCreate returns -1 and errno is set to ENOSPC - I > > tested the patch on OpenBSD 7.0, it successfully recycles sem's after > > previous "pkill -6 postgres". Verified it with 'ipcs -s'. > > Since you mentioned OpenBSD, what do you think of the idea of making > named POSIX semas the default on that platform? You can't run out of > those practically speaking, but then you get lots of little memory > mappings (from memory, at least it does close the fd for each one, > unlike some other OSes where we wouldn't want to use this technique). > Trivial patch: > > https://www.postgresql.org/message-id/CA%2BhUKGJVSjiDjbJpHwUrvA1TikFnJRfyJanrHofAWhnqcDJayQ%40mail.gmail.com > > No strong opinion on the tradeoffs here, as I'm not an OpenBSD user, > but it's something I think about whenever testing portability stuff > there and having to adjust the relevant sysctls. > > Note: The best kind would be *unnamed* POSIX semas, where we get to > control their placement in existing memory; that's what we do on Linux > and FreeBSD. They weren't supported on OpenBSD last time we checked: > it rejects requests for shared ones. I wonder if someone could > implement them with just a few lines of user space code, using atomic > counters and futex() for waiting. Hello, sorry for not replying earlier - I was able to think about and test the patch only on the weekend. I totally agree with your approach, in conversation with one of the OpenBSD developers he supported using of sem_open(), because most ports use it and consistency is desirable across our ports tree. It looks like PostgreSQL was the only port to use semget(). Switching to sem_open() also looks much safer than patching sysv_sema.c for corner ENOSPC case as Tom already mentioned. In your patch I've removed testing for 5.x versions, because official releases are supported only for one year, no need to worry about them. The patch is tested with 'make installcheck', also I can confirm that 'ipcs' shows that no semaphores are used, and server starts normally after 'pkill -6 postgres' with the default semmns sysctl, what was the original motivation for the work. diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml index d74d1ed7af..2dfea0662b 100644 --- a/doc/src/sgml/runtime.sgml +++ b/doc/src/sgml/runtime.sgml @@ -998,21 +998,7 @@ psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such <para> The default shared memory settings are usually good enough, unless you have set <literal>shared_memory_type</literal> to <literal>sysv</literal>. - You will usually want to - increase <literal>kern.seminfo.semmni</literal> - and <literal>kern.seminfo.semmns</literal>, - as <systemitem class="osname">OpenBSD</systemitem>'s default settings - for these are uncomfortably small. - </para> - - <para> - IPC parameters can be adjusted using <command>sysctl</command>, - for example: -<screen> -<prompt>#</prompt> <userinput>sysctl kern.seminfo.semmni=100</userinput> -</screen> - To make these settings persist over reboots, modify - <filename>/etc/sysctl.conf</filename>. + System V semaphores are not used on this platform. </para> </listitem> diff --git a/src/template/openbsd b/src/template/openbsd index 365268c489..41221af382 100644 --- a/src/template/openbsd +++ b/src/template/openbsd @@ -2,3 +2,7 @@ # Extra CFLAGS for code that will go into a shared library CFLAGS_SL="-fPIC -DPIC" + +# OpenBSD 5.5 (2014) gained named POSIX semaphores. They work out of the box +# without changing any sysctl settings, unlike System V semaphores. +USE_NAMED_POSIX_SEMAPHORES=1
Mikhail <mp39590@gmail.com> writes: > In your patch I've removed testing for 5.x versions, because official > releases are supported only for one year, no need to worry about them. Official support or no, we have OpenBSD 5.9 in our buildfarm, so ignoring the case isn't going to fly. regards, tom lane
On Sat, Oct 23, 2021 at 8:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Mikhail <mp39590@gmail.com> writes: > > In your patch I've removed testing for 5.x versions, because official > > releases are supported only for one year, no need to worry about them. > > Official support or no, we have OpenBSD 5.9 in our buildfarm, so > ignoring the case isn't going to fly. It was a test for < 5.5, so that aspect's OK.
On Fri, Oct 22, 2021 at 03:43:00PM -0400, Tom Lane wrote: > Mikhail <mp39590@gmail.com> writes: > > In your patch I've removed testing for 5.x versions, because official > > releases are supported only for one year, no need to worry about them. > > Official support or no, we have OpenBSD 5.9 in our buildfarm, so > ignoring the case isn't going to fly. 5.9 has support for unnamed POSIX semas. Do you think new machine with OpenBSD <5.5 (when unnamed POSIX semas were introduced) can appear in buildfarm or be used by real customer? I have no objections on testing "openbsd5.[01234]" and using SysV semas there and can redo and test the patch, but isn't it over caution?
Mikhail <mp39590@gmail.com> writes: > On Fri, Oct 22, 2021 at 03:43:00PM -0400, Tom Lane wrote: >> Official support or no, we have OpenBSD 5.9 in our buildfarm, so >> ignoring the case isn't going to fly. > 5.9 has support for unnamed POSIX semas. Do you think new machine with > OpenBSD <5.5 (when unnamed POSIX semas were introduced) can appear in > buildfarm or be used by real customer? Nah, I misunderstood you to say that 5.9 would also be affected. regards, tom lane
Mikhail <mp39590@gmail.com> writes: > +# OpenBSD 5.5 (2014) gained named POSIX semaphores. They work out of the box > +# without changing any sysctl settings, unlike System V semaphores. > +USE_NAMED_POSIX_SEMAPHORES=1 I tried this on an OpenBSD 6.0 image I had handy. The good news is that it works, and I can successfully start the postmaster with a lot of semaphores (I tried with max_connections=10000) without any special system configuration. The bad news is it's *slow*. It takes the postmaster over a minute to start up at 10000 max_connections, and also about 15 seconds to shut down. The regression tests also appear noticeably slower, even at the default max_connections=100. I'm afraid that those "lots of tiny mappings" that Thomas noted have a nasty impact on our process launch times, since the kernel presumably has to do work to clone them into the child process. Now this lashup that I'm testing on is by no means well suited for performance tests, so maybe my numbers are bogus. Also, maybe it's better in more recent OpenBSD releases. But I think we need to take a harder look at performance before we decide that it's okay to change the default semaphore type for this platform. regards, tom lane
On Fri, Oct 22, 2021 at 09:00:31PM -0400, Tom Lane wrote: > I tried this on an OpenBSD 6.0 image I had handy. The good news is > that it works, and I can successfully start the postmaster with a lot > of semaphores (I tried with max_connections=10000) without any special > system configuration. The bad news is it's *slow*. It takes the > postmaster over a minute to start up at 10000 max_connections, and > also about 15 seconds to shut down. The regression tests also appear > noticeably slower, even at the default max_connections=100. I'm > afraid that those "lots of tiny mappings" that Thomas noted have > a nasty impact on our process launch times, since the kernel > presumably has to do work to clone them into the child process. > > Now this lashup that I'm testing on is by no means well suited for > performance tests, so maybe my numbers are bogus. Also, maybe it's > better in more recent OpenBSD releases. But I think we need to take a > harder look at performance before we decide that it's okay to change > the default semaphore type for this platform. I got following results for "time make installcheck" on a laptop with OpenBSD 7.0 (amd64): POSIX (max_connections=100) (default): 1m32.39s real 0m03.82s user 0m05.75s system POSIX (max_connections=10000): 2m13.11s real 0m03.56s user 0m07.06s system SysV (max_connections=100) (default): 1m24.39s real 0m03.30s user 0m04.94s system SysV (max_connections=10000): failed to start after sysctl tunning: SysV (max_connections=10000): 1m47.51s real 0m03.78s user 0m05.61s system I can confirm that start and stop of the server was slower in POSIX case, but not terribly different (seconds, not a minute, as in your case). As the OpenBSD developers said - those who use OpenBSD are never after a good performance, the system has a lot of bottlenecks except IPCs. I see following reasons to switch from SysV to POSIX: - consistency in the ports tree, all major ports use POSIX, it means better testing of the API - as already pointed out - OpenBSD isn't about performance, and the results for default max_connections are pretty close - crash recovery with the OS defaults is automatic and don't require DBA intervention and knowledge of ipcs and ipcrm - higher density is available without system tuning The disadvantage is in a worse performance for extreme cases, but I'm not sure OpenBSD is used for them in production.
On Sun, Oct 17, 2021 at 10:52:38AM -0400, Tom Lane wrote: > I am, however, concerned that this'll just trade off one hazard for > another. Instead of a risk of failing with ENOSPC (which the DBA > can fix), we'll have a risk of kneecapping some other process at > random (which the DBA can do nothing to prevent). I tend to agree, and along with semas patch would like to suggest error message improvement, it would have saved me about half a day of digging. Tested on OpenBSD 7.0. I'm not a native speaker though, so grammar need to be checked. diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c index 21c883ba9a..b84f70b5e2 100644 --- a/src/backend/port/sysv_sema.c +++ b/src/backend/port/sysv_sema.c @@ -133,7 +133,10 @@ InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, int numSems) "respective kernel parameter. Alternatively, reduce PostgreSQL's " "consumption of semaphores by reducing its max_connections parameter.\n" "The PostgreSQL documentation contains more information about " - "configuring your system for PostgreSQL.") : 0)); + "configuring your system for PostgreSQL.\n" + "If server has crashed previously there may be resources left " + "after it - take a look at ipcs(1) and ipcrm(1) man pages to see " + "how to remove them.") : 0)); } return semId;
On Mon, Oct 18, 2021 at 10:07 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Note: The best kind would be *unnamed* POSIX semas, where we get to > control their placement in existing memory; that's what we do on Linux > and FreeBSD. They weren't supported on OpenBSD last time we checked: > it rejects requests for shared ones. I wonder if someone could > implement them with just a few lines of user space code, using atomic > counters and futex() for waiting. I meant that it'd be cool if OpenBSD implemented shared memory unnamed semas that way (as other OSes do), but just for fun I tried implementing that in PostgreSQL. I already had a patch to provide a wrapper API for futexes on a bunch of OSes including OpenBSD (because I've been looking into ways to rewrite lwlock.c to use futexes directly and skip all the per-backend semaphore stuff). That made it easy to write a quick-and-dirty clone of sem_{init,wait,post}() using atomics and futexes. Sadly, although the attached proof-of-concept patch allows a PREFERRED_SEMAPHORES=FUTEX build to pass tests on macOS (which also lacks native unnamed semas), FreeBSD and Linux (which don't need this but are interesting to test), and it also works on OpenBSD with shared_memory_type=sysv, it doesn't work on OpenBSD with shared_memory_type=mmap (the default). I suspect OpenBSD's futex(2) has a bug: inherited anonymous shared mmap memory seems to confuse it so that wakeups are lost. Arrrgh!
Attachment
On Sun, Oct 17, 2021 at 10:29:24AM -0400, Tom Lane wrote: > Also, you haven't explained why the existing (and much safer) recycling > logic in IpcSemaphoreCreate doesn't solve your problem. I think I'll drop the diffs, you're right that current proven logic need not to be changed for such rare corner case, which DBA can fix. I've added references to ipcs(1) and ipcrm(1) in OpenBSD's semget(2) man page, so newcomer won't need to spend hours digging in sysv semas management, if one would encounter the same situation as I did. Thanks for reviews.
On Sun, Oct 24, 2021 at 10:50 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Sadly, although the attached proof-of-concept patch allows a > PREFERRED_SEMAPHORES=FUTEX build to pass tests on macOS (which also > lacks native unnamed semas), FreeBSD and Linux (which don't need this > but are interesting to test), and it also works on OpenBSD with > shared_memory_type=sysv, it doesn't work on OpenBSD with > shared_memory_type=mmap (the default). I suspect OpenBSD's futex(2) > has a bug: inherited anonymous shared mmap memory seems to confuse it > so that wakeups are lost. Arrrgh! FWIW I'm trying to follow up with the OpenBSD list over here, because it'd be nice to get that working: https://marc.info/?l=openbsd-misc&m=163524454303022&w=2
On Fri, Oct 29, 2021 at 4:54 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sun, Oct 24, 2021 at 10:50 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > Sadly, although the attached proof-of-concept patch allows a > > PREFERRED_SEMAPHORES=FUTEX build to pass tests on macOS (which also > > lacks native unnamed semas), FreeBSD and Linux (which don't need this > > but are interesting to test), and it also works on OpenBSD with > > shared_memory_type=sysv, it doesn't work on OpenBSD with > > shared_memory_type=mmap (the default). I suspect OpenBSD's futex(2) > > has a bug: inherited anonymous shared mmap memory seems to confuse it > > so that wakeups are lost. Arrrgh! > > FWIW I'm trying to follow up with the OpenBSD list over here, because > it'd be nice to get that working: > > https://marc.info/?l=openbsd-misc&m=163524454303022&w=2 This has been fixed. So now there are working basic futexes on Linux, macOS, {Free,Open,Net,Dragonfly}BSD (though capabilities beyond basic wait/wake vary, as do APIs). So the question is whether it would be worth trying to do our own futex-based semaphores, as sketched above, just for the benefit of the OSes where the available built-in semaphores are of the awkward SysV kind, namely macOS, NetBSD and OpenBSD. Perhaps we shouldn't waste our time with that, and should instead plan to use futexes for a more ambitious lwlock rewrite.
Thomas Munro <thomas.munro@gmail.com> writes: > This has been fixed. So now there are working basic futexes on Linux, > macOS, {Free,Open,Net,Dragonfly}BSD (though capabilities beyond basic > wait/wake vary, as do APIs). So the question is whether it would be > worth trying to do our own futex-based semaphores, as sketched above, > just for the benefit of the OSes where the available built-in > semaphores are of the awkward SysV kind, namely macOS, NetBSD and > OpenBSD. Perhaps we shouldn't waste our time with that, and should > instead plan to use futexes for a more ambitious lwlock rewrite. I kind of like the latter idea, but I wonder how we make it coexist with (admittedly legacy) code for OSes that don't have usable futexes. regards, tom lane
On Sat, Nov 20, 2021 at 9:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > This has been fixed. So now there are working basic futexes on Linux, > > macOS, {Free,Open,Net,Dragonfly}BSD (though capabilities beyond basic > > wait/wake vary, as do APIs). So the question is whether it would be > > worth trying to do our own futex-based semaphores, as sketched above, > > just for the benefit of the OSes where the available built-in > > semaphores are of the awkward SysV kind, namely macOS, NetBSD and > > OpenBSD. Perhaps we shouldn't waste our time with that, and should > > instead plan to use futexes for a more ambitious lwlock rewrite. > > I kind of like the latter idea, but I wonder how we make it coexist > with (admittedly legacy) code for OSes that don't have usable futexes. One very rough idea, not yet tried, is that they could keep using semaphores, but use them to implement fake futexes. We'd put them in wait lists that live in a shared memory hash table (the futex address is the key, with some extra work needed for DSM-resident futexes), with per-bucket spinlocks so that you can perform the value check atomically with the decision to start waiting.