Reducing sema usage (was Postmaster dies with many child processes) - Mailing list pgsql-hackers

I said:
> Another thing we ought to look at is changing the use of semaphores so
> that Postgres uses a fixed number of semaphores, not a number that
> increases as more and more backends are started.  Kernels are
> traditionally configured with very low limits for the SysV IPC
> resources, so having a big appetite for semaphores is a Bad Thing.

I've been looking into this issue today, and it looks possible but messy.

The source of the problem is the lock manager
(src/backend/storage/lmgr/proc.c), which wants to be able to wake up a
specific process that is blocked on a lock.  I had first thought that it
would be OK to wake up any one of the processes waiting for a lock, but
after looking at the lock manager that seems a bad idea --- considerable
thought has gone into the queuing order of waiting processes, and we
don't want to give that up.  So we need to preserve this ability.

The way it's currently done is that each extant backend has its own
SysV-style semaphore, and when you want to wake up a particular backend
you just V() its semaphore.  (BTW, the semaphores get allocated in
chunks of 16, so an out-of-semaphores condition will always occur when
trying to start the 16*N+1'th backend...)  This is simple and reliable
but fails if you want to have more backends than the kernel has SysV
semaphores.  Unfortunately kernels are usually configured with not
very many semaphores --- 64 or so is typical.  Also, running the system
down to nearly zero free semaphores is likely to cause problems for
other subsystems even if Postgres itself doesn't run out.

What seems practical to do instead is this:
* At postmaster startup, allocate a fixed number of semaphores for use by all child backends.  ("Fixed" can really mean
"configurable",of course, but the point is we won't ask for more later.)
 
* The semaphores aren't dedicated to use by particular backends. Rather, when a backend needs to block, it finds a
currentlyfree semaphore and grabs it for the duration of its wait.  The number of the semaphore a backend is using to
waitwith would be recorded in its PROC struct, and we'd also need an array of per-sema data to keep track of free and
in-usesemaphores.
 
* This works with very little extra overhead until we have more simultaneously-blocked backends than we have
semaphores. When that happens (which we hope is really seldom), we overload semaphores --- that is, we use the same
semato block two or more backends.  Then the V() operation by the lock's releaser might wake the wrong backend. So, we
needan extra field in the LOCK struct to identify the intended wake-ee.  When a backend is released in ProcSleep, it
hasto look at the lock it is waiting on to see if it is supposed to be wakened right now.  If not, it V()s its shared
semaphorea second time (to release the intended wakee), then P()s the semaphore again to go back to sleep itself.
Thereprobably has to be a delay in here, to ensure that the intended wakee gets woken and we don't have its bed-mates
indefinitelytrading wakeups among the wrong processes. This is why we don't want this scenario happening often.
 

I think this could be made to work, but it would be a delicate and
hard-to-test change in what is already pretty subtle code.

A considerably more straightforward approach is just to forget about
incremental allocation of semaphores and grab all we could need at
postmaster startup.  ("OK, Mac, you told me to allow up to N backends?
Fine, I'm going to grab N semaphores at startup, and if I can't get them
I won't play.")  This would force the DB admin to either reconfigure the
kernel or reduce MaxBackendId to something the kernel can support right
off the bat, rather than allowing the problem to lurk undetected until
too many clients are started simultaneously.  (Note there are still
potential gotchas with running out of processes, swap space, or file
table slots, so we wouldn't have really guaranteed that N backends can
be started safely.)

If we make MaxBackendId settable from a postmaster command-line switch
then this second approach is probably not too inconvenient, though it
surely isn't pretty.

Any thoughts about which way to jump?  I'm sort of inclined to take
the simpler approach myself...
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] nested loops in joins, ambiguous rewrite rules
Next
From: Tom Lane
Date:
Subject: Re: Reducing sema usage (was Postmaster dies with many child processes)