Andreas Rieke <andreas.rieke@isl.de> writes:
> I am the guy who posted the problem to mod_perl, and yes, I am quite
> sure that we are talking about the right numbers. The best argument is
> that the machine in fact starts swapping when memory is gone - and this
> means there is neither free nor cached memory left.
Andreas, what it sounds like to me is a kernel memory leak probably
triggered by Postgres' use of SysV shared memory (which is not a heavily
used kernel feature these days, so bugs in it are hardly out of the
question).
A couple of facts that might help you narrow your theories:
1. When the postmaster starts up, it allocates one, count 'em one,
shared memory segment that is never thereafter changed in size.
2. When the postmaster shuts down, it issues a shmctl(IPC_RMID)
call against that segment. The kernel should thereupon mark the
segment for destruction, and then actually destroy it when the
last process connected to it is gone. In a normal shutdown that
would mean immediately (because the postmaster waits for all its
child processes to die first), but in an "immediate mode" shutdown
there might still be children alive at the instant of the shmctl.
Within this context, the only way to cause a memory leak is to
"kill -9" the postmaster instead of giving it a chance to exit
gracefully. In that case the shmctl(IPC_RMID) never happens and
the memory segment isn't reclaimed. However, if that were your
problem then the evidence would be real clear in "ipcs -m -a"
output: lots of postgres-owned segments with zero attached processes.
(There actually is code in the postmaster to try to find and
destroy such orphaned segments during postmaster restart, but
it's not 100% guaranteed to find everything.)
If the shared segment is no longer present according to ipcs,
and there are no postgres processes still running, then it's
simply not possible for it to be postgres' fault if memory has
not been reclaimed. So you're looking at a kernel bug.
As to the nature of the bug ... we saw something similar in older
versions of OS X:
http://archives.postgresql.org/pgsql-general/2004-08/msg00972.php
Since Darwin is BSD-derived, an ancient common bug seems possible.
(BTW, I just repeated the above experiment in OS X 10.4.8, and see
no leak, so Apple did fix it somewhere along the line.)
Anyway I'd suggest trying to duplicate the problem without apache
by firing new backends rapidly as in the above message. If you can,
file a kernel bug report.
regards, tom lane