Re: Quite strange crash - Mailing list pgsql-hackers
From | ncm@zembu.com (Nathan Myers) |
---|---|
Subject | Re: Quite strange crash |
Date | |
Msg-id | 20010108161030.B571@store.zembu.com Whole thread Raw |
In response to | Re: Quite strange crash (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Quite strange crash
Re: Quite strange crash |
List | pgsql-hackers |
On Mon, Jan 08, 2001 at 12:21:38PM -0500, Tom Lane wrote: > Denis Perchine <dyp@perchine.com> writes: > >>>>>>> FATAL: s_lock(401f7435) at bufmgr.c:2350, stuck spinlock. Aborting. > >>>>> > >>>>> Were there any errors before that? > > > Actually you can have a look on the logs yourself. > > Well, I found a smoking gun: ... > What seems to have happened is that 2501 curled up and died, leaving > one or more buffer spinlocks locked. ... > There is something pretty fishy about this. You aren't by any chance > running the postmaster under a ulimit setting that might cut off > individual backends after a certain amount of CPU time, are you? > What signal does a ulimit violation deliver on your machine, anyway? It's worth noting here that modern Unixes run around killing user-level processes more or less at random when free swap space (and sometimes just RAM) runs low. AIX was the first such, but would send SIGDANGER to processes first to try to reclaim some RAM; critical daemons were expected to explicitly ignore SIGDANGER. Other Unixes picked up the idea without picking up the SIGDANGER behavior. The reason for this common pathological behavior is usually traced to sloppy resource accounting. It manifests as the bad policy of having malloc() (and sbrk() or mmap() underneath) return a valid pointer rather than NULL, on the assumption that most of the memory asked for won't be used just yet. Anyhow, the system doesn't know how much memory is really available at that moment. Usually the problem is explained with the example of a very large process that forks, suddenly demanding twice as much memory. (Apache is particularly egregious this way, allocating lots of memory and then forking several times.) Instead of failing the fork, the kernel waits for a process to touch memory it was granted and then see if any RAM/swap has turned up to satisfy it, and then kill the process (or some random other process!) if not. Now that programs have come to depend on this behavior, it has become very hard to fix it. The implication for the rest of us is that we should expect our processes to be killed at random, just for touching memory granted, or for no reason at all. (Kernel people say, "They're just user-level programs, restart them;" or, "Maybe we can designate some critical processes that don't get killed".) In Linux they try to invent heuristics to avoid killing the X server, because so many programs depend on it. It's a disgraceful mess, really. The relevance to the issue at hand is that processes dying during heavy memory load is a documented feature of our supported platforms. Nathan Myers ncm@zembu.com
pgsql-hackers by date: