Wayne Piekarski <wayne@senet.com.au> writes:
> Unfortunately, this is not the kind of thing I can reproduce with a
> testing program, and so I can't try it against 6.5 - but it still exists
> in 6.4.2 so unless someones made more changes related to this area, there
> might be a chance it is still in 6.5 - although the locking code has been
> changed a lot maybe not?
I honestly don't know what to tell you here. There have been a huge
number of changes and bugfixes between 6.4.2 and 6.5, but there's really
no way to guess from your report whether any of them will cure your
problem (or, perhaps, make it worse :-(). I wish you could run 6.5-
current for a while under your live load and see how it fares. But
I understand your reluctance to do that.
> Is there anything I can do, like enable some extra debugging code,
There is some debug logging code in the lockmanager, but it produces
a huge volume of log output when turned on, and I for one am not
qualified to decipher it (perhaps one of the other list members can
offer more help). What I'd suggest first is trying to verify that
it *is* a lock problem. Attaching to some of the hung backends with
gdb and dumping their call stacks with "bt" could be very illuminating.
Especially if you compile the backend with -g first.
> One thing I thought is this problem could still be related to the
> spinlock/semget problem. ie, too many backends start up, something fails
> and dies off, but leaves a semaphore laying around, and so from then
> onwards, all the backends are waiting for this semaphore to go when it is
> still hanging around, causing problems ...
IIRC, 6.4.* will absolutely *not* recover from running out of kernel
semaphores or backend process slots. This is fixed in 6.5, and I think
someone posted a patch for 6.4 that covers the essentials, but I do
not recall the details.
regards, tom lane