Thread: Backends waiting, spinlocks, shared mem patches

Backends waiting, spinlocks, shared mem patches

From
Wayne Piekarski
Date:
Hi everyone,

Sorry this has taken me so long to get back to you. Just to refresh
everyones memory, I was the one who was having problems with postgres'
backends just hanging around in waiting, not doing anything. Tom Lane sent
me a patch to fix this for 6.4.2.

We didn't just install the patch on our live system and run it, as we were
worried about breaking something, so we spent a lot of time thrashing it
around, trying to reproduce the problem to check if it had been fixed
(this is why its taken me a while to do this). We captured a few hundred
sessions that our CGI's have with the database, including begin..commit
pairs and everything, in order to accurately simulate a heavy load on the
dbms. We tried this program, keeping about 40-50 connections going the
whole time, and we could not get the waiting problem to occur with even
the normal 6.4.2 so it was not possible to test if the patch had fixed our
particular problem.

So that was disappointing, we figured because we were hammering it so hard
that it would fail quickly and could use this as a good test program. The
6.4.2 patched version ran fine as well, so this was good. It seems that
the problem was caused by very rare circumstances which we just couldn't
reproduce during testing.

One thing we did notice is that when we tried to open more than say 50
backends, we would get the following:

InitPostgres
IpcSemaphoreCreate: semget failed (No space left on device) key=5432017,
num=16, permission=600
proc_exit(3) [#0]         

Shortly after, we would get:

FATAL: s_lock(18001065) at spin.c:125, stuck spinlock. Aborting.


Our FreeBSD machine was not setup for a huge number of semaphores, so the
semget was failing. That was fair enough, but then postmaster would die
afterwards with the spinlock error. I saw a post by Hiroshi Inoue with the
following:

>Hi all,
>
>ProcReleaseSpins() does nothing unless MyProc is set.
>So both elog(ERROR/FATAL) and proc_exit(0) before 
>InitProcess() don't release spinlocks.
>
>Comments ?
>
>Hiroshi Inoue
>Inoue@tpf.co.jp

I would have to agree with him here, i'm not familiar with postgres
internals but it looks like when semget fails, the backend doesn't clean
up the resources it already owns. I'm not sure if this is fixed, as I
can't always read the hackers list, but I thought I'd mention this in case
someone found it interesting.


We tried the same massive number of connections test with 6.5 and it
refuses to accept the connection after a while, which is good. I'm reading
through archives about MaxBackendId now, so I'm going to play with that.



So anyways, we installed the 6.4.2 patch a few days ago, and it seems to
be running ok. I haven't seen any cases where we get processes waiting for
nothing, (yet anyway - i'll have to wait and see for a few days). However,
now we are getting the stuck spinlock errors due to too many backends
being open, which I'm trying to prevent now so hopefully these two
problems will both go away now.


Now that I've learned more about the stuck spinlock problem, I realise
that when I emailed the first time, it was not just one problem, but two
or three at the same time which were making it harder to nail down what
the problem was. We will watch it over the week.

We have also been doing some testing with the latest 6.5 from the other
day, to check that certain problems we've bumped into have been fixed. We
can't run it live, but we'll try to run our testing programs on it as a
best approximation to help flush out any bugs that might be left.


Thanks for your help everyone, I hope that this has been helpful for
everyone else as well. I'm really looking forward to 6.5  :)


bye,
Wayne

------------------------------------------------------------------------------
Wayne Piekarski                               Tel:     (08) 8221 5221
Research & Development Manager                Fax:     (08) 8221 5220
SE Network Access Pty Ltd                     Mob:     0407 395 889
222 Grote Street                              Email:   wayne@senet.com.au
Adelaide SA 5000                              WWW:     http://www.senet.com.au


Re: [HACKERS] Backends waiting, spinlocks, shared mem patches

From
Tom Lane
Date:
Wayne Piekarski <wayne@senet.com.au> writes:
> Sorry this has taken me so long to get back to you.

Thanks for reporting back, Wayne.

> One thing we did notice is that when we tried to open more than say 50
> backends, we would get the following:
> InitPostgres
> IpcSemaphoreCreate: semget failed (No space left on device) key=5432017,
> num=16, permission=600
> proc_exit(3) [#0]         
> Shortly after, we would get:
> FATAL: s_lock(18001065) at spin.c:125, stuck spinlock. Aborting.

Yes, 6.4.* does not cope gracefully at all with running out of kernel
semaphores.  This is "fixed" in 6.5 by the brute-force approach of
grabbing all the semaphores we could want at postmaster startup, rather
than trying to allocate them on-the-fly during backend startup.  Either
way, you want your kernel to be able to provide one semaphore per
potential backend.

> We tried the same massive number of connections test with 6.5 and it
> refuses to accept the connection after a while, which is good. I'm reading
> through archives about MaxBackendId now, so I'm going to play with that.

In 6.5 you just need to set the postmaster's -N switch.

> We have also been doing some testing with the latest 6.5 from the other
> day, to check that certain problems we've bumped into have been fixed. We
> can't run it live, but we'll try to run our testing programs on it as a
> best approximation to help flush out any bugs that might be left.

OK, please let us know ASAP if you spot problems... we are shooting for
formal 6.5 release one week from today...
        regards, tom lane


Re: [HACKERS] Backends waiting, spinlocks, shared mem patches

From
Wayne Piekarski
Date:
Hi,

> Yes, 6.4.* does not cope gracefully at all with running out of kernel
> semaphores.  This is "fixed" in 6.5 by the brute-force approach of
> grabbing all the semaphores we could want at postmaster startup, rather
> than trying to allocate them on-the-fly during backend startup.  Either
> way, you want your kernel to be able to provide one semaphore per
> potential backend.

Right now, every so often we have a problem where all of a sudden the
backends will just start piling up, we exceed 50-60 backends, and then the
thing fails. The wierd part is that some times it happens during times of
the day which are very quiet and I wouldn't expect there to be that many
tasks being done. I'm thinking something is getting jammed up in Postgres
and then this occurs [more about this later] We get the spinlock fail
message and then we just restart, so it does "recover" in a way, although
it would be better if it didn't die. At least I understand what is
happening here ..... 

> > We have also been doing some testing with the latest 6.5 from the other
> > day, to check that certain problems we've bumped into have been fixed. We
> > can't run it live, but we'll try to run our testing programs on it as a
> > best approximation to help flush out any bugs that might be left.
> 
> OK, please let us know ASAP if you spot problems... we are shooting for
> formal 6.5 release one week from today...

Ok, well the past two days or so, we've still had the backends waiting
problem like before, even though we installed the 6.4.2 shared memory
patches. (ie, lots of backends waiting for nothing to happen - some kind
of lock is getting left around by a backend) It has been running better
than it was before, but we still get one problem or two per day, which
isn't very good. This time, when we kill all the waiting backends, new
backends will still jam anyways, so we kill and restart the whole thing.
The problem appears to have changed from what it was before, where we
could selectively kill off backends and eventually it would start working
again.

Unfortunately, this is not the kind of thing I can reproduce with a
testing program, and so I can't try it against 6.5 - but it still exists
in 6.4.2 so unless someones made more changes related to this area, there
might be a chance it is still in 6.5 - although the locking code has been
changed a lot maybe not?

Is there anything I can do, like enable some extra debugging code,
#define, (I've tried turning on a few of the locking defines but they
waiting for, so I or someone else can have a look and see if the problem
can be spotted? I can get it to happen one or twice per day, but I can
only test against 6.4.2 and it can't adversely affect the performance. 

One thing I thought is this problem could still be related to the
spinlock/semget problem. ie, too many backends start up, something fails
and dies off, but leaves a semaphore laying around, and so from then
onwards, all the backends are waiting for this semaphore to go when it is
still hanging around, causing problems ... The postmaster code fails to
detect the stuck spinlock and so it looks like a different problem? Hope
that made sense?

thanks,
Wayne

------------------------------------------------------------------------------
Wayne Piekarski                               Tel:     (08) 8221 5221
Research & Development Manager                Fax:     (08) 8221 5220
SE Network Access Pty Ltd                     Mob:     0407 395 889
222 Grote Street                              Email:   wayne@senet.com.au
Adelaide SA 5000                              WWW:     http://www.senet.com.au


Re: [HACKERS] Backends waiting, spinlocks, shared mem patches

From
Tom Lane
Date:
Wayne Piekarski <wayne@senet.com.au> writes:
> Unfortunately, this is not the kind of thing I can reproduce with a
> testing program, and so I can't try it against 6.5 - but it still exists
> in 6.4.2 so unless someones made more changes related to this area, there
> might be a chance it is still in 6.5 - although the locking code has been
> changed a lot maybe not?

I honestly don't know what to tell you here.  There have been a huge
number of changes and bugfixes between 6.4.2 and 6.5, but there's really
no way to guess from your report whether any of them will cure your
problem (or, perhaps, make it worse :-().  I wish you could run 6.5-
current for a while under your live load and see how it fares.  But
I understand your reluctance to do that.

> Is there anything I can do, like enable some extra debugging code,

There is some debug logging code in the lockmanager, but it produces
a huge volume of log output when turned on, and I for one am not
qualified to decipher it (perhaps one of the other list members can
offer more help).  What I'd suggest first is trying to verify that
it *is* a lock problem.  Attaching to some of the hung backends with
gdb and dumping their call stacks with "bt" could be very illuminating.
Especially if you compile the backend with -g first.

> One thing I thought is this problem could still be related to the
> spinlock/semget problem. ie, too many backends start up, something fails
> and dies off, but leaves a semaphore laying around, and so from then
> onwards, all the backends are waiting for this semaphore to go when it is
> still hanging around, causing problems ...

IIRC, 6.4.* will absolutely *not* recover from running out of kernel
semaphores or backend process slots.  This is fixed in 6.5, and I think
someone posted a patch for 6.4 that covers the essentials, but I do
not recall the details.
        regards, tom lane