Re: How to cripple a postgres server - Mailing list pgsql-general

From Stephen Robert Norris
Subject Re: How to cripple a postgres server
Date
Msg-id 1022628439.25604.2.camel@chinstrap
Whole thread Raw
In response to Re: How to cripple a postgres server  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
On Wed, 2002-05-29 at 09:08, Tom Lane wrote:
> Stephen Robert Norris <srn@commsecure.com.au> writes:
> > I've already strace'ed the idle backend, and I can see the SIGUSR2 being
> > delivered just before everything goes bad.
>
> >> Yes, but what happens after that?
>
> > The strace stops until I manually kill the connecting process - the
> > machine stops in general until then (vmstat 1 stops producing output,
> > shells stop responding ...). So who knows what happens :(
>
> Hmm, I hadn't quite understood that you were complaining of a
> system-wide lockup and not just Postgres getting wedged.  I think the
> chances are very good that this *is* a kernel bug.  In any case, no
> self-respecting kernel hacker would be happy with the notion that
> a completely unprivileged user program can lock up the whole machine.
> So even if Postgres has got a problem, the kernel is clearly failing
> to defend itself adequately.
>
> Are you able to reproduce the problem with fewer than 800 backends?
> How about if you try it on a smaller machine?

Yep, on a PIII-800 with 256MB I can do it with fewer backends (I forget
how many) and only a few vacuums. It's much easier, basically, but
there's much less CPU on that machine. It also locks the machine up for
several minutes...

> Another thing that would be entertaining to try is other ways of
> releasing 800 queries at once.  For example, on connection 1 do
>     BEGIN; LOCK TABLE foo;
> then issue a "SELECT COUNT(*) FROM foo" on each other connection,
> and finally COMMIT on connection 1.  If that creates similar misbehavior
> then I think the SI-overrun mechanism is probably not to be blamed.
>
> > ... Sometimes, the
> > SIGUSR2 does just create a very brief load spike (vmstat shows >500
> > processes on the run queue, but the next second everything is back to
> > normal and no unusual amount of CPU is consumed).
>
> That's the behavior I'd expect.  We need to figure out what's different
> between that case and the cases where it locks up.
>
>             regards, tom lane

Yeah. I'll try your suggestion above and report back.

    Stephen

Attachment

pgsql-general by date:

Previous
From: "Jerason Banes"
Date:
Subject: Re: Error class not found
Next
From: Stephen Robert Norris
Date:
Subject: Re: How to cripple a postgres server