Re: How to cripple a postgres server - Mailing list pgsql-general

From Stephen Robert Norris
Subject Re: How to cripple a postgres server
Date
Msg-id 1022721096.6066.36.camel@ws12
Whole thread Raw
In response to Re: How to cripple a postgres server  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
On Thu, 2002-05-30 at 01:52, Tom Lane wrote:
> I spent some time this morning trying to reproduce your problem, with
> not much luck.  I used the attached test program, in case anyone else
> wants to try --- it fires up the specified number of connections
> (issuing a trivial query on each one, just so that the backend is not
> completely virgin) and goes to sleep.  I ran that in one window and did
> manual "vacuum full"s in psql in another window.  I was doing the
> vacuums in the regression database which has about 150 tables, so there
> was an SI overrun event (resulting in SIGUSR2) every third or so vacuum.
>
> Using stock Red Hat Linux 7.2 (kernel 2.4.7-10) on a machine with 256M
> of RAM, I was able to run up to about 400 backends without seeing much
> of any performance problem.  (I had the postmaster running with
> postmaster -i -F -N 1000 -B 2000 and defaults in postgresql.conf.)
> Each SI overrun fired up all the idle backends, but they went back to
> sleep after a couple of kernel calls and not much computation.

Similar setup here, but 1GB RAM. If this problem is some sort of O(n^2)
thing, it could well be the case that it only happens on (for example) >
600 backends, and is fine at 400...

I also wonder if SMP has any impact - if there's lots of semops going
on, and the memory is being thrashed between CPU caches, that won't be
nice...

> Above 500 backends the thing went into swap hell --- it took minutes of
> realtime to finish out the SI overrun cycle, even though the CPU was
> idle (waiting for swap-in) most of the time.

I never swap.

Some more data from this end - I have only managed to reproduce the
problem once in about 2 hours with those lines removed that you asked me
to remove yesterday. With the lines still in, the problem happens after
a minute or two pretty much every time.

I still see the high numbers of processes in the run queue, and the load
rises, but neither postgres nor the machine stalls.


> What does your strace look like?
>
>             regards, tom lane

In "normal" SI overruns, about the same:

--- SIGUSR2 (User defined signal 2) ---
gettimeofday({1022719053, 355014}, NULL) = 0
close(7)                                = 0
close(6)                                = 0
close(4)                                = 0
close(3)                                = 0
close(9)                                = 0
semop(6258745, 0xbfffeb04, 1)           = 0
semop(6193207, 0xbfffeb04, 1)           = 0
open("/var/lib/pgsql/data/base/504592641/1259", O_RDWR) = 3
open("/var/lib/pgsql/data/base/504592641/16429", O_RDWR) = 4
semop(6258745, 0xbfffe8e4, 1)           = 0
semop(6225976, 0xbfffe8e4, 1)           = 0
open("/var/lib/pgsql/data/base/504592641/1249", O_RDWR) = 6
open("/var/lib/pgsql/data/base/504592641/16427", O_RDWR) = 7
open("/var/lib/pgsql/data/base/504592641/16414", O_RDWR) = 9
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}},
{it_interval={0, 0}, it_value={0, 0}}) = 0
semop(6258745, 0xbfffea24, 1)           = 0
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}},
{it_interval={0, 0}, it_value={0, 870000}}) = 0
lseek(9, 0, SEEK_END)                   = 0
semop(4456450, 0xbfffeac4, 1)           = 0
sigreturn()                             = ? (mask now [])
recv(8, 0x839a0a0, 8192, 0)             = ? ERESTARTSYS (To be
restarted)

Although, I see anything up to 9 or even 15 semop() calls and file
close/open pairs.

When it went mad, this happened:

--- SIGUSR2 (User defined signal 2) ---
gettimeofday({1022720979, 494838}, NULL) = 0
semop(10551353, 0xbfffeb04, 1)          = 0
close(7)                                = 0
close(6)                                = 0
close(4)                                = 0
close(3)                                = 0
select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
close(9)                                = 0
open("/var/lib/pgsql/data/base/504592641/1259", O_RDWR) = 3
select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
open("/var/lib/pgsql/data/base/504592641/16429", O_RDWR) = 4
open("/var/lib/pgsql/data/base/504592641/1249", O_RDWR) = 6
open("/var/lib/pgsql/data/base/504592641/16427", O_RDWR) = 7
open("/var/lib/pgsql/data/base/504592641/16414", O_RDWR) = 9
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}},
{it_interval={0, 0}, it_value={0, 0}}) = 0
semop(10551353, 0xbfffea24, 1)          = -1 EINTR (Interrupted system
call)
--- SIGALRM (Alarm clock) ---
semop(10551353, 0xbfffe694, 1)          = 0
semop(8716289, 0xbfffe694, 1)           = 0
sigreturn()                             = ? (mask now [USR2])

However, the strace stopped just before the ) on the first semop, which
I think means it hadn't completed. The whole thing (postgres, vmstat and
all) stopped for about 10 seconds, then it went on.

This was only a short version of the problem (it can lock up for 20-30
seconds), but I think it's the same thing.

    Stephen

Attachment

pgsql-general by date:

Previous
From: Richard Poole
Date:
Subject: Re: Query plan w/ like clause question
Next
From: Curt Sampson
Date:
Subject: Non-linear Performance