Re: How to cripple a postgres server - Mailing list pgsql-general
From | Stephen Robert Norris |
---|---|
Subject | Re: How to cripple a postgres server |
Date | |
Msg-id | 1022721096.6066.36.camel@ws12 Whole thread Raw |
In response to | Re: How to cripple a postgres server (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-general |
On Thu, 2002-05-30 at 01:52, Tom Lane wrote: > I spent some time this morning trying to reproduce your problem, with > not much luck. I used the attached test program, in case anyone else > wants to try --- it fires up the specified number of connections > (issuing a trivial query on each one, just so that the backend is not > completely virgin) and goes to sleep. I ran that in one window and did > manual "vacuum full"s in psql in another window. I was doing the > vacuums in the regression database which has about 150 tables, so there > was an SI overrun event (resulting in SIGUSR2) every third or so vacuum. > > Using stock Red Hat Linux 7.2 (kernel 2.4.7-10) on a machine with 256M > of RAM, I was able to run up to about 400 backends without seeing much > of any performance problem. (I had the postmaster running with > postmaster -i -F -N 1000 -B 2000 and defaults in postgresql.conf.) > Each SI overrun fired up all the idle backends, but they went back to > sleep after a couple of kernel calls and not much computation. Similar setup here, but 1GB RAM. If this problem is some sort of O(n^2) thing, it could well be the case that it only happens on (for example) > 600 backends, and is fine at 400... I also wonder if SMP has any impact - if there's lots of semops going on, and the memory is being thrashed between CPU caches, that won't be nice... > Above 500 backends the thing went into swap hell --- it took minutes of > realtime to finish out the SI overrun cycle, even though the CPU was > idle (waiting for swap-in) most of the time. I never swap. Some more data from this end - I have only managed to reproduce the problem once in about 2 hours with those lines removed that you asked me to remove yesterday. With the lines still in, the problem happens after a minute or two pretty much every time. I still see the high numbers of processes in the run queue, and the load rises, but neither postgres nor the machine stalls. > What does your strace look like? > > regards, tom lane In "normal" SI overruns, about the same: --- SIGUSR2 (User defined signal 2) --- gettimeofday({1022719053, 355014}, NULL) = 0 close(7) = 0 close(6) = 0 close(4) = 0 close(3) = 0 close(9) = 0 semop(6258745, 0xbfffeb04, 1) = 0 semop(6193207, 0xbfffeb04, 1) = 0 open("/var/lib/pgsql/data/base/504592641/1259", O_RDWR) = 3 open("/var/lib/pgsql/data/base/504592641/16429", O_RDWR) = 4 semop(6258745, 0xbfffe8e4, 1) = 0 semop(6225976, 0xbfffe8e4, 1) = 0 open("/var/lib/pgsql/data/base/504592641/1249", O_RDWR) = 6 open("/var/lib/pgsql/data/base/504592641/16427", O_RDWR) = 7 open("/var/lib/pgsql/data/base/504592641/16414", O_RDWR) = 9 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, {it_interval={0, 0}, it_value={0, 0}}) = 0 semop(6258745, 0xbfffea24, 1) = 0 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, {it_interval={0, 0}, it_value={0, 870000}}) = 0 lseek(9, 0, SEEK_END) = 0 semop(4456450, 0xbfffeac4, 1) = 0 sigreturn() = ? (mask now []) recv(8, 0x839a0a0, 8192, 0) = ? ERESTARTSYS (To be restarted) Although, I see anything up to 9 or even 15 semop() calls and file close/open pairs. When it went mad, this happened: --- SIGUSR2 (User defined signal 2) --- gettimeofday({1022720979, 494838}, NULL) = 0 semop(10551353, 0xbfffeb04, 1) = 0 close(7) = 0 close(6) = 0 close(4) = 0 close(3) = 0 select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout) close(9) = 0 open("/var/lib/pgsql/data/base/504592641/1259", O_RDWR) = 3 select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout) open("/var/lib/pgsql/data/base/504592641/16429", O_RDWR) = 4 open("/var/lib/pgsql/data/base/504592641/1249", O_RDWR) = 6 open("/var/lib/pgsql/data/base/504592641/16427", O_RDWR) = 7 open("/var/lib/pgsql/data/base/504592641/16414", O_RDWR) = 9 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, {it_interval={0, 0}, it_value={0, 0}}) = 0 semop(10551353, 0xbfffea24, 1) = -1 EINTR (Interrupted system call) --- SIGALRM (Alarm clock) --- semop(10551353, 0xbfffe694, 1) = 0 semop(8716289, 0xbfffe694, 1) = 0 sigreturn() = ? (mask now [USR2]) However, the strace stopped just before the ) on the first semop, which I think means it hadn't completed. The whole thing (postgres, vmstat and all) stopped for about 10 seconds, then it went on. This was only a short version of the problem (it can lock up for 20-30 seconds), but I think it's the same thing. Stephen
Attachment
pgsql-general by date: