Thread: How to cripple a postgres server
I've observed the following, and I wonder if anyone has seen it or has a workaround, before I report it as a bug. Doing the following will result in the load on the server reaching ~600, and the server becoming very unresponsive. This seems to be as a result of the SIGUSR2 which is delivered for async notify to free space in the event queue thingy (I'm sort of in the dark here). 1) Open many (I used 800) database connections and leave them idle. 2) run: while true; do vacuum {database} ; done Wait. Observe db has stopped doing useful work. This causes glitches of up to many minutes, during which the db fails to respond in any way. We noticed this on a production system, so it's fairly dire. I know one workaround is to reduce the number of idle connections, but that will make our system less responsive. It looks like a bug to me - anyone got any ideas? Stephen
Attachment
Stephen Robert Norris <srn@commsecure.com.au> writes: > 1) Open many (I used 800) database connections and leave them idle. > 2) run: while true; do vacuum {database} ; done > Wait. Observe db has stopped doing useful work. Is your machine capable of supporting 800 active backends in the absence of any vacuum processing? I'm not sure that "we're driving our server into swapping hell" qualifies as a Postgres bug ... If you're seeing load peaks in excess of what would be observed with 800 active queries, then I would agree there's something to investigate here. regards, tom lane
On Tue, 2002-05-28 at 11:57, Tom Lane wrote: > Stephen Robert Norris <srn@commsecure.com.au> writes: > > 1) Open many (I used 800) database connections and leave them idle. > > 2) run: while true; do vacuum {database} ; done > > > Wait. Observe db has stopped doing useful work. > > Is your machine capable of supporting 800 active backends in the absence > of any vacuum processing? I'm not sure that "we're driving our server > into swapping hell" qualifies as a Postgres bug ... > > If you're seeing load peaks in excess of what would be observed with > 800 active queries, then I would agree there's something to investigate > here. > > regards, tom lane Yep, indeed with 800 backends doing a query every second, nothing happens. The machine itself has 1GB of RAM, and uses no swap in the above situation. Instead, system time goes to 99% of CPU. The machine is a dual-CPU athlon 1900 (1.5GHz). It _only_ happens with idle connections! Stephen
Attachment
Stephen Robert Norris <srn@commsecure.com.au> writes: >> If you're seeing load peaks in excess of what would be observed with >> 800 active queries, then I would agree there's something to investigate >> here. > Yep, indeed with 800 backends doing a query every second, nothing > happens. > The machine itself has 1GB of RAM, and uses no swap in the above > situation. Instead, system time goes to 99% of CPU. The machine is a > dual-CPU athlon 1900 (1.5GHz). > It _only_ happens with idle connections! Hmm, you mean if the 800 other connections are *not* idle, you can do VACUUMs with impunity? If so, I'd agree we got a bug ... regards, tom lane
On Tue, 2002-05-28 at 12:44, Tom Lane wrote: > Stephen Robert Norris <srn@commsecure.com.au> writes: > >> If you're seeing load peaks in excess of what would be observed with > >> 800 active queries, then I would agree there's something to investigate > >> here. > > > Yep, indeed with 800 backends doing a query every second, nothing > > happens. > > > The machine itself has 1GB of RAM, and uses no swap in the above > > situation. Instead, system time goes to 99% of CPU. The machine is a > > dual-CPU athlon 1900 (1.5GHz). > > > It _only_ happens with idle connections! > > Hmm, you mean if the 800 other connections are *not* idle, you can > do VACUUMs with impunity? If so, I'd agree we got a bug ... > > regards, tom lane Yes, that's what I'm saying. If you put a sleep 5 into my while loop above, the problem still happens (albeit after a longer time). If you put a delay in, and also make sure that each of the 800 connections does a query (I'm using "select 1;"), then the problem never happens. Without the delay, you get a bit of load, but only up to about 10 or so, rather than 600. What seems to happen is that all the idle backends get woken up every now and then to do _something_ (it seemed to coincide with getting the SIGUSR2 to trigger an async_notify), and that shoots up the load. Making sure all the backends have done something every now and then seems to avoid the problem. I suspect the comment near the async_notify code explains the problem - where it talks about idle backends having to be woken up to clear out pending events... We only notice it on our production system when it's got lots of idle db connections. Other people seem to have spotted it (I found references to similar effects with Google) but nobody seems to have worked out what it is - the other people typically had the problem with web servers which keep a pool of db connections. Just to reiterate; this machine never swaps, it can handle > 1000 queries per second and it doesn't seem to run out of resources... Stephen
Attachment
Stephen Robert Norris <srn@commsecure.com.au> writes: > On Tue, 2002-05-28 at 12:44, Tom Lane wrote: >> Hmm, you mean if the 800 other connections are *not* idle, you can >> do VACUUMs with impunity? If so, I'd agree we got a bug ... > Yes, that's what I'm saying. Fascinating. > I suspect the comment near the async_notify code explains the problem - > where it talks about idle backends having to be woken up to clear out > pending events... Well, we need to do that, but your report seems to suggest that that code path isn't getting the job done completely --- my first guess is that a new normal query has to arrive before a SIGUSR2'd backend is completely happy again. Interesting. You didn't specify: what PG version are you running, and on what platform? regards, tom lane
On Tue, 2002-05-28 at 13:21, Tom Lane wrote: > Stephen Robert Norris <srn@commsecure.com.au> writes: > > On Tue, 2002-05-28 at 12:44, Tom Lane wrote: > >> Hmm, you mean if the 800 other connections are *not* idle, you can > >> do VACUUMs with impunity? If so, I'd agree we got a bug ... > > > Yes, that's what I'm saying. > > Fascinating. > > > I suspect the comment near the async_notify code explains the problem - > > where it talks about idle backends having to be woken up to clear out > > pending events... > > Well, we need to do that, but your report seems to suggest that that > code path isn't getting the job done completely --- my first guess is > that a new normal query has to arrive before a SIGUSR2'd backend is > completely happy again. Interesting. > > You didn't specify: what PG version are you running, and on what > platform? > > regards, tom lane Sorry, forgot that. PG 7.1 or 7.2 (both have the problem) on linux 2.2.x (x>=17) and 2.4.x (4 <= x <= 18). I've tried a variety of intel machines (and AMD machines) and managed to reproduce the same problem. To clarify my comments, I suspect the problem is that all 800 of the backends get the SIGUSR2 at the same time, and all wake up, causing the kernel scheduler to go mad trying to decide which one to schedule... If the connections aren't idle, the queue never fills up enough to require the signals... Stephen
Attachment
Stephen Robert Norris <srn@commsecure.com.au> writes: > To clarify my comments, I suspect the problem is that all 800 of the > backends get the SIGUSR2 at the same time, and all wake up, That would indeed happen ... > causing the > kernel scheduler to go mad trying to decide which one to schedule... If > the connections aren't idle, the queue never fills up enough to require > the signals... ... but it doesn't follow that this is causing your problem. All 800 are going to get woken up at the same time anyway (or as close to the same time as the postmaster can issue the signals, that is). Why does it make a difference whether they have or soon will process a normal query? If you are able to prevent the problem by issuing "select 1" once per second on each connection, then for sure it's not a case of whether they are busy at receipt of the signal or not. Most of 'em will not be busy at any given instant, with a load as trivial as that. Do you use LISTEN/NOTIFY at all? When was your pg_listener table last vacuumed? regards, tom lane
On Tue, 2002-05-28 at 13:42, Tom Lane wrote: > Stephen Robert Norris <srn@commsecure.com.au> writes: > > To clarify my comments, I suspect the problem is that all 800 of the > > backends get the SIGUSR2 at the same time, and all wake up, > > That would indeed happen ... > > > causing the > > kernel scheduler to go mad trying to decide which one to schedule... If > > the connections aren't idle, the queue never fills up enough to require > > the signals... > > ... but it doesn't follow that this is causing your problem. All 800 > are going to get woken up at the same time anyway (or as close to the > same time as the postmaster can issue the signals, that is). Why does > it make a difference whether they have or soon will process a normal > query? If you are able to prevent the problem by issuing "select 1" > once per second on each connection, then for sure it's not a case of > whether they are busy at receipt of the signal or not. Most of 'em > will not be busy at any given instant, with a load as trivial as that. > > Do you use LISTEN/NOTIFY at all? When was your pg_listener table > last vacuumed? > > regards, tom lane My reading of the code was that the signals didn't get delivered unless the queue got too full, and that entries on the queue are created by the vacuum (and other stuff) and processed when a backend does something, thus the queue never gets too full. My test program just cycles through each connection doing the select 1 on each in series. No LISTEN/NOTIFY at all. I've been creating entirely new db setups to test it (with initdb). Stephen
Attachment
Stephen Robert Norris <srn@commsecure.com.au> writes: > My reading of the code was that the signals didn't get delivered unless > the queue got too full, and that entries on the queue are created by the > vacuum (and other stuff) and processed when a backend does something, > thus the queue never gets too full. Good point. And certainly the async-notify code (which scans through pg_listener) is a lot more expensive than is needed just to respond to an SInval-queue-full condition; that looks to be a quick hack that was inserted without thought to performance. But I don't think we quite understand this issue yet. If your system can support 800 simultaneous useful queries then it shouldn't have a problem with 800 simultaneous scans of an empty pg_listener table. I'll ask again: is your system sized to support 800 *simultaneous* user queries? (One query per second on 800 connections is hardly the same thing.) regards, tom lane
On Tue, 2002-05-28 at 13:57, Tom Lane wrote: > Stephen Robert Norris <srn@commsecure.com.au> writes: > > My reading of the code was that the signals didn't get delivered unless > > the queue got too full, and that entries on the queue are created by the > > vacuum (and other stuff) and processed when a backend does something, > > thus the queue never gets too full. > > Good point. And certainly the async-notify code (which scans through > pg_listener) is a lot more expensive than is needed just to respond to > an SInval-queue-full condition; that looks to be a quick hack that was > inserted without thought to performance. But I don't think we quite > understand this issue yet. If your system can support 800 simultaneous > useful queries then it shouldn't have a problem with 800 simultaneous > scans of an empty pg_listener table. > > I'll ask again: is your system sized to support 800 *simultaneous* > user queries? (One query per second on 800 connections is hardly > the same thing.) > > regards, tom lane I'm not sure; it can certainly do >1k queries/second through 800 simultaneous connections (about 1/connection second), but it's hard to find enough machines to load it up that much... One big difference, though, is that with the vacuum problem, the CPU used is almost all (99%) system time; loading up the db with lots of queries increases user time mostly, with little system time... In any event, it seems a bug that merely having connections open causes this problem! They aren't even in transactions... Stephen
Attachment
Stephen Robert Norris <srn@commsecure.com.au> writes: > One big difference, though, is that with the vacuum problem, the CPU > used is almost all (99%) system time; loading up the db with lots of > queries increases user time mostly, with little system time... Hmm, that's a curious point; leaves one wondering about possible kernel bugs. > In any event, it seems a bug that merely having connections open causes > this problem! They aren't even in transactions... If the problem is that you've launched far more backends than the system can really support, I'd have no hesitation in writing it off as user error. "Idle" processes are not without cost. But at this point I can't tell whether that's the case, or whether you're looking at a genuine performance bug in either Postgres or the kernel. Can you run strace (or truss or kernel-call-tracer-of-your-choice) on the postmaster, and also on one of the putatively idle backends, so we can see some more data about what's happening? regards, tom lane
On Tue, 2002-05-28 at 14:24, Tom Lane wrote: > Stephen Robert Norris <srn@commsecure.com.au> writes: > > One big difference, though, is that with the vacuum problem, the CPU > > used is almost all (99%) system time; loading up the db with lots of > > queries increases user time mostly, with little system time... > > Hmm, that's a curious point; leaves one wondering about possible kernel > bugs. > > > In any event, it seems a bug that merely having connections open causes > > this problem! They aren't even in transactions... > > If the problem is that you've launched far more backends than the system > can really support, I'd have no hesitation in writing it off as user > error. "Idle" processes are not without cost. But at this point > I can't tell whether that's the case, or whether you're looking at a > genuine performance bug in either Postgres or the kernel. > > Can you run strace (or truss or kernel-call-tracer-of-your-choice) on > the postmaster, and also on one of the putatively idle backends, so > we can see some more data about what's happening? > > regards, tom lane I've already strace'ed the idle backend, and I can see the SIGUSR2 being delivered just before everything goes bad. What resource would you think idle backends might be exhausting? On the production system, the problem doesn't happen under load (of about 60-80 non-trivial queries/second) but does happen when the system is largely idle. The number of connections is exactly the same in both cases... Stephen
Attachment
On Tue, 28 May 2002, Tom Lane wrote: > Stephen Robert Norris <srn@commsecure.com.au> writes: > > One big difference, though, is that with the vacuum problem, the CPU > > used is almost all (99%) system time; loading up the db with lots of > > queries increases user time mostly, with little system time... > > Hmm, that's a curious point; leaves one wondering about possible kernel > bugs. If it turns out to be so, this would not be the first problem I've heard about in the Linux scheduler. (It was notoriously bad for years.) I'd suggest a good test would be to try this on a BSD machine and see if the problem exists there, too. That will at least tell you if it's Postgres or Linux. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Tue, 2002-05-28 at 15:43, Curt Sampson wrote: > On Tue, 28 May 2002, Tom Lane wrote: > > > Stephen Robert Norris <srn@commsecure.com.au> writes: > > > One big difference, though, is that with the vacuum problem, the CPU > > > used is almost all (99%) system time; loading up the db with lots of > > > queries increases user time mostly, with little system time... > > > > Hmm, that's a curious point; leaves one wondering about possible kernel > > bugs. > > If it turns out to be so, this would not be the first problem I've > heard about in the Linux scheduler. (It was notoriously bad for years.) > > I'd suggest a good test would be to try this on a BSD machine and > see if the problem exists there, too. That will at least tell you > if it's Postgres or Linux. > > cjs > -- > Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org > Don't you know, in this new Dark Age, we're all light. --XTC Can someone with access to a BSD machine try this? It's pretty easy to set up, a simple loop to open a few hundred connections and a 1-line shell script. It doesn't seem to matter what's in the database... I'm going to try the O(1) scheduler patch for the linux kernel and see if that helps... Stephen
Attachment
On Tue, 2002-05-28 at 14:24, Tom Lane wrote: > Stephen Robert Norris <srn@commsecure.com.au> writes: > > One big difference, though, is that with the vacuum problem, the CPU > > used is almost all (99%) system time; loading up the db with lots of > > queries increases user time mostly, with little system time... > > Hmm, that's a curious point; leaves one wondering about possible kernel > bugs. > > > In any event, it seems a bug that merely having connections open causes > > this problem! They aren't even in transactions... > > If the problem is that you've launched far more backends than the system > can really support, I'd have no hesitation in writing it off as user > error. "Idle" processes are not without cost. But at this point > I can't tell whether that's the case, or whether you're looking at a > genuine performance bug in either Postgres or the kernel. > > Can you run strace (or truss or kernel-call-tracer-of-your-choice) on > the postmaster, and also on one of the putatively idle backends, so > we can see some more data about what's happening? > > regards, tom lane I wonder if it's a problem with a second SIGUSR2 arriving before the first is finished? It seems much easier to trigger the effect with more rapid vacuums than with a delay (even accounting for the reduced number of vacuums occurring). Stephen
Attachment
> Can someone with access to a BSD machine try this? It's pretty easy to > set up, a simple loop to open a few hundred connections and a 1-line > shell script. It doesn't seem to matter what's in the database... > > I'm going to try the O(1) scheduler patch for the linux kernel and see > if that helps... > > Stephen > I'm willing, send over what you were using.. Also the schema and dump of any database you were testing against (if it matters).. I don't think I have a machine that I can get to tomorrow that can do 800 connections at the same time -- does a smaller number of connections produce the same result on your machine(s)? -Mitch
Hi Stephen, Are you able to give some detailed technical specs of the hardware you're running? I consider yours to be a higher-end PostgreSQL server, and I'd like to have a good practical understanding of what is required (hardware wise) to put together a 1k/second transaction PostgreSQL server. :-) Regards and best wishes, Justin Clift <snip> > I'm not sure; it can certainly do >1k queries/second through 800 > simultaneous connections (about 1/connection second), but it's hard to > find enough machines to load it up that much... > > One big difference, though, is that with the vacuum problem, the CPU > used is almost all (99%) system time; loading up the db with lots of > queries increases user time mostly, with little system time... > > In any event, it seems a bug that merely having connections open causes > this problem! They aren't even in transactions... > > Stephen > > ------------------------------------------------------------------------ > Name: signature.asc > signature.asc Type: PGP Armored File (application/x-unknown-content-type-PGP Armored File) > Description: This is a digitally signed message part -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi
Stephen Robert Norris <srn@commsecure.com.au> writes: > I've already strace'ed the idle backend, and I can see the SIGUSR2 being > delivered just before everything goes bad. Yes, but what happens after that? If you don't see anything obvious by examining a single process, maybe strace'ing the postmaster + all descendant processes would offer a better viewpoint. > What resource would you think idle backends might be exhausting? Difficult to say. I suspect your normal load doesn't have *all* the backends trying to run queries at once. But in any case the SIGUSR2 event should only produce a momentary spike in load, AFAICS. regards, tom lane
On Tue, 2002-05-28 at 22:26, Justin Clift wrote: > Hi Stephen, > > Are you able to give some detailed technical specs of the hardware > you're running? > > I consider yours to be a higher-end PostgreSQL server, and I'd like to > have a good practical understanding of what is required (hardware wise) > to put together a 1k/second transaction PostgreSQL server. > > :-) > > Regards and best wishes, > > Justin Clift It's nothing special - a dual Athlon (1600MHz or whatever), 1GB RAM, 4 UDMA IDE HDD. The key trick is that we only have about 200MB or data to serve, and thus never actually hit the disks... On similar hardware, but with 3GB RAM and 160GB of data, we go back to a more usual 30-50 random queries/second. For background, the first system is part of a stock exchange information system; the second is part of a bill presentation system. Stephen
Attachment
I spent some time this morning trying to reproduce your problem, with not much luck. I used the attached test program, in case anyone else wants to try --- it fires up the specified number of connections (issuing a trivial query on each one, just so that the backend is not completely virgin) and goes to sleep. I ran that in one window and did manual "vacuum full"s in psql in another window. I was doing the vacuums in the regression database which has about 150 tables, so there was an SI overrun event (resulting in SIGUSR2) every third or so vacuum. Using stock Red Hat Linux 7.2 (kernel 2.4.7-10) on a machine with 256M of RAM, I was able to run up to about 400 backends without seeing much of any performance problem. (I had the postmaster running with postmaster -i -F -N 1000 -B 2000 and defaults in postgresql.conf.) Each SI overrun fired up all the idle backends, but they went back to sleep after a couple of kernel calls and not much computation. Above 500 backends the thing went into swap hell --- it took minutes of realtime to finish out the SI overrun cycle, even though the CPU was idle (waiting for swap-in) most of the time. I could not see any mode where system CPU percentage was significant. I did notice the user CPU spiking up at times (this seems to be due to the deadlock detection algorithm, which gets invoked once a process has waited "too long" for a lock). I was doing this with current development sources, but I have no reason to think that PG 7.2 would act differently. It's fairly clear from looking at this example that we're handling SI overrun notification in a very inefficient manner; probably it should have its own signal so that receiving backends can just read the SI queue and not bother with scanning pg_listener. But there's something else going on in your system that I don't understand. You're not swapping and you are seeing 99% system CPU --- what's causing that? And why don't I see it? When there are not enough backends to cause swapping, strace'ing a randomly-chosen backend produces results like this for each SIGUSR2 cycle: 11:40:24.824914 recv(8, 0x83ff600, 8192, 0) = ? ERESTARTSYS (To be restarted) 11:40:32.549941 --- SIGUSR2 (User defined signal 2) --- 11:40:32.550243 gettimeofday({1022686832, 550363}, NULL) = 0 11:40:32.550592 semop(5898249, 0xbfffeba4, 1) = 0 11:40:32.639593 lseek(4, 0, SEEK_END) = 0 11:40:32.639729 sigreturn() = ? (mask now []) 11:40:32.639845 recv(8, On the other hand, with just a few too many backends I get: 11:42:46.913127 recv(8, 0x83ff600, 8192, 0) = ? ERESTARTSYS (To be restarted) 11:43:56.115481 --- SIGUSR2 (User defined signal 2) --- 11:43:56.577389 semop(5767173, 0xbfffebe4, 1) = 0 11:44:00.441212 semop(7438392, 0xbfffebe4, 1) = 0 11:44:00.441306 semop(5767173, 0xbfffeb34, 1) = 0 11:44:03.929909 semop(6455322, 0xbfffeb34, 1) = 0 11:44:03.930026 gettimeofday({1022687043, 930059}, NULL) = 0 11:44:03.931215 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, {it_interval={0, 0}, it_value={0, 0}}) = 0 11:44:03.931321 semop(5767173, 0xbfffeac4, 1) = -1 EINTR (Interrupted system call) 11:44:04.923771 --- SIGALRM (Alarm clock) --- 11:44:04.923969 semop(5767173, 0xbfffe734, 1) = 0 11:44:05.296190 semop(7602237, 0xbfffe734, 1) = 0 11:44:05.296368 sigreturn() = ? (mask now [USR2]) 11:44:05.296500 semop(5767173, 0xbfffeac4, 1) = 0 11:44:09.216795 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, {it_interval={0, 0}, it_value={0, 0}}) = 0 11:44:09.217014 lseek(4, 0, SEEK_END) = 0 11:44:09.286696 semop(6127632, 0xbfffeb54, 1) = 0 11:44:09.555040 sigreturn() = ? (mask now []) 11:44:09.555251 recv(8, (The number of semops can vary drastically depending on timing --- each one indicates blocking on an internal lock. BTW, does anyone know how to get Linux strace to show the contents of the second arg to semop?) What does your strace look like? regards, tom lane /* * startlots.c * * Build with libpq (eg, gcc startlots.c -o startlots -lpq). * * Usage: startlots N to start N backends. Control-C to exit. */ #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include "libpq-fe.h" static void exit_nicely(PGconn *conn) { PQfinish(conn); exit(1); } static void start1conn() { char *pghost, *pgport, *pgoptions, *pgtty; char *dbName; PGconn *conn; PGresult *res; /* * begin, by setting the parameters for a backend connection if the * parameters are null, then the system will try to use reasonable * defaults by looking up environment variables or, failing that, * using hardwired constants */ pghost = NULL; /* host name of the backend server */ pgport = NULL; /* port of the backend server */ pgoptions = NULL; /* special options to start up the backend * server */ pgtty = NULL; /* debugging tty for the backend server */ dbName = "template1"; /* make a connection to the database */ conn = PQsetdb(pghost, pgport, pgoptions, pgtty, dbName); /* check to see that the backend connection was successfully made */ if (PQstatus(conn) == CONNECTION_BAD) { fprintf(stderr, "Connection to database '%s' failed.\n", dbName); fprintf(stderr, "%s", PQerrorMessage(conn)); exit_nicely(conn); } /* run a transaction */ res = PQexec(conn, "select version()"); if (PQresultStatus(res) != PGRES_TUPLES_OK) { fprintf(stderr, "select version() command didn't return tuples properly\n"); PQclear(res); exit_nicely(conn); } PQclear(res); /* leave the connection open */ } int main(int argc, char**argv) { int nconn = atoi(argv[1]); while (nconn-- > 0) start1conn(); sleep(100000); return 0; }
On Thu, 2002-05-30 at 01:52, Tom Lane wrote: > I spent some time this morning trying to reproduce your problem, with > not much luck. I used the attached test program, in case anyone else > wants to try --- it fires up the specified number of connections > (issuing a trivial query on each one, just so that the backend is not > completely virgin) and goes to sleep. I ran that in one window and did > manual "vacuum full"s in psql in another window. I was doing the > vacuums in the regression database which has about 150 tables, so there > was an SI overrun event (resulting in SIGUSR2) every third or so vacuum. > > Using stock Red Hat Linux 7.2 (kernel 2.4.7-10) on a machine with 256M > of RAM, I was able to run up to about 400 backends without seeing much > of any performance problem. (I had the postmaster running with > postmaster -i -F -N 1000 -B 2000 and defaults in postgresql.conf.) > Each SI overrun fired up all the idle backends, but they went back to > sleep after a couple of kernel calls and not much computation. Similar setup here, but 1GB RAM. If this problem is some sort of O(n^2) thing, it could well be the case that it only happens on (for example) > 600 backends, and is fine at 400... I also wonder if SMP has any impact - if there's lots of semops going on, and the memory is being thrashed between CPU caches, that won't be nice... > Above 500 backends the thing went into swap hell --- it took minutes of > realtime to finish out the SI overrun cycle, even though the CPU was > idle (waiting for swap-in) most of the time. I never swap. Some more data from this end - I have only managed to reproduce the problem once in about 2 hours with those lines removed that you asked me to remove yesterday. With the lines still in, the problem happens after a minute or two pretty much every time. I still see the high numbers of processes in the run queue, and the load rises, but neither postgres nor the machine stalls. > What does your strace look like? > > regards, tom lane In "normal" SI overruns, about the same: --- SIGUSR2 (User defined signal 2) --- gettimeofday({1022719053, 355014}, NULL) = 0 close(7) = 0 close(6) = 0 close(4) = 0 close(3) = 0 close(9) = 0 semop(6258745, 0xbfffeb04, 1) = 0 semop(6193207, 0xbfffeb04, 1) = 0 open("/var/lib/pgsql/data/base/504592641/1259", O_RDWR) = 3 open("/var/lib/pgsql/data/base/504592641/16429", O_RDWR) = 4 semop(6258745, 0xbfffe8e4, 1) = 0 semop(6225976, 0xbfffe8e4, 1) = 0 open("/var/lib/pgsql/data/base/504592641/1249", O_RDWR) = 6 open("/var/lib/pgsql/data/base/504592641/16427", O_RDWR) = 7 open("/var/lib/pgsql/data/base/504592641/16414", O_RDWR) = 9 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, {it_interval={0, 0}, it_value={0, 0}}) = 0 semop(6258745, 0xbfffea24, 1) = 0 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, {it_interval={0, 0}, it_value={0, 870000}}) = 0 lseek(9, 0, SEEK_END) = 0 semop(4456450, 0xbfffeac4, 1) = 0 sigreturn() = ? (mask now []) recv(8, 0x839a0a0, 8192, 0) = ? ERESTARTSYS (To be restarted) Although, I see anything up to 9 or even 15 semop() calls and file close/open pairs. When it went mad, this happened: --- SIGUSR2 (User defined signal 2) --- gettimeofday({1022720979, 494838}, NULL) = 0 semop(10551353, 0xbfffeb04, 1) = 0 close(7) = 0 close(6) = 0 close(4) = 0 close(3) = 0 select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout) close(9) = 0 open("/var/lib/pgsql/data/base/504592641/1259", O_RDWR) = 3 select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout) open("/var/lib/pgsql/data/base/504592641/16429", O_RDWR) = 4 open("/var/lib/pgsql/data/base/504592641/1249", O_RDWR) = 6 open("/var/lib/pgsql/data/base/504592641/16427", O_RDWR) = 7 open("/var/lib/pgsql/data/base/504592641/16414", O_RDWR) = 9 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, {it_interval={0, 0}, it_value={0, 0}}) = 0 semop(10551353, 0xbfffea24, 1) = -1 EINTR (Interrupted system call) --- SIGALRM (Alarm clock) --- semop(10551353, 0xbfffe694, 1) = 0 semop(8716289, 0xbfffe694, 1) = 0 sigreturn() = ? (mask now [USR2]) However, the strace stopped just before the ) on the first semop, which I think means it hadn't completed. The whole thing (postgres, vmstat and all) stopped for about 10 seconds, then it went on. This was only a short version of the problem (it can lock up for 20-30 seconds), but I think it's the same thing. Stephen
Attachment
On Tue, 2002-05-28 at 23:29, Tom Lane wrote: > Stephen Robert Norris <srn@commsecure.com.au> writes: > > I've already strace'ed the idle backend, and I can see the SIGUSR2 being > > delivered just before everything goes bad. > > Yes, but what happens after that? The strace stops until I manually kill the connecting process - the machine stops in general until then (vmstat 1 stops producing output, shells stop responding ...). So who knows what happens :( > > If you don't see anything obvious by examining a single process, maybe > strace'ing the postmaster + all descendant processes would offer a > better viewpoint. > > > What resource would you think idle backends might be exhausting? > > Difficult to say. I suspect your normal load doesn't have *all* the > backends trying to run queries at once. But in any case the SIGUSR2 > event should only produce a momentary spike in load, AFAICS. > > regards, tom lane I agree (about not having 800 simultaneous queries). Sometimes, the SIGUSR2 does just create a very brief load spike (vmstat shows >500 processes on the run queue, but the next second everything is back to normal and no unusual amount of CPU is consumed). This sort of rules out (to me) a kernel problem, unless it's something triggered at a specific number of processes (like 700 is bad, 699 is OK). Stephen
Attachment
Stephen Robert Norris <srn@commsecure.com.au> writes: > I've already strace'ed the idle backend, and I can see the SIGUSR2 being > delivered just before everything goes bad. >> Yes, but what happens after that? > The strace stops until I manually kill the connecting process - the > machine stops in general until then (vmstat 1 stops producing output, > shells stop responding ...). So who knows what happens :( Hmm, I hadn't quite understood that you were complaining of a system-wide lockup and not just Postgres getting wedged. I think the chances are very good that this *is* a kernel bug. In any case, no self-respecting kernel hacker would be happy with the notion that a completely unprivileged user program can lock up the whole machine. So even if Postgres has got a problem, the kernel is clearly failing to defend itself adequately. Are you able to reproduce the problem with fewer than 800 backends? How about if you try it on a smaller machine? Another thing that would be entertaining to try is other ways of releasing 800 queries at once. For example, on connection 1 do BEGIN; LOCK TABLE foo; then issue a "SELECT COUNT(*) FROM foo" on each other connection, and finally COMMIT on connection 1. If that creates similar misbehavior then I think the SI-overrun mechanism is probably not to be blamed. > ... Sometimes, the > SIGUSR2 does just create a very brief load spike (vmstat shows >500 > processes on the run queue, but the next second everything is back to > normal and no unusual amount of CPU is consumed). That's the behavior I'd expect. We need to figure out what's different between that case and the cases where it locks up. regards, tom lane
On Wed, 2002-05-29 at 09:08, Tom Lane wrote: > Stephen Robert Norris <srn@commsecure.com.au> writes: > > I've already strace'ed the idle backend, and I can see the SIGUSR2 being > > delivered just before everything goes bad. > > >> Yes, but what happens after that? > > > The strace stops until I manually kill the connecting process - the > > machine stops in general until then (vmstat 1 stops producing output, > > shells stop responding ...). So who knows what happens :( > > Hmm, I hadn't quite understood that you were complaining of a > system-wide lockup and not just Postgres getting wedged. I think the > chances are very good that this *is* a kernel bug. In any case, no > self-respecting kernel hacker would be happy with the notion that > a completely unprivileged user program can lock up the whole machine. > So even if Postgres has got a problem, the kernel is clearly failing > to defend itself adequately. > > Are you able to reproduce the problem with fewer than 800 backends? > How about if you try it on a smaller machine? Yep, on a PIII-800 with 256MB I can do it with fewer backends (I forget how many) and only a few vacuums. It's much easier, basically, but there's much less CPU on that machine. It also locks the machine up for several minutes... > Another thing that would be entertaining to try is other ways of > releasing 800 queries at once. For example, on connection 1 do > BEGIN; LOCK TABLE foo; > then issue a "SELECT COUNT(*) FROM foo" on each other connection, > and finally COMMIT on connection 1. If that creates similar misbehavior > then I think the SI-overrun mechanism is probably not to be blamed. > > > ... Sometimes, the > > SIGUSR2 does just create a very brief load spike (vmstat shows >500 > > processes on the run queue, but the next second everything is back to > > normal and no unusual amount of CPU is consumed). > > That's the behavior I'd expect. We need to figure out what's different > between that case and the cases where it locks up. > > regards, tom lane Yeah. I'll try your suggestion above and report back. Stephen
Attachment
On Wed, 2002-05-29 at 13:55, Tom Lane wrote: > Stephen Robert Norris <srn@commsecure.com.au> writes: > > Not at all like the vacuum problem. Do you have any other tests I can > > run? Other diagnostics that might help? > > This is a long shot, but ... does the behavior change if you remove the > two lines > > if (signo != SIGALRM) > act.sa_flags |= SA_RESTART; > > near the bottom of src/backend/libpq/pqsignal.c ? I am wondering if > Linux gets unhappy if we try to do much in a signal handler. This > looks to be the simplest alteration that might address such a problem. > > regards, tom lane Maybe it does help. I've only seen the problem once now (just after startup) rather than every few minutes... I'm still seeing lots of processes in the ready queue every now and then, but they are doing more sensible things. I'll investigate some more and be more certain. Stephen
Attachment
Stephen Robert Norris <srn@commsecure.com.au> writes: > Not at all like the vacuum problem. Do you have any other tests I can > run? Other diagnostics that might help? This is a long shot, but ... does the behavior change if you remove the two lines if (signo != SIGALRM) act.sa_flags |= SA_RESTART; near the bottom of src/backend/libpq/pqsignal.c ? I am wondering if Linux gets unhappy if we try to do much in a signal handler. This looks to be the simplest alteration that might address such a problem. regards, tom lane
Comments: In-reply-to Stephen Robert Norris <srn@commsecure.com.au> message dated "29 May 2002 11:59:25 +1000" Date: Tue, 28 May 2002 23:55:57 -0400 Message-ID: <20115.1022644557@sss.pgh.pa.us> From: Tom Lane <tgl@sss.pgh.pa.us> Precedence: bulk Sender: pgsql-general-owner@postgresql.org Stephen Robert Norris <srn@commsecure.com.au> writes: > Not at all like the vacuum problem. Do you have any other tests I can > run? Other diagnostics that might help? This is a long shot, but ... does the behavior change if you remove the two lines if (signo != SIGALRM) act.sa_flags |= SA_RESTART; near the bottom of src/backend/libpq/pqsignal.c ? I am wondering if Linux gets unhappy if we try to do much in a signal handler. This looks to be the simplest alteration that might address such a problem. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly