Thread: How to simulate crashes of PostgreSQL?
Hello! To make my client application tolerant of PostgreSQL failures I first need to be able to simulate them in a safe manner (hard reset isn't a solution I'm looking for :) Is there a way to disconnect all the clients as if the server has crashed? It should look like a real crash from the client's point of view. Is using kill what everyone should use for these purposes? Thanks. -- Sergey Samokhin
On Sat, Aug 22, 2009 at 01:03:43PM -0700, Sergey Samokhin wrote: > Is there a way to disconnect all the clients as if the server has > crashed? It should look like a real crash from the client's point of > view. ifconfig ethx down ?
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 >> Is there a way to disconnect all the clients as if the server has >> crashed? It should look like a real crash from the client's point of >> view. > ifconfig ethx down ? Or even: iptables -I INPUT -p tcp --dport 5432 -j DROP Keep in mind that both of those are simulating network failures, not a "server crash". But network failures are something your application should handle gracefully too. :) To make something look like a real crash, you should do a real crash. In this case, kill -9 the backend(s). A server crash is a pretty rare event in the Postgres world, so I would not spend too many cycles on this... - -- Greg Sabino Mullane greg@turnstep.com End Point Corporation PGP Key: 0x14964AC8 200908221849 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAkqQd2sACgkQvJuQZxSWSsg6TwCfXMZ/GNi33qc2TyMa4uf1asw8 vVcAn3bUUZMP+cmSNEd5EABH/09gLeE/ =Uowh -----END PGP SIGNATURE-----
On Sat, Aug 22, 2009 at 4:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote: > A server crash is a pretty rare event in the Postgres world, so I > would not spend too many cycles on this... I've been running pg in production since 7.0 came out. zero server crashes.
On Sat, 2009-08-22 at 13:03 -0700, Sergey Samokhin wrote: > Hello! > > To make my client application tolerant of PostgreSQL failures I first > need to be able to simulate them in a safe manner (hard reset isn't a > solution I'm looking for :) > > Is there a way to disconnect all the clients as if the server has > crashed? It should look like a real crash from the client's point of > view. If you mean a PostgreSQL server crash: write a C extension function that de-references a null pointer or calls abort() . Instant crash on demand. `kill -9' on a backend should have much the same effect, though, and is easier - it's just not something a client can trigger through an SQL query. Remember to keep backups - Pg's designed to be fault tolerant, but it's still good to be careful just in case. If, however, you mean a crash of the server machine PostgreSQL is runnning on, which is MUCH more likely and will have different effects/behaviour, then Ray Stell's advice to bring the interface down is probably pretty good. The machine should stop responding to ARP requests or to any packets directed to its MAC address and will stop sending packets, so it'll look to the client like it's a hard server crash. You should also test your client's response to the Pg server remaining up but becoming non-responsive (eg: failed disk array causes Pg backends to remain in uninterruptable disk I/O system calls in the kernel). A possibly good way to do this is to SIGSTOP the backend(s). -- Craig Ringer
On Mon, Aug 24, 2009 at 12:49 AM, Craig Ringer<craig@postnewspapers.com.au> wrote: > You should also test your client's response to the Pg server remaining > up but becoming non-responsive (eg: failed disk array causes Pg backends > to remain in uninterruptable disk I/O system calls in the kernel). A > possibly good way to do this is to SIGSTOP the backend(s). This is a far more common and likely problem than the server crash scenario. I've had servers go unresponsive under load before. Took the load away and they came back, but the way the app responded has not always been optimal. Many apps get jammed up from something like this and require the app servers to be restarted.
On Mon, Aug 24, 2009 at 12:10:30AM -0600, Scott Marlowe wrote: > On Sat, Aug 22, 2009 at 4:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote: > > A server crash is a pretty rare event in the Postgres world, so I > > would not spend too many cycles on this... > > I've been running pg in production since 7.0 came out. zero server > crashes. In my experience, OS crashes are much more common than PostgreSQL crashes. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Mon, Aug 24, 2009 at 12:41 PM, David Fetter<david@fetter.org> wrote: > On Mon, Aug 24, 2009 at 12:10:30AM -0600, Scott Marlowe wrote: >> On Sat, Aug 22, 2009 at 4:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote: >> > A server crash is a pretty rare event in the Postgres world, so I >> > would not spend too many cycles on this... >> >> I've been running pg in production since 7.0 came out. zero server >> crashes. > > In my experience, OS crashes are much more common than PostgreSQL > crashes. Also, admin mistakes are more common than pgsql crashes. I've done things like type "sudo reboot" into my workstation only realize seconds later that I'm logged into a production server (long time ago, but still).
Hello! > If, however, you mean a crash of the server machine PostgreSQL is > runnning on, which is MUCH more likely and will have different > effects/behaviour, then Ray Stell's advice to bring the interface down > is probably pretty good. Sorry for a bit ambiguous usage of both "crash" and "fault" terms. By those words I meant crash of the server machine PostgreSQL is running on, not the PostgreSQL itself. Network outages between client and PostgreSQL are also kind of something I would like to simulate in any way. Though I don't think there are any differences between the crash of PosgreSQL itself and the crash of the machine PostgreSQL is running on from the client's point of view. Yet another way to simulate this terrible behaviour I've found is to stop PostgreSQL by "pg_ctl -m immediate" command. Thanks to all who has answered in this topic! It was very helpful to read it! -- Sergey Samokhin
Hello! > You should also test your client's response to the Pg server remaining > up but becoming non-responsive (eg: failed disk array causes Pg backends > to remain in uninterruptable disk I/O system calls in the kernel). A > possibly good way to do this is to SIGSTOP the backend(s). I haven't thought about it yet. It's possible the place where I should use timeouts on the operations involving calls to PostgreSQL. -- Sergey Samokhin
On Tue, 2009-08-25 at 00:26 +0400, Sergey Samokhin wrote: > Hello! > > > If, however, you mean a crash of the server machine PostgreSQL is > > runnning on, which is MUCH more likely and will have different > > effects/behaviour, then Ray Stell's advice to bring the interface down > > is probably pretty good. > > Sorry for a bit ambiguous usage of both "crash" and "fault" terms. By > those words I meant crash of the server machine PostgreSQL is running > on, not the PostgreSQL itself. Network outages between client and > PostgreSQL are also kind of something I would like to simulate in any > way. Get a cheap PC with two Ethernet cards running Linux, and put it between your Pg server and the rest of the network - or between your client and the rest of the network. Set it up to route packets between the two interfaces using iptables. You can now easily introduce rules to do things like drop random packets, drop packets of particular sizes, drop a regular percentage of packets, etc. You can also introduce latency using iproute2's `tc' . http://lartc.org/ example: http://www.kdedevelopers.org/node/1878 showing the use of the "delay" option of the network emulation (netem) qdisc. Alternately: brtables lets you do some network issue simulation on a Linux machine that's bridging between two interfaces instead of routing between them, so you can make your router transparent to the network. Unless you've worked a bit with iptables before or at least done a lot of general networking work you'll need to do a bit of learning to get much of this up and running smoothly. It's not a trivial drop-in. I'm not going to give detailed instructions and support, as I just don't have the time to go into it at present - sorry. -- Craig Ringer
On Tue, 2009-08-25 at 00:26 +0400, Sergey Samokhin wrote: > Hello! > > > If, however, you mean a crash of the server machine PostgreSQL is > > runnning on, which is MUCH more likely and will have different > > effects/behaviour, then Ray Stell's advice to bring the interface down > > is probably pretty good. > > Sorry for a bit ambiguous usage of both "crash" and "fault" terms. By > those words I meant crash of the server machine PostgreSQL is running > on, not the PostgreSQL itself. Network outages between client and > PostgreSQL are also kind of something I would like to simulate in any > way. This is the reference I should've given: http://www.linuxfoundation.org/en/Net:Netem -- Craig Ringer
On Tue, 2009-08-25 at 00:26 +0400, Sergey Samokhin wrote: > Though I don't think there are any differences between the crash of > PosgreSQL itself and the crash of the machine PostgreSQL is running on > from the client's point of view. There certainly are! For one thing, if a client with an established connection sends a packet to a machine where PostgreSQL has crashed (the backend process has exited on a signal) it'll receive a TCP RST indicating that the connection has been broken. The OS will also generally FIN to the client when the backend crashes to inform it that the connection is closing, so you'll often find out as soon as the backend dies or at least as soon as you next try to use the connection. If the issue was just with that backend, your client can just reconnect, retry its most recent work, and keep on going. Similarly, a new client trying to connect to a machine where the postmaster has crashed will receive a TCP RST packet indicating that the connection attempt was actively refused. It'll know immediately that something's not right and will get a useful error from the TCP stack. If, on the other hand, the server has crashed, clients may not receive any response at all to packets. The server may even stop responding to ARP requests, in which case the nearest router to it will - eventually, maybe - send your client an ICMP destination-unreachable . There will be long delays either way before the TCP/IP stack decides the connection has died. Your client will probably block on recv(...) / read(...) for an extended period. If a backend is still running but in a nonresponsive state, the TCP/IP stack on the server will still ACK packets you send to the backend (at least until the buffers fill up), but the backend won't be doing anything with the data. The local TCP stack won't see anything wrong because, at the TCP level, there isn't - something that can't happen in a server crash. So, yes, there's a pretty big difference between a crash of PostgreSQL and a server crash. Behaviour is different from the client perspective and you need to consider that. Intermediate network issues are different again, as you might encounter huge latency (possibly randomly only on some packets), random packet loss, etc. This will cause weird pauses and delays in communication that your client must cope with. This, by the way, is one of the reasons you *really* should do all your database work in a separate worker thread on GUI clients. The GUI must remain responsive even when you're waiting for a response that'll never come, or being held up by multi-second network latencies. -- Craig Ringer
On Sat, Aug 22, 2009 at 6:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote: > A server crash is a pretty rare event in the Postgres world, so I > would not spend too many cycles on this... > I had one the other day caused by server resource issues: I ran out of file descriptors when I had a very large surge in activity. Pg rightfully panicked and disconnected all my clients. Only the well written ones recovered automagically. I had to restart a handful of services :-( It is wise to put the effort to testing your client recovery strategy does work. I must say that I haven't had a Postgres crash due to Postgres bug since version 7.2 or so.
On Mon, Aug 24, 2009 at 2:10 AM, Scott Marlowe<scott.marlowe@gmail.com> wrote: > On Sat, Aug 22, 2009 at 4:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote: >> A server crash is a pretty rare event in the Postgres world, so I >> would not spend too many cycles on this... > > I've been running pg in production since 7.0 came out. zero server crashes. I've found a few...I discovered the aggregate problem in 8.4. I also co-discovered the prepared query/alter table that can trivially crash any pg server up to 8.2. merlin
Vick Khera wrote: > On Sat, Aug 22, 2009 at 6:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote: > > A server crash is a pretty rare event in the Postgres world, so I > > would not spend too many cycles on this... > > I had one the other day caused by server resource issues: I ran out of > file descriptors when I had a very large surge in activity. Pg > rightfully panicked and disconnected all my clients. PG is not supposed to crash when it runs out of file descriptors. In fact there's a whole abstraction layer to ensure this does not happen. What you saw was either misconfiguration or a bug somewhere (for example maybe you have untrusted functions that try to open files?) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Tue, Aug 25, 2009 at 1:09 PM, Alvaro Herrera<alvherre@commandprompt.com> wrote: > Vick Khera wrote: >> On Sat, Aug 22, 2009 at 6:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote: >> > A server crash is a pretty rare event in the Postgres world, so I >> > would not spend too many cycles on this... >> >> I had one the other day caused by server resource issues: I ran out of >> file descriptors when I had a very large surge in activity. Pg >> rightfully panicked and disconnected all my clients. > > PG is not supposed to crash when it runs out of file descriptors. In > fact there's a whole abstraction layer to ensure this does not happen. > What you saw was either misconfiguration or a bug somewhere (for example > maybe you have untrusted functions that try to open files?) From my syslog: Aug 21 15:11:13 d01 postgres[12037]: [156-1] PANIC: could not open file "pg_xlog/00000001000013E300000014" (log file 5091, segment 20): Too many open files in system Then all other processes did this: Aug 21 15:11:15 d01 postgres[38452]: [71-1] WARNING: terminating connection because of crash of another server process Then recovery began. Luckily it only took 3 minutes because I limit the number of log segments when in production mode. Seems to me to be a part of the core server that caused the panic, not any external functions (only external modules I use are pl/pgsql and slony1).
Vick Khera <vivek@khera.org> writes: > On Tue, Aug 25, 2009 at 1:09 PM, Alvaro > Herrera<alvherre@commandprompt.com> wrote: >> PG is not supposed to crash when it runs out of file descriptors. �In >> fact there's a whole abstraction layer to ensure this does not happen. > From my syslog: > Aug 21 15:11:13 d01 postgres[12037]: [156-1] PANIC: could not open > file "pg_xlog/00000001000013E300000014" (log file 5091, segment 20): > Too many open files in system This is probably coming from walwriter, which might not have very much of a cushion of "extra" open files to close. regards, tom lane
Tom Lane wrote: > Vick Khera <vivek@khera.org> writes: > > On Tue, Aug 25, 2009 at 1:09 PM, Alvaro > > Herrera<alvherre@commandprompt.com> wrote: > >> PG is not supposed to crash when it runs out of file descriptors. �In > >> fact there's a whole abstraction layer to ensure this does not happen. > > > From my syslog: > > Aug 21 15:11:13 d01 postgres[12037]: [156-1] PANIC: could not open > > file "pg_xlog/00000001000013E300000014" (log file 5091, segment 20): > > Too many open files in system > > This is probably coming from walwriter, which might not have very much > of a cushion of "extra" open files to close. Note that this is ENFILE, not EMFILE; so if the load is high, it's possible that the released file descriptor is immediately taken by another process before BasicFileOpen is able to grab it (assuming there's any open file to close). Vivek, do you see this error message before the PANIC? LOG: out of file descriptors: %m; release and retry Would it be worth for walwriter to grab a dozen of dummy fd's? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Aug 25, 2009 at 2:49 PM, Alvaro Herrera<alvherre@commandprompt.com> wrote: > Vivek, do you see this error message before the PANIC? > LOG: out of file descriptors: %m; release and retry > Nope. no mention of "release" in that log file. I have a handful of lines like these: ERROR: could not load library "/usr/local/lib/postgresql/slony1_funcs.so": dlopen (/usr/local/lib/postgresql/slony1_funcs.so) failed: ERROR: could not load library "/usr/local/lib/postgresql/plpgsql.so": dlopen (/usr/local/lib/postgresql/plpgsql.so) failed:
Alvaro Herrera <alvherre@commandprompt.com> writes: > Would it be worth for walwriter to grab a dozen of dummy fd's? I don't think so. As you point out, we could never positively guarantee no ENFILE failures anyway. If we were in an out-of-FDs situation, any such cushion would get whittled down to nothing pretty quickly, too. I've always thought that the fd.c layer is more about not having to configure the code explicitly for max-files-per-process limits. Once you get into ENFILE conditions, even if Postgres manages to stay up, everything else on the box is going to start falling over. So the sysadmin is likely to have to resort to a reboot anyway. (Hm, I wonder if that sort of thing explains the complaints we occasionally get about systems becoming completely nonresponsive under load? I'll bet you can't ssh into a machine that's up against the ENFILE limit, for instance.) regards, tom lane
On Tue, Aug 25, 2009 at 4:55 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > I've always thought that the fd.c layer is more about not having to > configure the code explicitly for max-files-per-process limits. Once > you get into ENFILE conditions, even if Postgres manages to stay up, > everything else on the box is going to start falling over. So the > sysadmin is likely to have to resort to a reboot anyway. In my case, all sorts of processes were complaining about being unable to open files. Once Pg panicked and closed all its files, everything came back to normal. I didn't have to reboot because most everything was written to retry and/or restart itself, and nothing critical like sshd croaked. I think we'll be adding a nagios check to track maxfiles vs. openfiles from the kernel and alarm when they get close.
Vick Khera wrote: > On Tue, Aug 25, 2009 at 4:55 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > > I've always thought that the fd.c layer is more about not having to > > configure the code explicitly for max-files-per-process limits. Once > > you get into ENFILE conditions, even if Postgres manages to stay up, > > everything else on the box is going to start falling over. So the > > sysadmin is likely to have to resort to a reboot anyway. > > In my case, all sorts of processes were complaining about being unable > to open files. Once Pg panicked and closed all its files, everything > came back to normal. I didn't have to reboot because most everything > was written to retry and/or restart itself, and nothing critical like > sshd croaked. Hmm. How many DB connections were there at the time? Are they normally long-lived? I'm wondering if the problem could be caused by too many backends holding the maximum of open files each. In my system, /proc/sys/fs/file-max says ~200k, and per-process limit is 1024, so it would take about 200 backends with all FDs in use to bring the system to a near collapse that won't be solved until Postgres is restarted. This doesn't sound so far-fetched if the connections are long lived, perhaps from a pooler. Maybe we should have another inter-backend signal: when a process gets ENFILE, signal all other backends and they close a bunch of files each. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, Aug 28, 2009 at 4:13 AM, Alvaro Herrera<alvherre@commandprompt.com> wrote: > Maybe we should have another inter-backend signal: when a process gets > ENFILE, signal all other backends and they close a bunch of files each. I wonder if this is a new problem due to the FSM and VM using up extra file handles? -- greg http://mit.edu/~gsstark/resume.pdf
Alvaro Herrera <alvherre@commandprompt.com> writes: > Maybe we should have another inter-backend signal: when a process gets > ENFILE, signal all other backends and they close a bunch of files each. I was wondering about that myself, but on balance I think it'd be a lot of work to achieve not much. What you would have is that Postgres would ramp its FD usage up to hit the kernel limit, things outside the database would fail for some period of time, then a backend would get ENFILE and we'd cut down our FD usage. Lather, rinse, repeat, ad infinitum. You'd have intermittent hard-to-reproduce failures of every other service on the box; and you'd *still* be at risk of the DB crashing, if walwriter or another low-cushion process hit the problem first. The only really reliable setup is to have max_connections times max_files_per_process less than the kernel limit. If we do something to mask the problem when it happens, I don't think we're doing the DBA a service in the long run. Thought: it's probably possible to find out the kernel limit on many platforms. Maybe postmaster startup should try to get that limit, and print an annoying warning if it's less than max_connections times max_files_per_process plus some safety factor? regards, tom lane