Thread: How to simulate crashes of PostgreSQL?

How to simulate crashes of PostgreSQL?

From
Sergey Samokhin
Date:
Hello!

To make my client application tolerant of PostgreSQL failures I first
need to be able to simulate them in a safe manner (hard reset isn't a
solution I'm looking for :)

Is there a way to disconnect all the clients as if the server has
crashed? It should look like a real crash from the client's point of
view.

Is using kill what everyone should use for these purposes?

Thanks.

--
Sergey Samokhin

Re: How to simulate crashes of PostgreSQL?

From
Ray Stell
Date:
On Sat, Aug 22, 2009 at 01:03:43PM -0700, Sergey Samokhin wrote:
> Is there a way to disconnect all the clients as if the server has
> crashed? It should look like a real crash from the client's point of
> view.

ifconfig ethx down ?

Re: How to simulate crashes of PostgreSQL?

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


>> Is there a way to disconnect all the clients as if the server has
>> crashed? It should look like a real crash from the client's point of
>> view.

> ifconfig ethx down ?

Or even:

iptables -I INPUT -p tcp --dport 5432 -j DROP

Keep in mind that both of those are simulating network failures, not
a "server crash". But network failures are something your application
should handle gracefully too. :) To make something look like a real
crash, you should do a real crash. In this case, kill -9 the backend(s).

A server crash is a pretty rare event in the Postgres world, so I
would not spend too many cycles on this...

- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation
PGP Key: 0x14964AC8 200908221849
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkqQd2sACgkQvJuQZxSWSsg6TwCfXMZ/GNi33qc2TyMa4uf1asw8
vVcAn3bUUZMP+cmSNEd5EABH/09gLeE/
=Uowh
-----END PGP SIGNATURE-----


Re: How to simulate crashes of PostgreSQL?

From
Scott Marlowe
Date:
On Sat, Aug 22, 2009 at 4:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote:
> A server crash is a pretty rare event in the Postgres world, so I
> would not spend too many cycles on this...

I've been running pg in production since 7.0 came out.  zero server crashes.

Re: How to simulate crashes of PostgreSQL?

From
Craig Ringer
Date:
On Sat, 2009-08-22 at 13:03 -0700, Sergey Samokhin wrote:
> Hello!
>
> To make my client application tolerant of PostgreSQL failures I first
> need to be able to simulate them in a safe manner (hard reset isn't a
> solution I'm looking for :)
>
> Is there a way to disconnect all the clients as if the server has
> crashed? It should look like a real crash from the client's point of
> view.

If you mean a PostgreSQL server crash: write a C extension function that
de-references a null pointer or calls abort() . Instant crash on demand.
`kill -9' on a backend should have much the same effect, though, and is
easier - it's just not something a client can trigger through an SQL
query.

Remember to keep backups - Pg's designed to be fault tolerant, but it's
still good to be careful just in case.


If, however, you mean a crash of the server machine PostgreSQL is
runnning on, which is MUCH more likely and will have different
effects/behaviour, then Ray Stell's advice to bring the interface down
is probably pretty good. The machine should stop responding to ARP
requests or to any packets directed to its MAC address and will stop
sending packets, so it'll look to the client like it's a hard server
crash.

You should also test your client's response to the Pg server remaining
up but becoming non-responsive (eg: failed disk array causes Pg backends
to remain in uninterruptable disk I/O system calls in the kernel). A
possibly good way to do this is to SIGSTOP the backend(s).

--
Craig Ringer


Re: How to simulate crashes of PostgreSQL?

From
Scott Marlowe
Date:
On Mon, Aug 24, 2009 at 12:49 AM, Craig
Ringer<craig@postnewspapers.com.au> wrote:
> You should also test your client's response to the Pg server remaining
> up but becoming non-responsive (eg: failed disk array causes Pg backends
> to remain in uninterruptable disk I/O system calls in the kernel). A
> possibly good way to do this is to SIGSTOP the backend(s).

This is a far more common and likely problem than the server crash
scenario.  I've had servers go unresponsive under load before.  Took
the load away and they came back, but the way the app responded has
not always been optimal.  Many apps get jammed up from something like
this and require the app servers to be restarted.

Re: How to simulate crashes of PostgreSQL?

From
David Fetter
Date:
On Mon, Aug 24, 2009 at 12:10:30AM -0600, Scott Marlowe wrote:
> On Sat, Aug 22, 2009 at 4:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote:
> > A server crash is a pretty rare event in the Postgres world, so I
> > would not spend too many cycles on this...
>
> I've been running pg in production since 7.0 came out.  zero server
> crashes.

In my experience, OS crashes are much more common than PostgreSQL
crashes.

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: How to simulate crashes of PostgreSQL?

From
Scott Marlowe
Date:
On Mon, Aug 24, 2009 at 12:41 PM, David Fetter<david@fetter.org> wrote:
> On Mon, Aug 24, 2009 at 12:10:30AM -0600, Scott Marlowe wrote:
>> On Sat, Aug 22, 2009 at 4:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote:
>> > A server crash is a pretty rare event in the Postgres world, so I
>> > would not spend too many cycles on this...
>>
>> I've been running pg in production since 7.0 came out.  zero server
>> crashes.
>
> In my experience, OS crashes are much more common than PostgreSQL
> crashes.

Also, admin mistakes are more common than pgsql crashes.  I've done
things like type "sudo reboot" into my workstation only realize
seconds later that I'm logged into a production server (long time ago,
but still).

Re: How to simulate crashes of PostgreSQL?

From
Sergey Samokhin
Date:
Hello!

> If, however, you mean a crash of the server machine PostgreSQL is
> runnning on, which is MUCH more likely and will have different
> effects/behaviour, then Ray Stell's advice to bring the interface down
> is probably pretty good.

Sorry for a bit ambiguous usage of both "crash" and "fault" terms. By
those words I meant crash of the server machine PostgreSQL is running
on, not the PostgreSQL itself. Network outages between client and
PostgreSQL are also kind of something I would like to simulate in any
way.

Though I don't think there are any differences between the crash of
PosgreSQL itself and the crash of the machine PostgreSQL is running on
from the client's point of view.

Yet another way to simulate this terrible behaviour I've found is to
stop PostgreSQL by "pg_ctl -m immediate" command.

Thanks to all who has answered in this topic! It was very helpful to read it!

--
Sergey Samokhin

Re: How to simulate crashes of PostgreSQL?

From
Sergey Samokhin
Date:
Hello!

> You should also test your client's response to the Pg server remaining
> up but becoming non-responsive (eg: failed disk array causes Pg backends
> to remain in uninterruptable disk I/O system calls in the kernel). A
> possibly good way to do this is to SIGSTOP the backend(s).

I haven't thought about it yet. It's possible the place where I should
use timeouts on the operations involving calls to PostgreSQL.

--
Sergey Samokhin

Re: How to simulate crashes of PostgreSQL?

From
Craig Ringer
Date:
On Tue, 2009-08-25 at 00:26 +0400, Sergey Samokhin wrote:
> Hello!
>
> > If, however, you mean a crash of the server machine PostgreSQL is
> > runnning on, which is MUCH more likely and will have different
> > effects/behaviour, then Ray Stell's advice to bring the interface down
> > is probably pretty good.
>
> Sorry for a bit ambiguous usage of both "crash" and "fault" terms. By
> those words I meant crash of the server machine PostgreSQL is running
> on, not the PostgreSQL itself. Network outages between client and
> PostgreSQL are also kind of something I would like to simulate in any
> way.

Get a cheap PC with two Ethernet cards running Linux, and put it between
your Pg server and the rest of the network - or between your client and
the rest of the network.

Set it up to route packets between the two interfaces using iptables.
You can now easily introduce rules to do things like drop random
packets, drop packets of particular sizes, drop a regular percentage of
packets, etc.


You can also introduce latency using iproute2's `tc' .

http://lartc.org/

example:

http://www.kdedevelopers.org/node/1878

showing the use of the "delay" option of the network emulation (netem)
qdisc.

Alternately: brtables lets you do some network issue simulation on a
Linux machine that's bridging between two interfaces instead of routing
between them, so you can make your router transparent to the network.

Unless you've worked a bit with iptables before or at least done a lot
of general networking work you'll need to do a bit of learning to get
much of this up and running smoothly. It's not a trivial drop-in. I'm
not going to give detailed instructions and support, as I just don't
have the time to go into it at present - sorry.

--
Craig Ringer


Re: How to simulate crashes of PostgreSQL?

From
Craig Ringer
Date:
On Tue, 2009-08-25 at 00:26 +0400, Sergey Samokhin wrote:
> Hello!
>
> > If, however, you mean a crash of the server machine PostgreSQL is
> > runnning on, which is MUCH more likely and will have different
> > effects/behaviour, then Ray Stell's advice to bring the interface down
> > is probably pretty good.
>
> Sorry for a bit ambiguous usage of both "crash" and "fault" terms. By
> those words I meant crash of the server machine PostgreSQL is running
> on, not the PostgreSQL itself. Network outages between client and
> PostgreSQL are also kind of something I would like to simulate in any
> way.

This is the reference I should've given:

http://www.linuxfoundation.org/en/Net:Netem

--
Craig Ringer


Re: How to simulate crashes of PostgreSQL?

From
Craig Ringer
Date:
On Tue, 2009-08-25 at 00:26 +0400, Sergey Samokhin wrote:

> Though I don't think there are any differences between the crash of
> PosgreSQL itself and the crash of the machine PostgreSQL is running on
> from the client's point of view.

There certainly are!

For one thing, if a client with an established connection sends a packet
to a machine where PostgreSQL has crashed (the backend process has
exited on a signal) it'll receive a TCP RST indicating that the
connection has been broken. The OS will also generally FIN to the client
when the backend crashes to inform it that the connection is closing, so
you'll often find out as soon as the backend dies or at least as soon as
you next try to use the connection. If the issue was just with that
backend, your client can just reconnect, retry its most recent work, and
keep on going.

Similarly, a new client trying to connect to a machine where the
postmaster has crashed will receive a TCP RST packet indicating that the
connection attempt was actively refused. It'll know immediately that
something's not right and will get a useful error from the TCP stack.

If, on the other hand, the server has crashed, clients may not receive
any response at all to packets. The server may even stop responding to
ARP requests, in which case the nearest router to it will - eventually,
maybe - send your client an ICMP destination-unreachable . There will be
long delays either way before the TCP/IP stack decides the connection
has died. Your client will probably block on recv(...) / read(...) for
an extended period.

If a backend is still running but in a nonresponsive state, the TCP/IP
stack on the server will still ACK packets you send to the backend (at
least until the buffers fill up), but the backend won't be doing
anything with the data. The local TCP stack won't see anything wrong
because, at the TCP level, there isn't - something that can't happen in
a server crash.

So, yes, there's a pretty big difference between a crash of PostgreSQL
and a server crash. Behaviour is different from the client perspective
and you need to consider that. Intermediate network issues are different
again, as you might encounter huge latency (possibly randomly only on
some packets), random packet loss, etc. This will cause weird pauses and
delays in communication that your client must cope with.


This, by the way, is one of the reasons you *really* should do all your
database work in a separate worker thread on GUI clients. The GUI must
remain responsive even when you're waiting for a response that'll never
come, or being held up by multi-second network latencies.

--
Craig Ringer


Re: How to simulate crashes of PostgreSQL?

From
Vick Khera
Date:
On Sat, Aug 22, 2009 at 6:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote:
> A server crash is a pretty rare event in the Postgres world, so I
> would not spend too many cycles on this...
>

I had one the other day caused by server resource issues: I ran out of
file descriptors when I had a very large surge in activity.  Pg
rightfully panicked and disconnected all my clients.

Only the well written ones recovered automagically. I had to restart a
handful of services :-(  It is wise to put the effort to testing your
client recovery strategy does work.

I must say that I haven't had a Postgres crash due to Postgres bug
since version 7.2 or so.

Re: How to simulate crashes of PostgreSQL?

From
Merlin Moncure
Date:
On Mon, Aug 24, 2009 at 2:10 AM, Scott Marlowe<scott.marlowe@gmail.com> wrote:
> On Sat, Aug 22, 2009 at 4:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote:
>> A server crash is a pretty rare event in the Postgres world, so I
>> would not spend too many cycles on this...
>
> I've been running pg in production since 7.0 came out.  zero server crashes.

I've found a few...I discovered the aggregate problem in 8.4.  I also
co-discovered the prepared query/alter table that can trivially crash
any pg server up to 8.2.

merlin

Re: How to simulate crashes of PostgreSQL?

From
Alvaro Herrera
Date:
Vick Khera wrote:
> On Sat, Aug 22, 2009 at 6:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote:
> > A server crash is a pretty rare event in the Postgres world, so I
> > would not spend too many cycles on this...
>
> I had one the other day caused by server resource issues: I ran out of
> file descriptors when I had a very large surge in activity.  Pg
> rightfully panicked and disconnected all my clients.

PG is not supposed to crash when it runs out of file descriptors.  In
fact there's a whole abstraction layer to ensure this does not happen.
What you saw was either misconfiguration or a bug somewhere (for example
maybe you have untrusted functions that try to open files?)

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: How to simulate crashes of PostgreSQL?

From
Vick Khera
Date:
On Tue, Aug 25, 2009 at 1:09 PM, Alvaro
Herrera<alvherre@commandprompt.com> wrote:
> Vick Khera wrote:
>> On Sat, Aug 22, 2009 at 6:55 PM, Greg Sabino Mullane<greg@turnstep.com> wrote:
>> > A server crash is a pretty rare event in the Postgres world, so I
>> > would not spend too many cycles on this...
>>
>> I had one the other day caused by server resource issues: I ran out of
>> file descriptors when I had a very large surge in activity.  Pg
>> rightfully panicked and disconnected all my clients.
>
> PG is not supposed to crash when it runs out of file descriptors.  In
> fact there's a whole abstraction layer to ensure this does not happen.
> What you saw was either misconfiguration or a bug somewhere (for example
> maybe you have untrusted functions that try to open files?)

From my syslog:

Aug 21 15:11:13 d01 postgres[12037]: [156-1] PANIC:  could not open
file "pg_xlog/00000001000013E300000014" (log file 5091, segment 20):
Too many open files in system

Then all other processes did this:

Aug 21 15:11:15 d01 postgres[38452]: [71-1] WARNING:  terminating
connection because of crash of another server process

Then recovery began.  Luckily it only took 3 minutes because I limit
the number of log segments when in production mode.

Seems to me to be a part of the core server that caused the panic, not
any external functions (only external modules I use are pl/pgsql and
slony1).

Re: How to simulate crashes of PostgreSQL?

From
Tom Lane
Date:
Vick Khera <vivek@khera.org> writes:
> On Tue, Aug 25, 2009 at 1:09 PM, Alvaro
> Herrera<alvherre@commandprompt.com> wrote:
>> PG is not supposed to crash when it runs out of file descriptors. �In
>> fact there's a whole abstraction layer to ensure this does not happen.

> From my syslog:
> Aug 21 15:11:13 d01 postgres[12037]: [156-1] PANIC:  could not open
> file "pg_xlog/00000001000013E300000014" (log file 5091, segment 20):
> Too many open files in system

This is probably coming from walwriter, which might not have very much
of a cushion of "extra" open files to close.

            regards, tom lane

Re: How to simulate crashes of PostgreSQL?

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Vick Khera <vivek@khera.org> writes:
> > On Tue, Aug 25, 2009 at 1:09 PM, Alvaro
> > Herrera<alvherre@commandprompt.com> wrote:
> >> PG is not supposed to crash when it runs out of file descriptors. �In
> >> fact there's a whole abstraction layer to ensure this does not happen.
>
> > From my syslog:
> > Aug 21 15:11:13 d01 postgres[12037]: [156-1] PANIC:  could not open
> > file "pg_xlog/00000001000013E300000014" (log file 5091, segment 20):
> > Too many open files in system
>
> This is probably coming from walwriter, which might not have very much
> of a cushion of "extra" open files to close.

Note that this is ENFILE, not EMFILE; so if the load is high, it's
possible that the released file descriptor is immediately taken by
another process before BasicFileOpen is able to grab it (assuming
there's any open file to close).

Vivek, do you see this error message before the PANIC?
LOG:    out of file descriptors: %m; release and retry

Would it be worth for walwriter to grab a dozen of dummy fd's?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: How to simulate crashes of PostgreSQL?

From
Vick Khera
Date:
On Tue, Aug 25, 2009 at 2:49 PM, Alvaro
Herrera<alvherre@commandprompt.com> wrote:
> Vivek, do you see this error message before the PANIC?
> LOG:    out of file descriptors: %m; release and retry
>

Nope.  no mention of "release" in that log file.  I have a handful of
lines like these:

ERROR:  could not load library
"/usr/local/lib/postgresql/slony1_funcs.so": dlopen
(/usr/local/lib/postgresql/slony1_funcs.so) failed:

ERROR:  could not load library "/usr/local/lib/postgresql/plpgsql.so":
dlopen (/usr/local/lib/postgresql/plpgsql.so) failed:

Re: How to simulate crashes of PostgreSQL?

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Would it be worth for walwriter to grab a dozen of dummy fd's?

I don't think so.  As you point out, we could never positively guarantee
no ENFILE failures anyway.  If we were in an out-of-FDs situation, any
such cushion would get whittled down to nothing pretty quickly, too.

I've always thought that the fd.c layer is more about not having to
configure the code explicitly for max-files-per-process limits.  Once
you get into ENFILE conditions, even if Postgres manages to stay up,
everything else on the box is going to start falling over.  So the
sysadmin is likely to have to resort to a reboot anyway.

(Hm, I wonder if that sort of thing explains the complaints we
occasionally get about systems becoming completely nonresponsive under
load?  I'll bet you can't ssh into a machine that's up against the
ENFILE limit, for instance.)

            regards, tom lane

Re: How to simulate crashes of PostgreSQL?

From
Vick Khera
Date:
On Tue, Aug 25, 2009 at 4:55 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> I've always thought that the fd.c layer is more about not having to
> configure the code explicitly for max-files-per-process limits.  Once
> you get into ENFILE conditions, even if Postgres manages to stay up,
> everything else on the box is going to start falling over.  So the
> sysadmin is likely to have to resort to a reboot anyway.

In my case, all sorts of processes were complaining about being unable
to open files.  Once Pg panicked and closed all its files, everything
came back to normal.  I didn't have to reboot because most everything
was written to retry and/or restart itself, and nothing critical like
sshd croaked.

I think we'll be adding a nagios check to track maxfiles vs. openfiles
from the kernel and alarm when they get close.

Re: How to simulate crashes of PostgreSQL?

From
Alvaro Herrera
Date:
Vick Khera wrote:
> On Tue, Aug 25, 2009 at 4:55 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> > I've always thought that the fd.c layer is more about not having to
> > configure the code explicitly for max-files-per-process limits.  Once
> > you get into ENFILE conditions, even if Postgres manages to stay up,
> > everything else on the box is going to start falling over.  So the
> > sysadmin is likely to have to resort to a reboot anyway.
>
> In my case, all sorts of processes were complaining about being unable
> to open files.  Once Pg panicked and closed all its files, everything
> came back to normal.  I didn't have to reboot because most everything
> was written to retry and/or restart itself, and nothing critical like
> sshd croaked.

Hmm.  How many DB connections were there at the time?  Are they normally
long-lived?

I'm wondering if the problem could be caused by too many backends
holding the maximum of open files each.  In my system,
/proc/sys/fs/file-max says ~200k, and per-process limit is 1024, so it
would take about 200 backends with all FDs in use to bring the system to
a near collapse that won't be solved until Postgres is restarted.  This
doesn't sound so far-fetched if the connections are long lived, perhaps
from a pooler.

Maybe we should have another inter-backend signal: when a process gets
ENFILE, signal all other backends and they close a bunch of files each.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: How to simulate crashes of PostgreSQL?

From
Greg Stark
Date:
On Fri, Aug 28, 2009 at 4:13 AM, Alvaro
Herrera<alvherre@commandprompt.com> wrote:
> Maybe we should have another inter-backend signal: when a process gets
> ENFILE, signal all other backends and they close a bunch of files each.

I wonder if this is a new problem due to the FSM and VM using up extra
file handles?


--
greg
http://mit.edu/~gsstark/resume.pdf

Re: How to simulate crashes of PostgreSQL?

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Maybe we should have another inter-backend signal: when a process gets
> ENFILE, signal all other backends and they close a bunch of files each.

I was wondering about that myself, but on balance I think it'd be a lot
of work to achieve not much.  What you would have is that Postgres would
ramp its FD usage up to hit the kernel limit, things outside the
database would fail for some period of time, then a backend would get
ENFILE and we'd cut down our FD usage.  Lather, rinse, repeat, ad
infinitum.  You'd have intermittent hard-to-reproduce failures of every
other service on the box; and you'd *still* be at risk of the DB
crashing, if walwriter or another low-cushion process hit the problem
first.

The only really reliable setup is to have max_connections times
max_files_per_process less than the kernel limit.  If we do something to
mask the problem when it happens, I don't think we're doing the DBA a
service in the long run.

Thought: it's probably possible to find out the kernel limit on many
platforms.  Maybe postmaster startup should try to get that limit, and
print an annoying warning if it's less than max_connections times
max_files_per_process plus some safety factor?

            regards, tom lane