Thread: select on 22 GB table causes "An I/O error occured while sending to the backend." exception

Hi,

We're currently having a problem with queries on a medium sized table. This table is 22GB in size (via select
pg_size_pretty(pg_relation_size('table'));).It has 7 indexes, which bring the total size of the table to 35 GB
(measuredwith pg_total_relation_size). 

On this table we're inserting records with a relatively low frequency of +- 6~10 per second. We're using PG 8.3.1 on a
machinewith two dual core 2.4Ghz XEON CPUs, 16 GB of memory and Debian Linux. The machine is completely devoted to PG,
nothingelse runs on the box. 

Lately we're getting a lot of exceptions from the Java process that does these inserts: "An I/O error occured while
sendingto the backend." No other information is provided with this exception (besides the stack trace of course). The
patternis that for about a minute, almost every insert to this 22 GB table results in this exception. After this minute
everythingis suddenly fine and PG happily accepts all inserts again. We tried to nail the problem down, and it seems
thatevery time this happens, a select query on this same table is in progress. This select query starts right before
theinsert problems begin and most often right after this select query finishes executing, inserts are fine again.
Sometimesthough inserts only fail in the middle of the execution of this select query. E.g. if the select query starts
at12:00 and ends at 12:03, inserts fail from 12:01 to 12:02. 

We have spend a lot of hours in getting to the bottom of this, but our ideas for fixing this problem are more or less
exhaustedat the moment. 

I wonder if anyone recognizes this problem and could give some pointers to stuff that we could investigate next.

Thanks a lot in advance.

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

On Tue, 2008-08-26 at 18:44 +0200, henk de wit wrote:
> Hi,
>
> We're currently having a problem with queries on a medium sized table. This table is 22GB in size (via select
pg_size_pretty(pg_relation_size('table'));).It has 7 indexes, which bring the total size of the table to 35 GB
(measuredwith pg_total_relation_size). 
>
> On this table we're inserting records with a relatively low frequency of +- 6~10 per second. We're using PG 8.3.1 on
amachine with two dual core 2.4Ghz XEON CPUs, 16 GB of memory and Debian Linux. The machine is completely devoted to
PG,nothing else runs on the box. 
>
> Lately we're getting a lot of exceptions from the Java process that does these inserts: "An I/O error occured while
sendingto the backend." No other information is provided with this exception (besides the stack trace of course). The
patternis that for about a minute, almost every insert to this 22 GB table results in this exception. After this minute
everythingis suddenly fine and PG happily accepts all inserts again. We tried to nail the problem down, and it seems
thatevery time this happens, a select query on this same table is in progress. This select query starts right before
theinsert problems begin and most often right after this select query finishes executing, inserts are fine again.
Sometimesthough inserts only fail in the middle of the execution of this select query. E.g. if the select query starts
at12:00 and ends at 12:03, inserts fail from 12:01 to 12:02. 
>
> We have spend a lot of hours in getting to the bottom of this, but our ideas for fixing this problem are more or less
exhaustedat the moment. 
>
> I wonder if anyone recognizes this problem and could give some pointers to stuff that we could investigate next.
>
> Thanks a lot in advance.

If the select returns a lot of data and you haven't enabled cursors (by
calling setFetchSize), then the entire SQL response will be loaded in
memory at once, so there could be an out-of-memory condition on the
client.

Or if the select uses sorts and PG thinks it has access to more sort
memory than is actually available on the system (due to ulimits,
physical memory restrictions, etc) then you could run into problems that
look like out-of-memory errors on the server.

If could also be something else entirely; exceeding your max
connections, something like that.

A really good place to start would be to enable tracing on the JDBC
driver.  Look at the docs for the PostgreSQL JDBC driver for how to
enable logging; that should give you a much better picture of exactly
where and what is failing.

If the issue is server-side, then you will also want to look at the
PostgreSQL logs on the server; anything as serious as a backend aborting
should write an entry in the log.

-- Mark

On Tue, Aug 26, 2008 at 10:44 AM, henk de wit <henk53602@hotmail.com> wrote:
>
> Hi,
>
> We're currently having a problem with queries on a medium sized table. This table is 22GB in size (via select
pg_size_pretty(pg_relation_size('table'));).It has 7 indexes, which bring the total size of the table to 35 GB
(measuredwith pg_total_relation_size). 
>
> On this table we're inserting records with a relatively low frequency of +- 6~10 per second. We're using PG 8.3.1 on
amachine with two dual core 2.4Ghz XEON CPUs, 16 GB of memory and Debian Linux. The machine is completely devoted to
PG,nothing else runs on the box. 
>
> Lately we're getting a lot of exceptions from the Java process that does these inserts: "An I/O error occured while
sendingto the backend." No other information is provided with this exception (besides the stack trace of course). 

What do your various logs (pgsql, application, etc...) have to say?
Can you read a java stack trace?  Sometimes slogging through them will
reveal some useful information.

> The pattern is that for about a minute, almost every insert to this 22 GB table results in this exception. After this
minuteeverything is suddenly fine and PG happily accepts all inserts again. We tried to nail the problem down, and it
seemsthat every time this happens, a select query on this same table is in progress. This select query starts right
beforethe insert problems begin and most often right after this select query finishes executing, inserts are fine
again.Sometimes though inserts only fail in the middle of the execution of this select query. E.g. if the select query
startsat 12:00 and ends at 12:03, inserts fail from 12:01 to 12:02. 

Sounds to me like your connections are timing out (what's your timeout
in jdbc set to?)

A likely cause is that you're getting big checkpoint spikes.  What
does vmstat 10 say during these spikes?  If you're running the
sysstate service with data collection then sar can tell you a lot.

If it is a checkpoint issue then you need more aggresive bgwriter
settings, and possibly more bandwidth on your storage array.

Note that you can force a checkpoint from a superuser account at the
command line.  You can always force one and see what happens to
performance during it.  You'll need to wait a few minutes or so
between runs to see an effect.

> If the select returns a lot of data and you haven't enabled cursors (by
> calling setFetchSize), then the entire SQL response will be loaded in
> memory at once, so there could be an out-of-memory condition on the
> client.

I hear you. This is absolutely not the case though. There is no other exception anywhere besides the "An I/O error occured while sending to the backend.". The select query eventually returns something between 10 and 50 rows. Also, the select query runs from another machine than the one that issues the inserts to the data base. I failed to mention in the openings post that simultaneously with this select/insert query for this single 22 GB table, thousands if not tens of thousands other queries are hitting other tables in the same database. There are about 70 other tables, with a combined size of about 40 GB. None of those 70 others tables has a size anywhere near that 22GB of the problematic table. There is never even a single problem of this kind with any of those other tables.

When doing research on the net, it became clear that a lot of these "An I/O error..." exceptions are caused by malfunctioning switches or routers in the network between the application server(s) and the data base. In our case this can hardly be true. As mentioned, a great number of other queries are hitting the database. Some of these are very small (exeuction times of about 10 ms), while others are quite large (hundreds of lines of SQL with over 12 joins and an execution time of several minutes). Not a single one of those results in this I/O error.

> If could also be something else entirely; exceeding your max
> connections, something like that.

We indeed ran into that, but I think more as collateral damage. When this single select query for the 22 GB table is executing and those inserts start to fail, this also starts holding up things. As a results the 6 writes per second start queuing up and requesting more and more connections from our connection pool (DBCP as supplied by Apache Tomcat). We had the maximum of our connection pool set to a too high value and after a while Postgres responded with a message that the connection limit was exceeded. We thereafter lowered the max of our connection pool and didn't see that particular message anymore.

So, it seems likely that "An I/O error occured while sending to the backend." doet not mean "connection limit exceeded", since the latter message is explitely given when this is the case.

> A really good place to start would be to enable tracing on the JDBC
> driver.

That's a good idea indeed. I'll try to enable this and see what it does.

> If the issue is server-side, then you will also want to look at the
> PostgreSQL logs on the server; anything as serious as a backend aborting
> should write an entry in the log.

We studied the PG logs extensively but unfortunately could not really detect anything that could point to the problem there.



Express yourself instantly with MSN Messenger! MSN Messenger
> What do your various logs (pgsql, application, etc...) have to say?

There is hardly anything helpful in the pgsql log. The application log doesn't mention anything either. We log a great deal of information in our application, but there's nothing out of the ordinary there, although there's of course always a chance that somewhere we missed something.

> Can you read a java stack trace? Sometimes slogging through them will
> reveal some useful information.

I can read a java stack trace very well, I'm primarily a Java developer. The stack trace is the following one:

org.postgresql.util.PSQLException: An I/O error occured while sending to the backend.
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:218)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:451)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:350)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:304)
    at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:102)
    at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:102)


Our code simply executes the same statement.executeUpdate() that is does about 500.000 times during a business day. As soon as this select query is hitting this 22 GB table then there's a chance that 'suddenly' all these utterly simply insert queries start failing. The insert query is nothing special either. It's just an "INSERT INTO ... VALUES (...)" type of thing. The select query can actually be a few different kinds of queries, but basically the common thing between them is reading from this 22 GB table. In fact, our system administrator just told me that even the DB backup is able to trigger this behaviour. As soon as the backup process is reading from this 22 GB table, the inserts on it -may- start to fail.

> Sounds to me like your connections are timing out (what's your timeout
> in jdbc set to?)

There's no explicit timeout being set. Queries can theoretically execute for hours. In some rare cases, some queries indeed run for that long.
 
> A likely cause is that you're getting big checkpoint spikes. What
> does vmstat 10 say during these spikes?

It's hard to reproduce the problem. We're trying to simulate it in on our testing servers but haven't been successfull yet. The problem typically lasts for only a minute a time on the production server and there's no saying on when it will occur again. Of course we could try to enfore it by running this select query continously, but for a production server it's not an easy decission to actually do that. So therefore basically all we were able to do now is investigate the logs afterwards. I'll try to run vmstat though when the problem happens when I'm at the console.

> If you're running the
> sysstate service with data collection then sar can tell you a lot.

Ok, I'm not a big PG expert so I'll have to look into what that means exactly ;) Thanks for the advice though.

Kind regards,
Henk

Express yourself instantly with MSN Messenger! MSN Messenger
* henk de wit:

> On this table we're inserting records with a relatively low
> frequency of +- 6~10 per second. We're using PG 8.3.1 on a machine
> with two dual core 2.4Ghz XEON CPUs, 16 GB of memory and Debian
> Linux. The machine is completely devoted to PG, nothing else runs on
> the box.

Have you disabled the OOM killer?

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

In response to henk de wit <henk53602@hotmail.com>:

> > What do your various logs (pgsql, application, etc...) have to say?
>
> There
> is hardly anything helpful in the pgsql log. The application log
> doesn't mention anything either. We log a great deal of information in
> our application, but there's nothing out of the ordinary there,
> although there's of course always a chance that somewhere we missed
> something.

There should be something in a log somewhere.  Someone suggested the oom
killer might be getting you, if so there should be something in one of
the system logs.

If you can't find anything, then you need to beef up your logs.  Try
increasing the amount of stuff that gets logged by PG by tweaking the
postgres.conf settings.  Then run iostat, vmstat and top in an endless
loop dumping their output to files (recommend you run date(1) in between
each run, otherwise you can't correlate the output to the time of
occurrence ;)

While you've got all this extra logging going and you're waiting for the
problem to happen again, do an audit of your postgres.conf settings for
memory usage and see if they actually add up.  How much RAM does the
system have?  How much of it is free?  How much of that are you eating
with shared_buffers?  How much sort_mem did you tell PG it has?  Have
you told PG that it has more memory than the machine actually has?

I've frequently recommended installing pg_buffercache and using mrtg
or something similar to graph various values from it and other easily
accessible statistics in PG and the operating system.  The overhead of
collecting and graphing those values is minimal, and having the data
from those graphs can often be the little red arrow that points you to
the solution to problems like these.  Not to mention the historical
data generally tells you months ahead of time when you're going to
need to scale up to bigger hardware.

On a side note, what version of PG are you using?  If it was in a
previous email, I missed it.

Hope this helps.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

Maybe strace could help you find the problem, but could cause a great
overhead...

"Bill Moran" <wmoran@collaborativefusion.com> escreveu:
> ...
--
<span style="color: #000080">Daniel Cristian Cruz
</span>Administrador de Banco de Dados
Direção Regional - Núcleo de Tecnologia da Informação
SENAI - SC
Telefone: 48-3239-1422 (ramal 1422)




Bill Moran wrote:

> On a side note, what version of PG are you using?  If it was in a
> previous email, I missed it.
>
He mentioned 8.3.1 in the first email.
Although nothing stands out in the 8.3.2 or 8.3.3 fix list (without
knowing his table structure or any contrib modules used) I wonder if
one of them may resolve his issue.

I also wonder if the error is actually sent back from postgresql or
whether jdbc is throwing the exception because of a timeout waiting for
a response. I would think that with the table in use having 22GB data
and 13GB indexes that the long running query has a chance of creating a
delay on the connections that is long enough to give jdbc the impression
that it isn't responding - generating a misleading error code of "An I/O
error" (meaning we know the server got the request but the response from
the server isn't coming back)

Can you increase the timeout settings on the insert connections that are
failing?




--

Shane Ambler
pgSQL (at) Sheeky (dot) Biz

Get Sheeky @ http://Sheeky.Biz

On Wed, 27 Aug 2008, Florian Weimer wrote:

> * henk de wit:
>
>> On this table we're inserting records with a relatively low
>> frequency of +- 6~10 per second. We're using PG 8.3.1 on a machine
>> with two dual core 2.4Ghz XEON CPUs, 16 GB of memory and Debian
>> Linux. The machine is completely devoted to PG, nothing else runs on
>> the box.
>
> Have you disabled the OOM killer?

my understanding of the OOM killer is that 'disabling' it is disabling
memory overcommit, making it impossible for you to get into a situation
where the OOM killer would activate, but this means that any load that
would have triggered the OOM killer will always start getting memory
allocation errors before that point.

the OOM killer exists becouse there are many things that can happen on a
system that allocate memory that 'may' really be needed, but also 'may
not' really be needed.

for example if you have a process that uses 1G of ram (say firefox) and it
needs to start a new process (say acroread to handle a pdf file), what it
does is it forks the firefox process (each of which have 1G of ram
allocated), and then does an exec of the acroread process (releasing the
1G of ram previously held by that copy of the firefox process)

with memory overcommit enabled (the default), the kernel recognises that
most programs that fork don't write to all the memory they have allocated,
so it marks the 1G of ram that firefox uses as read-only, and if either
copy of firefox writes to a page of memory it splits that page into
seperate copies for the seperate processes (and if at this time it runs of
of memory it invokes the OOM killer to free some space), when firefox does
an exec almost immediatly after the fork it touches basicly none of the
pages, so the process only uses 1G or ram total.

if memory overcommit is disabled, the kernel checks to see if you have an
extra 1G of ram available, if you do it allows the process to continue, if
you don't it tries to free memory (by throwing away cache, swapping to
disk, etc), and if it can't free the memory will return a memroy
allocation error (which I believe will cause firefox to exit).


so you can avoid the OOM killer, but the costs of doing so are that you
make far less efficiant use of your ram overall.

David Lang

On Wed, Aug 27, 2008 at 02:45:47PM -0700, david@lang.hm wrote:

> with memory overcommit enabled (the default), the kernel recognises that
> most programs that fork don't write to all the memory they have
> allocated,

It doesn't "recognise" it; it "hopes" it.  It happens to hope
correctly in many cases, because you're quite right that many programs
don't actually need all the memory they allocate.  But there's nothing
about the allocation that hints, "By the way, I'm not really planning
to use this."  Also. . .

> seperate copies for the seperate processes (and if at this time it runs of
> of memory it invokes the OOM killer to free some space),

. . .it kills processes that are using a lot of memory.  Those are not
necessarily the processes that are allocating memory they don't need.

The upshot of this is that postgres tends to be a big target for the
OOM killer, with seriously bad effects to your database.  So for good
Postgres operation, you want to run on a machine with the OOM killer
disabled.

A

--
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/

On Wed, 27 Aug 2008, Andrew Sullivan wrote:

> On Wed, Aug 27, 2008 at 02:45:47PM -0700, david@lang.hm wrote:
>
>> with memory overcommit enabled (the default), the kernel recognises that
>> most programs that fork don't write to all the memory they have
>> allocated,
>
> It doesn't "recognise" it; it "hopes" it.  It happens to hope
> correctly in many cases, because you're quite right that many programs
> don't actually need all the memory they allocate.  But there's nothing
> about the allocation that hints, "By the way, I'm not really planning
> to use this."  Also. . .

Ok, I was meaning to say "recognises the fact that a common pattern is to
not use the memory, and so it..."

>> seperate copies for the seperate processes (and if at this time it runs of
>> of memory it invokes the OOM killer to free some space),
>
> . . .it kills processes that are using a lot of memory.  Those are not
> necessarily the processes that are allocating memory they don't need.

the bahavior of the OOM killer has changed over time, so far nobody has
been able to come up with a 'better' strategy for it to follow.

> The upshot of this is that postgres tends to be a big target for the
> OOM killer, with seriously bad effects to your database.  So for good
> Postgres operation, you want to run on a machine with the OOM killer
> disabled.

I disagree with you. I think goof Postgres operation is so highly
dependant on caching as much data as possible that disabling overcommit
(and throwing away a lot of memory that could be used for cache) is a
solution that's as bad or worse than the problem it's trying to solve.

I find that addign a modest amount of swap to the system and leaving
overcommit enabled works better for me, if the system starts swapping I
have a chance of noticing and taking action, but it will ride out small
overloads. but the biggest thing is that it's not that much more
acceptable for me to have other programs on the box failing due to memory
allocation errors, and those will be much more common with overcommit
disabled then the OOM killer would be with it enabled

David Lang

On Wed, Aug 27, 2008 at 4:22 PM,  <david@lang.hm> wrote:
> I disagree with you. I think goof Postgres operation is so highly dependant
> on caching as much data as possible that disabling overcommit (and throwing
> away a lot of memory that could be used for cache) is a solution that's as
> bad or worse than the problem it's trying to solve.
>
> I find that addign a modest amount of swap to the system and leaving
> overcommit enabled works better for me, if the system starts swapping I have
> a chance of noticing and taking action, but it will ride out small
> overloads. but the biggest thing is that it's not that much more acceptable
> for me to have other programs on the box failing due to memory allocation
> errors, and those will be much more common with overcommit disabled then the
> OOM killer would be with it enabled

I don't generally find this to be the case because I usually allocate
about 20-25% of memory to shared_buffers, use another 10-20% for
work_mem across all backends, and let the OS cache with the other
50-60% or so of memory.  In this situation allocations rarely, if
ever, fail.

Note I also turn swappiness to 0 or 1 or something small too.
Otherwise linux winds up swapping out seldom used shared_buffers to
make more kernel cache, which is counter productive in the extreme.

david@lang.hm wrote:
> On Wed, 27 Aug 2008, Andrew Sullivan wrote:

>>> seperate copies for the seperate processes (and if at this time it runs of
>>> of memory it invokes the OOM killer to free some space),
>>
>> . . .it kills processes that are using a lot of memory.  Those are not
>> necessarily the processes that are allocating memory they don't need.
>
> the bahavior of the OOM killer has changed over time, so far nobody has
> been able to come up with a 'better' strategy for it to follow.

The problem with OOM killer for Postgres is that it tends to kill the
postmaster.  That's really dangerous.  If it simply killed a backend
then it wouldn't be so much of a problem.

Some time ago I found that it was possible to fiddle with a /proc entry
to convince the OOM to not touch the postmaster.  A postmaster with the
raw IO capability bit set would be skipped by the OOM too killer (this
is an Oracle tweak AFAIK).

These are tricks that people could use in their init scripts to protect
themselves.

(I wonder if the initscript supplied by the RPMs or Debian should
contain such a hack.)

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Some time ago I found that it was possible to fiddle with a /proc entry
> to convince the OOM to not touch the postmaster.  A postmaster with the
> raw IO capability bit set would be skipped by the OOM too killer (this
> is an Oracle tweak AFAIK).
> These are tricks that people could use in their init scripts to protect
> themselves.

Yeah?  Details please?  Does the bit get inherited by child processes?

> (I wonder if the initscript supplied by the RPMs or Debian should
> contain such a hack.)

It would certainly make sense for my RHEL/Fedora-specific packages,
since those are targeting a very limited range of kernel versions.
Not sure about the situation for other distros.

            regards, tom lane

david@lang.hm writes:
> On Wed, 27 Aug 2008, Andrew Sullivan wrote:
>> The upshot of this is that postgres tends to be a big target for the
>> OOM killer, with seriously bad effects to your database.  So for good
>> Postgres operation, you want to run on a machine with the OOM killer
>> disabled.

> I disagree with you.

Actually, the problem with Linux' OOM killer is that it
*disproportionately targets the PG postmaster*, on the basis not of
memory that the postmaster is using but of memory its child processes
are using.  This was discussed in the PG archives a few months ago;
I'm too lazy to search for the link right now, but the details and links
to confirming kernel documentation are in our archives.

This is one hundred percent antithetical to the basic design philosophy
of Postgres, which is that no matter how badly the child processes screw
up, the postmaster should live to fight another day.  The postmaster
basically exists to restart things after children die ungracefully.
If the OOM killer takes out the postmaster itself (rather than the child
that was actually eating the unreasonable amount of memory), we have no
chance of recovering.

So, if you want a PG installation that is as robust as it's designed to
be, you *will* turn off Linux' OOM killer.  Otherwise, don't complain to
us when your database unexpectedly stops responding.

(Alternatively, if you know how an unprivileged userland process can
defend itself against such exceedingly brain-dead kernel policy, we are
all ears.)

            regards, tom lane

PS: I think this is probably unrelated to the OP's problem, since he
stated there was no sign of any problem from the database server's
side.

The OOM killer is a terrible idea for any serious database server.  I wrote a detailed technical paper on this almost
15years ago when Silicon Graphics had this same feature, and Oracle and other critical server processes couldn't be
madereliable. 

The problem with "overallocating memory" as Linux does by default is that EVERY application, no matter how well
designedand written, becomes unreliable: It can be killed because of some OTHER process.  You can be as clever as you
like,and do all the QA possible, and demonstrate that there isn't a single bug in Postgres, and it will STILL be
unreliableif you run it on a Linux system that allows overcommitted memory. 

IMHO, all Postgres servers should run with memory-overcommit disabled.  On Linux, that means
/proc/sys/vm/overcommit_memory=2.

Craig

On Wed, 27 Aug 2008, Craig James wrote:

> The OOM killer is a terrible idea for any serious database server.  I wrote a
> detailed technical paper on this almost 15 years ago when Silicon Graphics
> had this same feature, and Oracle and other critical server processes
> couldn't be made reliable.
>
> The problem with "overallocating memory" as Linux does by default is that
> EVERY application, no matter how well designed and written, becomes
> unreliable: It can be killed because of some OTHER process.  You can be as
> clever as you like, and do all the QA possible, and demonstrate that there
> isn't a single bug in Postgres, and it will STILL be unreliable if you run it
> on a Linux system that allows overcommitted memory.
>
> IMHO, all Postgres servers should run with memory-overcommit disabled.  On
> Linux, that means  /proc/sys/vm/overcommit_memory=2.

it depends on how much stuff you allow others to run on the box. if you
have no control of that then yes, the box is unreliable (but it's not just
becouse of the OOM killer, it's becouse those other users can eat up all
the other box resources as well CPU, network bandwidth, disk bandwidth,
etc)

even with overcommit disabled, the only way you can be sure that a program
will not fail is to make sure that it never needs to allocate memory. with
overcommit off you could have one program that eats up 100% of your ram
without failing (handling the error on memory allocation such that it
doesn't crash), but which will cause _every_ other program on the system
to fail, including any scripts (becouse every command executed will
require forking and without overcommit that will require allocating the
total memory that your shell has allocated so that it can run a trivial
command (like ps or kill that you are trying to use to fix the problem)

if you have a box with unpredictable memory use, disabling overcommit will
not make it reliable. it may make it less unreliable (the fact that the
linux OOM killer will pick one of the worst possible processes to kill is
a problem), but less unreliable is not the same as reliable.

it's also not that hard to have a process monitor the postmaster (along
with other box resources) to restart it if it is killed, at some point you
can get init to watch your watchdog and the OOM killer will not kill init.
so while you can't prevent the postmaster from being killed, you can setup
to recover from it.

David Lang

On Thu, 28 Aug 2008, Tom Lane wrote:

> david@lang.hm writes:
>> On Wed, 27 Aug 2008, Andrew Sullivan wrote:
>>> The upshot of this is that postgres tends to be a big target for the
>>> OOM killer, with seriously bad effects to your database.  So for good
>>> Postgres operation, you want to run on a machine with the OOM killer
>>> disabled.
>
>> I disagree with you.
>
> Actually, the problem with Linux' OOM killer is that it
> *disproportionately targets the PG postmaster*, on the basis not of
> memory that the postmaster is using but of memory its child processes
> are using.  This was discussed in the PG archives a few months ago;
> I'm too lazy to search for the link right now, but the details and links
> to confirming kernel documentation are in our archives.
>
> This is one hundred percent antithetical to the basic design philosophy
> of Postgres, which is that no matter how badly the child processes screw
> up, the postmaster should live to fight another day.  The postmaster
> basically exists to restart things after children die ungracefully.
> If the OOM killer takes out the postmaster itself (rather than the child
> that was actually eating the unreasonable amount of memory), we have no
> chance of recovering.
>
> So, if you want a PG installation that is as robust as it's designed to
> be, you *will* turn off Linux' OOM killer.  Otherwise, don't complain to
> us when your database unexpectedly stops responding.
>
> (Alternatively, if you know how an unprivileged userland process can
> defend itself against such exceedingly brain-dead kernel policy, we are
> all ears.)

there are periodic flamefests on the kernel mailing list over the OOM
killer, if you can propose a better algorithm for it to use than the
current one that doesn't end up being just as bad for some other workload
the kernel policy can be changed.

IIRC the reason why it targets the parent process is to deal with a
fork-bomb type of failure where a program doesn't use much memory itself,
but forks off memory hogs as quickly as it can. if the OOM killer only
kills the children the problem never gets solved.

I assume that the postmaster process is monitoring the back-end processes
by being it's parent, is there another way that this monitoring could
be done so that the back-end processes become independant of the
monitoring tool after they are started (the equivalent of nohup)?

while this approach to monitoring may not be as quick to react as a wait
for a child exit, it may be worth doing if it makes the postmaster not be
the prime target of the OOM killer when things go bad on the system.

>             regards, tom lane
>
> PS: I think this is probably unrelated to the OP's problem, since he
> stated there was no sign of any problem from the database server's
> side.

agreed.

David Lang


On Wed, Aug 27, 2008 at 03:22:09PM -0700, david@lang.hm wrote:
>
> I disagree with you. I think goof Postgres operation is so highly dependant
> on caching as much data as possible that disabling overcommit (and throwing
> away a lot of memory that could be used for cache) is a solution that's as
> bad or worse than the problem it's trying to solve.

Ok, but the danger is that the OOM killer kills your postmaster.  To
me, this is a cure way worse than the disease it's trying to treat.
YMMD &c. &c.

A

--
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/

On Wed, 27 Aug 2008, david@lang.hm wrote:
> if memory overcommit is disabled, the kernel checks to see if you have an
> extra 1G of ram available, if you do it allows the process to continue, if
> you don't it tries to free memory (by throwing away cache, swapping to disk,
> etc), and if it can't free the memory will return a memroy allocation error
> (which I believe will cause firefox to exit).

Remember that the memory overcommit check is checking against the amount
of RAM + swap you have - not just the amount of RAM. When a fork occurs,
hardly any extra actual RAM is used (due to copy on write), but the
potential is there for the process to use it. If overcommit is switched
off, then you just need to make sure there is *plenty* of swap to convince
the kernel that it can actually fulfil all of the memory requests if all
the processes behave badly and all shared pages become unshared. Then the
consequences of processes actually using that memory are that the machine
will swap, rather than the OOM killer having to act.

Of course, it's generally bad to run a machine with more going on than
will fit in RAM.

Neither swapping nor OOM killing are particularly good - it's just a
consequence of the amount of memory needed being unpredictable.

Probably the best solution is to just tell the kernel somehow to never
kill the postmaster.

Matthew

--
<Taking apron off> And now you can say honestly that you have been to a
lecture where you watched paint dry.
         - Computer Graphics Lecturer

In response to Matthew Wakeling <matthew@flymine.org>:
>
> Probably the best solution is to just tell the kernel somehow to never
> kill the postmaster.

This thread interested me enough to research this a bit.

In linux, it's possible to tell the OOM killer never to consider
certain processes for the axe, using /proc magic.  See this page:
http://linux-mm.org/OOM_Killer

Perhaps this should be in the PostgreSQL docs somewhere?

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

>>> Bill Moran <wmoran@collaborativefusion.com> wrote:
> In response to Matthew Wakeling <matthew@flymine.org>:
>>
>> Probably the best solution is to just tell the kernel somehow to
never
>> kill the postmaster.
>
> This thread interested me enough to research this a bit.
>
> In linux, it's possible to tell the OOM killer never to consider
> certain processes for the axe, using /proc magic.  See this page:
> http://linux-mm.org/OOM_Killer
>
> Perhaps this should be in the PostgreSQL docs somewhere?

That sure sounds like a good idea.

Even though the one time the OOM killer kicked in on one of our
servers, it killed a runaway backend and not the postmaster
( http://archives.postgresql.org/pgsql-bugs/2008-07/msg00105.php ),
I think I will modify our service scripts in /etc/init.d/ to pick off
the postmaster pid after a start and echo -16 (or some such) into the
/proc/<pid>/oom_adj file (which is where I found the file on my SuSE
system).

Thanks for the research and the link!

-Kevin

On Aug 28, 2008, at 6:26 AM, Matthew Wakeling wrote:

> On Wed, 27 Aug 2008, david@lang.hm wrote:
>> if memory overcommit is disabled, the kernel checks to see if you
>> have an extra 1G of ram available, if you do it allows the process
>> to continue, if you don't it tries to free memory (by throwing away
>> cache, swapping to disk, etc), and if it can't free the memory will
>> return a memroy allocation error (which I believe will cause
>> firefox to exit).
>
> Remember that the memory overcommit check is checking against the
> amount of RAM + swap you have - not just the amount of RAM. When a
> fork occurs, hardly any extra actual RAM is used (due to copy on
> write), but the potential is there for the process to use it. If
> overcommit is switched off, then you just need to make sure there is
> *plenty* of swap to convince the kernel that it can actually fulfil
> all of the memory requests if all the processes behave badly and all
> shared pages become unshared. Then the consequences of processes
> actually using that memory are that the machine will swap, rather
> than the OOM killer having to act.
>
> Of course, it's generally bad to run a machine with more going on
> than will fit in RAM.
>
> Neither swapping nor OOM killing are particularly good - it's just a
> consequence of the amount of memory needed being unpredictable.
>
> Probably the best solution is to just tell the kernel somehow to
> never kill the postmaster.

Or configure adequate swap space?

Cheers,
   Steve

Another approach we used successfully for a similar problem -- (we had lots
of free high memory but were running out of low memory; oom killer wiped out
MQ a couple times and postmaster a couple times) -- was to change the
settings for how aggressively the virtual memory system protected low memory
by changing /proc/sys/vm/lowmem_reserve_ratio (2.6.18?+ Kernel).  I don't
remember all of the details, but we looked at
Documentation/filesystems/proc.txt for the 2.6.25 kernel (it wasn't
documented for earlier kernel releases) to figure out how it worked and set
it appropriate to our system memory configuration.

-Jerry


-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Steve Atkins
Sent: Thursday, August 28, 2008 9:06 AM
To: PostgreSQL Performance
Subject: Re: [PERFORM] select on 22 GB table causes "An I/O error occured
while sending to the backend." exception


On Aug 28, 2008, at 6:26 AM, Matthew Wakeling wrote:

> On Wed, 27 Aug 2008, david@lang.hm wrote:
>> if memory overcommit is disabled, the kernel checks to see if you
>> have an extra 1G of ram available, if you do it allows the process
>> to continue, if you don't it tries to free memory (by throwing away
>> cache, swapping to disk, etc), and if it can't free the memory will
>> return a memroy allocation error (which I believe will cause
>> firefox to exit).
>
> Remember that the memory overcommit check is checking against the
> amount of RAM + swap you have - not just the amount of RAM. When a
> fork occurs, hardly any extra actual RAM is used (due to copy on
> write), but the potential is there for the process to use it. If
> overcommit is switched off, then you just need to make sure there is
> *plenty* of swap to convince the kernel that it can actually fulfil
> all of the memory requests if all the processes behave badly and all
> shared pages become unshared. Then the consequences of processes
> actually using that memory are that the machine will swap, rather
> than the OOM killer having to act.
>
> Of course, it's generally bad to run a machine with more going on
> than will fit in RAM.
>
> Neither swapping nor OOM killing are particularly good - it's just a
> consequence of the amount of memory needed being unpredictable.
>
> Probably the best solution is to just tell the kernel somehow to
> never kill the postmaster.

Or configure adequate swap space?

Cheers,
   Steve

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



david@lang.hm wrote:
> On Wed, 27 Aug 2008, Craig James wrote:
>> The OOM killer is a terrible idea for any serious database server.  I
>> wrote a detailed technical paper on this almost 15 years ago when
>> Silicon Graphics had this same feature, and Oracle and other critical
>> server processes couldn't be made reliable.
>>
>> The problem with "overallocating memory" as Linux does by default is
>> that EVERY application, no matter how well designed and written,
>> becomes unreliable: It can be killed because of some OTHER process.
>> You can be as clever as you like, and do all the QA possible, and
>> demonstrate that there isn't a single bug in Postgres, and it will
>> STILL be unreliable if you run it on a Linux system that allows
>> overcommitted memory.
>>
>> IMHO, all Postgres servers should run with memory-overcommit
>> disabled.  On Linux, that means  /proc/sys/vm/overcommit_memory=2.
>
> it depends on how much stuff you allow others to run on the box. if you
> have no control of that then yes, the box is unreliable (but it's not
> just becouse of the OOM killer, it's becouse those other users can eat
> up all the other box resources as well CPU, network bandwidth, disk
> bandwidth, etc)
>
> even with overcommit disabled, the only way you can be sure that a
> program will not fail is to make sure that it never needs to allocate
> memory. with overcommit off you could have one program that eats up 100%
> of your ram without failing (handling the error on memory allocation
> such that it doesn't crash), but which will cause _every_ other program
> on the system to fail, including any scripts (becouse every command
> executed will require forking and without overcommit that will require
> allocating the total memory that your shell has allocated so that it can
> run a trivial command (like ps or kill that you are trying to use to fix
> the problem)
>
> if you have a box with unpredictable memory use, disabling overcommit
> will not make it reliable. it may make it less unreliable (the fact that
> the linux OOM killer will pick one of the worst possible processes to
> kill is a problem), but less unreliable is not the same as reliable.

The problem with any argument in favor of memory overcommit and OOM is that there is a MUCH better, and simpler,
solution. Buy a really big disk, say a terabyte, and allocate the whole thing as swap space.  Then do a decent job of
configuringyour kernel so that any reasonable process can allocate huge chunks of memory that it will never use, but
can'tuse the whole terrabyte. 

Using real swap space instead of overallocated memory is a much better solution.

- It's cheap.
- There is no performance hit at all if you buy enough real memory
- If runaway processes start actually using memory, the system slows
  down, but server processes like Postgres *aren't killed*.
- When a runaway process starts everybody swapping, you can just
  find it and kill it.  Once it's dead, everything else goes back
  to normal.

It's hard to imagine a situation where any program or collection of programs would actually try to allocate more than a
terrabyteof memory and exceed the swap space on a single terrabyte disk.  The cost is almost nothing, a few hundred
dollars.

So turn off overcommit, and buy an extra disk if you actually need a lot of "virtual memory".

Craig

On Thu, 28 Aug 2008, Steve Atkins wrote:
>> Probably the best solution is to just tell the kernel somehow to never kill
>> the postmaster.
>
> Or configure adequate swap space?

Oh yes, that's very important. However, that gives the machine the
opportunity to thrash.

Matthew

--
The early bird gets the worm. If you want something else for breakfast, get
up later.

On Thu, 28 Aug 2008, Craig James wrote:

> david@lang.hm wrote:
>> On Wed, 27 Aug 2008, Craig James wrote:
>>> The OOM killer is a terrible idea for any serious database server.  I
>>> wrote a detailed technical paper on this almost 15 years ago when Silicon
>>> Graphics had this same feature, and Oracle and other critical server
>>> processes couldn't be made reliable.
>>>
>>> The problem with "overallocating memory" as Linux does by default is that
>>> EVERY application, no matter how well designed and written, becomes
>>> unreliable: It can be killed because of some OTHER process.  You can be as
>>> clever as you like, and do all the QA possible, and demonstrate that there
>>> isn't a single bug in Postgres, and it will STILL be unreliable if you run
>>> it on a Linux system that allows overcommitted memory.
>>>
>>> IMHO, all Postgres servers should run with memory-overcommit disabled.  On
>>> Linux, that means  /proc/sys/vm/overcommit_memory=2.
>>
>> it depends on how much stuff you allow others to run on the box. if you
>> have no control of that then yes, the box is unreliable (but it's not just
>> becouse of the OOM killer, it's becouse those other users can eat up all
>> the other box resources as well CPU, network bandwidth, disk bandwidth,
>> etc)
>>
>> even with overcommit disabled, the only way you can be sure that a program
>> will not fail is to make sure that it never needs to allocate memory. with
>> overcommit off you could have one program that eats up 100% of your ram
>> without failing (handling the error on memory allocation such that it
>> doesn't crash), but which will cause _every_ other program on the system to
>> fail, including any scripts (becouse every command executed will require
>> forking and without overcommit that will require allocating the total
>> memory that your shell has allocated so that it can run a trivial command
>> (like ps or kill that you are trying to use to fix the problem)
>>
>> if you have a box with unpredictable memory use, disabling overcommit will
>> not make it reliable. it may make it less unreliable (the fact that the
>> linux OOM killer will pick one of the worst possible processes to kill is a
>> problem), but less unreliable is not the same as reliable.
>
> The problem with any argument in favor of memory overcommit and OOM is that
> there is a MUCH better, and simpler, solution.  Buy a really big disk, say a
> terabyte, and allocate the whole thing as swap space.  Then do a decent job
> of configuring your kernel so that any reasonable process can allocate huge
> chunks of memory that it will never use, but can't use the whole terrabyte.
>
> Using real swap space instead of overallocated memory is a much better
> solution.
>
> - It's cheap.

cheap in dollars, if you actually use any of it it's very expensive in
performance

> - There is no performance hit at all if you buy enough real memory
> - If runaway processes start actually using memory, the system slows
> down, but server processes like Postgres *aren't killed*.
> - When a runaway process starts everybody swapping, you can just
> find it and kill it.  Once it's dead, everything else goes back
> to normal.

all of these things are still true if you enable overcommit, the
difference is that with overcommit enabled your actual ram will be used
for cache as much as possible, with overcommit disabled you will keep
throwing away cache to make room for memory that's allocated but not
written to.

I generally allocate 2G of disk to swap, if the system ends up using even
that much it will have slowed to a crawl, but if you are worried that
that's no enough, by all means go ahead and allocate more, but allocateing
a 1TB disk is overkill (do you realize how long it takes just to _read_ an
entire 1TB disk? try it sometime with dd if=/dev/drive of=/dev/null)

David Lang

> It's hard to imagine a situation where any program or collection of programs
> would actually try to allocate more than a terrabyte of memory and exceed the
> swap space on a single terrabyte disk.  The cost is almost nothing, a few
> hundred dollars.
>
> So turn off overcommit, and buy an extra disk if you actually need a lot of
> "virtual memory".
>
> Craig
>

On Thu, 28 Aug 2008, Matthew Wakeling wrote:

> On Wed, 27 Aug 2008, david@lang.hm wrote:
>> if memory overcommit is disabled, the kernel checks to see if you have an
>> extra 1G of ram available, if you do it allows the process to continue, if
>> you don't it tries to free memory (by throwing away cache, swapping to
>> disk, etc), and if it can't free the memory will return a memroy allocation
>> error (which I believe will cause firefox to exit).
>
> Remember that the memory overcommit check is checking against the amount of
> RAM + swap you have - not just the amount of RAM. When a fork occurs, hardly
> any extra actual RAM is used (due to copy on write), but the potential is
> there for the process to use it. If overcommit is switched off, then you just
> need to make sure there is *plenty* of swap to convince the kernel that it
> can actually fulfil all of the memory requests if all the processes behave
> badly and all shared pages become unshared. Then the consequences of
> processes actually using that memory are that the machine will swap, rather
> than the OOM killer having to act.

if you are correct that it just checks against memory+swap then it's not a
big deal, but I don't think it does that. I think it actually allocates
the memory, and if it does that it will push things out of ram to do the
allocation, I don't believe that it will allocate swap space directly.

David Lang

> Of course, it's generally bad to run a machine with more going on than will
> fit in RAM.
>
> Neither swapping nor OOM killing are particularly good - it's just a
> consequence of the amount of memory needed being unpredictable.
>
> Probably the best solution is to just tell the kernel somehow to never kill
> the postmaster.
>
> Matthew
>
>

On Thu, 2008-08-28 at 00:56 -0400, Tom Lane wrote:
> Actually, the problem with Linux' OOM killer is that it
> *disproportionately targets the PG postmaster*, on the basis not of
> memory that the postmaster is using but of memory its child processes
> are using.  This was discussed in the PG archives a few months ago;
> I'm too lazy to search for the link right now, but the details and links
> to confirming kernel documentation are in our archives.
>

http://archives.postgresql.org/pgsql-hackers/2008-02/msg00101.php

It's not so much that the OOM Killer targets the parent process for a
fraction of the memory consumed by the child. It may not be a great
design, but it's not what's causing the problem for the postmaster.

The problem for the postmaster is that the OOM killer counts the
children's total vmsize -- including *shared* memory -- against the
parent, which is such a bad idea I don't know where to start. If you
have shared_buffers set to 1GB and 25 connections, the postmaster will
be penalized as though it was using 13.5 GB of memory, even though all
the processes together are only using about 1GB!

Not only that, killing a process doesn't free shared memory, so it's
just flat out broken.

Regards,
    Jeff Davis


Matthew Wakeling wrote:
> On Thu, 28 Aug 2008, Steve Atkins wrote:
>>> Probably the best solution is to just tell the kernel somehow to
>>> never kill the postmaster.
>>
>> Or configure adequate swap space?
>
> Oh yes, that's very important. However, that gives the machine the
> opportunity to thrash.

No, that's where the whole argument for allowing overcommitted memory falls flat.

The entire argument for allowing overcommitted memory hinges on the fact that processes *won't use the memory*.  If
theyuse it, then overcommitting causes problems everywhere, such as a Postmaster getting arbitrarily killed. 

If a process *doesn't* use the memory, then there's no problem with thrashing, right?

So it never makes sense to enable overcommitted memory when Postgres, or any server, is running.

Allocating a big, fat terabyte swap disk is ALWAYS better than allowing overcommitted memory.  If your usage is such
thatovercommitted memory would never be used, then the swap disk will never be used either.  If your processes do use
thememory, then your performance goes into the toilet, and you know it's time to buy more memory or a second server,
butin the mean time your server processes at least keep running while you kill the rogue processes. 

Craig

On Thu, 28 Aug 2008, david@lang.hm wrote:

> On Thu, 28 Aug 2008, Matthew Wakeling wrote:
>
>> On Wed, 27 Aug 2008, david@lang.hm wrote:
>>> if memory overcommit is disabled, the kernel checks to see if you have an
>>> extra 1G of ram available, if you do it allows the process to continue, if
>>> you don't it tries to free memory (by throwing away cache, swapping to
>>> disk, etc), and if it can't free the memory will return a memroy
>>> allocation error (which I believe will cause firefox to exit).
>>
>> Remember that the memory overcommit check is checking against the amount of
>> RAM + swap you have - not just the amount of RAM. When a fork occurs,
>> hardly any extra actual RAM is used (due to copy on write), but the
>> potential is there for the process to use it. If overcommit is switched
>> off, then you just need to make sure there is *plenty* of swap to convince
>> the kernel that it can actually fulfil all of the memory requests if all
>> the processes behave badly and all shared pages become unshared. Then the
>> consequences of processes actually using that memory are that the machine
>> will swap, rather than the OOM killer having to act.
>
> if you are correct that it just checks against memory+swap then it's not a
> big deal, but I don't think it does that. I think it actually allocates the
> memory, and if it does that it will push things out of ram to do the
> allocation, I don't believe that it will allocate swap space directly.

I just asked on the kernel mailing list and Alan Cox responded.

he is saying that you are correct, it only allocates against the total
available, it doesn't actually allocate ram.

so with sufficiant swap overcommit off should be fine.

but you do need to allocate more swap as the total memory 'used' can be
significantly higher that with overcommit on.

David Lang

> David Lang
>
>> Of course, it's generally bad to run a machine with more going on than will
>> fit in RAM.
>>
>> Neither swapping nor OOM killing are particularly good - it's just a
>> consequence of the amount of memory needed being unpredictable.
>>
>> Probably the best solution is to just tell the kernel somehow to never kill
>> the postmaster.
>>
>> Matthew
>>
>>
>

On Thu, 28 Aug 2008, Craig James wrote:

> Matthew Wakeling wrote:
>> On Thu, 28 Aug 2008, Steve Atkins wrote:
>>>> Probably the best solution is to just tell the kernel somehow to never
>>>> kill the postmaster.
>>>
>>> Or configure adequate swap space?
>>
>> Oh yes, that's very important. However, that gives the machine the
>> opportunity to thrash.
>
> No, that's where the whole argument for allowing overcommitted memory falls
> flat.
>
> The entire argument for allowing overcommitted memory hinges on the fact that
> processes *won't use the memory*.  If they use it, then overcommitting causes
> problems everywhere, such as a Postmaster getting arbitrarily killed.
>
> If a process *doesn't* use the memory, then there's no problem with
> thrashing, right?
>
> So it never makes sense to enable overcommitted memory when Postgres, or any
> server, is running.
>
> Allocating a big, fat terabyte swap disk is ALWAYS better than allowing
> overcommitted memory.  If your usage is such that overcommitted memory would
> never be used, then the swap disk will never be used either.  If your
> processes do use the memory, then your performance goes into the toilet, and
> you know it's time to buy more memory or a second server, but in the mean
> time your server processes at least keep running while you kill the rogue
> processes.

there was a misunderstanding (for me if nobody else) that without
overcommit it was actual ram that was getting allocated, which could push
things out to swap even if the memory ended up not being needed later.
with the clarification that this is not the case and the allocation is
just reducing the virtual memory available it's now clear that it is just
as efficiant to run with overcommit off.

so the conclusion is:

no performance/caching/buffer difference between the two modes.

the differencees between the two are:

with overcommit

   when all ram+swap is used OOM killer is activated.
   for the same amount of ram+swap more allocations can be done before it
is all used up (how much more is unpredicable)

without overcommit

   when all ram+swap is allocated programs (not nessasarily the memory
hog) start getting memory allocation errors.


David Lang

On Wed, 2008-08-27 at 23:23 -0700, david@lang.hm wrote:
> there are periodic flamefests on the kernel mailing list over the OOM
> killer, if you can propose a better algorithm for it to use than the
> current one that doesn't end up being just as bad for some other workload
> the kernel policy can be changed.
>

Tried that: http://lkml.org/lkml/2007/2/9/275

All they have to do is *not* count shared memory against the process (or
at least not count it against the parent of the process), and the system
may approximate sanity.

> IIRC the reason why it targets the parent process is to deal with a
> fork-bomb type of failure where a program doesn't use much memory itself,
> but forks off memory hogs as quickly as it can. if the OOM killer only
> kills the children the problem never gets solved.

But killing a process won't free shared memory. And there is already a
system-wide limit on shared memory. So what's the point of such a bad
design?

Regards,
    Jeff Davis


On Thu, 28 Aug 2008, Jeff Davis wrote:
> The problem for the postmaster is that the OOM killer counts the
> children's total vmsize -- including *shared* memory -- against the
> parent, which is such a bad idea I don't know where to start. If you
> have shared_buffers set to 1GB and 25 connections, the postmaster will
> be penalized as though it was using 13.5 GB of memory, even though all
> the processes together are only using about 1GB!

I find it really hard to believe that it counts shared memory like that.
That's just dumb.

Of course, there are two types of "shared" memory. There's explicit shared
memory, like Postgres uses, and there's copy-on-write "shared" memory,
caused by a process fork. The copy-on-write memory needs to be counted for
each child, but the explicit shared memory needs to be counted just once.

> Not only that, killing a process doesn't free shared memory, so it's
> just flat out broken.

Exactly. a cost-benefit model would work well here. Work out how much RAM
would be freed by killing a process, and use that when choosing which
process to kill.

Matthew

--
You will see this is a 3-blackboard lecture. This is the closest you are going
to get from me to high-tech teaching aids. Hey, if they put nooses on this, it
would be fun!                           -- Computer Science Lecturer

On Thu, 28 Aug 2008, david@lang.hm wrote:
> I just asked on the kernel mailing list and Alan Cox responded.
>
> he is saying that you are correct, it only allocates against the total
> available, it doesn't actually allocate ram.

That was remarkably graceful of you. Yes, operating systems have worked
that way for decades - it's the beauty of copy-on-write.

> but you do need to allocate more swap as the total memory 'used' can be
> significantly higher that with overcommit on.

Yes, that's right.

Matthew

--
Note: some countries impose serious penalties for a conspiracy to overthrow
      the political system. THIS DOES NOT FIX THE VULNERABILITY.
                          -- http://lcamtuf.coredump.cx/soft/trash/1apr.txt

On Thu, 28 Aug 2008, Craig James wrote:
> If your processes do use the memory, then your performance goes into the
> toilet, and you know it's time to buy more memory or a second server,
> but in the mean time your server processes at least keep running while
> you kill the rogue processes.

I'd argue against swap ALWAYS being better than overcommit. It's a choice
between your performance going into the toilet or your processes dieing.

On the one hand, if someone fork-bombs you, the OOM killer has a chance of
solving the problem for you, rather than you having to log onto an
unresponsive machine to kill the process yourself. On the other hand, the
OOM killer may kill the wrong thing. Depending on what else you use your
machine for, either of the choices may be the right one.

Another point is that from a business perspective, a database that has
stopped responding is equally bad regardless of whether that is because
the OOM killer has appeared or because the machine is thrashing. In both
cases, there is a maximum throughput that the machine can handle, and if
requests appear quicker than that the system will collapse, especially if
the requests start timing out and being retried.

This problem really is caused by the kernel not having enough information
on how much memory a process is going to use. I would be much in favour of
replacing fork() with some more informative system call. For example,
forkandexec() instead of fork() then exec() - the kernel would know that
the new process will never need any of that duplicated RAM. However, there
is *far* too much legacy in the old fork() call to change that now.

Likewise, I would be all for Postgres managing its memory better. It would
be very nice to be able to set a maximum amount of work-memory, rather
than a maximum amount per backend. Each backend could then make do with
however much is left of the work-memory pool when it actually executes
queries. As it is, the server admin has no idea how many multiples of
work-mem are going to be actually used, even knowing the maximum number of
backends.

Matthew

--
Of course it's your fault. Everything here's your fault - it says so in your
contract.                                    - Quark

On Thu, Aug 28, 2008 at 2:29 PM, Matthew Wakeling <matthew@flymine.org> wrote:

> Another point is that from a business perspective, a database that has
> stopped responding is equally bad regardless of whether that is because the
> OOM killer has appeared or because the machine is thrashing. In both cases,
> there is a maximum throughput that the machine can handle, and if requests
> appear quicker than that the system will collapse, especially if the
> requests start timing out and being retried.

But there's a HUGE difference between a machine that has bogged down
under load so badly that you have to reset it and a machine that's had
the postmaster slaughtered by the OOM killer.  In the first situation,
while the machine is unresponsive, it should come right back up with a
coherent database after the restart.

OTOH, a machine with a dead postmaster is far more likely to have a
corrupted database when it gets restarted.

> Likewise, I would be all for Postgres managing its memory better. It would
> be very nice to be able to set a maximum amount of work-memory, rather than
> a maximum amount per backend. Each backend could then make do with however
> much is left of the work-memory pool when it actually executes queries. As
> it is, the server admin has no idea how many multiples of work-mem are going
> to be actually used, even knowing the maximum number of backends.

Agreed.  It would be useful to have a cap on all work_mem, but it
might be an issue that causes all the backends to talk to each other,
which can be really slow if you're running a thousand or so
connections.

On Thu, 28 Aug 2008, Scott Marlowe wrote:

> On Thu, Aug 28, 2008 at 2:29 PM, Matthew Wakeling <matthew@flymine.org> wrote:
>
>> Another point is that from a business perspective, a database that has
>> stopped responding is equally bad regardless of whether that is because the
>> OOM killer has appeared or because the machine is thrashing. In both cases,
>> there is a maximum throughput that the machine can handle, and if requests
>> appear quicker than that the system will collapse, especially if the
>> requests start timing out and being retried.
>
> But there's a HUGE difference between a machine that has bogged down
> under load so badly that you have to reset it and a machine that's had
> the postmaster slaughtered by the OOM killer.  In the first situation,
> while the machine is unresponsive, it should come right back up with a
> coherent database after the restart.
>
> OTOH, a machine with a dead postmaster is far more likely to have a
> corrupted database when it gets restarted.

wait a min here, postgres is supposed to be able to survive a complete box
failure without corrupting the database, if killing a process can corrupt
the database it sounds like a major problem.

David Lang

>> Likewise, I would be all for Postgres managing its memory better. It would
>> be very nice to be able to set a maximum amount of work-memory, rather than
>> a maximum amount per backend. Each backend could then make do with however
>> much is left of the work-memory pool when it actually executes queries. As
>> it is, the server admin has no idea how many multiples of work-mem are going
>> to be actually used, even knowing the maximum number of backends.
>
> Agreed.  It would be useful to have a cap on all work_mem, but it
> might be an issue that causes all the backends to talk to each other,
> which can be really slow if you're running a thousand or so
> connections.
>
>

On Thu, Aug 28, 2008 at 5:08 PM,  <david@lang.hm> wrote:
> On Thu, 28 Aug 2008, Scott Marlowe wrote:
>
>> On Thu, Aug 28, 2008 at 2:29 PM, Matthew Wakeling <matthew@flymine.org>
>> wrote:
>>
>>> Another point is that from a business perspective, a database that has
>>> stopped responding is equally bad regardless of whether that is because
>>> the
>>> OOM killer has appeared or because the machine is thrashing. In both
>>> cases,
>>> there is a maximum throughput that the machine can handle, and if
>>> requests
>>> appear quicker than that the system will collapse, especially if the
>>> requests start timing out and being retried.
>>
>> But there's a HUGE difference between a machine that has bogged down
>> under load so badly that you have to reset it and a machine that's had
>> the postmaster slaughtered by the OOM killer.  In the first situation,
>> while the machine is unresponsive, it should come right back up with a
>> coherent database after the restart.
>>
>> OTOH, a machine with a dead postmaster is far more likely to have a
>> corrupted database when it gets restarted.
>
> wait a min here, postgres is supposed to be able to survive a complete box
> failure without corrupting the database, if killing a process can corrupt
> the database it sounds like a major problem.

Yes it is a major problem, but not with postgresql.  It's a major
problem with the linux OOM killer killing processes that should not be
killed.

Would it be postgresql's fault if it corrupted data because my machine
had bad memory?  Or a bad hard drive?  This is the same kind of
failure.  The postmaster should never be killed.  It's the one thing
holding it all together.

On Thu, 28 Aug 2008, Scott Marlowe wrote:

> On Thu, Aug 28, 2008 at 5:08 PM,  <david@lang.hm> wrote:
>> On Thu, 28 Aug 2008, Scott Marlowe wrote:
>>
>>> On Thu, Aug 28, 2008 at 2:29 PM, Matthew Wakeling <matthew@flymine.org>
>>> wrote:
>>>
>>>> Another point is that from a business perspective, a database that has
>>>> stopped responding is equally bad regardless of whether that is because
>>>> the
>>>> OOM killer has appeared or because the machine is thrashing. In both
>>>> cases,
>>>> there is a maximum throughput that the machine can handle, and if
>>>> requests
>>>> appear quicker than that the system will collapse, especially if the
>>>> requests start timing out and being retried.
>>>
>>> But there's a HUGE difference between a machine that has bogged down
>>> under load so badly that you have to reset it and a machine that's had
>>> the postmaster slaughtered by the OOM killer.  In the first situation,
>>> while the machine is unresponsive, it should come right back up with a
>>> coherent database after the restart.
>>>
>>> OTOH, a machine with a dead postmaster is far more likely to have a
>>> corrupted database when it gets restarted.
>>
>> wait a min here, postgres is supposed to be able to survive a complete box
>> failure without corrupting the database, if killing a process can corrupt
>> the database it sounds like a major problem.
>
> Yes it is a major problem, but not with postgresql.  It's a major
> problem with the linux OOM killer killing processes that should not be
> killed.
>
> Would it be postgresql's fault if it corrupted data because my machine
> had bad memory?  Or a bad hard drive?  This is the same kind of
> failure.  The postmaster should never be killed.  It's the one thing
> holding it all together.

the ACID guarantees that postgres is making are supposed to mean that even
if the machine dies, the CPU goes up in smoke, etc, the transactions that
are completed will not be corrupted.

if killing the process voids all the ACID protection then something is
seriously wrong.

it may loose transactions that are in flight, but it should not corrupt
the database.

David Lang

On Thu, Aug 28, 2008 at 7:16 PM,  <david@lang.hm> wrote:
> the ACID guarantees that postgres is making are supposed to mean that even
> if the machine dies, the CPU goes up in smoke, etc, the transactions that
> are completed will not be corrupted.

And if any of those things happens, the machine will shut down and
you'll be safe.

> if killing the process voids all the ACID protection then something is
> seriously wrong.

No, your understanding of what postgresql can expect to have happen to
it are wrong.

You'll lose data integrity if:
If a CPU starts creating bad output that gets written to disk,
your RAID controller starts writing garbage to disk,
your memory has bad bits and you don't have ECC,
Some program hijacks a postgres process and starts writing random bits
in the code,
some program comes along and kills the postmaster, which coordinates
all the backends, and corrupts shared data in the process.

> it may loose transactions that are in flight, but it should not corrupt the
> database.

That's true for anything that just stops the machine or all the
postgresql processes dead.

It's not true for a machine that is misbehaving.  And any server that
randomly kills processes is misbehaving.

On Thu, Aug 28, 2008 at 8:11 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> wait a min here, postgres is supposed to be able to survive a complete box
> failure without corrupting the database, if killing a process can corrupt
> the database it sounds like a major problem.

Yes it is a major problem, but not with postgresql.  It's a major
problem with the linux OOM killer killing processes that should not be
killed.

Would it be postgresql's fault if it corrupted data because my machine
had bad memory?  Or a bad hard drive?  This is the same kind of
failure.  The postmaster should never be killed.  It's the one thing
holding it all together.

I fail to see the difference between the OOM killing it and the power going out.  And yes, if the power went out and PG came up with a corrupted DB (assuming I didn't turn off fsync, etc) I *would* blame PG.  I understand that killing the postmaster could stop all useful PG work, that it could cause it to stop responding to clients, that it could even "crash" PG, et ceterabut if a particular process dying causes corrupted DBs, that sounds borked to me.
On Thu, Aug 28, 2008 at 7:53 PM, Matthew Dennis <mdennis@merfer.net> wrote:
> On Thu, Aug 28, 2008 at 8:11 PM, Scott Marlowe <scott.marlowe@gmail.com>
> wrote:
>>
>> > wait a min here, postgres is supposed to be able to survive a complete
>> > box
>> > failure without corrupting the database, if killing a process can
>> > corrupt
>> > the database it sounds like a major problem.
>>
>> Yes it is a major problem, but not with postgresql.  It's a major
>> problem with the linux OOM killer killing processes that should not be
>> killed.
>>
>> Would it be postgresql's fault if it corrupted data because my machine
>> had bad memory?  Or a bad hard drive?  This is the same kind of
>> failure.  The postmaster should never be killed.  It's the one thing
>> holding it all together.
>
> I fail to see the difference between the OOM killing it and the power going
> out.

Then you fail to understand.

scenario 1:  There's a postmaster, it owns all the child processes.
It gets killed.  The Postmaster gets restarted.  Since there isn't one
running, it comes up.  starts new child processes.  Meanwhile, the old
child processes that don't belong to it are busy writing to the data
store.  Instant corruption.

scenario 2: Someone pulls the plug.  Every postgres child dies a quick
death.  Data on the drives is coherent and recoverable.
>>  And yes, if the power went out and PG came up with a corrupted DB
> (assuming I didn't turn off fsync, etc) I *would* blame PG.

Then you might be wrong.  If you were using the LVM, or certain levels
of SW RAID, or a RAID controller with cache with no battery backing
that is set to write-back, or if you were using an IDE or SATA drive /
controller that didn't support write barriers, or using NFS mounts for
database storage, and so on.  My point being that PostgreSQL HAS to
make certain assumptions about its environment that it simply cannot
directly control or test for.  Not having the postmaster shot in the
head while the children keep running is one of those things.

>  I understand
> that killing the postmaster could stop all useful PG work, that it could
> cause it to stop responding to clients, that it could even "crash" PG, et
> ceterabut if a particular process dying causes corrupted DBs, that sounds
> borked to me.

Well, design a better method and implement it.  If everything went
through the postmaster you'd be lucky to get 100 transactions per
second.  There are compromises between performance and reliability
under fire that have to be made.  It is not unreasonable to assume
that your OS is not going to randomly kill off processes because of a
dodgy VM implementation quirk.

P.s. I'm a big fan of linux, and I run my dbs on it.  But I turn off
overcommit and make a few other adjustments to make sure my database
is safe.  The OOM killer as a default is fine for workstations, but
it's an insane setting for servers, much like swappiness=60 is an
insane setting for a server too.

Scott Marlowe escribió:

> scenario 1:  There's a postmaster, it owns all the child processes.
> It gets killed.  The Postmaster gets restarted.  Since there isn't one
> running, it comes up.

Actually there's an additional step required at this point.  There isn't
a postmaster running, but a new one refuses to start, because the shmem
segment is in use.  In order for the second postmaster to start, the
sysadmin must remove the PID file by hand.

> starts new child processes.  Meanwhile, the old child processes that
> don't belong to it are busy writing to the data store.  Instant
> corruption.

In this scenario, it is both a kernel fault and sysadmin stupidity.  The
corruption that ensues is 100% deserved.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

On Thu, 28 Aug 2008, Scott Marlowe wrote:

> On Thu, Aug 28, 2008 at 7:53 PM, Matthew Dennis <mdennis@merfer.net> wrote:
>> On Thu, Aug 28, 2008 at 8:11 PM, Scott Marlowe <scott.marlowe@gmail.com>
>> wrote:
>>>
>>>> wait a min here, postgres is supposed to be able to survive a complete
>>>> box
>>>> failure without corrupting the database, if killing a process can
>>>> corrupt
>>>> the database it sounds like a major problem.
>>>
>>> Yes it is a major problem, but not with postgresql.  It's a major
>>> problem with the linux OOM killer killing processes that should not be
>>> killed.
>>>
>>> Would it be postgresql's fault if it corrupted data because my machine
>>> had bad memory?  Or a bad hard drive?  This is the same kind of
>>> failure.  The postmaster should never be killed.  It's the one thing
>>> holding it all together.
>>
>> I fail to see the difference between the OOM killing it and the power going
>> out.
>
> Then you fail to understand.
>
> scenario 1:  There's a postmaster, it owns all the child processes.
> It gets killed.  The Postmaster gets restarted.  Since there isn't one

when the postmaster gets killed doesn't that kill all it's children as
well?

> running, it comes up.  starts new child processes.  Meanwhile, the old
> child processes that don't belong to it are busy writing to the data
> store.  Instant corruption.

if so then the postmaster should not only check if there is an existing
postmaster running, it should check for the presense of the child
processes as well.

> scenario 2: Someone pulls the plug.  Every postgres child dies a quick
> death.  Data on the drives is coherent and recoverable.
>>>  And yes, if the power went out and PG came up with a corrupted DB
>> (assuming I didn't turn off fsync, etc) I *would* blame PG.
>
> Then you might be wrong.  If you were using the LVM, or certain levels
> of SW RAID, or a RAID controller with cache with no battery backing
> that is set to write-back, or if you were using an IDE or SATA drive /
> controller that didn't support write barriers, or using NFS mounts for
> database storage, and so on.

these all fall under "(assuming I didn't turn off fsync, etc)"

> My point being that PostgreSQL HAS to
> make certain assumptions about its environment that it simply cannot
> directly control or test for.  Not having the postmaster shot in the
> head while the children keep running is one of those things.
>
>>  I understand
>> that killing the postmaster could stop all useful PG work, that it could
>> cause it to stop responding to clients, that it could even "crash" PG, et
>> ceterabut if a particular process dying causes corrupted DBs, that sounds
>> borked to me.
>
> Well, design a better method and implement it.  If everything went
> through the postmaster you'd be lucky to get 100 transactions per
> second.

well, if you aren't going through the postmaster, what process is
recieving network messages? it can't be a group of processes, only one can
be listening to a socket at one time.

and if the postmaster isn't needed for the child processes to write to the
datastore, how are multiple child processes prevented from writing to the
datastore normally? and why doesn't that mechanism continue to work?

>  There are compromises between performance and reliability
> under fire that have to be made.  It is not unreasonable to assume
> that your OS is not going to randomly kill off processes because of a
> dodgy VM implementation quirk.
>
> P.s. I'm a big fan of linux, and I run my dbs on it.  But I turn off
> overcommit and make a few other adjustments to make sure my database
> is safe.  The OOM killer as a default is fine for workstations, but
> it's an insane setting for servers, much like swappiness=60 is an
> insane setting for a server too.

so are you saying that the only possible thing that can kill the
postmaster is the OOM killer? it can't possilby exit in any other
situation without the children being shutdown first?

I would be surprised if that was really true.

David Lang

david@lang.hm escribió:
> On Thu, 28 Aug 2008, Scott Marlowe wrote:

>> scenario 1:  There's a postmaster, it owns all the child processes.
>> It gets killed.  The Postmaster gets restarted.  Since there isn't one
>
> when the postmaster gets killed doesn't that kill all it's children as
> well?

Of course not.  The postmaster gets a SIGKILL, which is instant death.
There's no way to signal the children.  If they were killed too then
this wouldn't be much of a problem.

>> running, it comes up.  starts new child processes.  Meanwhile, the old
>> child processes that don't belong to it are busy writing to the data
>> store.  Instant corruption.
>
> if so then the postmaster should not only check if there is an existing
> postmaster running, it should check for the presense of the child
> processes as well.

See my other followup.  There's limited things it can check, but against
sysadmin stupidity there's no silver bullet.

> well, if you aren't going through the postmaster, what process is
> recieving network messages? it can't be a group of processes, only one
> can be listening to a socket at one time.

Huh?  Each backend has its own socket.

> and if the postmaster isn't needed for the child processes to write to
> the datastore, how are multiple child processes prevented from writing to
> the datastore normally? and why doesn't that mechanism continue to work?

They use locks.  Those locks are implemented using shared memory.  If a
new postmaster starts, it gets a new shared memory, and a new set of
locks, that do not conflict with the ones already held by the first gang
of backends.  This is what causes the corruption.


> so are you saying that the only possible thing that can kill the
> postmaster is the OOM killer? it can't possilby exit in any other
> situation without the children being shutdown first?
>
> I would be surprised if that was really true.

If the sysadmin sends a SIGKILL then obviously the same thing happens.

Any other signal gives it the chance to signal the children before
dying.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

On Thu, 28 Aug 2008, Alvaro Herrera wrote:

> david@lang.hm escribi?:
>> On Thu, 28 Aug 2008, Scott Marlowe wrote:
>
>>> scenario 1:  There's a postmaster, it owns all the child processes.
>>> It gets killed.  The Postmaster gets restarted.  Since there isn't one
>>
>> when the postmaster gets killed doesn't that kill all it's children as
>> well?
>
> Of course not.  The postmaster gets a SIGKILL, which is instant death.
> There's no way to signal the children.  If they were killed too then
> this wouldn't be much of a problem.

I'm not saying that it would signal it's children, I thought that the OS
killed children (unless steps were taken to allow them to re-parent)

>> well, if you aren't going through the postmaster, what process is
>> recieving network messages? it can't be a group of processes, only one
>> can be listening to a socket at one time.
>
> Huh?  Each backend has its own socket.

we must be talking about different things. I'm talking about the socket
that would be used for clients to talk to postgres, this is either a TCP
socket or a unix socket. in either case only one process can listen on it.

>> and if the postmaster isn't needed for the child processes to write to
>> the datastore, how are multiple child processes prevented from writing to
>> the datastore normally? and why doesn't that mechanism continue to work?
>
> They use locks.  Those locks are implemented using shared memory.  If a
> new postmaster starts, it gets a new shared memory, and a new set of
> locks, that do not conflict with the ones already held by the first gang
> of backends.  This is what causes the corruption.

so the new postmaster needs to detect that there is a shared memory
segment out that used by backends for this database.

this doesn't sound that hard, basicly something similar to a pid file in
the db directory that records what backends are running and what shared
memory segment they are using.

this would be similar to the existing pid file that would have to be
removed manually before a new postmaster can start (if it's not a graceful
shutdown)

besides, some watchdog would need to start the new postmaster, that
watchdog can be taught to kill off the child processes before starting a
new postmaster along with clearing the pid file.

>> so are you saying that the only possible thing that can kill the
>> postmaster is the OOM killer? it can't possilby exit in any other
>> situation without the children being shutdown first?
>>
>> I would be surprised if that was really true.
>
> If the sysadmin sends a SIGKILL then obviously the same thing happens.
>
> Any other signal gives it the chance to signal the children before
> dying.

are you sure that it's not going to die from a memory allocation error? or
any other similar type of error without _always_ killing the children?

David Lang

On Tue, 26 Aug 2008, Scott Marlowe wrote:

> If it is a checkpoint issue then you need more aggresive bgwriter
> settings, and possibly more bandwidth on your storage array.

Since this is 8.3.1 the main useful thing to do is increase
checkpoint_segments and checkpoint_completion_target to spread the I/O
over a longer period.  Making the background writer more aggressive
doesn't really help with

What is checkpoint_segments set to on this system?  If it's still at the
default of 3, you should increase that dramatically.

> What does vmstat 10 say during these spikes?  If you're running the
> sysstate service with data collection then sar can tell you a lot.

Henk seemed a bit confused about this suggestion, and the typo doesn't
help.  You can install the sysstat package with:

# apt-get install sysstat

This allows collecting system load info data at regular periods,
automatically, and sar is the tool you can use to look at it.  On Debian,
in order to get it to collect that information for you, I believe you just
need to do:

# dpkg-reconfigure sysstat

Then answer "yes" to "Do you want to activate sysstat's cron job?"  This
will install a crontab file that collects all the data you need for sar to
work.  You may need to restart the service after that.  There's a useful
walkthrough for this at
http://www.linuxweblog.com/blogs/wizap/20080126/sysstat-ubuntu

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

On Thu, 28 Aug 2008, Bill Moran wrote:

> In linux, it's possible to tell the OOM killer never to consider
> certain processes for the axe, using /proc magic.  See this page:
> http://linux-mm.org/OOM_Killer
>
> Perhaps this should be in the PostgreSQL docs somewhere?

The fact that
http://www.postgresql.org/docs/current/static/kernel-resources.html#AEN22218
tells you to flat-out turn off overcommit is the right conservative thing
to be in the documentation as I see it.  Sure, it's possible to keep it on
but disable the worst side-effect in some kernels (looks like 2.6.11+, so
no RHEL4 for example).  Trying to get into all in the manual is kind of
pushing what's appropriate for the PostgreSQL docs I think.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

david@lang.hm wrote:
> for example if you have a process that uses 1G of ram (say firefox)
> and it needs to start a new process (say acroread to handle a pdf
> file), what it does is it forks the firefox process (each of which
> have 1G of ram allocated), and then does an exec of the acroread
> process (releasing the 1G of ram previously held by that copy of the
> firefox process)
>
Indeed, which is why we have vfork.  And, OK, vfork is busted if you
have a threaded environment, so we have posix_spawn and posix_spawnp.

It is also worth noting that the copy isn't really a full copy on any
decent modern UNIX - it is a reservation against the total swap space
available.  Most pages will be happilly shared copy-on-write and never
fully copied to the child before the exec.

I can't see how an OS can lie to processes about memory being allocated
to them and not be ridiculed as a toy, but there you go.  I don't think
Linux is the only perpetrator - doesn't AIX do this too?

The 'bests trategy' for the OOM killer is not to have one, and accept
that you need some swap space available (it doesn't have to be fast
since it won't actually be touched) to help out when fork/exec happens
in big process images.

James


david@lang.hm wrote:
> On Thu, 28 Aug 2008, Scott Marlowe wrote:
>>> wait a min here, postgres is supposed to be able to survive a
>>> complete box
>>> failure without corrupting the database, if killing a process can
>>> corrupt
>>> the database it sounds like a major problem.
>>
>> Yes it is a major problem, but not with postgresql.  It's a major
>> problem with the linux OOM killer killing processes that should not be
>> killed.
>>
>> Would it be postgresql's fault if it corrupted data because my machine
>> had bad memory?  Or a bad hard drive?  This is the same kind of
>> failure.  The postmaster should never be killed.  It's the one thing
>> holding it all together.
>
> the ACID guarantees that postgres is making are supposed to mean that
> even if the machine dies, the CPU goes up in smoke, etc, the
> transactions that are completed will not be corrupted.
>
> if killing the process voids all the ACID protection then something is
> seriously wrong.
>
> it may loose transactions that are in flight, but it should not corrupt
> the database.

AFAIK, it's not the killing of the postmaster that's the problem. The
backends will continue running and *not* corrupt anything, because the
shared memory and locking sicks around between them.

The issue is if you manage to start a *new* postmaster against the same
data directory. But there's a whole bunch of safeguards against that, so
it certainly shouldn't be something you manage to do by mistake.

I may end up being corrected by someone who knows more, but that's how
I've understood it works. Meaning it is safe against OOM killer, except
it requires manual work to come back up. But it shouldn't corrupt your data.

//Magnus

On Thu, 28 Aug 2008, david@lang.hm wrote:
>> Huh?  Each backend has its own socket.
>
> we must be talking about different things. I'm talking about the socket that
> would be used for clients to talk to postgres, this is either a TCP socket or
> a unix socket. in either case only one process can listen on it.

The postmaster opens a socket for listening. Only one process can do that.
When an incoming connection is received, postmaster passes that connection
on to a child backend process. The child then has a socket, but it is a
connected socket, not a listening socket.

Matthew

--
Anyone who goes to a psychiatrist ought to have his head examined.

In response to Greg Smith <gsmith@gregsmith.com>:

> On Thu, 28 Aug 2008, Bill Moran wrote:
>
> > In linux, it's possible to tell the OOM killer never to consider
> > certain processes for the axe, using /proc magic.  See this page:
> > http://linux-mm.org/OOM_Killer
> >
> > Perhaps this should be in the PostgreSQL docs somewhere?
>
> The fact that
> http://www.postgresql.org/docs/current/static/kernel-resources.html#AEN22218
> tells you to flat-out turn off overcommit is the right conservative thing
> to be in the documentation as I see it.  Sure, it's possible to keep it on
> but disable the worst side-effect in some kernels (looks like 2.6.11+, so
> no RHEL4 for example).  Trying to get into all in the manual is kind of
> pushing what's appropriate for the PostgreSQL docs I think.

I don't know, Greg.  First off, the solution of making the postmaster
immune to the OOM killer seems better than disabling overcommit to me
anyway; and secondly, I don't understand why we should avoid making
the PG documentation as comprehensive as possible, which seems to be
what you are saying: "we shouldn't make the PG documentation too
comprehensive, because then it will get very big"

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Bill Moran wrote:

> In response to Greg Smith <gsmith@gregsmith.com>:
>
>> <snipped...>
>
> I don't know, Greg.  First off, the solution of making the postmaster
> immune to the OOM killer seems better than disabling overcommit to me
> anyway; and secondly, I don't understand why we should avoid making
> the PG documentation as comprehensive as possible, which seems to be
> what you are saying: "we shouldn't make the PG documentation too
> comprehensive, because then it will get very big"

I think it would be a hopeless morass for PostgreSQL to try to document each evolution of each OS it runs under; the general caveat seems fine, although perhaps adding something to the effect of "search the archives for possible specifics" might be in order. But tracking postgres's own shifts and requirements seems daunting enough w/out adding in endless flavours of different OSs.

My $0.02 worth ...

Greg Williamson
Senior DBA
DigitalGlobe

Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information and must be protected in accordance with those provisions. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.

(My corporate masters made me say this.)

david@lang.hm escribió:
> On Thu, 28 Aug 2008, Alvaro Herrera wrote:
>
>> david@lang.hm escribi?:
>>> On Thu, 28 Aug 2008, Scott Marlowe wrote:
>>
>>>> scenario 1:  There's a postmaster, it owns all the child processes.
>>>> It gets killed.  The Postmaster gets restarted.  Since there isn't one
>>>
>>> when the postmaster gets killed doesn't that kill all it's children as
>>> well?
>>
>> Of course not.  The postmaster gets a SIGKILL, which is instant death.
>> There's no way to signal the children.  If they were killed too then
>> this wouldn't be much of a problem.
>
> I'm not saying that it would signal it's children, I thought that the OS
> killed children (unless steps were taken to allow them to re-parent)

Oh, you were mistaken then.

>>> well, if you aren't going through the postmaster, what process is
>>> recieving network messages? it can't be a group of processes, only one
>>> can be listening to a socket at one time.
>>
>> Huh?  Each backend has its own socket.
>
> we must be talking about different things. I'm talking about the socket
> that would be used for clients to talk to postgres, this is either a TCP
> socket or a unix socket. in either case only one process can listen on
> it.

Obviously only one process (the postmaster) can call listen() on a given
TCP address/port.  Once connected, the socket is passed to the
backend, and the postmaster is no longer involved in the communication
between backend and client.  Each backend has its own socket.  If the
postmaster dies, the established communication is still alive.


>>> and if the postmaster isn't needed for the child processes to write to
>>> the datastore, how are multiple child processes prevented from writing to
>>> the datastore normally? and why doesn't that mechanism continue to work?
>>
>> They use locks.  Those locks are implemented using shared memory.  If a
>> new postmaster starts, it gets a new shared memory, and a new set of
>> locks, that do not conflict with the ones already held by the first gang
>> of backends.  This is what causes the corruption.
>
> so the new postmaster needs to detect that there is a shared memory
> segment out that used by backends for this database.

> this doesn't sound that hard,

You're welcome to suggest actual improvements to our interlocking
system, after you've read the current code and understood its rationale.


>> Any other signal gives it the chance to signal the children before
>> dying.
>
> are you sure that it's not going to die from a memory allocation error?
> or any other similar type of error without _always_ killing the children?

I am sure.  There are no memory allocations in that code.  It is
carefully written with that one purpose.

There may be bugs, but that's another matter.  This code was written
eons ago and has proven very healthy.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

James Mansion wrote:
> I can't see how an OS can lie to processes about memory being allocated
> to them and not be ridiculed as a toy, but there you go.  I don't think
> Linux is the only perpetrator - doesn't AIX do this too?

This is a leftover from the days of massive physical modeling (chemistry, physics, astronomy, ...) programs written in
FORTRAN.Since FORTRAN didn't have pointers, scientists would allocate massive three-dimensional arrays, and their code
mightonly access a tiny fraction of the memory.  The operating-system vendors, particularly SGI, added features to the
variousflavors of UNIX, including the ability to overcommit memory, to support these FORTRAN programs, which at the
timewere some of the most important applications driving computer science and computer architectures of
workstation-classcomputers. 

When these workstation-class computers evolved enough to rival mainframes, companies started shifting apps like Oracle
ontothe cheaper workstation-class computers.  Unfortunately, the legacy of the days of these FORTRAN programs is still
withus, and every few years we have to go through this discussion again. 

Disable overcommitted memory.  There is NO REASON to use it on any modern server-class computer, and MANY REASONS WHY
ITIS A BAD IDEA. 

Craig

On Fri, 29 Aug 2008, Craig James wrote:
> Disable overcommitted memory.  There is NO REASON to use it on any modern
> server-class computer, and MANY REASONS WHY IT IS A BAD IDEA.

As far as I can see, the main reason nowadays for overcommit is when a
large process forks and then execs. Are there any other modern programs
that allocate lots of RAM and never use it?

Matthew

--
Nog:     Look! They've made me into an ensign!
O'Brien: I didn't know things were going so badly.
Nog:     Frightening, isn't it?

On Fri, 29 Aug 2008, Bill Moran wrote:

> First off, the solution of making the postmaster immune to the OOM
> killer seems better than disabling overcommit to me anyway

I really side with Craig James here that the right thing to do here is to
turn off overcommit altogether.  While it's possible that servers being
used for things other than PostgreSQL might need it, I feel that's a rare
case that's really hostile to the database environment and it shouldn't be
encouraged.

> I don't understand why we should avoid making the PG documentation as
> comprehensive as possible

The main problem here is that this whole area is a moving target.  To
really document this properly, you end up going through this messy
exercise where you have to create a table showing what kernel versions
support the option you're trying to document.  And even that varies based
on what distribution you're dealing with.  Just because you know when
something showed up in the mainline kernel, without some research you
can't really know for sure when RedHat or SuSE slipped that into their
kernel--sometimes they lead mainline, sometimes they lag.

The PG documentation shies away from mentioning thing that are so
volatile, because the expectation is that people will still be using that
as a reference many years from now.  For all we know, 2.6.27 will come out
and make any documentation about this obscure oomadj feature obsolete.  I
think the only thing that might make sense here is to write something on
the Wiki that goes over in what situations you might run with overcommit
enabled, how that can interact with database operation, and then tries to
address the version issue.  Then that can be updated as new versions are
released so it stays current.  If that's available with a stable URL, it
might make sense to point to it in the documentation near where turning
overcommit off is mentioned.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

In 2003 I met this guy who was doing Computation Fluid Dynamics and he had to use this software written by physics
engineersin FORTRAN. 1 Gig of ram wasn't yet the standard for a desktop pc at that time but the software required at
least1 Gig just to get started. So I thought what is the problem after all you are supposed to be able to allocate upto
2Gon a 32 bit system even if you don't quite have the memory and you have sufficiently big swat space. Still, the
softwaredidn't load on Windows. So, it seems that Windows does not overcommit. 

regards


--- On Fri, 29/8/08, Matthew Wakeling <matthew@flymine.org> wrote:

> From: Matthew Wakeling <matthew@flymine.org>
> Subject: Re: [PERFORM] select on 22 GB table causes "An I/O error occured while sending to the backend." exception
> To: pgsql-performance@postgresql.org
> Date: Friday, 29 August, 2008, 4:56 PM
> On Fri, 29 Aug 2008, Craig James wrote:
> > Disable overcommitted memory.  There is NO REASON to
> use it on any modern
> > server-class computer, and MANY REASONS WHY IT IS A
> BAD IDEA.
>
> As far as I can see, the main reason nowadays for
> overcommit is when a
> large process forks and then execs. Are there any other
> modern programs
> that allocate lots of RAM and never use it?
>
> Matthew
>
> --
> Nog:     Look! They've made me into an ensign!
> O'Brien: I didn't know things were going so badly.
> Nog:     Frightening, isn't it?
>
> --
> Sent via pgsql-performance mailing list
> (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Send instant messages to your online friends http://uk.messenger.yahoo.com

Gregory Williamson wrote:
>Bill Moran wrote:

>> In response to Greg Smith <gsmith@gregsmith.com>:
>>
>> <snipped...>
>>
>> I don't know, Greg.  First off, the solution of making the postmaster
>> immune to the OOM killer seems better than disabling overcommit to me
>> anyway; and secondly, I don't understand why we should avoid making
>> the PG documentation as comprehensive as possible, which seems to be
>> what you are saying: "we shouldn't make the PG documentation too
>> comprehensive, because then it will get very big"

> I think it would be a hopeless morass for PostgreSQL to try to document

> each evolution of each OS it runs under; the general caveat seems

>fine, although perhaps adding something to the effect of "search the

> archives for possible specifics" might be in order. But tracking postgres's

> own shifts and requirements seems daunting enough w/out adding in

> endless flavours of different OSs.

> My $0.02 worth ...

In some aspects I agree, however in this specific case I think the docs should include the details about options to protect the postmaster from the OOM killer.

 

So far I’ve seen three basic solutions to this problem:
(1) Disabling overcommit

(2) Be generous with swap space

(3) Protect postmaster from the OOM killer

 

As we’ve seen so far, there is not one solution that makes everybody happy. Each option has its merits and downsides. Personally, I think in this case the docs should present all 3 options, perhaps in a Linux specific note or section, so each DBA can decide for themselves the appropriate method.

 

Going one step further, I’m thinking making the third option the default on Linux systems might not be a bad thing either. And, if that is done, the docs definitely need to contain information about it.

 

Another couple of cents in the pot…

Greg

 

* Craig James:

> So it never makes sense to enable overcommitted memory when
> Postgres, or any server, is running.

There are some run-time environments which allocate huge chunks of
memory on startup, without marking them as not yet in use.  SBCL is in
this category, and also the Hotspot VM (at least some extent).

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Florian Weimer wrote:
> * Craig James:
>> So it never makes sense to enable overcommitted memory when
>> Postgres, or any server, is running.
>
> There are some run-time environments which allocate huge chunks of
> memory on startup, without marking them as not yet in use.  SBCL is in
> this category, and also the Hotspot VM (at least some extent).

I stand by my assertion: It never makes sense.  Do these applications allocate a terrabyte of memory?  I doubt it.  Buy
aterrabyte swap disk and disable overcommitted memory. 

Craig

* Craig James:

>> There are some run-time environments which allocate huge chunks of
>> memory on startup, without marking them as not yet in use.  SBCL is in
>> this category, and also the Hotspot VM (at least some extent).
>
> I stand by my assertion: It never makes sense.  Do these
> applications allocate a terrabyte of memory?  I doubt it.

SBCL sizes its allocated memory region based on the total amount of
RAM and swap space.  In this case, buying larger disks does not
help. 8-P

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Florian Weimer wrote:
> * Craig James:
>
>>> There are some run-time environments which allocate huge chunks of
>>> memory on startup, without marking them as not yet in use.  SBCL is in
>>> this category, and also the Hotspot VM (at least some extent).
>> I stand by my assertion: It never makes sense.  Do these
>> applications allocate a terrabyte of memory?  I doubt it.
>
> SBCL sizes its allocated memory region based on the total amount of
> RAM and swap space.  In this case, buying larger disks does not
> help. 8-P

SBCL, as Steel Bank Common Lisp? Why would you run that on a server machine alongside Postgres? If I had to use SBLC
andPostgres, I'd put SBLC on a separate machine all its own, so that it couldn't corrupt Postgres or other servers that
hadto be reliable. 

Are you saying that if I bought a terrabyte of swap disk, SBLC would allocate a terrabyte of space?

Craig