Thread: System overload / context switching / oom, 8.3

System overload / context switching / oom, 8.3

From
Rob
Date:
pg 8.3.9, Debian Etch, 8gb ram, quadcore xeon, megaraid (more details at end)
~240 active databases, 800+ db connections via tcp.

Everything goes along fairly well, load average from 0.5 to 4.0.  Disk
IO is writing about 12-20 MB every 4 or 5 seconds.  Cache memory about
4gb.  Then under load, we see swapping and then context switch storm and
then oom-killer.

I'm hoping to find some ideas for spreading out the load of bgwriter
and/or autovacuum somehow or possibly reconfiguring memory to help
alleviate the problem, or at least to avoid crashing.

(Hardware/software/configuration specs are below the following dstat output).

I've been able to recreate the context switch storm (without the crash)
by running 4 simultaneous 'vacuum analyze' tasks during a pg_dump.
During these times, htop shows all 8 cpu going red bar 100% for a second
or two or three, and this is when I see the context switch storm.
The following stat data however is from a production workload crash.

During the dstat output below, postgresql was protected by oom_adj -17.
vm_overcommit_memory set to 2, but at this time vm_overcommit_ratio was
still at 50 (has since been changed to 90, should this be 100?).  The
memory usage was fairly constant 4056M 91M 3906M, until the end and after
heavier swapping it went to 4681M  984k 3305M (used/buf/cache).

dstat output under light to normal load:
---procs--- ---paging-- -dsk/total- ---system-- ----total-cpu-usage----
run blk new|__in_ _out_|_read _writ|_int_ _csw_|usr sys idl wai hiq siq
  0   2   5|   0     0 | 608k  884k| 756   801 | 11   2  83   4   0   0
  1   0   4|   0     0 | 360k 1636k|1062  1147 | 13   1  83   2   0   0
  2   2   5|   0     0 | 664k 1404k| 880   998 | 13   2  82   4   0   0
  0   4   4|   0     0 |2700k 6724k|1004   909 | 10   1  72  16   0   0
  0   2   4|   0     0 |  13M   14M|1490  1496 | 13   2  72  12   0   0
  1   1   4|   0     0 |  21M 1076k|1472  1413 | 12   2  74  11   0   0
  0   3   5|   0     0 |  15M 1712k|1211  1192 | 10   1  76  12   0   0
  1   0   4|   0     0 |7384k 1124k|1277  1403 | 15   2  75   9   0   0
  0   7   4|   0     0 |8864k 9528k|1431  1270 | 11   2  63  24   0   0
  1   3   4|   0     0 |2520k   15M|2225  3410 | 13   2  66  19   0   0
  2   1   5|   0     0 |4388k 1720k|1823  2246 | 14   2  70  13   0   0
  2   0   4|   0     0 |2804k 1276k|1284  1378 | 12   2  80   6   0   0
  0   0   4|   0     0 | 224k  884k| 825   900 | 12   2  86   1   0   0

under heavy load, just before crash, swap use has been increasing for
several seconds or minutes:
---procs--- ---paging-- -dsk/total- ---system-- ----total-cpu-usage----
run blk new|__in_ _out_|_read _writ|_int_ _csw_|usr sys idl wai hiq siq
  2  22   9| 124k   28k|  12M 1360k|1831  2536 |  7   4  46  44   0   0
  4   7   8| 156k   80k|  14M  348k|1742  2625 |  5   3  53  38   0   0
  1  14   7|  60k  232k|9028k   24M|1278  1642 |  4   3  50  42   0   0
  0  24   7| 564k    0 |  15M 5832k|1640  2199 |  7   2  41  50   0   0
  1  26   7| 172k    0 |  13M 1052k|1433  2121 |  5   3  54  37   0   0
  0  15   6|  36k    0 |6912k   35M|1295  3486 |  2   3  58  37   0   0
  3  30   2|   0     0 |9724k   13M|1373  2378 |  4   3  48  45   0   0
  5  20   4|4096B    0 |  10M   26M|2945   87k |  0   1  44  55   0   0
  1  29   8|   0     0 |  19M 8192B| 840   19k |  0   0  12  87   0   0
  4  33   3|   0     0 |4096B    0 |  14    39 | 17  17   0  67   0   0
  3  31   0|  64k    0 | 116k    0 | 580  8418 |  0   0   0 100   0   0
  0  36   0|   0     0 |8192B    0 | 533   12k |  0   0   9  91   0   0
  2  32   1|   0     0 |   0     0 | 519   12k |  0   0  11  89   0   0
  2  34   1|   0     0 |  16k    0 |  28    94 |  9   0   0  91   0   0
  1  32   0|   0     0 |  20k    0 | 467  2295 |  1   0  13  87   0   0
  2  32   0|   0     0 |   0     0 | 811   21k |  0   0  12  87   0   0
  4  35   3|   0     0 |  44k    0 | 582   11k |  0   0   0 100   0   0
  3  37   0|   0     0 |   0     0 |  16    67 |  0   9   0  91   0   0
  2  35   0|   0     0 |   0     0 | 519  8205 |  0   2  21  77   0   0
  0  37   0|   0     0 |   0     0 |  11    60 |  0   4  12  85   0   0
  1  35   1|   0     0 |  20k    0 | 334  2499 |  0   0  23  77   0   0
  0  36   1|   0     0 |  80k    0 | 305  8144 |  0   1  23  76   0   0
  0  35   3|   0     0 | 952k    0 | 541  2537 |  0   0  16  84   0   0
  2  35   2|   0     0 |  40k    0 | 285  8162 |  0   0  24  75   0   0
  2  35   0| 100k    0 | 108k    0 | 550  9595 |  0   0  37  63   0   0
  0  40   3|   0     0 |  16k    0 |1092   26k |  0   0  26  74   0   0
  4  37   3|   0     0 |  96k    0 | 790   12k |  0   0  34  66   0   0
  2  39   2|   0     0 |  24k    0 |  77   116 |  8   8   0  83   0   0
  2  37   1|   0     0 |   0     0 | 354  2457 |  0   0  29  71   0   0
  2  37   0|4096B    0 |  28k    0 |1909   57k |  0   0  27  73   0   0
  0  39   1|   0     0 |  32k    0 |1060   25k |  0   0  12  88   0   0
---procs--- ---paging-- -dsk/total- ---system-- ----total-cpu-usage----
run blk new|__in_ _out_|_read _writ|_int_ _csw_|usr sys idl wai hiq siq

SPECS:

PostgreSQL 8.3.9 on i486-pc-linux-gnu, compiled by GCC cc (GCC) 4.1.2
20061115 (prerelease) (Debian 4.1.1-21)
Installed from the debian etch-backports package.

Linux 2.6.18-6-686-bigmem #1 SMP Thu Nov 5 17:30:05 UTC 2009 i686
GNU/Linux (Debian Etch)

8 MB RAM
4 Quad Core Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz stepping 06
L1 I cache: 32K, L1 D cache: 32K,  L2 cache: 6144K

LSI Logic SAS based MegaRAID driver (batter backed/write cache enabled)
Dell PERC 6/i
# 8 SEAGATE   Model: ST973451SS Rev: SM04  (72 GB) ANSI SCSI revision: 05

RAID Configuration:
sda RAID1  2 disks (with pg_xlog wal files on it's own partition)
sdb RAID10 6 disks (pg base dir only)

POSTGRES:

261 databases
238 active databases (w/connection processes)
863 connections to those 238 databases

postgresql.conf:
max_connections = 1100
shared_buffers = 800MB
max_prepared_transactions = 0
work_mem = 32MB
maintenance_work_mem = 64MB
max_fsm_pages = 3300000
max_fsm_relations = 10000
vacuum_cost_delay = 50ms
bgwriter_delay = 150ms
bgwriter_lru_maxpages = 250
bgwriter_lru_multiplier = 2.5
wal_buffers = 8MB
checkpoint_segments = 32
checkpoint_timeout = 5min
checkpoint_completion_target = 0.9
effective_cache_size = 5000MB
default_statistics_target = 100
log_min_duration_statement = 1000
log_checkpoints = on
log_connections = on
log_disconnections = on
log_temp_files = 0
track_counts = on
autovacuum = on
log_autovacuum_min_duration = 0

Thanks for any ideas!
Rob



Re: System overload / context switching / oom, 8.3

From
Scott Marlowe
Date:
On Tue, Feb 2, 2010 at 12:11 PM, Rob <rclemley@yahoo.com> wrote:
> pg 8.3.9, Debian Etch, 8gb ram, quadcore xeon, megaraid (more details at end)
> ~240 active databases, 800+ db connections via tcp.
>
> Everything goes along fairly well, load average from 0.5 to 4.0.  Disk
> IO is writing about 12-20 MB every 4 or 5 seconds.  Cache memory about
> 4gb.  Then under load, we see swapping and then context switch storm and
> then oom-killer.

SNIP

> postgresql.conf:
> max_connections = 1100
> work_mem = 32MB

32MB * 1000 = 32,000MB...  And that's if you max out connections and
they each only do 1 sort.  If you're running many queries that run > 1
sorts it'll happen a lot sooner.

Either drop max connections or work_mem is what I'd do to start with.
If you have one or two reporting apps that need it higher, then set it
higher for just those connections / users.

Re: System overload / context switching / oom, 8.3

From
"Kevin Grittner"
Date:
Rob <rclemley@yahoo.com> wrote:

> 8gb ram
> ~240 active databases
> 800+ db connections via tcp.

8 GB RAM divided by 800 DB connections is 10 MB per connection.  You
seriously need to find some way to use connection pooling.  I'm not
sure the best way to do that with 240 active databases.

-Kevin

Re: System overload / context switching / oom, 8.3

From
Andy Colson
Date:
On 2/2/2010 1:11 PM, Rob wrote:
>
> Linux 2.6.18-6-686-bigmem #1 SMP Thu Nov 5 17:30:05 UTC 2009 i686
> GNU/Linux (Debian Etch)
>
> 8 MB RAM
> 4 Quad Core Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz stepping 06
> L1 I cache: 32K, L1 D cache: 32K,  L2 cache: 6144K
>

Well _there's_ your problem!  Ya need more RAM!  hee hee, I know, I
know, probably 8 gig, but just had to be done.

-Andy

Re: System overload / context switching / oom, 8.3

From
Andy Colson
Date:
On 2/2/2010 1:11 PM, Rob wrote:
>
> postgresql.conf:
> max_connections = 1100
> shared_buffers = 800MB
> max_prepared_transactions = 0
> work_mem = 32MB
> maintenance_work_mem = 64MB
> max_fsm_pages = 3300000
> max_fsm_relations = 10000
> vacuum_cost_delay = 50ms
> bgwriter_delay = 150ms
> bgwriter_lru_maxpages = 250
> bgwriter_lru_multiplier = 2.5
> wal_buffers = 8MB
> checkpoint_segments = 32
> checkpoint_timeout = 5min
> checkpoint_completion_target = 0.9
> effective_cache_size = 5000MB
> default_statistics_target = 100
> log_min_duration_statement = 1000
> log_checkpoints = on
> log_connections = on
> log_disconnections = on
> log_temp_files = 0
> track_counts = on
> autovacuum = on
> log_autovacuum_min_duration = 0


Ok, seriously this time.

 > work_mem = 32MB
 > maintenance_work_mem = 64MB


if you have lots and lots of connections, you might need to cut these down?

 > effective_cache_size = 5000MB

I see your running a 32bit, but with bigmem support, but still, one
process is limited to 4gig.  You'd make better use of all that ram if
you switched to 64bit.  And this cache, I think, would be limited to 4gig.

The oom-killer is kicking in, at some point, so somebody is using too
much ram.  There should be messages or logs or something, right?  (I've
never enabled the oom stuff so dont know much about it).  But the log
messages might be helpful.

Also, do you know what the oom max memory usage is set to?  You said:
"oom_adj -17.  vm_overcommit_memory set to 2, but at this time
vm_overcommit_ratio was still at 50 (has since been changed to 90,
should this be 100?)"

but I have no idea what that means.

-Andy


Re: System overload / context switching / oom, 8.3

From
Rob
Date:
Scott Marlowe wrote:
On Tue, Feb 2, 2010 at 12:11 PM, Rob <rclemley@yahoo.com> wrote: 
postgresql.conf:
max_connections = 1100
work_mem = 32MB   
32MB * 1000 = 32,000MB...  And that's if you max out connections and
they each only do 1 sort.  If you're running many queries that run > 1
sorts it'll happen a lot sooner.

Either drop max connections or work_mem is what I'd do to start with.
If you have one or two reporting apps that need it higher, then set it
higher for just those connections / users

Thanks much.  So does dropping work_mem to the default of 1MB sound good?

By moving databases around we're getting max_connections below 600 or 700.

Re: System overload / context switching / oom, 8.3

From
Rob
Date:
Kevin Grittner wrote:
Rob <rclemley@yahoo.com> wrote: 
8gb ram
~240 active databases
800+ db connections via tcp.   
 
8 GB RAM divided by 800 DB connections is 10 MB per connection.  You
seriously need to find some way to use connection pooling.  I'm not
sure the best way to do that with 240 active databases. 

By wrangling the applications, We've got the number of connections down to 530 and number of active databases down to 186.

The application's poor connection management exacerbates the problem.

Thanks for the idea,
Rob

Re: System overload / context switching / oom, 8.3

From
Greg Smith
Date:
Andy Colson wrote:
> The oom-killer is kicking in, at some point, so somebody is using too
> much ram.  There should be messages or logs or something, right?
> (I've never enabled the oom stuff so dont know much about it).  But
> the log messages might be helpful.

They probably won't be.  The information logged about what the OOM
killer decided to kill is rarely sufficient to tell anything interesting
about the true cause in a PostgreSQL context--only really helpful if
you've got some memory hog process it decided to kill.  In this case,
seems to be a simple situation:  way too many connections for the
work_mem setting used for a 8GB server to support.  I'd take a look at
the system using "top -c" as well, in good times and bad if possible,
just to see if any weird memory use is showing up somewhere, perhaps
even outside the database.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: System overload / context switching / oom, 8.3

From
Matthew Wakeling
Date:
On Tue, 2 Feb 2010, Rob wrote:
> pg 8.3.9, Debian Etch, 8gb ram, quadcore xeon, megaraid (more details at end)
> ~240 active databases, 800+ db connections via tcp.

> Linux 2.6.18-6-686-bigmem #1 SMP Thu Nov 5 17:30:05 UTC 2009 i686
> GNU/Linux (Debian Etch)
>
> 8 MB RAM
> 4 Quad Core Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz stepping 06

My advice?

1. Switch to 64-bit operating system and Postgres. Debian provides that,
and it works a charm. You have a 64-bit system, so why not use it?

2. Buy more RAM. Think about it - you have 800 individual processes
running on your box, and they will all want their own process space. To be
honest, I'm impressed that the current machine works at all. You can get
an idea of how much RAM you might need by multiplying the number of
connections by (work_mem + about 3MB), and add on shared_buffers. So even
when the system is idle you're currently burning 3200MB just sustaining
800 processes - more if they are actually doing something.

3. Try to reduce the number of connections to the database server.

4. Think about your work_mem. Finding the correct value for you is going
to be a matter of testing. Smaller values will result in large queries
running slowly, but have the danger of driving the system to swap and OOM.

Matthew

--
 A good programmer is one who looks both ways before crossing a one-way street.
 Considering the quality and quantity of one-way streets in Cambridge, it
 should be no surprise that there are so many good programmers there.

Re: System overload / context switching / oom, 8.3

From
Robert Haas
Date:
On Tue, Feb 2, 2010 at 3:47 PM, Andy Colson <andy@squeakycode.net> wrote:
>> effective_cache_size = 5000MB
>
> I see your running a 32bit, but with bigmem support, but still, one process
> is limited to 4gig.  You'd make better use of all that ram if you switched
> to 64bit.  And this cache, I think, would be limited to 4gig.

Just to be clear, effective_cache_size does not allocate any memory of
any kind, in any way, ever...

...Robert

Re: System overload / context switching / oom, 8.3

From
Rob Lemley
Date:
Andy Colson wrote:
> > work_mem = 32MB
> > maintenance_work_mem = 64MB
>
>
> if you have lots and lots of connections, you might need to cut these
> down?

definitely, work_mem is the main focus.

If I understand correctly, th 64MB maintenance_work_mem is per vacuum
task, and on this system there are 3 autovacuums.  I was wondering if
with this many databases, possibly decreasing the maintenance_work_mem
significantly and starting up more autovacuums.

Yes, also moving databases to other servers in order to decrease the
number of connections.
>
> > effective_cache_size = 5000MB
>
> I see your running a 32bit, but with bigmem support, but still, one
> process is limited to 4gig.  You'd make better use of all that ram if
> you switched to 64bit.  And this cache, I think, would be limited to
> 4gig.
All of the cache is being used because the operating system kernel is
built with the memory extensions to access outside the 32bit range.
This is the cache size reported by free(1). However, there may be
advantages to switch to 64bit.

>
> The oom-killer is kicking in, at some point, so somebody is using too
> much ram.  There should be messages or logs or something, right?
> (I've never enabled the oom stuff so dont know much about it).  But
> the log messages might be helpful.
>
> Also, do you know what the oom max memory usage is set to?  You said:
> "oom_adj -17.  vm_overcommit_memory set to 2, but at this time
> vm_overcommit_ratio was still at 50 (has since been changed to 90,
> should this be 100?)"

Oh man.  I encourage everyone to find out what /proc/<pid>/oom_adj
means.  You have to set this to keep the Linux "oom-killer" from doing a
kill -9 on postgres postmaster.  On Debian:

 echo -17 >> /proc/$(cat /var/run/postgresql/8.3-main.pid)/oom_adj

This is my experience with oom-killer.  After putting -17 into
/proc/pid/oom_adj, oom-killer seemed to kill one of the database
connection processes.  Then the postmaster attempted to shut down all
processes because of possible shared memory corruption.  The database
then went into recovery mode.  After stopping the database some of the
processes were stuck and could not be killed.  The operating system was
rebooted and the database returned with no data loss.

My earlier experience with oom-killer: If you don't have this setting in
oom_adj, then it seems likely (certain?) that oom-killer kills the
postmaster because of the algorithm oom-killer uses (called badness())
which adds children process scores to their parent's scores.  I don't
know if sshd was killed but I don't think anyone could log in to the
OS.  After rebooting there was a segmentation violation when trying to
start the postmaster.  I don't think that running pg_resetxlog with
defaults is a good idea.  My colleague who has been investigating the
crash believes that we could have probably eliminated at least some of
the data loss with more judicious use of pg_resetxlog.

There was a discussion on the postgres lists about somehow having the
postgres distribution include the functionality to set oom_adj on
startup.  To my knowledge, that's not in 8.3 so I wrote a script and
init.d script to do this on Debian systems.

As far as vm.over_commit memory goes, there are three settings and most
recommend setting it to 2 for postgres.  However, this does not turn off
oom-killer!  You need to put -17 in /proc/<pid>/oom_adj whether you do
anything about vm.over_commit memory or not  We had vm_overcommit_memory
set to 2 and oom-killer became active and killed the postmaster.

Kind of off-topic, but a Linux kernel parameter that's often not set on
database servers is elevator=deadline which sets up the io scheduling
algorithm.  The algorithm can be viewed/set at runtime for example the
disk /dev/sdc in /sys/block/sdc/queue/scheduler.

Rob




Re: System overload / context switching / oom, 8.3

From
Greg Smith
Date:
Rob Lemley wrote:
> here was a discussion on the postgres lists about somehow having the
> postgres distribution include the functionality to set oom_adj on
> startup.  To my knowledge, that's not in 8.3 so I wrote a script and
> init.d script to do this on Debian systems.

That's not in anything earlier than the upcoming 9.0 because the support
code involved just showed up:
http://archives.postgresql.org/pgsql-committers/2010-01/msg00169.php

It was always possible to do this in an init script as you describe.
The specific new feature added is the ability to remove client child
processes from having that protection, so that they can still be killed
normally.  Basically, limiting the protection just at the process that
you really need it on.  The updated documentation for the new version
has more details about this whole topic, useful to people running older
versions too:
http://developer.postgresql.org/pgdocs/postgres/kernel-resources.html

> Kind of off-topic, but a Linux kernel parameter that's often not set
> on database servers is elevator=deadline which sets up the io
> scheduling algorithm.  The algorithm can be viewed/set at runtime for
> example the disk /dev/sdc in /sys/block/sdc/queue/scheduler.

I've never seen a real-world PostgreSQL workload where deadline worked
better than CFQ, and I've seen a couple where it was significantly
worse.  Playing with that parameter needs a heavy disclaimer that you
should benchmark *your app* before and after changing it to make sure it
was actually useful.  Actually, three times:  return to CFQ again
afterwards, too, just to confirm it's not a "faster on the second run"
effect.

The important things to get right on Linux are read-ahead and reducing
the size of the write cache size--the latter being the more direct and
effective way to improve the problem that the scheduler change happens
to impact too.  Those have dramatically more importance than sensible
changes to the scheduler used (with using the anticipatory one on a
server system or the no-op one on a desktop would be non-sensible changes).

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us