Thread: System overload / context switching / oom, 8.3
pg 8.3.9, Debian Etch, 8gb ram, quadcore xeon, megaraid (more details at end) ~240 active databases, 800+ db connections via tcp. Everything goes along fairly well, load average from 0.5 to 4.0. Disk IO is writing about 12-20 MB every 4 or 5 seconds. Cache memory about 4gb. Then under load, we see swapping and then context switch storm and then oom-killer. I'm hoping to find some ideas for spreading out the load of bgwriter and/or autovacuum somehow or possibly reconfiguring memory to help alleviate the problem, or at least to avoid crashing. (Hardware/software/configuration specs are below the following dstat output). I've been able to recreate the context switch storm (without the crash) by running 4 simultaneous 'vacuum analyze' tasks during a pg_dump. During these times, htop shows all 8 cpu going red bar 100% for a second or two or three, and this is when I see the context switch storm. The following stat data however is from a production workload crash. During the dstat output below, postgresql was protected by oom_adj -17. vm_overcommit_memory set to 2, but at this time vm_overcommit_ratio was still at 50 (has since been changed to 90, should this be 100?). The memory usage was fairly constant 4056M 91M 3906M, until the end and after heavier swapping it went to 4681M 984k 3305M (used/buf/cache). dstat output under light to normal load: ---procs--- ---paging-- -dsk/total- ---system-- ----total-cpu-usage---- run blk new|__in_ _out_|_read _writ|_int_ _csw_|usr sys idl wai hiq siq 0 2 5| 0 0 | 608k 884k| 756 801 | 11 2 83 4 0 0 1 0 4| 0 0 | 360k 1636k|1062 1147 | 13 1 83 2 0 0 2 2 5| 0 0 | 664k 1404k| 880 998 | 13 2 82 4 0 0 0 4 4| 0 0 |2700k 6724k|1004 909 | 10 1 72 16 0 0 0 2 4| 0 0 | 13M 14M|1490 1496 | 13 2 72 12 0 0 1 1 4| 0 0 | 21M 1076k|1472 1413 | 12 2 74 11 0 0 0 3 5| 0 0 | 15M 1712k|1211 1192 | 10 1 76 12 0 0 1 0 4| 0 0 |7384k 1124k|1277 1403 | 15 2 75 9 0 0 0 7 4| 0 0 |8864k 9528k|1431 1270 | 11 2 63 24 0 0 1 3 4| 0 0 |2520k 15M|2225 3410 | 13 2 66 19 0 0 2 1 5| 0 0 |4388k 1720k|1823 2246 | 14 2 70 13 0 0 2 0 4| 0 0 |2804k 1276k|1284 1378 | 12 2 80 6 0 0 0 0 4| 0 0 | 224k 884k| 825 900 | 12 2 86 1 0 0 under heavy load, just before crash, swap use has been increasing for several seconds or minutes: ---procs--- ---paging-- -dsk/total- ---system-- ----total-cpu-usage---- run blk new|__in_ _out_|_read _writ|_int_ _csw_|usr sys idl wai hiq siq 2 22 9| 124k 28k| 12M 1360k|1831 2536 | 7 4 46 44 0 0 4 7 8| 156k 80k| 14M 348k|1742 2625 | 5 3 53 38 0 0 1 14 7| 60k 232k|9028k 24M|1278 1642 | 4 3 50 42 0 0 0 24 7| 564k 0 | 15M 5832k|1640 2199 | 7 2 41 50 0 0 1 26 7| 172k 0 | 13M 1052k|1433 2121 | 5 3 54 37 0 0 0 15 6| 36k 0 |6912k 35M|1295 3486 | 2 3 58 37 0 0 3 30 2| 0 0 |9724k 13M|1373 2378 | 4 3 48 45 0 0 5 20 4|4096B 0 | 10M 26M|2945 87k | 0 1 44 55 0 0 1 29 8| 0 0 | 19M 8192B| 840 19k | 0 0 12 87 0 0 4 33 3| 0 0 |4096B 0 | 14 39 | 17 17 0 67 0 0 3 31 0| 64k 0 | 116k 0 | 580 8418 | 0 0 0 100 0 0 0 36 0| 0 0 |8192B 0 | 533 12k | 0 0 9 91 0 0 2 32 1| 0 0 | 0 0 | 519 12k | 0 0 11 89 0 0 2 34 1| 0 0 | 16k 0 | 28 94 | 9 0 0 91 0 0 1 32 0| 0 0 | 20k 0 | 467 2295 | 1 0 13 87 0 0 2 32 0| 0 0 | 0 0 | 811 21k | 0 0 12 87 0 0 4 35 3| 0 0 | 44k 0 | 582 11k | 0 0 0 100 0 0 3 37 0| 0 0 | 0 0 | 16 67 | 0 9 0 91 0 0 2 35 0| 0 0 | 0 0 | 519 8205 | 0 2 21 77 0 0 0 37 0| 0 0 | 0 0 | 11 60 | 0 4 12 85 0 0 1 35 1| 0 0 | 20k 0 | 334 2499 | 0 0 23 77 0 0 0 36 1| 0 0 | 80k 0 | 305 8144 | 0 1 23 76 0 0 0 35 3| 0 0 | 952k 0 | 541 2537 | 0 0 16 84 0 0 2 35 2| 0 0 | 40k 0 | 285 8162 | 0 0 24 75 0 0 2 35 0| 100k 0 | 108k 0 | 550 9595 | 0 0 37 63 0 0 0 40 3| 0 0 | 16k 0 |1092 26k | 0 0 26 74 0 0 4 37 3| 0 0 | 96k 0 | 790 12k | 0 0 34 66 0 0 2 39 2| 0 0 | 24k 0 | 77 116 | 8 8 0 83 0 0 2 37 1| 0 0 | 0 0 | 354 2457 | 0 0 29 71 0 0 2 37 0|4096B 0 | 28k 0 |1909 57k | 0 0 27 73 0 0 0 39 1| 0 0 | 32k 0 |1060 25k | 0 0 12 88 0 0 ---procs--- ---paging-- -dsk/total- ---system-- ----total-cpu-usage---- run blk new|__in_ _out_|_read _writ|_int_ _csw_|usr sys idl wai hiq siq SPECS: PostgreSQL 8.3.9 on i486-pc-linux-gnu, compiled by GCC cc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) Installed from the debian etch-backports package. Linux 2.6.18-6-686-bigmem #1 SMP Thu Nov 5 17:30:05 UTC 2009 i686 GNU/Linux (Debian Etch) 8 MB RAM 4 Quad Core Intel(R) Xeon(R) CPU E5440 @ 2.83GHz stepping 06 L1 I cache: 32K, L1 D cache: 32K, L2 cache: 6144K LSI Logic SAS based MegaRAID driver (batter backed/write cache enabled) Dell PERC 6/i # 8 SEAGATE Model: ST973451SS Rev: SM04 (72 GB) ANSI SCSI revision: 05 RAID Configuration: sda RAID1 2 disks (with pg_xlog wal files on it's own partition) sdb RAID10 6 disks (pg base dir only) POSTGRES: 261 databases 238 active databases (w/connection processes) 863 connections to those 238 databases postgresql.conf: max_connections = 1100 shared_buffers = 800MB max_prepared_transactions = 0 work_mem = 32MB maintenance_work_mem = 64MB max_fsm_pages = 3300000 max_fsm_relations = 10000 vacuum_cost_delay = 50ms bgwriter_delay = 150ms bgwriter_lru_maxpages = 250 bgwriter_lru_multiplier = 2.5 wal_buffers = 8MB checkpoint_segments = 32 checkpoint_timeout = 5min checkpoint_completion_target = 0.9 effective_cache_size = 5000MB default_statistics_target = 100 log_min_duration_statement = 1000 log_checkpoints = on log_connections = on log_disconnections = on log_temp_files = 0 track_counts = on autovacuum = on log_autovacuum_min_duration = 0 Thanks for any ideas! Rob
On Tue, Feb 2, 2010 at 12:11 PM, Rob <rclemley@yahoo.com> wrote: > pg 8.3.9, Debian Etch, 8gb ram, quadcore xeon, megaraid (more details at end) > ~240 active databases, 800+ db connections via tcp. > > Everything goes along fairly well, load average from 0.5 to 4.0. Disk > IO is writing about 12-20 MB every 4 or 5 seconds. Cache memory about > 4gb. Then under load, we see swapping and then context switch storm and > then oom-killer. SNIP > postgresql.conf: > max_connections = 1100 > work_mem = 32MB 32MB * 1000 = 32,000MB... And that's if you max out connections and they each only do 1 sort. If you're running many queries that run > 1 sorts it'll happen a lot sooner. Either drop max connections or work_mem is what I'd do to start with. If you have one or two reporting apps that need it higher, then set it higher for just those connections / users.
Rob <rclemley@yahoo.com> wrote: > 8gb ram > ~240 active databases > 800+ db connections via tcp. 8 GB RAM divided by 800 DB connections is 10 MB per connection. You seriously need to find some way to use connection pooling. I'm not sure the best way to do that with 240 active databases. -Kevin
On 2/2/2010 1:11 PM, Rob wrote: > > Linux 2.6.18-6-686-bigmem #1 SMP Thu Nov 5 17:30:05 UTC 2009 i686 > GNU/Linux (Debian Etch) > > 8 MB RAM > 4 Quad Core Intel(R) Xeon(R) CPU E5440 @ 2.83GHz stepping 06 > L1 I cache: 32K, L1 D cache: 32K, L2 cache: 6144K > Well _there's_ your problem! Ya need more RAM! hee hee, I know, I know, probably 8 gig, but just had to be done. -Andy
On 2/2/2010 1:11 PM, Rob wrote: > > postgresql.conf: > max_connections = 1100 > shared_buffers = 800MB > max_prepared_transactions = 0 > work_mem = 32MB > maintenance_work_mem = 64MB > max_fsm_pages = 3300000 > max_fsm_relations = 10000 > vacuum_cost_delay = 50ms > bgwriter_delay = 150ms > bgwriter_lru_maxpages = 250 > bgwriter_lru_multiplier = 2.5 > wal_buffers = 8MB > checkpoint_segments = 32 > checkpoint_timeout = 5min > checkpoint_completion_target = 0.9 > effective_cache_size = 5000MB > default_statistics_target = 100 > log_min_duration_statement = 1000 > log_checkpoints = on > log_connections = on > log_disconnections = on > log_temp_files = 0 > track_counts = on > autovacuum = on > log_autovacuum_min_duration = 0 Ok, seriously this time. > work_mem = 32MB > maintenance_work_mem = 64MB if you have lots and lots of connections, you might need to cut these down? > effective_cache_size = 5000MB I see your running a 32bit, but with bigmem support, but still, one process is limited to 4gig. You'd make better use of all that ram if you switched to 64bit. And this cache, I think, would be limited to 4gig. The oom-killer is kicking in, at some point, so somebody is using too much ram. There should be messages or logs or something, right? (I've never enabled the oom stuff so dont know much about it). But the log messages might be helpful. Also, do you know what the oom max memory usage is set to? You said: "oom_adj -17. vm_overcommit_memory set to 2, but at this time vm_overcommit_ratio was still at 50 (has since been changed to 90, should this be 100?)" but I have no idea what that means. -Andy
Scott Marlowe wrote:
Thanks much. So does dropping work_mem to the default of 1MB sound good?
By moving databases around we're getting max_connections below 600 or 700.
On Tue, Feb 2, 2010 at 12:11 PM, Rob <rclemley@yahoo.com> wrote:postgresql.conf: max_connections = 1100 work_mem = 32MB32MB * 1000 = 32,000MB... And that's if you max out connections and they each only do 1 sort. If you're running many queries that run > 1 sorts it'll happen a lot sooner. Either drop max connections or work_mem is what I'd do to start with. If you have one or two reporting apps that need it higher, then set it higher for just those connections / users
Thanks much. So does dropping work_mem to the default of 1MB sound good?
By moving databases around we're getting max_connections below 600 or 700.
Kevin Grittner wrote:
By wrangling the applications, We've got the number of connections down to 530 and number of active databases down to 186.
The application's poor connection management exacerbates the problem.
Thanks for the idea,
Rob
Rob <rclemley@yahoo.com> wrote:8gb ram ~240 active databases 800+ db connections via tcp.8 GB RAM divided by 800 DB connections is 10 MB per connection. You seriously need to find some way to use connection pooling. I'm not sure the best way to do that with 240 active databases.
By wrangling the applications, We've got the number of connections down to 530 and number of active databases down to 186.
The application's poor connection management exacerbates the problem.
Thanks for the idea,
Rob
Andy Colson wrote: > The oom-killer is kicking in, at some point, so somebody is using too > much ram. There should be messages or logs or something, right? > (I've never enabled the oom stuff so dont know much about it). But > the log messages might be helpful. They probably won't be. The information logged about what the OOM killer decided to kill is rarely sufficient to tell anything interesting about the true cause in a PostgreSQL context--only really helpful if you've got some memory hog process it decided to kill. In this case, seems to be a simple situation: way too many connections for the work_mem setting used for a 8GB server to support. I'd take a look at the system using "top -c" as well, in good times and bad if possible, just to see if any weird memory use is showing up somewhere, perhaps even outside the database. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Tue, 2 Feb 2010, Rob wrote: > pg 8.3.9, Debian Etch, 8gb ram, quadcore xeon, megaraid (more details at end) > ~240 active databases, 800+ db connections via tcp. > Linux 2.6.18-6-686-bigmem #1 SMP Thu Nov 5 17:30:05 UTC 2009 i686 > GNU/Linux (Debian Etch) > > 8 MB RAM > 4 Quad Core Intel(R) Xeon(R) CPU E5440 @ 2.83GHz stepping 06 My advice? 1. Switch to 64-bit operating system and Postgres. Debian provides that, and it works a charm. You have a 64-bit system, so why not use it? 2. Buy more RAM. Think about it - you have 800 individual processes running on your box, and they will all want their own process space. To be honest, I'm impressed that the current machine works at all. You can get an idea of how much RAM you might need by multiplying the number of connections by (work_mem + about 3MB), and add on shared_buffers. So even when the system is idle you're currently burning 3200MB just sustaining 800 processes - more if they are actually doing something. 3. Try to reduce the number of connections to the database server. 4. Think about your work_mem. Finding the correct value for you is going to be a matter of testing. Smaller values will result in large queries running slowly, but have the danger of driving the system to swap and OOM. Matthew -- A good programmer is one who looks both ways before crossing a one-way street. Considering the quality and quantity of one-way streets in Cambridge, it should be no surprise that there are so many good programmers there.
On Tue, Feb 2, 2010 at 3:47 PM, Andy Colson <andy@squeakycode.net> wrote: >> effective_cache_size = 5000MB > > I see your running a 32bit, but with bigmem support, but still, one process > is limited to 4gig. You'd make better use of all that ram if you switched > to 64bit. And this cache, I think, would be limited to 4gig. Just to be clear, effective_cache_size does not allocate any memory of any kind, in any way, ever... ...Robert
Andy Colson wrote: > > work_mem = 32MB > > maintenance_work_mem = 64MB > > > if you have lots and lots of connections, you might need to cut these > down? definitely, work_mem is the main focus. If I understand correctly, th 64MB maintenance_work_mem is per vacuum task, and on this system there are 3 autovacuums. I was wondering if with this many databases, possibly decreasing the maintenance_work_mem significantly and starting up more autovacuums. Yes, also moving databases to other servers in order to decrease the number of connections. > > > effective_cache_size = 5000MB > > I see your running a 32bit, but with bigmem support, but still, one > process is limited to 4gig. You'd make better use of all that ram if > you switched to 64bit. And this cache, I think, would be limited to > 4gig. All of the cache is being used because the operating system kernel is built with the memory extensions to access outside the 32bit range. This is the cache size reported by free(1). However, there may be advantages to switch to 64bit. > > The oom-killer is kicking in, at some point, so somebody is using too > much ram. There should be messages or logs or something, right? > (I've never enabled the oom stuff so dont know much about it). But > the log messages might be helpful. > > Also, do you know what the oom max memory usage is set to? You said: > "oom_adj -17. vm_overcommit_memory set to 2, but at this time > vm_overcommit_ratio was still at 50 (has since been changed to 90, > should this be 100?)" Oh man. I encourage everyone to find out what /proc/<pid>/oom_adj means. You have to set this to keep the Linux "oom-killer" from doing a kill -9 on postgres postmaster. On Debian: echo -17 >> /proc/$(cat /var/run/postgresql/8.3-main.pid)/oom_adj This is my experience with oom-killer. After putting -17 into /proc/pid/oom_adj, oom-killer seemed to kill one of the database connection processes. Then the postmaster attempted to shut down all processes because of possible shared memory corruption. The database then went into recovery mode. After stopping the database some of the processes were stuck and could not be killed. The operating system was rebooted and the database returned with no data loss. My earlier experience with oom-killer: If you don't have this setting in oom_adj, then it seems likely (certain?) that oom-killer kills the postmaster because of the algorithm oom-killer uses (called badness()) which adds children process scores to their parent's scores. I don't know if sshd was killed but I don't think anyone could log in to the OS. After rebooting there was a segmentation violation when trying to start the postmaster. I don't think that running pg_resetxlog with defaults is a good idea. My colleague who has been investigating the crash believes that we could have probably eliminated at least some of the data loss with more judicious use of pg_resetxlog. There was a discussion on the postgres lists about somehow having the postgres distribution include the functionality to set oom_adj on startup. To my knowledge, that's not in 8.3 so I wrote a script and init.d script to do this on Debian systems. As far as vm.over_commit memory goes, there are three settings and most recommend setting it to 2 for postgres. However, this does not turn off oom-killer! You need to put -17 in /proc/<pid>/oom_adj whether you do anything about vm.over_commit memory or not We had vm_overcommit_memory set to 2 and oom-killer became active and killed the postmaster. Kind of off-topic, but a Linux kernel parameter that's often not set on database servers is elevator=deadline which sets up the io scheduling algorithm. The algorithm can be viewed/set at runtime for example the disk /dev/sdc in /sys/block/sdc/queue/scheduler. Rob
Rob Lemley wrote: > here was a discussion on the postgres lists about somehow having the > postgres distribution include the functionality to set oom_adj on > startup. To my knowledge, that's not in 8.3 so I wrote a script and > init.d script to do this on Debian systems. That's not in anything earlier than the upcoming 9.0 because the support code involved just showed up: http://archives.postgresql.org/pgsql-committers/2010-01/msg00169.php It was always possible to do this in an init script as you describe. The specific new feature added is the ability to remove client child processes from having that protection, so that they can still be killed normally. Basically, limiting the protection just at the process that you really need it on. The updated documentation for the new version has more details about this whole topic, useful to people running older versions too: http://developer.postgresql.org/pgdocs/postgres/kernel-resources.html > Kind of off-topic, but a Linux kernel parameter that's often not set > on database servers is elevator=deadline which sets up the io > scheduling algorithm. The algorithm can be viewed/set at runtime for > example the disk /dev/sdc in /sys/block/sdc/queue/scheduler. I've never seen a real-world PostgreSQL workload where deadline worked better than CFQ, and I've seen a couple where it was significantly worse. Playing with that parameter needs a heavy disclaimer that you should benchmark *your app* before and after changing it to make sure it was actually useful. Actually, three times: return to CFQ again afterwards, too, just to confirm it's not a "faster on the second run" effect. The important things to get right on Linux are read-ahead and reducing the size of the write cache size--the latter being the more direct and effective way to improve the problem that the scheduler change happens to impact too. Those have dramatically more importance than sensible changes to the scheduler used (with using the anticipatory one on a server system or the no-op one on a desktop would be non-sensible changes). -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us