Thread: Postgres 8.2 memory weirdness
I'm not sure what is going on but looking for some advice, knowledge. I'm running multiple postgres servers in a slon relationship. I have hundreds of thousands of updates, inserts a day. But what I'm seeing is my server appears to "deallocate" memory (for the lack of a better term) and performance goes to heck, slow response, a sub second query takes anywhere from 6-40 seconds to complete when this happens. My guess is that postgres does not have enough memory to handle the query and starts to swap (although I can't see any swapping happening (vmstat) Some examples: Also when the memory deallocates, I start stacking connections (again, performance goes to heck). FC6, postgres 8.2 on a dual quad core intel box, 8 gigs of physical RAM Between 8:17:36 and 8:17:57, my system goes from almost all memory consumed to almost all memory free.. Previous to this big swing, Cache goes from 7 gigs down to what you see here, connections start stacking and IOWAIT goes thru the roof. 1200932256 - Mon Jan 21 08:17:36 2008 - qdb02.gc.sv.admission.net - 0 sec. elapsed. procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 2 9072 297460 4904 1543316 0 0 52 175 2 2 21 3 75 2 0 1200932277 - Mon Jan 21 08:17:57 2008 - qdb02.gc.sv.admission.net - 2 sec. elapsed. procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 18 6 9072 6354560 6128 1574324 0 0 52 175 2 2 21 3 75 2 0 At this point I have connections stacking (simple queries), Until what I believe is that the kernel starts to allocate memory for the dB (or the db starts to be moved from disk to memory?) (queries, tables etc).. However performance is still dismal and I do see high IOWAIT when I don't have sufficient memory for the DB (again my feeling). If I do a reindex at this time, the system alocates most of my RAM and my queries are subsecond agiain and I work fine until the next time, the DB appears to be flushed from RAM. I'm having a hard time understanding what I'm seeing. I do grab netstat/ps/free/vmstat/netstat every few seconds so I have insight as to the systems health when these things happen. I suspect I've outgrown our initial postgres config, or there are more sysctl or other kernel tweaks that need to happen. I really am just trying to understand the memory allocation etc. And why my memory goes from fully utilized to 90% free and DB performance goes to heck. I'll provide more info as needed, and my apologies for this being a bit scattered, but I'm really confused. I'm either running out of a resource or other (but no errors in any logs, postgres or otherwise).. Thanks Tory
On Wed, 23 Jan 2008, Tory M Blue wrote: > I have hundreds of thousands of updates, inserts a day. But what I'm > seeing is my server appears to "deallocate" memory (for the lack of a > better term) and performance goes to heck, slow response, a sub second > query takes anywhere from 6-40 seconds to complete when this happens. Generally if you have a system doing many updates and inserts that slows for that long, it's because it hit a checkpoint. I'm not sure what your memory-related issues are but it's possible that might be from a backlog of sessions using memory that are stuck behind the checkpoint, particularly since you mention simple query connections stacking up during these periods. In any case you should prove/disprove this is checkpoint-related behavior before you chase down something more esoteric. There's a quick intro to this area in the "Monitoring checkpoints" section of http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm and the later sections go into what you can do about it. > I suspect I've outgrown our initial postgres config, or there are more > sysctl or other kernel tweaks that need to happen. You should post a list of what you're changed from the defaults. You're analyzing from the perspective where you assume it's a memory problem and a look at your config will give a better idea whether that's possible or not. Other good things to mention: exact 8.2 version, OS, total memory, outline of disk configuration. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Jan 23, 2008 4:31 PM, Greg Smith <gsmith@gregsmith.com> wrote: > Generally if you have a system doing many updates and inserts that slows > for that long, it's because it hit a checkpoint. I'm not sure what your > memory-related issues are but it's possible that might be from a backlog > of sessions using memory that are stuck behind the checkpoint, > particularly since you mention simple query connections stacking up during > these periods. > > In any case you should prove/disprove this is checkpoint-related behavior > before you chase down something more esoteric. There's a quick intro to > this area in the "Monitoring checkpoints" section of > http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm and the > later sections go into what you can do about it. > > Thanks Greg, Particulars: Postg: 8.2.1fc6 Fedora FC6: 2.6.19-1.2911.fc6 Dell 2950, Dual quad core 8 Gigs of Ram Lefthand Iscsi; 48 drives Postgres.conf max_connections = 300 shared_buffers = 75000 <--- Believe these need tuning (based on the reading last night) max_prepared_transactions = 0 work_mem = 102400 maintenance_work_mem = 65536 max_fsm_pages = 1087500 <-- modified last night, based on warnings in log max_fsm_relations = 430 fsync = true checkpoint_segments = 50 checkpoint_timeout = 300 checkpoint_warning = 3600s <--- set this last night and already see instances of "2008-01-24 03:54:39 PST LOG: checkpoints are occurring too frequently (89 seconds apart) 2008-01-24 03:54:39 PST HINT: Consider increasing the configuration parameter "checkpoint_segments"." effective_cache_size = 330000 <-- This appears totally wrong and something I noticed last night. left over from previous versions of postgres on different hardware. (thinking to set this to 6-7G) autovacuum = on autovacuum_analyze_threshold = 2000 Thanks for the link, I read lots of good information last night and will start pushing forward with some changes in my test area. Any insight into what my current settings are telling you is appreciated -Tory
On Thu, 24 Jan 2008, Tory M Blue wrote: > Postg: 8.2.1fc6 8.2.1 has a nasty bug related to statistics collection that causes performance issues exactly in the kind of heavy update situation you're in. That's actually why i asked for the exact 8.2 version. You should plan an upgrade as soon as feasible to the current release just to eliminate this as a possible influence on your problems. No need to dump the database or do anything fancy, just get the new version going and point it at the existing database. To do a quick check on whether this is impacting things, run top, press "c" to show the full process lines, and note whether the statistics collector process is taking up a significant amount of CPU time. If it is, you're being nailed by the bug, and you really need that ugprade. > 8 Gigs of Ram > shared_buffers = 75000 <--- Believe these need tuning (based on the > reading last night) Probably, but if you're having checkpoint problems now making shared_buffers bigger will likely make them worse. Some people with update-heavy workloads end up reducing this to a very small value (<250MB) even with large amounts of RAM because that makes less information to dump at checkpoint time. > checkpoint_segments = 50 > checkpoint_timeout = 300 > checkpoint_warning = 3600s <--- set this last night and > already see instances of > > "2008-01-24 03:54:39 PST LOG: checkpoints are occurring too > frequently (89 seconds apart) > 2008-01-24 03:54:39 PST HINT: Consider increasing the > configuration parameter "checkpoint_segments"." If you're getting checkpoints every 89 seconds it's no wonder your system is dying. You may need to consider a large increase to checkpoint_segments to get the interval between checkpoints to increase. It should at least be a few minutes between them if you want any reasonable performance level. > effective_cache_size = 330000 <-- This appears totally wrong and > something I noticed last night. left over from previous versions of > postgres on different hardware. (thinking to set this to 6-7G) Right, that's where it should be. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Jan 24, 2008 10:49 AM, Greg Smith <gsmith@gregsmith.com> wrote: > 8.2.1 has a nasty bug related to statistics collection that causes > performance issues exactly in the kind of heavy update situation you're > in. That's actually why i asked for the exact 8.2 version. You should > plan an upgrade as soon as feasible to the current release just to > eliminate this as a possible influence on your problems. No need to dump > the database or do anything fancy, just get the new version going and > point it at the existing database. Not seeing any excessive cpu from the stats collector process.. So maybe not being hit with this bug. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P SWAP TIME COMMAND 28445 postgres 15 0 7432 828 408 S 0 0.0 4:15.47 3 6604 4:15 postgres: stats collector process With the above said, we had started sometime ago to move 8.2.5 into our environments, so that should be on these servers next week (the push is a slow process, but we are really liking what we are seeing for 8.3, so I'm hoping once blessed, i'll push it thru quickly).. > > checkpoint_segments = 50 > > checkpoint_timeout = 300 > > checkpoint_warning = 3600s <--- set this last night and > > If you're getting checkpoints every 89 seconds it's no wonder your system > is dying. You may need to consider a large increase to > checkpoint_segments to get the interval between checkpoints to increase. > It should at least be a few minutes between them if you want any > reasonable performance level. I doubled the checkpoint segments yesterday and have not seen any warnings. Will run with segments of 100 for a while and see how things look.. Anyway to make sure that there is not a number between 50 and 100 that makes more sense? > > effective_cache_size = 330000 <-- This appears totally wrong and > > something I noticed last night. left over from previous versions of > > postgres on different hardware. (thinking to set this to 6-7G) > > Right, that's where it should be. We have started some performance analysis and this numvber is sure affecting performance in good ways by having it set semi correctly. This has not been pushed (too many changes), but we will continue performance testing and it will probably make it to prod next week. Thanks for some sanity checks here Greg, it's truly appreciated. Tory
On Fri, 25 Jan 2008, Tory M Blue wrote: > I doubled the checkpoint segments yesterday and have not seen any > warnings. Will run with segments of 100 for a while and see how things > look.. Anyway to make sure that there is not a number between 50 and > 100 that makes more sense? More segments means more disk space taken up with them and a longer crash recovery. Those are the downsides; if you can live with those there's no reason to run at <100 if that works for you. Fine-tuning here isn't really that helpful. I'm a little confused by your report through because you should still be seeing regular checkpoint warnings if you set checkpoint_warning = 3600s , they should just be spaced further apart. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Jan 27, 2008 4:08 PM, Greg Smith <gsmith@gregsmith.com> wrote: > > More segments means more disk space taken up with them and a longer crash > recovery. Those are the downsides; if you can live with those there's no > reason to run at <100 if that works for you. Fine-tuning here isn't > really that helpful. > > I'm a little confused by your report through because you should still be > seeing regular checkpoint warnings if you set checkpoint_warning = 3600s , > they should just be spaced further apart. I'm not seeing any warnings at all. [idb01 ~]$ sudo cat /data/logs/pgsql-27.log | grep -i check [idb01 ~]$ sudo cat /data/logs/pgsql-26.log | grep -i check [idb01 ~]$ sudo cat /data/logs/pgsql-25.log | grep -i check [idb01 ~]$ sudo cat /data/logs/pgsql-24.log | grep -i check 2008-01-24 03:54:39 PST LOG: checkpoints are occurring too frequently (89 seconds apart) 2008-01-24 03:54:39 PST HINT: Consider increasing the configuration parameter "checkpoint_segments". 2008-01-24 07:26:25 PST LOG: checkpoints are occurring too frequently (106 seconds apart) 2008-01-24 07:26:25 PST HINT: Consider increasing the configuration parameter "checkpoint_segments". 2008-01-24 11:34:18 PST LOG: checkpoints are occurring too frequently (173 seconds apart) 2008-01-24 11:34:18 PST HINT: Consider increasing the configuration parameter "checkpoint_segments". Segment config still: # - Checkpoints - checkpoint_segments = 100 # bumped from 50 checkpoint_timeout = 300 # range 30s-1h #checkpoint_warning = 30s # 0 is off checkpoint_warning = 3600s # 0 is off No warnings in my logs, I see some LOG information but it pertains to slon and not postgres directly. Ideas?! Thanks again Tory