Thread: PG quitting sporadically!!
Hi, I have been running PG for quite a while now. Suddenly, the postmaster either hogs up memory or just croaks and doesn't respond. The write process has become horribly slow. Could it be that my data has grown so large that it's becoming an issue? Even connecting to PGSQL as postgres user in SSH is way too slow. I am on a 4GB server with Apache 2.2.6 and PgSql 8.2.3. The same server runs Apache, PgSQL, and MySQL. The last is not used for much, just some simple logging and such. It seldom shows up in "TOP" output. So the main heavy lifters are Apache and PgSQL. In my PG_LOG files, I have many messages like these: LOG: could not send data to client: Broken pipe LOG: could not receive data from client: Connection reset by peer LOG: unexpected EOF on client connection What are these things? My conf file settings (which were working fine until recently!) are as follows: max_connections = 150 shared_buffers = 330MB effective_cache_size = 512000 max_fsm_relations = 100 max_fsm_pages = 300000 work_mem = 100MB temp_buffers = 4096 authentication_timeout = 10s ssl = off autovacuum = on autovacuum_vacuum_cost_delay = 20 vacuum_cost_delay = 20 autovacuum_naptime = 10 stats_start_collector = on stats_row_level = on autovacuum_vacuum_threshold = 75 autovacuum_analyze_threshold = 25 autovacuum_analyze_scale_factor = 0.02 autovacuum_vacuum_scale_factor = 0.01 wal_buffers = 128 checkpoint_segments = 64 checkpoint_timeout = 900 fsync = on maintenance_work_mem = 512MB ------------------------------------------------ 1. VMSTAT -S ~ > vmstat -s 4148844 total memory 3771392 used memory 3232328 active memory 265652 inactive memory 377452 free memory 75044 buffer memory 2890740 swap cache 2096440 total swap 6436 used swap 2090004 free swap 24821214 non-nice user cpu ticks 229660 nice user cpu ticks 14153129 system cpu ticks 566198940 idle cpu ticks 22918768 IO-wait cpu ticks 75387 IRQ cpu ticks 0 softirq cpu ticks 153625372 pages paged in 287656751 pages paged out 175855 pages swapped in 859894 pages swapped out 1844498357 interrupts 889200515 CPU context switches 1201419806 boot time 15735402 forks ------------------------------------------------ 2. FREE -M total used free shared buffers cached Mem: 4051 4025 26 0 74 2823 -/+ buffers/cache: 1127 2924 Swap: 2047 6 2041 ------------------------------------------------ 3. VACUUM ANALYE OUTPUT This is the last few lines of this output: 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "clients": found 1 removable, 3031359 nonremovable row versions in 199194 pages DETAIL: 0 dead row versions cannot be removed yet. There were 2624366 unused item pointers. 97747 pages contain useful free space. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 235.21 sec. INFO: vacuuming "pg_toast.pg_toast_50002" INFO: index "pg_toast_50002_index" now contains 20 row versions in 2 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "pg_toast_50002": found 0 removable, 20 nonremovable row versions in 7 pages DETAIL: 0 dead row versions cannot be removed yet. There were 12 unused item pointers. 6 pages contain useful free space. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.02 sec. INFO: analyzing "public.clients" INFO: "clients": scanned 7500 of 199194 pages, containing 114197 live rows and 1 dead rows; 7500 rows in sample, 3032981 estimated total rows INFO: free space map contains 98498 pages in 25 relations DETAIL: A total of 98720 page slots are in use (including overhead). 98720 page slots are required to track all free space. Current limits are: 300000 page slots, 100 relations, using 1766 kB. ------------------------------------------------
On Thu, 14 Feb 2008, Phoenix Kiula wrote: > Suddenly, the postmaster either hogs up memory or just croaks and > doesn't respond. The write process has become horribly slow. You should consider whether that's because a checkpoint is happening at that point. You didn't mention anything about your disk+controller information to have an idea how likely that is. Consider increasing checkpoint_warning=3600 so that you'll always get a note in the logs when a checkpoint happens; if those line up with your client disconnects that will be telling you something. On the logging size, you may want to also enable log_min_duration_statement ; around 500 (milliseconds) would be a reasonable starting value. That will show you what queries are taking a long time to handle. > I am on a 4GB server with Apache 2.2.6 and PgSql 8.2.3. Be aware that when 8.2.3 was released, 8.2 had only been out for two months. There's another 11 months worth of accumulated bug fixes in 8.2.6, including some that can cause the server to slow or crash. It's not a difficult upgrade (no changes to the database) and you should consider it. There are plenty of known and already fixed problems in 8.2.3 you could be running into. > max_connections = 150 > maintenance_work_mem = 512MB > shared_buffers = 330MB > work_mem = 100MB That's a really high setting for work_mem with this many connections; are you aware that combination can easily use 15GB of RAM? You should decrease that to around 10MB with the size of your server and greatly reduce one possible source for running out of memory. It's possible to increase that value just for some individual queries if there's some known set of ones that really need more memory to work efficiently (looking at the minimum duration logs should give you guidance here). I'd normally suggest increasing shared_buffers instead as well, but you should rule out checkpoints before doing that (increasing shared_buffes can make checkpoint issues worse). You could also decrease maintenance_work_mem quite a bit from where you've got it at now to reduce another possible source for large memory allocations. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Thanks. Comments below. (PS: I am still unable to connect to postgresql even in SSH! I see this message: psql: could not connect to server: Connection timed out Is the server running on host "localhost" and accepting TCP/IP connections on port 5432? Yes of course the localhost is running the pgsql server and that port is allowed! > You should consider whether that's because a checkpoint is happening at > that point. You didn't mention anything about your disk+controller > information to have an idea how likely that is. Consider increasing > checkpoint_warning=3600 so that you'll always get a note in the logs when > a checkpoint happens; if those line up with your client disconnects that > will be telling you something. > I am not sure what checkpoint stuff means. But I added that entry, and now my log has ONLY this: LOG: test message did not get through on socket for statistics collector LOG: disabling statistics collector for lack of working socket LOG: database system was shut down at 2008-02-14 17:53:13 CST LOG: checkpoint record is at 8/E4BE7CF8 LOG: redo record is at 8/E4BE7CF8; undo record is at 0/0; shutdown TRUE LOG: next transaction ID: 0/296662974; next OID: 89700 LOG: next MultiXactId: 1; next MultiXactOffset: 0 LOG: database system is ready > On the logging size, you may want to also enable > log_min_duration_statement ; around 500 (milliseconds) would be a > reasonable starting value. That will show you what queries are taking a > long time to handle. I have this at 5000. I think I first want to track those that take more than 5 seconds. But for now, the log is useless as mentioned above. > Be aware that when 8.2.3 was released, 8.2 had only been out for two > months. There's another 11 months worth of accumulated bug fixes in > 8.2.6, including some that can cause the server to slow or crash. It's > not a difficult upgrade (no changes to the database) and you should > consider it. There are plenty of known and already fixed problems in > 8.2.3 you could be running into. Thanks, I can do that, but the upgrade process is not very simple or automated and will take backup of database and all that rigmarole.Last time I remember, I had to seek help from this forum to do that simple task, because something had gone remiss. Is there an easy RPM method of upgrading postgresql without backingup/restoring etc? I am on CentOS 4. > > max_connections = 150 > > maintenance_work_mem = 512MB > > shared_buffers = 330MB > > work_mem = 100MB > > That's a really high setting for work_mem with this many connections; are > you aware that combination can easily use 15GB of RAM? I didn't know that, but it had been working for over a year. I have 4GB of RAM. I have changed this now to 10MB, the work_mem, but that isn't helping. I'd appreciate any other info. Thanks!
> > Be aware that when 8.2.3 was released, 8.2 had only been out for two > > months. There's another 11 months worth of accumulated bug fixes in > > 8.2.6, including some that can cause the server to slow or crash. It's > > not a difficult upgrade (no changes to the database) and you should > > consider it. There are plenty of known and already fixed problems in > > 8.2.3 you could be running into. > > > > Thanks, I can do that, but the upgrade process is not very simple or > automated and will take backup of database and all that rigmarole.Last > time I remember, I had to seek help from this forum to do that simple > task, because something had gone remiss. > > Is there an easy RPM method of upgrading postgresql without > backingup/restoring etc? I am on CentOS 4. You don't need a dump/restore to go between minor releases (e.g. 8.2.3 to 8.2.6). Just install the new binaries and fire it up. Taking a backup beforehand isn't a bad idea, of course. :) -Doug
Phoenix Kiula escribió: > Thanks. Comments below. (PS: I am still unable to connect to > postgresql even in SSH! I see this message: > > psql: could not connect to server: Connection timed out > Is the server running on host "localhost" and accepting > TCP/IP connections on port 5432? > > Yes of course the localhost is running the pgsql server and that port > is allowed! My educated guess is that your network is behaving funny, or your firewall is crazy. Or your kernel has bugs. The fact that the connections work sometimes is one clue; this is the other: > LOG: test message did not get through on socket for statistics collector > LOG: disabling statistics collector for lack of working socket This is a bad sign. For one thing, it means autovacuum, if enabled, is not really working at all (which can, in turn, explain slowness). -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, 15 Feb 2008, Phoenix Kiula wrote: > I am not sure what checkpoint stuff means. But I added that entry, and > now my log has ONLY this: > > LOG: test message did not get through on socket for statistics collector > LOG: disabling statistics collector for lack of working socket If you're getting that, as already suggested you should be chasing down whatever is going on there before you touch anything else. I was throwing out a list of possible things that might cause your problems on a fully working system, but you don't have one of those right now and that's really strange. Have you ever run top and hit the C key to see what all the processes were doing? It labels the major PostgreSQL processes more usefully if you do that. I'd be curious just what is gobbling up resources on your machine, this socket error is somewhat disturbing. Since it sounds like a fairly critical machine you've got going here, if I were you I'd be thinking a bit about whether it might make sense to purchase an hour or two of consulting time from someone who really knows this area. You're doing the right thing asking for help here, but I wonder whether there's something else going on that would be obvious to an expert if they logged into your system and poked around a bit. (This is certainly not an ad for me--I'm not doing consulting right now). > I can do that, but the upgrade process is not very simple or automated > and will take backup of database and all that rigmarole...Is there an > easy RPM method of upgrading postgresql without backingup/restoring etc? > I am on CentOS 4. First off: I wouldn't want to introduce another variable here until there's a better understanding of what's wrong with your existing system. You certainly should do a true backup here oriented at disaster recovery before upgrading given the weirdness involved. But the upgrade itself doesn't require one, just installing new packages. If you've already installed the PGDG RPMs on your system (I don't know how else you'd have gotten 8.2.3 onto Centos 4 via RPM) you should be able to download the new ones for the latest 8.2, put them all into a directory, and do "rpm -Uvh *.rpm" to get the new ones replacing the old (with the server shutdown!). *Should* only take a few minutes. I wrote a little guide to sorting through more complicated upgrades if it comes to that you can find at http://www.westnet.com/~gsmith/content/postgresql/pgrpm.htm , but if the packages are from the same underlying source it's unlikely you'll have that large of a mess. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD