Thread: WAL partition overloaded--by autovacuum?
Hi everyone, I'm running 8.4.2 on a CentOS machine, and postgres recently died with signal 6 because the pg_xlog partition filled up (33GB)on 7/4/10 10:34:23 (perfect timing, as I was hiking in the mountains in the remotest parts of our country). I didsome digging and found the following: -- The current state of /db/data/pg_xlog (after postgres died on 7/4/10) indicates there are 2056 files, all created between7/4/10 10:01 and 7/4/10 10:34, and a quick bash-kung-fu check shows that these files were created anywhere from 3per minute (10:22) to 106 per minute (10:12) -- wal_buffers=8MB; commit_siblings=200; commit_delay=0; checkpoint_segments=16 -- All other WAL-related parameters are either defaults or commented out -- syslog shows that on 7/2/10 from 10:16:11 to 10:21:51, messages similar to this occurred 29 times: > Jul 2 10:16:11 db4.sac postgres[20526]: [56-1] 2010-07-02 10:16:11.117 PDT [user=,db= PID:20526 XID:0]LOG: checkpointsare occurring too frequently (4 seconds apart) > Jul 2 10:16:11 db4.sac postgres[20526]: [56-2] 2010-07-02 10:16:11.117 PDT [user=,db= PID:20526 XID:0]HINT: Considerincreasing the configuration parameter "checkpoint_segments". -- On 7/4/10 from 09:09:02 to 09:10:08, the same type of messages occurred 21 times in syslog -- This message did not occur on 7/3/10 at all -- Preceding the "checkpoints are occurring too frequently" syslog entries are autovacuum entries: > Jul 2 10:16:04 db4.sac postgres[11357]: [7-1] 2010-07-02 10:16:04.576 PDT [user=,db= PID:11357 XID:0]LOG: automaticvacuum of table "tii._sac_global.sl_event": index scans: 1 > Jul 2 10:16:04 db4.sac postgres[11357]: [7-2] pages: 0 removed, 25 remain > Jul 2 10:16:04 db4.sac postgres[11357]: [7-3] tuples: 676 removed, 955 remain > Jul 2 10:16:04 db4.sac postgres[11357]: [7-4] system usage: CPU 0.01s/0.00u sec elapsed 0.02 sec > Jul 2 10:16:04 db4.sac postgres[11357]: [8-1] 2010-07-02 10:16:04.580 PDT [user=,db= PID:11357 XID:197431667]LOG: automaticanalyze of table "tii._sac_global.sl_event" system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec > Jul 2 10:16:04 db4.sac postgres[11357]: [9-1] 2010-07-02 10:16:04.965 PDT [user=,db= PID:11357 XID:0]LOG: automaticvacuum of table "tii._sac_global.sl_confirm": index scans: 1 > Jul 2 10:16:04 db4.sac postgres[11357]: [9-2] pages: 0 removed, 154 remain > Jul 2 10:16:04 db4.sac postgres[11357]: [9-3] tuples: 1834 removed, 8385 remain > Jul 2 10:16:04 db4.sac postgres[11357]: [9-4] system usage: CPU 0.32s/0.04u sec elapsed 0.37 sec > Jul 4 09:08:56 db4.sac postgres[21798]: [13-1] 2010-07-04 09:08:56.107 PDT [user=,db= PID:21798 XID:0]LOG: automaticvacuum of table "tii.public.city": index scans: 0 > Jul 4 09:08:56 db4.sac postgres[21798]: [13-2] pages: 0 removed, 151 remain > Jul 4 09:08:56 db4.sac postgres[21798]: [13-3] tuples: 0 removed, 20223 remain > Jul 4 09:08:56 db4.sac postgres[21798]: [13-4] system usage: CPU 0.00s/0.00u sec elapsed 0.01 sec > Jul 4 09:08:56 db4.sac postgres[21798]: [14-1] 2010-07-04 09:08:56.118 PDT [user=,db= PID:21798 XID:0]LOG: automaticvacuum of table "tii.public.country": index scans: 1 > Jul 4 09:08:56 db4.sac postgres[21798]: [14-2] pages: 0 removed, 2 remain > Jul 4 09:08:56 db4.sac postgres[21798]: [14-3] tuples: 77 removed, 185 remain > Jul 4 09:08:56 db4.sac postgres[21798]: [14-4] system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec > Jul 4 09:08:56 db4.sac postgres[21798]: [15-1] 2010-07-04 09:08:56.335 PDT [user=,db= PID:21798 XID:0]LOG: automaticvacuum of table "tii.public.gm3_clipboard": index scans: 1 > Jul 4 09:08:56 db4.sac postgres[21798]: [15-2] pages: 0 removed, 897 remain > Jul 4 09:08:56 db4.sac postgres[21798]: [15-3] tuples: 2 removed, 121594 remain > Jul 4 09:08:56 db4.sac postgres[21798]: [15-4] system usage: CPU 0.07s/0.08u sec elapsed 0.20 sec ...snip... > Jul 4 09:10:25 db4.sac postgres[22066]: [23-1] 2010-07-04 09:10:25.921 PDT [user=,db= PID:22066 XID:0]LOG: automaticvacuum of table "tii.public.pm_assignment": index scans: 1 > Jul 4 09:10:25 db4.sac postgres[22066]: [23-2] pages: 0 removed, 995 remain > Jul 4 09:10:25 db4.sac postgres[22066]: [23-3] tuples: 323 removed, 83964 remain > Jul 4 09:10:25 db4.sac postgres[22066]: [23-4] system usage: CPU 0.01s/0.09u sec elapsed 0.52 sec > Jul 4 09:10:25 db4.sac postgres[22073]: [7-1] 2010-07-04 09:10:25.978 PDT [user=,db= PID:22073 XID:0]LOG: automaticvacuum of table "tii.public.pm_question_type": index scans: 0 > Jul 4 09:10:25 db4.sac postgres[22073]: [7-2] pages: 0 removed, 1 remain > Jul 4 09:10:25 db4.sac postgres[22073]: [7-3] tuples: 0 removed, 2 remain > Jul 4 09:10:25 db4.sac postgres[22073]: [7-4] system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec > Jul 4 09:10:26 db4.sac postgres[22066]: [24-1] 2010-07-04 09:10:26.301 PDT [user=,db= PID:22066 XID:0]LOG: automaticvacuum of table "tii.public.pm_association_rule": index scans: 1 > Jul 4 09:10:26 db4.sac postgres[22066]: [24-2] pages: 0 removed, 286 remain > Jul 4 09:10:26 db4.sac postgres[22066]: [24-3] tuples: 46 removed, 51321 remain > Jul 4 09:10:26 db4.sac postgres[22066]: [24-4] system usage: CPU 0.00s/0.03u sec elapsed 0.34 sec > Jul 4 09:10:26 db4.sac postgres[22066]: [25-1] 2010-07-04 09:10:26.328 PDT [user=,db= PID:22066 XID:0]LOG: automaticvacuum of table "tii.public.role": index scans: 0 > Jul 4 09:10:26 db4.sac postgres[22066]: [25-2] pages: 0 removed, 1 remain > Jul 4 09:10:26 db4.sac postgres[22066]: [25-3] tuples: 0 removed, 5 remain > Jul 4 09:10:26 db4.sac postgres[22066]: [25-4] system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec > Jul 4 09:10:26 db4.sac postgres[22066]: [26-1] 2010-07-04 09:10:26.373 PDT [user=,db= PID:22066 XID:0]LOG: automaticvacuum of table "tii.public.pm_review_type": index scans: 0 > Jul 4 09:10:26 db4.sac postgres[22066]: [26-2] pages: 0 removed, 1 remain > Jul 4 09:10:26 db4.sac postgres[22066]: [26-3] tuples: 0 removed, 4 remain > Jul 4 09:10:26 db4.sac postgres[22066]: [26-4] system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec > Jul 4 09:10:26 db4.sac postgres[22066]: [27-1] 2010-07-04 09:10:26.428 PDT [user=,db= PID:22066 XID:0]LOG: automaticvacuum of table "tii.public.permission": index scans: 0 > Jul 4 09:10:26 db4.sac postgres[22066]: [27-2] pages: 0 removed, 1 remain > Jul 4 09:10:26 db4.sac postgres[22066]: [27-3] tuples: 0 removed, 8 remain > Jul 4 09:10:26 db4.sac postgres[22066]: [27-4] system usage: CPU 0.00s/0.00u sec elapsed 0.01 sec -- min_duration_statement=1000, so there is a chance that the volume of really-fast write traffic might be masked, but I'mdoubtful as this is our low-usage season (we run a web service for the education sector). -- Another tidbit of information is that I am using slony 2.0.3 for replication, and my sync_interval is 500. However, I'mdoubtful that slony traffic is responsible because my other nodes use the exact same config and hardware, and none ofthem had this issue. -- My last administrative action on this machine was to SIGHUP this machine to push changes to pg_hba.conf on 6/29/10. Justnow, I diff-ed my postgresql.conf file with a copy that I keep in svn--no changes at all. This leads me to believe that there was a sudden flurry of write activity that occurred, and the process that would flushWAL files to /db/data/ couldn't keep up, thereby filling up the disk. I'm wondering if anyone else out there mightbe able to give me some insight or comments to my assessment--is it accurate? Any input would be helpful, and I'lltry to make necessary architectural changes to keep this from happening again. Your help is much appreciated! --Richard
On Tue, Jul 6, 2010 at 7:10 PM, Richard Yen <richyen@iparadigms.com> wrote: > This leads me to believe that there was a sudden flurry of write activity that occurred, and the process that would flushWAL files to /db/data/ couldn't keep up, thereby filling up the disk. I'm wondering if anyone else out there mightbe able to give me some insight or comments to my assessment--is it accurate? Any input would be helpful, and I'lltry to make necessary architectural changes to keep this from happening again. I tend to agree. What kind of disk setup is under your data store? If it's RAID-5 or RAID-6 is it possibly degraded? I'd run iostat -xd 10 while the write load it high to see how your xlog partition %util compares to the main data store partition. Do you have your main data store partition under a battery backed caching RAID controller? Tell us what you can about your hardware setup.
On Tue, Jul 6, 2010 at 7:10 PM, Richard Yen <richyen@iparadigms.com> wrote: One more thing, do you have long running transactions during these periods?
And, read this: http://wiki.postgresql.org/wiki/Guide_to_reporting_problems
On 07/07/10 13:10, Richard Yen wrote: > > This leads me to believe that there was a sudden flurry of write activity that occurred, and the process that would flushWAL files to /db/data/ couldn't keep up, thereby filling up the disk. I'm wondering if anyone else out there mightbe able to give me some insight or comments to my assessment--is it accurate? Any input would be helpful, and I'lltry to make necessary architectural changes to keep this from happening again. > Do you have wal archiving enabled? (if so lets see your archive_command). Cheers Mark
Richard Yen wrote: > the pg_xlog partition filled up (33GB) > checkpoint_segments=16 > a sudden flurry of write activity Was this database bulk-loaded in some way (like by restoring the output of pg_dump, for example)? If so, all rows inserted into all tables would have the same (or very nearly the same) xmin value. At some later time, virtually all tuples would need to be rewritten to freeze them. This would be a logged operation, so WAL files would be getting created rapidly. If you have WAL archiving turned on, and you can't copy the files out as fast as they're being created, this might happen. To avoid such a crushing mass freeze at an unpredictable time, we always run VACUUM FREEZE ANALYZE on a bulk-loaded database before turning on WAL archiving. Of course, if this database wasn't bulk-loaded, this can't be your problem.... -Kevin
Sorry, I forgot to mention that archive_mode is "off" and commented out, and archive command is '' and commented out. Thanks for following up! -- Richard On Jul 7, 2010, at 1:58, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > On 07/07/10 13:10, Richard Yen wrote: >> >> This leads me to believe that there was a sudden flurry of write activity that occurred, and the process that would flushWAL files to /db/data/ couldn't keep up, thereby filling up the disk. I'm wondering if anyone else out there mightbe able to give me some insight or comments to my assessment--is it accurate? Any input would be helpful, and I'lltry to make necessary architectural changes to keep this from happening again. >> > > Do you have wal archiving enabled? (if so lets see your archive_command). > > Cheers > > Mark > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Jul 6, 2010, at 8:25 PM, Scott Marlowe wrote: > Tell us what you can about your hardware setup. Sorry, I made the bad assumption that the hardware setup would be irrelevant--dunno why I thought that. My hardware setup is 2 FusionIO 160GB drives in a RAID-1 configuration, running on an HP DL360 G5 I think I figured out the problem: -- I figured that pg_xlog and data/base could both be on the FusionIO drive, since there would be no latency when there areno spindles. -- However, I didn't take into account the fact that pg_xlog might grow in size when autovacuum does its work when vacuumingto prevent XID wraparound. I *just* discovered this when one of my other replication nodes decided to die on meand fill up its disk. -- Unfortunately, my db is 114GB (including indexes) or 60GB (without indexes), leaving ~37GB for pg_xlog (since they aresharing a partition). So I'm guessing what happened was that when autovacuum ran to prevent XID wraparound, it takeseach table and changes the XID, and it gets recorded in WAL, causing WAL to bloat. This this the correct understanding? Question for now is, documentation says: > There will always be at least one WAL segment file, and will normally not be more than (2 + checkpoint_completion_target)* checkpoint_segments + 1 files. Each segment file is normally 16 MB (though this size can bealtered when building the server). You can use this to estimate space requirements for WAL. Ordinarily, when old log segmentfiles are no longer needed, they are recycled (renamed to become the next segments in the numbered sequence). If,due to a short-term peak of log output rate, there are more than 3 * checkpoint_segments + 1 segment files, the unneededsegment files will be deleted instead of recycled until the system gets back under this limit. This means my pg_xlog partition should be (2 + checkpoint_completion_target) * checkpoint_segments + 1 = 41 files, or 656MB. Then, if there are more than 49 files, unneeded segment files will be deleted, but in this case all segment filesare needed, so they never got deleted. Perhaps we should add in the docs that pg_xlog should be the size of the DBor larger? --Richard
On Wed, Jul 7, 2010 at 3:32 PM, Richard Yen <richyen@iparadigms.com> wrote: > On Jul 6, 2010, at 8:25 PM, Scott Marlowe wrote: > >> Tell us what you can about your hardware setup. > > Sorry, I made the bad assumption that the hardware setup would be irrelevant--dunno why I thought that. > > My hardware setup is 2 FusionIO 160GB drives in a RAID-1 configuration, running on an HP DL360 G5 > > I think I figured out the problem: > > -- I figured that pg_xlog and data/base could both be on the FusionIO drive, since there would be no latency when thereare no spindles. > -- However, I didn't take into account the fact that pg_xlog might grow in size when autovacuum does its work when vacuumingto prevent XID wraparound. I *just* discovered this when one of my other replication nodes decided to die on meand fill up its disk. > -- Unfortunately, my db is 114GB (including indexes) or 60GB (without indexes), leaving ~37GB for pg_xlog (since they aresharing a partition). So I'm guessing what happened was that when autovacuum ran to prevent XID wraparound, it takeseach table and changes the XID, and it gets recorded in WAL, causing WAL to bloat. This this the correct understanding? That seems logical (and un-fun), but I don't understand how you managed to fill up 37GB of disk with WAL files. Every time you fill up checkpoint_segments * 16MB of WAL files, you ought to get a checkpoint. When it's complete, WAL segments completely written before the start of the checkpoint should be recyclable. Unless I'm confused, which apparently I am. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> wrote: > I don't understand how you managed to fill up 37GB of disk with > WAL files. Every time you fill up checkpoint_segments * 16MB of > WAL files, you ought to get a checkpoint. When it's complete, WAL > segments completely written before the start of the checkpoint > should be recyclable. Unless I'm confused, which apparently I am. You're not alone. At first I was assuming that it was because of archiving, but the OP says that's turned off. Unless it had been on and there wasn't a *restart* after changing the configuration, I can't see how this could happen, and was hoping someone else could cast some light on the issue. The one setting that gave me pause was: commit_siblings=200 but it doesn't seem like that should matter with: commit_delay=0; -Kevin
On Jul 8, 2010, at 12:04 PM, Kevin Grittner wrote: > Robert Haas <robertmhaas@gmail.com> wrote: > >> I don't understand how you managed to fill up 37GB of disk with >> WAL files. Every time you fill up checkpoint_segments * 16MB of >> WAL files, you ought to get a checkpoint. When it's complete, WAL >> segments completely written before the start of the checkpoint >> should be recyclable. Unless I'm confused, which apparently I am. > > You're not alone. At first I was assuming that it was because of > archiving, but the OP says that's turned off. Unless it had been on > and there wasn't a *restart* after changing the configuration, I > can't see how this could happen, and was hoping someone else could > cast some light on the issue. I'm fairly confused myself. I'm beginning to think that because data/base and data/pg_xlog were on the same partition (/db),when the /db partition filled, up the WAL files couldn't get flushed to data/base, thereby preventing data/pg_xlogfrom being emptied out, as per the documentation. My concern is that--as in the original post--there were moments where 129 WAL files were generated in one minute. Is itplausible that this autovacuum could be responsible for this? --Richard
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > You're not alone. At first I was assuming that it was because of > archiving, but the OP says that's turned off. Unless it had been on > and there wasn't a *restart* after changing the configuration, Yeah, I was less than convinced by his eyeball report of that, too. "show archive_mode" would be a much more convincing check of the server's state. Or would have been, if the server hadn't been restarted since the problem occurred. archive_mode on with a bad archive_command would lead directly to the reported problem ... although it should also lead to pretty obvious complaints in the postmaster log. (Hmm ... but those complaints are logged at level WARNING, which as discussed elsewhere is really lower than LOG. Should we change them?) regards, tom lane
On Jul 8, 2010, at 12:27 PM, Tom Lane wrote: > > (Hmm ... but those complaints are logged at level WARNING, which as > discussed elsewhere is really lower than LOG. Should we change them?) Hmm, I did a grep on "WARNING" on my log, and the only thing that turns up are the "WARNING: terminating connection becauseof crash of another server process" entries when postgres died and shut down when the disks filled up. Would this be conclusive evidence that archive_mode=off? --Richard
Richard Yen <richyen@iparadigms.com> wrote: > there were moments where 129 WAL files were generated in one > minute. Is it plausible that this autovacuum could be responsible > for this? I don't remember seeing your autovacuum and vacuum config settings, or an answer to my question about whether there was a bulk load of a significant portion of current data. With agressive autovacuum settings, hitting the freeze threshold for bulk-loaded data could do that. -Kevin
Richard Yen <richyen@iparadigms.com> writes: > My concern is that--as in the original post--there were moments where 129 WAL files were generated in one minute. Is itplausible that this autovacuum could be responsible for this? That's not a particularly surprising WAL creation rate for a busy database. I wouldn't expect autovacuum to cause it by itself, but that's true only because autovacuum processing is typically throttled by autovacuum_vacuum_cost_delay. Perhaps you had that set to zero? regards, tom lane
On Jul 8, 2010, at 12:50 PM, Tom Lane wrote: > Richard Yen <richyen@iparadigms.com> writes: >> My concern is that--as in the original post--there were moments where 129 WAL files were generated in one minute. Isit plausible that this autovacuum could be responsible for this? > > That's not a particularly surprising WAL creation rate for a busy > database. I wouldn't expect autovacuum to cause it by itself, but > that's true only because autovacuum processing is typically throttled > by autovacuum_vacuum_cost_delay. Perhaps you had that set to zero? > Ah, yes, autovacuum_vacuum_cost_delay = 0 in my config. That explains it--guess I'm playing with knives if I set thingsthat way... --Richard
On Jul 8, 2010, at 12:50 PM, Kevin Grittner wrote: > Richard Yen <richyen@iparadigms.com> wrote: > >> there were moments where 129 WAL files were generated in one >> minute. Is it plausible that this autovacuum could be responsible >> for this? > > I don't remember seeing your autovacuum and vacuum config settings, > or an answer to my question about whether there was a bulk load of a > significant portion of current data. With agressive autovacuum > settings, hitting the freeze threshold for bulk-loaded data could do > that. > Yeah, autovacuum is pretty aggressive, as I recall: autovacuum = on log_autovacuum_min_duration = 0 autovacuum_max_workers = 8 autovacuum_naptime = 1min autovacuum_vacuum_threshold = 400 autovacuum_analyze_threshold = 200 autovacuum_vacuum_scale_factor = 0.2 autovacuum_analyze_scale_factor = 0.1 autovacuum_freeze_max_age = 200000000 autovacuum_vacuum_cost_delay = 0 autovacuum_vacuum_cost_limit = -1 vacuum_cost_delay = 0 vacuum_cost_limit = 200 When you say "bulk-loaded," I suppose that also includes loading the data via slony ADD NODE as well--correct? I createdthis node maybe 6 months ago via slony ADD NODE Thanks much for your time, --Richard
Richard Yen <richyen@iparadigms.com> wrote: > When you say "bulk-loaded," I suppose that also includes loading > the data via slony ADD NODE as well--correct? I would think so. Anything which loads a lot of data in relatively few database transactions would qualify; I would think slony would do this, as it's generally required to get decent performance. -Kevin
Richard Yen wrote: > I figured that pg_xlog and data/base could both be on the FusionIO drive, since there would be no latency when there areno spindles. > (Rolls eyes) Please be careful about how much SSD Kool-Aid you drink, and be skeptical of vendor claims. They don't just make latency go away, particularly on heavy write workloads where the technology is at its weakest. Also, random note, I'm seeing way too many FusionIO drive setups where people don't have any redundancy to cope with a drive failure, because the individual drives are so expensive they don't have more than one. Make sure that if you lose one of the drives, you won't have a massive data loss. Replication might help with that, if you can stand a little bit of data loss when the SSD dies. Not if--when. Even if you have a good one they don't last forever. > This means my pg_xlog partition should be (2 + checkpoint_completion_target) * checkpoint_segments + 1 = 41 files, or 656MB. Then, if there are more than 49 files, unneeded segment files will be deleted, but in this case all segment filesare needed, so they never got deleted. Perhaps we should add in the docs that pg_xlog should be the size of the DBor larger? > Excessive write volume beyond the capacity of the hardware can end up delaying the normal checkpoint that would have cleaned up all the xlog files. There's a nasty spiral that can get into I've seen a couple of times in similar form to what you reported. The pg_xlog should never exceed the size computed by that formula for very long, but it can burst above its normal size limits for a little bit. This is already mentioned as possibility in the manual: "If, due to a short-term peak of log output rate, there are more than 3 * checkpoint_segments + 1 segment files, the unneeded segment files will be deleted instead of recycled until the system gets back under this limit." Autovacuum is an easy way to get the sort of activity needed to cause this problem, but I don't know if it's a necessary component to see the problem. You have to be in an unusual situation before the sum of the xlog files is anywhere close to the size of the database though. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us