Thread: Vacuum time degrading
We've been doing a vacuum every night. With about 254 million rows, it was taking abut 0.8 hours. Now a few months later with about 456 million rows it is taking 4.8 hours. I did check the archives and didn't find anything like this. Why is the vacuum time not going up linearly? Since vacuum has to sequentially read the entire database, I would have expected a linear increase - about 1.5 hours now. There are currently no deletes or modifies to the database - only inserts. This is on PostgreSQL 7.4.5, RedHat ES 3.0. Wes
Wes <wespvp@syntegra.com> writes: > Why is the vacuum time not going up linearly? I'm betting that the database is suffering from substantial bloat, requiring VACUUM to scan through lots of dead space that wasn't there before. Check your FSM settings (the tail end of the output from a full-database VACUUM VERBOSE command would give some info about what you need). If you are suffering bloat, the fastest route to a solution would probably be to CLUSTER your larger tables. Although VACUUM FULL would work, it's likely to be very slow. > There are currently no deletes or modifies to the database - only inserts. You *certain* about that? It's hard to see how the vacuum time wouldn't be linear in table size if there's nothing to do and no dead space. Again, VACUUM VERBOSE info would be informative (it's sufficient to look at your larger tables for this). regards, tom lane
On 2/28/05 6:53 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > If you are suffering bloat, the fastest route to a solution would > probably be to CLUSTER your larger tables. Although VACUUM FULL > would work, it's likely to be very slow. How can there be bloat if there are no deletes or modifies? Even if there were deletes or modifies (there will be in about another year and a half), if a vacuum is being performed every night, how can there be bloat? The vacuum should release the dead space and it should be reused. Am I missing something? >> There are currently no deletes or modifies to the database - only inserts. > > You *certain* about that? It's hard to see how the vacuum time wouldn't > be linear in table size if there's nothing to do and no dead space. Absolutely sure. The only case that would approach a delete is if a batch load fails, the transaction is rolled back. That very seldom happens. Why am I running vacuum nightly if I have no deletes or updates, you ask? Two reasons - to have it in the cron schedule for when there are deletes (there will never be updates), and as a check on database integrity. If there is a database problem, vacuum at least has a chance of flagging it since it reads the entire database. This was instigated after we had a couple of instances of corruption a while back that went undetected for too long. I'm also doing a weekly pg_dumpall as an additional check/fallback. > Again, VACUUM VERBOSE info would be informative (it's sufficient to look > at your larger tables for this). I'll set that up to run tonight and see if it gives any clues. Last night, vacuum ran over 5 hours. Wes
On 2/28/05 6:53 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > It's hard to see how the vacuum time wouldn't > be linear in table size if there's nothing to do and no dead space. I am doing 'vacuum analyze' rather than just 'vacuum'. Could that have anything to do with the non-linear behavior? Wes
On 2/28/05 6:53 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > Again, VACUUM VERBOSE info would be informative (it's sufficient to look > at your larger tables for this). It took 5.2 hours again tonight to do the vacuum. I don't see anything out of the ordinary - no explanation for the non-linear increases in vacuum time. This is what shows up at the end: INFO: free space map: 93 relations, 282 pages stored; 1712 total pages needed DETAIL: Allocated FSM size: 1000 relations + 1000000 pages = 5920 kB shared memory. The tables all show something like: INFO: "blah": found 0 removable, 366326534 nonremovable row versions in 3241829 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. 0 pages are entirely empty. Wes
Wes <wespvp@syntegra.com> writes: > It took 5.2 hours again tonight to do the vacuum. I don't see anything out > of the ordinary - no explanation for the non-linear increases in vacuum > time. Would you post the complete VACUUM VERBOSE log? The CPU/elapsed time lines would help us identify where the time is going. > This is what shows up at the end: > INFO: free space map: 93 relations, 282 pages stored; 1712 total pages > needed > DETAIL: Allocated FSM size: 1000 relations + 1000000 pages = 5920 kB shared > memory. Well, you don't have a problem with FSM being too small anyway ;-) regards, tom lane
On 3/2/05 12:16 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > Would you post the complete VACUUM VERBOSE log? The CPU/elapsed time lines > would help us identify where the time is going. I'll send it to you directly - its rather long. >> DETAIL: Allocated FSM size: 1000 relations + 1000000 pages = 5920 kB shared >> memory. > > Well, you don't have a problem with FSM being too small anyway ;-) Nope... Preparation for when deletes start kicking in down the road. If I can only do a vacuum once a week, I've got to have lots of space. Wes
On 3/2/05 12:16 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > Would you post the complete VACUUM VERBOSE log? The CPU/elapsed time lines > would help us identify where the time is going. Mailed. I do see stats like: CPU 518.88s/25.17u sec elapsed 10825.33 sec. CPU 884.96s/64.35u sec elapsed 13793.13 sec. CPU 132.46s/6.66u sec elapsed 2435.42 sec. CPU 49.25s/4.15u sec elapsed 414.71 sec. This is a dual CPU hyperthreaded (which probably makes little difference here) 2.4Ghz RedHat 3.0. The database is on an 8-disk SCSI hardware RAID 5 with 10k rpm disks. Pg_xlog is on a separate volume. I thought it was a 2.6 kernel, but it looks like it is 2.4.20. I need to monitor the system when the vacuum is running to see if sar/top show anything. I wonder if it's hitting the kswapd thrashing problem? Wes
Watching the system as vacuum is running, I can see that we are encountering the kswapd/kscand problem in the 2.4.20 kernel. This could very well account for the non-linear increase in vacuum time. This problem is fixed in the 2.6 kernel, but we can't upgrade because DELL is dragging their feet in releasing hardware monitoring compatible with 2.6 kernel. So, we're going to try a 2.4.29 kernel and hope that the problem is fixed there. With any luck, by Friday I'll know if the kswapd problem is fixed in 2.4.29 and if that solves the excessive vacuum times. Wes
Wes <wespvp@syntegra.com> writes: > Watching the system as vacuum is running, I can see that we are encountering > the kswapd/kscand problem in the 2.4.20 kernel. This could very well > account for the non-linear increase in vacuum time. Hmm. Looking at the vacuum verbose output you sent me, it's clear that the bulk of the time is going into scanning a couple of the larger indexes. On an index that's been growing for awhile, this involves a highly nonsequential access pattern (it wants to visit the index leaf pages in sort order, which will not look much like physical order after a lot of page splits have occurred). I don't know whether that would tend to set off the kswapd/kscand problems, but just in terms of physical I/O time it might be annoying. I was going to suggest REINDEXing those indexes to see if that cuts the vacuum time at all. regards, tom lane
Wes Palmer <Wesley.R.Palmer@syntegra.com> writes: > Any chance of change that > behavior to scan in physical storage order? It wouldn't be easy --- there are some locking considerations that say btbulkdelete needs to scan the index in the same order that an ordinary scan would do. See the nbtree README for details. regards, tom lane
On 3/2/05 3:51 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > I was going to suggest > REINDEXing those indexes to see if that cuts the vacuum time at all. The problem with that is it takes a very long time. I've got a couple of things to try yet on the kswapd problem. If that doesn't work, maybe I can rebuild one of the indexes and see how much that one improves. I wasn't aware that the indexes were scanned non-sequentially. The under one hour time was probably shortly after a full reload. Any chance of change that behavior to scan in physical storage order? The index from the largest table that has: CPU 216.15s/18.13u sec elapsed 2110.84 sec. is inserted in sequential order. The index CPU 518.88s/25.17u sec elapsed 10825.33 sec. has records inserted in essentially a random order, and is also something like twice as large (key size). We're going to try to test the 2.4.29 kernel tomorrow. Wes
Well, the good news is that the 2.24.29 kernel solved the kswapd problem. That bad news is that it didn't help the vacuum time. In fact, the vacuum time is now over 6 hours instead of 5 hours. Whether that is a direct result of the 2.24.29 kernel, or a coincidence, I don't know at this time. I guess the next step is to try reindexing a couple of the big indexes and see if that helps. Wes
On 3/2/05 10:50 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > It wouldn't be easy --- there are some locking considerations that say > btbulkdelete needs to scan the index in the same order that an ordinary > scan would do. See the nbtree README for details. Just a follow-up on this.. The vacuum time has been steadily increasing at a seemingly increasing rate, although there are no deletes or updates to the database. The current DB size is just over 500 million rows. Last week it was up to 6.84 hours to do a vacuum. Over the weekend I reindexed all the major indexes. The two largest indexes took about 10 hours to reindex both. After the reindexing, the vacuum took only 1.44 hours. This is pretty much a linear scaling from the original vacuum time I reported. So, the increasing vacuum times would appear to be as Tom suggested - due to the fact that vacuum processes indexes in index order, not physical disk order. I guess we add a periodic reindex to our maintenance procedures... Wes
Wes <wespvp@syntegra.com> writes: > On 3/2/05 10:50 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: >> It wouldn't be easy --- there are some locking considerations that say >> btbulkdelete needs to scan the index in the same order that an ordinary >> scan would do. See the nbtree README for details. > Just a follow-up on this.. > The vacuum time has been steadily increasing at a seemingly increasing rate, > although there are no deletes or updates to the database. The current DB > size is just over 500 million rows. Last week it was up to 6.84 hours to do > a vacuum. Over the weekend I reindexed all the major indexes. The two > largest indexes took about 10 hours to reindex both. After the reindexing, > the vacuum took only 1.44 hours. This is pretty much a linear scaling from > the original vacuum time I reported. > So, the increasing vacuum times would appear to be as Tom suggested - due to > the fact that vacuum processes indexes in index order, not physical disk > order. I guess we add a periodic reindex to our maintenance procedures... That doesn't follow from what you said. Did you check that the physical sizes of the indexes were comparable before and after the reindex? regards, tom lane
On 4/4/05 8:50 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > That doesn't follow from what you said. Did you check that the physical > sizes of the indexes were comparable before and after the reindex? No, how do I do that (or where is it documented how to do it)? How is it not consistent? I believe you suggested the reindex. The initial timing was shortly after a database reload. The index would have been built in sorted order, correct? This was the 1 hour time. After a period of months, the index values have been inserted in random order. The DB size is up 50% but the vacuum time is up a factor of 6+. Presumably the index is being read by moving the heads all over the place. I reindex, and the index is rebuilt in sorted order. Vacuum is now down to 1.5 hours - a linear scaling from the original numbers. The locality of reference in reading the indexes in order should be much better. Wes
Wes <wespvp@syntegra.com> writes: > On 4/4/05 8:50 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: >> That doesn't follow from what you said. Did you check that the physical >> sizes of the indexes were comparable before and after the reindex? > No, how do I do that (or where is it documented how to do it)? The best way is probably to capture the output of VACUUM VERBOSE (not FULL) for the table before and after REINDEX. > How is it not consistent? I didn't say it wasn't consistent, just that it doesn't prove the point. The speedup you saw could have been from elimination of index bloat more than from bringing the index into physically sorted order. An estimate of the overall database size doesn't really tell us how much this particular table's indexes changed in size. regards, tom lane
On 4/5/05 11:15 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > I didn't say it wasn't consistent, just that it doesn't prove the > point. The speedup you saw could have been from elimination of index > bloat more than from bringing the index into physically sorted order. > An estimate of the overall database size doesn't really tell us how > much this particular table's indexes changed in size. Ok, now I follow. Taking the biggest indexes: The weekend before: INFO: index "message_recipients_i_recip_date" now contains 393961361 row versions in 2435100 pages INFO: index "message_recipients_i_message" now contains 393934394 row versions in 1499853 pages After reindex: INFO: index "message_recipients_i_recip_date" now contains 401798357 row versions in 1765613 pages INFO: index "message_recipients_i_message" now contains 401787237 row versions in 1322974 pages Wes
Wes <wespvp@syntegra.com> writes: > Ok, now I follow. Taking the biggest indexes: > The weekend before: > INFO: index "message_recipients_i_recip_date" now contains 393961361 row > versions in 2435100 pages > INFO: index "message_recipients_i_message" now contains 393934394 row > versions in 1499853 pages > After reindex: > INFO: index "message_recipients_i_recip_date" now contains 401798357 row > versions in 1765613 pages > INFO: index "message_recipients_i_message" now contains 401787237 row > versions in 1322974 pages OK, that's certainly not a factor-of-four difference in size, so I'm now convinced you're right: bringing the index into physical order is having a big impact on the runtime. From a development standpoint, that suggests a couple of TODO items: * Look harder at whether VACUUM can scan the index in physical instead of logical order. * See whether ordinary btree maintenance (ie page splits) can do anything to maintain/improve the physical ordering of the index. regards, tom lane