Thread: VACUUMs take twice as long across all nodes
Hullo, here's one of those dreadful touchy-feely hand-waving problems. Our 5-node 8.1.3 Slony system has just started taking /much/ longer to VACUUM ANALYZE.. The data set has not increased more than usual (nightly backups stand at 1.3GB, growing by 10MB per day), and no configuration has changed on the machines. Nodes 2 and 3 take only the tables necessary to run our search (10 out of the full 130) and are much lighter (only 7GB on disk cf. 30GB for the full master) , yet the nightly VACUUM FULL has jumped from 2 hours to 4 in the space of one day! Like I say, no config changes, no reboots / postmaster restarts, no extra processes, and every machine has a comfortableoverhead of free page slots + relations. From a few days ago: 2006-10-20 03:04:29 UTC INFO: "Allocation": found 786856 removable, 4933448 nonremovable row versions in 53461 pages 2006-10-20 03:04:29 UTC DETAIL: 0 dead row versions cannot be removed yet. 2006-10-20 03:07:32 UTC INFO: index "allocation_pkey" now contains 4933448 row versions in 93918 pages 2006-10-20 03:07:32 UTC DETAIL: 786856 index row versions were removed. 2006-10-20 03:14:21 UTC INFO: index "ix_date" now contains 4933448 row versions in 74455 pages 2006-10-20 03:14:21 UTC DETAIL: 786856 index row versions were removed. 2006-10-20 03:22:32 UTC INFO: index "ix_dateprice" now contains 4933448 row versions in 81313 pages 2006-10-20 03:22:32 UTC DETAIL: 786856 index row versions were removed. 2006-10-20 03:24:41 UTC INFO: index "ix_dateroom" now contains 4933448 row versions in 44610 pages 2006-10-20 03:24:41 UTC DETAIL: 786856 index row versions were removed. 2006-10-20 03:27:52 UTC INFO: index "ix_room" now contains 4933448 row versions in 35415 pages 2006-10-20 03:27:52 UTC DETAIL: 786856 index row versions were removed. 2006-10-20 03:31:43 UTC INFO: "Allocation": moved 348324 row versions, truncated 53461 to 46107 pages 2006-10-20 03:31:43 UTC DETAIL: CPU 4.72s/17.63u sec elapsed 230.81 sec. From last night: 2006-10-26 01:00:30 UTC INFO: vacuuming "public.Allocation" 2006-10-26 01:00:36 UTC INFO: "Allocation": found 774057 removable, 4979938 nonremovable row versions in 53777 pages 2006-10-26 01:00:36 UTC DETAIL: 0 dead row versions cannot be removed yet. 2006-10-26 01:06:18 UTC INFO: index "allocation_pkey" now contains 4979938 row versions in 100800 pages 2006-10-26 01:06:18 UTC DETAIL: 774057 index row versions were removed. 2006-10-26 01:19:22 UTC INFO: index "ix_date" now contains 4979938 row versions in 81630 pages 2006-10-26 01:19:22 UTC DETAIL: 774057 index row versions were removed. 2006-10-26 01:35:17 UTC INFO: index "ix_dateprice" now contains 4979938 row versions in 87750 pages 2006-10-26 01:35:17 UTC DETAIL: 774057 index row versions were removed. 2006-10-26 01:41:27 UTC INFO: index "ix_dateroom" now contains 4979938 row versions in 46320 pages 2006-10-26 01:41:27 UTC DETAIL: 774057 index row versions were removed. 2006-10-26 01:48:18 UTC INFO: index "ix_room" now contains 4979938 row versions in 36513 pages 2006-10-26 01:48:18 UTC DETAIL: 774057 index row versions were removed. 2006-10-26 01:56:35 UTC INFO: "Allocation": moved 322744 row versions, truncated 53777 to 46542 pages 2006-10-26 01:56:35 UTC DETAIL: CPU 4.21s/15.90u sec elapsed 496.30 sec. As you can see, the amount of system + user time for these runs are comparable, but the amount of real time has more thandoubled. This isn't even a case for making the cost-based delay vacuum more aggressive because I already have vacuum_cost_delay =0 on all machines to make the vacuum run as quickly as possible. Any ideas warmly received! :) Cheers, Gavin.
Gavin Hamill <gdh@laterooms.com> writes: > Nodes 2 and 3 take only the tables necessary to run our search (10 out > of the full 130) and are much lighter (only 7GB on disk cf. 30GB for > the full master) , yet the nightly VACUUM FULL has jumped from 2 hours > to 4 in the space of one day! I guess the most useful question to ask is "why are you doing VACUUM FULL?" Plain VACUUM should be considerably faster, and for the level of row turnover shown by your log, there doesn't seem to be a reason to use FULL. regards, tom lane
On Thu, 26 Oct 2006 10:47:21 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Gavin Hamill <gdh@laterooms.com> writes: > > Nodes 2 and 3 take only the tables necessary to run our search (10 > > out of the full 130) and are much lighter (only 7GB on disk cf. > > 30GB for the full master) , yet the nightly VACUUM FULL has jumped > > from 2 hours to 4 in the space of one day! > > I guess the most useful question to ask is "why are you doing VACUUM > FULL?" Plain VACUUM should be considerably faster, and for the level > of row turnover shown by your log, there doesn't seem to be a reason > to use FULL. I do FULL on the 'light' clients simply because 'I can'. The example posted was a poor choice - the other tables have a larger churn. Anyway, once it starts, the load balancer takes it out of rotation so no love is lost. The same behaviour is shown on the 'heavy' clients (master + 2 slaves) which take all tables - although I cannot afford to VACUUM FULL on there, the usual VACUUM ANALYZE has begun to take vastly more time since yesterday than in the many previous months we've been using pg. Cheers, Gavin.
On Thu, Oct 26, 2006 at 04:06:09PM +0100, Gavin Hamill wrote: > On Thu, 26 Oct 2006 10:47:21 -0400 > Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Gavin Hamill <gdh@laterooms.com> writes: > > > Nodes 2 and 3 take only the tables necessary to run our search (10 > > > out of the full 130) and are much lighter (only 7GB on disk cf. > > > 30GB for the full master) , yet the nightly VACUUM FULL has jumped > > > from 2 hours to 4 in the space of one day! > > > > I guess the most useful question to ask is "why are you doing VACUUM > > FULL?" Plain VACUUM should be considerably faster, and for the level > > of row turnover shown by your log, there doesn't seem to be a reason > > to use FULL. > > I do FULL on the 'light' clients simply because 'I can'. The example > posted was a poor choice - the other tables have a larger churn. > > Anyway, once it starts, the load balancer takes it out of rotation so > no love is lost. > > The same behaviour is shown on the 'heavy' clients (master + 2 slaves) > which take all tables - although I cannot afford to VACUUM FULL on > there, the usual VACUUM ANALYZE has begun to take vastly more time > since yesterday than in the many previous months we've been using pg. Are you sure that there's nothing else happening on the machine that could affect the vacuum times? Like, say a backup? Or perhaps updates coming in from Slony that didn't used to be there? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Thu, 26 Oct 2006 14:17:29 -0500 "Jim C. Nasby" <jim@nasby.net> wrote: > Are you sure that there's nothing else happening on the machine that > could affect the vacuum times? Like, say a backup? Or perhaps updates > coming in from Slony that didn't used to be there? I'm absolutely certain. The backups run from only one slave, given that it is a full copy of node 1. Our overnight traffic has not increased any, and the nightly backups show that the overall size of the DB has not increased more than usual growth. Plus, I have fairly verbose logging, and it's not showing anything out of the ordinary. Like I said, it's one of those awful hypothesis/hand-waving problems :) Cheers, Gavin.
On Thu, Oct 26, 2006 at 09:35:56PM +0100, Gavin Hamill wrote: > On Thu, 26 Oct 2006 14:17:29 -0500 > "Jim C. Nasby" <jim@nasby.net> wrote: > > > Are you sure that there's nothing else happening on the machine that > > could affect the vacuum times? Like, say a backup? Or perhaps updates > > coming in from Slony that didn't used to be there? > > I'm absolutely certain. The backups run from only one slave, given that > it is a full copy of node 1. Our overnight traffic has not increased > any, and the nightly backups show that the overall size of the DB has > not increased more than usual growth. > > Plus, I have fairly verbose logging, and it's not showing anything out > of the ordinary. > > Like I said, it's one of those awful hypothesis/hand-waving problems :) Well, the fact that it's happening on all your nodes leads me to think Slony is somehow involved. Perhaps it suddenly decided to change how often it's issuing syncs? I know it issues vacuums as well, so maybe that's got something to do with it... (though I'm guessing you've already looked in pg_stat_activity/logs to see if anything correlates...) Still, it might be worth asking about this on the slony list... -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Thu, Oct 26, 2006 at 09:35:56PM +0100, Gavin Hamill wrote: > > I'm absolutely certain. The backups run from only one slave, given that > it is a full copy of node 1. Our overnight traffic has not increased > any, and the nightly backups show that the overall size of the DB has > not increased more than usual growth. A couple things from your posts: 1. Don't do VACUUM FULL, please. It takes longer, and blocks other things while it's going on, which might mean you're having table bloat in various slony-related tables. 2. Are your slony logs showing increased time too? Are your targets getting further behind? 3. Your backups "from the slave" aren't done with pg_dump, right? But I suspect Slony has a role here, too. I'd look carefully at the slony tables -- especially the sl_log and pg_listen things, which both are implicated. A -- Andrew Sullivan | ajs@crankycanuck.ca "The year's penultimate month" is not in truth a good way of saying November. --H.W. Fowler
On Thu, 26 Oct 2006 18:09:37 -0400 Andrew Sullivan <ajs@crankycanuck.ca> wrote: > On Thu, Oct 26, 2006 at 09:35:56PM +0100, Gavin Hamill wrote: > > > > I'm absolutely certain. The backups run from only one slave, given > > that it is a full copy of node 1. Our overnight traffic has not > > increased any, and the nightly backups show that the overall size > > of the DB has not increased more than usual growth. > > A couple things from your posts: > > 1. Don't do VACUUM FULL, please. It takes longer, and blocks > other things while it's going on, which might mean you're having > table bloat in various slony-related tables. I know it takes longer, I know it blocks. It's never been a problem > 2. Are your slony logs showing increased time too? Are your > targets getting further behind? Nope, the slaves are keeping up just great - once the vacuums are finished, all machines are running at about 50%-75% of full load in duty. > 3. Your backups "from the slave" aren't done with pg_dump, > right? Em, they are indeed. I assumed that MVCC would ensure I got a consistent snapshot from the instant when pg_dump began. Am I wrong? > But I suspect Slony has a role here, too. I'd look carefully at the > slony tables -- especially the sl_log and pg_listen things, which > both are implicated. Slony is an easy target to point the finger at, so I tried a little test. I took one of the 'light' slaves (only 10 tables..), stopped its slon daemon, removed it from the load-balancer, and restarted postgres so there were no active connections. With the removal of both replication overhead and normal queries from clients, the machine should be completely clear to run at full tilt. Then I launched a 'vacuum verbose' and I was able to see exactly the same poor speeds as before, even with vacuum_cost_delay = 0 as it was previously... 2006-10-27 08:37:12 UTC INFO: vacuuming "public.Allocation" 2006-10-27 08:37:21 UTC INFO: "Allocation": found 56449 removable, 4989360 nonremovable row versions in 47158 pages 2006-10-27 08:37:21 UTC DETAIL: 0 dead row versions cannot be removed yet. Nonremovable row versions range from 64 to 72 bytes long. There were 1 unused item pointers. Total free space (including removable row versions) is 5960056 bytes. 13 pages are or will become empty, including 0 at the end of the table. 5258 pages containing 4282736 free bytes are potential move destinations. CPU 0.16s/0.07u sec elapsed 9.55 sec. 2006-10-27 08:44:25 UTC INFO: index "allocation_pkey" now contains 4989360 row versions in 102198 pages 2006-10-27 08:44:25 UTC DETAIL: 56449 index row versions were removed. 1371 index pages have been deleted, 1371 are currently reusable. CPU 1.02s/0.38u sec elapsed 423.22 sec. If I've read this correctly, then on an otherwise idle system, it has taken seven minutes to perform 1.4 seconds-worth ofactual work. Surely that's nonsense? That would suggest that the issue is poor IO; "vmstat 5" output during this run wasn't ripping performance - maybe averaging3MB/sec in and out. I know the peak IO on this machine is rather much better than that: joltpg2:/root# dd if=/dev/zero of=/tmp/t bs=1024k count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 8.02106 seconds, 131 MB/s The test "system" is one CPU's-worth (two cores) of a 4 x Opteron 880 machine split up by Xen, and I can confirm the IO onthe other Xen partitions was minimal. I appreciate the time, help and advice people are offering, however I really don't think Slony is the culprit here. Cheers, Gavin.
Gavin Hamill <gdh@laterooms.com> writes: > 2006-10-27 08:37:12 UTC INFO: vacuuming "public.Allocation" > 2006-10-27 08:37:21 UTC INFO: "Allocation": found 56449 removable, 4989360 nonremovable row versions in 47158 pages > 2006-10-27 08:37:21 UTC DETAIL: 0 dead row versions cannot be removed yet. > Nonremovable row versions range from 64 to 72 bytes long. > There were 1 unused item pointers. > Total free space (including removable row versions) is 5960056 bytes. > 13 pages are or will become empty, including 0 at the end of the table. > 5258 pages containing 4282736 free bytes are potential move destinations. > CPU 0.16s/0.07u sec elapsed 9.55 sec. > 2006-10-27 08:44:25 UTC INFO: index "allocation_pkey" now contains 4989360 row versions in 102198 pages > 2006-10-27 08:44:25 UTC DETAIL: 56449 index row versions were removed. > 1371 index pages have been deleted, 1371 are currently reusable. > CPU 1.02s/0.38u sec elapsed 423.22 sec. So the time is all in index vacuuming, eh? I think what's happening is that the physical order of the index is degrading over time, and so the vacuum scan takes longer due to more seeking. Can you afford to do a REINDEX? If this theory is correct that should drive the time back down. 8.2 has rewritten btree index vacuuming code that scans the index in physical not logical order, so this problem should largely go away in 8.2, but in existing releases I can't see much you can do about it except REINDEX when things get slow. regards, tom lane
On Fri, 27 Oct 2006 14:07:43 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > So the time is all in index vacuuming, eh? I think what's happening > is that the physical order of the index is degrading over time, and > so the vacuum scan takes longer due to more seeking. Can you afford > to do a REINDEX? If this theory is correct that should drive the > time back down. Tom, You wonderful, wonderful man. I tried a test reindex on "Allocation", and noticed a vacuum had turbo-charged... then reindexed the whole db, did a vacuum, and lo! The whole db had turbo-charged :) When I say 'turbo-charged', I mean it. The vacuum times have dropped to 20% of what we were seeing even before it 'got much slower a couple of days ago.' It sucks that the new reindex code is only in 8.2, but now that I know this is an issue in 8.1 I can plan for it. Thanks so much :) Cheers, Gavin.
Ok, I see Tom has diagnosed your problem. Here are more hints anyway: On Fri, Oct 27, 2006 at 10:20:25AM +0100, Gavin Hamill wrote: > > table bloat in various slony-related tables. > > I know it takes longer, I know it blocks. It's never been a problem The problem from a VACUUM FULL is that its taking longer causes the vacuums on (especially) pg_listen and sl_log_[n] to be unable to recover as many rows (because there's an older transaction around). This is a significant area of vulnerability in Slony. You really have to readjust your vacuum assumptions when using Slony. > > 3. Your backups "from the slave" aren't done with pg_dump, > > right? > > Em, they are indeed. I assumed that MVCC would ensure I got a > consistent snapshot from the instant when pg_dump began. Am I wrong? That's not the problem. The problem is that when you restore the dump of the slave, you'll have garbage. Slony fools with the catalogs on the replicas. This is documented in the Slony docs, but probably not in sufficiently large-type bold italics in red with the <blink> tag set as would be appropriate for such a huge gotcha. Anyway, don't use pg_dump on a replica. There's a tool that comes with slony that will allow you to take consistent, restorable dumps from replicas if you like. (And you might as well throw away the dumpfiles from the replicas that you have. They won't work when you restore them.) A -- Andrew Sullivan | ajs@crankycanuck.ca "The year's penultimate month" is not in truth a good way of saying November. --H.W. Fowler
Hi :) [pg_dump from a Slony replica] > That's not the problem. The problem is that when you restore the > dump of the slave, you'll have garbage. Slony fools with the > catalogs on the replicas. > (And you might as well throw away the > dumpfiles from the replicas that you have. They won't work when you > restore them.) This is interesting, but I don't understand.. We've done a full restore from one of these pg_dump backups before now and it worked just great. Sure I had to DROP SCHEMA _replication CASCADE to clear out all the slony-specific triggers etc., but the new-master ran fine, as did firing up new replication to the other nodes :) Was I just lucky? Cheers, Gavin.
On Sun, Oct 29, 2006 at 03:08:26PM +0000, Gavin Hamill wrote: > > This is interesting, but I don't understand.. We've done a full restore > from one of these pg_dump backups before now and it worked just great. > > Sure I had to DROP SCHEMA _replication CASCADE to clear out all the > slony-specific triggers etc., but the new-master ran fine, as did > firing up new replication to the other nodes :) > > Was I just lucky? Yes. Slony alters data in the system catalog for a number of database objects on the replicas. It does this in order to prevent, for example, triggers from firing both on the origin and the replica. (That is the one that usually bites people hardest, but IIRC it's not the only such hack in there.) This was a bit of a dirty hack that was supposed to be cleaned up, but that hasn't been yet. In general, you can't rely on a pg_dump of a replica giving you a dump that, when restored, actually works. A -- Andrew Sullivan | ajs@crankycanuck.ca Everything that happens in the world happens at some place. --Jane Jacobs
Am Sonntag, den 29.10.2006, 10:34 -0500 schrieb Andrew Sullivan: > On Sun, Oct 29, 2006 at 03:08:26PM +0000, Gavin Hamill wrote: > > > > This is interesting, but I don't understand.. We've done a full restore > > from one of these pg_dump backups before now and it worked just great. > > > > Sure I had to DROP SCHEMA _replication CASCADE to clear out all the > > slony-specific triggers etc., but the new-master ran fine, as did > > firing up new replication to the other nodes :) > > > > Was I just lucky? > > Yes. Slony alters data in the system catalog for a number of > database objects on the replicas. It does this in order to prevent, > for example, triggers from firing both on the origin and the replica. > (That is the one that usually bites people hardest, but IIRC it's not > the only such hack in there.) This was a bit of a dirty hack that > was supposed to be cleaned up, but that hasn't been yet. In general, > you can't rely on a pg_dump of a replica giving you a dump that, when > restored, actually works. Actually, you need to get the schema from the master node, and can take the data from a slave. In mixing dumps like that, you must realize that there are two seperate parts in the schema dump: "table definitions" and "constraints". Do get a restorable backup you need to put the table definitions stuff before your data, and the constraints after the data copy. Andreas > > A >
Attachment
On Sun, Oct 29, 2006 at 05:24:33PM +0100, Andreas Kostyrka wrote: > Actually, you need to get the schema from the master node, and can take > the data from a slave. In mixing dumps like that, you must realize that > there are two seperate parts in the schema dump: "table definitions" and > "constraints". Do get a restorable backup you need to put the table > definitions stuff before your data, and the constraints after the data > copy. This will work, yes, but you don't get a real point-in-time dump this way. (In any case, we're off the -performance charter now, so if anyone wants to pursue this, I urge you to take it to the Slony list.) A -- Andrew Sullivan | ajs@crankycanuck.ca Windows is a platform without soap, where rats run around in open sewers. --Daniel Eran
Am Sonntag, den 29.10.2006, 11:43 -0500 schrieb Andrew Sullivan: > On Sun, Oct 29, 2006 at 05:24:33PM +0100, Andreas Kostyrka wrote: > > Actually, you need to get the schema from the master node, and can take > > the data from a slave. In mixing dumps like that, you must realize that > > there are two seperate parts in the schema dump: "table definitions" and > > "constraints". Do get a restorable backup you need to put the table > > definitions stuff before your data, and the constraints after the data > > copy. > > This will work, yes, but you don't get a real point-in-time dump this But one does, because one can dump all data in one pg_dump call. And with slony enabled, schema changes won't happen by mistake, they tend to be a thing for the Slony High Priest, nothing for mere developers ;) Andreas
Attachment
On Oct 27, 2006, at 2:07 PM, Tom Lane wrote: > 8.2, but in existing releases I can't see much you can do about it > except REINDEX when things get slow. This will be so nice for me. I have one huge table with a massive amount of churn and bulk deletes. I have to reindex it once every other month. It takes about 60 to 75 minutes per index (times two indexes) else I'd do it monthly. It shaves nearly 1/3 of the relpages off of the index size.