Thread: VACUUMs take twice as long across all nodes

VACUUMs take twice as long across all nodes

From
Gavin Hamill
Date:
Hullo, here's one of those dreadful touchy-feely hand-waving problems.
Our 5-node 8.1.3 Slony system has just started taking /much/ longer to
VACUUM ANALYZE..

The data set has not increased more than usual (nightly backups stand
at 1.3GB, growing by 10MB per day), and no configuration has changed on
the machines.

Nodes 2 and 3 take only the tables necessary to run our search (10 out
of the full 130) and are much lighter (only 7GB on disk cf. 30GB for
the full master) , yet the nightly VACUUM FULL has jumped from 2 hours
to 4 in the space of one day!

Like I say, no config changes, no reboots / postmaster restarts, no extra processes, and every machine has a
comfortableoverhead of free page slots + relations. 

From a few days ago:
2006-10-20 03:04:29 UTC INFO:  "Allocation": found 786856 removable, 4933448 nonremovable row versions in 53461 pages
2006-10-20 03:04:29 UTC DETAIL:  0 dead row versions cannot be removed yet.
2006-10-20 03:07:32 UTC INFO:  index "allocation_pkey" now contains 4933448 row versions in 93918 pages
2006-10-20 03:07:32 UTC DETAIL:  786856 index row versions were removed.
2006-10-20 03:14:21 UTC INFO:  index "ix_date" now contains 4933448 row versions in 74455 pages
2006-10-20 03:14:21 UTC DETAIL:  786856 index row versions were removed.
2006-10-20 03:22:32 UTC INFO:  index "ix_dateprice" now contains 4933448 row versions in 81313 pages
2006-10-20 03:22:32 UTC DETAIL:  786856 index row versions were removed.
2006-10-20 03:24:41 UTC INFO:  index "ix_dateroom" now contains 4933448 row versions in 44610 pages
2006-10-20 03:24:41 UTC DETAIL:  786856 index row versions were removed.
2006-10-20 03:27:52 UTC INFO:  index "ix_room" now contains 4933448 row versions in 35415 pages
2006-10-20 03:27:52 UTC DETAIL:  786856 index row versions were removed.
2006-10-20 03:31:43 UTC INFO:  "Allocation": moved 348324 row versions, truncated 53461 to 46107 pages
2006-10-20 03:31:43 UTC DETAIL:  CPU 4.72s/17.63u sec elapsed 230.81 sec.

From last night:
2006-10-26 01:00:30 UTC INFO:  vacuuming "public.Allocation"
2006-10-26 01:00:36 UTC INFO:  "Allocation": found 774057 removable, 4979938 nonremovable row versions in 53777 pages
2006-10-26 01:00:36 UTC DETAIL:  0 dead row versions cannot be removed yet.
2006-10-26 01:06:18 UTC INFO:  index "allocation_pkey" now contains 4979938 row versions in 100800 pages
2006-10-26 01:06:18 UTC DETAIL:  774057 index row versions were removed.
2006-10-26 01:19:22 UTC INFO:  index "ix_date" now contains 4979938 row versions in 81630 pages
2006-10-26 01:19:22 UTC DETAIL:  774057 index row versions were removed.
2006-10-26 01:35:17 UTC INFO:  index "ix_dateprice" now contains 4979938 row versions in 87750 pages
2006-10-26 01:35:17 UTC DETAIL:  774057 index row versions were removed.
2006-10-26 01:41:27 UTC INFO:  index "ix_dateroom" now contains 4979938 row versions in 46320 pages
2006-10-26 01:41:27 UTC DETAIL:  774057 index row versions were removed.
2006-10-26 01:48:18 UTC INFO:  index "ix_room" now contains 4979938 row versions in 36513 pages
2006-10-26 01:48:18 UTC DETAIL:  774057 index row versions were removed.
2006-10-26 01:56:35 UTC INFO:  "Allocation": moved 322744 row versions, truncated 53777 to 46542 pages
2006-10-26 01:56:35 UTC DETAIL:  CPU 4.21s/15.90u sec elapsed 496.30 sec.

As you can see, the amount of system + user time for these runs are comparable, but the amount of real time has more
thandoubled.  

This isn't even a case for making the cost-based delay vacuum more aggressive because I already have vacuum_cost_delay
=0 on all machines to make the vacuum run as quickly as possible. 

Any ideas warmly received! :)

Cheers,
Gavin.


Re: VACUUMs take twice as long across all nodes

From
Tom Lane
Date:
Gavin Hamill <gdh@laterooms.com> writes:
> Nodes 2 and 3 take only the tables necessary to run our search (10 out
> of the full 130) and are much lighter (only 7GB on disk cf. 30GB for
> the full master) , yet the nightly VACUUM FULL has jumped from 2 hours
> to 4 in the space of one day!

I guess the most useful question to ask is "why are you doing VACUUM FULL?"
Plain VACUUM should be considerably faster, and for the level of row
turnover shown by your log, there doesn't seem to be a reason to use FULL.

            regards, tom lane

Re: VACUUMs take twice as long across all nodes

From
Gavin Hamill
Date:
On Thu, 26 Oct 2006 10:47:21 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Gavin Hamill <gdh@laterooms.com> writes:
> > Nodes 2 and 3 take only the tables necessary to run our search (10
> > out of the full 130) and are much lighter (only 7GB on disk cf.
> > 30GB for the full master) , yet the nightly VACUUM FULL has jumped
> > from 2 hours to 4 in the space of one day!
>
> I guess the most useful question to ask is "why are you doing VACUUM
> FULL?" Plain VACUUM should be considerably faster, and for the level
> of row turnover shown by your log, there doesn't seem to be a reason
> to use FULL.

I do FULL on the 'light' clients simply because 'I can'. The example
posted was a poor choice - the other tables have a larger churn.

Anyway, once it starts, the load balancer takes it out of rotation so
no love is lost.

The same behaviour is shown on the 'heavy' clients (master + 2 slaves)
which take all tables - although I cannot afford to VACUUM FULL on
there, the usual VACUUM ANALYZE has begun to take vastly more time
since yesterday than in the many previous months we've been using pg.

Cheers,
Gavin.

Re: VACUUMs take twice as long across all nodes

From
"Jim C. Nasby"
Date:
On Thu, Oct 26, 2006 at 04:06:09PM +0100, Gavin Hamill wrote:
> On Thu, 26 Oct 2006 10:47:21 -0400
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > Gavin Hamill <gdh@laterooms.com> writes:
> > > Nodes 2 and 3 take only the tables necessary to run our search (10
> > > out of the full 130) and are much lighter (only 7GB on disk cf.
> > > 30GB for the full master) , yet the nightly VACUUM FULL has jumped
> > > from 2 hours to 4 in the space of one day!
> >
> > I guess the most useful question to ask is "why are you doing VACUUM
> > FULL?" Plain VACUUM should be considerably faster, and for the level
> > of row turnover shown by your log, there doesn't seem to be a reason
> > to use FULL.
>
> I do FULL on the 'light' clients simply because 'I can'. The example
> posted was a poor choice - the other tables have a larger churn.
>
> Anyway, once it starts, the load balancer takes it out of rotation so
> no love is lost.
>
> The same behaviour is shown on the 'heavy' clients (master + 2 slaves)
> which take all tables - although I cannot afford to VACUUM FULL on
> there, the usual VACUUM ANALYZE has begun to take vastly more time
> since yesterday than in the many previous months we've been using pg.

Are you sure that there's nothing else happening on the machine that
could affect the vacuum times? Like, say a backup? Or perhaps updates
coming in from Slony that didn't used to be there?
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: VACUUMs take twice as long across all nodes

From
Gavin Hamill
Date:
On Thu, 26 Oct 2006 14:17:29 -0500
"Jim C. Nasby" <jim@nasby.net> wrote:

> Are you sure that there's nothing else happening on the machine that
> could affect the vacuum times? Like, say a backup? Or perhaps updates
> coming in from Slony that didn't used to be there?

I'm absolutely certain. The backups run from only one slave, given that
it is a full copy of node 1. Our overnight traffic has not increased
any, and the nightly backups show that the overall size of the DB has
not increased more than usual growth.

Plus, I have fairly verbose logging, and it's not showing anything out
of the ordinary.

Like I said, it's one of those awful hypothesis/hand-waving problems :)

Cheers,
Gavin.

Re: VACUUMs take twice as long across all nodes

From
"Jim C. Nasby"
Date:
On Thu, Oct 26, 2006 at 09:35:56PM +0100, Gavin Hamill wrote:
> On Thu, 26 Oct 2006 14:17:29 -0500
> "Jim C. Nasby" <jim@nasby.net> wrote:
>
> > Are you sure that there's nothing else happening on the machine that
> > could affect the vacuum times? Like, say a backup? Or perhaps updates
> > coming in from Slony that didn't used to be there?
>
> I'm absolutely certain. The backups run from only one slave, given that
> it is a full copy of node 1. Our overnight traffic has not increased
> any, and the nightly backups show that the overall size of the DB has
> not increased more than usual growth.
>
> Plus, I have fairly verbose logging, and it's not showing anything out
> of the ordinary.
>
> Like I said, it's one of those awful hypothesis/hand-waving problems :)

Well, the fact that it's happening on all your nodes leads me to think
Slony is somehow involved. Perhaps it suddenly decided to change how
often it's issuing syncs? I know it issues vacuums as well, so maybe
that's got something to do with it... (though I'm guessing you've
already looked in pg_stat_activity/logs to see if anything
correlates...) Still, it might be worth asking about this on the slony
list...
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: VACUUMs take twice as long across all nodes

From
Andrew Sullivan
Date:
On Thu, Oct 26, 2006 at 09:35:56PM +0100, Gavin Hamill wrote:
>
> I'm absolutely certain. The backups run from only one slave, given that
> it is a full copy of node 1. Our overnight traffic has not increased
> any, and the nightly backups show that the overall size of the DB has
> not increased more than usual growth.

A couple things from your posts:

1.    Don't do VACUUM FULL, please.  It takes longer, and blocks
other things while it's going on, which might mean you're having
table bloat in various slony-related tables.

2.    Are your slony logs showing increased time too?  Are your
targets getting further behind?

3.    Your backups "from the slave" aren't done with pg_dump,
right?

But I suspect Slony has a role here, too.  I'd look carefully at the
slony tables -- especially the sl_log and pg_listen things, which
both are implicated.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
"The year's penultimate month" is not in truth a good way of saying
November.
        --H.W. Fowler

Re: VACUUMs take twice as long across all nodes

From
Gavin Hamill
Date:
On Thu, 26 Oct 2006 18:09:37 -0400
Andrew Sullivan <ajs@crankycanuck.ca> wrote:

> On Thu, Oct 26, 2006 at 09:35:56PM +0100, Gavin Hamill wrote:
> >
> > I'm absolutely certain. The backups run from only one slave, given
> > that it is a full copy of node 1. Our overnight traffic has not
> > increased any, and the nightly backups show that the overall size
> > of the DB has not increased more than usual growth.
>
> A couple things from your posts:
>
> 1.    Don't do VACUUM FULL, please.  It takes longer, and blocks
> other things while it's going on, which might mean you're having
> table bloat in various slony-related tables.

I know it takes longer, I know it blocks. It's never been a problem

> 2.    Are your slony logs showing increased time too?  Are your
> targets getting further behind?

Nope, the slaves are keeping up just great - once the vacuums are
finished, all machines are running at about 50%-75% of full load in
duty.

> 3.    Your backups "from the slave" aren't done with pg_dump,
> right?

Em, they are indeed. I assumed that MVCC would ensure I got a
consistent snapshot from the instant when pg_dump began. Am I wrong?

> But I suspect Slony has a role here, too.  I'd look carefully at the
> slony tables -- especially the sl_log and pg_listen things, which
> both are implicated.

Slony is an easy target to point the finger at, so I tried a
little test. I took one of the 'light' slaves (only 10 tables..),
stopped its slon daemon, removed it from the load-balancer, and
restarted postgres so there were no active connections.

With the removal of both replication overhead and normal queries from
clients, the machine should be completely clear to run at full tilt.

Then I launched a 'vacuum verbose' and I was able to see exactly the
same poor speeds as before, even with vacuum_cost_delay = 0 as it was
previously...

2006-10-27 08:37:12 UTC INFO:  vacuuming "public.Allocation"
2006-10-27 08:37:21 UTC INFO:  "Allocation": found 56449 removable, 4989360 nonremovable row versions in 47158 pages
2006-10-27 08:37:21 UTC DETAIL:  0 dead row versions cannot be removed yet.
        Nonremovable row versions range from 64 to 72 bytes long.
        There were 1 unused item pointers.
        Total free space (including removable row versions) is 5960056 bytes.
        13 pages are or will become empty, including 0 at the end of the table.
        5258 pages containing 4282736 free bytes are potential move destinations.
        CPU 0.16s/0.07u sec elapsed 9.55 sec.
2006-10-27 08:44:25 UTC INFO:  index "allocation_pkey" now contains 4989360 row versions in 102198 pages
2006-10-27 08:44:25 UTC DETAIL:  56449 index row versions were removed.
        1371 index pages have been deleted, 1371 are currently reusable.
        CPU 1.02s/0.38u sec elapsed 423.22 sec.

If I've read this correctly, then on an otherwise idle system, it has taken seven minutes to perform 1.4 seconds-worth
ofactual work. Surely that's nonsense?  

That would suggest that the issue is poor IO; "vmstat 5" output during this run wasn't ripping performance - maybe
averaging3MB/sec in and out.  

I know the peak IO on this machine is rather much better than that:

joltpg2:/root# dd if=/dev/zero of=/tmp/t bs=1024k count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 8.02106 seconds, 131 MB/s

The test "system" is one CPU's-worth (two cores) of a 4 x Opteron 880 machine split up by Xen, and I can confirm the IO
onthe other Xen partitions was minimal. 

I appreciate the time, help and advice people are offering, however I really don't think Slony is the culprit here.

Cheers,
Gavin.

Re: VACUUMs take twice as long across all nodes

From
Tom Lane
Date:
Gavin Hamill <gdh@laterooms.com> writes:
> 2006-10-27 08:37:12 UTC INFO:  vacuuming "public.Allocation"
> 2006-10-27 08:37:21 UTC INFO:  "Allocation": found 56449 removable, 4989360 nonremovable row versions in 47158 pages
> 2006-10-27 08:37:21 UTC DETAIL:  0 dead row versions cannot be removed yet.
>         Nonremovable row versions range from 64 to 72 bytes long.
>         There were 1 unused item pointers.
>         Total free space (including removable row versions) is 5960056 bytes.
>         13 pages are or will become empty, including 0 at the end of the table.
>         5258 pages containing 4282736 free bytes are potential move destinations.
>         CPU 0.16s/0.07u sec elapsed 9.55 sec.
> 2006-10-27 08:44:25 UTC INFO:  index "allocation_pkey" now contains 4989360 row versions in 102198 pages
> 2006-10-27 08:44:25 UTC DETAIL:  56449 index row versions were removed.
>         1371 index pages have been deleted, 1371 are currently reusable.
>         CPU 1.02s/0.38u sec elapsed 423.22 sec.

So the time is all in index vacuuming, eh?  I think what's happening is
that the physical order of the index is degrading over time, and so the
vacuum scan takes longer due to more seeking.  Can you afford to do a
REINDEX?  If this theory is correct that should drive the time back
down.

8.2 has rewritten btree index vacuuming code that scans the index in
physical not logical order, so this problem should largely go away in
8.2, but in existing releases I can't see much you can do about it
except REINDEX when things get slow.

            regards, tom lane

Re: VACUUMs take twice as long across all nodes

From
Gavin Hamill
Date:
On Fri, 27 Oct 2006 14:07:43 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> So the time is all in index vacuuming, eh?  I think what's happening
> is that the physical order of the index is degrading over time, and
> so the vacuum scan takes longer due to more seeking.  Can you afford
> to do a REINDEX?  If this theory is correct that should drive the
> time back down.

Tom,

You wonderful, wonderful man.

I tried a test reindex on "Allocation", and noticed a vacuum had
turbo-charged... then reindexed the whole db, did a vacuum, and lo! The
whole db had turbo-charged :)

When I say 'turbo-charged', I mean it. The vacuum times have dropped to
20% of what we were seeing even before it 'got much slower a
couple of days ago.'

It sucks that the new reindex code is only in 8.2, but now that I know
this is an issue in 8.1 I can plan for it.

Thanks so much :)

Cheers,
Gavin.

Re: VACUUMs take twice as long across all nodes

From
Andrew Sullivan
Date:
Ok, I see Tom has diagnosed your problem.  Here are more hints
anyway:

On Fri, Oct 27, 2006 at 10:20:25AM +0100, Gavin Hamill wrote:
> > table bloat in various slony-related tables.
>
> I know it takes longer, I know it blocks. It's never been a problem

The problem from a VACUUM FULL is that its taking longer causes the
vacuums on (especially) pg_listen and sl_log_[n] to be unable to
recover as many rows (because there's an older transaction around).
This is a significant area of vulnerability in Slony.  You really
have to readjust your vacuum assumptions when using Slony.

> > 3.    Your backups "from the slave" aren't done with pg_dump,
> > right?
>
> Em, they are indeed. I assumed that MVCC would ensure I got a
> consistent snapshot from the instant when pg_dump began. Am I wrong?

That's not the problem.  The problem is that when you restore the
dump of the slave, you'll have garbage.  Slony fools with the
catalogs on the replicas.  This is documented in the Slony docs, but
probably not in sufficiently large-type bold italics in red with the
<blink> tag set as would be appropriate for such a huge gotcha.
Anyway, don't use pg_dump on a replica.  There's a tool that comes
with slony that will allow you to take consistent, restorable dumps
from replicas if you like.  (And you might as well throw away the
dumpfiles from the replicas that you have.  They won't work when you
restore them.)

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
"The year's penultimate month" is not in truth a good way of saying
November.
        --H.W. Fowler

Re: VACUUMs take twice as long across all nodes

From
Gavin Hamill
Date:
Hi :)

[pg_dump from a Slony replica]

> That's not the problem.  The problem is that when you restore the
> dump of the slave, you'll have garbage.  Slony fools with the
> catalogs on the replicas.

>  (And you might as well throw away the
> dumpfiles from the replicas that you have.  They won't work when you
> restore them.)

This is interesting, but I don't understand.. We've done a full restore
from one of these pg_dump backups before now and it worked just great.

Sure I had to DROP SCHEMA _replication CASCADE to clear out all the
slony-specific triggers etc., but the new-master ran fine, as did
firing up new replication to the other nodes :)

Was I just lucky?

Cheers,
Gavin.

Re: VACUUMs take twice as long across all nodes

From
Andrew Sullivan
Date:
On Sun, Oct 29, 2006 at 03:08:26PM +0000, Gavin Hamill wrote:
>
> This is interesting, but I don't understand.. We've done a full restore
> from one of these pg_dump backups before now and it worked just great.
>
> Sure I had to DROP SCHEMA _replication CASCADE to clear out all the
> slony-specific triggers etc., but the new-master ran fine, as did
> firing up new replication to the other nodes :)
>
> Was I just lucky?

Yes.  Slony alters data in the system catalog for a number of
database objects on the replicas.  It does this in order to prevent,
for example, triggers from firing both on the origin and the replica.
(That is the one that usually bites people hardest, but IIRC it's not
the only such hack in there.)  This was a bit of a dirty hack that
was supposed to be cleaned up, but that hasn't been yet.  In general,
you can't rely on a pg_dump of a replica giving you a dump that, when
restored, actually works.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
Everything that happens in the world happens at some place.
        --Jane Jacobs

Re: VACUUMs take twice as long across all nodes

From
Andreas Kostyrka
Date:
Am Sonntag, den 29.10.2006, 10:34 -0500 schrieb Andrew Sullivan:
> On Sun, Oct 29, 2006 at 03:08:26PM +0000, Gavin Hamill wrote:
> >
> > This is interesting, but I don't understand.. We've done a full restore
> > from one of these pg_dump backups before now and it worked just great.
> >
> > Sure I had to DROP SCHEMA _replication CASCADE to clear out all the
> > slony-specific triggers etc., but the new-master ran fine, as did
> > firing up new replication to the other nodes :)
> >
> > Was I just lucky?
>
> Yes.  Slony alters data in the system catalog for a number of
> database objects on the replicas.  It does this in order to prevent,
> for example, triggers from firing both on the origin and the replica.
> (That is the one that usually bites people hardest, but IIRC it's not
> the only such hack in there.)  This was a bit of a dirty hack that
> was supposed to be cleaned up, but that hasn't been yet.  In general,
> you can't rely on a pg_dump of a replica giving you a dump that, when
> restored, actually works.

Actually, you need to get the schema from the master node, and can take
the data from a slave. In mixing dumps like that, you must realize that
there are two seperate parts in the schema dump: "table definitions" and
"constraints". Do get a restorable backup you need to put the table
definitions stuff before your data, and the constraints after the data
copy.

Andreas

>
> A
>

Attachment

Re: VACUUMs take twice as long across all nodes

From
Andrew Sullivan
Date:
On Sun, Oct 29, 2006 at 05:24:33PM +0100, Andreas Kostyrka wrote:
> Actually, you need to get the schema from the master node, and can take
> the data from a slave. In mixing dumps like that, you must realize that
> there are two seperate parts in the schema dump: "table definitions" and
> "constraints". Do get a restorable backup you need to put the table
> definitions stuff before your data, and the constraints after the data
> copy.

This will work, yes, but you don't get a real point-in-time dump this
way.  (In any case, we're off the -performance charter now, so if
anyone wants to pursue this, I urge you to take it to the Slony
list.)

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
Windows is a platform without soap, where rats run around
in open sewers.
        --Daniel Eran

Re: VACUUMs take twice as long across all nodes

From
Andreas Kostyrka
Date:
Am Sonntag, den 29.10.2006, 11:43 -0500 schrieb Andrew Sullivan:
> On Sun, Oct 29, 2006 at 05:24:33PM +0100, Andreas Kostyrka wrote:
> > Actually, you need to get the schema from the master node, and can take
> > the data from a slave. In mixing dumps like that, you must realize that
> > there are two seperate parts in the schema dump: "table definitions" and
> > "constraints". Do get a restorable backup you need to put the table
> > definitions stuff before your data, and the constraints after the data
> > copy.
>
> This will work, yes, but you don't get a real point-in-time dump this
But one does, because one can dump all data in one pg_dump call. And
with slony enabled, schema changes won't happen by mistake, they tend to
be a thing for the Slony High Priest, nothing for mere developers ;)

Andreas


Attachment

Re: VACUUMs take twice as long across all nodes

From
Vivek Khera
Date:
On Oct 27, 2006, at 2:07 PM, Tom Lane wrote:

> 8.2, but in existing releases I can't see much you can do about it
> except REINDEX when things get slow.

This will be so nice for me.  I have one huge table with a massive
amount of churn and bulk deletes.  I have to reindex it once every
other month.  It takes about 60 to 75 minutes per index (times two
indexes) else I'd do it monthly.

It shaves nearly 1/3 of the relpages off of the index size.


Attachment