Thread: Vacuum time degrading

Vacuum time degrading

From
Wes
Date:
We've been doing a vacuum every night.  With about 254 million rows, it was
taking abut 0.8 hours.  Now a few months later with about 456 million rows
it is taking 4.8 hours.  I did check the archives and didn't find anything
like this.

Why is the vacuum time not going up linearly?  Since vacuum has to
sequentially read the entire database, I would have expected a linear
increase - about 1.5 hours now.

There are currently no deletes or modifies to the database - only inserts.

This is on PostgreSQL 7.4.5, RedHat ES 3.0.

Wes



Re: Vacuum time degrading

From
Tom Lane
Date:
Wes <wespvp@syntegra.com> writes:
> Why is the vacuum time not going up linearly?

I'm betting that the database is suffering from substantial bloat,
requiring VACUUM to scan through lots of dead space that wasn't there
before.  Check your FSM settings (the tail end of the output from a
full-database VACUUM VERBOSE command would give some info about what you
need).

If you are suffering bloat, the fastest route to a solution would
probably be to CLUSTER your larger tables.  Although VACUUM FULL
would work, it's likely to be very slow.

> There are currently no deletes or modifies to the database - only inserts.

You *certain* about that?  It's hard to see how the vacuum time wouldn't
be linear in table size if there's nothing to do and no dead space.

Again, VACUUM VERBOSE info would be informative (it's sufficient to look
at your larger tables for this).

            regards, tom lane

Re: Vacuum time degrading

From
Wes
Date:
On 2/28/05 6:53 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> If you are suffering bloat, the fastest route to a solution would
> probably be to CLUSTER your larger tables.  Although VACUUM FULL
> would work, it's likely to be very slow.

How can there be bloat if there are no deletes or modifies?

Even if there were deletes or modifies (there will be in about another year
and a half), if a vacuum is being performed every night, how can there be
bloat?  The vacuum should release the dead space and it should be reused.
Am I missing something?

>> There are currently no deletes or modifies to the database - only inserts.
>
> You *certain* about that?  It's hard to see how the vacuum time wouldn't
> be linear in table size if there's nothing to do and no dead space.

Absolutely sure.  The only case that would approach a delete is if a batch
load fails, the transaction is rolled back.  That very seldom happens.

Why am I running vacuum nightly if I have no deletes or updates, you ask?
Two reasons - to have it in the cron schedule for when there are deletes
(there will never be updates), and as a check on database integrity.  If
there is a database problem, vacuum at least has a chance of flagging it
since it reads the entire database.  This was instigated after we had a
couple of instances of corruption a while back that went undetected for too
long.  I'm also doing a weekly pg_dumpall as an additional check/fallback.

> Again, VACUUM VERBOSE info would be informative (it's sufficient to look
> at your larger tables for this).

I'll set that up to run tonight and see if it gives any clues.  Last night,
vacuum ran over 5 hours.

Wes



Re: Vacuum time degrading

From
Wes
Date:
On 2/28/05 6:53 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> It's hard to see how the vacuum time wouldn't
> be linear in table size if there's nothing to do and no dead space.

I am doing 'vacuum analyze' rather than just 'vacuum'.  Could that have
anything to do with the non-linear behavior?

Wes



Re: Vacuum time degrading

From
Wes
Date:
On 2/28/05 6:53 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Again, VACUUM VERBOSE info would be informative (it's sufficient to look
> at your larger tables for this).

It took 5.2 hours again tonight to do the vacuum.  I don't see anything out
of the ordinary - no explanation for the non-linear increases in vacuum
time.

This is what shows up at the end:

INFO:  free space map: 93 relations, 282 pages stored; 1712 total pages
needed
DETAIL:  Allocated FSM size: 1000 relations + 1000000 pages = 5920 kB shared
memory.

The tables all show something like:

INFO:  "blah": found 0 removable, 366326534 nonremovable row versions in
3241829 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.

Wes



Re: Vacuum time degrading

From
Tom Lane
Date:
Wes <wespvp@syntegra.com> writes:
> It took 5.2 hours again tonight to do the vacuum.  I don't see anything out
> of the ordinary - no explanation for the non-linear increases in vacuum
> time.

Would you post the complete VACUUM VERBOSE log?  The CPU/elapsed time lines
would help us identify where the time is going.

> This is what shows up at the end:

> INFO:  free space map: 93 relations, 282 pages stored; 1712 total pages
> needed
> DETAIL:  Allocated FSM size: 1000 relations + 1000000 pages = 5920 kB shared
> memory.

Well, you don't have a problem with FSM being too small anyway ;-)

            regards, tom lane

Re: Vacuum time degrading

From
Wes
Date:
On 3/2/05 12:16 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Would you post the complete VACUUM VERBOSE log?  The CPU/elapsed time lines
> would help us identify where the time is going.

I'll send it to you directly - its rather long.

>> DETAIL:  Allocated FSM size: 1000 relations + 1000000 pages = 5920 kB shared
>> memory.
>
> Well, you don't have a problem with FSM being too small anyway ;-)

Nope...  Preparation for when deletes start kicking in down the road.  If I
can only do a vacuum once a week, I've got to have lots of space.

Wes



Re: Vacuum time degrading

From
Wes
Date:
On 3/2/05 12:16 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Would you post the complete VACUUM VERBOSE log?  The CPU/elapsed time lines
> would help us identify where the time is going.

Mailed.

I do see stats like:

CPU 518.88s/25.17u sec elapsed 10825.33 sec.
CPU 884.96s/64.35u sec elapsed 13793.13 sec.
CPU 132.46s/6.66u sec elapsed 2435.42 sec.
CPU 49.25s/4.15u sec elapsed 414.71 sec.

This is a dual CPU hyperthreaded (which probably makes little difference
here) 2.4Ghz RedHat 3.0.  The database is on an 8-disk SCSI hardware RAID 5
with 10k rpm disks.  Pg_xlog is on a separate volume.

I thought it was a 2.6 kernel, but it looks like it is 2.4.20.  I need to
monitor the system when the vacuum is running to see if sar/top show
anything.  I wonder if it's hitting the kswapd thrashing problem?

Wes



Re: Vacuum time degrading

From
Wes
Date:
Watching the system as vacuum is running, I can see that we are encountering
the kswapd/kscand problem in the 2.4.20 kernel.  This could very well
account for the non-linear increase in vacuum time.

This problem is fixed in the 2.6 kernel, but we can't upgrade because DELL
is dragging their feet in releasing hardware monitoring compatible with 2.6
kernel.  So, we're going to try a 2.4.29 kernel and hope that the problem is
fixed there.  With any luck, by Friday I'll know if the kswapd problem is
fixed in 2.4.29 and if that solves the excessive vacuum times.

Wes



Re: Vacuum time degrading

From
Tom Lane
Date:
Wes <wespvp@syntegra.com> writes:
> Watching the system as vacuum is running, I can see that we are encountering
> the kswapd/kscand problem in the 2.4.20 kernel.  This could very well
> account for the non-linear increase in vacuum time.

Hmm.  Looking at the vacuum verbose output you sent me, it's clear that
the bulk of the time is going into scanning a couple of the larger
indexes.  On an index that's been growing for awhile, this involves a
highly nonsequential access pattern (it wants to visit the index leaf
pages in sort order, which will not look much like physical order after
a lot of page splits have occurred).  I don't know whether that would
tend to set off the kswapd/kscand problems, but just in terms of
physical I/O time it might be annoying.  I was going to suggest
REINDEXing those indexes to see if that cuts the vacuum time at all.

            regards, tom lane

Re: Vacuum time degrading

From
Tom Lane
Date:
Wes Palmer <Wesley.R.Palmer@syntegra.com> writes:
> Any chance of change that
> behavior to scan in physical storage order?

It wouldn't be easy --- there are some locking considerations that say
btbulkdelete needs to scan the index in the same order that an ordinary
scan would do.  See the nbtree README for details.

            regards, tom lane

Re: Vacuum time degrading

From
Wes Palmer
Date:
On 3/2/05 3:51 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> I was going to suggest
> REINDEXing those indexes to see if that cuts the vacuum time at all.

The problem with that is it takes a very long time.  I've got a couple of
things to try yet on the kswapd problem.  If that doesn't work, maybe I can
rebuild one of the indexes and see how much that one improves.  I wasn't
aware that the indexes were scanned non-sequentially.  The under one hour
time was probably shortly after a full reload.  Any chance of change that
behavior to scan in physical storage order?

The index from the largest table that has:

  CPU 216.15s/18.13u sec elapsed 2110.84 sec.

is inserted in sequential order.  The index

  CPU 518.88s/25.17u sec elapsed 10825.33 sec.

has records inserted in essentially a random order, and is also something
like twice as large (key size).

We're going to try to test the 2.4.29 kernel tomorrow.

Wes


Re: Vacuum time degrading

From
Wes
Date:
Well, the good news is that the 2.24.29 kernel solved the kswapd problem.
That bad news is that it didn't help the vacuum time.  In fact, the vacuum
time is now over 6 hours instead of 5 hours.  Whether that is a direct
result of the 2.24.29 kernel, or a coincidence, I don't know at this time.

I guess the next step is to try reindexing a couple of the big indexes and
see if that helps.

Wes



Re: Vacuum time degrading

From
Wes
Date:
On 3/2/05 10:50 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> It wouldn't be easy --- there are some locking considerations that say
> btbulkdelete needs to scan the index in the same order that an ordinary
> scan would do.  See the nbtree README for details.

Just a follow-up on this..

The vacuum time has been steadily increasing at a seemingly increasing rate,
although there are no deletes or updates to the database.  The current DB
size is just over 500 million rows.  Last week it was up to 6.84 hours to do
a vacuum.  Over the weekend I reindexed all the major indexes.  The two
largest indexes took about 10 hours to reindex both.  After the reindexing,
the vacuum took only 1.44 hours.  This is pretty much a linear scaling from
the original vacuum time I reported.

So, the increasing vacuum times would appear to be as Tom suggested - due to
the fact that vacuum processes indexes in index order, not physical disk
order.  I guess we add a periodic reindex to our maintenance procedures...

Wes



Re: Vacuum time degrading

From
Tom Lane
Date:
Wes <wespvp@syntegra.com> writes:
> On 3/2/05 10:50 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
>> It wouldn't be easy --- there are some locking considerations that say
>> btbulkdelete needs to scan the index in the same order that an ordinary
>> scan would do.  See the nbtree README for details.

> Just a follow-up on this..

> The vacuum time has been steadily increasing at a seemingly increasing rate,
> although there are no deletes or updates to the database.  The current DB
> size is just over 500 million rows.  Last week it was up to 6.84 hours to do
> a vacuum.  Over the weekend I reindexed all the major indexes.  The two
> largest indexes took about 10 hours to reindex both.  After the reindexing,
> the vacuum took only 1.44 hours.  This is pretty much a linear scaling from
> the original vacuum time I reported.

> So, the increasing vacuum times would appear to be as Tom suggested - due to
> the fact that vacuum processes indexes in index order, not physical disk
> order.  I guess we add a periodic reindex to our maintenance procedures...

That doesn't follow from what you said.  Did you check that the physical
sizes of the indexes were comparable before and after the reindex?

            regards, tom lane

Re: Vacuum time degrading

From
Wes
Date:
On 4/4/05 8:50 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> That doesn't follow from what you said.  Did you check that the physical
> sizes of the indexes were comparable before and after the reindex?

No, how do I do that (or where is it documented how to do it)?

How is it not consistent?  I believe you suggested the reindex.  The initial
timing was shortly after a database reload.  The index would have been built
in sorted order, correct?  This was the 1 hour time.  After a period of
months, the index values have been inserted in random order.  The DB size is
up 50% but the vacuum time is up a factor of 6+.  Presumably the index is
being read by moving the heads all over the place.  I reindex, and the index
is rebuilt in sorted order.  Vacuum is now down to 1.5 hours - a linear
scaling from the original numbers.  The locality of reference in reading the
indexes in order should be much better.

Wes



Re: Vacuum time degrading

From
Tom Lane
Date:
Wes <wespvp@syntegra.com> writes:
> On 4/4/05 8:50 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
>> That doesn't follow from what you said.  Did you check that the physical
>> sizes of the indexes were comparable before and after the reindex?

> No, how do I do that (or where is it documented how to do it)?

The best way is probably to capture the output of VACUUM VERBOSE (not FULL)
for the table before and after REINDEX.

> How is it not consistent?

I didn't say it wasn't consistent, just that it doesn't prove the
point.  The speedup you saw could have been from elimination of index
bloat more than from bringing the index into physically sorted order.
An estimate of the overall database size doesn't really tell us how
much this particular table's indexes changed in size.

            regards, tom lane

Re: Vacuum time degrading

From
Wes
Date:
On 4/5/05 11:15 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> I didn't say it wasn't consistent, just that it doesn't prove the
> point.  The speedup you saw could have been from elimination of index
> bloat more than from bringing the index into physically sorted order.
> An estimate of the overall database size doesn't really tell us how
> much this particular table's indexes changed in size.

Ok, now I follow.  Taking the biggest indexes:

The weekend before:

INFO:  index "message_recipients_i_recip_date" now contains 393961361 row
versions in 2435100 pages

INFO:  index "message_recipients_i_message" now contains 393934394 row
versions in 1499853 pages

After reindex:

INFO:  index "message_recipients_i_recip_date" now contains 401798357 row
versions in 1765613 pages

INFO:  index "message_recipients_i_message" now contains 401787237 row
versions in 1322974 pages



Wes



Re: Vacuum time degrading

From
Tom Lane
Date:
Wes <wespvp@syntegra.com> writes:
> Ok, now I follow.  Taking the biggest indexes:

> The weekend before:
> INFO:  index "message_recipients_i_recip_date" now contains 393961361 row
> versions in 2435100 pages
> INFO:  index "message_recipients_i_message" now contains 393934394 row
> versions in 1499853 pages

> After reindex:
> INFO:  index "message_recipients_i_recip_date" now contains 401798357 row
> versions in 1765613 pages
> INFO:  index "message_recipients_i_message" now contains 401787237 row
> versions in 1322974 pages

OK, that's certainly not a factor-of-four difference in size, so I'm
now convinced you're right: bringing the index into physical order is
having a big impact on the runtime.

From a development standpoint, that suggests a couple of TODO items:
* Look harder at whether VACUUM can scan the index in physical instead
  of logical order.
* See whether ordinary btree maintenance (ie page splits) can do
  anything to maintain/improve the physical ordering of the index.

            regards, tom lane