Thread: estimating the need for VACUUM FULL and REINDEX

estimating the need for VACUUM FULL and REINDEX

From
Guillaume Cottenceau
Date:
I'm trying to come up with a way to estimate the need for a
VACUUM FULL and/or a REINDEX on some tables.


According to documentation[1], VACUUM FULL's only benefit is
returning unused disk space to the operating system; am I correct
in assuming there's also the benefit of optimizing the
performance of scans, because rows are physically compacted on
the disk?

With that in mind, I've tried to estimate how much benefit would
be brought by running VACUUM FULL, with the output of VACUUM
VERBOSE. However, it seems that for example the "removable rows"
reported by each VACUUM VERBOSE run is actually reused by VACUUM,
so is not what I'm looking for.


Then according to documentation[2], REINDEX has some benefit when
all but a few index keys on a page have been deleted, because the
page remains allocated (thus, I assume it improves index scan
performance, am I correct?). However, again I'm unable to
estimate the expected benefit. With a slightly modified version
of a query found in documentation[3] to see the pages used by a
relation[4], I'm able to see that the index data from a given
table...

            relname         | relpages | reltuples
    ------------------------+----------+-----------
     idx_sessions_owner_key |       38 |      2166
     pk_sessions            |       25 |      2166

...is duly optimized after a REINDEX:

            relname         | relpages | reltuples
    ------------------------+----------+-----------
     idx_sessions_owner_key |       13 |      2166
     pk_sessions            |        7 |      2166

but what I'd need is really these 38-13 and 25-7 figures (or
estimates) prior to running REINDEX.


Thanks for any insight.


Ref:
[1] http://www.postgresql.org/docs/8.2/interactive/routine-vacuuming.html

[2] http://www.postgresql.org/docs/8.2/interactive/routine-reindex.html

[3] http://www.postgresql.org/docs/8.2/interactive/disk-usage.html

[4] SELECT c2.relname, c2.relpages, c2.reltuples
      FROM pg_class c, pg_class c2, pg_index i
     WHERE c.relname = 'sessions'
       AND c.oid = i.indrelid
       AND c2.oid = i.indexrelid
     ORDER BY c2.relname;

--
Guillaume Cottenceau, MNC Mobile News Channel SA, an Alcatel-Lucent Company
Av. de la Gare 10, 1003 Lausanne, Switzerland - direct +41 21 317 50 36

Re: estimating the need for VACUUM FULL and REINDEX

From
Heikki Linnakangas
Date:
Guillaume Cottenceau wrote:
> According to documentation[1], VACUUM FULL's only benefit is
> returning unused disk space to the operating system; am I correct
> in assuming there's also the benefit of optimizing the
> performance of scans, because rows are physically compacted on
> the disk?

That's right.

> With that in mind, I've tried to estimate how much benefit would
> be brought by running VACUUM FULL, with the output of VACUUM
> VERBOSE. However, it seems that for example the "removable rows"
> reported by each VACUUM VERBOSE run is actually reused by VACUUM,
> so is not what I'm looking for.

Take a look at contrib/pgstattuple. If a table has high percentage of
free space, VACUUM FULL will compact that out.

> Then according to documentation[2], REINDEX has some benefit when
> all but a few index keys on a page have been deleted, because the
> page remains allocated (thus, I assume it improves index scan
> performance, am I correct?). However, again I'm unable to
> estimate the expected benefit. With a slightly modified version
> of a query found in documentation[3] to see the pages used by a
> relation[4], I'm able to see that the index data from a given
> table...

See pgstatindex, in the same contrib-module. The number you're looking
for is avg_leaf_density. REINDEX will bring that to 90% (with default
fill factor), so if it's much lower than that REINDEX will help.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: estimating the need for VACUUM FULL and REINDEX

From
Bill Moran
Date:
In response to Guillaume Cottenceau <gc@mnc.ch>:

> I'm trying to come up with a way to estimate the need for a
> VACUUM FULL and/or a REINDEX on some tables.

You shouldn't vacuum full unless you have a good reason.  Vacuum full
causes index bloat.

> According to documentation[1], VACUUM FULL's only benefit is
> returning unused disk space to the operating system; am I correct
> in assuming there's also the benefit of optimizing the
> performance of scans, because rows are physically compacted on
> the disk?

In my experience, the smaller the overall database size, the less shared
memory it requires.  Keeping it vacuumed will reduce the amount of space
taken up in memory, which means it's more likely that the data you need
at any particular time is in memory.

Look up a thread with my name on it a lot related to reindexing.  I did
some experiments with indexes and reindexing and the only advantage I found
was that the space requirement for the indexes is reduced by reindexing.
I was not able to find any performance difference in newly created indexes
vs. indexes that were starting to bloat.

> With that in mind, I've tried to estimate how much benefit would
> be brought by running VACUUM FULL, with the output of VACUUM
> VERBOSE. However, it seems that for example the "removable rows"
> reported by each VACUUM VERBOSE run is actually reused by VACUUM,
> so is not what I'm looking for.

I'm not sure what you mean by that last sentence.

There are only two circumstances (I can think of) for running vacuum
full:
1) You've just made some major change to the database (such as adding
   an obscene # of records, making massive changes to a large
   percentage of the existing data, or issuing a lot of "alter table")
   and want to get the FSM back down to a manageable size.
2) You are desperately hurting for disk space, and need a holdover
   until you can get bigger drives.

Reindexing pretty much falls into the same 2 scenerios.  I do recommend
that you reindex after any vacuum full.

However, a much better approach is to either schedule frequent vacuums
(without the full) or configure/enable autovacuum appropriately for your
setup.

> Then according to documentation[2], REINDEX has some benefit when
> all but a few index keys on a page have been deleted, because the
> page remains allocated (thus, I assume it improves index scan
> performance, am I correct?). However, again I'm unable to
> estimate the expected benefit. With a slightly modified version
> of a query found in documentation[3] to see the pages used by a
> relation[4], I'm able to see that the index data from a given
> table...
>
>             relname         | relpages | reltuples
>     ------------------------+----------+-----------
>      idx_sessions_owner_key |       38 |      2166
>      pk_sessions            |       25 |      2166
>
> ...is duly optimized after a REINDEX:
>
>             relname         | relpages | reltuples
>     ------------------------+----------+-----------
>      idx_sessions_owner_key |       13 |      2166
>      pk_sessions            |        7 |      2166
>
> but what I'd need is really these 38-13 and 25-7 figures (or
> estimates) prior to running REINDEX.

Again, my experience shows that reindexing is only worthwhile if you're
really hurting for disk space/memory.

I don't know of any way to tell what size an index would be if it were
completely packed, but it doesn't seem as if this is the best approach
anyway.  Newer versions of PG have the option to create indexes with
empty space already there at creation time (I believe this is called
"fill factor") to allow for future growth.

The only other reason I can see for vacuum full/reindex is if you _can_.
For example, if there is a period that you know the database will be
unused that it sufficiently long that you know these operations can
complete.  Keep in mind that both reindex and vacuum full create performance
problems while they are running.  If you knew, however, that the system
was _never_ being used between 6:00 PM and 8:00 AM, you could run them
over night.  In that case, I would recommend replacing vacuum full with
cluster.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Re: estimating the need for VACUUM FULL and REINDEX

From
Guillaume Cottenceau
Date:
Heikki Linnakangas <heikki 'at' enterprisedb.com> writes:

> Guillaume Cottenceau wrote:
> > According to documentation[1], VACUUM FULL's only benefit is
> > returning unused disk space to the operating system; am I correct
> > in assuming there's also the benefit of optimizing the
> > performance of scans, because rows are physically compacted on
> > the disk?
>
> That's right.

Ok. Then I think the documentation should probably be updated? It
seems to totally miss this benefit.

We've been hit by degrading performance, probably because of too
seldom VACUUM ANALYZE, and in this situation it seems that the
two solutions are either VACUUM FULL or dumping and recreating
the database. Maybe this situation should be described in the
documentation. In this list, everyone always say "you should
VACUUM ANALYZE frequently" but little is done to consider the
case when we have to deal with an existing database on which this
hasn't been done properly.

> > With that in mind, I've tried to estimate how much benefit would
> > be brought by running VACUUM FULL, with the output of VACUUM
> > VERBOSE. However, it seems that for example the "removable rows"
> > reported by each VACUUM VERBOSE run is actually reused by VACUUM,
> > so is not what I'm looking for.
>
> Take a look at contrib/pgstattuple. If a table has high percentage of
> free space, VACUUM FULL will compact that out.

Thanks a lot. I've followed this path and I think it should be
said that free_space must also be large compared to 8K -
free_percent can be large for tables with very few tuples even on
already compacted tables.

> > Then according to documentation[2], REINDEX has some benefit when
> > all but a few index keys on a page have been deleted, because the
> > page remains allocated (thus, I assume it improves index scan
> > performance, am I correct?). However, again I'm unable to
> > estimate the expected benefit. With a slightly modified version
> > of a query found in documentation[3] to see the pages used by a
> > relation[4], I'm able to see that the index data from a given
> > table...
>
> See pgstatindex, in the same contrib-module. The number you're looking
> for is avg_leaf_density. REINDEX will bring that to 90% (with default
> fill factor), so if it's much lower than that REINDEX will help.

Woops, seems that this was not availabe in pgstattuple of pg 7.4 :/

--
Guillaume Cottenceau, MNC Mobile News Channel SA, an Alcatel-Lucent Company
Av. de la Gare 10, 1003 Lausanne, Switzerland - direct +41 21 317 50 36

Re: estimating the need for VACUUM FULL and REINDEX

From
Guillaume Cottenceau
Date:
Guillaume Cottenceau <gc 'at' mnc.ch> writes:

> With that in mind, I've tried to estimate how much benefit would
> be brought by running VACUUM FULL, with the output of VACUUM
> VERBOSE. However, it seems that for example the "removable rows"
> reported by each VACUUM VERBOSE run is actually reused by VACUUM,
> so is not what I'm looking for.

I've tried to better understand how autovacuum works (we use 7.4)
to see if a similar mechanism could be used in 7.4 (e.g. run
VACUUM ANALYZE often enough to not end up with a need to VACUUM
FULL).

The autovacuum daemon uses statistics collected thanks to
stats_row_level. However, inside pg_stat_user_tables, the values
n_tup_upd and n_tup_del seem to be reported from pg startup and
never reset, whereas the information from previous VACUUM would
be needed here, if I understand correctly. Is there anything that
can be done from that point on with existing pg information, or
I'd need e.g. to remember the values of my last VACUUM myself?

Thanks.

--
Guillaume Cottenceau, MNC Mobile News Channel SA, an Alcatel-Lucent Company
Av. de la Gare 10, 1003 Lausanne, Switzerland - direct +41 21 317 50 36

Re: estimating the need for VACUUM FULL and REINDEX

From
Alvaro Herrera
Date:
Guillaume Cottenceau wrote:
> Guillaume Cottenceau <gc 'at' mnc.ch> writes:
>
> > With that in mind, I've tried to estimate how much benefit would
> > be brought by running VACUUM FULL, with the output of VACUUM
> > VERBOSE. However, it seems that for example the "removable rows"
> > reported by each VACUUM VERBOSE run is actually reused by VACUUM,
> > so is not what I'm looking for.
>
> I've tried to better understand how autovacuum works (we use 7.4)
> to see if a similar mechanism could be used in 7.4 (e.g. run
> VACUUM ANALYZE often enough to not end up with a need to VACUUM
> FULL).
>
> The autovacuum daemon uses statistics collected thanks to
> stats_row_level. However, inside pg_stat_user_tables, the values
> n_tup_upd and n_tup_del seem to be reported from pg startup and
> never reset, whereas the information from previous VACUUM would
> be needed here, if I understand correctly. Is there anything that
> can be done from that point on with existing pg information, or
> I'd need e.g. to remember the values of my last VACUUM myself?

In 7.4 there was the pg_autovacuum daemon in contrib, wasn't there?  No
need to write one yourself.

AFAIR what it did was precisely to remember the numbers from the last
vacuum, which was cumbersome and not very effective (because they were
lost on restart for example).  Also, the new autovac has some features
that the old one didn't have.  Ability to set per-table configuration
for example.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: estimating the need for VACUUM FULL and REINDEX

From
"Jim C. Nasby"
Date:
On Fri, May 11, 2007 at 01:25:04PM -0400, Alvaro Herrera wrote:
> Guillaume Cottenceau wrote:
> > Guillaume Cottenceau <gc 'at' mnc.ch> writes:
> >
> > > With that in mind, I've tried to estimate how much benefit would
> > > be brought by running VACUUM FULL, with the output of VACUUM
> > > VERBOSE. However, it seems that for example the "removable rows"
> > > reported by each VACUUM VERBOSE run is actually reused by VACUUM,
> > > so is not what I'm looking for.
> >
> > I've tried to better understand how autovacuum works (we use 7.4)
> > to see if a similar mechanism could be used in 7.4 (e.g. run
> > VACUUM ANALYZE often enough to not end up with a need to VACUUM
> > FULL).
> >
> > The autovacuum daemon uses statistics collected thanks to
> > stats_row_level. However, inside pg_stat_user_tables, the values
> > n_tup_upd and n_tup_del seem to be reported from pg startup and
> > never reset, whereas the information from previous VACUUM would
> > be needed here, if I understand correctly. Is there anything that
> > can be done from that point on with existing pg information, or
> > I'd need e.g. to remember the values of my last VACUUM myself?
>
> In 7.4 there was the pg_autovacuum daemon in contrib, wasn't there?  No
> need to write one yourself.

Correct. But one important note: the default parameters in the 7.4
contrib autovac are *horrible*. They will let your table grow to 3x
minimum size, instead of 1.4x in 8.0/8.1 and 1.2x in 8.2. You must
specify a different scale if you want anything resembling good results.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)