Thread: estimating the need for VACUUM FULL and REINDEX
I'm trying to come up with a way to estimate the need for a VACUUM FULL and/or a REINDEX on some tables. According to documentation[1], VACUUM FULL's only benefit is returning unused disk space to the operating system; am I correct in assuming there's also the benefit of optimizing the performance of scans, because rows are physically compacted on the disk? With that in mind, I've tried to estimate how much benefit would be brought by running VACUUM FULL, with the output of VACUUM VERBOSE. However, it seems that for example the "removable rows" reported by each VACUUM VERBOSE run is actually reused by VACUUM, so is not what I'm looking for. Then according to documentation[2], REINDEX has some benefit when all but a few index keys on a page have been deleted, because the page remains allocated (thus, I assume it improves index scan performance, am I correct?). However, again I'm unable to estimate the expected benefit. With a slightly modified version of a query found in documentation[3] to see the pages used by a relation[4], I'm able to see that the index data from a given table... relname | relpages | reltuples ------------------------+----------+----------- idx_sessions_owner_key | 38 | 2166 pk_sessions | 25 | 2166 ...is duly optimized after a REINDEX: relname | relpages | reltuples ------------------------+----------+----------- idx_sessions_owner_key | 13 | 2166 pk_sessions | 7 | 2166 but what I'd need is really these 38-13 and 25-7 figures (or estimates) prior to running REINDEX. Thanks for any insight. Ref: [1] http://www.postgresql.org/docs/8.2/interactive/routine-vacuuming.html [2] http://www.postgresql.org/docs/8.2/interactive/routine-reindex.html [3] http://www.postgresql.org/docs/8.2/interactive/disk-usage.html [4] SELECT c2.relname, c2.relpages, c2.reltuples FROM pg_class c, pg_class c2, pg_index i WHERE c.relname = 'sessions' AND c.oid = i.indrelid AND c2.oid = i.indexrelid ORDER BY c2.relname; -- Guillaume Cottenceau, MNC Mobile News Channel SA, an Alcatel-Lucent Company Av. de la Gare 10, 1003 Lausanne, Switzerland - direct +41 21 317 50 36
Guillaume Cottenceau wrote: > According to documentation[1], VACUUM FULL's only benefit is > returning unused disk space to the operating system; am I correct > in assuming there's also the benefit of optimizing the > performance of scans, because rows are physically compacted on > the disk? That's right. > With that in mind, I've tried to estimate how much benefit would > be brought by running VACUUM FULL, with the output of VACUUM > VERBOSE. However, it seems that for example the "removable rows" > reported by each VACUUM VERBOSE run is actually reused by VACUUM, > so is not what I'm looking for. Take a look at contrib/pgstattuple. If a table has high percentage of free space, VACUUM FULL will compact that out. > Then according to documentation[2], REINDEX has some benefit when > all but a few index keys on a page have been deleted, because the > page remains allocated (thus, I assume it improves index scan > performance, am I correct?). However, again I'm unable to > estimate the expected benefit. With a slightly modified version > of a query found in documentation[3] to see the pages used by a > relation[4], I'm able to see that the index data from a given > table... See pgstatindex, in the same contrib-module. The number you're looking for is avg_leaf_density. REINDEX will bring that to 90% (with default fill factor), so if it's much lower than that REINDEX will help. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
In response to Guillaume Cottenceau <gc@mnc.ch>: > I'm trying to come up with a way to estimate the need for a > VACUUM FULL and/or a REINDEX on some tables. You shouldn't vacuum full unless you have a good reason. Vacuum full causes index bloat. > According to documentation[1], VACUUM FULL's only benefit is > returning unused disk space to the operating system; am I correct > in assuming there's also the benefit of optimizing the > performance of scans, because rows are physically compacted on > the disk? In my experience, the smaller the overall database size, the less shared memory it requires. Keeping it vacuumed will reduce the amount of space taken up in memory, which means it's more likely that the data you need at any particular time is in memory. Look up a thread with my name on it a lot related to reindexing. I did some experiments with indexes and reindexing and the only advantage I found was that the space requirement for the indexes is reduced by reindexing. I was not able to find any performance difference in newly created indexes vs. indexes that were starting to bloat. > With that in mind, I've tried to estimate how much benefit would > be brought by running VACUUM FULL, with the output of VACUUM > VERBOSE. However, it seems that for example the "removable rows" > reported by each VACUUM VERBOSE run is actually reused by VACUUM, > so is not what I'm looking for. I'm not sure what you mean by that last sentence. There are only two circumstances (I can think of) for running vacuum full: 1) You've just made some major change to the database (such as adding an obscene # of records, making massive changes to a large percentage of the existing data, or issuing a lot of "alter table") and want to get the FSM back down to a manageable size. 2) You are desperately hurting for disk space, and need a holdover until you can get bigger drives. Reindexing pretty much falls into the same 2 scenerios. I do recommend that you reindex after any vacuum full. However, a much better approach is to either schedule frequent vacuums (without the full) or configure/enable autovacuum appropriately for your setup. > Then according to documentation[2], REINDEX has some benefit when > all but a few index keys on a page have been deleted, because the > page remains allocated (thus, I assume it improves index scan > performance, am I correct?). However, again I'm unable to > estimate the expected benefit. With a slightly modified version > of a query found in documentation[3] to see the pages used by a > relation[4], I'm able to see that the index data from a given > table... > > relname | relpages | reltuples > ------------------------+----------+----------- > idx_sessions_owner_key | 38 | 2166 > pk_sessions | 25 | 2166 > > ...is duly optimized after a REINDEX: > > relname | relpages | reltuples > ------------------------+----------+----------- > idx_sessions_owner_key | 13 | 2166 > pk_sessions | 7 | 2166 > > but what I'd need is really these 38-13 and 25-7 figures (or > estimates) prior to running REINDEX. Again, my experience shows that reindexing is only worthwhile if you're really hurting for disk space/memory. I don't know of any way to tell what size an index would be if it were completely packed, but it doesn't seem as if this is the best approach anyway. Newer versions of PG have the option to create indexes with empty space already there at creation time (I believe this is called "fill factor") to allow for future growth. The only other reason I can see for vacuum full/reindex is if you _can_. For example, if there is a period that you know the database will be unused that it sufficiently long that you know these operations can complete. Keep in mind that both reindex and vacuum full create performance problems while they are running. If you knew, however, that the system was _never_ being used between 6:00 PM and 8:00 AM, you could run them over night. In that case, I would recommend replacing vacuum full with cluster. -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ wmoran@collaborativefusion.com Phone: 412-422-3463x4023
Heikki Linnakangas <heikki 'at' enterprisedb.com> writes: > Guillaume Cottenceau wrote: > > According to documentation[1], VACUUM FULL's only benefit is > > returning unused disk space to the operating system; am I correct > > in assuming there's also the benefit of optimizing the > > performance of scans, because rows are physically compacted on > > the disk? > > That's right. Ok. Then I think the documentation should probably be updated? It seems to totally miss this benefit. We've been hit by degrading performance, probably because of too seldom VACUUM ANALYZE, and in this situation it seems that the two solutions are either VACUUM FULL or dumping and recreating the database. Maybe this situation should be described in the documentation. In this list, everyone always say "you should VACUUM ANALYZE frequently" but little is done to consider the case when we have to deal with an existing database on which this hasn't been done properly. > > With that in mind, I've tried to estimate how much benefit would > > be brought by running VACUUM FULL, with the output of VACUUM > > VERBOSE. However, it seems that for example the "removable rows" > > reported by each VACUUM VERBOSE run is actually reused by VACUUM, > > so is not what I'm looking for. > > Take a look at contrib/pgstattuple. If a table has high percentage of > free space, VACUUM FULL will compact that out. Thanks a lot. I've followed this path and I think it should be said that free_space must also be large compared to 8K - free_percent can be large for tables with very few tuples even on already compacted tables. > > Then according to documentation[2], REINDEX has some benefit when > > all but a few index keys on a page have been deleted, because the > > page remains allocated (thus, I assume it improves index scan > > performance, am I correct?). However, again I'm unable to > > estimate the expected benefit. With a slightly modified version > > of a query found in documentation[3] to see the pages used by a > > relation[4], I'm able to see that the index data from a given > > table... > > See pgstatindex, in the same contrib-module. The number you're looking > for is avg_leaf_density. REINDEX will bring that to 90% (with default > fill factor), so if it's much lower than that REINDEX will help. Woops, seems that this was not availabe in pgstattuple of pg 7.4 :/ -- Guillaume Cottenceau, MNC Mobile News Channel SA, an Alcatel-Lucent Company Av. de la Gare 10, 1003 Lausanne, Switzerland - direct +41 21 317 50 36
Guillaume Cottenceau <gc 'at' mnc.ch> writes: > With that in mind, I've tried to estimate how much benefit would > be brought by running VACUUM FULL, with the output of VACUUM > VERBOSE. However, it seems that for example the "removable rows" > reported by each VACUUM VERBOSE run is actually reused by VACUUM, > so is not what I'm looking for. I've tried to better understand how autovacuum works (we use 7.4) to see if a similar mechanism could be used in 7.4 (e.g. run VACUUM ANALYZE often enough to not end up with a need to VACUUM FULL). The autovacuum daemon uses statistics collected thanks to stats_row_level. However, inside pg_stat_user_tables, the values n_tup_upd and n_tup_del seem to be reported from pg startup and never reset, whereas the information from previous VACUUM would be needed here, if I understand correctly. Is there anything that can be done from that point on with existing pg information, or I'd need e.g. to remember the values of my last VACUUM myself? Thanks. -- Guillaume Cottenceau, MNC Mobile News Channel SA, an Alcatel-Lucent Company Av. de la Gare 10, 1003 Lausanne, Switzerland - direct +41 21 317 50 36
Guillaume Cottenceau wrote: > Guillaume Cottenceau <gc 'at' mnc.ch> writes: > > > With that in mind, I've tried to estimate how much benefit would > > be brought by running VACUUM FULL, with the output of VACUUM > > VERBOSE. However, it seems that for example the "removable rows" > > reported by each VACUUM VERBOSE run is actually reused by VACUUM, > > so is not what I'm looking for. > > I've tried to better understand how autovacuum works (we use 7.4) > to see if a similar mechanism could be used in 7.4 (e.g. run > VACUUM ANALYZE often enough to not end up with a need to VACUUM > FULL). > > The autovacuum daemon uses statistics collected thanks to > stats_row_level. However, inside pg_stat_user_tables, the values > n_tup_upd and n_tup_del seem to be reported from pg startup and > never reset, whereas the information from previous VACUUM would > be needed here, if I understand correctly. Is there anything that > can be done from that point on with existing pg information, or > I'd need e.g. to remember the values of my last VACUUM myself? In 7.4 there was the pg_autovacuum daemon in contrib, wasn't there? No need to write one yourself. AFAIR what it did was precisely to remember the numbers from the last vacuum, which was cumbersome and not very effective (because they were lost on restart for example). Also, the new autovac has some features that the old one didn't have. Ability to set per-table configuration for example. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Fri, May 11, 2007 at 01:25:04PM -0400, Alvaro Herrera wrote: > Guillaume Cottenceau wrote: > > Guillaume Cottenceau <gc 'at' mnc.ch> writes: > > > > > With that in mind, I've tried to estimate how much benefit would > > > be brought by running VACUUM FULL, with the output of VACUUM > > > VERBOSE. However, it seems that for example the "removable rows" > > > reported by each VACUUM VERBOSE run is actually reused by VACUUM, > > > so is not what I'm looking for. > > > > I've tried to better understand how autovacuum works (we use 7.4) > > to see if a similar mechanism could be used in 7.4 (e.g. run > > VACUUM ANALYZE often enough to not end up with a need to VACUUM > > FULL). > > > > The autovacuum daemon uses statistics collected thanks to > > stats_row_level. However, inside pg_stat_user_tables, the values > > n_tup_upd and n_tup_del seem to be reported from pg startup and > > never reset, whereas the information from previous VACUUM would > > be needed here, if I understand correctly. Is there anything that > > can be done from that point on with existing pg information, or > > I'd need e.g. to remember the values of my last VACUUM myself? > > In 7.4 there was the pg_autovacuum daemon in contrib, wasn't there? No > need to write one yourself. Correct. But one important note: the default parameters in the 7.4 contrib autovac are *horrible*. They will let your table grow to 3x minimum size, instead of 1.4x in 8.0/8.1 and 1.2x in 8.2. You must specify a different scale if you want anything resembling good results. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)