Thread: Auto ANALYZE criteria
The autovacuum daemon currently uses the number of inserted and updated tuples to determine if it should run VACUUM ANALYZE on a table. Why doesn’t it consider deleted tuples as well? For example, I have a table which gets initially loaded with several million records. A batch process grabs the records 100 at a time, does some processing and deletes them from the table in the order of the primary key. Eventually, performance degrades because an autoanalyze is never run. The planner decides that it should do a sequential scan instead of an index scan because the stats don't reflect reality. See example below. I can set up a cron job to run the ANALYZE manually, but it seems like the autovacuum daemon should be smart enough to figure this out on its own. Deletes can have as big an impact on the stats as inserts and updates. Joe Miller --------------------------- testdb=# \d test Table "public.test" Column | Type | Modifiers --------+---------+----------- id | integer | not null data | bytea | Indexes: "test_pkey" PRIMARY KEY, btree (id) testdb=# insert into public.test select s.a, gen_random_bytes(256) from generate_series(1,10000000) as s(a); INSERT 0 10000000 testdb=# SELECT * FROM pg_stat_all_tables WHERE schemaname='public' AND relname='test'; relid | schemaname | relname | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup | n_dead_tup | last_vacuum | last_autovacuum | last_analyze | last_autoanalyze ---------+------------+---------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------+------------+------------+-------------+-----------------+--------------+------------------ 5608158 | public | test | 1 | 0 | 0 | 0 | 10000000 | 0 | 0 | 0 | 0 | 0 | | | | 2010-09-20 10:46:37.283775-04 (1 row) testdb=# explain analyze delete from public.test where id <= 100; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Index Scan using test_pkey on test (cost=0.00..71.63 rows=1000 width=6) (actual time=13.251..22.916 rows=100 loops=1) Index Cond: (id <= 100) Total runtime: 23.271 ms (3 rows) { delete records ad nauseum } testdb=# explain analyze delete from public.test where id <= 7978800; QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Seq Scan on test (cost=0.00..410106.17 rows=2538412 width=6) (actual time=48771.772..49681.562 rows=100 loops=1) Filter: (id <= 7978800) Total runtime: 49682.006 ms (3 rows) testdb=# SELECT * FROM pg_stat_all_tables WHERE schemaname='public' AND relname='test'; relid | schemaname | relname | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup | n_dead_tup | last_vacuum | last_autovacuum | last_analyze | last_autoanalyze ---------+------------+---------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------+------------+------------+-------------+-------------------------------+--------------+------------------------------- 5608158 | public | test | 1 | 0 | 54345 | 5433206 | 10000000 | 0 | 5433200 | 0 | 5459506 | 725300 | | 2010-09-20 14:45:54.757611-04 | | 2010-09-20 10:46:37.283775-04
Joe Miller <joe.d.miller@gmail.com> wrote: > I can set up a cron job to run the ANALYZE manually, but it seems > like the autovacuum daemon should be smart enough to figure this > out on its own. Deletes can have as big an impact on the stats as > inserts and updates. But until the deleted rows are vacuumed from the indexes, an index scan must read all the index entries for the deleted tuples, and visit the heap to determine that they are not visible. Does a manual run of ANALYZE without a VACUUM change the stats much for you, or are you running VACUUM ANALYZE? -Kevin
Joe Miller <joe.d.miller@gmail.com> writes: > The autovacuum daemon currently uses the number of inserted and > updated tuples to determine if it should run VACUUM ANALYZE on a > table.� Why doesn�t it consider deleted tuples as well? I think you misread the code. Now there *is* a problem, pre-9.0, if your update pattern is such that most or all updates are HOT updates. To quote from the 9.0 alpha release notes: Revise pgstat's tracking of tuple changes to improve the reliability of decisions about when to auto-analyze. The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples, where all three of these numbers could be bad estimates from ANALYZE itself. Even worse, in the presence of a steady flow of HOT updates and matching HOT-tuple reclamations, auto-analyze might never trigger at all, even if all three numbers are exactly right, because n_dead_tuples could hold steady. It's not clear to me if that matches your problem, though. regards, tom lane
I was looking at the autovacuum documentation: http://www.postgresql.org/docs/9.0/interactive/routine-vacuuming.html#AUTOVACUUM For analyze, a similar condition is used: the threshold, defined as: analyze threshold = analyze base threshold + analyze scale factor * number of tuples is compared to the total number of tuples inserted or updated since the last ANALYZE. I guess that should be updated to read "insert, updated or deleted". On Mon, Sep 20, 2010 at 10:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Joe Miller <joe.d.miller@gmail.com> writes: >> The autovacuum daemon currently uses the number of inserted and >> updated tuples to determine if it should run VACUUM ANALYZE on a >> table. Why doesn’t it consider deleted tuples as well? > > I think you misread the code. > > Now there *is* a problem, pre-9.0, if your update pattern is such that > most or all updates are HOT updates. To quote from the 9.0 alpha > release notes: > > Revise pgstat's tracking of tuple changes to > improve the reliability of decisions about when to > auto-analyze. The previous code depended on n_live_tuples + > n_dead_tuples - last_anl_tuples, where all three of these > numbers could be bad estimates from ANALYZE itself. Even > worse, in the presence of a steady flow of HOT updates and > matching HOT-tuple reclamations, auto-analyze might never > trigger at all, even if all three numbers are exactly right, > because n_dead_tuples could hold steady. > > It's not clear to me if that matches your problem, though. > > regards, tom lane >
On Mon, Sep 20, 2010 at 6:28 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Joe Miller <joe.d.miller@gmail.com> wrote: > >> I can set up a cron job to run the ANALYZE manually, but it seems >> like the autovacuum daemon should be smart enough to figure this >> out on its own. Deletes can have as big an impact on the stats as >> inserts and updates. > > But until the deleted rows are vacuumed from the indexes, an index > scan must read all the index entries for the deleted tuples, and > visit the heap to determine that they are not visible. Does a > manual run of ANALYZE without a VACUUM change the stats much for > you, or are you running VACUUM ANALYZE? > > -Kevin > The autovacuum is running correctly, so the deleted rows are being removed. All I'm doing is an ANALYZE, not VACUUM ANALYZE.
Joe Miller <joe.d.miller@gmail.com> writes: > I was looking at the autovacuum documentation: > http://www.postgresql.org/docs/9.0/interactive/routine-vacuuming.html#AUTOVACUUM > For analyze, a similar condition is used: the threshold, defined as: > analyze threshold = analyze base threshold + analyze scale factor * > number of tuples > is compared to the total number of tuples inserted or updated since > the last ANALYZE. > I guess that should be updated to read "insert, updated or deleted". Mph. We caught the other places where the docs explain what the analyze threshold is, but missed that one. Fixed, thanks for pointing it out. regards, tom lane
Thanks for fixing the docs, but if that's the case, I shouldn't be seeing the behavior that I'm seeing. Should I flesh out this test case a little better and file a bug? Thanks, Joe On Tue, Sep 21, 2010 at 4:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Joe Miller <joe.d.miller@gmail.com> writes: >> I was looking at the autovacuum documentation: >> http://www.postgresql.org/docs/9.0/interactive/routine-vacuuming.html#AUTOVACUUM > >> For analyze, a similar condition is used: the threshold, defined as: >> analyze threshold = analyze base threshold + analyze scale factor * >> number of tuples >> is compared to the total number of tuples inserted or updated since >> the last ANALYZE. > >> I guess that should be updated to read "insert, updated or deleted". > > Mph. We caught the other places where the docs explain what the analyze > threshold is, but missed that one. Fixed, thanks for pointing it out. > > regards, tom lane >
On Wed, Oct 13, 2010 at 5:20 PM, Joe Miller <joe.d.miller@gmail.com> wrote: > Thanks for fixing the docs, but if that's the case, I shouldn't be > seeing the behavior that I'm seeing. > > Should I flesh out this test case a little better and file a bug? A reproducible test case is always a good thing to have... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company