Thread: Auto ANALYZE criteria

Auto ANALYZE criteria

From
Joe Miller
Date:
The autovacuum daemon currently uses the number of inserted and
updated tuples to determine if it should run VACUUM ANALYZE on a
table.  Why doesn’t it consider deleted tuples as well?

For example, I have a table which gets initially loaded with several
million records. A batch process grabs the records 100 at a time, does
some processing and deletes them from the table in the order of the
primary key.  Eventually, performance degrades because an autoanalyze
is never run.  The planner decides that it should do a sequential scan
instead of an index scan because the stats don't reflect reality.  See
example below.

I can set up a cron job to run the ANALYZE manually, but it seems like
the autovacuum daemon should be smart enough to figure this out on its
own.  Deletes can have as big an impact on the stats as inserts and
updates.

Joe Miller

---------------------------

testdb=# \d test
     Table "public.test"
 Column |  Type   | Modifiers
--------+---------+-----------
 id     | integer | not null
 data   | bytea   |
Indexes:
    "test_pkey" PRIMARY KEY, btree (id)

testdb=# insert into public.test select s.a, gen_random_bytes(256)
from generate_series(1,10000000) as s(a);
INSERT 0 10000000

testdb=# SELECT *
FROM pg_stat_all_tables
WHERE schemaname='public' AND relname='test';
  relid  | schemaname | relname | seq_scan | seq_tup_read | idx_scan |
idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd |
n_live_tup | n_dead_tup | last_vacuum | last_autovacuum | last_analyze
| last_autoanalyze

---------+------------+---------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------+------------+------------+-------------+-----------------+--------------+------------------
 5608158 | public     | test    |        1 |            0 |        0 |
            0 |  10000000 |         0 |         0 |             0 |
      0 |          0 |             |                 |              |
2010-09-20 10:46:37.283775-04
(1 row)


testdb=# explain analyze delete from public.test where id <= 100;
                                                       QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
 Index Scan using test_pkey on test  (cost=0.00..71.63 rows=1000
width=6) (actual time=13.251..22.916 rows=100 loops=1)
   Index Cond: (id <= 100)
 Total runtime: 23.271 ms
(3 rows)

{ delete records ad nauseum }

testdb=# explain analyze delete from public.test where id <= 7978800;
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..410106.17 rows=2538412 width=6) (actual
time=48771.772..49681.562 rows=100 loops=1)
   Filter: (id <= 7978800)
 Total runtime: 49682.006 ms
(3 rows)

testdb=# SELECT *
FROM pg_stat_all_tables
WHERE schemaname='public' AND relname='test';
  relid  | schemaname | relname | seq_scan | seq_tup_read | idx_scan |
idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd |
n_live_tup | n_dead_tup | last_vacuum |        last_autovacuum
| last_analyze |       last_autoanalyze

---------+------------+---------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------+------------+------------+-------------+-------------------------------+--------------+-------------------------------
 5608158 | public     | test    |        1 |            0 |    54345 |
      5433206 |  10000000 |         0 |   5433200 |             0 |
5459506 |     725300 |             | 2010-09-20 14:45:54.757611-04 |
           | 2010-09-20 10:46:37.283775-04

Re: Auto ANALYZE criteria

From
"Kevin Grittner"
Date:
Joe Miller <joe.d.miller@gmail.com> wrote:

> I can set up a cron job to run the ANALYZE manually, but it seems
> like the autovacuum daemon should be smart enough to figure this
> out on its own.  Deletes can have as big an impact on the stats as
> inserts and updates.

But until the deleted rows are vacuumed from the indexes, an index
scan must read all the index entries for the deleted tuples, and
visit the heap to determine that they are not visible.  Does a
manual run of ANALYZE without a VACUUM change the stats much for
you, or are you running VACUUM ANALYZE?

-Kevin

Re: Auto ANALYZE criteria

From
Tom Lane
Date:
Joe Miller <joe.d.miller@gmail.com> writes:
> The autovacuum daemon currently uses the number of inserted and
> updated tuples to determine if it should run VACUUM ANALYZE on a
> table.� Why doesn�t it consider deleted tuples as well?

I think you misread the code.

Now there *is* a problem, pre-9.0, if your update pattern is such that
most or all updates are HOT updates.  To quote from the 9.0 alpha
release notes:

         Revise pgstat's tracking of tuple changes to
         improve the reliability of decisions about when to
         auto-analyze.  The previous code depended on n_live_tuples +
         n_dead_tuples - last_anl_tuples, where all three of these
         numbers could be bad estimates from ANALYZE itself.  Even
         worse, in the presence of a steady flow of HOT updates and
         matching HOT-tuple reclamations, auto-analyze might never
         trigger at all, even if all three numbers are exactly right,
         because n_dead_tuples could hold steady.

It's not clear to me if that matches your problem, though.

            regards, tom lane

Re: Auto ANALYZE criteria

From
Joe Miller
Date:
I was looking at the autovacuum documentation:
http://www.postgresql.org/docs/9.0/interactive/routine-vacuuming.html#AUTOVACUUM

   For analyze, a similar condition is used: the threshold, defined as:
   analyze threshold = analyze base threshold + analyze scale factor *
number of tuples
   is compared to the total number of tuples inserted or updated since
the last ANALYZE.

I guess that should be updated to read "insert, updated or deleted".


On Mon, Sep 20, 2010 at 10:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Joe Miller <joe.d.miller@gmail.com> writes:
>> The autovacuum daemon currently uses the number of inserted and
>> updated tuples to determine if it should run VACUUM ANALYZE on a
>> table.  Why doesn’t it consider deleted tuples as well?
>
> I think you misread the code.
>
> Now there *is* a problem, pre-9.0, if your update pattern is such that
> most or all updates are HOT updates.  To quote from the 9.0 alpha
> release notes:
>
>         Revise pgstat's tracking of tuple changes to
>         improve the reliability of decisions about when to
>         auto-analyze.  The previous code depended on n_live_tuples +
>         n_dead_tuples - last_anl_tuples, where all three of these
>         numbers could be bad estimates from ANALYZE itself.  Even
>         worse, in the presence of a steady flow of HOT updates and
>         matching HOT-tuple reclamations, auto-analyze might never
>         trigger at all, even if all three numbers are exactly right,
>         because n_dead_tuples could hold steady.
>
> It's not clear to me if that matches your problem, though.
>
>                        regards, tom lane
>

Re: Auto ANALYZE criteria

From
Joe Miller
Date:
On Mon, Sep 20, 2010 at 6:28 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Joe Miller <joe.d.miller@gmail.com> wrote:
>
>> I can set up a cron job to run the ANALYZE manually, but it seems
>> like the autovacuum daemon should be smart enough to figure this
>> out on its own.  Deletes can have as big an impact on the stats as
>> inserts and updates.
>
> But until the deleted rows are vacuumed from the indexes, an index
> scan must read all the index entries for the deleted tuples, and
> visit the heap to determine that they are not visible.  Does a
> manual run of ANALYZE without a VACUUM change the stats much for
> you, or are you running VACUUM ANALYZE?
>
> -Kevin
>

The autovacuum is running correctly, so the deleted rows are being
removed.  All I'm doing is an ANALYZE, not VACUUM ANALYZE.

Re: Auto ANALYZE criteria

From
Tom Lane
Date:
Joe Miller <joe.d.miller@gmail.com> writes:
> I was looking at the autovacuum documentation:
> http://www.postgresql.org/docs/9.0/interactive/routine-vacuuming.html#AUTOVACUUM

>    For analyze, a similar condition is used: the threshold, defined as:
>    analyze threshold = analyze base threshold + analyze scale factor *
> number of tuples
>    is compared to the total number of tuples inserted or updated since
> the last ANALYZE.

> I guess that should be updated to read "insert, updated or deleted".

Mph.  We caught the other places where the docs explain what the analyze
threshold is, but missed that one.  Fixed, thanks for pointing it out.

            regards, tom lane

Re: Auto ANALYZE criteria

From
Joe Miller
Date:
Thanks for fixing the docs, but if that's the case, I shouldn't be
seeing the behavior that I'm seeing.

Should I flesh out this test case a little better and file a bug?

Thanks,

Joe


On Tue, Sep 21, 2010 at 4:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Joe Miller <joe.d.miller@gmail.com> writes:
>> I was looking at the autovacuum documentation:
>> http://www.postgresql.org/docs/9.0/interactive/routine-vacuuming.html#AUTOVACUUM
>
>>    For analyze, a similar condition is used: the threshold, defined as:
>>    analyze threshold = analyze base threshold + analyze scale factor *
>> number of tuples
>>    is compared to the total number of tuples inserted or updated since
>> the last ANALYZE.
>
>> I guess that should be updated to read "insert, updated or deleted".
>
> Mph.  We caught the other places where the docs explain what the analyze
> threshold is, but missed that one.  Fixed, thanks for pointing it out.
>
>                        regards, tom lane
>

Re: Auto ANALYZE criteria

From
Robert Haas
Date:
On Wed, Oct 13, 2010 at 5:20 PM, Joe Miller <joe.d.miller@gmail.com> wrote:
> Thanks for fixing the docs, but if that's the case, I shouldn't be
> seeing the behavior that I'm seeing.
>
> Should I flesh out this test case a little better and file a bug?

A reproducible test case is always a good thing to have...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company