Thread: VACUUM and ANALYZE Follow-Up

VACUUM and ANALYZE Follow-Up

From
"Mark Dexter"
Date:

Several recent postings appear to confirm that there is an issue with the use of VACUUM or ANALYZE on empty tables.  Specifically, if you

VACUUM or ANALYZE a table that is empty and then insert a large number of rows into this table, you will experience very poor performance.

For example, in our testing, we suffered a 15X performance penalty when inserting 35,000 rows into a table that had been VACUUM'd or

ANALYZE'd when empty.  Also, in our testing, it didn't matter whether you just did VACCUM or VACUUM ANALYZE -- in both cases the

subsequent inserts were slow.

In the short run, the work-around appears to be either to avoid using these commands on empty tables or to keep some "dummy" rows in these

tables that don't get deleted (and use DELETE instead of TRUNCATE).

However, in the long run, it would seem to make sense to address the issue directly so DBA's and developers don't have to deal with it.  Several

possible solutions come to mind, and I'm sure there are others.

1. Provide an option with ANALYZE to force it to work as if a table had a minimum number of rows (e.g., ANALYZE MINIMUM 1000 would analyze

tables as if they all had at least 1000 rows).
2. Provide an option during table creation to state the minimum number of rows to use for ANALYZE.
3. Just change ANALYZE to assume that all tables might have a reasonable number of rows at some point even if they are empty now.  (How much performance is actually gained currently when ANALYZE updates the stats for an empty table?)

In any case, it is hard to see how the present behaviour can be seen as desirable.  It obviously causes problems at least for new Postgres users, and we

all hope there will be many more of these folks in the future.  Thanks for considering this.  Mark

Re: VACUUM and ANALYZE Follow-Up

From
"Joshua D. Drake"
Date:
>
> In any case, it is hard to see how the present behaviour can be seen as
> desirable.  It obviously causes problems at least for new Postgres
> users, and we
> all hope there will be many more of these folks in the future.  Thanks
> for considering this.  Mark

Uhmmm... analyze or vacuum on an empty table is fairly pointless. Those
utilities are supposed to be used on tables that have data.

So the answer is, use them on tables that have data.

Sincerely,

Joshua D. Drake



>


--
Command Prompt, Inc., home of PostgreSQL Replication, and plPHP.
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL

Attachment

Re: VACUUM and ANALYZE Follow-Up

From
Tom Lane
Date:
"Mark Dexter" <MDEXTER@dexterchaney.com> writes:
> Several recent postings appear to confirm that there is an issue with
> the use of VACUUM or ANALYZE on empty tables.  Specifically, if you
> VACUUM or ANALYZE a table that is empty and then insert a large number
> of rows into this table, you will experience very poor performance.

Well-known issue.  See this thread for the latest discussion:
http://archives.postgresql.org/pgsql-hackers/2004-11/msg00985.php

            regards, tom lane

Re: VACUUM and ANALYZE Follow-Up

From
Martijn van Oosterhout
Date:
On Mon, Nov 29, 2004 at 11:48:37AM -0800, Mark Dexter wrote:
> 1. Provide an option with ANALYZE to force it to work as if a table had
> a minimum number of rows (e.g., ANALYZE MINIMUM 1000 would analyze
> tables as if they all had at least 1000 rows).
> 2. Provide an option during table creation to state the minimum number
> of rows to use for ANALYZE.

Ok, here's a scenario, you've executed ANALYZE on an empty table. So
the system needs to get for each column statitics on largest value,
smallest value, approximate distribution, how well does the order
correlate with the table order, average column width, etc.

Hang on, it's an empty table. Is it supposed to fabricate these
statistics out of thin air? Any made up numbers will probably be worse
than none at all.

> 3. Just change ANALYZE to assume that all tables might have a reasonable
> number of rows at some point even if they are empty now.  (How much
> performance is actually gained currently when ANALYZE updates the stats
> for an empty table?)

If the table is really empty, the performance is good since it will
never use any indexes.

> In any case, it is hard to see how the present behaviour can be seen as
> desirable.  It obviously causes problems at least for new Postgres
> users, and we
> all hope there will be many more of these folks in the future.  Thanks
> for considering this.  Mark

Seems to me that the solution is as the other poster suggested, don't
run ANALYZE on an empty table if it's not going to be empty. Run
ANALYZE when the table has data in it. If you use TRUNCATE you never
need to VACUUM that table anyway.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: VACUUM and ANALYZE Follow-Up

From
Karsten Hilbert
Date:
> >In any case, it is hard to see how the present behaviour can be seen as
> >desirable.  It obviously causes problems at least for new Postgres
> >users, and we
> >all hope there will be many more of these folks in the future.  Thanks
> >for considering this.  Mark
>
> Uhmmm... analyze or vacuum on an empty table is fairly pointless. Those
> utilities are supposed to be used on tables that have data.
>
> So the answer is, use them on tables that have data.
What the OP seems to be saying is that to him it would make a
lot of sense to have vacuum by default not act on empty
tables.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: VACUUM and ANALYZE Follow-Up

From
"Mark Dexter"
Date:
Hmm... it seems that we're maybe not understanding one another here.
I'm going to try to be more clear.  Below are the questions that were
raised and my attemp to answer them clearly.

1. Why run VACUUM on an empty table?

This is a good question, but perhaps there is a valid answer.  Our
production database contains about 1500 tables.  It is MUCH more
convenient to be able to VACCUM ANALYZE the entire database than to list
specific tables.  Furthermore, our application uses "work" tables that
often will be empty (e.g., at night) but that, at times, will contain a
large number of rows.  The Postgres documentation says the following:
"We recommend that active production databases be vacuumed frequently
(at least nightly), in order to remove expired rows.".  This is going to
be difficult without messing up the performance.

2. Hang on, it's an empty table. Is it supposed to fabricate these
statistics out of thin air? Any made up numbers will probably be worse
than none at all.

Well, that's why I suggested some type of command line option so the
user could give it a number of rows to use for the analysis (e.g.,
ANALYZE MINIMUM 1000).

Another point of interest:  If I DROP and the CREATE the table, without
doing ANALYZE, I get good performance when inserting rows.  So whatever
assumptions the database is making about a newly-created table appear to
be different (and BETTER) than the assumptions made when doing ANALYZE
on an empty table.  It's not clear to me why this should be.  In both
cases, you don't really know anything about the table other than at this
moment it has zero rows.  Obviously, it would be better (at least in
this instance) if running ANALYZE on an empty table had the same
performance result as using CREATE to make a new (empty) table.

Finally, my testing would seem to contradict that any made-up number
will be better than none at all.  In my testing (inserting 35,000 rows
into an empty table), I could only  measure two distinct outcomes -- one
good and one bad (with a 15X performance difference).  I got good
performance with any of the folloiwng scenarios:  CREATE TABLE, ANALYZE
or VACUUM with more than 94 rows in the table.  I got bad performance if
I did ANALYZE or VACUUM with less than 94 rows in the table.  I could
not measure any difference between other numbers of rows (between 0 and
35,000).  So I don't think in practice it is that sensitive, at least in
the simple test case I was doing.

Below are two additional questions I have.

3. Is there some benefit to having ANALYZE behave the way it now does on
empty or nearly empty tables?  Is there a large performance improvement
for really small tables (e.g., under 100 rows or under 1000 rows)?  Does
anyone really care about performance for small tables?

4. Isn't ANALYZE on a totally empty table really a special case?  The
presumption should be that the table will not remain empty.  To optimize
the performance assuming that there will be zero (or close to zero) rows
seems somewhat pointless.  However, there are valid reasons why a table
might be empty at the moment in time when the ANALYZE is run.  (In our
case, we use "work" tables that get cleared at the end of an application
process.)  And, as mentioned above, it is easier to VACUUM ANALYZE an
entire database than it is to list tables individually.

5. Why does DROP / CREATE work better than TRUNCATE / VACUUM in terms of
creating a fresh table in which to insert new rows?  Is this desirable?
In both cases, the optimizer doesn't really know anything about what to
expect for the table.  But CREATE provides a better starting point for
inserts than does VACUUM or ANALYZE, at least in my testing.

I am relatively new to Postgres, and I apologize if I'm repeating issues
that have been raised before.  However, it does seem to me to be an
important issue.  The easier it is to maintain a Postgres database, the
more likely it is to be widely used.  Thanks.  Mark

Re: VACUUM and ANALYZE Follow-Up

From
Alvaro Herrera
Date:
On Mon, Nov 29, 2004 at 02:57:28PM -0800, Mark Dexter wrote:

> 1. Why run VACUUM on an empty table?

Another approach: you run VACUUM on an empty table to empty it.  If you
had a table with a lot of tuples, and ran DELETE on it, it will empty
after that, but it will be full of dead tuples.  So you have to VACUUM
it to reclaim space.

Of course it would be better to use TRUNCATE rather than DELETE, but
that doesn't always work (e.g. because the table has foreign keys).

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"This is a foot just waiting to be shot"                (Andrew Dunstan)

Re: VACUUM and ANALYZE Follow-Up

From
Mage
Date:
Joshua D. Drake wrote:

>>
>> In any case, it is hard to see how the present behaviour can be seen as
>> desirable.  It obviously causes problems at least for new Postgres
>> users, and we
>> all hope there will be many more of these folks in the future.  Thanks
>> for considering this.  Mark
>
>
> Uhmmm... analyze or vacuum on an empty table is fairly pointless.
> Those utilities are supposed to be used on tables that have data.
>
> So the answer is, use them on tables that have data.

Every 5th page of the manual says that I should use vacuum analyze
frequently. There are two command line tools for this and another one in
the contrib. I think none of them cares of empty tables.

This issue need to be fixed.

       Mage

Re: VACUUM and ANALYZE Follow-Up

From
Pierre-Frédéric Caillaud
Date:
> 4. Isn't ANALYZE on a totally empty table really a special case?  The
> presumption should be that the table will not remain empty.  To optimize
> the performance assuming that there will be zero (or close to zero) rows
> seems somewhat pointless.  However, there are valid reasons why a table
> might be empty at the moment in time when the ANALYZE is run.  (In our
> case, we use "work" tables that get cleared at the end of an application
> process.)  And, as mentioned above, it is easier to VACUUM ANALYZE an
> entire database than it is to list tables individually.

    Well, for instance I have a few tables which contain just a few rows, for
instance a list of states in which an object in another table may be, or a
list of tax rates... for these kinds of tables with like, 10 rows, or just
a few pages, you don't want index scans, so VACUUM and ANALYZE are doing
their job.

    If you were going to insert 5 rows in an empty table, you would also want
this behaviour. The problems start when you make a large INSERT in an
empty or almost empty table.

    So, how to solve your problem without slowing the other requests (ie.
selecting and inserting a few rows into a very small table) ?

    Nobody responded to my suggestion that the planner take into account the
number of rows to be inserted into the table in its plan... so I'll repost
it :
    - INSERT ... SELECT :
    Planner has an estimate of how many rows the SELECT will yield. So it
could plan the queries involving SELECTs on the target table (like, UNIQUE
checks et al) using the number of rows in the table + number of rows to be
inserted. This solves your problem.

    Problems with this approach :
    - This only gives a number of rows, not more precise statistics
    It's the only information available so why not use it ? And it's enough
to solve the OP's problem.

    - Can get recursive
    What if there is a self-join ? I guess, just fall back to the previous
behaviour...

    - Does not work for COPY
    argument : COPY should act like it's going to insert many rows. Most of
the time, that's how it's used.

    - When the estimated number of rows to insert is imprecise
    (for instance a SELECT with UNION's or DISTINCT or a huge join), the
outcome would be incertain.

    What do you think ?








Re: VACUUM and ANALYZE Follow-Up

From
"Joshua D. Drake"
Date:
>>
>> Uhmmm... analyze or vacuum on an empty table is fairly pointless.
>> Those utilities are supposed to be used on tables that have data.
>>
>> So the answer is, use them on tables that have data.
>
>
> Every 5th page of the manual says that I should use vacuum analyze
> frequently. There are two command line tools for this and another one in
> the contrib. I think none of them cares of empty tables.
>
> This issue need to be fixed.

What issue? I don't see an issue in the least.

Sincerely,


Joshua D. Drake



>
>       Mage
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org


--
Command Prompt, Inc., home of PostgreSQL Replication, and plPHP.
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL

Attachment

Re: VACUUM and ANALYZE Follow-Up

From
Tom Lane
Date:
>>> Every 5th page of the manual says that I should use vacuum analyze
>>> frequently. There are two command line tools for this and another one in
>>> the contrib. I think none of them cares of empty tables.
>>
>> This issue need to be fixed.

> What issue? I don't see an issue in the least.

Hasn't anybody read the other threads I posted links to?

(That's a rhetorical question, because the answer clearly is "no" :-()

            regards, tom lane

Re: VACUUM and ANALYZE Follow-Up

From
Pierre-Frédéric Caillaud
Date:
> Hasn't anybody read the other threads I posted links to?
> (That's a rhetorical question, because the answer clearly is "no" :-()

    You mean this one :

http://archives.postgresql.org/pgsql-hackers/2004-11/msg00985.php

    In which you write :

rel->pages = RelationGetNumberOfBlocks(relation);
if (relation->rd_rel->relpages > 0)
     density = relation->rd_rel->reltuples / relation->rd_rel->relpages;
else if (relation->rd_rel->reltuples > 0)  /* already a density */
     density = relation->rd_rel->reltuples;
else
     density = some_default_estimate;
rel->tuples = round(rel->pages * density);

> A variant of this is to set reltuples = density, relpages = 1 instead
> of 0, which makes the relpages value a lie but would be even less likely
> to confuse client-side code.

    I don't know how it works internally, but if an empty table has a
filesize of 0 that's a lie, but if an empty table is just one page with a
header saying "nothing here, go away", it's the truth.

    And I like your idea. I definitely think it would be useful. Your
proposed implementation is a bit hackish but quick and easy to do, and
with minimal breakage.

    What do you think of the idea of using the estimation of the number of
rows to be inserted in the table as a help in planning the queries on this
table made during the INSERT (like FK checks) ? Did you read my previous
post on this ?

    Thanks !



Re: VACUUM and ANALYZE Follow-Up

From
"Mark Dexter"
Date:

Tom, I did read through the links you provided.  Unfortunately, I don't feel qualified to judge the technical merits of the possible solutions.  Since you appear to be well informed on this issue, can I ask you a couple of quick questions?

1. Would it be difficult to add an option to ANALYZE to force it to pretend that there are a minimum number of rows (e.g., ANALYZE MINIMUM 1000 or something)?  This would appear to be a simple-minded way to solve the problem without any concerns about backward compatibility.

2. Why does a newly CREATE'd table behave differently than an empty table after ANALYZE?  Does it make sense that it should?  In the CREATE case, the assumptions appear to be much more reasonable for a table that is going to grow. 

3. Has anyone ever tested whether there is a measurable performance gained after doing ANALYZE on empty or nearly empty tables?  We know that there is a very large (in my case 15x) performance loss when the table starts growing.  If the gain is small or negligable when the tables really are small, then perhaps worrying about maintaining current behaviour is not as important.

The nice thing about option (1) is that is solves the slow insert issue both for empty tables and for tables with a few rows.  It also causes absolutely no backward-compatibility issues.

Thanks very much for your comments on this.  Mark

Re: VACUUM and ANALYZE Follow-Up

From
Tom Lane
Date:
"Mark Dexter" <MDEXTER@dexterchaney.com> writes:
> 1. Would it be difficult to add an option to ANALYZE to force it to
> pretend that there are a minimum number of rows (e.g., ANALYZE MINIMUM
> 1000 or something)?  This would appear to be a simple-minded way to
> solve the problem without any concerns about backward compatibility.

This strikes me as useless, not to mention not backward-compatible at all.
Where is ANALYZE supposed to come up with the data to fill pg_statistic?
Shall we add the same kluge option to VACUUM and CREATE INDEX?

> 2. Why does a newly CREATE'd table behave differently than an empty
> table after ANALYZE?  Does it make sense that it should?

This is a long-standing hack, which I am proposing undoing; see
http://archives.postgresql.org/pgsql-patches/2004-11/msg00339.php
and in particular read the comments that the patch deletes.

> 3. Has anyone ever tested whether there is a measurable performance
> gained after doing ANALYZE on empty or nearly empty tables?

As long as the table *stays* empty or nearly so, the knowledge that it
is small is good for the planner to have.  The problem we are dealing
with here boils down to the fact that a table can grow substantially
without the planner being aware of the fact.  So IMHO the correct
solution is to attack that problem head-on, not to invent weird options
to make ANALYZE lie about what it found.  CREATE TABLE shouldn't be
lying either, but at one time that seemed like a good quick-hack
workaround ...

            regards, tom lane