Thread: VACUUM and ANALYZE Follow-Up
Several recent postings appear to confirm that there is an issue with the use of VACUUM or ANALYZE on empty tables. Specifically, if you
VACUUM or ANALYZE a table that is empty and then insert a large number of rows into this table, you will experience very poor performance.
For example, in our testing, we suffered a 15X performance penalty when inserting 35,000 rows into a table that had been VACUUM'd or
ANALYZE'd when empty. Also, in our testing, it didn't matter whether you just did VACCUM or VACUUM ANALYZE -- in both cases the
subsequent inserts were slow.
In the short run, the work-around appears to be either to avoid using these commands on empty tables or to keep some "dummy" rows in these
tables that don't get deleted (and use DELETE instead of TRUNCATE).
However, in the long run, it would seem to make sense to address the issue directly so DBA's and developers don't have to deal with it. Several
possible solutions come to mind, and I'm sure there are others.
1. Provide an option with ANALYZE to force it to work as if a table had a minimum number of rows (e.g., ANALYZE MINIMUM 1000 would analyze
tables as if they all had at least 1000 rows).
2. Provide an option during table creation to state the minimum number of rows to use for ANALYZE.
3. Just change ANALYZE to assume that all tables might have a reasonable number of rows at some point even if they are empty now. (How much performance is actually gained currently when ANALYZE updates the stats for an empty table?)
In any case, it is hard to see how the present behaviour can be seen as desirable. It obviously causes problems at least for new Postgres users, and we
all hope there will be many more of these folks in the future. Thanks for considering this. Mark
> > In any case, it is hard to see how the present behaviour can be seen as > desirable. It obviously causes problems at least for new Postgres > users, and we > all hope there will be many more of these folks in the future. Thanks > for considering this. Mark Uhmmm... analyze or vacuum on an empty table is fairly pointless. Those utilities are supposed to be used on tables that have data. So the answer is, use them on tables that have data. Sincerely, Joshua D. Drake > -- Command Prompt, Inc., home of PostgreSQL Replication, and plPHP. Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL
Attachment
"Mark Dexter" <MDEXTER@dexterchaney.com> writes: > Several recent postings appear to confirm that there is an issue with > the use of VACUUM or ANALYZE on empty tables. Specifically, if you > VACUUM or ANALYZE a table that is empty and then insert a large number > of rows into this table, you will experience very poor performance. Well-known issue. See this thread for the latest discussion: http://archives.postgresql.org/pgsql-hackers/2004-11/msg00985.php regards, tom lane
On Mon, Nov 29, 2004 at 11:48:37AM -0800, Mark Dexter wrote: > 1. Provide an option with ANALYZE to force it to work as if a table had > a minimum number of rows (e.g., ANALYZE MINIMUM 1000 would analyze > tables as if they all had at least 1000 rows). > 2. Provide an option during table creation to state the minimum number > of rows to use for ANALYZE. Ok, here's a scenario, you've executed ANALYZE on an empty table. So the system needs to get for each column statitics on largest value, smallest value, approximate distribution, how well does the order correlate with the table order, average column width, etc. Hang on, it's an empty table. Is it supposed to fabricate these statistics out of thin air? Any made up numbers will probably be worse than none at all. > 3. Just change ANALYZE to assume that all tables might have a reasonable > number of rows at some point even if they are empty now. (How much > performance is actually gained currently when ANALYZE updates the stats > for an empty table?) If the table is really empty, the performance is good since it will never use any indexes. > In any case, it is hard to see how the present behaviour can be seen as > desirable. It obviously causes problems at least for new Postgres > users, and we > all hope there will be many more of these folks in the future. Thanks > for considering this. Mark Seems to me that the solution is as the other poster suggested, don't run ANALYZE on an empty table if it's not going to be empty. Run ANALYZE when the table has data in it. If you use TRUNCATE you never need to VACUUM that table anyway. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
> >In any case, it is hard to see how the present behaviour can be seen as > >desirable. It obviously causes problems at least for new Postgres > >users, and we > >all hope there will be many more of these folks in the future. Thanks > >for considering this. Mark > > Uhmmm... analyze or vacuum on an empty table is fairly pointless. Those > utilities are supposed to be used on tables that have data. > > So the answer is, use them on tables that have data. What the OP seems to be saying is that to him it would make a lot of sense to have vacuum by default not act on empty tables. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Hmm... it seems that we're maybe not understanding one another here. I'm going to try to be more clear. Below are the questions that were raised and my attemp to answer them clearly. 1. Why run VACUUM on an empty table? This is a good question, but perhaps there is a valid answer. Our production database contains about 1500 tables. It is MUCH more convenient to be able to VACCUM ANALYZE the entire database than to list specific tables. Furthermore, our application uses "work" tables that often will be empty (e.g., at night) but that, at times, will contain a large number of rows. The Postgres documentation says the following: "We recommend that active production databases be vacuumed frequently (at least nightly), in order to remove expired rows.". This is going to be difficult without messing up the performance. 2. Hang on, it's an empty table. Is it supposed to fabricate these statistics out of thin air? Any made up numbers will probably be worse than none at all. Well, that's why I suggested some type of command line option so the user could give it a number of rows to use for the analysis (e.g., ANALYZE MINIMUM 1000). Another point of interest: If I DROP and the CREATE the table, without doing ANALYZE, I get good performance when inserting rows. So whatever assumptions the database is making about a newly-created table appear to be different (and BETTER) than the assumptions made when doing ANALYZE on an empty table. It's not clear to me why this should be. In both cases, you don't really know anything about the table other than at this moment it has zero rows. Obviously, it would be better (at least in this instance) if running ANALYZE on an empty table had the same performance result as using CREATE to make a new (empty) table. Finally, my testing would seem to contradict that any made-up number will be better than none at all. In my testing (inserting 35,000 rows into an empty table), I could only measure two distinct outcomes -- one good and one bad (with a 15X performance difference). I got good performance with any of the folloiwng scenarios: CREATE TABLE, ANALYZE or VACUUM with more than 94 rows in the table. I got bad performance if I did ANALYZE or VACUUM with less than 94 rows in the table. I could not measure any difference between other numbers of rows (between 0 and 35,000). So I don't think in practice it is that sensitive, at least in the simple test case I was doing. Below are two additional questions I have. 3. Is there some benefit to having ANALYZE behave the way it now does on empty or nearly empty tables? Is there a large performance improvement for really small tables (e.g., under 100 rows or under 1000 rows)? Does anyone really care about performance for small tables? 4. Isn't ANALYZE on a totally empty table really a special case? The presumption should be that the table will not remain empty. To optimize the performance assuming that there will be zero (or close to zero) rows seems somewhat pointless. However, there are valid reasons why a table might be empty at the moment in time when the ANALYZE is run. (In our case, we use "work" tables that get cleared at the end of an application process.) And, as mentioned above, it is easier to VACUUM ANALYZE an entire database than it is to list tables individually. 5. Why does DROP / CREATE work better than TRUNCATE / VACUUM in terms of creating a fresh table in which to insert new rows? Is this desirable? In both cases, the optimizer doesn't really know anything about what to expect for the table. But CREATE provides a better starting point for inserts than does VACUUM or ANALYZE, at least in my testing. I am relatively new to Postgres, and I apologize if I'm repeating issues that have been raised before. However, it does seem to me to be an important issue. The easier it is to maintain a Postgres database, the more likely it is to be widely used. Thanks. Mark
On Mon, Nov 29, 2004 at 02:57:28PM -0800, Mark Dexter wrote: > 1. Why run VACUUM on an empty table? Another approach: you run VACUUM on an empty table to empty it. If you had a table with a lot of tuples, and ran DELETE on it, it will empty after that, but it will be full of dead tuples. So you have to VACUUM it to reclaim space. Of course it would be better to use TRUNCATE rather than DELETE, but that doesn't always work (e.g. because the table has foreign keys). -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "This is a foot just waiting to be shot" (Andrew Dunstan)
Joshua D. Drake wrote: >> >> In any case, it is hard to see how the present behaviour can be seen as >> desirable. It obviously causes problems at least for new Postgres >> users, and we >> all hope there will be many more of these folks in the future. Thanks >> for considering this. Mark > > > Uhmmm... analyze or vacuum on an empty table is fairly pointless. > Those utilities are supposed to be used on tables that have data. > > So the answer is, use them on tables that have data. Every 5th page of the manual says that I should use vacuum analyze frequently. There are two command line tools for this and another one in the contrib. I think none of them cares of empty tables. This issue need to be fixed. Mage
> 4. Isn't ANALYZE on a totally empty table really a special case? The > presumption should be that the table will not remain empty. To optimize > the performance assuming that there will be zero (or close to zero) rows > seems somewhat pointless. However, there are valid reasons why a table > might be empty at the moment in time when the ANALYZE is run. (In our > case, we use "work" tables that get cleared at the end of an application > process.) And, as mentioned above, it is easier to VACUUM ANALYZE an > entire database than it is to list tables individually. Well, for instance I have a few tables which contain just a few rows, for instance a list of states in which an object in another table may be, or a list of tax rates... for these kinds of tables with like, 10 rows, or just a few pages, you don't want index scans, so VACUUM and ANALYZE are doing their job. If you were going to insert 5 rows in an empty table, you would also want this behaviour. The problems start when you make a large INSERT in an empty or almost empty table. So, how to solve your problem without slowing the other requests (ie. selecting and inserting a few rows into a very small table) ? Nobody responded to my suggestion that the planner take into account the number of rows to be inserted into the table in its plan... so I'll repost it : - INSERT ... SELECT : Planner has an estimate of how many rows the SELECT will yield. So it could plan the queries involving SELECTs on the target table (like, UNIQUE checks et al) using the number of rows in the table + number of rows to be inserted. This solves your problem. Problems with this approach : - This only gives a number of rows, not more precise statistics It's the only information available so why not use it ? And it's enough to solve the OP's problem. - Can get recursive What if there is a self-join ? I guess, just fall back to the previous behaviour... - Does not work for COPY argument : COPY should act like it's going to insert many rows. Most of the time, that's how it's used. - When the estimated number of rows to insert is imprecise (for instance a SELECT with UNION's or DISTINCT or a huge join), the outcome would be incertain. What do you think ?
>> >> Uhmmm... analyze or vacuum on an empty table is fairly pointless. >> Those utilities are supposed to be used on tables that have data. >> >> So the answer is, use them on tables that have data. > > > Every 5th page of the manual says that I should use vacuum analyze > frequently. There are two command line tools for this and another one in > the contrib. I think none of them cares of empty tables. > > This issue need to be fixed. What issue? I don't see an issue in the least. Sincerely, Joshua D. Drake > > Mage > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- Command Prompt, Inc., home of PostgreSQL Replication, and plPHP. Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL
Attachment
>>> Every 5th page of the manual says that I should use vacuum analyze >>> frequently. There are two command line tools for this and another one in >>> the contrib. I think none of them cares of empty tables. >> >> This issue need to be fixed. > What issue? I don't see an issue in the least. Hasn't anybody read the other threads I posted links to? (That's a rhetorical question, because the answer clearly is "no" :-() regards, tom lane
> Hasn't anybody read the other threads I posted links to? > (That's a rhetorical question, because the answer clearly is "no" :-() You mean this one : http://archives.postgresql.org/pgsql-hackers/2004-11/msg00985.php In which you write : rel->pages = RelationGetNumberOfBlocks(relation); if (relation->rd_rel->relpages > 0) density = relation->rd_rel->reltuples / relation->rd_rel->relpages; else if (relation->rd_rel->reltuples > 0) /* already a density */ density = relation->rd_rel->reltuples; else density = some_default_estimate; rel->tuples = round(rel->pages * density); > A variant of this is to set reltuples = density, relpages = 1 instead > of 0, which makes the relpages value a lie but would be even less likely > to confuse client-side code. I don't know how it works internally, but if an empty table has a filesize of 0 that's a lie, but if an empty table is just one page with a header saying "nothing here, go away", it's the truth. And I like your idea. I definitely think it would be useful. Your proposed implementation is a bit hackish but quick and easy to do, and with minimal breakage. What do you think of the idea of using the estimation of the number of rows to be inserted in the table as a help in planning the queries on this table made during the INSERT (like FK checks) ? Did you read my previous post on this ? Thanks !
Tom, I did read through the links you provided. Unfortunately, I don't feel qualified to judge the technical merits of the possible solutions. Since you appear to be well informed on this issue, can I ask you a couple of quick questions?
1. Would it be difficult to add an option to ANALYZE to force it to pretend that there are a minimum number of rows (e.g., ANALYZE MINIMUM 1000 or something)? This would appear to be a simple-minded way to solve the problem without any concerns about backward compatibility.
2. Why does a newly CREATE'd table behave differently than an empty table after ANALYZE? Does it make sense that it should? In the CREATE case, the assumptions appear to be much more reasonable for a table that is going to grow.
3. Has anyone ever tested whether there is a measurable performance gained after doing ANALYZE on empty or nearly empty tables? We know that there is a very large (in my case 15x) performance loss when the table starts growing. If the gain is small or negligable when the tables really are small, then perhaps worrying about maintaining current behaviour is not as important.
The nice thing about option (1) is that is solves the slow insert issue both for empty tables and for tables with a few rows. It also causes absolutely no backward-compatibility issues.
Thanks very much for your comments on this. Mark
"Mark Dexter" <MDEXTER@dexterchaney.com> writes: > 1. Would it be difficult to add an option to ANALYZE to force it to > pretend that there are a minimum number of rows (e.g., ANALYZE MINIMUM > 1000 or something)? This would appear to be a simple-minded way to > solve the problem without any concerns about backward compatibility. This strikes me as useless, not to mention not backward-compatible at all. Where is ANALYZE supposed to come up with the data to fill pg_statistic? Shall we add the same kluge option to VACUUM and CREATE INDEX? > 2. Why does a newly CREATE'd table behave differently than an empty > table after ANALYZE? Does it make sense that it should? This is a long-standing hack, which I am proposing undoing; see http://archives.postgresql.org/pgsql-patches/2004-11/msg00339.php and in particular read the comments that the patch deletes. > 3. Has anyone ever tested whether there is a measurable performance > gained after doing ANALYZE on empty or nearly empty tables? As long as the table *stays* empty or nearly so, the knowledge that it is small is good for the planner to have. The problem we are dealing with here boils down to the fact that a table can grow substantially without the planner being aware of the fact. So IMHO the correct solution is to attack that problem head-on, not to invent weird options to make ANALYZE lie about what it found. CREATE TABLE shouldn't be lying either, but at one time that seemed like a good quick-hack workaround ... regards, tom lane