Thread: Autovacuum and Autoanalyze
Disabling autovacuum can have catastrophic effects, since it disables the ANALYZing of tables. Can we have a mode where we disable autoVACUUM yet enable autoANALYZE? ANALYZE times are fairly bounded because of the way we do sampling. VACUUM times are not bounded at all, and typically > O(n). So it makes sense to switch off the VACUUM at certain times, but never good to switch off ANALYZE. While we're there, it would be useful if CREATE TABLE AS SELECT was followed by an automatic ANALYZE. Especially important for temp tables. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
> Disabling autovacuum can have catastrophic effects, since it disables > the ANALYZing of tables. > > Can we have a mode where we disable autoVACUUM yet enable autoANALYZE? > > ANALYZE times are fairly bounded because of the way we do sampling. > VACUUM times are not bounded at all, and typically > O(n). So it makes > sense to switch off the VACUUM at certain times, but never good to > switch off ANALYZE. This seems reasonable. > While we're there, it would be useful if CREATE TABLE AS SELECT was > followed by an automatic ANALYZE. Especially important for temp tables. This seems like the wrong solution. There is a general problem that bulk data loads on an empty table tend to result in horrible query plans, but not all of those will be created using CREATE TABLE AS SELECT. Someone may easily do a COPY or just a bunch of INSERTs. Maybe something like: If the table has never been ANALYZEd, force an immediate ANALYZE before planning the first SELECT, UPDATE, or DELETE. And maybe also do the same thing if the table has grown significantly (not sure what the threshold should be) since the last ANALYZE. I'm not sure exactly what is practical here but it would certainly be nice to have some solution. This has got to be my #1 cause of extremely slow queries. ...Robert
"Robert Haas" <robertmhaas@gmail.com> writes: > This seems like the wrong solution. There is a general problem that > bulk data loads on an empty table tend to result in horrible query > plans, Please provide some specifics. It's been a very long time since the planner was completely unaware of the size of such a table. Lack of stats is certainly a handicap, but I'm not convinced it should result in horrible plans. Maybe a more appropriate answer to this type of issue is to tweak some of the default selectivity numbers. > And maybe also do the same thing if the table has grown significantly > (not sure what the threshold should be) since the last ANALYZE. Autovacuum already does this type of thing. regards, tom lane
> Please provide some specifics. It's been a very long time since the > planner was completely unaware of the size of such a table. Lack of > stats is certainly a handicap, but I'm not convinced it should result > in horrible plans. Maybe a more appropriate answer to this type of > issue is to tweak some of the default selectivity numbers. Sure. See attached output. This is from 8.2.9, but the behavior on HEAD is similar. The first query executed before and then again after the ANALYZE is OK, but the second, which involves an additional join condition, is 6X slower prior to the ANALYZE. I don't see how you're going to fix this problem by tweaking the selectivity estimates. If it were possible to generate good query plans without selectivity estimates derived from the actual table contents, we wouldn't need ANALYZE in the first place. >> And maybe also do the same thing if the table has grown significantly >> (not sure what the threshold should be) since the last ANALYZE. > > Autovacuum already does this type of thing. It's asynchronous, though. Frequently, you want to load a bunch of data into a table and then immediately execute a query against it, or possibly several queries. It's pretty annoying to have to write logic that says - ok, if the number of rows that we just inserted was really big relative to what was already in the table, then do an ANALYZE on the table before issuing the SELECT, otherwise skip it. I would be happy enough if we could recognize CREATE TABLE ... insert a bunch of data ... SELECT as a case where we need to force a synchronous ANALYZE - because in my experience you almost always do. Recognizing the case where the table has grown a lot since the last ANALYZE is probably harder, and a bit less important, but would surely be nice if it could be done. ...Robert
Attachment
Simon Riggs wrote: > Disabling autovacuum can have catastrophic effects, since it disables > the ANALYZing of tables. > > Can we have a mode where we disable autoVACUUM yet enable autoANALYZE? You mean something like autovacuum = on / off / analyze ? We can certainly do that, but is there buy-in? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Tue, Sep 16, 2008 at 08:59:08PM -0400, Alvaro Herrera wrote: > Simon Riggs wrote: > > Disabling autovacuum can have catastrophic effects, since it disables > > the ANALYZing of tables. > > > > Can we have a mode where we disable autoVACUUM yet enable autoANALYZE? > > You mean something like > autovacuum = on / off / analyze ? > > We can certainly do that, but is there buy-in? +1 Having autovacuum on during bulk loads can really tank performance, but having autoanalyze on is good :) Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter wrote: > On Tue, Sep 16, 2008 at 08:59:08PM -0400, Alvaro Herrera wrote: >> Simon Riggs wrote: >>> Disabling autovacuum can have catastrophic effects, since it disables >>> the ANALYZing of tables. >>> >>> Can we have a mode where we disable autoVACUUM yet enable autoANALYZE? >> You mean something like >> autovacuum = on / off / analyze ? >> >> We can certainly do that, but is there buy-in? > > +1 > > Having autovacuum on during bulk loads can really tank performance, > but having autoanalyze on is good :) Isn't autoanalyze a waste of time during a bulk load? Seems better to run ANALYZE manually at the end. Adding that option feels natural to me, but it is a rather blunt instrument. You can already do that with pg_autovacuum, though that interface isn't very user-friendly. I whole-heartedly support the idea of controlling autovacuum with storage options, e.g "ALTER TABLE ... WITH (autoanalyze = on)". -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2008-09-17 at 10:09 +0300, Heikki Linnakangas wrote: > David Fetter wrote: > > On Tue, Sep 16, 2008 at 08:59:08PM -0400, Alvaro Herrera wrote: > >> Simon Riggs wrote: > >>> Disabling autovacuum can have catastrophic effects, since it disables > >>> the ANALYZing of tables. > >>> > >>> Can we have a mode where we disable autoVACUUM yet enable autoANALYZE? > >> You mean something like > >> autovacuum = on / off / analyze ? > >> > >> We can certainly do that, but is there buy-in? > > > > +1 > > > > Having autovacuum on during bulk loads can really tank performance, > > but having autoanalyze on is good :) > > Isn't autoanalyze a waste of time during a bulk load? Seems better to > run ANALYZE manually at the end. Its not a waste of time because it catches tables immediately they have been loaded, not just at the end of the bulk load. Running ANALYZE is a waste of time if autoanalyze has already caught it, which is why that's never been added onto the end of a pg_dump script. But currently this is true only when we have both autoVACUUM and autoANALYZE enabled. > Adding that option feels natural to me, but it is a rather blunt > instrument. You can already do that with pg_autovacuum, though that > interface isn't very user-friendly. I whole-heartedly support the idea > of controlling autovacuum with storage options, e.g "ALTER TABLE ... > WITH (autoanalyze = on)". Yes, have that option also, since it is fine tuning. I definitely want a blunt instrument! I don't want to have to run ALTER TABLE on *every* table. Even if you think that's possible, it won't work in conjunction with interfaces submitting standard SQL, plus it won't work if I forget either. This request comes from a real situation where a dump was reloaded during the day when autovacuum was off and so ANALYZE was missed. Not my mistake, but it took time to resolve that could have been avoided by the new option suggested here. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > > On Wed, 2008-09-17 at 10:09 +0300, Heikki Linnakangas wrote: > > Isn't autoanalyze a waste of time during a bulk load? Seems better to > > run ANALYZE manually at the end. > > Its not a waste of time because it catches tables immediately they have > been loaded, not just at the end of the bulk load. Running ANALYZE is a > waste of time if autoanalyze has already caught it, which is why that's > never been added onto the end of a pg_dump script. But currently this is > true only when we have both autoVACUUM and autoANALYZE enabled. Hmm, one of the first complaints about defaulting autovacuum to on was that it made restores so much longer *because* it was choosing to do autoanalyzes on the tables as they were imported. It was then that the auto-cancel mechanism was introduced. http://pgsql.markmail.org/message/rqyjkafuw43426xy Why doesn't this new request conflict with that one? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Simon Riggs wrote: >> On Wed, 2008-09-17 at 10:09 +0300, Heikki Linnakangas wrote: >>> Isn't autoanalyze a waste of time during a bulk load? Seems better to >>> run ANALYZE manually at the end. >> Its not a waste of time because it catches tables immediately they have >> been loaded, not just at the end of the bulk load. Running ANALYZE is a >> waste of time if autoanalyze has already caught it, which is why that's >> never been added onto the end of a pg_dump script. But currently this is >> true only when we have both autoVACUUM and autoANALYZE enabled. > > Hmm, one of the first complaints about defaulting autovacuum to on was > that it made restores so much longer *because* it was choosing to do > autoanalyzes on the tables as they were imported. It was then that the > auto-cancel mechanism was introduced. > > http://pgsql.markmail.org/message/rqyjkafuw43426xy > > Why doesn't this new request conflict with that one? The problem back then was that a CREATE INDEX was waiting on the autoanalyze to finish, and the autoanalyze took a long time to finish because of vacuum_cost_delay. Now that we have the auto-cancel mechanism, that's not a problem. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Alvaro Herrera wrote: >> Why doesn't this new request conflict with that one? > The problem back then was that a CREATE INDEX was waiting on the > autoanalyze to finish, and the autoanalyze took a long time to finish > because of vacuum_cost_delay. Now that we have the auto-cancel > mechanism, that's not a problem. Define "not a problem". With auto-cancel, what will happen is that whatever work the autoanalyze does will be wasted. It seems to me that the current complaint is about background autovacuum/autoanalyze wasting cycles during a bulk load, and there's certainly no purer waste than an analyze cycle that gets aborted. I tend to agree with Alvaro that there's not very much of a use case for an analyze-only autovacuum mode. Assuming that we get to the point of having a parallelizing pg_restore, it would be interesting to give it an option to include ANALYZE for each table it's loaded among the tasks that it schedules. (I'm visualizing these commands as being made up by pg_restore itself, *not* added to the pg_dump output.) Then you could have a reasonably optimal total workflow, whereas allowing autovacuum to try to schedule the ANALYZEs can't be. regards, tom lane
On Wed, 2008-09-17 at 10:52 -0400, Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > > Alvaro Herrera wrote: > >> Why doesn't this new request conflict with that one? > > > The problem back then was that a CREATE INDEX was waiting on the > > autoanalyze to finish, and the autoanalyze took a long time to finish > > because of vacuum_cost_delay. Now that we have the auto-cancel > > mechanism, that's not a problem. > > Define "not a problem". With auto-cancel, what will happen is that > whatever work the autoanalyze does will be wasted. It seems to me > that the current complaint is about background autovacuum/autoanalyze > wasting cycles during a bulk load, and there's certainly no purer waste > than an analyze cycle that gets aborted. OK, but that's an argument against auto-anything, not just against splitting out autoanalyze and autovacuum. > I tend to agree with Alvaro that there's not very much of a use case for > an analyze-only autovacuum mode. Did he say that? I thought he said "we could do that", what did that mean Alvaro? I have a customer saying this would be a good thing and I agree. The roles of Autovacuum and autoanalyze are not exactly matched, so why do we force them to be run together or not at all? Why not allow the user to specify whether they want both or not? It's an option, we're not forcing anyone to do it that way if they don't want to. > Assuming that we get to the point of > having a parallelizing pg_restore, it would be interesting to give it an > option to include ANALYZE for each table it's loaded among the tasks > that it schedules. (I'm visualizing these commands as being made up by > pg_restore itself, *not* added to the pg_dump output.) Then you could > have a reasonably optimal total workflow, whereas allowing autovacuum > to try to schedule the ANALYZEs can't be. That doesn't solve all problems, just ones with pg_restore. That's nice and I won't turn it away, but what will we do about plain pg_dump and about other table creations and loads? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
> I tend to agree with Alvaro that there's not very much of a use case for > an analyze-only autovacuum mode. Assuming that we get to the point of > having a parallelizing pg_restore, it would be interesting to give it an > option to include ANALYZE for each table it's loaded among the tasks > that it schedules. (I'm visualizing these commands as being made up by > pg_restore itself, *not* added to the pg_dump output.) Then you could > have a reasonably optimal total workflow, whereas allowing autovacuum > to try to schedule the ANALYZEs can't be. In Simon's original email, he suggested forcing an automatic ANALYZE on the server side after CREATE TABLE AS. I objected on the grounds that this won't fix anything for people who are doing bulk data loads using any other mechanism. Here, you're proposing the exact same thing, except instead of restricting it to people who use CREATE TABLE AS, you're restricting it to people who use a hypothetical parallelized implementation of pg_restore. While either of these is better than doing nothing, ISTM it would be far better to give the database some smarts about what constitutes a bulk data load (a whole bunch of insert operations on a newly created table) and what to do about it (synchronous analyze just before the first operation on the table that isn't an insert - and perhaps not before). ...Robert