Thread: Autovacuum and Autoanalyze

Autovacuum and Autoanalyze

From
Simon Riggs
Date:
Disabling autovacuum can have catastrophic effects, since it disables
the ANALYZing of tables.

Can we have a mode where we disable autoVACUUM yet enable autoANALYZE?

ANALYZE times are fairly bounded because of the way we do sampling.
VACUUM times are not bounded at all, and typically > O(n). So it makes
sense to switch off the VACUUM at certain times, but never good to
switch off ANALYZE.

While we're there, it would be useful if CREATE TABLE AS SELECT was
followed by an automatic ANALYZE. Especially important for temp tables.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Autovacuum and Autoanalyze

From
"Robert Haas"
Date:
> Disabling autovacuum can have catastrophic effects, since it disables
> the ANALYZing of tables.
>
> Can we have a mode where we disable autoVACUUM yet enable autoANALYZE?
>
> ANALYZE times are fairly bounded because of the way we do sampling.
> VACUUM times are not bounded at all, and typically > O(n). So it makes
> sense to switch off the VACUUM at certain times, but never good to
> switch off ANALYZE.

This seems reasonable.

> While we're there, it would be useful if CREATE TABLE AS SELECT was
> followed by an automatic ANALYZE. Especially important for temp tables.

This seems like the wrong solution.  There is a general problem that
bulk data loads on an empty table tend to result in horrible query
plans, but not all of those will be created using CREATE TABLE AS
SELECT.  Someone may easily do a COPY or just a bunch of INSERTs.
Maybe something like: If the table has never been ANALYZEd, force an
immediate ANALYZE before planning the first SELECT, UPDATE, or DELETE.

And maybe also do the same thing if the table has grown significantly
(not sure what the threshold should be) since the last ANALYZE.

I'm not sure exactly what is practical here but it would certainly be
nice to have some solution.  This has got to be my #1 cause of
extremely slow queries.

...Robert


Re: Autovacuum and Autoanalyze

From
Tom Lane
Date:
"Robert Haas" <robertmhaas@gmail.com> writes:
> This seems like the wrong solution.  There is a general problem that
> bulk data loads on an empty table tend to result in horrible query
> plans,

Please provide some specifics.  It's been a very long time since the
planner was completely unaware of the size of such a table.  Lack of
stats is certainly a handicap, but I'm not convinced it should result
in horrible plans.  Maybe a more appropriate answer to this type of
issue is to tweak some of the default selectivity numbers.

> And maybe also do the same thing if the table has grown significantly
> (not sure what the threshold should be) since the last ANALYZE.

Autovacuum already does this type of thing.
        regards, tom lane


Re: Autovacuum and Autoanalyze

From
"Robert Haas"
Date:
> Please provide some specifics.  It's been a very long time since the
> planner was completely unaware of the size of such a table.  Lack of
> stats is certainly a handicap, but I'm not convinced it should result
> in horrible plans.  Maybe a more appropriate answer to this type of
> issue is to tweak some of the default selectivity numbers.

Sure.  See attached output.  This is from 8.2.9, but the behavior on
HEAD is similar.  The first query executed before and then again after
the ANALYZE is OK, but the second, which involves an additional join
condition, is 6X slower prior to the ANALYZE.

I don't see how you're going to fix this problem by tweaking the
selectivity estimates.  If it were possible to generate good query
plans without selectivity estimates derived from the actual table
contents, we wouldn't need ANALYZE in the first place.

>> And maybe also do the same thing if the table has grown significantly
>> (not sure what the threshold should be) since the last ANALYZE.
>
> Autovacuum already does this type of thing.

It's asynchronous, though.  Frequently, you want to load a bunch of
data into a table and then immediately execute a query against it, or
possibly several queries.  It's pretty annoying to have to write logic
that says - ok, if the number of rows that we just inserted was really
big relative to what was already in the table, then do an ANALYZE on
the table before issuing the SELECT, otherwise skip it.

I would be happy enough if we could recognize CREATE TABLE ... insert
a bunch of data ... SELECT as a case where we need to force a
synchronous ANALYZE - because in my experience you almost always do.
Recognizing the case where the table has grown a lot since the last
ANALYZE is probably harder, and a bit less important, but would surely
be nice if it could be done.

...Robert

Attachment

Re: Autovacuum and Autoanalyze

From
Alvaro Herrera
Date:
Simon Riggs wrote:
> Disabling autovacuum can have catastrophic effects, since it disables
> the ANALYZing of tables.
> 
> Can we have a mode where we disable autoVACUUM yet enable autoANALYZE?

You mean something like
autovacuum = on / off / analyze ?

We can certainly do that, but is there buy-in?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Autovacuum and Autoanalyze

From
David Fetter
Date:
On Tue, Sep 16, 2008 at 08:59:08PM -0400, Alvaro Herrera wrote:
> Simon Riggs wrote:
> > Disabling autovacuum can have catastrophic effects, since it disables
> > the ANALYZing of tables.
> > 
> > Can we have a mode where we disable autoVACUUM yet enable autoANALYZE?
> 
> You mean something like
> autovacuum = on / off / analyze ?
> 
> We can certainly do that, but is there buy-in?

+1

Having autovacuum on during bulk loads can really tank performance,
but having autoanalyze on is good :)

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Autovacuum and Autoanalyze

From
Heikki Linnakangas
Date:
David Fetter wrote:
> On Tue, Sep 16, 2008 at 08:59:08PM -0400, Alvaro Herrera wrote:
>> Simon Riggs wrote:
>>> Disabling autovacuum can have catastrophic effects, since it disables
>>> the ANALYZing of tables.
>>>
>>> Can we have a mode where we disable autoVACUUM yet enable autoANALYZE?
>> You mean something like
>> autovacuum = on / off / analyze ?
>>
>> We can certainly do that, but is there buy-in?
> 
> +1
> 
> Having autovacuum on during bulk loads can really tank performance,
> but having autoanalyze on is good :)

Isn't autoanalyze a waste of time during a bulk load? Seems better to 
run ANALYZE manually at the end.

Adding that option feels natural to me, but it is a rather blunt 
instrument. You can already do that with pg_autovacuum, though that 
interface isn't very user-friendly. I whole-heartedly support the idea 
of controlling autovacuum with storage options, e.g "ALTER TABLE ... 
WITH (autoanalyze = on)".

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Autovacuum and Autoanalyze

From
Simon Riggs
Date:
On Wed, 2008-09-17 at 10:09 +0300, Heikki Linnakangas wrote:
> David Fetter wrote:
> > On Tue, Sep 16, 2008 at 08:59:08PM -0400, Alvaro Herrera wrote:
> >> Simon Riggs wrote:
> >>> Disabling autovacuum can have catastrophic effects, since it disables
> >>> the ANALYZing of tables.
> >>>
> >>> Can we have a mode where we disable autoVACUUM yet enable autoANALYZE?
> >> You mean something like
> >> autovacuum = on / off / analyze ?
> >>
> >> We can certainly do that, but is there buy-in?
> > 
> > +1
> > 
> > Having autovacuum on during bulk loads can really tank performance,
> > but having autoanalyze on is good :)
> 
> Isn't autoanalyze a waste of time during a bulk load? Seems better to 
> run ANALYZE manually at the end.

Its not a waste of time because it catches tables immediately they have
been loaded, not just at the end of the bulk load. Running ANALYZE is a
waste of time if autoanalyze has already caught it, which is why that's
never been added onto the end of a pg_dump script. But currently this is
true only when we have both autoVACUUM and autoANALYZE enabled.

> Adding that option feels natural to me, but it is a rather blunt 
> instrument. You can already do that with pg_autovacuum, though that 
> interface isn't very user-friendly. I whole-heartedly support the idea 
> of controlling autovacuum with storage options, e.g "ALTER TABLE ... 
> WITH (autoanalyze = on)".

Yes, have that option also, since it is fine tuning.

I definitely want a blunt instrument! I don't want to have to run ALTER
TABLE on *every* table. Even if you think that's possible, it won't work
in conjunction with interfaces submitting standard SQL, plus it won't
work if I forget either.

This request comes from a real situation where a dump was reloaded
during the day when autovacuum was off and so ANALYZE was missed. Not my
mistake, but it took time to resolve that could have been avoided by the
new option suggested here.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Autovacuum and Autoanalyze

From
Alvaro Herrera
Date:
Simon Riggs wrote:
> 
> On Wed, 2008-09-17 at 10:09 +0300, Heikki Linnakangas wrote:

> > Isn't autoanalyze a waste of time during a bulk load? Seems better to 
> > run ANALYZE manually at the end.
> 
> Its not a waste of time because it catches tables immediately they have
> been loaded, not just at the end of the bulk load. Running ANALYZE is a
> waste of time if autoanalyze has already caught it, which is why that's
> never been added onto the end of a pg_dump script. But currently this is
> true only when we have both autoVACUUM and autoANALYZE enabled.

Hmm, one of the first complaints about defaulting autovacuum to on was
that it made restores so much longer *because* it was choosing to do
autoanalyzes on the tables as they were imported.  It was then that the
auto-cancel mechanism was introduced.

http://pgsql.markmail.org/message/rqyjkafuw43426xy

Why doesn't this new request conflict with that one?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Autovacuum and Autoanalyze

From
Heikki Linnakangas
Date:
Alvaro Herrera wrote:
> Simon Riggs wrote:
>> On Wed, 2008-09-17 at 10:09 +0300, Heikki Linnakangas wrote:
>>> Isn't autoanalyze a waste of time during a bulk load? Seems better to 
>>> run ANALYZE manually at the end.
>> Its not a waste of time because it catches tables immediately they have
>> been loaded, not just at the end of the bulk load. Running ANALYZE is a
>> waste of time if autoanalyze has already caught it, which is why that's
>> never been added onto the end of a pg_dump script. But currently this is
>> true only when we have both autoVACUUM and autoANALYZE enabled.
> 
> Hmm, one of the first complaints about defaulting autovacuum to on was
> that it made restores so much longer *because* it was choosing to do
> autoanalyzes on the tables as they were imported.  It was then that the
> auto-cancel mechanism was introduced.
> 
> http://pgsql.markmail.org/message/rqyjkafuw43426xy
> 
> Why doesn't this new request conflict with that one?

The problem back then was that a CREATE INDEX was waiting on the 
autoanalyze to finish, and the autoanalyze took a long time to finish 
because of vacuum_cost_delay. Now that we have the auto-cancel 
mechanism, that's not a problem.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Autovacuum and Autoanalyze

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Alvaro Herrera wrote:
>> Why doesn't this new request conflict with that one?

> The problem back then was that a CREATE INDEX was waiting on the 
> autoanalyze to finish, and the autoanalyze took a long time to finish 
> because of vacuum_cost_delay. Now that we have the auto-cancel 
> mechanism, that's not a problem.

Define "not a problem".  With auto-cancel, what will happen is that
whatever work the autoanalyze does will be wasted.  It seems to me
that the current complaint is about background autovacuum/autoanalyze
wasting cycles during a bulk load, and there's certainly no purer waste
than an analyze cycle that gets aborted.

I tend to agree with Alvaro that there's not very much of a use case for
an analyze-only autovacuum mode.  Assuming that we get to the point of
having a parallelizing pg_restore, it would be interesting to give it an
option to include ANALYZE for each table it's loaded among the tasks
that it schedules.  (I'm visualizing these commands as being made up by
pg_restore itself, *not* added to the pg_dump output.)  Then you could
have a reasonably optimal total workflow, whereas allowing autovacuum
to try to schedule the ANALYZEs can't be.
        regards, tom lane


Re: Autovacuum and Autoanalyze

From
Simon Riggs
Date:
On Wed, 2008-09-17 at 10:52 -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> > Alvaro Herrera wrote:
> >> Why doesn't this new request conflict with that one?
> 
> > The problem back then was that a CREATE INDEX was waiting on the 
> > autoanalyze to finish, and the autoanalyze took a long time to finish 
> > because of vacuum_cost_delay. Now that we have the auto-cancel 
> > mechanism, that's not a problem.
> 
> Define "not a problem".  With auto-cancel, what will happen is that
> whatever work the autoanalyze does will be wasted.  It seems to me
> that the current complaint is about background autovacuum/autoanalyze
> wasting cycles during a bulk load, and there's certainly no purer waste
> than an analyze cycle that gets aborted.

OK, but that's an argument against auto-anything, not just against
splitting out autoanalyze and autovacuum.

> I tend to agree with Alvaro that there's not very much of a use case for
> an analyze-only autovacuum mode. 

Did he say that? I thought he said "we could do that", what did that mean Alvaro?

I have a customer saying this would be a good thing and I agree. The
roles of Autovacuum and autoanalyze are not exactly matched, so why do
we force them to be run together or not at all? Why not allow
the user to specify whether they want both or not? It's an option, we're
not forcing anyone to do it that way if they don't want to.

>  Assuming that we get to the point of
> having a parallelizing pg_restore, it would be interesting to give it an
> option to include ANALYZE for each table it's loaded among the tasks
> that it schedules.  (I'm visualizing these commands as being made up by
> pg_restore itself, *not* added to the pg_dump output.)  Then you could
> have a reasonably optimal total workflow, whereas allowing autovacuum
> to try to schedule the ANALYZEs can't be.

That doesn't solve all problems, just ones with pg_restore. That's nice
and I won't turn it away, but what will we do about plain pg_dump and
about other table creations and loads?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Autovacuum and Autoanalyze

From
"Robert Haas"
Date:
> I tend to agree with Alvaro that there's not very much of a use case for
> an analyze-only autovacuum mode.  Assuming that we get to the point of
> having a parallelizing pg_restore, it would be interesting to give it an
> option to include ANALYZE for each table it's loaded among the tasks
> that it schedules.  (I'm visualizing these commands as being made up by
> pg_restore itself, *not* added to the pg_dump output.)  Then you could
> have a reasonably optimal total workflow, whereas allowing autovacuum
> to try to schedule the ANALYZEs can't be.

In Simon's original email, he suggested forcing an automatic ANALYZE
on the server side after CREATE TABLE AS.  I objected on the grounds
that this won't fix anything for people who are doing bulk data loads
using any other mechanism.  Here, you're proposing the exact same
thing, except instead of restricting it to people who use CREATE TABLE
AS, you're restricting it to people who use a hypothetical
parallelized implementation of pg_restore.

While either of these is better than doing nothing, ISTM it would be
far better to give the database some smarts about what constitutes a
bulk data load (a whole bunch of insert operations on a newly created
table) and what to do about it (synchronous analyze just before the
first operation on the table that isn't an insert - and perhaps not
before).

...Robert