Thread: Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Tom Lane

Date:

25 May 2011, 12:48:02

Florian Helmberger <fh@25th-floor.com> writes:
> On 25.05.11 04:47, Tom Lane wrote:
>> Florian Helmberger<fh@25th-floor.com>  writes:
>>> I'm running a production database with PostgreSQL 9.0.3 (64-bit) on
>>> Debian 5.0.4 and have an issue with a TOAST table and far to frequent
>>> autovacuum runs.
>>> 
>>> I think I've pinned the problem down to the values pg_class holds for
>>> the affected TOAST table:
>>> 
>>> relpages  | 433596
>>> reltuples | 1868538
>>> 
>>> These values are significantly too low. Interestingly, the autovacuum
>>> logout reports the correct values:
>>> 
>>> pages: 0 removed, 34788136 remain
>>> tuples: 932487 removed, 69599038 remain
>>> 
>>> but these aren't stored in pg_class after each run.

>> That's exceedingly weird.  Do the pg_stat_all_tables columns update
>> after autovacuums on that table?

> Yes they do:

I think I see what must be going on here: that toast table must contain
a long run of all-visible-according-to-the-VM pages (which is hardly a
surprising situation).  This results in VACUUM choosing not to update
the pg_class entry:
   /*    * Update statistics in pg_class.  But only if we didn't skip any pages;    * the tuple count only includes
tuplesfrom the pages we've visited, and    * we haven't frozen tuples in unvisited pages either.  The page count is
*accurate in any case, but because we use the reltuples / relpages ratio    * in the planner, it's better to not update
relpageseither if we can't    * update reltuples.    */   if (vacrelstats->scanned_all)
vac_update_relstats(onerel,                          vacrelstats->rel_pages, vacrelstats->rel_tuples,
       vacrelstats->hasindex,                           FreezeLimit);

For an ordinary table this wouldn't be fatal because we'll still do an
ANALYZE from time to time, and that will update the entries with new
(approximate) values.  But we never run ANALYZE on toast tables.

And this would *still* be okay, because as noted in the comment, the
planner only depends on the ratio being roughly correct, not on either
individual value being current.  But autovacuum didn't get the memo;
it thinks it can use reltuples to make decisions.

I can see two basic approaches we might take here:

1. Modify autovacuum to use something from the stats collector, rather
than reltuples, to make its decisions.  I'm not too clear on why
reltuples is being used there anyway; is there some good algorithmic or
statistical reason why AV should be looking at a number from the last
vacuum?

2. Revise the vacuum code so that it doesn't skip updating the pg_class
entries.  We could have it count the number of pages it skipped, rather
than just keeping a bool, and then scale up the rel_tuples count to be
approximately right by assuming the skipped pages have tuple density
similar to the scanned ones.

Thoughts?
        regards, tom lane

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Alvaro Herrera

Date:

25 May 2011, 13:00:55

Excerpts from Tom Lane's message of mié may 25 11:47:52 -0400 2011:

> I think I see what must be going on here: that toast table must contain
> a long run of all-visible-according-to-the-VM pages (which is hardly a
> surprising situation).  This results in VACUUM choosing not to update
> the pg_class entry:
> 
>     /*
>      * Update statistics in pg_class.  But only if we didn't skip any pages;
>      * the tuple count only includes tuples from the pages we've visited, and
>      * we haven't frozen tuples in unvisited pages either.  The page count is
>      * accurate in any case, but because we use the reltuples / relpages ratio
>      * in the planner, it's better to not update relpages either if we can't
>      * update reltuples.
>      */
>     if (vacrelstats->scanned_all)
>         vac_update_relstats(onerel,
>                             vacrelstats->rel_pages, vacrelstats->rel_tuples,
>                             vacrelstats->hasindex,
>                             FreezeLimit);
> 
> For an ordinary table this wouldn't be fatal because we'll still do an
> ANALYZE from time to time, and that will update the entries with new
> (approximate) values.  But we never run ANALYZE on toast tables.

Ouch.

> I can see two basic approaches we might take here:
> 
> 1. Modify autovacuum to use something from the stats collector, rather
> than reltuples, to make its decisions.  I'm not too clear on why
> reltuples is being used there anyway; is there some good algorithmic or
> statistical reason why AV should be looking at a number from the last
> vacuum?

It uses reltuples simply because it was what the original contrib code
was using.  Since pgstat was considerably weaker at the time, reltuples
might have been the only thing available.  It's certainly the case that
pgstat has improved a lot since autovacuum got in, and some things have
been revised but not this one.

> 2. Revise the vacuum code so that it doesn't skip updating the pg_class
> entries.  We could have it count the number of pages it skipped, rather
> than just keeping a bool, and then scale up the rel_tuples count to be
> approximately right by assuming the skipped pages have tuple density
> similar to the scanned ones.

Hmm, interesting idea.  This would be done only for toast tables, or all
tables?

At this point I only wonder why we store tuples & pages rather than just
live tuple density.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Robert Haas

Date:

25 May 2011, 13:23:58

On Wed, May 25, 2011 at 11:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 2. Revise the vacuum code so that it doesn't skip updating the pg_class
> entries.  We could have it count the number of pages it skipped, rather
> than just keeping a bool, and then scale up the rel_tuples count to be
> approximately right by assuming the skipped pages have tuple density
> similar to the scanned ones.

This approach doesn't seem like a good idea to me.  The skipped
portions of the table are the ones that haven't been modified in a
while, so this is or embeds an assumption that the tuples in the hot
and cold portions of the table are the same size.  That might be true,
but it isn't hard to think of cases where it might not be.  There
could also very easily be sampling error, because it's easy to imagine
a situation where 99% of the table is getting skipped.  Any error that
creeps into the estimate is going to get scaled up along with the
estimate itself.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Tom Lane

Date:

25 May 2011, 13:24:06

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Tom Lane's message of mié may 25 11:47:52 -0400 2011:
>> I can see two basic approaches we might take here:
>> 
>> 1. Modify autovacuum to use something from the stats collector, rather
>> than reltuples, to make its decisions.  I'm not too clear on why
>> reltuples is being used there anyway; is there some good algorithmic or
>> statistical reason why AV should be looking at a number from the last
>> vacuum?

> It uses reltuples simply because it was what the original contrib code
> was using.  Since pgstat was considerably weaker at the time, reltuples
> might have been the only thing available.  It's certainly the case that
> pgstat has improved a lot since autovacuum got in, and some things have
> been revised but not this one.

On reflection I'm hesitant to do this, especially for a backpatched bug
fix, because it would be changing the feedback loop behavior for
autovacuum scheduling.  That could have surprising consequences.

>> 2. Revise the vacuum code so that it doesn't skip updating the pg_class
>> entries.  We could have it count the number of pages it skipped, rather
>> than just keeping a bool, and then scale up the rel_tuples count to be
>> approximately right by assuming the skipped pages have tuple density
>> similar to the scanned ones.

> Hmm, interesting idea.  This would be done only for toast tables, or all
> tables?

I'm thinking just do it for all.  The fact that these numbers don't
necessarily update after a vacuum is certainly surprising in and of
itself, and it did not work that way before the VM patch went in.
I'm concerned about other stuff besides AV not dealing well with
obsolete values.

> At this point I only wonder why we store tuples & pages rather than just
> live tuple density.

It's just for backwards compatibility.  I've thought about doing that in
the past, but I don't know what client-side code might be looking at
relpages/reltuples.  It's not like collapsing them into one field would
save much, anyway.
        regards, tom lane

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

"Kevin Grittner"

Date:

25 May 2011, 13:37:34

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I don't know what client-side code might be looking at
> relpages/reltuples.
I know that I find reltuples useful for getting an "accurate enough"
sense of rows in a table (or set of tables) without resorting to
count(*).  I'd be OK with any two of pages, tuples, and density; but
dropping to one would make databases harder to manage, IMV.
Personally, I don't have much code that uses those columns;
eliminating an existing column wouldn't involve much pain for me as
long as it could still be derived.
-Kevin

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Tom Lane

Date:

25 May 2011, 13:41:52

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, May 25, 2011 at 11:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 2. Revise the vacuum code so that it doesn't skip updating the pg_class
>> entries. �We could have it count the number of pages it skipped, rather
>> than just keeping a bool, and then scale up the rel_tuples count to be
>> approximately right by assuming the skipped pages have tuple density
>> similar to the scanned ones.

> This approach doesn't seem like a good idea to me.  The skipped
> portions of the table are the ones that haven't been modified in a
> while, so this is or embeds an assumption that the tuples in the hot
> and cold portions of the table are the same size.  That might be true,
> but it isn't hard to think of cases where it might not be.  There
> could also very easily be sampling error, because it's easy to imagine
> a situation where 99% of the table is getting skipped.

Yeah, I had been thinking about the latter point.  We could be
conservative and just keep the reported tuple density the same (ie,
update relpages to the new correct value, while setting reltuples so
that the density ratio doesn't change).  But that has its own problems
when the table contents *do* change.  What I'm currently imagining is
to do a smoothed moving average, where we factor in the new density
estimate with a weight dependent on the percentage of the table we did
scan.  That is, the calculation goes something like

old_density = old_reltuples / old_relpages
new_density = counted_tuples / scanned_pages
reliability = scanned_pages / new_relpages
updated_density = old_density + (new_density - old_density) * reliability
new_reltuples = updated_density * new_relpages

We could slow the moving-average convergence even further when
reliability is small; for instance if you were really paranoid you might
want to use the square of reliability in line 4.  That might be
overdoing it, though.
        regards, tom lane

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Robert Haas

Date:

25 May 2011, 13:54:38

On Wed, May 25, 2011 at 12:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah, I had been thinking about the latter point.  We could be
> conservative and just keep the reported tuple density the same (ie,
> update relpages to the new correct value, while setting reltuples so
> that the density ratio doesn't change).  But that has its own problems
> when the table contents *do* change.  What I'm currently imagining is
> to do a smoothed moving average, where we factor in the new density
> estimate with a weight dependent on the percentage of the table we did
> scan.  That is, the calculation goes something like
>
> old_density = old_reltuples / old_relpages
> new_density = counted_tuples / scanned_pages
> reliability = scanned_pages / new_relpages
> updated_density = old_density + (new_density - old_density) * reliability
> new_reltuples = updated_density * new_relpages
>
> We could slow the moving-average convergence even further when
> reliability is small; for instance if you were really paranoid you might
> want to use the square of reliability in line 4.  That might be
> overdoing it, though.

I don't know.  That's maybe better, but I'd be willing to wager that
in some cases it will just slow down the rate at which we converge to
a completely incorrect value, while in other cases it'll fail to
update the data when it really has changed.

I am wondering, though, why we're not just inserting a special-purpose
hack for TOAST tables.  Your email seems to indicate that regular
tables are already handled well enough, and certainly if we only whack
around the TOAST behavior it's much less likely to fry anything.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Alvaro Herrera

Date:

25 May 2011, 13:55:23

Excerpts from Kevin Grittner's message of mié may 25 12:37:24 -0400 2011:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>  
> > I don't know what client-side code might be looking at
> > relpages/reltuples.
>  
> I know that I find reltuples useful for getting an "accurate enough"
> sense of rows in a table (or set of tables) without resorting to
> count(*).  I'd be OK with any two of pages, tuples, and density; but
> dropping to one would make databases harder to manage, IMV.
>  
> Personally, I don't have much code that uses those columns;
> eliminating an existing column wouldn't involve much pain for me as
> long as it could still be derived.

Well, we only actually need to store one number, because you can figure
out a much more precise number-of-pages figure with pg_relation_size()
divided by configured page size.

(We feel free to hack around catalogs in other places, and we regularly
break pgAdmin and the like when we drop columns -- people just live with
it and update their tools.  I don't think it's such a big deal in this
particular case.)

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Tom Lane

Date:

25 May 2011, 14:04:39

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, May 25, 2011 at 12:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, I had been thinking about the latter point. �We could be
>> conservative and just keep the reported tuple density the same (ie,
>> update relpages to the new correct value, while setting reltuples so
>> that the density ratio doesn't change). �But that has its own problems
>> when the table contents *do* change. �What I'm currently imagining is
>> to do a smoothed moving average, where we factor in the new density
>> estimate with a weight dependent on the percentage of the table we did
>> scan. �That is, the calculation goes something like
>> 
>> old_density = old_reltuples / old_relpages
>> new_density = counted_tuples / scanned_pages
>> reliability = scanned_pages / new_relpages
>> updated_density = old_density + (new_density - old_density) * reliability
>> new_reltuples = updated_density * new_relpages
>> 
>> We could slow the moving-average convergence even further when
>> reliability is small; for instance if you were really paranoid you might
>> want to use the square of reliability in line 4. �That might be
>> overdoing it, though.

> I don't know.  That's maybe better, but I'd be willing to wager that
> in some cases it will just slow down the rate at which we converge to
> a completely incorrect value, while in other cases it'll fail to
> update the data when it really has changed.

[ shrug... ]  When you don't have complete information, it's *always*
the case that you will sometimes make a mistake.  That's not
justification for paralysis, especially not when the existing code is
demonstrably broken.

What occurs to me after thinking a bit more is that the existing tuple
density is likely to be only an estimate, too (one coming from the last
ANALYZE, which could very well have scanned even less of the table than
VACUUM did).  So what I now think is that both VACUUM and ANALYZE ought
to use a calculation of the above form to arrive at a new value for
pg_class.reltuples.  In both cases it would be pretty easy to track the
number of pages we looked at while counting tuples, so the same raw
information is available.

> I am wondering, though, why we're not just inserting a special-purpose
> hack for TOAST tables.

Because the problem is not specific to TOAST tables.  As things
currently stand, we will accept the word of an ANALYZE as gospel even if
it scanned 1% of the table, and completely ignore the results from a
VACUUM even if it scanned 99% of the table.  This is not sane.
        regards, tom lane

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Alvaro Herrera

Date:

25 May 2011, 14:13:42

Excerpts from Robert Haas's message of mié may 25 12:54:28 -0400 2011:

> I don't know.  That's maybe better, but I'd be willing to wager that
> in some cases it will just slow down the rate at which we converge to
> a completely incorrect value, while in other cases it'll fail to
> update the data when it really has changed.

For regular tables I don't think there's a real problem because it'll
get fixed next time a full scan happens anyway.  For toast tables, I
think the set of operations is limited enough that this is easy to prove
correct (or fixed so that it is) -- no HOT updates (in fact no updates
at all), etc.

BTW one thing we haven't fixed at all is how do HOT updates affect
vacuuming behavior ...

> I am wondering, though, why we're not just inserting a special-purpose
> hack for TOAST tables.  Your email seems to indicate that regular
> tables are already handled well enough, and certainly if we only whack
> around the TOAST behavior it's much less likely to fry anything.

Well, having two paths one of which is uncommonly used means that it
will get all the bugs.  As with the xl_commit WAL record comment from
Simon, I'd rather stick with having a single one.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Robert Haas

Date:

25 May 2011, 14:16:08

On Wed, May 25, 2011 at 1:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> [ shrug... ]  When you don't have complete information, it's *always*
> the case that you will sometimes make a mistake.  That's not
> justification for paralysis, especially not when the existing code is
> demonstrably broken.
>
> What occurs to me after thinking a bit more is that the existing tuple
> density is likely to be only an estimate, too (one coming from the last
> ANALYZE, which could very well have scanned even less of the table than
> VACUUM did).  So what I now think is that both VACUUM and ANALYZE ought
> to use a calculation of the above form to arrive at a new value for
> pg_class.reltuples.  In both cases it would be pretty easy to track the
> number of pages we looked at while counting tuples, so the same raw
> information is available.
>
>> I am wondering, though, why we're not just inserting a special-purpose
>> hack for TOAST tables.
>
> Because the problem is not specific to TOAST tables.  As things
> currently stand, we will accept the word of an ANALYZE as gospel even if
> it scanned 1% of the table, and completely ignore the results from a
> VACUUM even if it scanned 99% of the table.  This is not sane.

I agree that if VACUUM scanned 99% of the table, it's probably fine to
use its numbers.  It's also fine to use the numbers from ANALYZE,
because those pages are chosen randomly.  What bothers me is the idea
of using a small *non-random* sample, and I'm not sure that
incorporating possibly-bogus results slowly is any better than
incorporating them quickly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Robert Haas

Date:

25 May 2011, 14:31:11

On Wed, May 25, 2011 at 1:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I agree that if VACUUM scanned 99% of the table, it's probably fine to
> use its numbers.  It's also fine to use the numbers from ANALYZE,
> because those pages are chosen randomly.  What bothers me is the idea
> of using a small *non-random* sample, and I'm not sure that
> incorporating possibly-bogus results slowly is any better than
> incorporating them quickly.

In particular, unless I'm misremembering, VACUUM *always* scans the
first few pages of the table, until it sees enough consecutive
all-visible bits that it decides to start skipping.  If I'm right
about that, then those pages could easily end up being overweighted
when VACUUM does the counting; especially if ANALYZE or an actual
full-table vacuum aren't allowed to snap the count back to reality.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Cédric Villemain

Date:

25 May 2011, 14:32:31

2011/5/25 Alvaro Herrera <alvherre@commandprompt.com>:
> Excerpts from Kevin Grittner's message of mié may 25 12:37:24 -0400 2011:
>> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> > I don't know what client-side code might be looking at
>> > relpages/reltuples.
>>
>> I know that I find reltuples useful for getting an "accurate enough"
>> sense of rows in a table (or set of tables) without resorting to
>> count(*).  I'd be OK with any two of pages, tuples, and density; but
>> dropping to one would make databases harder to manage, IMV.
>>
>> Personally, I don't have much code that uses those columns;
>> eliminating an existing column wouldn't involve much pain for me as
>> long as it could still be derived.
>
> Well, we only actually need to store one number, because you can figure
> out a much more precise number-of-pages figure with pg_relation_size()
> divided by configured page size.
>
> (We feel free to hack around catalogs in other places, and we regularly
> break pgAdmin and the like when we drop columns -- people just live with
> it and update their tools.  I don't think it's such a big deal in this
> particular case.)

I may miss something but we need relation size in costsize.c even if
we have a reldensity (or we need a reltuples). Else what values should
be used to estimate the relation size ? (pg_relation_size() goes down
to kernel/fs to ask the stat.st_size, is it really what we want?)


>
> --
> Álvaro Herrera <alvherre@commandprompt.com>
> The PostgreSQL Company - Command Prompt, Inc.
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Alvaro Herrera

Date:

25 May 2011, 14:37:21

Excerpts from Cédric Villemain's message of mié may 25 13:24:01 -0400 2011:

> > Well, we only actually need to store one number, because you can figure
> > out a much more precise number-of-pages figure with pg_relation_size()
> > divided by configured page size.

> I may miss something but we need relation size in costsize.c even if
> we have a reldensity (or we need a reltuples). Else what values should
> be used to estimate the relation size ? (pg_relation_size() goes down
> to kernel/fs to ask the stat.st_size, is it really what we want?)

Actually yes, the planner does go to kernel to determine the current
relation size, and then multiplies by density as computed from catalog
data to figure out current reasonably accurate number of tuples.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Cédric Villemain

Date:

25 May 2011, 16:14:34

2011/5/25 Alvaro Herrera <alvherre@commandprompt.com>:
> Excerpts from Cédric Villemain's message of mié may 25 13:24:01 -0400 2011:
>
>> > Well, we only actually need to store one number, because you can figure
>> > out a much more precise number-of-pages figure with pg_relation_size()
>> > divided by configured page size.
>
>> I may miss something but we need relation size in costsize.c even if
>> we have a reldensity (or we need a reltuples). Else what values should
>> be used to estimate the relation size ? (pg_relation_size() goes down
>> to kernel/fs to ask the stat.st_size, is it really what we want?)
>
> Actually yes, the planner does go to kernel to determine the current
> relation size, and then multiplies by density as computed from catalog
> data to figure out current reasonably accurate number of tuples.

Okay! I just read that part. Interesting.
(If I dive correctly, we search our last segment and then use a
fileseek to the end of this segment to get our information)

make more sense, suddendly :)

>
> --
> Álvaro Herrera <alvherre@commandprompt.com>
> The PostgreSQL Company - Command Prompt, Inc.
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Tom Lane

Date:

25 May 2011, 18:54:40

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, May 25, 2011 at 1:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Because the problem is not specific to TOAST tables. �As things
>> currently stand, we will accept the word of an ANALYZE as gospel even if
>> it scanned 1% of the table, and completely ignore the results from a
>> VACUUM even if it scanned 99% of the table. �This is not sane.

> I agree that if VACUUM scanned 99% of the table, it's probably fine to
> use its numbers.  It's also fine to use the numbers from ANALYZE,
> because those pages are chosen randomly.  What bothers me is the idea
> of using a small *non-random* sample, and I'm not sure that
> incorporating possibly-bogus results slowly is any better than
> incorporating them quickly.

The above is simply fuzzy thinking.  The fact that ANALYZE looked at a
random subset of pages is *no guarantee whatsoever* that its results are
highly accurate.  They might be more trustworthy than VACUUM's nonrandom
sample of a similar number of pages, but it doesn't hold even a little
bit of water to claim that we should believe ANALYZE completely and
VACUUM not at all even when the latter has looked at a significantly
larger sample of pages.

In any case, your line of thought doesn't help us for fixing the problem
with toast tables, because we aren't going to start doing ANALYZEs on
toast tables.

The bottom line here is that making use of stats we have is a lot better
than not making use of them, even if they aren't entirely trustworthy.
        regards, tom lane

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Robert Haas

Date:

25 May 2011, 20:11:45

On Wed, May 25, 2011 at 5:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, May 25, 2011 at 1:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Because the problem is not specific to TOAST tables.  As things
>>> currently stand, we will accept the word of an ANALYZE as gospel even if
>>> it scanned 1% of the table, and completely ignore the results from a
>>> VACUUM even if it scanned 99% of the table.  This is not sane.
>
>> I agree that if VACUUM scanned 99% of the table, it's probably fine to
>> use its numbers.  It's also fine to use the numbers from ANALYZE,
>> because those pages are chosen randomly.  What bothers me is the idea
>> of using a small *non-random* sample, and I'm not sure that
>> incorporating possibly-bogus results slowly is any better than
>> incorporating them quickly.
>
> The above is simply fuzzy thinking.  The fact that ANALYZE looked at a
> random subset of pages is *no guarantee whatsoever* that its results are
> highly accurate.  They might be more trustworthy than VACUUM's nonrandom
> sample of a similar number of pages, but it doesn't hold even a little
> bit of water to claim that we should believe ANALYZE completely and
> VACUUM not at all even when the latter has looked at a significantly
> larger sample of pages.

I think you're arguing against a straw-man.

> In any case, your line of thought doesn't help us for fixing the problem
> with toast tables, because we aren't going to start doing ANALYZEs on
> toast tables.

Can we simply use a constant for tuple density on TOAST tables?

> The bottom line here is that making use of stats we have is a lot better
> than not making use of them, even if they aren't entirely trustworthy.

http://dilbert.com/strips/comic/2008-05-07/

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Greg Stark

Date:

26 May 2011, 02:13:06

On Wed, May 25, 2011 at 10:05 PM, Greg Stark <gsstark@mit.edu> wrote:
>> updated_density = old_density + (new_density - old_density) * reliability
>> new_reltuples = updated_density * new_relpages
>
> This amounts to assuming that the pages observed in the vacuum have
> the density observed and the pages that weren't seen have the density
> that were previously in the reltuples/relpages stats.

In case it's not clear, Tom's expression for updated_density is
equivalent by simple algebra to:

updated_density = (old_density * (1-reliability)) + (new_density * reliability)

-- 
greg

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Greg Stark

Date:

26 May 2011, 02:14:58

On Wed, May 25, 2011 at 9:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah, I had been thinking about the latter point.  We could be
> conservative and just keep the reported tuple density the same (ie,
> update relpages to the new correct value, while setting reltuples so
> that the density ratio doesn't change).  But that has its own problems
> when the table contents *do* change.  What I'm currently imagining is
> to do a smoothed moving average, where we factor in the new density
> estimate with a weight dependent on the percentage of the table we did
> scan.  That is, the calculation goes something like
>
> old_density = old_reltuples / old_relpages
> new_density = counted_tuples / scanned_pages
> reliability = scanned_pages / new_relpages
> updated_density = old_density + (new_density - old_density) * reliability
> new_reltuples = updated_density * new_relpages

This amounts to assuming that the pages observed in the vacuum have
the density observed and the pages that weren't seen have the density
that were previously in the reltuples/relpages stats. That seems like
a pretty solid approach to me. If the numbers were sane before it
follows that they should be sane after the update.

--
greg

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Tom Lane

Date:

26 May 2011, 12:25:21

Greg Stark <gsstark@mit.edu> writes:
> On Wed, May 25, 2011 at 9:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ... What I'm currently imagining is
>> to do a smoothed moving average, where we factor in the new density
>> estimate with a weight dependent on the percentage of the table we did
>> scan.  That is, the calculation goes something like
>> 
>> old_density = old_reltuples / old_relpages
>> new_density = counted_tuples / scanned_pages
>> reliability = scanned_pages / new_relpages
>> updated_density = old_density + (new_density - old_density) * reliability
>> new_reltuples = updated_density * new_relpages

> This amounts to assuming that the pages observed in the vacuum have
> the density observed and the pages that weren't seen have the density
> that were previously in the reltuples/relpages stats. That seems like
> a pretty solid approach to me. If the numbers were sane before it
> follows that they should be sane after the update.

Hm, that's an interesting way of looking at it, but I was coming at it
from a signal-processing point of view.  What Robert is concerned about
is that if VACUUM is cleaning a non-representative sample of pages, and
repeated VACUUMs examine pretty much the same sample each time, then
over repeated applications of the above formula the estimated density
will eventually converge to what we are seeing in the sample.  The speed
of convergence depends on the moving-average multiplier, ie the
"reliability" number above, and what I was after was just to slow down
convergence for smaller samples.  So I wouldn't have any problem with
including a fudge factor to make the convergence even slower.  But your
analogy makes it seem like this particular formulation is actually
"right" in some sense.

One other point here is that Florian's problem is really only with our
failing to update relpages.  I don't think there is any part of the
system that particularly cares about reltuples for a toast table.  So
even if the value did converge to some significantly-bad estimate over
time, it's not really an issue AFAICS.  We do care about having a sane
reltuples estimate for regular tables, but for those we should have a
mixture of updates from ANALYZE and updates from VACUUM.  Also, for both
regular and toast tables we will have an occasional vacuum-for-wraparound
that is guaranteed to scan all pages and hence do a hard reset of
reltuples to the correct value.

I'm still of the opinion that an incremental estimation process like
the above is a lot saner than what we're doing now, snarky Dilbert
references notwithstanding.  The only thing that seems worthy of debate
from here is whether we should trust ANALYZE's estimates a bit more than
VACUUM's estimates, on the grounds that the former are more likely to be
from a random subset of pages.  We could implement that by applying a
fudge factor when folding a VACUUM estimate into the moving average (ie,
multiply its reliability by something less than one).  I don't have any
principled suggestion for just what the fudge factor ought to be, except
that I don't think "zero" is the best value, which AFAICT is what Robert
is arguing.  I think Greg's argument shows that "one" is the right value
when dealing with an ANALYZE estimate, if you believe that ANALYZE saw a
random set of pages ... but using that for VACUUM does seem
overoptimistic.
        regards, tom lane

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Robert Haas

Date:

26 May 2011, 12:56:08

On Thu, May 26, 2011 at 11:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm still of the opinion that an incremental estimation process like
> the above is a lot saner than what we're doing now, snarky Dilbert
> references notwithstanding.  The only thing that seems worthy of debate
> from here is whether we should trust ANALYZE's estimates a bit more than
> VACUUM's estimates, on the grounds that the former are more likely to be
> from a random subset of pages.  We could implement that by applying a
> fudge factor when folding a VACUUM estimate into the moving average (ie,
> multiply its reliability by something less than one).  I don't have any
> principled suggestion for just what the fudge factor ought to be, except
> that I don't think "zero" is the best value, which AFAICT is what Robert
> is arguing.  I think Greg's argument shows that "one" is the right value
> when dealing with an ANALYZE estimate, if you believe that ANALYZE saw a
> random set of pages ... but using that for VACUUM does seem
> overoptimistic.

The problem is that it's quite difficult to predict the relative
frequency of full-relation-vacuum, vacuum-with-skips, and ANALYZE
operations on the table will be.  It matters how fast the table is
being inserted into vs. updated/deleted; and it also matters how fast
the table is being updated compared with the system's rate of XID
consumption.  So in general it seems hard to say, well, we know this
number might drift off course a little bit, but there will be a
freezing vacuum or analyze or something coming along soon enough to
fix the problem.  There might be, but it's difficult to be sure.  My
argument isn't so much that using a non-zero value here is guaranteed
to have bad effects, but that we really have no idea what will work
out well in practice, and therefore it seems dangerous to whack the
behavior around ... especially in stable branches.

If we changed this in 9.1, and that's the last time we ever get a
complaint about it, problem solved.  But I would feel bad if we
changed this in the back-branches and then found that, while solving
this particular problem, we had created others.  It also seems likely
that the replacement problems would be more subtle and more difficult
to diagnose, because they'd depend in a very complicated way on the
workload, and having, say, the latest table contents would not
necessarily enable us to reproduce the problem.

I would feel a lot better about something that is deterministic, like,
I dunno, if VACUUM visits more than 25% of the table, we use its
estimate.  And we always use ANALYZE's estimate.  Or something.

Another thought: Couldn't relation_needs_vacanalyze() just scale up
reltuples by the ratio of the current number of pages in the relation
to relpages, just as the query planner does?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Tom Lane

Date:

26 May 2011, 13:24:46

Robert Haas <robertmhaas@gmail.com> writes:
> I would feel a lot better about something that is deterministic, like,
> I dunno, if VACUUM visits more than 25% of the table, we use its
> estimate.  And we always use ANALYZE's estimate.  Or something.

This argument seems to rather miss the point.  The data we are working
from is fundamentally not deterministic, and you can't make it so by
deciding to ignore what data we do have.  That leads to a less useful
estimate, not a more useful estimate.

> Another thought: Couldn't relation_needs_vacanalyze() just scale up
> reltuples by the ratio of the current number of pages in the relation
> to relpages, just as the query planner does?

Hmm ... that would fix Florian's immediate issue, and it does seem like
a good change on its own merits.  But it does nothing for the problem
that we're failing to put the best available information into pg_class.

Possibly we could compromise on doing just that much in the back
branches, and the larger change for 9.1?
        regards, tom lane

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Robert Haas

Date:

26 May 2011, 14:08:55

On Thu, May 26, 2011 at 12:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Another thought: Couldn't relation_needs_vacanalyze() just scale up
>> reltuples by the ratio of the current number of pages in the relation
>> to relpages, just as the query planner does?
>
> Hmm ... that would fix Florian's immediate issue, and it does seem like
> a good change on its own merits.  But it does nothing for the problem
> that we're failing to put the best available information into pg_class.
>
> Possibly we could compromise on doing just that much in the back
> branches, and the larger change for 9.1?

Do you think we need to worry about the extra overhead of determining
the current size of every relation as we sweep through pg_class?  It's
not a lot, but OTOH I think we'd be doing it once a minute... not sure
what would happen if there were tons of tables.

Going back to your thought upthread, I think we should really consider
replacing reltuples with reltupledensity at some point.  I continue to
be afraid that using a decaying average in this case is going to end
up overweighting the values from some portion of the table that's
getting scanned repeatedly, at the expense of other portions of the
table that are not getting scanned at all.  Now, perhaps experience
will prove that's not a problem.  But storing relpages and
reltupledensity separately would give us more flexibility, because we
could feel free to bump relpages even when we're not sure what to do
about reltupledensity.  That would help Florian's problem quite a lot,
even if we did nothing else.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

"Kevin Grittner"

Date:

26 May 2011, 14:28:30

Robert Haas <robertmhaas@gmail.com> wrote:
> I think we should really consider replacing reltuples with
> reltupledensity at some point.  I continue to be afraid that using
> a decaying average in this case is going to end up overweighting
> the values from some portion of the table that's getting scanned
> repeatedly, at the expense of other portions of the table that are
> not getting scanned at all.  Now, perhaps experience will prove
> that's not a problem.  But storing relpages and reltupledensity
> separately would give us more flexibility, because we could feel
> free to bump relpages even when we're not sure what to do about
> reltupledensity.  That would help Florian's problem quite a lot,
> even if we did nothing else.
Given how trivial it would be to adjust reltuples to keep its ratio
to relpages about the same when we don't have a new "hard" number,
but some evidence that we should fudge our previous value, I don't
see where this change buys us much.  It seems to mostly obfuscate
the fact that we're changing our assumption about how many tuples we
have.  I would rather that we did that explicitly with code comments
about why it's justified than to slip it in the way you suggest.
-Kevin

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Robert Haas

Date:

26 May 2011, 14:50:54

On Thu, May 26, 2011 at 1:28 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
>> I think we should really consider replacing reltuples with
>> reltupledensity at some point.  I continue to be afraid that using
>> a decaying average in this case is going to end up overweighting
>> the values from some portion of the table that's getting scanned
>> repeatedly, at the expense of other portions of the table that are
>> not getting scanned at all.  Now, perhaps experience will prove
>> that's not a problem.  But storing relpages and reltupledensity
>> separately would give us more flexibility, because we could feel
>> free to bump relpages even when we're not sure what to do about
>> reltupledensity.  That would help Florian's problem quite a lot,
>> even if we did nothing else.
>
> Given how trivial it would be to adjust reltuples to keep its ratio
> to relpages about the same when we don't have a new "hard" number,
> but some evidence that we should fudge our previous value, I don't
> see where this change buys us much.  It seems to mostly obfuscate
> the fact that we're changing our assumption about how many tuples we
> have.  I would rather that we did that explicitly with code comments
> about why it's justified than to slip it in the way you suggest.

I'm a bit confused by this - what the current design obfuscates is the
fact that reltuples and relpages are not really independent columns;
you can't update one without updating the other, unless you want
screwy behavior.  Replacing reltuples by reltupledensity would fix
that problem - it would be logical and non-damaging to update either
column independently.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

"Kevin Grittner"

Date:

26 May 2011, 15:06:10

Robert Haas <robertmhaas@gmail.com> wrote:
> Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
>> Given how trivial it would be to adjust reltuples to keep its
>> ratio to relpages about the same when we don't have a new "hard"
>> number, but some evidence that we should fudge our previous
>> value, I don't see where this change buys us much.  It seems to
>> mostly obfuscate the fact that we're changing our assumption
>> about how many tuples we have.  I would rather that we did that
>> explicitly with code comments about why it's justified than to
>> slip it in the way you suggest.
> 
> I'm a bit confused by this - what the current design obfuscates is
> the fact that reltuples and relpages are not really independent
> columns; you can't update one without updating the other, unless
> you want screwy behavior.  Replacing reltuples by reltupledensity
> would fix that problem - it would be logical and non-damaging to
> update either column independently.
They don't always move in tandem.  Certainly there can be available
space in those pages from which tuples can be allocated or which
increases as tuples are vacuumed.  Your proposed change would
neither make more or less information available, because we've got
two numbers which can be observed as raw counts, and a ratio between
them.  By storing the ratio and one count you make changes to the
other count implied and less visible.  It seems more understandable
and less prone to error (to me, anyway) to keep the two "raw"
numbers and calculate the ratio -- and when you observe a change in
one raw number which you believe should force an adjustment to the
other raw number before its next actual value is observed, to
comment on why that's a good idea, and do the trivial arithmetic at
that time.
As a thought exercise, what happens each way if a table is loaded
with a low fillfactor and then a lot of inserts are done?  What
happens if mass deletes are done from a table which has a high
density?
-Kevin

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Robert Haas

Date:

26 May 2011, 16:48:38

On Thu, May 26, 2011 at 2:05 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
>> I'm a bit confused by this - what the current design obfuscates is
>> the fact that reltuples and relpages are not really independent
>> columns; you can't update one without updating the other, unless
>> you want screwy behavior.  Replacing reltuples by reltupledensity
>> would fix that problem - it would be logical and non-damaging to
>> update either column independently.
>
> They don't always move in tandem.  Certainly there can be available
> space in those pages from which tuples can be allocated or which
> increases as tuples are vacuumed.  Your proposed change would
> neither make more or less information available, because we've got
> two numbers which can be observed as raw counts, and a ratio between
> them.

So far I agree.

> By storing the ratio and one count you make changes to the
> other count implied and less visible.  It seems more understandable
> and less prone to error (to me, anyway) to keep the two "raw"
> numbers and calculate the ratio -- and when you observe a change in
> one raw number which you believe should force an adjustment to the
> other raw number before its next actual value is observed, to
> comment on why that's a good idea, and do the trivial arithmetic at
> that time.

Except that's not how it works.  At least in the case of ANALYZE, we
*aren't* counting all the tuples in the table.  We're selecting a
random sample of pages and inferring a tuple density, which we then
extrapolate to the whole table and store.  Then when we pull it back
out of the table, we convert it back to a tuple density.  The real
value we are computing and using almost everywhere is tuple density;
storing a total number of tuples in the table appears to be just
confusing the issue.

Unless, of course, I am misunderstanding, which is possible.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Tom Lane

Date:

26 May 2011, 17:30:23

Robert Haas <robertmhaas@gmail.com> writes:
> Except that's not how it works.  At least in the case of ANALYZE, we
> *aren't* counting all the tuples in the table.  We're selecting a
> random sample of pages and inferring a tuple density, which we then
> extrapolate to the whole table and store.  Then when we pull it back
> out of the table, we convert it back to a tuple density.  The real
> value we are computing and using almost everywhere is tuple density;
> storing a total number of tuples in the table appears to be just
> confusing the issue.

If we were starting in a green field we might choose to store tuple
density.  However, the argument for changing it now is at best mighty
thin; IMO it is not worth the risk of breaking client code.
        regards, tom lane

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Tom Lane

Date:

26 May 2011, 17:37:56

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, May 26, 2011 at 12:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Another thought: Couldn't relation_needs_vacanalyze() just scale up
>>> reltuples by the ratio of the current number of pages in the relation
>>> to relpages, just as the query planner does?

>> Hmm ... that would fix Florian's immediate issue, and it does seem like
>> a good change on its own merits. �But it does nothing for the problem
>> that we're failing to put the best available information into pg_class.
>> 
>> Possibly we could compromise on doing just that much in the back
>> branches, and the larger change for 9.1?

> Do you think we need to worry about the extra overhead of determining
> the current size of every relation as we sweep through pg_class?  It's
> not a lot, but OTOH I think we'd be doing it once a minute... not sure
> what would happen if there were tons of tables.

Ugh ... that is a mighty good point, since the RelationGetNumberOfBlocks
call would have to happen for each table, even the ones we then decide
not to vacuum.  We've already seen people complain about the cost of the
AV launcher once they have a lot of databases, and this would probably
increase it quite a bit.

> Going back to your thought upthread, I think we should really consider
> replacing reltuples with reltupledensity at some point.  I continue to
> be afraid that using a decaying average in this case is going to end
> up overweighting the values from some portion of the table that's
> getting scanned repeatedly, at the expense of other portions of the
> table that are not getting scanned at all.

Changing the representation of the information would change that issue
not in the slightest.  The fundamental point here is that we have new,
possibly partial, information which we ought to somehow merge with the
old, also possibly partial, information.  Storing the data a little bit
differently doesn't magically eliminate that issue.
        regards, tom lane

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

"Kevin Grittner"

Date:

26 May 2011, 18:24:42

Robert Haas <robertmhaas@gmail.com> wrote:
> Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
>> By storing the ratio and one count you make changes to the
>> other count implied and less visible.  It seems more
>> understandable and less prone to error (to me, anyway) to keep
>> the two "raw" numbers and calculate the ratio -- and when you
>> observe a change in one raw number which you believe should force
>> an adjustment to the other raw number before its next actual
>> value is observed, to comment on why that's a good idea, and do
>> the trivial arithmetic at that time.
> 
> Except that's not how it works.  At least in the case of ANALYZE,
> we *aren't* counting all the tuples in the table.  We're selecting
> a random sample of pages and inferring a tuple density, which we
> then extrapolate to the whole table and store.  Then when we pull
> it back out of the table, we convert it back to a tuple density. 
> The real value we are computing and using almost everywhere is
> tuple density; storing a total number of tuples in the table
> appears to be just confusing the issue.
Well, if tuple density is the number which is most heavily used, it
might shave a few nanoseconds doing the arithmetic in enough places
to justify the change, but I'm skeptical.  Basically I'm with Tom on
the fact that this change would store neither more nor less
information (and for that matter would not really change what
information you can easily retrieve); and slightly changing the
manner in which it is stored doesn't solve any of the problems you
assert that it does.
When we prune or vacuum a page, I don't suppose we have enough
information about that page's previous state to calculate a tuple
count delta, do we?  That would allow a far more accurate number to
be maintained than anything suggested so far, as long as we tweak
autovacuum to count inserts toward the need to vacuum.  (It seems to
me I saw a post giving some reason that would have benefits anyway.)
Except for the full pass during transaction wrap-around protection,
where it could just set a new actual count, autovacuum would be
skipping pages where the bit is set to indicate that all tuples are
visible, right?
-Kevin

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Tom Lane

Date:

26 May 2011, 18:50:14

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> When we prune or vacuum a page, I don't suppose we have enough
> information about that page's previous state to calculate a tuple
> count delta, do we?  That would allow a far more accurate number to
> be maintained than anything suggested so far, as long as we tweak
> autovacuum to count inserts toward the need to vacuum.

Well, that was the other direction that was suggested upthread: stop
relying on reltuples at all, but use the stats collector's counts.
That might be a good solution in the long run, but there are some
issues:

1. It's not clear how using a current count, as opposed to
time-of-last-vacuum count, would affect the behavior of the autovacuum
control logic.  At first glance I think it would break it, since the
basic logic there is "how much of the table changed since it was last
vacuumed?".  Even if the equations could be modified to still work,
I remember enough feedback control theory from undergrad EE to think that
this is something to be seriously scared of tweaking without extensive
testing.  IMO it is far more risky than what Robert is worried about.

2. You still have the problem that we're exposing inaccurate (or at
least less accurate than they could be) counts to the planner and to
onlooker clients.  We could change the planner to also depend on the
stats collector instead of reltuples, but at that point you just removed
the option for people to turn off the stats collector.  The implications
for plan stability might be unpleasant, too.

So that's not a direction I want to go without a significant amount
of work and testing.
        regards, tom lane

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Robert Haas

Date:

27 May 2011, 12:07:16

On Thu, May 26, 2011 at 5:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> When we prune or vacuum a page, I don't suppose we have enough
>> information about that page's previous state to calculate a tuple
>> count delta, do we?  That would allow a far more accurate number to
>> be maintained than anything suggested so far, as long as we tweak
>> autovacuum to count inserts toward the need to vacuum.
>
> Well, that was the other direction that was suggested upthread: stop
> relying on reltuples at all, but use the stats collector's counts.
> That might be a good solution in the long run, but there are some
> issues:
>
> 1. It's not clear how using a current count, as opposed to
> time-of-last-vacuum count, would affect the behavior of the autovacuum
> control logic.  At first glance I think it would break it, since the
> basic logic there is "how much of the table changed since it was last
> vacuumed?".  Even if the equations could be modified to still work,
> I remember enough feedback control theory from undergrad EE to think that
> this is something to be seriously scared of tweaking without extensive
> testing.  IMO it is far more risky than what Robert is worried about.

Yeah, I think that would be broken.

> 2. You still have the problem that we're exposing inaccurate (or at
> least less accurate than they could be) counts to the planner and to
> onlooker clients.  We could change the planner to also depend on the
> stats collector instead of reltuples, but at that point you just removed
> the option for people to turn off the stats collector.  The implications
> for plan stability might be unpleasant, too.
>
> So that's not a direction I want to go without a significant amount
> of work and testing.

FWIW, I agree.  Your proposed solution is certainly better than trying
to do this; but it still seems a bit shaky to me.

Still, maybe we don't have a better option.  If it were me, I'd add an
additional safety valve: use your formula if the percentage of the
relation scanned is above some threshold where there's unlikely to be
too much skew.  But if the percentage scanned is too small, then don't
use that formula.  Instead, only update relpages/reltuples if the
relation is now larger; set relpages to the new actual value, and
scale up reltuples proportionately.

However, I just work here.  It's possible that I'm worrying about a
problem that won't materialize in practice.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Tom Lane

Date:

27 May 2011, 15:34:28

Robert Haas <robertmhaas@gmail.com> writes:
> Still, maybe we don't have a better option.  If it were me, I'd add an
> additional safety valve: use your formula if the percentage of the
> relation scanned is above some threshold where there's unlikely to be
> too much skew.  But if the percentage scanned is too small, then don't
> use that formula.  Instead, only update relpages/reltuples if the
> relation is now larger; set relpages to the new actual value, and
> scale up reltuples proportionately.

Ah, progress: now we're down to arguing about the size of the fudge
factor ;-).  I'll do something involving derating the reliability
when the number is coming from VACUUM.
        regards, tom lane

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Tom Lane

Date:

28 May 2011, 16:01:58

Robert Haas <robertmhaas@gmail.com> writes:
> Still, maybe we don't have a better option.  If it were me, I'd add an
> additional safety valve: use your formula if the percentage of the
> relation scanned is above some threshold where there's unlikely to be
> too much skew.  But if the percentage scanned is too small, then don't
> use that formula.  Instead, only update relpages/reltuples if the
> relation is now larger; set relpages to the new actual value, and
> scale up reltuples proportionately.

> However, I just work here.  It's possible that I'm worrying about a
> problem that won't materialize in practice.

Attached is a proposed patch to fix these issues.  Experimentation
convinced me that including a fudge factor for VACUUM's results made
things *less* accurate, not more so.  The reason seems to be bound up in
Greg Stark's observation that the unmodified calculation is equivalent
to assuming that the old average tuple density still applies to the
unscanned pages.  In a VACUUM, we know that the unscanned pages are
exactly those that have had no changes since (at least) the last vacuum,
which means that indeed the old density ought to be a good estimate.
Now, this reasoning can break down if the table's tuple density is
nonuniform, but what I found in my testing is that if you vacuum after a
significant change in a table (such as deleting a lot of rows), and you
don't apply the full unfudged correction, you get a badly wrong result.
I think that's a more significant issue than the possibility of drift
over time.

I also found that Greg was right in thinking that it would help if we
tweaked lazy_scan_heap to not always scan the first
SKIP_PAGES_THRESHOLD-1 pages even if they were
all_visible_according_to_vm.  That seemed to skew the results if those
pages weren't representative.  And, for the case of a useless manual
vacuum on a completely clean table, it would cause the reltuples value
to drift when there was no reason to change it at all.

Lastly, this patch removes a bunch of grotty interconnections between
VACUUM and ANALYZE that were meant to prevent ANALYZE from updating the
stats if VACUUM had done a full-table scan in the same command.  With
the new logic it's relatively harmless if ANALYZE does that, and anyway
autovacuum frequently fires the two cases independently anyway, making
all that logic quite useless in the normal case.  (This simplification
accounts for the bulk of the diff, actually.)

Comments?

            regards, tom lane

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 0568a1bcf86281a9b1086d343e7027557295065c..fa84989fc6fa8be90d4eecb9c33e94a232d79880 100644
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
*************** static MemoryContext anl_context = NULL;
*** 84,91 ****
  static BufferAccessStrategy vac_strategy;


! static void do_analyze_rel(Relation onerel, VacuumStmt *vacstmt,
!                bool update_reltuples, bool inh);
  static void BlockSampler_Init(BlockSampler bs, BlockNumber nblocks,
                    int samplesize);
  static bool BlockSampler_HasMore(BlockSampler bs);
--- 84,90 ----
  static BufferAccessStrategy vac_strategy;


! static void do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, bool inh);
  static void BlockSampler_Init(BlockSampler bs, BlockNumber nblocks,
                    int samplesize);
  static bool BlockSampler_HasMore(BlockSampler bs);
*************** static bool std_typanalyze(VacAttrStats
*** 115,132 ****

  /*
   *    analyze_rel() -- analyze one relation
-  *
-  * If update_reltuples is true, we update reltuples and relpages columns
-  * in pg_class.  Caller should pass false if we're part of VACUUM ANALYZE,
-  * and the VACUUM didn't skip any pages.  We only have an approximate count,
-  * so we don't want to overwrite the accurate values already inserted by the
-  * VACUUM in that case.  VACUUM always scans all indexes, however, so the
-  * pg_class entries for indexes are never updated if we're part of VACUUM
-  * ANALYZE.
   */
  void
! analyze_rel(Oid relid, VacuumStmt *vacstmt,
!             BufferAccessStrategy bstrategy, bool update_reltuples)
  {
      Relation    onerel;

--- 114,122 ----

  /*
   *    analyze_rel() -- analyze one relation
   */
  void
! analyze_rel(Oid relid, VacuumStmt *vacstmt, BufferAccessStrategy bstrategy)
  {
      Relation    onerel;

*************** analyze_rel(Oid relid, VacuumStmt *vacst
*** 238,250 ****
      /*
       * Do the normal non-recursive ANALYZE.
       */
!     do_analyze_rel(onerel, vacstmt, update_reltuples, false);

      /*
       * If there are child tables, do recursive ANALYZE.
       */
      if (onerel->rd_rel->relhassubclass)
!         do_analyze_rel(onerel, vacstmt, false, true);

      /*
       * Close source relation now, but keep lock so that no one deletes it
--- 228,240 ----
      /*
       * Do the normal non-recursive ANALYZE.
       */
!     do_analyze_rel(onerel, vacstmt, false);

      /*
       * If there are child tables, do recursive ANALYZE.
       */
      if (onerel->rd_rel->relhassubclass)
!         do_analyze_rel(onerel, vacstmt, true);

      /*
       * Close source relation now, but keep lock so that no one deletes it
*************** analyze_rel(Oid relid, VacuumStmt *vacst
*** 267,274 ****
   *    do_analyze_rel() -- analyze one relation, recursively or not
   */
  static void
! do_analyze_rel(Relation onerel, VacuumStmt *vacstmt,
!                bool update_reltuples, bool inh)
  {
      int            attr_cnt,
                  tcnt,
--- 257,263 ----
   *    do_analyze_rel() -- analyze one relation, recursively or not
   */
  static void
! do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, bool inh)
  {
      int            attr_cnt,
                  tcnt,
*************** do_analyze_rel(Relation onerel, VacuumSt
*** 437,445 ****
      }

      /*
!      * Quit if no analyzable columns and no pg_class update needed.
       */
!     if (attr_cnt <= 0 && !analyzableindex && !update_reltuples)
          goto cleanup;

      /*
--- 426,434 ----
      }

      /*
!      * Quit if no analyzable columns.
       */
!     if (attr_cnt <= 0 && !analyzableindex)
          goto cleanup;

      /*
*************** do_analyze_rel(Relation onerel, VacuumSt
*** 549,558 ****
      }

      /*
!      * Update pages/tuples stats in pg_class, but not if we're inside a VACUUM
!      * that got a more precise number.
       */
!     if (update_reltuples)
          vac_update_relstats(onerel,
                              RelationGetNumberOfBlocks(onerel),
                              totalrows, hasindex, InvalidTransactionId);
--- 538,547 ----
      }

      /*
!      * Update pages/tuples stats in pg_class ... but not if we're doing
!      * inherited stats.
       */
!     if (!inh)
          vac_update_relstats(onerel,
                              RelationGetNumberOfBlocks(onerel),
                              totalrows, hasindex, InvalidTransactionId);
*************** do_analyze_rel(Relation onerel, VacuumSt
*** 562,568 ****
       * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
       * VACUUM.
       */
!     if (!(vacstmt->options & VACOPT_VACUUM))
      {
          for (ind = 0; ind < nindexes; ind++)
          {
--- 551,557 ----
       * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
       * VACUUM.
       */
!     if (!inh && !(vacstmt->options & VACOPT_VACUUM))
      {
          for (ind = 0; ind < nindexes; ind++)
          {
*************** do_analyze_rel(Relation onerel, VacuumSt
*** 577,589 ****
      }

      /*
!      * Report ANALYZE to the stats collector, too; likewise, tell it to adopt
!      * these numbers only if we're not inside a VACUUM that got a better
!      * number.    However, a call with inh = true shouldn't reset the stats.
       */
      if (!inh)
!         pgstat_report_analyze(onerel, update_reltuples,
!                               totalrows, totaldeadrows);

      /* We skip to here if there were no analyzable columns */
  cleanup:
--- 566,577 ----
      }

      /*
!      * Report ANALYZE to the stats collector, too.  However, if doing
!      * inherited stats we shouldn't report, because the stats collector only
!      * tracks per-table stats.
       */
      if (!inh)
!         pgstat_report_analyze(onerel, totalrows, totaldeadrows);

      /* We skip to here if there were no analyzable columns */
  cleanup:
*************** acquire_sample_rows(Relation onerel, Hea
*** 1243,1260 ****
          qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);

      /*
!      * Estimate total numbers of rows in relation.
       */
      if (bs.m > 0)
!     {
!         *totalrows = floor((liverows * totalblocks) / bs.m + 0.5);
!         *totaldeadrows = floor((deadrows * totalblocks) / bs.m + 0.5);
!     }
      else
-     {
-         *totalrows = 0.0;
          *totaldeadrows = 0.0;
-     }

      /*
       * Emit some interesting relation info
--- 1231,1249 ----
          qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);

      /*
!      * Estimate total numbers of rows in relation.  For live rows, use
!      * vac_estimate_reltuples; for dead rows, we have no source of old
!      * information, so we have to assume the density is the same in unseen
!      * pages as in the pages we scanned.
       */
+     *totalrows = vac_estimate_reltuples(onerel, true,
+                                         totalblocks,
+                                         bs.m,
+                                         liverows);
      if (bs.m > 0)
!         *totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
      else
          *totaldeadrows = 0.0;

      /*
       * Emit some interesting relation info
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 9606569617afafe16c55752183cb2a6de89bcad1..efae31acab7de0f6d4cbe45fb175a301486ce582 100644
*** a/src/backend/commands/vacuum.c
--- b/src/backend/commands/vacuum.c
***************
*** 20,25 ****
--- 20,27 ----
   */
  #include "postgres.h"

+ #include <math.h>
+
  #include "access/clog.h"
  #include "access/genam.h"
  #include "access/heapam.h"
*************** static BufferAccessStrategy vac_strategy
*** 62,68 ****
  static List *get_rel_oids(Oid relid, const RangeVar *vacrel);
  static void vac_truncate_clog(TransactionId frozenXID);
  static bool vacuum_rel(Oid relid, VacuumStmt *vacstmt, bool do_toast,
!            bool for_wraparound, bool *scanned_all);


  /*
--- 64,70 ----
  static List *get_rel_oids(Oid relid, const RangeVar *vacrel);
  static void vac_truncate_clog(TransactionId frozenXID);
  static bool vacuum_rel(Oid relid, VacuumStmt *vacstmt, bool do_toast,
!            bool for_wraparound);


  /*
*************** vacuum(VacuumStmt *vacstmt, Oid relid, b
*** 219,230 ****
          foreach(cur, relations)
          {
              Oid            relid = lfirst_oid(cur);
-             bool        scanned_all = false;

              if (vacstmt->options & VACOPT_VACUUM)
              {
!                 if (!vacuum_rel(relid, vacstmt, do_toast, for_wraparound,
!                                 &scanned_all))
                      continue;
              }

--- 221,230 ----
          foreach(cur, relations)
          {
              Oid            relid = lfirst_oid(cur);

              if (vacstmt->options & VACOPT_VACUUM)
              {
!                 if (!vacuum_rel(relid, vacstmt, do_toast, for_wraparound))
                      continue;
              }

*************** vacuum(VacuumStmt *vacstmt, Oid relid, b
*** 241,247 ****
                      PushActiveSnapshot(GetTransactionSnapshot());
                  }

!                 analyze_rel(relid, vacstmt, vac_strategy, !scanned_all);

                  if (use_own_xacts)
                  {
--- 241,247 ----
                      PushActiveSnapshot(GetTransactionSnapshot());
                  }

!                 analyze_rel(relid, vacstmt, vac_strategy);

                  if (use_own_xacts)
                  {
*************** vacuum_set_xid_limits(int freeze_min_age
*** 454,459 ****
--- 454,537 ----


  /*
+  * vac_estimate_reltuples() -- estimate the new value for pg_class.reltuples
+  *
+  *        If we scanned the whole relation then we should just use the count of
+  *        live tuples seen; but if we did not, we should not trust the count
+  *        unreservedly, especially not in VACUUM, which may have scanned a quite
+  *        nonrandom subset of the table.  When we have only partial information,
+  *        we take the old value of pg_class.reltuples as a measurement of the
+  *        tuple density in the unscanned pages.
+  *
+  *        This routine is shared by VACUUM and ANALYZE.
+  */
+ double
+ vac_estimate_reltuples(Relation relation, bool is_analyze,
+                        BlockNumber total_pages,
+                        BlockNumber scanned_pages,
+                        double scanned_tuples)
+ {
+     BlockNumber    old_rel_pages = relation->rd_rel->relpages;
+     double        old_rel_tuples = relation->rd_rel->reltuples;
+     double        old_density;
+     double        new_density;
+     double        multiplier;
+     double        updated_density;
+
+     /* If we did scan the whole table, just use the count as-is */
+     if (scanned_pages >= total_pages)
+         return scanned_tuples;
+
+     /*
+      * If scanned_pages is zero but total_pages isn't, keep the existing
+      * value of reltuples.
+      */
+     if (scanned_pages == 0)
+         return old_rel_tuples;
+
+     /*
+      * If old value of relpages is zero, old density is indeterminate; we
+      * can't do much except scale up scanned_tuples to match total_pages.
+      */
+     if (old_rel_pages == 0)
+         return floor((scanned_tuples / scanned_pages) * total_pages + 0.5);
+
+     /*
+      * Okay, we've covered the corner cases.  The normal calculation is to
+      * convert the old measurement to a density (tuples per page), then
+      * update the density using an exponential-moving-average approach,
+      * and finally compute reltuples as updated_density * total_pages.
+      *
+      * For ANALYZE, the moving average multiplier is just the fraction of
+      * the table's pages we scanned.  This is equivalent to assuming
+      * that the tuple density in the unscanned pages didn't change.  Of
+      * course, it probably did, if the new density measurement is different.
+      * But over repeated cycles, the value of reltuples will converge towards
+      * the correct value, if repeated measurements show the same new density.
+      *
+      * For VACUUM, the situation is a bit different: we have looked at a
+      * nonrandom sample of pages, but we know for certain that the pages we
+      * didn't look at are precisely the ones that haven't changed lately.
+      * Thus, there is a reasonable argument for doing exactly the same thing
+      * as for the ANALYZE case, that is use the old density measurement as
+      * the value for the unscanned pages.
+      *
+      * This logic could probably use further refinement.
+      */
+     old_density = old_rel_tuples / old_rel_pages;
+     new_density = scanned_tuples / scanned_pages;
+     multiplier = (double) scanned_pages / (double) total_pages;
+ #ifdef NOT_USED
+     /* this makes things less accurate, not more so :-( */
+     if (!is_analyze && multiplier < 0.5)
+         multiplier *= 0.25;        /* apply fudge factor */
+ #endif
+     updated_density = old_density + (new_density - old_density) * multiplier;
+     return floor(updated_density * total_pages + 0.5);
+ }
+
+
+ /*
   *    vac_update_relstats() -- update statistics for one relation
   *
   *        Update the whole-relation statistics that are kept in its pg_class
*************** vacuum_set_xid_limits(int freeze_min_age
*** 480,486 ****
   *        somebody vacuuming pg_class might think they could delete a tuple
   *        marked with xmin = our xid.
   *
!  *        This routine is shared by VACUUM and stand-alone ANALYZE.
   */
  void
  vac_update_relstats(Relation relation,
--- 558,564 ----
   *        somebody vacuuming pg_class might think they could delete a tuple
   *        marked with xmin = our xid.
   *
!  *        This routine is shared by VACUUM and ANALYZE.
   */
  void
  vac_update_relstats(Relation relation,
*************** vac_truncate_clog(TransactionId frozenXI
*** 758,771 ****
   *        many small transactions.  Otherwise, two-phase locking would require
   *        us to lock the entire database during one pass of the vacuum cleaner.
   *
-  *        We'll return true in *scanned_all if the vacuum scanned all heap
-  *        pages, and updated pg_class.
-  *
   *        At entry and exit, we are not inside a transaction.
   */
  static bool
! vacuum_rel(Oid relid, VacuumStmt *vacstmt, bool do_toast, bool for_wraparound,
!            bool *scanned_all)
  {
      LOCKMODE    lmode;
      Relation    onerel;
--- 836,845 ----
   *        many small transactions.  Otherwise, two-phase locking would require
   *        us to lock the entire database during one pass of the vacuum cleaner.
   *
   *        At entry and exit, we are not inside a transaction.
   */
  static bool
! vacuum_rel(Oid relid, VacuumStmt *vacstmt, bool do_toast, bool for_wraparound)
  {
      LOCKMODE    lmode;
      Relation    onerel;
*************** vacuum_rel(Oid relid, VacuumStmt *vacstm
*** 775,783 ****
      int            save_sec_context;
      int            save_nestlevel;

-     if (scanned_all)
-         *scanned_all = false;
-
      /* Begin a transaction for vacuuming this relation */
      StartTransactionCommand();

--- 849,854 ----
*************** vacuum_rel(Oid relid, VacuumStmt *vacstm
*** 971,977 ****
                      vacstmt->freeze_min_age, vacstmt->freeze_table_age);
      }
      else
!         lazy_vacuum_rel(onerel, vacstmt, vac_strategy, scanned_all);

      /* Roll back any GUC changes executed by index functions */
      AtEOXact_GUC(false, save_nestlevel);
--- 1042,1048 ----
                      vacstmt->freeze_min_age, vacstmt->freeze_table_age);
      }
      else
!         lazy_vacuum_rel(onerel, vacstmt, vac_strategy);

      /* Roll back any GUC changes executed by index functions */
      AtEOXact_GUC(false, save_nestlevel);
*************** vacuum_rel(Oid relid, VacuumStmt *vacstm
*** 997,1003 ****
       * totally unimportant for toast relations.
       */
      if (toast_relid != InvalidOid)
!         vacuum_rel(toast_relid, vacstmt, false, for_wraparound, NULL);

      /*
       * Now release the session-level lock on the master table.
--- 1068,1074 ----
       * totally unimportant for toast relations.
       */
      if (toast_relid != InvalidOid)
!         vacuum_rel(toast_relid, vacstmt, false, for_wraparound);

      /*
       * Now release the session-level lock on the master table.
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 9393fa0727aaad7508e1163623322b4066412257..570f1032f3d4eba46ed54de3f6cd39d62e6d09f0 100644
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 77,93 ****
   * Before we consider skipping a page that's marked as clean in
   * visibility map, we must've seen at least this many clean pages.
   */
! #define SKIP_PAGES_THRESHOLD    32

  typedef struct LVRelStats
  {
      /* hasindex = true means two-pass strategy; false means one-pass */
      bool        hasindex;
-     bool        scanned_all;    /* have we scanned all pages (this far)? */
      /* Overall statistics about rel */
!     BlockNumber rel_pages;
      double        old_rel_tuples; /* previous value of pg_class.reltuples */
!     double        rel_tuples;        /* counts only tuples on scanned pages */
      BlockNumber pages_removed;
      double        tuples_deleted;
      BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
--- 77,94 ----
   * Before we consider skipping a page that's marked as clean in
   * visibility map, we must've seen at least this many clean pages.
   */
! #define SKIP_PAGES_THRESHOLD    ((BlockNumber) 32)

  typedef struct LVRelStats
  {
      /* hasindex = true means two-pass strategy; false means one-pass */
      bool        hasindex;
      /* Overall statistics about rel */
!     BlockNumber rel_pages;        /* total number of pages */
!     BlockNumber scanned_pages;    /* number of pages we examined */
!     double        scanned_tuples;    /* counts only tuples on scanned pages */
      double        old_rel_tuples; /* previous value of pg_class.reltuples */
!     double        new_rel_tuples; /* new estimated total # of tuples */
      BlockNumber pages_removed;
      double        tuples_deleted;
      BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
*************** static int    vac_cmp_itemptr(const void *l
*** 143,149 ****
   */
  void
  lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
!                 BufferAccessStrategy bstrategy, bool *scanned_all)
  {
      LVRelStats *vacrelstats;
      Relation   *Irel;
--- 144,150 ----
   */
  void
  lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
!                 BufferAccessStrategy bstrategy)
  {
      LVRelStats *vacrelstats;
      Relation   *Irel;
*************** lazy_vacuum_rel(Relation onerel, VacuumS
*** 175,181 ****

      vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));

-     vacrelstats->scanned_all = true;    /* will be cleared if we skip a page */
      vacrelstats->old_rel_tuples = onerel->rd_rel->reltuples;
      vacrelstats->num_index_scans = 0;

--- 176,181 ----
*************** lazy_vacuum_rel(Relation onerel, VacuumS
*** 205,228 ****
      FreeSpaceMapVacuum(onerel);

      /*
!      * Update statistics in pg_class.  But only if we didn't skip any pages;
!      * the tuple count only includes tuples from the pages we've visited, and
!      * we haven't frozen tuples in unvisited pages either.  The page count is
!      * accurate in any case, but because we use the reltuples / relpages ratio
!      * in the planner, it's better to not update relpages either if we can't
!      * update reltuples.
       */
!     if (vacrelstats->scanned_all)
!         vac_update_relstats(onerel,
!                             vacrelstats->rel_pages, vacrelstats->rel_tuples,
!                             vacrelstats->hasindex,
!                             FreezeLimit);

      /* report results to the stats collector, too */
      pgstat_report_vacuum(RelationGetRelid(onerel),
                           onerel->rd_rel->relisshared,
!                          vacrelstats->scanned_all,
!                          vacrelstats->rel_tuples);

      /* and log the action if appropriate */
      if (IsAutoVacuumWorkerProcess() && Log_autovacuum_min_duration >= 0)
--- 205,224 ----
      FreeSpaceMapVacuum(onerel);

      /*
!      * Update statistics in pg_class.  But don't change relfrozenxid if we
!      * skipped any pages.
       */
!     vac_update_relstats(onerel,
!                         vacrelstats->rel_pages, vacrelstats->new_rel_tuples,
!                         vacrelstats->hasindex,
!                         (vacrelstats->scanned_pages < vacrelstats->rel_pages) ?
!                         InvalidTransactionId :
!                         FreezeLimit);

      /* report results to the stats collector, too */
      pgstat_report_vacuum(RelationGetRelid(onerel),
                           onerel->rd_rel->relisshared,
!                          vacrelstats->new_rel_tuples);

      /* and log the action if appropriate */
      if (IsAutoVacuumWorkerProcess() && Log_autovacuum_min_duration >= 0)
*************** lazy_vacuum_rel(Relation onerel, VacuumS
*** 239,251 ****
                              get_namespace_name(RelationGetNamespace(onerel)),
                              RelationGetRelationName(onerel),
                              vacrelstats->num_index_scans,
!                           vacrelstats->pages_removed, vacrelstats->rel_pages,
!                         vacrelstats->tuples_deleted, vacrelstats->rel_tuples,
                              pg_rusage_show(&ru0))));
      }
-
-     if (scanned_all)
-         *scanned_all = vacrelstats->scanned_all;
  }

  /*
--- 235,246 ----
                              get_namespace_name(RelationGetNamespace(onerel)),
                              RelationGetRelationName(onerel),
                              vacrelstats->num_index_scans,
!                             vacrelstats->pages_removed,
!                             vacrelstats->rel_pages,
!                             vacrelstats->tuples_deleted,
!                             vacrelstats->new_rel_tuples,
                              pg_rusage_show(&ru0))));
      }
  }

  /*
*************** lazy_scan_heap(Relation onerel, LVRelSta
*** 301,307 ****
      HeapTupleData tuple;
      char       *relname;
      BlockNumber empty_pages,
-                 scanned_pages,
                  vacuumed_pages;
      double        num_tuples,
                  tups_vacuumed,
--- 296,301 ----
*************** lazy_scan_heap(Relation onerel, LVRelSta
*** 321,327 ****
                      get_namespace_name(RelationGetNamespace(onerel)),
                      relname)));

!     empty_pages = vacuumed_pages = scanned_pages = 0;
      num_tuples = tups_vacuumed = nkeep = nunused = 0;

      indstats = (IndexBulkDeleteResult **)
--- 315,321 ----
                      get_namespace_name(RelationGetNamespace(onerel)),
                      relname)));

!     empty_pages = vacuumed_pages = 0;
      num_tuples = tups_vacuumed = nkeep = nunused = 0;

      indstats = (IndexBulkDeleteResult **)
*************** lazy_scan_heap(Relation onerel, LVRelSta
*** 329,340 ****
--- 323,363 ----

      nblocks = RelationGetNumberOfBlocks(onerel);
      vacrelstats->rel_pages = nblocks;
+     vacrelstats->scanned_pages = 0;
      vacrelstats->nonempty_pages = 0;
      vacrelstats->latestRemovedXid = InvalidTransactionId;

      lazy_space_alloc(vacrelstats, nblocks);

+     /*
+      * We want to skip pages that don't require vacuuming according to the
+      * visibility map, but only if we've seen a streak of at least
+      * SKIP_PAGES_THRESHOLD pages marked as clean.  Since we're reading
+      * sequentially, the OS should be doing readahead for us and there's no
+      * gain in skipping a page now and then.  You need a longer run of
+      * consecutive skipped pages before it's worthwhile.  Also, skipping even
+      * a single page means that we can't update relfrozenxid, so we only want
+      * to do it if there's a good chance to skip a goodly number of pages.
+      *
+      * Before entering the main loop, look ahead to see if there are at least
+      * SKIP_PAGES_THRESHOLD clean pages at the start of the relation; if so,
+      * we are going to skip at least one page, and we might as well skip the
+      * initial pages too.  This extra logic prevents over-weighting these
+      * pages when we estimate the new tuple density in vac_estimate_reltuples.
+      */
      all_visible_streak = 0;
+     if (!scan_all && nblocks >= SKIP_PAGES_THRESHOLD)
+     {
+         for (blkno = 0; blkno < SKIP_PAGES_THRESHOLD; blkno++)
+         {
+             if (!visibilitymap_test(onerel, blkno, &vmbuffer))
+                 break;
+             all_visible_streak++;
+         }
+         if (all_visible_streak < SKIP_PAGES_THRESHOLD)
+             all_visible_streak = 0;
+     }
+
      for (blkno = 0; blkno < nblocks; blkno++)
      {
          Buffer        buf;
*************** lazy_scan_heap(Relation onerel, LVRelSta
*** 351,379 ****
          bool        all_visible;
          bool        has_dead_tuples;

-         /*
-          * Skip pages that don't require vacuuming according to the visibility
-          * map. But only if we've seen a streak of at least
-          * SKIP_PAGES_THRESHOLD pages marked as clean. Since we're reading
-          * sequentially, the OS should be doing readahead for us and there's
-          * no gain in skipping a page now and then. You need a longer run of
-          * consecutive skipped pages before it's worthwhile. Also, skipping
-          * even a single page means that we can't update relfrozenxid or
-          * reltuples, so we only want to do it if there's a good chance to
-          * skip a goodly number of pages.
-          */
          if (!scan_all)
          {
              all_visible_according_to_vm =
                  visibilitymap_test(onerel, blkno, &vmbuffer);
              if (all_visible_according_to_vm)
              {
                  all_visible_streak++;
                  if (all_visible_streak >= SKIP_PAGES_THRESHOLD)
-                 {
-                     vacrelstats->scanned_all = false;
                      continue;
-                 }
              }
              else
                  all_visible_streak = 0;
--- 374,392 ----
          bool        all_visible;
          bool        has_dead_tuples;

          if (!scan_all)
          {
+             /*
+              * Skip pages that don't require vacuuming according to the
+              * visibility map, as per comment above.
+              */
              all_visible_according_to_vm =
                  visibilitymap_test(onerel, blkno, &vmbuffer);
              if (all_visible_according_to_vm)
              {
                  all_visible_streak++;
                  if (all_visible_streak >= SKIP_PAGES_THRESHOLD)
                      continue;
              }
              else
                  all_visible_streak = 0;
*************** lazy_scan_heap(Relation onerel, LVRelSta
*** 381,387 ****

          vacuum_delay_point();

!         scanned_pages++;

          /*
           * If we are close to overrunning the available space for dead-tuple
--- 394,400 ----

          vacuum_delay_point();

!         vacrelstats->scanned_pages++;

          /*
           * If we are close to overrunning the available space for dead-tuple
*************** lazy_scan_heap(Relation onerel, LVRelSta
*** 764,772 ****
      }

      /* save stats for use later */
!     vacrelstats->rel_tuples = num_tuples;
      vacrelstats->tuples_deleted = tups_vacuumed;

      /* If any tuples need to be deleted, perform final vacuum cycle */
      /* XXX put a threshold on min number of tuples here? */
      if (vacrelstats->num_dead_tuples > 0)
--- 777,791 ----
      }

      /* save stats for use later */
!     vacrelstats->scanned_tuples = num_tuples;
      vacrelstats->tuples_deleted = tups_vacuumed;

+     /* now we can compute the new value for pg_class.reltuples */
+     vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, false,
+                                                          nblocks,
+                                                          vacrelstats->scanned_pages,
+                                                          num_tuples);
+
      /* If any tuples need to be deleted, perform final vacuum cycle */
      /* XXX put a threshold on min number of tuples here? */
      if (vacrelstats->num_dead_tuples > 0)
*************** lazy_scan_heap(Relation onerel, LVRelSta
*** 805,811 ****
      ereport(elevel,
              (errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
                      RelationGetRelationName(onerel),
!                     tups_vacuumed, num_tuples, scanned_pages, nblocks),
               errdetail("%.0f dead row versions cannot be removed yet.\n"
                         "There were %.0f unused item pointers.\n"
                         "%u pages are entirely empty.\n"
--- 824,831 ----
      ereport(elevel,
              (errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
                      RelationGetRelationName(onerel),
!                     tups_vacuumed, num_tuples,
!                     vacrelstats->scanned_pages, nblocks),
               errdetail("%.0f dead row versions cannot be removed yet.\n"
                         "There were %.0f unused item pointers.\n"
                         "%u pages are entirely empty.\n"
*************** lazy_cleanup_index(Relation indrel,
*** 977,986 ****

      ivinfo.index = indrel;
      ivinfo.analyze_only = false;
!     ivinfo.estimated_count = !vacrelstats->scanned_all;
      ivinfo.message_level = elevel;
!     /* use rel_tuples only if we scanned all pages, else fall back */
!     ivinfo.num_heap_tuples = vacrelstats->scanned_all ? vacrelstats->rel_tuples : vacrelstats->old_rel_tuples;
      ivinfo.strategy = vac_strategy;

      stats = index_vacuum_cleanup(&ivinfo, stats);
--- 997,1005 ----

      ivinfo.index = indrel;
      ivinfo.analyze_only = false;
!     ivinfo.estimated_count = (vacrelstats->scanned_pages < vacrelstats->rel_pages);
      ivinfo.message_level = elevel;
!     ivinfo.num_heap_tuples = vacrelstats->new_rel_tuples;
      ivinfo.strategy = vac_strategy;

      stats = index_vacuum_cleanup(&ivinfo, stats);
*************** lazy_truncate_heap(Relation onerel, LVRe
*** 1041,1048 ****
      new_rel_pages = RelationGetNumberOfBlocks(onerel);
      if (new_rel_pages != old_rel_pages)
      {
!         /* might as well use the latest news when we update pg_class stats */
!         vacrelstats->rel_pages = new_rel_pages;
          UnlockRelation(onerel, AccessExclusiveLock);
          return;
      }
--- 1060,1072 ----
      new_rel_pages = RelationGetNumberOfBlocks(onerel);
      if (new_rel_pages != old_rel_pages)
      {
!         /*
!          * Note: we intentionally don't update vacrelstats->rel_pages with
!          * the new rel size here.  If we did, it would amount to assuming that
!          * the new pages are empty, which is unlikely.  Leaving the numbers
!          * alone amounts to assuming that the new pages have the same tuple
!          * density as existing ones, which is less unlikely.
!          */
          UnlockRelation(onerel, AccessExclusiveLock);
          return;
      }
*************** lazy_truncate_heap(Relation onerel, LVRe
*** 1076,1082 ****
       */
      UnlockRelation(onerel, AccessExclusiveLock);

!     /* update statistics */
      vacrelstats->rel_pages = new_rel_pages;
      vacrelstats->pages_removed = old_rel_pages - new_rel_pages;

--- 1100,1110 ----
       */
      UnlockRelation(onerel, AccessExclusiveLock);

!     /*
!      * Update statistics.  Here, it *is* correct to adjust rel_pages without
!      * also touching reltuples, since the tuple count wasn't changed by the
!      * truncation.
!      */
      vacrelstats->rel_pages = new_rel_pages;
      vacrelstats->pages_removed = old_rel_pages - new_rel_pages;

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5ed6e8337c11fdd599a0deb219e1cb0285228a88..1d80c311d879d9cf9009621860cda3ab19c6dea9 100644
*** a/src/backend/postmaster/pgstat.c
--- b/src/backend/postmaster/pgstat.c
*************** pgstat_report_autovac(Oid dboid)
*** 1246,1253 ****
   * ---------
   */
  void
! pgstat_report_vacuum(Oid tableoid, bool shared, bool adopt_counts,
!                      PgStat_Counter tuples)
  {
      PgStat_MsgVacuum msg;

--- 1246,1252 ----
   * ---------
   */
  void
! pgstat_report_vacuum(Oid tableoid, bool shared, PgStat_Counter tuples)
  {
      PgStat_MsgVacuum msg;

*************** pgstat_report_vacuum(Oid tableoid, bool
*** 1257,1263 ****
      pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
      msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
      msg.m_tableoid = tableoid;
-     msg.m_adopt_counts = adopt_counts;
      msg.m_autovacuum = IsAutoVacuumWorkerProcess();
      msg.m_vacuumtime = GetCurrentTimestamp();
      msg.m_tuples = tuples;
--- 1256,1261 ----
*************** pgstat_report_vacuum(Oid tableoid, bool
*** 1271,1277 ****
   * --------
   */
  void
! pgstat_report_analyze(Relation rel, bool adopt_counts,
                        PgStat_Counter livetuples, PgStat_Counter deadtuples)
  {
      PgStat_MsgAnalyze msg;
--- 1269,1275 ----
   * --------
   */
  void
! pgstat_report_analyze(Relation rel,
                        PgStat_Counter livetuples, PgStat_Counter deadtuples)
  {
      PgStat_MsgAnalyze msg;
*************** pgstat_report_analyze(Relation rel, bool
*** 1308,1314 ****
      pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
      msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
      msg.m_tableoid = RelationGetRelid(rel);
-     msg.m_adopt_counts = adopt_counts;
      msg.m_autovacuum = IsAutoVacuumWorkerProcess();
      msg.m_analyzetime = GetCurrentTimestamp();
      msg.m_live_tuples = livetuples;
--- 1306,1311 ----
*************** pgstat_recv_vacuum(PgStat_MsgVacuum *msg
*** 4197,4204 ****

      tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);

!     if (msg->m_adopt_counts)
!         tabentry->n_live_tuples = msg->m_tuples;
      /* Resetting dead_tuples to 0 is an approximation ... */
      tabentry->n_dead_tuples = 0;

--- 4194,4200 ----

      tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);

!     tabentry->n_live_tuples = msg->m_tuples;
      /* Resetting dead_tuples to 0 is an approximation ... */
      tabentry->n_dead_tuples = 0;

*************** pgstat_recv_analyze(PgStat_MsgAnalyze *m
*** 4233,4243 ****

      tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);

!     if (msg->m_adopt_counts)
!     {
!         tabentry->n_live_tuples = msg->m_live_tuples;
!         tabentry->n_dead_tuples = msg->m_dead_tuples;
!     }

      /*
       * We reset changes_since_analyze to zero, forgetting any changes that
--- 4229,4236 ----

      tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);

!     tabentry->n_live_tuples = msg->m_live_tuples;
!     tabentry->n_dead_tuples = msg->m_dead_tuples;

      /*
       * We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 79c9f5d90fb674ca8c778a65ca540ef62bded0af..cfbe0c43924029843f636845d87a09996d706af3 100644
*** a/src/include/commands/vacuum.h
--- b/src/include/commands/vacuum.h
*************** extern void vacuum(VacuumStmt *vacstmt,
*** 142,147 ****
--- 142,151 ----
  extern void vac_open_indexes(Relation relation, LOCKMODE lockmode,
                   int *nindexes, Relation **Irel);
  extern void vac_close_indexes(int nindexes, Relation *Irel, LOCKMODE lockmode);
+ extern double vac_estimate_reltuples(Relation relation, bool is_analyze,
+                        BlockNumber total_pages,
+                        BlockNumber scanned_pages,
+                        double scanned_tuples);
  extern void vac_update_relstats(Relation relation,
                      BlockNumber num_pages,
                      double num_tuples,
*************** extern void vacuum_delay_point(void);
*** 157,166 ****

  /* in commands/vacuumlazy.c */
  extern void lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
!                 BufferAccessStrategy bstrategy, bool *scanned_all);

  /* in commands/analyze.c */
  extern void analyze_rel(Oid relid, VacuumStmt *vacstmt,
!             BufferAccessStrategy bstrategy, bool update_reltuples);

  #endif   /* VACUUM_H */
--- 161,170 ----

  /* in commands/vacuumlazy.c */
  extern void lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
!                 BufferAccessStrategy bstrategy);

  /* in commands/analyze.c */
  extern void analyze_rel(Oid relid, VacuumStmt *vacstmt,
!             BufferAccessStrategy bstrategy);

  #endif   /* VACUUM_H */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f04be95b459b2053c66495979b95f0b697a2af13..5446fa04409ebefc31a3a1f282e7dd1224a5d734 100644
*** a/src/include/pgstat.h
--- b/src/include/pgstat.h
*************** typedef struct PgStat_MsgVacuum
*** 322,328 ****
      PgStat_MsgHdr m_hdr;
      Oid            m_databaseid;
      Oid            m_tableoid;
-     bool        m_adopt_counts;
      bool        m_autovacuum;
      TimestampTz m_vacuumtime;
      PgStat_Counter m_tuples;
--- 322,327 ----
*************** typedef struct PgStat_MsgAnalyze
*** 339,345 ****
      PgStat_MsgHdr m_hdr;
      Oid            m_databaseid;
      Oid            m_tableoid;
-     bool        m_adopt_counts;
      bool        m_autovacuum;
      TimestampTz m_analyzetime;
      PgStat_Counter m_live_tuples;
--- 338,343 ----
*************** extern void pgstat_reset_shared_counters
*** 706,714 ****
  extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);

  extern void pgstat_report_autovac(Oid dboid);
! extern void pgstat_report_vacuum(Oid tableoid, bool shared, bool adopt_counts,
                       PgStat_Counter tuples);
! extern void pgstat_report_analyze(Relation rel, bool adopt_counts,
                        PgStat_Counter livetuples, PgStat_Counter deadtuples);

  extern void pgstat_report_recovery_conflict(int reason);
--- 704,712 ----
  extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);

  extern void pgstat_report_autovac(Oid dboid);
! extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                       PgStat_Counter tuples);
! extern void pgstat_report_analyze(Relation rel,
                        PgStat_Counter livetuples, PgStat_Counter deadtuples);

  extern void pgstat_report_recovery_conflict(int reason);

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Greg Stark

Date:

28 May 2011, 23:43:40

On Sat, May 28, 2011 at 12:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I also found that Greg was right in thinking that it would help if we
> tweaked lazy_scan_heap to not always scan the first
> SKIP_PAGES_THRESHOLD-1 pages even if they were
> all_visible_according_to_vm.  That seemed to skew the results if those
> pages weren't representative.  And, for the case of a useless manual
> vacuum on a completely clean table, it would cause the reltuples value
> to drift when there was no reason to change it at all.

You fixed the logic only for the first 32 pages which helps with the
skew. But really the logic is backwards in general. Instead of
counting how many missed opportunities for skipped pages we've seen in
the past we should read the bits for the next 32 pages in advance and
decide what to do before we read those pages.

--
greg

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Tom Lane

Date:

29 May 2011, 02:11:03

Greg Stark <gsstark@mit.edu> writes:
> On Sat, May 28, 2011 at 12:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I also found that Greg was right in thinking that it would help if we
>> tweaked lazy_scan_heap to not always scan the first
>> SKIP_PAGES_THRESHOLD-1 pages even if they were
>> all_visible_according_to_vm.

> You fixed the logic only for the first 32 pages which helps with the
> skew. But really the logic is backwards in general. Instead of
> counting how many missed opportunities for skipped pages we've seen in
> the past we should read the bits for the next 32 pages in advance and
> decide what to do before we read those pages.

OK, do you like the attached version of that logic?  (Other fragments
of the patch as before.)
        regards, tom lane

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 9393fa0727aaad7508e1163623322b4066412257..231447b31223bc5350ce49a136cffafaa53bc5fb 100644
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
*************** lazy_scan_heap(Relation onerel, LVRelSta
*** 311,317 ****     int            i;     PGRUsage    ru0;     Buffer        vmbuffer = InvalidBuffer;
!     BlockNumber all_visible_streak;      pg_rusage_init(&ru0); 
--- 305,312 ----     int            i;     PGRUsage    ru0;     Buffer        vmbuffer = InvalidBuffer;
!     BlockNumber next_not_all_visible_block;
!     bool        skipping_all_visible_blocks;      pg_rusage_init(&ru0); 
*************** lazy_scan_heap(Relation onerel, LVRelSta
*** 329,340 ****      nblocks = RelationGetNumberOfBlocks(onerel);     vacrelstats->rel_pages = nblocks;
vacrelstats->nonempty_pages= 0;     vacrelstats->latestRemovedXid = InvalidTransactionId;
lazy_space_alloc(vacrelstats,nblocks); 
 
!     all_visible_streak = 0;     for (blkno = 0; blkno < nblocks; blkno++)     {         Buffer        buf;
--- 324,369 ----      nblocks = RelationGetNumberOfBlocks(onerel);     vacrelstats->rel_pages = nblocks;
+     vacrelstats->scanned_pages = 0;     vacrelstats->nonempty_pages = 0;     vacrelstats->latestRemovedXid =
InvalidTransactionId;     lazy_space_alloc(vacrelstats, nblocks); 
 
!     /*
!      * We want to skip pages that don't require vacuuming according to the
!      * visibility map, but only when we can skip at least SKIP_PAGES_THRESHOLD
!      * consecutive pages.  Since we're reading sequentially, the OS should be
!      * doing readahead for us, so there's no gain in skipping a page now and
!      * then; that's likely to disable readahead and so be counterproductive.
!      * Also, skipping even a single page means that we can't update
!      * relfrozenxid, so we only want to do it if we can skip a goodly number
!      * of pages.
!      *
!      * Before entering the main loop, establish the invariant that
!      * next_not_all_visible_block is the next block number >= blkno that's
!      * not all-visible according to the visibility map, or nblocks if there's
!      * no such block.  Also, we set up the skipping_all_visible_blocks flag,
!      * which is needed because we need hysteresis in the decision: once we've
!      * started skipping blocks, we may as well skip everything up to the next
!      * not-all-visible block.
!      *
!      * Note: if scan_all is true, we won't actually skip any pages; but we
!      * maintain next_not_all_visible_block anyway, so as to set up the
!      * all_visible_according_to_vm flag correctly for each page.
!      */
!     for (next_not_all_visible_block = 0;
!          next_not_all_visible_block < nblocks;
!          next_not_all_visible_block++)
!     {
!         if (!visibilitymap_test(onerel, next_not_all_visible_block,    &vmbuffer))
!             break;
!     }
!     if (next_not_all_visible_block >= SKIP_PAGES_THRESHOLD)
!         skipping_all_visible_blocks = true;
!     else
!         skipping_all_visible_blocks = false;
!      for (blkno = 0; blkno < nblocks; blkno++)     {         Buffer        buf;
*************** lazy_scan_heap(Relation onerel, LVRelSta
*** 347,387 ****         OffsetNumber frozen[MaxOffsetNumber];         int            nfrozen;         Size
freespace;
!         bool        all_visible_according_to_vm = false;         bool        all_visible;         bool
has_dead_tuples;
 
!         /*
!          * Skip pages that don't require vacuuming according to the visibility
!          * map. But only if we've seen a streak of at least
!          * SKIP_PAGES_THRESHOLD pages marked as clean. Since we're reading
!          * sequentially, the OS should be doing readahead for us and there's
!          * no gain in skipping a page now and then. You need a longer run of
!          * consecutive skipped pages before it's worthwhile. Also, skipping
!          * even a single page means that we can't update relfrozenxid or
!          * reltuples, so we only want to do it if there's a good chance to
!          * skip a goodly number of pages.
!          */
!         if (!scan_all)         {
!             all_visible_according_to_vm =
!                 visibilitymap_test(onerel, blkno, &vmbuffer);
!             if (all_visible_according_to_vm)             {
!                 all_visible_streak++;
!                 if (all_visible_streak >= SKIP_PAGES_THRESHOLD)
!                 {
!                     vacrelstats->scanned_all = false;
!                     continue;
!                 }             }             else
!                 all_visible_streak = 0;         }          vacuum_delay_point(); 
!         scanned_pages++;          /*          * If we are close to overrunning the available space for dead-tuple
--- 376,419 ----         OffsetNumber frozen[MaxOffsetNumber];         int            nfrozen;         Size
freespace;
!         bool        all_visible_according_to_vm;         bool        all_visible;         bool
has_dead_tuples;
 
!         if (blkno == next_not_all_visible_block)         {
!             /* Time to advance next_not_all_visible_block */
!             for (next_not_all_visible_block++;
!                  next_not_all_visible_block < nblocks;
!                  next_not_all_visible_block++)             {
!                 if (!visibilitymap_test(onerel, next_not_all_visible_block,
!                                         &vmbuffer))
!                     break;             }
+ 
+             /*
+              * We know we can't skip the current block.  But set up
+              * skipping_all_visible_blocks to do the right thing at the
+              * following blocks.
+              */
+             if (next_not_all_visible_block - blkno > SKIP_PAGES_THRESHOLD)
+                 skipping_all_visible_blocks = true;             else
!                 skipping_all_visible_blocks = false;
!             all_visible_according_to_vm = false;
!         }
!         else
!         {
!             /* Current block is all-visible */
!             if (skipping_all_visible_blocks && !scan_all)
!                 continue;
!             all_visible_according_to_vm = true;         }          vacuum_delay_point(); 
!         vacrelstats->scanned_pages++;          /*          * If we are close to overrunning the available space for
dead-tuple

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Cédric Villemain

Date:

29 May 2011, 05:41:06

2011/5/29 Tom Lane <tgl@sss.pgh.pa.us>:
> Greg Stark <gsstark@mit.edu> writes:
>> On Sat, May 28, 2011 at 12:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I also found that Greg was right in thinking that it would help if we
>>> tweaked lazy_scan_heap to not always scan the first
>>> SKIP_PAGES_THRESHOLD-1 pages even if they were
>>> all_visible_according_to_vm.
>
>> You fixed the logic only for the first 32 pages which helps with the
>> skew. But really the logic is backwards in general. Instead of
>> counting how many missed opportunities for skipped pages we've seen in
>> the past we should read the bits for the next 32 pages in advance and
>> decide what to do before we read those pages.
>
> OK, do you like the attached version of that logic?  (Other fragments
> of the patch as before.)

The idea was that remove only one page from the VACUUM will prevent
relfrozenxid update and reltuples (and relpages) update.
Now, I beleive that once we've skip at least one page thanks to
SKIP_PAGES_THRESHOLD, then we should be more agressive and remove as
many as possible pages from the VACUUM, tks to the VM.


>
>                        regards, tom lane
>
> diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
> index 9393fa0727aaad7508e1163623322b4066412257..231447b31223bc5350ce49a136cffafaa53bc5fb 100644
> *** a/src/backend/commands/vacuumlazy.c
> --- b/src/backend/commands/vacuumlazy.c
> *************** lazy_scan_heap(Relation onerel, LVRelSta
> *** 311,317 ****
>        int                     i;
>        PGRUsage        ru0;
>        Buffer          vmbuffer = InvalidBuffer;
> !       BlockNumber all_visible_streak;
>
>        pg_rusage_init(&ru0);
>
> --- 305,312 ----
>        int                     i;
>        PGRUsage        ru0;
>        Buffer          vmbuffer = InvalidBuffer;
> !       BlockNumber next_not_all_visible_block;
> !       bool            skipping_all_visible_blocks;
>
>        pg_rusage_init(&ru0);
>
> *************** lazy_scan_heap(Relation onerel, LVRelSta
> *** 329,340 ****
>
>        nblocks = RelationGetNumberOfBlocks(onerel);
>        vacrelstats->rel_pages = nblocks;
>        vacrelstats->nonempty_pages = 0;
>        vacrelstats->latestRemovedXid = InvalidTransactionId;
>
>        lazy_space_alloc(vacrelstats, nblocks);
>
> !       all_visible_streak = 0;
>        for (blkno = 0; blkno < nblocks; blkno++)
>        {
>                Buffer          buf;
> --- 324,369 ----
>
>        nblocks = RelationGetNumberOfBlocks(onerel);
>        vacrelstats->rel_pages = nblocks;
> +       vacrelstats->scanned_pages = 0;
>        vacrelstats->nonempty_pages = 0;
>        vacrelstats->latestRemovedXid = InvalidTransactionId;
>
>        lazy_space_alloc(vacrelstats, nblocks);
>
> !       /*
> !        * We want to skip pages that don't require vacuuming according to the
> !        * visibility map, but only when we can skip at least SKIP_PAGES_THRESHOLD
> !        * consecutive pages.  Since we're reading sequentially, the OS should be
> !        * doing readahead for us, so there's no gain in skipping a page now and
> !        * then; that's likely to disable readahead and so be counterproductive.
> !        * Also, skipping even a single page means that we can't update
> !        * relfrozenxid, so we only want to do it if we can skip a goodly number
> !        * of pages.
> !        *
> !        * Before entering the main loop, establish the invariant that
> !        * next_not_all_visible_block is the next block number >= blkno that's
> !        * not all-visible according to the visibility map, or nblocks if there's
> !        * no such block.  Also, we set up the skipping_all_visible_blocks flag,
> !        * which is needed because we need hysteresis in the decision: once we've
> !        * started skipping blocks, we may as well skip everything up to the next
> !        * not-all-visible block.
> !        *
> !        * Note: if scan_all is true, we won't actually skip any pages; but we
> !        * maintain next_not_all_visible_block anyway, so as to set up the
> !        * all_visible_according_to_vm flag correctly for each page.
> !        */
> !       for (next_not_all_visible_block = 0;
> !                next_not_all_visible_block < nblocks;
> !                next_not_all_visible_block++)
> !       {
> !               if (!visibilitymap_test(onerel, next_not_all_visible_block,     &vmbuffer))
> !                       break;
> !       }
> !       if (next_not_all_visible_block >= SKIP_PAGES_THRESHOLD)
> !               skipping_all_visible_blocks = true;
> !       else
> !               skipping_all_visible_blocks = false;
> !
>        for (blkno = 0; blkno < nblocks; blkno++)
>        {
>                Buffer          buf;
> *************** lazy_scan_heap(Relation onerel, LVRelSta
> *** 347,387 ****
>                OffsetNumber frozen[MaxOffsetNumber];
>                int                     nfrozen;
>                Size            freespace;
> !               bool            all_visible_according_to_vm = false;
>                bool            all_visible;
>                bool            has_dead_tuples;
>
> !               /*
> !                * Skip pages that don't require vacuuming according to the visibility
> !                * map. But only if we've seen a streak of at least
> !                * SKIP_PAGES_THRESHOLD pages marked as clean. Since we're reading
> !                * sequentially, the OS should be doing readahead for us and there's
> !                * no gain in skipping a page now and then. You need a longer run of
> !                * consecutive skipped pages before it's worthwhile. Also, skipping
> !                * even a single page means that we can't update relfrozenxid or
> !                * reltuples, so we only want to do it if there's a good chance to
> !                * skip a goodly number of pages.
> !                */
> !               if (!scan_all)
>                {
> !                       all_visible_according_to_vm =
> !                               visibilitymap_test(onerel, blkno, &vmbuffer);
> !                       if (all_visible_according_to_vm)
>                        {
> !                               all_visible_streak++;
> !                               if (all_visible_streak >= SKIP_PAGES_THRESHOLD)
> !                               {
> !                                       vacrelstats->scanned_all = false;
> !                                       continue;
> !                               }
>                        }
>                        else
> !                               all_visible_streak = 0;
>                }
>
>                vacuum_delay_point();
>
> !               scanned_pages++;
>
>                /*
>                 * If we are close to overrunning the available space for dead-tuple
> --- 376,419 ----
>                OffsetNumber frozen[MaxOffsetNumber];
>                int                     nfrozen;
>                Size            freespace;
> !               bool            all_visible_according_to_vm;
>                bool            all_visible;
>                bool            has_dead_tuples;
>
> !               if (blkno == next_not_all_visible_block)
>                {
> !                       /* Time to advance next_not_all_visible_block */
> !                       for (next_not_all_visible_block++;
> !                                next_not_all_visible_block < nblocks;
> !                                next_not_all_visible_block++)
>                        {
> !                               if (!visibilitymap_test(onerel, next_not_all_visible_block,
> !                                                                               &vmbuffer))
> !                                       break;
>                        }
> +
> +                       /*
> +                        * We know we can't skip the current block.  But set up
> +                        * skipping_all_visible_blocks to do the right thing at the
> +                        * following blocks.
> +                        */
> +                       if (next_not_all_visible_block - blkno > SKIP_PAGES_THRESHOLD)
> +                               skipping_all_visible_blocks = true;
>                        else
> !                               skipping_all_visible_blocks = false;
> !                       all_visible_according_to_vm = false;
> !               }
> !               else
> !               {
> !                       /* Current block is all-visible */
> !                       if (skipping_all_visible_blocks && !scan_all)
> !                               continue;
> !                       all_visible_according_to_vm = true;
>                }
>
>                vacuum_delay_point();
>
> !               vacrelstats->scanned_pages++;
>
>                /*
>                 * If we are close to overrunning the available space for dead-tuple
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Tom Lane

Date:

29 May 2011, 11:40:40

Cédric Villemain <cedric.villemain.debian@gmail.com> writes:
> 2011/5/29 Tom Lane <tgl@sss.pgh.pa.us>:
>> OK, do you like the attached version of that logic? �(Other fragments
>> of the patch as before.)

> The idea was that remove only one page from the VACUUM will prevent
> relfrozenxid update and reltuples (and relpages) update.
> Now, I beleive that once we've skip at least one page thanks to
> SKIP_PAGES_THRESHOLD, then we should be more agressive and remove as
> many as possible pages from the VACUUM, tks to the VM.

That would require proof, not just suggestion.  Skipping pages will
defeat the OS read-ahead algorithm, and so could easily cost more than
reading them.
        regards, tom lane

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Pavan Deolasee

Date:

29 May 2011, 12:49:13

On Sun, May 29, 2011 at 8:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Cédric Villemain <cedric.villemain.debian@gmail.com> writes:
>> 2011/5/29 Tom Lane <tgl@sss.pgh.pa.us>:
>>> OK, do you like the attached version of that logic?  (Other fragments
>>> of the patch as before.)
>
>> The idea was that remove only one page from the VACUUM will prevent
>> relfrozenxid update and reltuples (and relpages) update.
>> Now, I beleive that once we've skip at least one page thanks to
>> SKIP_PAGES_THRESHOLD, then we should be more agressive and remove as
>> many as possible pages from the VACUUM, tks to the VM.
>
> That would require proof, not just suggestion.  Skipping pages will
> defeat the OS read-ahead algorithm, and so could easily cost more than
> reading them.
>

My worry is what we have right now is also based on just assumptions
and gut feelings rather than any numbers.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Tom Lane

Date:

29 May 2011, 12:57:16

Pavan Deolasee <pavan.deolasee@gmail.com> writes:
> On Sun, May 29, 2011 at 8:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That would require proof, not just suggestion. �Skipping pages will
>> defeat the OS read-ahead algorithm, and so could easily cost more than
>> reading them.

> My worry is what we have right now is also based on just assumptions
> and gut feelings rather than any numbers.

So go collect some numbers.
        regards, tom lane

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Pavan Deolasee

Date:

29 May 2011, 13:27:09

On Sun, May 29, 2011 at 9:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Pavan Deolasee <pavan.deolasee@gmail.com> writes:
>> On Sun, May 29, 2011 at 8:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> That would require proof, not just suggestion.  Skipping pages will
>>> defeat the OS read-ahead algorithm, and so could easily cost more than
>>> reading them.
>
>> My worry is what we have right now is also based on just assumptions
>> and gut feelings rather than any numbers.
>
> So go collect some numbers.
>

I am sorry if I sounded terse above. But my gripe is that sometimes we
are too reluctant to listen to ideas and insist on producing some hard
numbers first which might take significant efforts. But we are not
equally strict when such changes are introduced initially. For
example, in this particular case, the change was introduced after this
discussion:

http://archives.postgresql.org/pgsql-hackers/2008-12/msg01316.php

Heikki suggested 20, Simon proposed 32 to make it a power of 2. But
why not 16 ? Thats closer to 16 than 32. And Greg yesterday said, 8 is
a better number based on his testings.

May be a performance build farm as being discussed is the solution
where we can throw some simple patches and see if something helps or
not.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Tom Lane

Date:

29 May 2011, 14:06:06

Pavan Deolasee <pavan.deolasee@gmail.com> writes:
> I am sorry if I sounded terse above. But my gripe is that sometimes we
> are too reluctant to listen to ideas and insist on producing some hard
> numbers first which might take significant efforts. But we are not
> equally strict when such changes are introduced initially.

The reason for not wanting to change it without some actual evidence
is that there is already evidence: the code has been in the field with
this setting since 8.4, and nobody's vacuum performance has fallen off a
cliff.  So while I'd agree that there was little testing done before the
code went in, there is more than zero reason to leave it where it is.
Without some positive evidence showing that another value is better,
I'm disinclined to change it.  I also think that you're not helping
by complaining about the code without being willing to do some work
to try to collect such evidence.

> Heikki suggested 20, Simon proposed 32 to make it a power of 2. But
> why not 16 ? Thats closer to 16 than 32. And Greg yesterday said, 8 is
> a better number based on his testings.

Greg said he had found that the read speed was the same for reading
every page vs reading every 8th page.  That's not the same as concluding
that 8 is the optimal skip distance for vacuum; or at least, he didn't
say that's what he had concluded.  vacuum isn't just reading ...
        regards, tom lane

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Pavan Deolasee

Date:

29 May 2011, 14:19:29

On Sun, May 29, 2011 at 10:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Pavan Deolasee <pavan.deolasee@gmail.com> writes:
>> I am sorry if I sounded terse above. But my gripe is that sometimes we
>> are too reluctant to listen to ideas and insist on producing some hard
>> numbers first which might take significant efforts. But we are not
>> equally strict when such changes are introduced initially.
>
> The reason for not wanting to change it without some actual evidence
> is that there is already evidence: the code has been in the field with
> this setting since 8.4, and nobody's vacuum performance has fallen off a
> cliff.

Well, that's probably because there was definitely much improvement
over what existed before. But that does not mean we can't make it
better. IOW there are no complaints because there is no regression.

> So while I'd agree that there was little testing done before the
> code went in, there is more than zero reason to leave it where it is.
> Without some positive evidence showing that another value is better,
> I'm disinclined to change it.  I also think that you're not helping
> by complaining about the code without being willing to do some work
> to try to collect such evidence.
>

I am not complaining about the code. I am suggesting we can be more
receptive to ideas, especially when we know what we have today was not
backed by any evidence either. I will anyways do some tests and post
numbers when I work on single-pass vacuum patch. I'll try to
experiment with this stuff at that time.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

From

Cédric Villemain

Date:

29 May 2011, 20:33:27

2011/5/29 Tom Lane <tgl@sss.pgh.pa.us>:
> Cédric Villemain <cedric.villemain.debian@gmail.com> writes:
>> 2011/5/29 Tom Lane <tgl@sss.pgh.pa.us>:
>>> OK, do you like the attached version of that logic?  (Other fragments
>>> of the patch as before.)
>
>> The idea was that remove only one page from the VACUUM will prevent
>> relfrozenxid update and reltuples (and relpages) update.
>> Now, I beleive that once we've skip at least one page thanks to
>> SKIP_PAGES_THRESHOLD, then we should be more agressive and remove as
>> many as possible pages from the VACUUM, tks to the VM.
>
> That would require proof, not just suggestion.  Skipping pages will
> defeat the OS read-ahead algorithm, and so could easily cost more than
> reading them.

Correct, it needs proof. Parenthesis: I did learn also that reading
the first block of a file make read-ahead have its larger window from
the beginning (the one that posix_fadvise_sequential set too), so
remove those initial reads might be counter-productive also. But this
is damn hard to benchmark because the read ahead is also influenced by
memory pressure for example.

From theory, 1. readahead algo is a bit smarter and can work with
read-with-holes (if the holes are not larger than the read-ahead
window) and 2. if holes are that large then maybe it is not so good to
keep a larger read-ahead window (which keep trashing our buffer
cache).

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support