Thread: Remaining case where reltuples can become distorted across multiple VACUUM operations

My bugfix commit 74388a1a (which was pushed back in February) added
heuristics to VACUUM's reltuples calculation/estimate. This prevented
VACUUM from distorting our estimate of reltuples over time, across
successive VACUUM operations run against the same table. The problem
was that VACUUM could scan the same single heap page again and again,
while believing it saw a random sample each time. This eventually
leads to a pg_class.reltuples value that is based on the assumption
that every single heap page in the table is just like the heap page
that gets "sampled" again and again. This was always the last heap
page (due to implementation details related to the work done by commit
e8429082), which in practice tend to be particularly poor
representations of the overall reltuples density of tables.

I have discovered a gap in these heuristics: there are remaining cases
where its percentage threshold doesn't prevent reltuples distortion as
intended. It can still happen with tables that are small enough that a
cutoff of 2% of rel_pages is less than a single page, yet still large
enough that vacuumlazy.c will consider it worth its while to skip some
pages using the visibility map. It will typically skip all but the
final heap page from the relation (same as the first time around).

Here is a test case that shows how this can still happen on HEAD (and
in Postgres 15):

regression=# create table foo(bar int);insert into foo select i from
generate_series(1, 10000) i;
CREATE TABLE
INSERT 0 10000

Now run vacuum verbose against the table several times:

regression=# vacuum verbose foo;
*** SNIP ***
regression=# vacuum verbose foo;

The first vacuum shows "tuples: 0 removed, 10000 remain...", which is
correct. However, each subsequent vacuum verbose revises the estimate
downwards, eventually making pg_class.reltuples significantly
underestimate tuple density (same as the first time around).

Attached patch fixes closes the remaining gap. With the patch applied,
the second and subsequent vacuum verbose operations from the test case
will show that reltuples is still 10000 (it won't ever change). The
patch just extends an old behavior that was applied when scanned_pages
== 0 to cases where scanned_pages <= 1 (unless we happened to scan all
of the relation's tables, of course). It doesn't remove the original
test from commit 74388a1a, which still seems like a good idea to me.

-- 
Peter Geoghegan

Attachment
On Fri, Aug 5, 2022 at 5:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached patch fixes closes the remaining gap. With the patch applied,
> the second and subsequent vacuum verbose operations from the test case
> will show that reltuples is still 10000 (it won't ever change). The
> patch just extends an old behavior that was applied when scanned_pages
> == 0 to cases where scanned_pages <= 1 (unless we happened to scan all
> of the relation's tables, of course).

My plan is to commit this later in the week, barring objections. Maybe
on Thursday.

-- 
Peter Geoghegan



Re: Remaining case where reltuples can become distorted across multiple VACUUM operations

From
Matthias van de Meent
Date:
On Mon, 8 Aug 2022 at 16:52, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Aug 5, 2022 at 5:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Attached patch fixes closes the remaining gap. With the patch applied,
> > the second and subsequent vacuum verbose operations from the test case
> > will show that reltuples is still 10000 (it won't ever change). The
> > patch just extends an old behavior that was applied when scanned_pages
> > == 0 to cases where scanned_pages <= 1 (unless we happened to scan all
> > of the relation's tables, of course).
>
> My plan is to commit this later in the week, barring objections. Maybe
> on Thursday.

I do not have intimate knowledge of this code, but shouldn't we also
add some sefety guarantees like the following in these blocks? Right
now, we'll keep underestimating the table size even when we know that
the count is incorrect.

if (scanned_tuples > old_rel_tuples)
    return some_weighted_scanned_tuples;

Kind regards,

Matthias van de Meent



On Mon, Aug 8, 2022 at 8:14 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I do not have intimate knowledge of this code, but shouldn't we also
> add some sefety guarantees like the following in these blocks? Right
> now, we'll keep underestimating the table size even when we know that
> the count is incorrect.
>
> if (scanned_tuples > old_rel_tuples)
>     return some_weighted_scanned_tuples;

Not sure what you mean -- we do something very much like that already.

We take the existing tuple density, and assume that that hasn't
changed for any unscanned pages -- that is used to build a total
number of tuples for the unscanned pages. Then we add the number of
live tuples/scanned_tuples that the vacuumlazy.c caller just
encountered on scanned_pages. That's often where the final reltuples
value comes from.

-- 
Peter Geoghegan



Re: Remaining case where reltuples can become distorted across multiple VACUUM operations

From
Matthias van de Meent
Date:
On Mon, 8 Aug 2022 at 17:26, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Aug 8, 2022 at 8:14 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > I do not have intimate knowledge of this code, but shouldn't we also
> > add some sefety guarantees like the following in these blocks? Right
> > now, we'll keep underestimating the table size even when we know that
> > the count is incorrect.
> >
> > if (scanned_tuples > old_rel_tuples)
> >     return some_weighted_scanned_tuples;
>
> Not sure what you mean -- we do something very much like that already.
>
> We take the existing tuple density, and assume that that hasn't
> changed for any unscanned pages -- that is used to build a total
> number of tuples for the unscanned pages. Then we add the number of
> live tuples/scanned_tuples that the vacuumlazy.c caller just
> encountered on scanned_pages. That's often where the final reltuples
> value comes from.

Indeed we often apply this, but not always. It is the default case,
but never applied in the special cases.

For example, if currently the measured 2% of the pages contains more
than 100% of the previous count of tuples, or with your patch the last
page contains more than 100% of the previous count of the tuples, that
new count is ignored, which seems silly considering that the vacuum
count is supposed to be authorative.

Kind regards,

Matthias van de Meent



On Mon, Aug 8, 2022 at 8:33 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> For example, if currently the measured 2% of the pages contains more
> than 100% of the previous count of tuples, or with your patch the last
> page contains more than 100% of the previous count of the tuples, that
> new count is ignored, which seems silly considering that the vacuum
> count is supposed to be authorative.

The 2% thing is conditioned on the new relpages value precisely
matching the existing relpages from pg_class -- which makes it very
targeted. I don't see why scanned_tuples greatly exceeding the
existing reltuples from pg_class is interesting (any more interesting
than the other way around).

We'll always accept scanned_tuples as authoritative when VACUUM
actually scans all pages, no matter what. Currently it isn't possible
for VACUUM to skip pages in a table that is 32 pages or less in size.
So even the new "single page" thing from the patch cannot matter
there.


--
Peter Geoghegan



Re: Remaining case where reltuples can become distorted across multiple VACUUM operations

From
Matthias van de Meent
Date:
On Mon, 8 Aug 2022 at 17:49, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Aug 8, 2022 at 8:33 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > For example, if currently the measured 2% of the pages contains more
> > than 100% of the previous count of tuples, or with your patch the last
> > page contains more than 100% of the previous count of the tuples, that
> > new count is ignored, which seems silly considering that the vacuum
> > count is supposed to be authorative.
>
> The 2% thing is conditioned on the new relpages value precisely
> matching the existing relpages from pg_class -- which makes it very
> targeted. I don't see why scanned_tuples greatly exceeding the
> existing reltuples from pg_class is interesting (any more interesting
> than the other way around).

Because if a subset of the pages of a relation contains more tuples
than your current total expected tuples in the table, you should
update your expectations regardless of which blocks or which number of
blocks you've scanned - the previous stored value is a strictly worse
estimation than your last measurement.

> We'll always accept scanned_tuples as authoritative when VACUUM
> actually scans all pages, no matter what. Currently it isn't possible
> for VACUUM to skip pages in a table that is 32 pages or less in size.
> So even the new "single page" thing from the patch cannot matter
> there.

A 33-block relation with first 32 1-tuple pages is still enough to
have a last page with 250 tuples, which would be ignored in that
scheme and have a total tuple count of 33 or so. Sure, this is an
artificial sample, but you can construct similarly wrong vacuum
samples: Two classes of tuples that have distinct update regimes, one
with 32B-tuples and one with MaxFreeSpaceRequestSize-d tuples. When
you start running VACUUM against these separate classes of updated
blocks you'll see that the relation tuple count will also tend to
1*nblocks, due to the disjoint nature of these updates and the
tendency to ignore all updates to densely packed blocks.

With current code, we ignore the high counts of those often-updated
blocks and expect low density in the relation, precisely because we
ignore areas that are extremely dense and updated in VACUUM cycles
that are different from bloated blocks.

Kind regards,

Matthias van de Meent



On Mon, Aug 8, 2022 at 9:17 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Because if a subset of the pages of a relation contains more tuples
> than your current total expected tuples in the table, you should
> update your expectations regardless of which blocks or which number of
> blocks you've scanned - the previous stored value is a strictly worse
> estimation than your last measurement.

The previous stored value could be -1, which represents the idea that
we don't know the tuple density yet. So it doesn't necessarily follow
that the new estimate is strictly better, even in this exact scenario.

> A 33-block relation with first 32 1-tuple pages is still enough to
> have a last page with 250 tuples, which would be ignored in that
> scheme and have a total tuple count of 33 or so.

The simple fact is that there is only so much we can do with the
limited information/context that we have. Heuristics are not usually
free of all bias. Often the bias is the whole point -- the goal can be
to make sure that we have the bias that we know we can live with, and
not the opposite bias, which is much worse. Details of which are
usually very domain specific.

I presented my patch with a very simple test case -- a very clear
problem. Can you do the same for this scenario?

I accept that it is possible that we'll keep an old reltuples which is
provably less accurate than doing something with the latest
information from vacuumlazy.c. But the conditions under which this can
happen are *very* narrow. I am not inclined to do anything about it
for that reason.

-- 
Peter Geoghegan



Re: Remaining case where reltuples can become distorted across multiple VACUUM operations

From
Matthias van de Meent
Date:
On Mon, 8 Aug 2022 at 18:48, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Aug 8, 2022 at 9:17 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > Because if a subset of the pages of a relation contains more tuples
> > than your current total expected tuples in the table, you should
> > update your expectations regardless of which blocks or which number of
> > blocks you've scanned - the previous stored value is a strictly worse
> > estimation than your last measurement.
>
> The previous stored value could be -1, which represents the idea that
> we don't know the tuple density yet. So it doesn't necessarily follow
> that the new estimate is strictly better, even in this exact scenario.
>
> > A 33-block relation with first 32 1-tuple pages is still enough to
> > have a last page with 250 tuples, which would be ignored in that
> > scheme and have a total tuple count of 33 or so.
>
> The simple fact is that there is only so much we can do with the
> limited information/context that we have. Heuristics are not usually
> free of all bias. Often the bias is the whole point -- the goal can be
> to make sure that we have the bias that we know we can live with, and
> not the opposite bias, which is much worse. Details of which are
> usually very domain specific.
>
> I presented my patch with a very simple test case -- a very clear
> problem. Can you do the same for this scenario?

CREATE TABLE tst (id int primary key generated by default as identity,
payload text) with (fillfactor 50); -- fillfactor to make pages fill
up fast
INSERT INTO tst (payload) select repeat('a', 5000) from
generate_series(32); -- 32 pages filled with large tuples
INSERT INTO tst (payload) select repeat('a', 4); -- small tuple at last page
vacuum (verbose, freeze) tst; -- 33 tuples on 33 pages, with lots of
space left on last page
INSERT INTO tst(payload) select repeat('a', 4) from
generate_series(1,63); -- now, we have 64 tuples on the last page
vacuum verbose tst; -- with your patch it reports only 33 tuples
total, while the single page that was scanned contains 64 tuples, and
the table contains 96 tuples.

> I accept that it is possible that we'll keep an old reltuples which is
> provably less accurate than doing something with the latest
> information from vacuumlazy.c. But the conditions under which this can
> happen are *very* narrow. I am not inclined to do anything about it
> for that reason.

I think I understand your reasoning, but I don't agree with the
conclusion. The attached patch 0002 does fix that skew too, at what I
consider negligible cost. 0001 is your patch with a new version
number.

I'm fine with your patch as is, but would appreciate it if known
estimate mistakes would also be fixed.

An alternative solution could be doing double-vetting, where we ignore
tuples_scanned if <2% of pages AND <2% of previous estimated tuples
was scanned.

Kind regards,

Matthias van de Meent

Attachment
On Thu, Aug 11, 2022 at 1:48 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I think I understand your reasoning, but I don't agree with the
> conclusion. The attached patch 0002 does fix that skew too, at what I
> consider negligible cost. 0001 is your patch with a new version
> number.

Your patch added allowSystemTableMods to one of the tests. I guess
that this was an oversight?

> I'm fine with your patch as is, but would appreciate it if known
> estimate mistakes would also be fixed.

Why do you think that this particular scenario/example deserves
special attention? As I've acknowledged already, it is true that your
scenario is one in which we provably give a less accurate estimate,
based on already-available information. But other than that, I don't
see any underlying principle that would be violated by my original
patch (any kind of principle, held by anybody). reltuples is just an
estimate.

I was thinking of going your way on this, purely because it didn't
seem like there'd be much harm in it (why not just handle your case
and be done with it?). But I don't think that it's a good idea now.
reltuples is usually derived by ANALYZE using a random sample, so the
idea that tuple density can be derived accurately enough from a random
sample is pretty baked in. You're talking about a case where ignoring
just one page ("sampling" all but one of the pages) *isn't* good
enough. It just doesn't seem like something that needs to be addressed
-- it's quite awkward to do so.

Barring any further objections, I plan on committing the original
version tomorrow.

> An alternative solution could be doing double-vetting, where we ignore
> tuples_scanned if <2% of pages AND <2% of previous estimated tuples
> was scanned.

I'm not sure that I've understood you, but I think that you're talking
about remembering more information (in pg_class), which is surely out
of scope for a bug fix.

-- 
Peter Geoghegan