Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuplesinaccurate. - Mailing list pgsql-hackers

From David Gould
Subject Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuplesinaccurate.
Date
Msg-id 20180302161840.3530921f@engels
Whole thread Raw
In response to Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Thu, 01 Mar 2018 18:49:20 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:


> The sticking point in my mind right now is, if we do that, what to do with
> VACUUM's estimates.  If you believe the argument in the PDF that we'll
> necessarily overshoot reltuples in the face of declining true density,
> then it seems like that argument applies to VACUUM as well.  However,
> VACUUM has the issue that we should *not* believe that it looked at a
> random sample of pages.  Maybe the fact that it looks exactly at the
> changed pages causes it to see something less than the overall density,
> cancelling out the problem, but that seems kinda optimistic.

For what it's worth, I think the current estimate formula for VACUUM is
pretty reasonable. Consider a table T with N rows and P pages clustered
on serial key k. Assume reltuples is initially correct.

Then after:

 delete from T where k < 0.2 * (select max k from T);
 vacuum T;

Vacuum will touch the first 20% of the pages due to visibility map, the sample
will have 0 live rows, scanned pages will be 0.2 * P.

Then according to the current code:

    old_density = old_rel_tuples / old_rel_pages;
    new_density = scanned_tuples / scanned_pages;
    multiplier = (double) scanned_pages / (double) total_pages;
    updated_density = old_density + (new_density - old_density) * multiplier;
    return floor(updated_density * total_pages + 0.5);

the new density will be:

   N/P + (0/0.2*P - N/P) * 0.2
 = N/P - N/P * 0.2
 = 0.8 * N/P

New reltuples estimate will be 0.8 * old_reltuples. Which is what we wanted.


If we evenly distribute the deletes across the table:

  delete from T where rand() < 0.2;

Then vacuum will scan all the pages, the sample will have 0.8 * N live rows,
scanned pages will be 1.0 * P. The new density will be

   N/P + (0.8 * N/1.0*P - N/P) * 1.0
 = N/P + (0.8 N/P - N/P)
 = N/P - 0.2 * N/P
 = 0.8 * N/P

Which again gives new reltuples as 0.8 * old_reltuples and is again correct.

I believe that given a good initial estimate of reltuples and relpages and
assuming that the pages vacuum does not scan do not change density then the
vacuum calculation does the right thing.

However, for ANALYZE the case is different.

-dg

-- 
David Gould                                   daveg@sonic.net
If simplicity worked, the world would be overrun with insects.


pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: JIT compiling with LLVM v11
Next
From: Andres Freund
Date:
Subject: Re: JIT compiling with LLVM v11