Thread: BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      15005
Logged by:          David Gould
Email address:      daveg@sonic.net
PostgreSQL version: 10.1
Operating system:   Linux
Description:

ANALYZE can make pg_class.reltuples wildly inaccurate compared to the
actual
row counts for tables that are larger than the default_statistics_target.

Example from one of a clients production instances:

# analyze verbose pg_attribute;
INFO:  analyzing "pg_catalog.pg_attribute"
INFO:  "pg_attribute": scanned 30000 of 24519424 pages, containing 6475 live
rows and 83 dead rows; 6475 rows in sample, 800983035 estimated total
rows.

This is a large complex database -- pg_attribute actually has about five
million rows and needs about one hundred thouand pages. However it has
become extremely bloated and is taking 25 million pages (192GB!), about
250
times too much. This happened despite aggressive autovacuum settings and a
periodic bloat monitoring script. Since pg_class.reltuples was 800
million,
our bloat monitoring script did not detect that this table was bloated and
autovacuum did not think it needed vacuuming.
 
When reltuples is very large compared to the actual row count it causes
problems:

- Bad input to the query planner.
- Prevents autovacuum from processing large bloated tables because
  autovacuum_scale_factor * reltuples is large enough the threshold is
rarely
  reached.
- Decieves bloat checking tools that rely on the relationship of relpages
  to reltuples*average_row_size.

-dg



Re: BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

From
David Gould
Date:
> Bug reference:      15005
> Logged by:          David Gould
> Email address:      daveg@sonic.net

> ANALYZE can make pg_class.reltuples wildly inaccurate compared to the actual
> row counts for tables that are larger than the default_statistics_target.

I've tracked down how this happens and created a reproduction script and a
patch. Attached:

- vac_estimate_reltuples.patch     Patch against master
- README.txt                       Instructions for testing
- reltuples_analyze_bug.sql        Reproduction script
- analyze_counts.awk               Helper for viewing results of test
- test_standard.txt                Test output for unpatched postgresql 10.1
- test_patched.txt                 Test output with patch

Analysis:
---------

Analyze and vacuum calculate the new value for pg_class.reltuples in
vacuum.c:vac_estimate_reltuples():

    old_density = old_rel_tuples / old_rel_pages;
    new_density = scanned_tuples / scanned_pages;
    multiplier = (double) scanned_pages / (double) total_pages;
    updated_density = old_density + (new_density - old_density) * multiplier;
    return floor(updated_density * total_pages + 0.5);

The comments talk about the difference between VACUUM and ANALYZE and explain
that VACUUM probably only scanned changed pages so the density of the scanned
pages is not representative of the rest of the unchanged table. Hence the new
overall density of the table should be adjusted proportionaly to the scanned
pages vs total pages. Which makes sense. However despite the comment noteing
that ANALYZE and VACUUM are different, the code actually does the same
calculation for both.

The problem is that it dilutes the impact of ANALYZE on reltuples for large
tables:

- For a table of 3000000 pages an analyze can only change the reltuples
  value by 1%.
- When combined with changes in relpages due to bloat the new computed
  reltuples can end up far from reality.


Reproducing the reltuples analyze estimate bug.
-----------------------------------------------

The script "reltuples_analyze_bug.sql" creates a table that is large
compared to the analyze sample size and then repeatedly updates about
10% of it followed by an analyze each iteration. The bug is that the
calculation analyze uses to update pg_class.reltuples will tend to
increase each time even though the actual rowcount does not change.

To run:

Given a postgresql 10.x server with >= 1GB of shared buffers:

  createdb test
  psql --no-psqlrc -f reltuples_analyze_bug.sql test > test_standard.out 2>&1
  awk -f analyze_counts.awk test_standard.out

To verify the fix, restart postgres with a patched binary and repeat
the above.

Here are the results with an unpatched server:

After 10 interations of:
  update 10% of rows;
  analyze

reltuples has almost doubled.

               / estimated rows  /    /   pages   /   /sampled rows/
   relname       current  proposed    total scanned    live    dead
reltuples_test  10000001  10000055   153847   3000   195000       0
reltuples_test  10981367   9951346   169231   3000   176410   18590
reltuples_test  11948112  10039979   184615   3000   163150   31850
reltuples_test  12900718  10070666   200000   3000   151060   43940
reltuples_test  13835185   9739305   215384   3000   135655   59345
reltuples_test  14758916   9864947   230768   3000   128245   66755
reltuples_test  15674572  10138631   246153   3000   123565   71435
reltuples_test  16576847   9910944   261537   3000   113685   81315
reltuples_test  17470388  10019961   276922   3000   108550   86450
reltuples_test  18356707  10234607   292306   3000   105040   89960
reltuples_test  19228409   9639927   307690   3000    93990  101010

-- 
David Gould                                   daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Attachment

Re: BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

From
David Gould
Date:
On Thu, 11 Jan 2018 10:12:16 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > David Gould wrote:  
> >> I've tracked down how this happens and created a reproduction script and a
> >> patch. Attached:  
> 
> > Great work.  Please make sure to register this as a bug fix in the
> > next commitfest.  
> 
> There's already an existing discussion about what seems to be the same
> issue, or at least a closely related one:
> 
> https://www.postgresql.org/message-id/flat/16db4468-edfa-830a-f921-39a50498e77e@2ndquadrant.com

Yes. I reviewed that thread, while I was researching. It is a related issue
but not the same. That one adds n_tup_dead to the reluples estimate after
vacuums. It amplifies the effect of this one which prevents analyze from
fixing reltuples.

-dg

-- 
David Gould                                   daveg@sonic.net
If simplicity worked, the world would be overrun with insects.


Re: BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

From
David Gould
Date:
On Thu, 11 Jan 2018 11:39:22 -0300
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> David Gould wrote:
> 
> > I've tracked down how this happens and created a reproduction script and a
> > patch. Attached:
> > 
> > - vac_estimate_reltuples.patch     Patch against master
> > - README.txt                       Instructions for testing
> > - reltuples_analyze_bug.sql        Reproduction script
> > - analyze_counts.awk               Helper for viewing results of test
> > - test_standard.txt                Test output for unpatched postgresql 10.1
> > - test_patched.txt                 Test output with patch  
> 
> Great work.  Please make sure to register this as a bug fix in the
> next commitfest.
> 

Done last week. See:

https://www.postgresql.org/message-id/flat/20180117164916.3fdcf2e9@engels/

I think this should be considered for back-patching, the bug exists in all
versions back to 9.4 (and probably earlier, but I did not test those).

-dg

-- 
David Gould                                   daveg@sonic.net
If simplicity worked, the world would be overrun with insects.