[PATCH] Never convert n_distinct < 2 values to a ratio when computing stats - Mailing list pgsql-hackers

From Dan McGee
Subject [PATCH] Never convert n_distinct < 2 values to a ratio when computing stats
Date
Msg-id 1332562638-11675-1-git-send-email-dan@archlinux.org
Whole thread Raw
Responses Re: [PATCH] Never convert n_distinct < 2 values to a ratio when computing stats  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
This is a bit of a corner case in all honesty, but if you have a short
table (under 20 rows), the 10% heuristic used that decides whether
distinct values scale with the row count will result in rather odd
values for stadistinct in pg_statistic, such as '-0.2' or '-0.666667',
rather than the expected '2'. Additionally, this can cause only one of
{t, f} to appear in the most common values array.

Does this actually affect query planning in any way? Probably not, but
it is extremely odd to look at pg_stats for these columns, and the
solution seems easy.
---

The only other minor changes included here were to make it clear when we were
comparing float values, so use 0.0 instead of 0.

Example stats output from the database I noticed this on:

archweb=# SELECT c.relname, a.attname, pg_stat_get_live_tuples(c.oid) AS n_live_tup, stadistinct, stanullfrac,
stawidth,stavalues1, stanumbers1 FROM pg_statistic s JOIN pg_class c ON c.oid = s.starelid JOIN pg_attribute a ON c.oid
=a.attrelid AND a.attnum = s.staattnum LEFT JOIN pg_namespace n ON n.oid = c.relnamespace JOIN pg_type t ON t.oid =
a.atttypidWHERE NOT a.attisdropped AND nspname = 'public' AND t.typname = 'bool' ORDER BY stadistinct, n_live_tup;
    relname            |    attname    | n_live_tup | stadistinct | stanullfrac | stawidth | stavalues1 |
stanumbers1     
 

-------------------------------+---------------+------------+-------------+-------------+----------+------------+-----------------------mirrors_mirrorprotocol
      | is_download   |          3 |   -0.666667 |           0 |        1 | {t}        | {0.666667}arches
        | agnostic      |          3 |   -0.666667 |           0 |        1 | {f}        | {0.666667}repos
          | staging       |         10 |        -0.2 |           0 |        1 | {f,t}      | {0.7,0.3}repos
           | testing       |         10 |        -0.2 |           0 |        1 | {f,t}      |
{0.7,0.3}devel_pgpsignature           | valid         |        264 |           1 |           0 |        1 | {t}
|{1}packages_flagrequest          | is_spam       |        415 |           1 |           0 |        1 | {f}        |
{1}donors                       | visible       |        716 |           1 |           0 |        1 | {t}        |
{1}auth_user                    | is_superuser  |         95 |           2 |           0 |        1 | {f,t}      |
{0.957895,0.0421053}user_profiles                | notify        |         95 |           2 |           0 |        1 |
{t,f}     | {0.957895,0.0421053}auth_user                     | is_active     |         95 |           2 |           0
|       1 | {t,f}      | {0.621053,0.378947}auth_user                     | is_staff      |         95 |           2 |
        0 |        1 | {f,t}      | {0.873684,0.126316}releng_iso                    | active        |        158 |
     2 |           0 |        1 | {f,t}      | {0.893333,0.106667}mirrors_mirror                | isos          |
180 |           2 |           0 |        1 | {t,f}      | {0.972678,0.0273224}mirrors_mirror                | active
   |        180 |           2 |           0 |        1 | {t,f}      | {0.672131,0.327869}mirrors_mirror
|public        |        180 |           2 |           0 |        1 | {t,f}      | {0.978142,0.0218579}mirrors_mirrorurl
           | has_ipv6      |        379 |           2 |           0 |        1 | {f,t}      |
{0.709763,0.290237}mirrors_mirrorurl            | has_ipv4      |        379 |           2 |           0 |        1 |
{t}       | {0.997361}packages_flagrequest          | is_legitimate |        415 |           2 |           0 |        1
|{t,f}      | {0.992754,0.00724638}packages_signoffspecification | enabled       |       1130 |           2 |
0|        1 | {t,f}      | {0.977578,0.0224215}packages_signoffspecification | known_bad     |       1130 |           2
|          0 |        1 | {f,t}      | {0.993722,0.00627803}mirrors_mirrorlog             | is_success    |      12715
|          2 |           0 |        1 | {t,f}      | {0.953345,0.0466552}package_depends               | optional
|     28592 |           2 |           0 |        1 | {f,t}      | {0.880322,0.119678}package_files                 |
is_directory |     225084 |           2 |           0 |        1 | {f,t}      | {0.829933,0.170067}
 
(23 rows)

src/backend/commands/analyze.c |   18 +++++++++---------1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9cd6e67..995ed9d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2110,7 +2110,7 @@ compute_minimal_stats(VacAttrStatsP stats,         * least 2 instances in the sample.         */
     if (track_cnt < track_max && toowide_cnt == 0 &&
 
-            stats->stadistinct > 0 &&
+            stats->stadistinct > 0.0 &&            track_cnt <= num_mcv)        {            /* Track list includes
allvalues seen, and all will fit */
 
@@ -2122,7 +2122,7 @@ compute_minimal_stats(VacAttrStatsP stats,            double        avgcount,
  mincount;
 
-            if (ndistinct < 0)
+            if (ndistinct < 0.0)                ndistinct = -ndistinct * totalrows;            /* estimate # of
occurrencesin sample of a typical value */            avgcount = (double) samplerows / ndistinct;
 
@@ -2434,12 +2434,12 @@ compute_scalar_stats(VacAttrStatsP stats,        }        /*
-         * If we estimated the number of distinct values at more than 10% of
-         * the total row count (a very arbitrary limit), then assume that
-         * stadistinct should scale with the row count rather than be a fixed
-         * value.
+         * If we estimated the number of distinct values at more than 2 total
+         * values (a boolean) and more than 10% of the total row count (a very
+         * arbitrary limit), then assume that stadistinct should scale with
+         * the row count rather than be a fixed value.         */
-        if (stats->stadistinct > 0.1 * totalrows)
+        if (stats->stadistinct > 2.0 && stats->stadistinct > 0.1 * totalrows)            stats->stadistinct =
-(stats->stadistinct/ totalrows);        /*
 
@@ -2457,7 +2457,7 @@ compute_scalar_stats(VacAttrStatsP stats,         * but we prefer to treat such values as MCVs if
atall possible.)         */        if (track_cnt == ndistinct && toowide_cnt == 0 &&
 
-            stats->stadistinct > 0 &&
+            stats->stadistinct > 0.0 &&            track_cnt <= num_mcv)        {            /* Track list includes
allvalues seen, and all will fit */
 
@@ -2470,7 +2470,7 @@ compute_scalar_stats(VacAttrStatsP stats,                        mincount,
maxmincount;
-            if (ndistinct < 0)
+            if (ndistinct < 0.0)                ndistinct = -ndistinct * totalrows;            /* estimate # of
occurrencesin sample of a typical value */            avgcount = (double) samplerows / ndistinct;
 
-- 
1.7.9.4



pgsql-hackers by date:

Previous
From: Dave Cramer
Date:
Subject: who's familiar with the GSOC application process
Next
From: Constantin Teodorescu
Date:
Subject: PostgreSQL optimisations on Linux machines with more than 24 cores