Thread: [PATCH] Never convert n_distinct < 2 values to a ratio when computing stats
This is a bit of a corner case in all honesty, but if you have a short table (under 20 rows), the 10% heuristic used that decides whether distinct values scale with the row count will result in rather odd values for stadistinct in pg_statistic, such as '-0.2' or '-0.666667', rather than the expected '2'. Additionally, this can cause only one of {t, f} to appear in the most common values array. Does this actually affect query planning in any way? Probably not, but it is extremely odd to look at pg_stats for these columns, and the solution seems easy. --- The only other minor changes included here were to make it clear when we were comparing float values, so use 0.0 instead of 0. Example stats output from the database I noticed this on: archweb=# SELECT c.relname, a.attname, pg_stat_get_live_tuples(c.oid) AS n_live_tup, stadistinct, stanullfrac, stawidth,stavalues1, stanumbers1 FROM pg_statistic s JOIN pg_class c ON c.oid = s.starelid JOIN pg_attribute a ON c.oid =a.attrelid AND a.attnum = s.staattnum LEFT JOIN pg_namespace n ON n.oid = c.relnamespace JOIN pg_type t ON t.oid = a.atttypidWHERE NOT a.attisdropped AND nspname = 'public' AND t.typname = 'bool' ORDER BY stadistinct, n_live_tup; relname | attname | n_live_tup | stadistinct | stanullfrac | stawidth | stavalues1 | stanumbers1 -------------------------------+---------------+------------+-------------+-------------+----------+------------+-----------------------mirrors_mirrorprotocol | is_download | 3 | -0.666667 | 0 | 1 | {t} | {0.666667}arches | agnostic | 3 | -0.666667 | 0 | 1 | {f} | {0.666667}repos | staging | 10 | -0.2 | 0 | 1 | {f,t} | {0.7,0.3}repos | testing | 10 | -0.2 | 0 | 1 | {f,t} | {0.7,0.3}devel_pgpsignature | valid | 264 | 1 | 0 | 1 | {t} |{1}packages_flagrequest | is_spam | 415 | 1 | 0 | 1 | {f} | {1}donors | visible | 716 | 1 | 0 | 1 | {t} | {1}auth_user | is_superuser | 95 | 2 | 0 | 1 | {f,t} | {0.957895,0.0421053}user_profiles | notify | 95 | 2 | 0 | 1 | {t,f} | {0.957895,0.0421053}auth_user | is_active | 95 | 2 | 0 | 1 | {t,f} | {0.621053,0.378947}auth_user | is_staff | 95 | 2 | 0 | 1 | {f,t} | {0.873684,0.126316}releng_iso | active | 158 | 2 | 0 | 1 | {f,t} | {0.893333,0.106667}mirrors_mirror | isos | 180 | 2 | 0 | 1 | {t,f} | {0.972678,0.0273224}mirrors_mirror | active | 180 | 2 | 0 | 1 | {t,f} | {0.672131,0.327869}mirrors_mirror |public | 180 | 2 | 0 | 1 | {t,f} | {0.978142,0.0218579}mirrors_mirrorurl | has_ipv6 | 379 | 2 | 0 | 1 | {f,t} | {0.709763,0.290237}mirrors_mirrorurl | has_ipv4 | 379 | 2 | 0 | 1 | {t} | {0.997361}packages_flagrequest | is_legitimate | 415 | 2 | 0 | 1 |{t,f} | {0.992754,0.00724638}packages_signoffspecification | enabled | 1130 | 2 | 0| 1 | {t,f} | {0.977578,0.0224215}packages_signoffspecification | known_bad | 1130 | 2 | 0 | 1 | {f,t} | {0.993722,0.00627803}mirrors_mirrorlog | is_success | 12715 | 2 | 0 | 1 | {t,f} | {0.953345,0.0466552}package_depends | optional | 28592 | 2 | 0 | 1 | {f,t} | {0.880322,0.119678}package_files | is_directory | 225084 | 2 | 0 | 1 | {f,t} | {0.829933,0.170067} (23 rows) src/backend/commands/analyze.c | 18 +++++++++---------1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 9cd6e67..995ed9d 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -2110,7 +2110,7 @@ compute_minimal_stats(VacAttrStatsP stats, * least 2 instances in the sample. */ if (track_cnt < track_max && toowide_cnt == 0 && - stats->stadistinct > 0 && + stats->stadistinct > 0.0 && track_cnt <= num_mcv) { /* Track list includes allvalues seen, and all will fit */ @@ -2122,7 +2122,7 @@ compute_minimal_stats(VacAttrStatsP stats, double avgcount, mincount; - if (ndistinct < 0) + if (ndistinct < 0.0) ndistinct = -ndistinct * totalrows; /* estimate # of occurrencesin sample of a typical value */ avgcount = (double) samplerows / ndistinct; @@ -2434,12 +2434,12 @@ compute_scalar_stats(VacAttrStatsP stats, } /* - * If we estimated the number of distinct values at more than 10% of - * the total row count (a very arbitrary limit), then assume that - * stadistinct should scale with the row count rather than be a fixed - * value. + * If we estimated the number of distinct values at more than 2 total + * values (a boolean) and more than 10% of the total row count (a very + * arbitrary limit), then assume that stadistinct should scale with + * the row count rather than be a fixed value. */ - if (stats->stadistinct > 0.1 * totalrows) + if (stats->stadistinct > 2.0 && stats->stadistinct > 0.1 * totalrows) stats->stadistinct = -(stats->stadistinct/ totalrows); /* @@ -2457,7 +2457,7 @@ compute_scalar_stats(VacAttrStatsP stats, * but we prefer to treat such values as MCVs if atall possible.) */ if (track_cnt == ndistinct && toowide_cnt == 0 && - stats->stadistinct > 0 && + stats->stadistinct > 0.0 && track_cnt <= num_mcv) { /* Track list includes allvalues seen, and all will fit */ @@ -2470,7 +2470,7 @@ compute_scalar_stats(VacAttrStatsP stats, mincount, maxmincount; - if (ndistinct < 0) + if (ndistinct < 0.0) ndistinct = -ndistinct * totalrows; /* estimate # of occurrencesin sample of a typical value */ avgcount = (double) samplerows / ndistinct; -- 1.7.9.4
Re: [PATCH] Never convert n_distinct < 2 values to a ratio when computing stats
From
Robert Haas
Date:
On Sat, Mar 24, 2012 at 12:17 AM, Dan McGee <dan@archlinux.org> wrote: > This is a bit of a corner case in all honesty, but if you have a short > table (under 20 rows), the 10% heuristic used that decides whether > distinct values scale with the row count will result in rather odd > values for stadistinct in pg_statistic, such as '-0.2' or '-0.666667', > rather than the expected '2'. Additionally, this can cause only one of > {t, f} to appear in the most common values array. > > Does this actually affect query planning in any way? Probably not, but > it is extremely odd to look at pg_stats for these columns, and the > solution seems easy. But the stats aren't there to be looked at, but rather to guide query planning. If at execution time there are 100 rows in the table, should we still assume that there are only 2 distinct values in the table, or that it's gone up to about 50 distinct values? It's hard to say, but there's no apparent reason to think that the number of distinct values will scale up for a large table but not a small table. The bit about maybe not getting both t and f as MCVs on a Boolean does seem a little worrying, but I'm not sure whether it actually affects query planning in a materially negative way. Can you demonstrate a case where it matters? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > The bit about maybe not getting both t and f as MCVs on a Boolean does > seem a little worrying, but I'm not sure whether it actually affects > query planning in a materially negative way. Can you demonstrate a > case where it matters? If we were trying to force that to happen it would be wrong anyway. Consider a column that contains *only* "t", or at least has so few "f"'s that "f" appears never or only once in the selected sample. (IIRC there is a clamp that prevents selecting anything as an MCV unless it appears at least twice in the sample.) Like Robert, I'm not convinced whether or not this is a reasonable change, but arguing for it on the basis of boolean columns doesn't seem very sound. regards, tom lane