On Tue, Mar 8, 2016 at 3:36 PM, Joel Jacobson <joel@trustly.com> wrote:
Hi Alex,
Thanks for excellent research.
Joel,
Thank you for spending your time to run these :-)
I've ran your queries against Trustly's production database and I can confirm your findings, the results are similar: WITH ... SELECT count(1), min(hist_ratio)::real, avg(hist_ratio)::real, max(hist_ratio)::real, stddev(hist_ratio)::real FROM stats2 WHERE histogram_bounds IS NOT NULL;
-[ RECORD 1 ]---- count | 2814 min | 0.193548 avg | 0.927357 max | 1 stddev | 0.164134
WHERE distinct_hist < num_hist -[ RECORD 1 ]---- count | 624 min | 0.193548 avg | 0.672407 max | 0.990099 stddev | 0.194901
WITH .. SELECT schemaname ||'.'|| tablename ||'.'|| attname || (CASE inherited WHEN TRUE THEN ' (inherited)' ELSE '' END) AS columnname, n_distinct, null_frac, num_mcv, most_common_vals, most_common_freqs, mcv_frac, (mcv_frac / (1 - null_frac))::real AS nonnull_mcv_frac, distinct_hist, num_hist, hist_ratio, histogram_bounds FROM stats2 ORDER BY hist_ratio LIMIT 1;
I don't want to be asking for too much here, but is there a chance you could try the effects of the proposed patch on an offline copy of your database?
Do you envision or maybe have experienced problems with query plans referring to the columns that are near the top of the above hist_ratio report? In other words: what are the practical implications for you with the values being duplicated rather badly throughout the histogram like in the example you shown?