In the process of attempting to understand the data in pg_stats, I created a
(very) simple example :
CREATE TABLE test(id integer);
INSERT INTO test VALUES(1);
INSERT INTO test VALUES(1);
INSERT INTO test VALUES(1);
INSERT INTO test VALUES(1);
INSERT INTO test VALUES(2);
INSERT INTO test VALUES(2);
INSERT INTO test VALUES(2);
INSERT INTO test VALUES(3);
INSERT INTO test VALUES(4);
INSERT INTO test VALUES(5);
ANALYZE test;
SELECT * FROM pg_stats WHERE tablename='test';
tablename test
attname id
null_frac 0
avg_width 4
n_distinct -0.5
most_common_vals {"1","2"}
most_common_vals {"0.4","0.3"}
histogram_bounds {"3","4","5"}
correlation 1
everything looks good except for n_distinct ( its negative - should be 5)
(I wasn't too worried about avg_width )
Using fairly crude tracing (adding elog calls) in
src/backend/commands/analyze.c :
DEBUG: Analyzing test
DEBUG: Analyze : beginning a column
DEBUG: Have 10 total values in relation (totalrows)
DEBUG: Have 10 values in relation (numrows)
DEBUG: Have 10 values in sample (values_cnt)
DEBUG: Have 5 distinct values in sample (ndistinct)
DEBUG: Have 2 multiple values in sample (nmultiple)
DEBUG: calc 5.000000 distinct via Chaudhuri rule
DEBUG: calc -0.500000 distinct via >10 percent rowcount rule
So we had the correct answer before applying the 10 percent rowcount code.
This 10 percent rowcount code being line 1340 or thereabouts :
if (stats->stadistinct > 0.1 * totalrows)
{
stats->stadistinct = -(stats->stadistinct / totalrows);
}
My example is pretty contrived, but I wonder if I have "stumbled" on a bug
here.
regards
Mark