On Mon, 10 Nov 2014 09:39:23 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jehan-Guillaume de Rorthais <ioguix@free.fr> writes:
> > While investigating about a wrong result with my btree bloat estimation
> > query, I found a strange stat deviation between the average size of a value
> > in its table and its average size in one index on the "lower" expression.
>
> ANALYZE doesn't look at the physical index at all. For an expression
> index, it will recompute the index expression at each sampled row, and
> then take statistics on those values. In the case you've got here, it's
> re-running the lower() function and looking at the uncompressed result.
> So that accounts for the discrepancy in average width stats.
Ok, understood.
But in my demo scenario, I used *only* md5 to populate the test table. So
data length is always 32 and the average width length is exact. No possible
deviation, even with lower(). To be quite dense:
postgres@test=# select
length(lower(md5('a'))) = length(md5('b')),
length(md5('c')) = length(md5('d'));
-[ RECORD 1 ]
?column? | t
?column? | t
And here is another test with a static string for all rows:
postgres@test=# create table test as
select '1234567890123456789012'::text as t
from generate_series(1,100);
SELECT 100
postgres@test=# create index ON test (lower(t));
CREATE INDEX
postgres@test=# analyze test;
ANALYZE
postgres@test=# select tablename, attname, avg_width
from pg_stats
where schemaname = 'public';
tablename | attname | avg_width
----------------+---------+-----------
test | t | 23
test_lower_idx | lower | 26
AFAIU, we should not have a discrepancy here.
> > This tiny difference is the source of a very bad estimation with the
> > Btree bloat estimation query when values are around an alignement
> > boundary.
>
> TBH, if that query is relying on ANALYZE width estimates to be accurate
> to the last byte, its reliability is going to be horrid anyway.
Well, I'm aware of that. I don't need an accuracy to the last byte. This query
doesn't even pay attention to the data alignment padding anyway (I measured some
10% deviation in a case because of this).
This request only helps guessing the bloat evolution in Btrees or quickly
discover *big* deviations. In many situations, we can not afford a call to
pgstattuple.avg_leaf_density().
But this statistic difference between two values with the exact same size is
itching me. Sa far, I couldn't find a logical explanation and it just looks
like a wrong statistic.
Regards,
--
Jehan-Guillaume (ioguix) de Rorthais