Hello,
I'm not sure if I should post here or on pgsql-hackers.
While investigating about a wrong result with my btree bloat estimation query,
I found a strange stat deviation between the average size of a value in its
table and its average size in one index on the "lower" expression.
Take the following scenario:
postgres@test=# create table test as
test-# select md5(t::text) AS t
test-# from generate_series(1, 1000000) t;
SELECT 1000000
postgres@test=# create index ON test (lower(t));
CREATE INDEX
postgres@test=# analyze test;
ANALYZE
postgres@test=# select tablename, attname, avg_width
test-# from pg_stats
test-# where schemaname = 'public';
tablename | attname | avg_width
----------------+---------+-----------
test | t | 33
test_lower_idx | lower | 36
Md5 values are always 32 bytes long. In the table, pg_stats reports 33 because
of the text header. In the index, the reported value is 36!
Looking at the page layout documentation and in the index using hexdump, I can
not find any answer about this 3 bytes. PFA the "hexdump -C" output from the
index. For each row, we clearly see a 8 bytes row header followed by a ONE byte
value header (43 'C'), the 32 bytes of the md5 and 7 bytes of padding (00).
A wild guess would be that ANALYZE is considering a text field from an
expression has always a 4 bytes header whatever its actual size (text field
header size is one if the value is < 127 bytes long, 4 in other situations).
This tiny difference is the source of a very bad estimation with the Btree bloat
estimation query when values are around an alignement boundary. As instance,
here is the use case that lead me to this:
tablename | attname | avg_width
--------------------+---------+-----------
customer | email | 23
customer_lower_idx | lower | 26
We have an index on email, and another one on lower(index). The first one is
aligned on 24, the second one on 32. Leading to bloat estimation of 17% for
the first one and -1% for the second one (the estimated index size is bigger
than the real one).
Any information about this from a hacker? Do anyone have an explanation about
this? Is it something that's worth posting on pgsql-hackers?
Regards,