Strange avg value size for index on expression in pg_stats - Mailing list pgsql-general

From Jehan-Guillaume de Rorthais
Subject Strange avg value size for index on expression in pg_stats
Date
Msg-id 20141110115254.2997aaf8@erg
Whole thread Raw
Responses Re: Strange avg value size for index on expression in pg_stats  (Jehan-Guillaume de Rorthais <ioguix@free.fr>)
Re: Strange avg value size for index on expression in pg_stats  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Hello,

I'm not sure if I should post here or on pgsql-hackers.

While investigating about a wrong result with my btree bloat estimation query,
I found a strange stat deviation between the average size of a value in its
table and its average size in one index on the "lower" expression.

Take the following scenario:

  postgres@test=# create table test as
  test-# select md5(t::text) AS t
  test-# from generate_series(1, 1000000) t;
  SELECT 1000000

  postgres@test=# create index ON test (lower(t));
  CREATE INDEX

  postgres@test=# analyze test;
  ANALYZE

  postgres@test=# select tablename, attname, avg_width
  test-# from pg_stats
  test-# where schemaname = 'public';
     tablename    | attname | avg_width
  ----------------+---------+-----------
   test           | t       |        33
   test_lower_idx | lower   |        36

Md5 values are always 32 bytes long. In the table, pg_stats reports 33 because
of the text header. In the index, the reported value is 36!

Looking at the page layout documentation and in the index using hexdump, I can
not find any answer about this 3 bytes. PFA the "hexdump -C" output from the
index. For each row, we clearly see a 8 bytes row header followed by a ONE byte
value header (43 'C'), the 32 bytes of the md5 and 7 bytes of padding (00).

A wild guess would be that ANALYZE is considering a text field from an
expression has always a 4 bytes header whatever its actual size (text field
header size is one if the value is < 127 bytes long, 4 in other situations).


This tiny difference is the source of a very bad estimation with the Btree bloat
estimation query when values are around an alignement boundary. As instance,
here is the use case that lead me to this:

       tablename      | attname | avg_width
  --------------------+---------+-----------
   customer           | email   |        23
   customer_lower_idx | lower   |        26

We have an index on email, and another one on lower(index). The first one is
aligned on 24, the second one on 32. Leading to bloat estimation of 17% for
the first one and -1% for the second one (the estimated index size is bigger
than the real one).

Any information about this from a hacker? Do anyone have an explanation about
this? Is it something that's worth posting on pgsql-hackers?


Regards,

Attachment

pgsql-general by date:

Previous
From: Ilya Ashchepkov
Date:
Subject: justify_interval: days in year
Next
From: Jehan-Guillaume de Rorthais
Date:
Subject: Re: Strange avg value size for index on expression in pg_stats