Re: Strange avg value size for index on expression in pg_stats - Mailing list pgsql-general

From Jehan-Guillaume de Rorthais
Subject Re: Strange avg value size for index on expression in pg_stats
Date
Msg-id 20141110164111.665de42c@erg
Whole thread Raw
In response to Re: Strange avg value size for index on expression in pg_stats  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
On Mon, 10 Nov 2014 09:39:23 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Jehan-Guillaume de Rorthais <ioguix@free.fr> writes:
> > While investigating about a wrong result with my btree bloat estimation
> > query, I found a strange stat deviation between the average size of a value
> > in its table and its average size in one index on the "lower" expression.
>
> ANALYZE doesn't look at the physical index at all.  For an expression
> index, it will recompute the index expression at each sampled row, and
> then take statistics on those values.  In the case you've got here, it's
> re-running the lower() function and looking at the uncompressed result.
> So that accounts for the discrepancy in average width stats.

Ok, understood.

But in my demo scenario, I used *only* md5 to populate the test table. So
data length is always 32 and the average width length is exact. No possible
deviation, even with lower(). To be quite dense:

  postgres@test=# select
    length(lower(md5('a'))) = length(md5('b')),
    length(md5('c')) = length(md5('d'));
  -[ RECORD 1 ]
  ?column? | t
  ?column? | t

And here is another test with a static string for all rows:

  postgres@test=# create table test as
  select '1234567890123456789012'::text as t
  from generate_series(1,100);
  SELECT 100

  postgres@test=# create index ON test (lower(t));
  CREATE INDEX

  postgres@test=# analyze test;
  ANALYZE

  postgres@test=# select tablename, attname, avg_width
  from pg_stats
  where schemaname = 'public';
     tablename    | attname | avg_width
  ----------------+---------+-----------
   test           | t       |        23
   test_lower_idx | lower   |        26


AFAIU, we should not have a discrepancy here.

> > This tiny difference is the source of a very bad estimation with the
> > Btree bloat estimation query when values are around an alignement
> > boundary.
>
> TBH, if that query is relying on ANALYZE width estimates to be accurate
> to the last byte, its reliability is going to be horrid anyway.

Well, I'm aware of that. I don't need an accuracy to the last byte. This query
doesn't even pay attention to the data alignment padding anyway (I measured some
10% deviation in a case because of this).

This request only helps guessing the bloat evolution in Btrees or quickly
discover *big* deviations. In many situations, we can not afford a call to
pgstattuple.avg_leaf_density().

But this statistic difference between two values with the exact same size is
itching me. Sa far, I couldn't find a logical explanation and it just looks
like a wrong statistic.

Regards,
--
Jehan-Guillaume (ioguix) de Rorthais


pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Strange avg value size for index on expression in pg_stats
Next
From: Edson Richter
Date:
Subject: Centos 7 - anyone?