Thread: Are statistics gathered on function indexes?

Are statistics gathered on function indexes?

From
"Nick Fankhauser"
Date:
Hi-

Can someone tell me how the cost is estimated for retrieving a column based
on a function that is indexed?

The issue I'm wrestling with is a query which works nicely when based on a
plain "LIKE field", but    poorly using "LIKE upper(field)". There is an
index on both.

Since this field has only uppercase strings in it currently, I know that
both indexes are equally selective, but the planner judges that "LIKE field"
will return 1 row, and "LIKE upper(field)" will return 2168 rows. In both
cases, the index is used, but in the next step, joining to another table,
the regular version uses an index on the other table, while the "upper"
version uses a seq scan. I'm guessing that the scan is used because if we
are going after 2168 rows in the adjoining table, the index is no longer a
good choice.

I'm able to see the stats for the field using pg_stats, but don't see
anything connected to the function, so I'm guessing that real stats aren't
kept & some sort of default is used. perhaps I can modify this default.



Also, even with 2168 rows to gather, my experience based on cases where
several thousand rows really are returned indicates that the index would
still be a good choice. Is there a way to make the planner favor index scans
a bit more? (Other than the drastic set enable_seqscan to off.)


Thanks

-Nick



--------------------------------------------------------------------------
Nick Fankhauser  nickf@ontko.com  Phone 1.765.935.4283  Fax 1.765.962.9788
Ray Ontko & Co.     Software Consulting Services     http://www.ontko.com/




Re: Are statistics gathered on function indexes?

From
Tom Lane
Date:
"Nick Fankhauser" <nickf@ontko.com> writes:
> [see subject]

Nope, they ain't.  I agree they should be.

> Can someone tell me how the cost is estimated for retrieving a column based
> on a function that is indexed?

It falls back to a default selectivity estimate, which is something
like 1% or 0.5% (depending on which version you are running).

> Also, even with 2168 rows to gather, my experience based on cases where
> several thousand rows really are returned indicates that the index would
> still be a good choice. Is there a way to make the planner favor index scans
> a bit more? (Other than the drastic set enable_seqscan to off.)

I'd suggest reducing random_page_cost; we've seen a number of anecdotal
reports that the default of 4.0 is too high, though nothing systematic
enough to refute the experiments I did to get that number awhile back.
(IMHO anyway.  Others may differ.)

            regards, tom lane



Re: Are statistics gathered on function indexes?

From
"Nick Fankhauser"
Date:
Tom-

Thanks for the info-

Based on your response plus some local issues, we're going to work around
this by simply creating another column containing the results of the
function & then index the column. That gets the results we want without
tweaking something we may regret later.

Stats for function indexes would be nice, so add our vote for it to wherever
such things are tallied to come up with priorities.

Regards,

-Nick

> -----Original Message-----
> From: pgsql-admin-owner@postgresql.org
> [mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Tom Lane
> Sent: Wednesday, June 26, 2002 10:37 PM
> To: nickf@ontko.com
> Cc: pgsql-admin
> Subject: Re: [ADMIN] Are statistics gathered on function indexes?
>
>
> "Nick Fankhauser" <nickf@ontko.com> writes:
> > [see subject]
>
> Nope, they ain't.  I agree they should be.
>
> > Can someone tell me how the cost is estimated for retrieving a
> column based
> > on a function that is indexed?
>
> It falls back to a default selectivity estimate, which is something
> like 1% or 0.5% (depending on which version you are running).
>
> > Also, even with 2168 rows to gather, my experience based on cases where
> > several thousand rows really are returned indicates that the index would
> > still be a good choice. Is there a way to make the planner
> favor index scans
> > a bit more? (Other than the drastic set enable_seqscan to off.)
>
> I'd suggest reducing random_page_cost; we've seen a number of anecdotal
> reports that the default of 4.0 is too high, though nothing systematic
> enough to refute the experiments I did to get that number awhile back.
> (IMHO anyway.  Others may differ.)
>
>             regards, tom lane
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>
>




Re: Are statistics gathered on function indexes?

From
Ray Ontko
Date:
Tom, et al,

Yes, thanks.

Another thing that we noticed is that when LIKE is used on an
index with a constant value like 'WILLIAMS%', a full table
scan occurs when the constant is 'W%', but the index is used
if the like string is 'WI%' or longer.

It seems to me that the selectivity of the string would vary
with the length of the string, perhaps as a fraction of the
length of the field.  In other words, I would have expected
the selectivity to vary something like this:

         W% -> 0.1
        WI% -> 0.01
       WIL% -> 0.001
      WILL% -> 0.0001
     WILLI% -> 0.00001
    WILLIA% -> 0.000001
   WILLIAM% -> 0.0000001
  WILLIAMS% -> 0.00000001

In other words, if I only give one letter, then I might expect
to get about 1/10 of the table, and a full scan might make sense.
But the cost should continue to decline as I give longer and longer
strings, up to the length of the field.

Would this be a reasonable improvement to the optimizer?

Ray

[Charset iso-8859-1 unsupported, filtering to ASCII...]
> Tom-
>
> Thanks for the info-
>
> Based on your response plus some local issues, we're going to work around
> this by simply creating another column containing the results of the
> function & then index the column. That gets the results we want without
> tweaking something we may regret later.
>
> Stats for function indexes would be nice, so add our vote for it to wherever
> such things are tallied to come up with priorities.
>
> Regards,
>
> -Nick
>
> > -----Original Message-----
> > From: pgsql-admin-owner@postgresql.org
> > [mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Tom Lane
> > Sent: Wednesday, June 26, 2002 10:37 PM
> > To: nickf@ontko.com
> > Cc: pgsql-admin
> > Subject: Re: [ADMIN] Are statistics gathered on function indexes?
> >
> >
> > "Nick Fankhauser" <nickf@ontko.com> writes:
> > > [see subject]
> >
> > Nope, they ain't.  I agree they should be.
> >
> > > Can someone tell me how the cost is estimated for retrieving a
> > column based
> > > on a function that is indexed?
> >
> > It falls back to a default selectivity estimate, which is something
> > like 1% or 0.5% (depending on which version you are running).
> >
> > > Also, even with 2168 rows to gather, my experience based on cases where
> > > several thousand rows really are returned indicates that the index would
> > > still be a good choice. Is there a way to make the planner
> > favor index scans
> > > a bit more? (Other than the drastic set enable_seqscan to off.)
> >
> > I'd suggest reducing random_page_cost; we've seen a number of anecdotal
> > reports that the default of 4.0 is too high, though nothing systematic
> > enough to refute the experiments I did to get that number awhile back.
> > (IMHO anyway.  Others may differ.)
> >
> >             regards, tom lane
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 3: if posting/reading through Usenet, please send an appropriate
> > subscribe-nomail command to majordomo@postgresql.org so that your
> > message can get through to the mailing list cleanly
> >
> >
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>
>

----------------------------------------------------------------------
Ray Ontko   rayo@ontko.com   Phone 1.765.935.4283   Fax 1.765.962.9788
Ray Ontko & Co.   Software Consulting Services   http://www.ontko.com/



Re: Are statistics gathered on function indexes?

From
Tom Lane
Date:
Ray Ontko <rayo@ontko.com> writes:
> In other words, if I only give one letter, then I might expect
> to get about 1/10 of the table, and a full scan might make sense.
> But the cost should continue to decline as I give longer and longer
> strings, up to the length of the field.
> Would this be a reasonable improvement to the optimizer?

It's there already; what did you think was making the difference
between W% and WI% ?

            regards, tom lane



Re: Are statistics gathered on function indexes?

From
Curt Sampson
Date:
The vast majority of the cost of using an index is due to reading
the data blocks in the table in a random order, rathern than
sequentially, right? Rather than making estimates of the selectivity
of an index when you search for 'W%', why not actually start the
index lookup and count how many you get? If you know you want to
do a table scan if you have more than, say, 500 rows that match
'W%', you'd only have to read a few index pages to determine whether
or not there are more than 500 rows, right?

Or am I on crack here?

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC




Re: Are statistics gathered on function indexes?

From
Ray Ontko
Date:
Tom,

> > In other words, if I only give one letter, then I might expect
> > to get about 1/10 of the table, and a full scan might make sense.
> > But the cost should continue to decline as I give longer and longer
> > strings, up to the length of the field.
> > Would this be a reasonable improvement to the optimizer?
>
> It's there already; what did you think was making the difference
> between W% and WI% ?

Yes, but the cost doesn't continue to decline if I make the LIKE
more and more restrictive by going from WI% to WIL% to WILL%, etc.
The current approach assumes, perhaps correctly, that with only
one letter, you might as well do a full table scan, but with 2
or more letters, you might as well use an index and make a
reasonable guess at the cost.

The limitation with this approach is that the optimizer is
considering a number of different options if the join includes
a number of tables and constraints.  My query suffers from a
one-size-fits-all approach.  If the optimizer had a better
guess on cost, it could choose to drive my query using this
index instead of another.

By allowing the cost to decline as the length of the string
increases, we're making the bet that longer strings are more
selective and require fewer random pages to be read, among
other improvements.  Note that this would behave badly in
situations where all the values in the index begin with the same
10 characters and the LIKE string is less than 10 characters
long.  This is already a problem for the current approach, and
I think could only be solved by adding statistics for the
selectivity of increasingly longer strings.

BTW, I think that this discussion probably belongs in a different
list (e.g., the one for hacking the optimizer).  Since I'm not
(yet) planning to jump in to the code, my purpose for raising
the question here is to help me (and others on the list) understand
the capabilities and limitations of indexes and the optimizer so
that we can make better use of what we have currently, and help
identify areas for improvement.

Ray
----------------------------------------------------------------------
Ray Ontko   rayo@ontko.com   Phone 1.765.935.4283   Fax 1.765.962.9788
Ray Ontko & Co.   Software Consulting Services   http://www.ontko.com/



Re: Are statistics gathered on function indexes?

From
Tom Lane
Date:
Ray Ontko <rayo@ontko.com> writes:
>> It's there already; what did you think was making the difference
>> between W% and WI% ?

> Yes, but the cost doesn't continue to decline if I make the LIKE
> more and more restrictive by going from WI% to WIL% to WILL%, etc.

Yes it does, if you have a large enough table.  In most scenarios
the selectivity drops off fast enough with larger strings that you
hit the minimum estimate of 1 row pretty quickly; I suppose that's
what's happening with your case.  Here's an example using the 7.2
regression-test database:

-- update stats
regression=# analyze road;
ANALYZE

-- now force planner to think "road" is much larger than it really is,
-- else we can't see the change in estimate beyond WI%
regression=# update pg_class set relpages = relpages * 10000,
regression-# reltuples = reltuples * 10000 where relname = 'road';
UPDATE 1

regression=# explain select * from only road where name like 'W%';
NOTICE:  QUERY PLAN:

Seq Scan on road  (cost=0.00..1444625.00 rows=764903 width=89)

EXPLAIN
regression=# explain select * from only road where name like 'WI%';
NOTICE:  QUERY PLAN:

Index Scan using rix on road  (cost=0.00..25007.80 rows=8406 width=89)

EXPLAIN
regression=# explain select * from only road where name like 'WIJ%';
NOTICE:  QUERY PLAN:

Index Scan using rix on road  (cost=0.00..277.04 rows=92 width=89)

EXPLAIN
regression=# explain select * from only road where name like 'WIJK%';
NOTICE:  QUERY PLAN:

Index Scan using rix on road  (cost=0.00..5.28 rows=1 width=89)

EXPLAIN
regression=# explain select * from only road where name like 'WIJKL%';
NOTICE:  QUERY PLAN:

Index Scan using rix on road  (cost=0.00..5.23 rows=1 width=89)

EXPLAIN
regression=# explain select * from only road where name like 'WIJKLM%';
NOTICE:  QUERY PLAN:

Index Scan using rix on road  (cost=0.00..5.23 rows=1 width=89)

EXPLAIN
regression=#

As you can see, the estimate drops off by about a factor of 90 per
added character.  This is probably too much, but it's not that easy
to determine what the ratio ought to be.  The critical code involved
in this is convert_string_to_scalar in backend/utils/adt/selfuncs.c;
the ratio per character is essentially the same as the character range
that it induces from the available values.  Feel free to propose a
better implementation if you can think of one.

            regards, tom lane



Re: Are statistics gathered on function indexes?

From
Ray Ontko
Date:
Tom, et al,

Hmm.  Something is wierd here.  Watch this.

1) first we demonstrate that we drop to 1 row between 1 and 2 letters.
2) then we do an analyze and things work the way you suggest.
3) then we do some more stuff
4) things revert to the old way
5) And even after we re-analyze correctly, things still look broken

Eek.  Do you have an explanation for this behavior?

It appears that "vacuum analyze verbose actor" causes the problem.
It appears that I have to say "vacuum analyze actor" in order to
clear out the ill effects of having said "vacuum analyze verbose actor".
Typing "vacuum verbose analyze actor" doesn't clear things out,
but it doesn't produce the problem.

Ray

P.S. psql (PostgreSQL) 7.2.1

**********
1) first we demonstrate that we drop to 1 row between 1 and 2 letters.
**********

develop=# select count(*) from actor ;
 count
--------
 433902
(1 row)

develop=# explain select * from actor where actor_full_name like 'W%' ;
NOTICE:  QUERY PLAN:

Seq Scan on actor  (cost=0.00..12529.77 rows=3992 width=570)

EXPLAIN
develop=# explain select * from actor where actor_full_name like 'WI%' ;
NOTICE:  QUERY PLAN:

Index Scan using actor_full_name on actor  (cost=0.00..6.01 rows=1 width=570)

EXPLAIN

**********
2) then we do an analyze and things work the way you suggest.
**********

develop=# vacuum analyze verbose actor ;
NOTICE:  --Relation actor--
NOTICE:  Pages 7106: Changed 0, Empty 0; Tup 433902: Vac 0, Keep 0, UnUsed 1443.
        Total CPU 0.23s/0.07u sec elapsed 0.30 sec.
NOTICE:  Analyzing actor
VACUUM
develop=# \h vac
Command:     VACUUM
Description: garbage-collect and optionally analyze a database
Syntax:
VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ table ]
VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] ANALYZE [ table [ (column [, ...] ) ] ]

develop=# vacuum verbose analyze actor ;
NOTICE:  --Relation actor--
NOTICE:  Pages 7106: Changed 0, Empty 0; Tup 433902: Vac 0, Keep 0, UnUsed 1443.
        Total CPU 0.26s/0.05u sec elapsed 0.30 sec.
NOTICE:  Analyzing actor
VACUUM
develop=# explain select * from actor where actor_full_name like 'W%' ;
NOTICE:  QUERY PLAN:

Seq Scan on actor  (cost=0.00..12529.77 rows=7468 width=571)

EXPLAIN
develop=# explain select * from actor where actor_full_name like 'WI%' ;
NOTICE:  QUERY PLAN:

Index Scan using actor_full_name on actor  (cost=0.00..447.49 rows=112 width=571
)

EXPLAIN
develop=# explain select * from actor where actor_full_name like 'WIL%' ;
NOTICE:  QUERY PLAN:

Index Scan using actor_full_name on actor  (cost=0.00..9.61 rows=2 width=571)

EXPLAIN

**********
3) then we do some more stuff
**********

develop=# vacuum verbose analyze actor ;
NOTICE:  --Relation actor--
NOTICE:  Pages 7106: Changed 0, Empty 0; Tup 433902: Vac 0, Keep 0, UnUsed 1443.
        Total CPU 0.24s/0.06u sec elapsed 0.30 sec.
NOTICE:  Analyzing actor
VACUUM

develop=# vacuum analyze actor verbose ;
ERROR:  parser: parse error at or near "verbose"
develop=# vacuum verbose analyze actor ;
NOTICE:  --Relation actor--
NOTICE:  Pages 7106: Changed 0, Empty 0; Tup 433902: Vac 0, Keep 0, UnUsed 1443.
        Total CPU 0.22s/0.08u sec elapsed 0.29 sec.
NOTICE:  Analyzing actor
VACUUM
develop=# vacuum analyze verbose actor ;
NOTICE:  --Relation actor--
NOTICE:  Pages 7106: Changed 0, Empty 0; Tup 433902: Vac 0, Keep 0, UnUsed 1443.
        Total CPU 0.24s/0.06u sec elapsed 0.30 sec.
NOTICE:  Analyzing actor
VACUUM

**********
4) Then things revert to the old way.
**********

develop=# explain select * from actor where actor_full_name like 'W%' ;
NOTICE:  QUERY PLAN:

Seq Scan on actor  (cost=0.00..12529.77 rows=6244 width=571)

EXPLAIN
develop=# explain select * from actor where actor_full_name like 'WI%' ;
NOTICE:  QUERY PLAN:

Index Scan using actor_full_name on actor  (cost=0.00..6.01 rows=1 width=571)

EXPLAIN
develop=# explain select * from actor where actor_full_name like 'WIL%' ;
NOTICE:  QUERY PLAN:

Index Scan using actor_full_name on actor  (cost=0.00..6.01 rows=1 width=571)

EXPLAIN
develop=# explain select * from actor where actor_full_name like 'WILL%' ;
NOTICE:  QUERY PLAN:

Index Scan using actor_full_name on actor  (cost=0.00..6.01 rows=1 width=571)

EXPLAIN

**********
5) And even after we re-analyze correctly, things still look broken
**********

develop=# vacuum verbose analyze actor ;
NOTICE:  --Relation actor--
NOTICE:  Pages 7106: Changed 0, Empty 0; Tup 433902: Vac 0, Keep 0, UnUsed 1443.
        Total CPU 0.30s/0.01u sec elapsed 0.30 sec.
NOTICE:  Analyzing actor
VACUUM
develop=# explain select * from actor where actor_full_name like 'W%' ;
NOTICE:  QUERY PLAN:

Seq Scan on actor  (cost=0.00..12529.77 rows=7130 width=571)

EXPLAIN
develop=# explain select * from actor where actor_full_name like 'WI%' ;
NOTICE:  QUERY PLAN:

Index Scan using actor_full_name on actor  (cost=0.00..6.01 rows=1 width=571)

EXPLAIN
develop=# explain select * from actor where actor_full_name like 'WIL%' ;
NOTICE:  QUERY PLAN:

Index Scan using actor_full_name on actor  (cost=0.00..6.01 rows=1 width=571)

EXPLAIN
develop=# explain select * from actor where actor_full_name like 'WILL%' ;
NOTICE:  QUERY PLAN:

Index Scan using actor_full_name on actor  (cost=0.00..6.01 rows=1 width=571)

EXPLAIN


> Ray Ontko <rayo@ontko.com> writes:
> >> It's there already; what did you think was making the difference
> >> between W% and WI% ?
>
> > Yes, but the cost doesn't continue to decline if I make the LIKE
> > more and more restrictive by going from WI% to WIL% to WILL%, etc.
>
> Yes it does, if you have a large enough table.  In most scenarios
> the selectivity drops off fast enough with larger strings that you
> hit the minimum estimate of 1 row pretty quickly; I suppose that's
> what's happening with your case.  Here's an example using the 7.2
> regression-test database:
>
> -- update stats
> regression=# analyze road;
> ANALYZE
>
> -- now force planner to think "road" is much larger than it really is,
> -- else we can't see the change in estimate beyond WI%
> regression=# update pg_class set relpages = relpages * 10000,
> regression-# reltuples = reltuples * 10000 where relname = 'road';
> UPDATE 1
>
> regression=# explain select * from only road where name like 'W%';
> NOTICE:  QUERY PLAN:
>
> Seq Scan on road  (cost=0.00..1444625.00 rows=764903 width=89)
>
> EXPLAIN
> regression=# explain select * from only road where name like 'WI%';
> NOTICE:  QUERY PLAN:
>
> Index Scan using rix on road  (cost=0.00..25007.80 rows=8406 width=89)
>
> EXPLAIN
> regression=# explain select * from only road where name like 'WIJ%';
> NOTICE:  QUERY PLAN:
>
> Index Scan using rix on road  (cost=0.00..277.04 rows=92 width=89)
>
> EXPLAIN
> regression=# explain select * from only road where name like 'WIJK%';
> NOTICE:  QUERY PLAN:
>
> Index Scan using rix on road  (cost=0.00..5.28 rows=1 width=89)
>
> EXPLAIN
> regression=# explain select * from only road where name like 'WIJKL%';
> NOTICE:  QUERY PLAN:
>
> Index Scan using rix on road  (cost=0.00..5.23 rows=1 width=89)
>
> EXPLAIN
> regression=# explain select * from only road where name like 'WIJKLM%';
> NOTICE:  QUERY PLAN:
>
> Index Scan using rix on road  (cost=0.00..5.23 rows=1 width=89)
>
> EXPLAIN
> regression=#
>
> As you can see, the estimate drops off by about a factor of 90 per
> added character.  This is probably too much, but it's not that easy
> to determine what the ratio ought to be.  The critical code involved
> in this is convert_string_to_scalar in backend/utils/adt/selfuncs.c;
> the ratio per character is essentially the same as the character range
> that it induces from the available values.  Feel free to propose a
> better implementation if you can think of one.
>
>             regards, tom lane
>

----------------------------------------------------------------------
Ray Ontko   rayo@ontko.com   Phone 1.765.935.4283   Fax 1.765.962.9788
Ray Ontko & Co.   Software Consulting Services   http://www.ontko.com/



Re: Are statistics gathered on function indexes?

From
Ray Ontko
Date:
Tom, et al,

> It appears that "vacuum analyze verbose actor" causes the problem.
> It appears that I have to say "vacuum analyze actor" in order to
> clear out the ill effects of having said "vacuum analyze verbose actor".
> Typing "vacuum verbose analyze actor" doesn't clear things out,
> but it doesn't produce the problem.

Here's two more data points:

1) "analyze verbose actor" causes the problem too.  "analyze actor"
   usually clears the problem out.

2) In cases where "analyze actor" doesn't clear out the problem,
   exiting psql and trying again does clear the problem out.

Ray
----------------------------------------------------------------------
Ray Ontko   rayo@ontko.com   Phone 1.765.935.4283   Fax 1.765.962.9788
Ray Ontko & Co.   Software Consulting Services   http://www.ontko.com/



Re: Are statistics gathered on function indexes?

From
Bruce Momjian
Date:
Tom Lane wrote:
> As you can see, the estimate drops off by about a factor of 90 per
> added character.  This is probably too much, but it's not that easy
> to determine what the ratio ought to be.  The critical code involved
> in this is convert_string_to_scalar in backend/utils/adt/selfuncs.c;
> the ratio per character is essentially the same as the character range
> that it induces from the available values.  Feel free to propose a
> better implementation if you can think of one.

FYI, 7.3 will drop off quicker than <=7.2.X:

  * Change FIXED_CHAR_SEL to 0.20 from 0.04 to give better selectivity (Bruce)

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026



Re: Are statistics gathered on function indexes?

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> FYI, 7.3 will drop off quicker than <=7.2.X:

Not on this test...

            regards, tom lane



Re: Are statistics gathered on function indexes?

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > FYI, 7.3 will drop off quicker than <=7.2.X:
>
> Not on this test...

Are you sure.  I thought FIXED_CHAR_SEL was used for exactly this
purpose.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026



Re: Are statistics gathered on function indexes?

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> Not on this test...

> Are you sure.

Quite; I get the same behavior from either 7.2 or current.

> I thought FIXED_CHAR_SEL was used for exactly this purpose.

That only applies to characters appearing after the first wildcard.

            regards, tom lane



Re: Are statistics gathered on function indexes?

From
Tom Lane
Date:
Ray Ontko <rayo@ontko.com> writes:
>> It appears that "vacuum analyze verbose actor" causes the problem.
>> It appears that I have to say "vacuum analyze actor" in order to
>> clear out the ill effects of having said "vacuum analyze verbose actor".

I really, really doubt that "verbose" has anything to do with it.

What do you get from
    select * from pg_stats where tablename = 'actor' and
        attname = 'actor_full_name';

Do the results change significantly between the "good" state and the
"bad" state?  How about the results of
    select relpages, reltuples from pg_class where relname = 'actor';

It would seem that one or another of these statistical items is getting
set weirdly by something you are doing, but I have no idea what exactly
is going wrong...

            regards, tom lane



Re: Are statistics gathered on function indexes?

From
Ray Ontko
Date:
> Ray Ontko <rayo@ontko.com> writes:
> >> It appears that "vacuum analyze verbose actor" causes the problem.
> >> It appears that I have to say "vacuum analyze actor" in order to
> >> clear out the ill effects of having said "vacuum analyze verbose actor".
>
> I really, really doubt that "verbose" has anything to do with it.
>
> What do you get from
>     select * from pg_stats where tablename = 'actor' and
>         attname = 'actor_full_name';
>
> Do the results change significantly between the "good" state and the
> "bad" state?  How about the results of
>     select relpages, reltuples from pg_class where relname = 'actor';
>
> It would seem that one or another of these statistical items is getting
> set weirdly by something you are doing, but I have no idea what exactly
> is going wrong...

Hmm.

1) here's the "bad" stats.
2) here's the "good" stats.

Note that the information really is different.

3) here's the results of the relpages,reltuples query.

Same whether good or bad stats.

Ray

**********
1) here's the "bad" stats.
**********

develop=# explain select * from actor where actor_full_name like 'WI%' ;
NOTICE:  QUERY PLAN:

Index Scan using actor_full_name on actor  (cost=0.00..6.01 rows=1 width=570)

EXPLAIN
develop=# select * from pg_stats where tablename = 'actor' and
develop-#                 attname = 'actor_full_name';
 tablename |     attname     |  null_frac  | avg_width | n_distinct |

                        most_common_vals
                                                                           |
                                    most_common_freqs
             |
                                          histogram_bounds

      | correlation
-----------+-----------------+-------------+-----------+------------+-----------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
---------------------------------------------------------------------------+----
--------------------------------------------------------------------------------
-------------+------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
------+-------------
 actor     | actor_full_name | 0.000333333 |        22 |      14657 | {"INDIANA
DEPARTMENT OF REVENUE","AEGIS WOMENS HEALTHCARE","BLOOMINGTON HOSPITAL","MONROE
COUNTY TREASURER","PEOPLES STATE BANK","RICHLAND BEAN BLOSSOM CSC","SMITHVILLE T
ELEPHONE","STATE OF INDIANA","PETTAY, LEE","WOODINGTON COURTS MANAGEMENT"} | {0.
0813333,0.00366667,0.003,0.00266667,0.00266667,0.00266667,0.00233333,0.00233333,
0.002,0.002} | {"(ABEL) CONDER, CRYSTAL","BLOOMINGTON HOUSING AUTHORITY","CORBIN
, MARK J","FLEETWOOD, JAMES WILBUR","HAZEL, JEFF W","KIDD, PATTY","MEADOW PARK A
PARTMENTS","PETERSON, CATHY L","SHADLE, MARY","THRASHER, CHRISTOPHER B","ZYNNCO
LLC"} |    0.025242
(1 row)


**********
2)
**********

develop=# explain select * from actor where actor_full_name like 'WI%' ;
NOTICE:  QUERY PLAN:

Index Scan using actor_full_name on actor  (cost=0.00..6.01 rows=1 width=571)

EXPLAIN
develop=# analyze actor ;
ANALYZE
develop=# explain select * from actor where actor_full_name like 'WI%' ;
NOTICE:  QUERY PLAN:

Index Scan using actor_full_name on actor  (cost=0.00..433.52 rows=108 width=571
)

EXPLAIN
develop=# select * from pg_stats where tablename = 'actor' and
develop-#                 attname = 'actor_full_name';
 tablename |     attname     | null_frac | avg_width | n_distinct |

                 most_common_vals
                                                                |
                         most_common_freqs
  |
                               histogram_bounds
                                                                           | cor
relation
-----------+-----------------+-----------+-----------+------------+-------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
----------------------------------------------------------------+---------------
--------------------------------------------------------------------------------
--+-----------------------------------------------------------------------------
--------------------------------------------------------------------------------
---------------------------------------------------------------------------+----
---------
 actor     | actor_full_name |         0 |        22 |      14541 | {"INDIANA DE
PARTMENT OF REVENUE","RICHLAND BEAN BLOSSOM CSC","PETTAY, LEE","STATE OF INDIANA
","BAKER DDS, DONALD","BLOOMINGTON HOSPITAL","SMITHVILLE TELEPHONE","AEGIS WOMEN
S HEALTHCARE","BAKER DDS, LISA","BLOOMINGTON ACCOUNTS SERVICE"} | {0.0856667,0.0
0333333,0.00233333,0.00233333,0.002,0.002,0.002,0.00166667,0.00166667,0.00166667
} | {"(FITZPATRICK) STOUT, LISA","BLOOMINGTON HOUSING AUTHORITY","CONKLIN, TONIA
 A","EWING, CRAIG","HARTENFELD, KATHLEEN A","KELLEY, KIMBERLEY","MDF BUILDERS","
PENNINGTON, ADA M","SCISCOE, R L ETAL","THOMPSON, JEANA J","ZOOK, ALISON"} |   0
.0127368
(1 row)

**********
3) results of the replage,reltuples query
**********

develop=#       select relpages, reltuples from pg_class where relname = 'actor'
;
 relpages | reltuples
----------+-----------
     7106 |    436871
(1 row)


------------------------------------------------------------------------
Ray Ontko   rayo@ontko.com   Phone 1.765.935.4283   Fax 1.765.962.9788
Ray Ontko & Co.   Software Consulting Services   http://www.ontko.com/



Re: Are statistics gathered on function indexes?

From
Tom Lane
Date:
Ray Ontko <rayo@ontko.com> writes:
> 1) here's the "bad" stats.
> 2) here's the "good" stats.
> Note that the information really is different.

Yeah.  It seems that the critical difference is the null_frac; if I take
your "bad" values and change null_frac to 0 then I get
reasonable-looking answers, but with null_frac = 0.000333333 I indeed
get bad answers.  I suspect null_frac is getting double-counted
somewhere, and depending on just how many nulls ANALYZE happens to find,
you might get pushed over the edge from a small selectivity to zero or
negative selectivity.  Will dig further.

            regards, tom lane



Re: Are statistics gathered on function indexes?

From
Ray Ontko
Date:
Tom,

> Ray Ontko <rayo@ontko.com> writes:
> > 1) here's the "bad" stats.
> > 2) here's the "good" stats.
> > Note that the information really is different.
>
> Yeah.  It seems that the critical difference is the null_frac; if I take
> your "bad" values and change null_frac to 0 then I get
> reasonable-looking answers, but with null_frac = 0.000333333 I indeed
> get bad answers.  I suspect null_frac is getting double-counted
> somewhere, and depending on just how many nulls ANALYZE happens to find,
> you might get pushed over the edge from a small selectivity to zero or
> negative selectivity.  Will dig further.

Does this help?

develop=# select count(*) from actor where actor_full_name is not null ;
 count
--------
 433809
(1 row)

develop=# select count(*) from actor ;
 count
--------
 433902
(1 row)

Ray
----------------------------------------------------------------------
Ray Ontko   rayo@ontko.com   Phone 1.765.935.4283   Fax 1.765.962.9788
Ray Ontko & Co.   Software Consulting Services   http://www.ontko.com/



Re: Are statistics gathered on function indexes?

From
Tom Lane
Date:
Ray Ontko <rayo@ontko.com> writes:
> Does this help?

Well, it says that ANALYZE is not doing anything wrong ;-).

The true null_frac given those numbers is 1.0 - 433809.0/433902.0
or about 0.000214.  In the default ANALYZE sample size (3000 rows)
the statistical expectation would be 0.643 NULLs, so most of the
time ANALYZE is going to find 1 NULL, sometimes no NULLs, and rarely
more than 1 NULL.  Most of the time you'll get null_frac = 1/3000 =
0.000333333, sometimes 0, etc.

The stats themselves look fine to me; I think there must be a
mathematical error in what selfuncs.c is doing with them.
Still looking to find it...

            regards, tom lane



Re: Are statistics gathered on function indexes?

From
Ray Ontko
Date:
Tom,

> Ray Ontko <rayo@ontko.com> writes:
> > Does this help?
>
> Well, it says that ANALYZE is not doing anything wrong ;-).
>
> The true null_frac given those numbers is 1.0 - 433809.0/433902.0
> or about 0.000214.  In the default ANALYZE sample size (3000 rows)
> the statistical expectation would be 0.643 NULLs, so most of the
> time ANALYZE is going to find 1 NULL, sometimes no NULLs, and rarely
> more than 1 NULL.  Most of the time you'll get null_frac = 1/3000 =
> 0.000333333, sometimes 0, etc.
>
> The stats themselves look fine to me; I think there must be a
> mathematical error in what selfuncs.c is doing with them.
> Still looking to find it...

Well, that explains the non-determinism, I guess.  It seems
odd that if we aren't doing any updates to the data that analyze
would give us different stats.  How does analyze pick the 3000
rows it gets back, and should this vary from run to run if we
aren't changing the table?

Ray
----------------------------------------------------------------------
Ray Ontko   rayo@ontko.com   Phone 1.765.935.4283   Fax 1.765.962.9788
Ray Ontko & Co.   Software Consulting Services   http://www.ontko.com/



Re: Are statistics gathered on function indexes?

From
Tom Lane
Date:
> The stats themselves look fine to me; I think there must be a
> mathematical error in what selfuncs.c is doing with them.
> Still looking to find it...

Oh, I see it: range selectivity double-excludes NULLs.  See the comment
for clauselist_selectivity in src/backend/optimizer/path/clausesel.c.
The individual estimates for the two component comparison operators
each exclude nulls, and when we merge them together we get the wrong
answer.

Good catch!  (Though I'm surprised no one noticed this before; with a
larger null population the error would be much more obvious.)

I'm running out of time today but will look into a fix later.

            regards, tom lane



Re: Are statistics gathered on function indexes?

From
Tom Lane
Date:
Ray Ontko <rayo@ontko.com> writes:
> Well, that explains the non-determinism, I guess.  It seems
> odd that if we aren't doing any updates to the data that analyze
> would give us different stats.  How does analyze pick the 3000
> rows it gets back, and should this vary from run to run if we
> aren't changing the table?

At random, and yes.  See the documentation ...

            regards, tom lane



Re: Are statistics gathered on function indexes?

From
Tom Lane
Date:
Awhile back I said:
> Oh, I see it: range selectivity double-excludes NULLs.  See the comment
> for clauselist_selectivity in src/backend/optimizer/path/clausesel.c.
> The individual estimates for the two component comparison operators
> each exclude nulls, and when we merge them together we get the wrong
> answer.

Just FYI, I have fixed this for 7.3.

            regards, tom lane