Re: Planner estimates cost of 'like' a lot lower than '='?? - Mailing list pgsql-general

From Tom Lane
Subject Re: Planner estimates cost of 'like' a lot lower than '='??
Date
Msg-id 1826.995818421@sss.pgh.pa.us
Whole thread Raw
In response to Planner estimates cost of 'like' a lot lower than '='??  (Mats Lofkvist <mal@algonet.se>)
Responses Re: Planner estimates cost of 'like' a lot lower than '='??  (Mats Lofkvist <mal@algonet.se>)
List pgsql-general
Mats Lofkvist <mal@algonet.se> writes:
> Fiddling around revealed that the problem is that the cost of
> 'like' is severely underestimated, making the database use
> the wrong index.

Not cost --- selectivity.  How many rows actually match the criterion
    WHERE find0.key2 = 'llll'
?  How about
    WHERE find0.value like 't10k__'

It would appear from your timings that the latter is not very selective
at all, whereas the former is quite selective.  However, given the
limitations of the planner's statistical routines, I wouldn't be too
surprised if it makes the opposite guess in 7.1 and before.  Notice
the difference between the estimated rows counts and reality in your
simplified test :-(.  The speed differential in your join almost
certainly has nothing to do with the execution time of a single '='
or 'like' operator, and everything to do with the number of rows
coming out of the first-stage index scan.  So if the planner guesses
wrong about which index is more selective for the query, it will choose
a bad plan.

How large is your dataset?  Would you be willing to build a trial
installation of current sources, and see if the 7.2-to-be planner
does any better?  We've done some major overhauling of the statistical
code since 7.1, and I'm curious to see its results in the field.
See our CVS server, or the nightly snapshot tarball at
http://www.ca.postgresql.org/ftpsite/dev/

Also: the overly large rows estimate for "where value = 't10k9999'"
is most likely caused by having some one extremely common value in
the column.  (In 7.1 and before, the most common value is the *only*
statistic the planner has, and so a common MCV drives it to assume
that there are only a few distinct values in the column.)  Often the
most common value is actually a dummy value, like an empty string.
If you have a lot of dummies, consider whether you can't replace them
with NULL.  7.1's VACUUM ANALYZE does distinguish NULLs from real
values, so this hack can help it derive somewhat less bogus stats.

            regards, tom lane

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: VACUUM ANALYZE
Next
From: missive@frontiernet.net (Lee Harr)
Date:
Subject: Re: SQL Regular Expression Question