Re: Planner estimates cost of 'like' a lot lower than '='?? - Mailing list pgsql-general

From Mats Lofkvist
Subject Re: Planner estimates cost of 'like' a lot lower than '='??
Date
Msg-id 20010722183643.20949.qmail@kairos.algonet.se
Whole thread Raw
In response to Re: Planner estimates cost of 'like' a lot lower than '='??  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Planner estimates cost of 'like' a lot lower than '='??  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Planner estimates cost of 'like' a lot lower than '='??  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
   Mats Lofkvist <mal@algonet.se> writes:
   > Fiddling around revealed that the problem is that the cost of
   > 'like' is severely underestimated, making the database use
   > the wrong index.

   Not cost --- selectivity.  How many rows actually match the criterion
       WHERE find0.key2 = 'llll'
   ?  How about
       WHERE find0.value like 't10k__'

There are 11004 rows matching key2 = 'llll' and 90 rows matching
value like 't10k__' (all 90 have key2 = 'llll').

   It would appear from your timings that the latter is not very selective
   at all, whereas the former is quite selective.

Did you mean the other way around?

   However, given the
   limitations of the planner's statistical routines, I wouldn't be too
   surprised if it makes the opposite guess in 7.1 and before.  Notice
   the difference between the estimated rows counts and reality in your
   simplified test :-(.  The speed differential in your join almost
   certainly has nothing to do with the execution time of a single '='
   or 'like' operator, and everything to do with the number of rows
   coming out of the first-stage index scan.  So if the planner guesses
   wrong about which index is more selective for the query, it will choose
   a bad plan.

   How large is your dataset?  Would you be willing to build a trial
   installation of current sources, and see if the 7.2-to-be planner
   does any better?  We've done some major overhauling of the statistical
   code since 7.1, and I'm curious to see its results in the field.
   See our CVS server, or the nightly snapshot tarball at
   http://www.ca.postgresql.org/ftpsite/dev/

The 'data' table contains 162135 rows, the 'test2' table contains
118113 rows (the latter is a subset of the data.value column).

(I'm downloading the CVS tree right now. Do I need to do dump/restore
or can I just start it on the current data?)

   Also: the overly large rows estimate for "where value = 't10k9999'"
   is most likely caused by having some one extremely common value in
   the column.  (In 7.1 and before, the most common value is the *only*
   statistic the planner has, and so a common MCV drives it to assume
   that there are only a few distinct values in the column.)  Often the
   most common value is actually a dummy value, like an empty string.
   If you have a lot of dummies, consider whether you can't replace them
   with NULL.  7.1's VACUUM ANALYZE does distinguish NULLs from real
   values, so this hack can help it derive somewhat less bogus stats.

Yes, there are very common values, but none can be considered dummies
(i.e. they can't be replaced by null in a production database).

data.key0 is an object id, data.key1 and data.key2 is a two-part
object member name and data.longValue is the member value. data.value
is data.longValue truncated to make it possible to index a prefix of
data.longValue with databases not supporting this explicitly.

When running the tests, data contained ~11k objects each having about
a dozen members, some with unique values but may with common values.


Maybe I'm mistaken in assuming that the simplified test points at the
problem with 'real' test, but aren't cost estimates comparable between
two different explains? If they should be I still don't understand how
"where value = 'xxx'" can be estimated to return 600 times more rows
than "where value like 'xxx%'" (this is what happens in my simplified
test).


               regards, tom lane

thanks for the reply,
      _
Mats Lofkvist
mal@algonet.se

pgsql-general by date:

Previous
From: missive@frontiernet.net (Lee Harr)
Date:
Subject: Re: SQL Regular Expression Question
Next
From: Tom Lane
Date:
Subject: Re: Planner estimates cost of 'like' a lot lower than '='??