Thread: Problem (bug?) wih like

Problem (bug?) wih like

From
bombadil@wanadoo.es
Date:
 I'd like to contrast an error that I get when using like in text
 fields.

 I have a table A an a view v_A of A. Name is a text field (person
 names). Look at these queries:

 1)

     select * from A where name like '%DAVID%'

 It works pretty well and fast. Result:

     DAVID
    DAVID FOO
    DAVID
    .../...

 2)

     select * from v_A where name like '%DA%'

 It works too, with a result bigger (obviously) than first query.
 Result:

     DAVID
    DANIEL
    DAVID FOO
    .../...

 3)

     select * from v_A where name like '%DAVID%'

 It freezes psql. Why?. It seems a bug, doesn't it?. Thanks for any
 info.

                     David

Re: Problem (bug?) wih like

From
DaVinci
Date:
El lunes 03 de diciembre, bombadil@wanadoo.es escribió:
>  I'd like to contrast an error that I get when using like in text
>  fields.

 Sorry. Info about my system:

     Postgresql 7.1.3
    Debian Sid GNU/Linux
    i386

 Greets.

                     David

Re: Problem (bug?) wih like

From
Tom Lane
Date:
bombadil@wanadoo.es writes:
>      select * from v_A where name like '%DAVID%'

>  It freezes psql.

I don't believe that it's really "frozen".  Taking a long time, maybe.

> Why?

You tell us.  What's the EXPLAIN query plan for these three queries?

            regards, tom lane

Re: Problem (bug?) with like

From
bombadil@wanadoo.es
Date:
El lunes 03 de diciembre, Tom Lane escribió:
> bombadil@wanadoo.es writes:
> >      select * from v_A where name like '%DAVID%'
> >
> >  It freezes psql.
>
> I don't believe that it's really "frozen".  Taking a long time, maybe.

 Perhaps. But a veeeeeery long time, in any way ;)

 I have been waiting more than 3 minutes and... ¡e voila!, here it is
 :)

> > Why?
>
> You tell us.  What's the EXPLAIN query plan for these three queries?

 Ops. Sorry for laziness.

 Here are my queries:

--------------------------------------------------------------------------
 1) # explain SELECT * from cliente where nombre like '%DAVID%';

 Result:

NOTICE:  QUERY PLAN:

Seq Scan on cliente  (cost=0.00..16139.44 rows=1 width=131)

--------------------------------------------------------------------------
 2) # explain SELECT * from v_cliente where nombre like '%DA%';

 Result:


NOTICE:  QUERY PLAN:

Merge Join  (cost=54763.50..62874.36 rows=413980 width=183)
  ->  Sort  (cost=16238.44..16238.44 rows=54 width=131)
        ->  Seq Scan on cliente c  (cost=0.00..16236.88 rows=54 width=131)
  ->  Sort  (cost=38525.06..38525.06 rows=20097 width=74)
        ->  Subquery Scan d  (cost=891.91..37088.66 rows=20097 width=74)
              ->  Hash Join  (cost=891.91..37088.66 rows=20097 width=74)
                    ->  Hash Join  (cost=100.89..26377.49 rows=20097 width=58)
                          ->  Merge Join  (cost=78.96..17190.49 rows=20097 width=42)
                                ->  Index Scan using dir_via_ndx on dirección d  (cost=0.00..8951.65 rows=20097
width=26)
                                ->  Sort  (cost=78.96..78.96 rows=176 width=16)
                                      ->  Seq Scan on vía v  (cost=0.00..72.40 rows=176 width=16)
                          ->  Hash  (cost=21.80..21.80 rows=52 width=16)
                                ->  Seq Scan on provincia p  (cost=0.00..21.80 rows=52 width=16)
                    ->  Hash  (cost=786.20..786.20 rows=1928 width=16)
                          ->  Seq Scan on localidad l  (cost=0.00..786.20 rows=1928 width=16)

------------------------------------------------------------------------------
 3) # explain SELECT * from v_cliente where nombre like '%DAVID%';

 Result:

NOTICE:  QUERY PLAN:

Merge Join  (cost=54763.50..62874.36 rows=413980 width=183)
  ->  Sort  (cost=16238.44..16238.44 rows=54 width=131)
        ->  Seq Scan on cliente c  (cost=0.00..16236.88 rows=54 width=131)
  ->  Sort  (cost=38525.06..38525.06 rows=20097 width=74)
        ->  Subquery Scan d  (cost=891.91..37088.66 rows=20097 width=74)
              ->  Hash Join  (cost=891.91..37088.66 rows=20097 width=74)
                    ->  Hash Join  (cost=100.89..26377.49 rows=20097 width=58)
                          ->  Merge Join  (cost=78.96..17190.49 rows=20097 width=42)
                                ->  Index Scan using dir_via_ndx on dirección d  (cost=0.00..8951.65 rows=20097
width=26)
                                ->  Sort  (cost=78.96..78.96 rows=176 width=16)
                                      ->  Seq Scan on vía v  (cost=0.00..72.40 rows=176 width=16)
                          ->  Hash  (cost=21.80..21.80 rows=52 width=16)
                                ->  Seq Scan on provincia p  (cost=0.00..21.80 rows=52 width=16)
                    ->  Hash  (cost=786.20..786.20 rows=1928 width=16)
                          ->  Seq Scan on localidad l  (cost=0.00..786.20 rows=1928 width=16)

--------------------------------------------------------------------------------

 Greets.

                     David


Re: Problem (bug?) with like

From
Tom Lane
Date:
bombadil@wanadoo.es writes:
>  Here are my queries:

You sure you didn't paste in the same result for #2 and #3?  They're
the same plan with the same rows estimates --- but I'd expect the rows
estimates, at least, to change given the more-selective LIKE pattern.

Also, how many rows are there really that match '%DA%' and '%DAVID%'?

I suspect the planner is being overoptimistic about the selectivity of
'%DAVID%', and is choosing a plan that doesn't work well when there are
lots of DAVIDs :-(

            regards, tom lane

Re: Problem (bug?) with like

From
bombadil@wanadoo.es
Date:
El lunes 03 de diciembre, Tom Lane escribió:
> >  Here are my queries:
>
> You sure you didn't paste in the same result for #2 and #3?  They're
> the same plan with the same rows estimates --- but I'd expect the rows
> estimates, at least, to change given the more-selective LIKE pattern.

 I don't know for sure if I have send wrong data, but I think not.
 Tomorrow I'll get more info.

> Also, how many rows are there really that match '%DA%' and '%DAVID%'?

 Very few for that incredible difference in time:

     '%DA%'        -> 2 sec.
    '%DAVID%'     -> 3 min.

> I suspect the planner is being overoptimistic about the selectivity of
> '%DAVID%', and is choosing a plan that doesn't work well when there are
> lots of DAVIDs :-(

 I have thought that it only occurs when item to find is present
 completely in any number of registers, but this is only a burde
 hypothesis :?

 Thanks for all.

                             David


Re: Problem (bug?) with like

From
Tom Lane
Date:
bombadil@wanadoo.es writes:
>  Here are my queries:

You sure you didn't paste in the same result for #2 and #3?  They're
the same plan with the same rows estimates --- but I'd expect the rows
estimates, at least, to change given the more-selective LIKE pattern.

Also, how many rows are there really that match '%DA%' and '%DAVID%'?

I suspect the planner is being overoptimistic about the selectivity of
'%DAVID%', and is choosing a plan that doesn't work well when there are
lots of DAVIDs :-(

            regards, tom lane


Re: Problem (bug?) with like

From
bombadil@wanadoo.es
Date:
El lunes 03 de diciembre, Tom Lane escribió:
> You sure you didn't paste in the same result for #2 and #3?  They're
> the same plan with the same rows estimates --- but I'd expect the rows
> estimates, at least, to change given the more-selective LIKE pattern.

 Ahem... You are right. It seems I have a problem with Vim and insertion of
 command result :(

 Here are the correct results:

-----------------------------------------------------------------------
1) # explain SELECT * from v_cliente where nombre like '%DA%';

Result:

NOTICE:  QUERY PLAN:

Merge Join  (cost=54763.50..62874.36 rows=413980 width=183)
  ->  Sort  (cost=16238.44..16238.44 rows=54 width=131)
        ->  Seq Scan on cliente c  (cost=0.00..16236.88 rows=54 width=131)
  ->  Sort  (cost=38525.06..38525.06 rows=20097 width=74)
        ->  Subquery Scan d  (cost=891.91..37088.66 rows=20097 width=74)
              ->  Hash Join  (cost=891.91..37088.66 rows=20097 width=74)
                    ->  Hash Join  (cost=100.89..26377.49 rows=20097 width=58)
                          ->  Merge Join  (cost=78.96..17190.49 rows=20097 width=42)
                                ->  Index Scan using dir_via_ndx on dirección d  (cost=0.00..8951.65 rows=20097
width=26)
                                ->  Sort  (cost=78.96..78.96 rows=176 width=16)
                                      ->  Seq Scan on vía v  (cost=0.00..72.40 rows=176 width=16)
                          ->  Hash  (cost=21.80..21.80 rows=52 width=16)
                                ->  Seq Scan on provincia p  (cost=0.00..21.80 rows=52 width=16)
                    ->  Hash  (cost=786.20..786.20 rows=1928 width=16)
                          ->  Seq Scan on localidad l  (cost=0.00..786.20 rows=1928 width=16)

EXPLAIN
---------------------------------------------------------------------------
2) explain SELECT * from v_cliente where nombre like '%DAVID%';

Result:

NOTICE:  QUERY PLAN:

Nested Loop  (cost=891.91..61414.58 rows=7638 width=183)
  ->  Seq Scan on cliente c  (cost=0.00..16236.88 rows=1 width=131)
  ->  Subquery Scan d  (cost=891.91..37088.66 rows=20097 width=74)
        ->  Hash Join  (cost=891.91..37088.66 rows=20097 width=74)
              ->  Hash Join  (cost=100.89..26377.49 rows=20097 width=58)
                    ->  Merge Join  (cost=78.96..17190.49 rows=20097 width=42)
                          ->  Index Scan using dir_via_ndx on dirección d  (cost=0.00..8951.65 rows=20097 width=26)
                          ->  Sort  (cost=78.96..78.96 rows=176 width=16)
                                ->  Seq Scan on vía v  (cost=0.00..72.40 rows=176 width=16)
                    ->  Hash  (cost=21.80..21.80 rows=52 width=16)
                          ->  Seq Scan on provincia p  (cost=0.00..21.80 rows=52 width=16)
              ->  Hash  (cost=786.20..786.20 rows=1928 width=16)
                    ->  Seq Scan on localidad l  (cost=0.00..786.20 rows=1928 width=16)
EXPLAIN
----------------------------------------------------------------------------

> Also, how many rows are there really that match '%DA%' and '%DAVID%'?

 1)   2672 rows    -> 3.59 sec.
 2)   257 rows     -> 364.69 sec.

 v_cliente and cliente have same number of rows: 38975

 I hope this is enough. Greets.

                             David


Re: Problem (bug?) with like

From
Tom Lane
Date:
bombadil@wanadoo.es writes:
> 1) # explain SELECT * from v_cliente where nombre like '%DA%';

> Merge Join  (cost=54763.50..62874.36 rows=413980 width=183)
>   ->  Sort  (cost=16238.44..16238.44 rows=54 width=131)
>         ->  Seq Scan on cliente c  (cost=0.00..16236.88 rows=54 width=131)
>   ->  Sort  (cost=38525.06..38525.06 rows=20097 width=74)
>         ->  Subquery Scan d  (cost=891.91..37088.66 rows=20097 width=74)
>               ->  Hash Join  (cost=891.91..37088.66 rows=20097 width=74)
>                   ...

> 2) explain SELECT * from v_cliente where nombre like '%DAVID%';

> Nested Loop  (cost=891.91..61414.58 rows=7638 width=183)
>   ->  Seq Scan on cliente c  (cost=0.00..16236.88 rows=1 width=131)
>   ->  Subquery Scan d  (cost=891.91..37088.66 rows=20097 width=74)
>         ->  Hash Join  (cost=891.91..37088.66 rows=20097 width=74)
>             ... [same subplan as above]

The problem here is that the planner is being way too optimistic about
the selectivity of LIKE '%DAVID%' --- notice the estimate that only
one matching row will be found in cliente, rather than 54 as with '%DA%'.
So it chooses a plan that avoids the sort overhead needed for an
efficient merge join with the other tables.  That would be a win if
there were only one matching row, but as soon as there are lots, it's
a big loss, because the subquery to join the other tables gets redone
for every matching row :-(

>> Also, how many rows are there really that match '%DA%' and '%DAVID%'?

>  1)   2672 rows    -> 3.59 sec.
>  2)   257 rows     -> 364.69 sec.

I am thinking that the rules for selectivity of LIKE patterns probably
need to be modified.  Presently the code assumes that a long constant
string has probability of occurrence proportional to the product of the
probabilities of the individual letters.  That might be true in a random
world, but people don't search for random strings.  I think we need to
back off the selectivity estimate by some large factor to account for
the fact that the pattern being searched for is probably not random.
Anyone have ideas how to do that?

            regards, tom lane

Re: Problem (bug?) with like

From
Tom Lane
Date:
bombadil@wanadoo.es writes:
> 1) # explain SELECT * from v_cliente where nombre like '%DA%';

> Merge Join  (cost=54763.50..62874.36 rows=413980 width=183)
>   ->  Sort  (cost=16238.44..16238.44 rows=54 width=131)
>         ->  Seq Scan on cliente c  (cost=0.00..16236.88 rows=54 width=131)
>   ->  Sort  (cost=38525.06..38525.06 rows=20097 width=74)
>         ->  Subquery Scan d  (cost=891.91..37088.66 rows=20097 width=74)
>               ->  Hash Join  (cost=891.91..37088.66 rows=20097 width=74)
>                   ...

> 2) explain SELECT * from v_cliente where nombre like '%DAVID%';

> Nested Loop  (cost=891.91..61414.58 rows=7638 width=183)
>   ->  Seq Scan on cliente c  (cost=0.00..16236.88 rows=1 width=131)
>   ->  Subquery Scan d  (cost=891.91..37088.66 rows=20097 width=74)
>         ->  Hash Join  (cost=891.91..37088.66 rows=20097 width=74)
>             ... [same subplan as above]

The problem here is that the planner is being way too optimistic about
the selectivity of LIKE '%DAVID%' --- notice the estimate that only
one matching row will be found in cliente, rather than 54 as with '%DA%'.
So it chooses a plan that avoids the sort overhead needed for an
efficient merge join with the other tables.  That would be a win if
there were only one matching row, but as soon as there are lots, it's
a big loss, because the subquery to join the other tables gets redone
for every matching row :-(

>> Also, how many rows are there really that match '%DA%' and '%DAVID%'?

>  1)   2672 rows    -> 3.59 sec.
>  2)   257 rows     -> 364.69 sec.

I am thinking that the rules for selectivity of LIKE patterns probably
need to be modified.  Presently the code assumes that a long constant
string has probability of occurrence proportional to the product of the
probabilities of the individual letters.  That might be true in a random
world, but people don't search for random strings.  I think we need to
back off the selectivity estimate by some large factor to account for
the fact that the pattern being searched for is probably not random.
Anyone have ideas how to do that?

            regards, tom lane


Re: Problem (bug?) with like

From
Bruce Momjian
Date:
> I am thinking that the rules for selectivity of LIKE patterns probably
> need to be modified.  Presently the code assumes that a long constant
> string has probability of occurrence proportional to the product of the
> probabilities of the individual letters.  That might be true in a random
> world, but people don't search for random strings.  I think we need to
> back off the selectivity estimate by some large factor to account for
> the fact that the pattern being searched for is probably not random.
> Anyone have ideas how to do that?

But what about '%A%' vs. '%AC%'.  Seems the second is reasonably
different from the first the our optimizer may be fine with that.  Is it
only when the strings get longer that we lose specificity?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Problem (bug?) with like

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> But what about '%A%' vs. '%AC%'.  Seems the second is reasonably
> different from the first the our optimizer may be fine with that.  Is it
> only when the strings get longer that we lose specificity?

Yeah, I don't think that the estimates are bad for one or two
characters.  But the estimate gets real small real fast as you
increase the number of match characters in the LIKE pattern.
We need to slow that down some.

            regards, tom lane

Re: Problem (bug?) with like

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > But what about '%A%' vs. '%AC%'.  Seems the second is reasonably
> > different from the first the our optimizer may be fine with that.  Is it
> > only when the strings get longer that we lose specificity?
>
> Yeah, I don't think that the estimates are bad for one or two
> characters.  But the estimate gets real small real fast as you
> increase the number of match characters in the LIKE pattern.
> We need to slow that down some.

Yea, maybe a log base 2 decrease:

    1 char    1x
    2 char    2x
    4 char    3x
    8 char    4x
    16 char 5x

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Problem (bug?) wih like

From
DaVinci
Date:
El lunes 03 de diciembre, bombadil@wanadoo.es escribió:
>  I'd like to contrast an error that I get when using like in text
>  fields.

 Sorry. Info about my system:

     Postgresql 7.1.3
    Debian Sid GNU/Linux
    i386

 Greets.

                     David

Re: [HACKERS] Problem (bug?) with like

From
Hannu Krosing
Date:

Tom Lane wrote:

>Bruce Momjian <pgman@candle.pha.pa.us> writes:
>
>>But what about '%A%' vs. '%AC%'.  Seems the second is reasonably
>>different from the first the our optimizer may be fine with that.  Is it
>>only when the strings get longer that we lose specificity?
>>
>
>Yeah, I don't think that the estimates are bad for one or two
>characters.  But the estimate gets real small real fast as you
>increase the number of match characters in the LIKE pattern.
>We need to slow that down some.
>
Could we just assign weights to first few characters and then consider
only these
first few characters when determinind probabbility of finding it ?

If someone searches for '%New York City%' we have quite good reasons to
believe
that there are some of these in there so we should factor in the fact
that usually one searches
for strings that do exist by said weights.

Another option would be to gather statistics not only on individual
letters but on bi- or
trigraphs. Then as a next step we could implement proper trigraph indexes ;)

----------------
Hannu



New planner for like was -- Problem (bug?) with like

From
David Walter
Date:

On Tue, 4 Dec 2001, Tom Lane wrote:

> bombadil@wanadoo.es writes:
> > 1) # explain SELECT * from v_cliente where nombre like '%DA%';
>
> > Merge Join  (cost=54763.50..62874.36 rows=413980 width=183)
> >   ->  Sort  (cost=16238.44..16238.44 rows=54 width=131)

> The problem here is that the planner is being way too optimistic about
> the selectivity of LIKE '%DAVID%' --- notice the estimate that only
> one matching row will be found in cliente, rather than 54 as with '%DA%'.
> So it chooses a plan that avoids the sort overhead needed for an
> efficient merge join with the other tables.  That would be a win if
> there were only one matching row, but as soon as there are lots, it's
> a big loss, because the subquery to join the other tables gets redone
> for every matching row :-(
>
> >> Also, how many rows are there really that match '%DA%' and '%DAVID%'?
>
> >  1)   2672 rows    -> 3.59 sec.
> >  2)   257 rows     -> 364.69 sec.
>
> I am thinking that the rules for selectivity of LIKE patterns probably
> need to be modified.  Presently the code assumes that a long constant
> string has probability of occurrence proportional to the product of the
> probabilities of the individual letters.  That might be true in a random
> world, but people don't search for random strings.  I think we need to
> back off the selectivity estimate by some large factor to account for
> the fact that the pattern being searched for is probably not random.
> Anyone have ideas how to do that?
>

Is there any statistic being kept by partial index?

for instance

    #occurrences of A
                    B

A fairly small table/index could track these couldn't it?

If it were a btree itself, then statistics could be split appropriately
into sub-branches when the # of occurrences exceeds some threshold.

david



Re: Problem (bug?) with like

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > But what about '%A%' vs. '%AC%'.  Seems the second is reasonably
> > different from the first the our optimizer may be fine with that.  Is it
> > only when the strings get longer that we lose specificity?
>
> Yeah, I don't think that the estimates are bad for one or two
> characters.  But the estimate gets real small real fast as you
> increase the number of match characters in the LIKE pattern.
> We need to slow that down some.

See my earlier email about the 50% idea for LIKE.  Do ordinary string
comparisons also have this problem?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [HACKERS] Problem (bug?) with like

From
Bruce Momjian
Date:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > But what about '%A%' vs. '%AC%'.  Seems the second is reasonably
> > > different from the first the our optimizer may be fine with that.  Is it
> > > only when the strings get longer that we lose specificity?
> >
> > Yeah, I don't think that the estimates are bad for one or two
> > characters.  But the estimate gets real small real fast as you
> > increase the number of match characters in the LIKE pattern.
> > We need to slow that down some.
>
> Yea, maybe a log base 2 decrease:
>
>     1 char    1x
>     2 char    2x
>     4 char    3x
>     8 char    4x
>     16 char 5x

I did a little research on this.  I think the problem is that ordinary
characters are assumed to randomly appear in a character string, while
in practice, if the string has already been specified like 'DAV', there
are very few additional characters that can follow it and make sense.

Looking at backend/utils/adt/selfuncs.c, I see this:

    #define FIXED_CHAR_SEL    0.04    /* about 1/25 */
...
    sel *= FIXED_CHAR_SEL;

which means every additional character reduces the selectivity by 96%.
This seems much too restrictive to me.  Because of the new optimizer
buckets, we do have good statistics on the leading character, but
additional characters drastically reduce selectivity.  I think perhaps a
number like 0.50 or 50% may be correct.

That would be a table like this:

     1 char    2x
     2 char    4x
     4 char    8x
     8 char    16x
     16 char 32x

which is more restrictive than I initially suggested above but less
restrictive than we have now.

Should we assume additional characters are indeed randomly appearing in
the string?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Problem (bug?) with like

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > But what about '%A%' vs. '%AC%'.  Seems the second is reasonably
> > different from the first the our optimizer may be fine with that.  Is it
> > only when the strings get longer that we lose specificity?
>
> Yeah, I don't think that the estimates are bad for one or two
> characters.  But the estimate gets real small real fast as you
> increase the number of match characters in the LIKE pattern.
> We need to slow that down some.

OK, I think I have the proper value for FIXED_CHAR_SEL.  It is currently
0.04 or 1/26, meaning the letters are random, though this is not usually
the case.

If we assume our optimizer buckets have given us a reasonable value for
the first character, suppose it is an 'F', there are only a few valid
characters after that, at least in English.  There are vowels, and a few
consonants, and given that character, there are only a few characters
that can be valid after that.  To my thinking, it is two characters that
represent the same distribution as one random character, leaving 0.20 as
the proper value for FIXED_CHAR_SEL because 0.20 * 0.20 is the same as
0.04.

Added to TODO:

  * Change FIXED_CHAR_SEL to 0.20 from 0.04 to give better selectivity (Bruce)

If people think there is a better value for this, please chime in.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Problem (bug?) with like

From
Bruce Momjian
Date:
> The problem here is that the planner is being way too optimistic about
> the selectivity of LIKE '%DAVID%' --- notice the estimate that only
> one matching row will be found in cliente, rather than 54 as with '%DA%'.
> So it chooses a plan that avoids the sort overhead needed for an
> efficient merge join with the other tables.  That would be a win if
> there were only one matching row, but as soon as there are lots, it's
> a big loss, because the subquery to join the other tables gets redone
> for every matching row :-(
>
> >> Also, how many rows are there really that match '%DA%' and '%DAVID%'?
>
> >  1)   2672 rows    -> 3.59 sec.
> >  2)   257 rows     -> 364.69 sec.
>
> I am thinking that the rules for selectivity of LIKE patterns probably
> need to be modified.  Presently the code assumes that a long constant
> string has probability of occurrence proportional to the product of the
> probabilities of the individual letters.  That might be true in a random
> world, but people don't search for random strings.  I think we need to
> back off the selectivity estimate by some large factor to account for
> the fact that the pattern being searched for is probably not random.
> Anyone have ideas how to do that?

Let's use the above example with the new FIXED_CHAR_SEL values:

With the new 0.20 value for FIXED_CHAR_SEL, we see for DA and DAVID
above:

DA    1) 0.20 ^ 2
            .04

DAVID    2) 0.20 ^ 5
            .00032

If we divide these two, we get:

    > 0.04 / 0.00032
            125

while looking at the total counts reported above, we get:

    > 2672 / 257
            ~10.39688715953307392996

The 0.04 value gives a value of:

    > 0.04 ^ 2
            .0016
    > 0.04 ^ 5
            .0000001024
    > .0016 / .0000001024
            15625

Clearly the 0.20 value is 10x too large, while the 0.04 value is 1000x
too large.  Because this was a contrived example, and because some have
more random text than DAVID in their field, I think 0.20 is the proper
value.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026