Re: [GENERAL] Problem (bug?) with like - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: [GENERAL] Problem (bug?) with like
Date
Msg-id 200112281827.fBSIRRk28377@candle.pha.pa.us
Whole thread Raw
In response to Re: [GENERAL] Problem (bug?) with like  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-hackers
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > But what about '%A%' vs. '%AC%'.  Seems the second is reasonably
> > > different from the first the our optimizer may be fine with that.  Is it
> > > only when the strings get longer that we lose specificity?
> >
> > Yeah, I don't think that the estimates are bad for one or two
> > characters.  But the estimate gets real small real fast as you
> > increase the number of match characters in the LIKE pattern.
> > We need to slow that down some.
>
> Yea, maybe a log base 2 decrease:
>
>     1 char    1x
>     2 char    2x
>     4 char    3x
>     8 char    4x
>     16 char 5x

I did a little research on this.  I think the problem is that ordinary
characters are assumed to randomly appear in a character string, while
in practice, if the string has already been specified like 'DAV', there
are very few additional characters that can follow it and make sense.

Looking at backend/utils/adt/selfuncs.c, I see this:

    #define FIXED_CHAR_SEL    0.04    /* about 1/25 */
...
    sel *= FIXED_CHAR_SEL;

which means every additional character reduces the selectivity by 96%.
This seems much too restrictive to me.  Because of the new optimizer
buckets, we do have good statistics on the leading character, but
additional characters drastically reduce selectivity.  I think perhaps a
number like 0.50 or 50% may be correct.

That would be a table like this:

     1 char    2x
     2 char    4x
     4 char    8x
     8 char    16x
     16 char 32x

which is more restrictive than I initially suggested above but less
restrictive than we have now.

Should we assume additional characters are indeed randomly appearing in
the string?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [GENERAL] Problem (bug?) with like
Next
From: Peter Eisentraut
Date:
Subject: Re: psql misbehaves because of a simple typo