> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > But what about '%A%' vs. '%AC%'. Seems the second is reasonably
> > > different from the first the our optimizer may be fine with that. Is it
> > > only when the strings get longer that we lose specificity?
> >
> > Yeah, I don't think that the estimates are bad for one or two
> > characters. But the estimate gets real small real fast as you
> > increase the number of match characters in the LIKE pattern.
> > We need to slow that down some.
>
> Yea, maybe a log base 2 decrease:
>
> 1 char 1x
> 2 char 2x
> 4 char 3x
> 8 char 4x
> 16 char 5x
I did a little research on this. I think the problem is that ordinary
characters are assumed to randomly appear in a character string, while
in practice, if the string has already been specified like 'DAV', there
are very few additional characters that can follow it and make sense.
Looking at backend/utils/adt/selfuncs.c, I see this:
#define FIXED_CHAR_SEL 0.04 /* about 1/25 */
...
sel *= FIXED_CHAR_SEL;
which means every additional character reduces the selectivity by 96%.
This seems much too restrictive to me. Because of the new optimizer
buckets, we do have good statistics on the leading character, but
additional characters drastically reduce selectivity. I think perhaps a
number like 0.50 or 50% may be correct.
That would be a table like this:
1 char 2x
2 char 4x
4 char 8x
8 char 16x
16 char 32x
which is more restrictive than I initially suggested above but less
restrictive than we have now.
Should we assume additional characters are indeed randomly appearing in
the string?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026