On Tue, 2006-01-10 at 17:21 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > I think its OK to use the MCV, but I have a problem with the current
> > heuristics: they only work for randomly generated strings, since the
> > selectivity goes down geometrically with length.
>
> We could certainly use a less aggressive curve for that. You got a
> specific proposal?
non-left anchored LIKE is most likely going to be used with
unstructured, variable length data - else we might use SUBSTRING
instead. My proposal would be to assume that LIKE is acting on human
language text data.
I considered this a while back, but wrote it off in favour of dynamic
sampling - but it's worth discussing this to see whether we can improve
on things without that.
Here's one of the links I reviewed previously:
http://www.ling.lu.se/persons/Joost/Texts/studling.pdf
Sigurd et al [2004]
This shows word frequency distribution peaks at 3 letter/2 phoneme
words, then tails off exponentially after that.
Clearly when search string > 3 then the selectivity must tail off
exponentially also, since we couldn't find words shorter than the search
string itself. The search string might be a phrase, but it seems
reasonable to assume that phrases also drop off in frequency according
to length. It is difficult to decide what to do at len=2 or len=3, and I
would be open to various thoughts, but would default to keeping
like_selectivity as it is now.
Sigurd et al show that word length tails off at 0.7^Len beyond Len=3, so
selectivity FIXED_CHAR_SEL should not be more than 0.7, but I see no
evidence for it being as low as 0.2 (from the published results). For
simplicity, where Len > 3, I would make the tail off occur with factor
0.5, rather than 0.2.
We could see a few more changes from those results, but curbing the
aggressive tail off would be a simple and easy act.
Best Regards, Simon Riggs