Re: NOT LIKE much faster than LIKE? - Mailing list pgsql-performance

From Simon Riggs
Subject Re: NOT LIKE much faster than LIKE?
Date
Msg-id 1137026916.21025.585.camel@localhost.localdomain
Whole thread Raw
In response to Re: NOT LIKE much faster than LIKE?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
On Tue, 2006-01-10 at 17:21 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > I think its OK to use the MCV, but I have a problem with the current
> > heuristics: they only work for randomly generated strings, since the
> > selectivity goes down geometrically with length.
>
> We could certainly use a less aggressive curve for that.  You got a
> specific proposal?

non-left anchored LIKE is most likely going to be used with
unstructured, variable length data - else we might use SUBSTRING
instead. My proposal would be to assume that LIKE is acting on human
language text data.

I considered this a while back, but wrote it off in favour of dynamic
sampling - but it's worth discussing this to see whether we can improve
on things without that.

Here's one of the links I reviewed previously:
http://www.ling.lu.se/persons/Joost/Texts/studling.pdf
Sigurd et al [2004]

This shows word frequency distribution peaks at 3 letter/2 phoneme
words, then tails off exponentially after that.

Clearly when search string > 3 then the selectivity must tail off
exponentially also, since we couldn't find words shorter than the search
string itself. The search string might be a phrase, but it seems
reasonable to assume that phrases also drop off in frequency according
to length. It is difficult to decide what to do at len=2 or len=3, and I
would be open to various thoughts, but would default to keeping
like_selectivity as it is now.

Sigurd et al show that word length tails off at 0.7^Len beyond Len=3, so
selectivity FIXED_CHAR_SEL should not be more than 0.7, but I see no
evidence for it being as low as 0.2 (from the published results).  For
simplicity, where Len > 3, I would make the tail off occur with factor
0.5, rather than 0.2.

We could see a few more changes from those results, but curbing the
aggressive tail off would be a simple and easy act.

Best Regards, Simon Riggs


pgsql-performance by date:

Previous
From: Mark Liberman
Date:
Subject: Stable function being evaluated more than once in a single query
Next
From: K C Lau
Date:
Subject: Re: indexes on primary and foreign keys