Re: NOT LIKE much faster than LIKE? - Mailing list pgsql-performance

From Simon Riggs
Subject Re: NOT LIKE much faster than LIKE?
Date
Msg-id 1136930796.21025.486.camel@localhost.localdomain
Whole thread Raw
In response to Re: NOT LIKE much faster than LIKE?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: NOT LIKE much faster than LIKE?
List pgsql-performance
On Tue, 2006-01-10 at 12:49 -0500, Tom Lane wrote:
> Matteo Beccati <php@beccati.com> writes:
> >> I did just think of something we could improve though.  The pattern
> >> selectivity code doesn't make any use of the statistics about "most
> >> common values".  For a constant pattern, we could actually apply the
> >> pattern test with each common value and derive answers that are exact
> >> for the portion of the population represented by the most-common-values
> >> list.
>
> > This reminds me what I did in a patch which is currently on hold for the
> > next release:
>
> I've applied a patch to make patternsel() compute the exact result for
> the MCV list, and use its former heuristics only for the portion of the
> column population not included in the MCV list.

I think its OK to use the MCV, but I have a problem with the current
heuristics: they only work for randomly generated strings, since the
selectivity goes down geometrically with length. That doesn't match most
languages where one and two syllable words are extremely common and
longer ones much less so. A syllable can be 1-2 chars long, so any
search string of length 1-4 is probably equally likely, rather than
reducing in selectivity based upon length. So I think part of the
problem is the geometrically reducing selectivity itself.

> After finishing that work it occurred to me that we could go a step
> further: if the MCV list accounts for a substantial fraction of the
> population, we could assume that the MCV list is representative of the
> whole population, and extrapolate the pattern's selectivity over the MCV
> list to the whole population instead of using the existing heuristics at
> all.  In a situation like Andreas' example this would win big, although
> you can certainly imagine cases where it would lose too.

I don't think that can be inferred with any confidence, unless a large
proportion of the MCV list were itself selected. Otherwise it might
match only a single MCV that just happens to have a high proportion,
then we assume all others have the same proportion. The calculation is
related to Ndistinct, in some ways.

> Any thoughts about this?  What would be a reasonable threshold for
> "substantial fraction"?  It would probably make sense to have different
> thresholds depending on whether the pattern is left-anchored or not,
> since the range heuristic only works for left-anchored patterns.

I don't think you can do this for a low enough substantial fraction to
make this interesting.

I would favour the idea of dynamic sampling using a block sampling
approach; that was a natural extension of improving ANALYZE also. We can
use that approach for things such as LIKE, but also use it for
multi-column single-table and join selectivity.

Best Regards, Simon Riggs



pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: NOT LIKE much faster than LIKE?
Next
From: Tom Lane
Date:
Subject: Re: NOT LIKE much faster than LIKE?