Re: NOT LIKE much faster than LIKE? - Mailing list pgsql-performance
From | Simon Riggs |
---|---|
Subject | Re: NOT LIKE much faster than LIKE? |
Date | |
Msg-id | 1136930796.21025.486.camel@localhost.localdomain Whole thread Raw |
In response to | Re: NOT LIKE much faster than LIKE? (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: NOT LIKE much faster than LIKE?
|
List | pgsql-performance |
On Tue, 2006-01-10 at 12:49 -0500, Tom Lane wrote: > Matteo Beccati <php@beccati.com> writes: > >> I did just think of something we could improve though. The pattern > >> selectivity code doesn't make any use of the statistics about "most > >> common values". For a constant pattern, we could actually apply the > >> pattern test with each common value and derive answers that are exact > >> for the portion of the population represented by the most-common-values > >> list. > > > This reminds me what I did in a patch which is currently on hold for the > > next release: > > I've applied a patch to make patternsel() compute the exact result for > the MCV list, and use its former heuristics only for the portion of the > column population not included in the MCV list. I think its OK to use the MCV, but I have a problem with the current heuristics: they only work for randomly generated strings, since the selectivity goes down geometrically with length. That doesn't match most languages where one and two syllable words are extremely common and longer ones much less so. A syllable can be 1-2 chars long, so any search string of length 1-4 is probably equally likely, rather than reducing in selectivity based upon length. So I think part of the problem is the geometrically reducing selectivity itself. > After finishing that work it occurred to me that we could go a step > further: if the MCV list accounts for a substantial fraction of the > population, we could assume that the MCV list is representative of the > whole population, and extrapolate the pattern's selectivity over the MCV > list to the whole population instead of using the existing heuristics at > all. In a situation like Andreas' example this would win big, although > you can certainly imagine cases where it would lose too. I don't think that can be inferred with any confidence, unless a large proportion of the MCV list were itself selected. Otherwise it might match only a single MCV that just happens to have a high proportion, then we assume all others have the same proportion. The calculation is related to Ndistinct, in some ways. > Any thoughts about this? What would be a reasonable threshold for > "substantial fraction"? It would probably make sense to have different > thresholds depending on whether the pattern is left-anchored or not, > since the range heuristic only works for left-anchored patterns. I don't think you can do this for a low enough substantial fraction to make this interesting. I would favour the idea of dynamic sampling using a block sampling approach; that was a natural extension of improving ANALYZE also. We can use that approach for things such as LIKE, but also use it for multi-column single-table and join selectivity. Best Regards, Simon Riggs
pgsql-performance by date: