Thread: NOT LIKE much faster than LIKE?

NOT LIKE much faster than LIKE?

From

Andrea Arcangeli

Date:

09 January 2006, 21:44:55

Hello,

I've a performance problem with the planner algorithm choosen in a website.
See the difference between this:

    http://klive.cpushare.com/?scheduler=cooperative

and this:

    http://klive.cpushare.com/?scheduler=preemptive

(note, there's much less data to show with preemptive, so it's not because of
the amount of data to output)

The second takes ages to complete and it overloads the server as well at 100%
cpu load for several seconds.

"cooperative" runs "WHERE kernel_version NOT LIKE '%% PREEMPT %%'", while
"preempt" runs "WHERE kernel_version LIKE '%% PREEMPT %%'. The only difference
is a NOT before "LIKE". No other differences at all.

The planner does apparently a big mistake using the nested loop in the "LIKE"
query, it should use the hash join lik in the "NOT LIKE" query instead.

I guess I can force it somehow (I found some docs about it here:

    http://www.postgresql.org/docs/8.1/static/runtime-config-query.html

) but it looks like something could be improved in the default mode too, so I'm
reporting the problem since it looks a performance bug to me. It just makes no
sense to me that the planner takes a difference decision based on a "not". It
can't know if it's more likely or less likely, this is a boolean return, it's
*exactly* the same cost to run it. Making assumption without user-provided
hints looks a mistake. I never said to the db that "not like" is more or less
likely to return data in output than "like".

Tried ANALYZE, including VACUUM FULL ANALYZE and it doesn't make a difference.

Perhaps it's analyze that suggests to use a different algorithm with "not like"
because there's much more data to analyze with "not like" than with "like", but
that algorithm works much better even when there's less data to analyze.

Indexes don't make any visible difference.

postgres is version 8.1.2 self compiled from CVS 8.1 branch of yesterday.

psandrea@opteron:~> psql -V
psql (PostgreSQL) 8.1.2
contains support for command-line editing
andrea@opteron:~>

The problem is reproducible on the shell, I only need to remove "explain". Of
course explain is wrong about the cost, according to explain the first query is
cheaper when infact it's an order of magnitude more costly.

klive=> explain SELECT class, vendor, device, revision, COUNT(*) as nr FROM pci NATURAL INNER JOIN klive WHERE
kernel_versionLIKE '%% PREEMPT %%' GROUP BY class, vendor, device, revision; 
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 HashAggregate  (cost=1687.82..1687.83 rows=1 width=16)
   ->  Nested Loop  (cost=235.86..1687.81 rows=1 width=16)
         ->  Seq Scan on klive  (cost=0.00..1405.30 rows=1 width=8)
               Filter: ((kernel_version)::text ~~ '%% PREEMPT %%'::text)
         ->  Bitmap Heap Scan on pci  (cost=235.86..282.32 rows=15 width=24)
               Recheck Cond: (pci.klive = "outer".klive)
               ->  Bitmap Index Scan on pci_pkey  (cost=0.00..235.86 rows=15 width=0)
                     Index Cond: (pci.klive = "outer".klive)
(8 rows)

klive=> explain SELECT class, vendor, device, revision, COUNT(*) as nr FROM pci NATURAL INNER JOIN klive WHERE
kernel_versionNOT LIKE '%% PREEMPT %%' GROUP BY class, vendor, device, revision; 
                                   QUERY PLAN
--------------------------------------------------------------------------------
 HashAggregate  (cost=3577.40..3612.00 rows=2768 width=16)
   ->  Hash Join  (cost=1569.96..3231.50 rows=27672 width=16)
         Hash Cond: ("outer".klive = "inner".klive)
         ->  Seq Scan on pci  (cost=0.00..480.73 rows=27673 width=24)
         ->  Hash  (cost=1405.30..1405.30 rows=22263 width=8)
               ->  Seq Scan on klive  (cost=0.00..1405.30 rows=22263 width=8)
                     Filter: ((kernel_version)::text !~~ '%% PREEMPT %%'::text)
(7 rows)

klive=>

Hints welcome, thanks!


PS. All the source code of the website where I'm reproducing the problem is
available at the above url under the GPL.

Re: NOT LIKE much faster than LIKE?

From

Tom Lane

Date:

09 January 2006, 22:04:49

Andrea Arcangeli <andrea@cpushare.com> writes:
> It just makes no sense to me that the planner takes a difference
> decision based on a "not".

Why in the world would you think that?  In general a NOT will change the
selectivity of the WHERE condition tremendously.  If the planner weren't
sensitive to that, *that* would be a bug.  The only case where it's
irrelevant is if the selectivity of the base condition is exactly 50%,
which is not a very reasonable default guess for LIKE.

It sounds to me that the problem is misestimation of the selectivity
of the LIKE pattern --- the planner is going to think that
LIKE '%% PREEMPT %%' is fairly selective because of the rather long
match text, when in reality it's probably not so selective on your
data.  But we don't keep any statistics that would allow the actual
number of matching rows to be estimated well.  You might want to think
about changing your data representation so that the pattern-match can be
replaced by a boolean column, or some such, so that the existing
statistics code can make a more reasonable estimate how many rows are
selected.

            regards, tom lane

Re: NOT LIKE much faster than LIKE?

From

Andrea Arcangeli

Date:

09 January 2006, 22:23:07

On Mon, Jan 09, 2006 at 09:04:48PM -0500, Tom Lane wrote:
> Andrea Arcangeli <andrea@cpushare.com> writes:
> > It just makes no sense to me that the planner takes a difference
> > decision based on a "not".
>
> Why in the world would you think that?  In general a NOT will change the
> selectivity of the WHERE condition tremendously.  If the planner weren't
> sensitive to that, *that* would be a bug.  The only case where it's
> irrelevant is if the selectivity of the base condition is exactly 50%,
> which is not a very reasonable default guess for LIKE.

How do you know that "LIKE" will have a selectivity above 50% in the
first place? I think 50% should be the default unless the selectively is
measured at runtime against the data being queried.

If you don't know the data, I think it's a bug that LIKE is assumed to
have a selectivity above 50%. You can't know that, only the author of
the code can know that and that's why I talked about hints. It'd be fine
to give hints like:

    UNLIKELY string LIKE '%% PREEMPT %%'

or:

    LIKELY string NOT LIKE '%% PREEMPT %%'

Then you could assume that very little data will be returned or a lot of
data will be returned.

If you don't get hints NOT LIKE or LIKE should be assumed to have the
same selectivity.

> It sounds to me that the problem is misestimation of the selectivity
> of the LIKE pattern --- the planner is going to think that
> LIKE '%% PREEMPT %%' is fairly selective because of the rather long
> match text, when in reality it's probably not so selective on your
> data.  But we don't keep any statistics that would allow the actual

True, there's a lot of data that matches %% PREEMPT %% (even if less
than the NOT case).

> number of matching rows to be estimated well.  You might want to think
> about changing your data representation so that the pattern-match can be
> replaced by a boolean column, or some such, so that the existing
> statistics code can make a more reasonable estimate how many rows are
> selected.

I see. I can certainly fix it by stopping using LIKE. But IMHO this
remains a bug, since until the statistics about the numberof matching
rows isn't estimated well, you should not make assumptions on LIKE/NOT
LIKE. I think you can change the code in a way that I won't have to
touch anything, and this will lead to fewer surprises in the future IMHO.

Thanks!

Re: NOT LIKE much faster than LIKE?

From

Christopher Kings-Lynne

Date:

09 January 2006, 22:27:50

>     UNLIKELY string LIKE '%% PREEMPT %%'
>
> or:
>
>     LIKELY string NOT LIKE '%% PREEMPT %%'

You should be using contrib/tsearch2 for an un-anchored text search perhaps?

Re: NOT LIKE much faster than LIKE?

From

Andrea Arcangeli

Date:

09 January 2006, 22:45:34

On Tue, Jan 10, 2006 at 10:29:05AM +0800, Christopher Kings-Lynne wrote:
> >    UNLIKELY string LIKE '%% PREEMPT %%'
> >
> >or:
> >
> >    LIKELY string NOT LIKE '%% PREEMPT %%'
>
> You should be using contrib/tsearch2 for an un-anchored text search perhaps?

If I wanted to get the fastest speed possible, then I think parsing it
with python and storing true/false in a boolean like suggested before
would be better and simpler as well for this specific case.

However I don't need big performance, I need just decent performance, and it
annoys me that there heurisics where the LIKE query assumes little data
will be selected. There's no way to know that until proper stats are
recorded on the results of the query. The default should be good enough
to use IMHO, and there's no way to know if NOT LIKE or LIKE will return
more data, 50% should be assumed for both if no runtime information is
available IMHO.

IIRC gcc in a code like if (something) {a}  else {b} assumes that a is
more likely to be executed then b, but that's because it's forced to
choose something. Often one is forced to choose what is more likely
between two events, but I don't think the above falls in this case. I
guess the heuristic really wanted to speed up the runtime of LIKE, when
it actually made it a _lot_ worse. No heuristic is better than an
heuristic that falls apart in corner cases like the above "LIKE".

Re: NOT LIKE much faster than LIKE?

From

Tom Lane

Date:

09 January 2006, 22:54:43

Andrea Arcangeli <andrea@cpushare.com> writes:
> If you don't know the data, I think it's a bug that LIKE is assumed to
> have a selectivity above 50%.

Extrapolating from the observation that the heuristics don't work well
on your data to the conclusion that they don't work for anybody is not
good logic.  Replacing that code with a flat 50% is not going to happen
(or if it does, I'll be sure to send the mob of unhappy users waving
torches and pitchforks to your door not mine ;-)).

I did just think of something we could improve though.  The pattern
selectivity code doesn't make any use of the statistics about "most
common values".  For a constant pattern, we could actually apply the
pattern test with each common value and derive answers that are exact
for the portion of the population represented by the most-common-values
list.  If the MCV list covers a large fraction of the population then
this would be a big leg up in accuracy.  Dunno if that applies to your
particular case or not, but it seems worth doing ...

            regards, tom lane

Re: NOT LIKE much faster than LIKE?

From

Stephan Szabo

Date:

09 January 2006, 22:54:57

On Tue, 10 Jan 2006, Andrea Arcangeli wrote:

> I see. I can certainly fix it by stopping using LIKE. But IMHO this
> remains a bug, since until the statistics about the numberof matching
> rows isn't estimated well, you should not make assumptions on LIKE/NOT
> LIKE. I think you can change the code in a way that I won't have to
> touch anything, and this will lead to fewer surprises in the future IMHO.

I doubt it, since I would expect that this would be as large a
pessimization for a larger fraction of people than it is an optimization
for you.

Re: NOT LIKE much faster than LIKE?

From

Andrea Arcangeli

Date:

09 January 2006, 23:47:22

On Mon, Jan 09, 2006 at 09:54:44PM -0500, Tom Lane wrote:
> Extrapolating from the observation that the heuristics don't work well
> on your data to the conclusion that they don't work for anybody is not
> good logic.  Replacing that code with a flat 50% is not going to happen
> (or if it does, I'll be sure to send the mob of unhappy users waving
> torches and pitchforks to your door not mine ;-)).

I'm not convinced but of course I cannot exclude that some people may be
depending on this very heuristic. But I consider this being
bug-compatible, I've an hard time to be convinced that such heuristic
isn't going to bite other people like it did with me.

> I did just think of something we could improve though.  The pattern
> selectivity code doesn't make any use of the statistics about "most
> common values".  For a constant pattern, we could actually apply the
> pattern test with each common value and derive answers that are exact
> for the portion of the population represented by the most-common-values
> list.  If the MCV list covers a large fraction of the population then
> this would be a big leg up in accuracy.  Dunno if that applies to your
> particular case or not, but it seems worth doing ...

Fixing this with proper stats would be great indeed. What would be the
most common value for the kernel_version? You can see samples of the
kernel_version here http://klive.cpushare.com/2.6.15/ .  That's the
string that is being searched against both PREEMPT and SMP.

BTW, I also run a LIKE '%% SMP %%' a NOT LIKE '%% SMP %%' but that runs
fine, probably because as you said in the first email PREEMPT is long
but SMP is short.

Thanks!

Re: NOT LIKE much faster than LIKE?

From

Matteo Beccati

Date:

10 January 2006, 05:08:33

Hi,

> I did just think of something we could improve though.  The pattern
> selectivity code doesn't make any use of the statistics about "most
> common values".  For a constant pattern, we could actually apply the
> pattern test with each common value and derive answers that are exact
> for the portion of the population represented by the most-common-values
> list.  If the MCV list covers a large fraction of the population then
> this would be a big leg up in accuracy.  Dunno if that applies to your
> particular case or not, but it seems worth doing ...

This reminds me what I did in a patch which is currently on hold for the
next release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold
http://candle.pha.pa.us/mhonarc/patches_hold/msg00026.html

The patch was addressing a similar issue when using ltree <@ and @>
operator on an unbalanced tree.


Best regards
--
Matteo Beccati
http://phpadsnew.com
http://phppgads.com

Re: NOT LIKE much faster than LIKE?

From

Andrea Arcangeli

Date:

10 January 2006, 11:32:36

On Tue, Jan 10, 2006 at 10:11:18AM -0500, Greg Stark wrote:
>
> Andrea Arcangeli <andrea@cpushare.com> writes:
>
> > Fixing this with proper stats would be great indeed. What would be the
> > most common value for the kernel_version? You can see samples of the
> > kernel_version here http://klive.cpushare.com/2.6.15/ .  That's the
> > string that is being searched against both PREEMPT and SMP.
>
> Try something like this where attname is the column name and tablename is,
> well, the tablename:
>
> db=> select most_common_vals from pg_stats where tablename = 'region' and attname = 'province';
>  most_common_vals
> ------------------
>  {ON,NB,QC,BC}

Thanks for the info!

klive=> select most_common_vals from pg_stats where tablename = 'klive' and attname = 'kernel_version';

                                                   most_common_vals
                                                                                                                       

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {"#1 Tue Sep 13 14:56:15 UTC 2005","#1 Fri Aug 19 11:58:59 UTC 2005","#7 SMP Fri Oct 7 15:56:41 CEST 2005","#1 SMP Fri
Aug19 11:58:59 UTC 2005","#2 Thu Sep 22 15:58:44 CEST 2005","#1 Fri Sep 23 15:32:21 GMT 2005","#1 Fri Oct 21 03:46:55
EDT2005","#1 Sun Sep 4 13:45:32 CEST 2005","#5 PREEMPT Mon Nov 21 17:53:59 EET 2005","#1 Wed Sep 28 19:15:10 EDT 2005"} 
(1 row)

klive=> select most_common_freqs from pg_stats where tablename = 'klive' and attname = 'kernel_version';
                                     most_common_freqs
-------------------------------------------------------------------------------------------
 {0.0133333,0.0116667,0.011,0.009,0.00733333,0.00666667,0.00633333,0.006,0.006,0.00566667}
(1 row)

klive=>

There's only one preempt near the end, not sure if it would work?

Re: NOT LIKE much faster than LIKE?

From

Greg Stark

Date:

10 January 2006, 11:33:33

Andrea Arcangeli <andrea@cpushare.com> writes:

> Fixing this with proper stats would be great indeed. What would be the
> most common value for the kernel_version? You can see samples of the
> kernel_version here http://klive.cpushare.com/2.6.15/ .  That's the
> string that is being searched against both PREEMPT and SMP.

Try something like this where attname is the column name and tablename is,
well, the tablename:

db=> select most_common_vals from pg_stats where tablename = 'region' and attname = 'province';
 most_common_vals
------------------
 {ON,NB,QC,BC}

Note that there's a second column most_common_freqs and to do this would
really require doing a weighted average based on the frequencies.

--
greg

Re: NOT LIKE much faster than LIKE?

From

Tom Lane

Date:

10 January 2006, 11:47:15

Andrea Arcangeli <andrea@cpushare.com> writes:
> There's only one preempt near the end, not sure if it would work?

Not with that data, but maybe if you increased the statistics target for
the column to 100 or so, you'd catch enough values to get reasonable
results.

            regards, tom lane

Re: NOT LIKE much faster than LIKE?

From

Tom Lane

Date:

10 January 2006, 13:50:01

Matteo Beccati <php@beccati.com> writes:
>> I did just think of something we could improve though.  The pattern
>> selectivity code doesn't make any use of the statistics about "most
>> common values".  For a constant pattern, we could actually apply the
>> pattern test with each common value and derive answers that are exact
>> for the portion of the population represented by the most-common-values
>> list.

> This reminds me what I did in a patch which is currently on hold for the
> next release:

I've applied a patch to make patternsel() compute the exact result for
the MCV list, and use its former heuristics only for the portion of the
column population not included in the MCV list.

After finishing that work it occurred to me that we could go a step
further: if the MCV list accounts for a substantial fraction of the
population, we could assume that the MCV list is representative of the
whole population, and extrapolate the pattern's selectivity over the MCV
list to the whole population instead of using the existing heuristics at
all.  In a situation like Andreas' example this would win big, although
you can certainly imagine cases where it would lose too.

Any thoughts about this?  What would be a reasonable threshold for
"substantial fraction"?  It would probably make sense to have different
thresholds depending on whether the pattern is left-anchored or not,
since the range heuristic only works for left-anchored patterns.

            regards, tom lane

Re: NOT LIKE much faster than LIKE?

From

Simon Riggs

Date:

10 January 2006, 18:06:38

On Tue, 2006-01-10 at 12:49 -0500, Tom Lane wrote:
> Matteo Beccati <php@beccati.com> writes:
> >> I did just think of something we could improve though.  The pattern
> >> selectivity code doesn't make any use of the statistics about "most
> >> common values".  For a constant pattern, we could actually apply the
> >> pattern test with each common value and derive answers that are exact
> >> for the portion of the population represented by the most-common-values
> >> list.
>
> > This reminds me what I did in a patch which is currently on hold for the
> > next release:
>
> I've applied a patch to make patternsel() compute the exact result for
> the MCV list, and use its former heuristics only for the portion of the
> column population not included in the MCV list.

I think its OK to use the MCV, but I have a problem with the current
heuristics: they only work for randomly generated strings, since the
selectivity goes down geometrically with length. That doesn't match most
languages where one and two syllable words are extremely common and
longer ones much less so. A syllable can be 1-2 chars long, so any
search string of length 1-4 is probably equally likely, rather than
reducing in selectivity based upon length. So I think part of the
problem is the geometrically reducing selectivity itself.

> After finishing that work it occurred to me that we could go a step
> further: if the MCV list accounts for a substantial fraction of the
> population, we could assume that the MCV list is representative of the
> whole population, and extrapolate the pattern's selectivity over the MCV
> list to the whole population instead of using the existing heuristics at
> all.  In a situation like Andreas' example this would win big, although
> you can certainly imagine cases where it would lose too.

I don't think that can be inferred with any confidence, unless a large
proportion of the MCV list were itself selected. Otherwise it might
match only a single MCV that just happens to have a high proportion,
then we assume all others have the same proportion. The calculation is
related to Ndistinct, in some ways.

> Any thoughts about this?  What would be a reasonable threshold for
> "substantial fraction"?  It would probably make sense to have different
> thresholds depending on whether the pattern is left-anchored or not,
> since the range heuristic only works for left-anchored patterns.

I don't think you can do this for a low enough substantial fraction to
make this interesting.

I would favour the idea of dynamic sampling using a block sampling
approach; that was a natural extension of improving ANALYZE also. We can
use that approach for things such as LIKE, but also use it for
multi-column single-table and join selectivity.

Best Regards, Simon Riggs

Re: NOT LIKE much faster than LIKE?

From

Tom Lane

Date:

10 January 2006, 18:21:37

Simon Riggs <simon@2ndquadrant.com> writes:
> I think its OK to use the MCV, but I have a problem with the current
> heuristics: they only work for randomly generated strings, since the
> selectivity goes down geometrically with length.

We could certainly use a less aggressive curve for that.  You got a
specific proposal?

>> After finishing that work it occurred to me that we could go a step
>> further: if the MCV list accounts for a substantial fraction of the
>> population, we could assume that the MCV list is representative of the
>> whole population, and extrapolate the pattern's selectivity over the MCV
>> list to the whole population instead of using the existing heuristics at
>> all.  In a situation like Andreas' example this would win big, although
>> you can certainly imagine cases where it would lose too.

> I don't think that can be inferred with any confidence, unless a large
> proportion of the MCV list were itself selected. Otherwise it might
> match only a single MCV that just happens to have a high proportion,
> then we assume all others have the same proportion.

Well, of course it can't be inferred "with confidence".  Sometimes
you'll win and sometimes you'll lose.  The question is, is this a
better heuristic than what we use otherwise?  The current estimate
for non-anchored patterns is really pretty crummy, and even with a
less aggressive length-vs-selectivity curve it's not going to be great.

Another possibility is to merge the two estimates somehow.

> I would favour the idea of dynamic sampling using a block sampling
> approach; that was a natural extension of improving ANALYZE also.

One thing at a time please.  Obtaining better statistics is one issue,
but the one at hand here is what to do given particular statistics.

            regards, tom lane

Re: NOT LIKE much faster than LIKE?

From

Simon Riggs

Date:

10 January 2006, 19:36:47

On Tue, 2006-01-10 at 17:21 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > I think its OK to use the MCV, but I have a problem with the current
> > heuristics: they only work for randomly generated strings, since the
> > selectivity goes down geometrically with length.
>
> We could certainly use a less aggressive curve for that.  You got a
> specific proposal?

I read some research not too long ago that showed a frequency curve of
words by syllable length. I'll dig that out tomorrow.

> > I would favour the idea of dynamic sampling using a block sampling
> > approach; that was a natural extension of improving ANALYZE also.
>
> One thing at a time please.  Obtaining better statistics is one issue,
> but the one at hand here is what to do given particular statistics.

I meant use the same sampling approach as I was proposing for ANALYZE,
but do this at plan time for the query. That way we can apply the
function directly to the sampled rows and estimate selectivity.

I specifically didn't mention that in the Ndistinct discussion because I
didn't want to confuse the subject further, but the underlying block
sampling method would be identical, so the code is already almost
there...we just need to eval the RestrictInfo against the sampled
tuples.

Best Regards, Simon Riggs

Re: NOT LIKE much faster than LIKE?

From

Tom Lane

Date:

10 January 2006, 23:40:35

Simon Riggs <simon@2ndquadrant.com> writes:
> I meant use the same sampling approach as I was proposing for ANALYZE,
> but do this at plan time for the query. That way we can apply the
> function directly to the sampled rows and estimate selectivity.

I think this is so unlikely to be a win as to not even be worth spending
any time discussing.  The extra planning time across all queries will
vastly outweigh the occasional improvement in plan choice for some
queries.

            regards, tom lane

Re: NOT LIKE much faster than LIKE?

From

Andrea Arcangeli

Date:

11 January 2006, 04:00:03

On Tue, Jan 10, 2006 at 10:46:53AM -0500, Tom Lane wrote:
> Not with that data, but maybe if you increased the statistics target for
> the column to 100 or so, you'd catch enough values to get reasonable
> results.

Sorry, I'm not expert with postgresql, could you tell me how to increase
the statistic target?

In another email you said you applied a patch to CVS, please let me know
if you've anything to test for me, and I'll gladly test it immediately
(I've a sandbox so it's ok even if it corrupts the db ;).

Thanks!

Re: NOT LIKE much faster than LIKE?

From

Simon Riggs

Date:

11 January 2006, 05:07:49

On Tue, 2006-01-10 at 22:40 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > I meant use the same sampling approach as I was proposing for ANALYZE,
> > but do this at plan time for the query. That way we can apply the
> > function directly to the sampled rows and estimate selectivity.
>
> I think this is so unlikely to be a win as to not even be worth spending
> any time discussing.  The extra planning time across all queries will
> vastly outweigh the occasional improvement in plan choice for some
> queries.

Extra planning time would be bad, so clearly we wouldn't do this when we
already have relevant ANALYZE statistics.

I would suggest we do this only when all of these are true
- when accessing more than one table, so the selectivity could effect a
join result
- when we have either no ANALYZE statistics, or ANALYZE statistics are
not relevant to estimating selectivity, e.g. LIKE
- when access against the single table in question cannot find an index
to use from other RestrictInfo predicates

I imagined that this would also be controlled by a GUC, dynamic_sampling
which would be set to zero by default, and give a measure of sample size
to use. (Or just a bool enable_sampling = off (default)).

This is mentioned now because the plan under consideration in this
thread would be improved by this action. It also isn't a huge amount of
code to get it to work.

Best Regards, Simon Riggs

Re: NOT LIKE much faster than LIKE?

From

Andrea Arcangeli

Date:

11 January 2006, 05:18:44

On Wed, Jan 11, 2006 at 09:07:45AM +0000, Simon Riggs wrote:
> I would suggest we do this only when all of these are true
> - when accessing more than one table, so the selectivity could effect a
> join result

FWIW my problem only happens if I join: on the main table where the
kernel_version string is stored (without joins), everything is always
blazing fast. So this requirement certainly sounds fine to me.

Re: NOT LIKE much faster than LIKE?

From

"Jim C. Nasby"

Date:

11 January 2006, 14:40:40

On Tue, Jan 10, 2006 at 02:44:47AM +0100, Andrea Arcangeli wrote:
> "cooperative" runs "WHERE kernel_version NOT LIKE '%% PREEMPT %%'", while
> "preempt" runs "WHERE kernel_version LIKE '%% PREEMPT %%'. The only difference

One thing you could do is change the like to:

WHERE position(' PREEMPT ' in kernel_version) != 0

And then create a functional index on that:

CREATE INDEX indexname ON tablename ( position(' PREEMPT ' in kernel_version) );
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: NOT LIKE much faster than LIKE?

From

Andrea Arcangeli

Date:

11 January 2006, 16:39:51

On Wed, Jan 11, 2006 at 12:40:32PM -0600, Jim C. Nasby wrote:
> On Tue, Jan 10, 2006 at 02:44:47AM +0100, Andrea Arcangeli wrote:
> > "cooperative" runs "WHERE kernel_version NOT LIKE '%% PREEMPT %%'", while
> > "preempt" runs "WHERE kernel_version LIKE '%% PREEMPT %%'. The only difference
>
> One thing you could do is change the like to:
>
> WHERE position(' PREEMPT ' in kernel_version) != 0

That alone fixed it, with this I don't even need the index (yet). Thanks
a lot.

> And then create a functional index on that:
>
> CREATE INDEX indexname ON tablename ( position(' PREEMPT ' in kernel_version) );

The index only helps the above query with = 0 and not the one with != 0,
but it seems not needed in practice.

Re: NOT LIKE much faster than LIKE?

From

Andrea Arcangeli

Date:

11 January 2006, 16:46:44

On Wed, Jan 11, 2006 at 09:39:47PM +0100, Andrea Arcangeli wrote:
> On Wed, Jan 11, 2006 at 12:40:32PM -0600, Jim C. Nasby wrote:
> > On Tue, Jan 10, 2006 at 02:44:47AM +0100, Andrea Arcangeli wrote:
> > > "cooperative" runs "WHERE kernel_version NOT LIKE '%% PREEMPT %%'", while
> > > "preempt" runs "WHERE kernel_version LIKE '%% PREEMPT %%'. The only difference
> >
> > One thing you could do is change the like to:
> >
> > WHERE position(' PREEMPT ' in kernel_version) != 0
>
> That alone fixed it, with this I don't even need the index (yet). Thanks
> a lot.

The fix is online already w/o index:

    http://klive.cpushare.com/?branch=all&scheduler=preemptive

Of course I'm still fully available to test any fix for the previous
LIKE query if there's interest.

Re: NOT LIKE much faster than LIKE?

From

"Jim C. Nasby"

Date:

11 January 2006, 17:02:27

On Wed, Jan 11, 2006 at 09:39:47PM +0100, Andrea Arcangeli wrote:
> > CREATE INDEX indexname ON tablename ( position(' PREEMPT ' in kernel_version) );
>
> The index only helps the above query with = 0 and not the one with != 0,
> but it seems not needed in practice.

Hrm. If you need indexing then, you'll probably have to do 2 indexes
with a WHERE clause...

CREATE INDEX ... WHERE position(...) = 0;
CREATE INDEX ... WHERE position(...) != 0;

I suspect this is because of a lack of stats for functional indexes.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: NOT LIKE much faster than LIKE?

From

Tom Lane

Date:

11 January 2006, 17:13:24

"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Wed, Jan 11, 2006 at 09:39:47PM +0100, Andrea Arcangeli wrote:
>> The index only helps the above query with = 0 and not the one with != 0,
>> but it seems not needed in practice.

> I suspect this is because of a lack of stats for functional indexes.

No, it's because != isn't an indexable operator.

            regards, tom lane

Re: NOT LIKE much faster than LIKE?

From

Simon Riggs

Date:

11 January 2006, 20:48:37

On Tue, 2006-01-10 at 17:21 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > I think its OK to use the MCV, but I have a problem with the current
> > heuristics: they only work for randomly generated strings, since the
> > selectivity goes down geometrically with length.
>
> We could certainly use a less aggressive curve for that.  You got a
> specific proposal?

non-left anchored LIKE is most likely going to be used with
unstructured, variable length data - else we might use SUBSTRING
instead. My proposal would be to assume that LIKE is acting on human
language text data.

I considered this a while back, but wrote it off in favour of dynamic
sampling - but it's worth discussing this to see whether we can improve
on things without that.

Here's one of the links I reviewed previously:
http://www.ling.lu.se/persons/Joost/Texts/studling.pdf
Sigurd et al [2004]

This shows word frequency distribution peaks at 3 letter/2 phoneme
words, then tails off exponentially after that.

Clearly when search string > 3 then the selectivity must tail off
exponentially also, since we couldn't find words shorter than the search
string itself. The search string might be a phrase, but it seems
reasonable to assume that phrases also drop off in frequency according
to length. It is difficult to decide what to do at len=2 or len=3, and I
would be open to various thoughts, but would default to keeping
like_selectivity as it is now.

Sigurd et al show that word length tails off at 0.7^Len beyond Len=3, so
selectivity FIXED_CHAR_SEL should not be more than 0.7, but I see no
evidence for it being as low as 0.2 (from the published results).  For
simplicity, where Len > 3, I would make the tail off occur with factor
0.5, rather than 0.2.

We could see a few more changes from those results, but curbing the
aggressive tail off would be a simple and easy act.

Best Regards, Simon Riggs