Thread: LIKE optimization and locale

LIKE optimization and locale

From
Bruce Momjian
Date:
When locale is enabled, we have always had a problem using an index
with:
col LIKE 'abc%'

We need to make this:
col LIKE 'abc%' ANDcol >= "abc" ANDcol <  "abd"

but when locale is enabled, we can't be sure what letter is greater than
'c' in this case.

Why don't we just spin through all 255 locale values, and find the
lowest value that is greater than comparison target.  It takes only 255
comparisons, which is certainly faster than not using an index with
LIKE.

It is so simple, I don't know why I didn't think of it before.  If
performance is a problem, we can do it once in the backend or even in
the postmaster and keep the collation ordering in a 256-byte array.  We
may even be able to handle multi-byte in this way, though with 256^2, we
would need to do it once on postmaster startup.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: LIKE optimization and locale

From
Peter Eisentraut
Date:
Bruce Momjian writes:

> Why don't we just spin through all 255 locale values, and find the
> lowest value that is greater than comparison target.

The issue is not that the 255 extended ASCII characters have a different
ordering in various locales (although that's part of it).  The real
problem lies with multi-character collating elements, context dependent
collation order, multi-pass sorting algorithms, etc.  I'm almost convinced
that it is not possible to do any such optimization as we had for the most
general case.

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/



Re: LIKE optimization and locale

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> The real
> problem lies with multi-character collating elements, context dependent
> collation order, multi-pass sorting algorithms, etc.  I'm almost convinced
> that it is not possible to do any such optimization as we had for the most
> general case.

It's not impossible, surely.  The question is whether it's practical,
particularly from a manpower and maintenance standpoint.  We do not
want to build something that needs explicit knowledge of a ton of
locale-specific special cases.

The core problem is: given a string "foo", find a string "fop" that
is greater than any possible extension "foobar" of "foo".  We need
not find the least such string (else it would indeed be a hard
problem), just a reasonably close upper bound.  The algorithm we have
in 7.0.* increments the last byte(s) of "foo" until it finds
something greater than "foo".  That handles collation orders that are
different from numerical order, but it still breaks down in the cases
Peter mentions.

One variant I've been wondering about is to test a candidate bound
string against not only "foo", but all single-character extensions of
"foo", ie, "foo\001" through "foo\255".  That would catch situations
like the one most recently complained of, where the last character
of the proposed bound string is just a noise-character in dictionary
order.  But I'm afraid it's still not good enough to catch all cases
... and it doesn't generalize to MULTIBYTE very well anyway.
        regards, tom lane


Re: LIKE optimization and locale

From
Bruce Momjian
Date:
> The core problem is: given a string "foo", find a string "fop" that
> is greater than any possible extension "foobar" of "foo".  We need
> not find the least such string (else it would indeed be a hard
> problem), just a reasonably close upper bound.  The algorithm we have
> in 7.0.* increments the last byte(s) of "foo" until it finds
> something greater than "foo".  That handles collation orders that are
> different from numerical order, but it still breaks down in the cases
> Peter mentions.

This increment seems sub-optimal.

> 
> One variant I've been wondering about is to test a candidate bound
> string against not only "foo", but all single-character extensions of
> "foo", ie, "foo\001" through "foo\255".  That would catch situations
> like the one most recently complained of, where the last character
> of the proposed bound string is just a noise-character in dictionary
> order.  But I'm afraid it's still not good enough to catch all cases
> ... and it doesn't generalize to MULTIBYTE very well anyway.

This was my suggestion, to test all 255 chars and find the lowest that
is greater than the target, but I see that multi-byte would be a
problem.  Oh, well.  I hoped some postmaster-generated lookup table
could fix this.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026