Thread: Possible solution for LIKE optimization

Possible solution for LIKE optimization

From
Peter Eisentraut
Date:
I have had an idea how the LIKE optimization problem could be solved.
The problem was that given the expression
   A LIKE (P || '%')

where A is a column reference and P is a constant not containing
wildcards, we wanted to find X and Y such that
   X <= A and A <= Y

where X and Y are calculated from P and the bound is usefully tight.

Note that X <= A is really strcoll(X, A) <= 0 and strcoll(X, A) is really
strcmp(strxfrm(X), strxfrm(A)) (the prototype of strxfrm() is different in
C, of course).

Let $<=$ be a non-locale based comparison (i.e., plain strcmp()).

The we can say that
   strxfrm(P) $<=$ strxfrm(A) and   strxfrm(A) $<=$ (strxfrm(P) + 1)

where "+ 1" means adding 1 to the last character and accounting for
overflow (like considering the string a base-256 number).  Basically, this
is the funny-collation-safe equivalent to A LIKE 'foo%' yielding 'foo' <=
A <= 'fop'.

We'd need to implement the strxfrm() function in SQL and the $<=$
operator, both of which are trivial.  The index would have to be in terms
of strxfrm().  There might be other issues, but they could be solved
algorithmically, I suppose.

What do you think?

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: Possible solution for LIKE optimization

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> I have had an idea how the LIKE optimization problem could be solved.

Hmm ... so in a non-ASCII locale, we'd have to look for an index on
strxfrm(A) rather than directly on A.  And the index would need to
use a nonstandard operator set --- ie, *non* locale aware comparison
operators (which might be useful for other purposes anyway).

Interesting thought.  I'm not entirely sure how we'd teach the planner
to do this, but that's probably solvable.

A more significant problem is that I'm still not convinced this gets the
job done, because of the problem of multi-character collation elements.
If "A LIKE 'FOOS%'" should match FOOSS, but SS is treated specially by
the collation rules, does this scheme work?
        regards, tom lane


Re: Possible solution for LIKE optimization

From
Peter Eisentraut
Date:
Tom Lane writes:

> Peter Eisentraut <peter_e@gmx.net> writes:
> > I have had an idea how the LIKE optimization problem could be solved.
>
> Hmm ... so in a non-ASCII locale, we'd have to look for an index on
> strxfrm(A) rather than directly on A.  And the index would need to
> use a nonstandard operator set --- ie, *non* locale aware comparison
> operators (which might be useful for other purposes anyway).

Wait, why isn't that the solution in the first place.  Let's build the
index with an opclass that uses plain strcmp comparison.  Then you can
compute the bounds using the method 'foo' <= 'foo%' <= 'fop'.  We don't
need to trick the locale facilities, we just avoid using them.  LIKE is
defined in terms of character elements, not collation elements, so that's
okay.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: Possible solution for LIKE optimization

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Wait, why isn't that the solution in the first place.  Let's build the
> index with an opclass that uses plain strcmp comparison.

By George, I think you've got it!  All we need is comparison ops and
an opclass that use strcmp, even when USE_LOCALE is defined.  Then we
document "here's how you make a LIKE-compatible index in non-ASCII
locales", and away we go.
        regards, tom lane


Re: Possible solution for LIKE optimization

From
Hiroshi Inoue
Date:
Tom Lane wrote:
> 
> Peter Eisentraut <peter_e@gmx.net> writes:
> > Wait, why isn't that the solution in the first place.  Let's build the
> > index with an opclass that uses plain strcmp comparison.
> 
> By George, I think you've got it!  All we need is comparison ops and
> an opclass that use strcmp, even when USE_LOCALE is defined.  Then we
> document "here's how you make a LIKE-compatible index in non-ASCII
> locales", and away we go.
> 

Do we have to make 2 indexes for non_ASCII text field ?

regards,
Hiroshi Inoue


Re: Possible solution for LIKE optimization

From
Tom Lane
Date:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
>> Peter Eisentraut <peter_e@gmx.net> writes:
> Wait, why isn't that the solution in the first place.  Let's build the
> index with an opclass that uses plain strcmp comparison.

> Do we have to make 2 indexes for non_ASCII text field ?

You would if you want to use indexscans for both LIKE and "x < 'FOO'"
(ie, locale-aware comparisons).  Which is not great, but I think we've
finally seen the light: a locale-sorted index is just plain not useful
for LIKE.

This discussion has restored my faith in index opclasses; finally we
have a real-world application of 'em that we can point to ...
        regards, tom lane


RE: Possible solution for LIKE optimization

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> 
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> >> Peter Eisentraut <peter_e@gmx.net> writes:
> > Wait, why isn't that the solution in the first place.  Let's build the
> > index with an opclass that uses plain strcmp comparison.
> 
> > Do we have to make 2 indexes for non_ASCII text field ?
> 
> You would if you want to use indexscans for both LIKE and "x < 'FOO'"
> (ie, locale-aware comparisons). 

And ORDER BY ?

> Which is not great, but I think we've
> finally seen the light: a locale-sorted index is just plain not useful
> for LIKE.

I'm not familiar with non_ASCII locale.
Is 'ss'  always guaranteed to be LIKE 's%'  for example ?

regards,
Hiroshi Inoue


Re: Possible solution for LIKE optimization

From
Tom Lane
Date:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
>>> Do we have to make 2 indexes for non_ASCII text field ?
>> 
>> You would if you want to use indexscans for both LIKE and "x < 'FOO'"
>> (ie, locale-aware comparisons). 

> And ORDER BY ?

Well, the assumption is that we'd make a second set of string comparison
operators that are defined as not-locale-aware.  Names still to be
chosen, but let's suppose they're $=$, $<$, $<=$, etc.  Then
SELECT ... ORDER BY foo;

would want to use a plain index (defined using locale-aware comparison),
whereas
SELECT ... ORDER BY foo USING $<$;

would want to use a non-locale-aware index.  So you could use such an
index for sorting, as long as you were content to have non-locale-aware
output ordering.

> I'm not familiar with non_ASCII locale.
> Is 'ss'  always guaranteed to be LIKE 's%'  for example ?

I'd assume so, but I'd be interested to hear whether native speakers
of German think that that's appropriate in their locale ...
        regards, tom lane


RE: Possible solution for LIKE optimization

From
Peter Eisentraut
Date:
Hiroshi Inoue writes:

> I'm not familiar with non_ASCII locale.
> Is 'ss'  always guaranteed to be LIKE 's%'  for example ?

Yes.  LIKE doesn't use any collation rules, since it doesn't do any
collating.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: Possible solution for LIKE optimization

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Hiroshi Inoue writes:
>> I'm not familiar with non_ASCII locale.
>> Is 'ss'  always guaranteed to be LIKE 's%'  for example ?

> Yes.  LIKE doesn't use any collation rules, since it doesn't do any
> collating.

On the other hand, LIKE *is* multibyte aware.  So the hypothetical
non-locale-aware comparison operators would need to be aware of
multibyte character sets even though not aware of locale.  And the
"add one" operator that we postulated for the LIKE index optimization
needs to be able to increment a multibyte character.

This seems doable, but the sort order of such a comparison function
might not be very pleasant, depending on what character set you are
using.
        regards, tom lane


Re: Possible solution for LIKE optimization

From
Peter Eisentraut
Date:
Tom Lane writes:

> On the other hand, LIKE *is* multibyte aware.  So the hypothetical
> non-locale-aware comparison operators would need to be aware of
> multibyte character sets even though not aware of locale.  And the
> "add one" operator that we postulated for the LIKE index optimization
> needs to be able to increment a multibyte character.

Both of these are not hard if you know how strcmp operates.  It would also
be irrelevant whether "add one" to a multibyte character yields another
valid multibyte character.  strcmp doesn't care.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: Possible solution for LIKE optimization

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Tom Lane writes:
>> On the other hand, LIKE *is* multibyte aware.  So the hypothetical
>> non-locale-aware comparison operators would need to be aware of
>> multibyte character sets even though not aware of locale.  And the
>> "add one" operator that we postulated for the LIKE index optimization
>> needs to be able to increment a multibyte character.

> Both of these are not hard if you know how strcmp operates.  It would also
> be irrelevant whether "add one" to a multibyte character yields another
> valid multibyte character.  strcmp doesn't care.

I imagine we can come up with a definition that makes LIKE optimization
work.  I was wondering more about Hiroshi's implied question: would such
an index have any use for sorting?  Or would the induced sort order (on
multibyte character sets) look too weird to be of any use?
        regards, tom lane


Re: Possible solution for LIKE optimization

From
Giles Lean
Date:
[ I realise the discussion has left strxfrm(), but for the archives if nothing else ... ]

Peter Eisentraut <peter_e@gmx.net> wrote:

> We'd need to implement the strxfrm() function in SQL and the $<=$
> operator, both of which are trivial.  The index would have to be in terms
> of strxfrm().  There might be other issues, but they could be solved
> algorithmically, I suppose.

Implementations of strxfrm() that I've looked at have had result data
that is three or four times larger than then input string -- quite a
penalty in some situations.

Regards,

Giles




Re: Possible solution for LIKE optimization

From
Tom Lane
Date:
Giles Lean <giles@nemeton.com.au> writes:
> Implementations of strxfrm() that I've looked at have had result data
> that is three or four times larger than then input string -- quite a
> penalty in some situations.

Especially so given that we don't have TOAST for indexes, so the indexed
value can't exceed about 2700 bytes (for btree and an 8K block size).
You are allowed to compress first, so that's not a hard limit, but it
could still be a problem.

I like the non-locale-aware-opclass idea much better than the original.
        regards, tom lane


Re: Possible solution for LIKE optimization

From
Hiroshi Inoue
Date:
Peter Eisentraut wrote:
> 
> Hiroshi Inoue writes:
> 
> > I'm not familiar with non_ASCII locale.
> > Is 'ss'  always guaranteed to be LIKE 's%'  for example ?
> 
> Yes.  LIKE doesn't use any collation rules, since it doesn't do any
> collating.
> 

Hmm I see the description like the following in SQL99 though I
don't understand the meaning.

i) If <escape character> is not specified, then the collating  sequence used for the <like predicate> is determined by
Table3, ‘‘Collating sequence usage for comparisons’’, taking <character  match value> as comparand 1 (one) and
<characterpattern> as  comparand 2.
 

regards,
Hiroshi Inoue


Re: Possible solution for LIKE optimization

From
Oleg Bartunov
Date:
On Mon, 6 Aug 2001, Tom Lane wrote:

> Giles Lean <giles@nemeton.com.au> writes:
> > Implementations of strxfrm() that I've looked at have had result data
> > that is three or four times larger than then input string -- quite a
> > penalty in some situations.
>
> Especially so given that we don't have TOAST for indexes, so the indexed
> value can't exceed about 2700 bytes (for btree and an 8K block size).
> You are allowed to compress first, so that's not a hard limit, but it
> could still be a problem.
>
> I like the non-locale-aware-opclass idea much better than the original.

Does this means that if implemented we could create indexes for
different columns with/without locale support ? It's pain currently
that I had to enable locale support while actually I need
it only for several columns. Gnu sort on Linux 10 times slow if
I use LC_ALL other than C !
Regards,
    Oleg

>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83