Re: 7.3 no longer using indexes for LIKE queries - Mailing list pgsql-general

From Matthew Gabeler-Lee
Subject Re: 7.3 no longer using indexes for LIKE queries
Date
Msg-id ABABFB80F35AD311848B0090279918EF010B9B67@ZYCOSNT2.hq.zycos.com
Whole thread Raw
In response to 7.3 no longer using indexes for LIKE queries  (Matthew Gabeler-Lee <mgabelerlee@zycos.com>)
Responses Re: 7.3 no longer using indexes for LIKE queries
Re: 7.3 no longer using indexes for LIKE queries
List pgsql-general
Can anyone explain the rationale behind that sort order?  I'm guessing it
has something to do with getting sequences of words sorted 'right', but I
fail to see how that is right.  I'd think when sorting it would make sense
to have all sequences of words that start with the same word(s) together.

$ echo -e 'bobbill\nbobrob\nbob bill\nbob robber' | LC_ALL=en_US sort
bobbill
bob bill    -\
bobrob       |-- Seems these ought to be adjacent?
bob robber  -/

Perhaps I'm just old fashioned.  In any case, I don't define locale
collation orders, so *shrug*.

The pseudocode I gave for finding insane ones could also be extended to find
information for the modified LIKE optimization you suggest.  Find chars
(like space) that are ignored in the collation order, as it does, and report
them.  When picking the lower bound, drop all such chars from the beginning
of the expression, then find the upper bound from the thus shortened lower
bound.  This would work for stuff that sorted like en_US does with spaces,
but something that sorted 'a b' *before* 'ab' gets into the realms of the
impossible to deal with (there was a message in one of the lists talking
about a locale that apparently suffered from this problem, it at least
sorted 'abc\NNN' before 'abc' for some value of NNN which I forget).

In locales whose second-pass sort only moves things forwards, the least the
planner could know these 'danger' characters and only optimize if none of
them are present in the query string.  foo LIKE 'ab%' can be safely
optimized in en_US since it only has chars that sort simply.  Finding which
chars sort simply should be straight forward and only require testing 2^8 or
maybe 2^16 values.

    -Matt


-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

[tgl@g3]$ echo -e  "a\na a\naa\na  a\nab\na b" | LC_ALL=en_US sort
a
aa
a a
a  a
ab
a b

There's no way to use an index ordered like this to look for strings
beginning "a ", because the sorting of spaces depends on what comes
after them.

Making any real dent in the problem will probably require in-depth
analysis of common locale (mis)behaviors.  For example, if space sorting
is the only thing that's funny about en_US, it might make sense for us
to support a modified form of the LIKE optimization that doesn't
consider space as a "safe" prefix character (ie, we could index only
for "a" not "a ", even if the pattern is LIKE 'a %').

I have no idea exactly what sort of compromises would be effective
though.  Any localedef experts out there?

            regards, tom lane

pgsql-general by date:

Previous
From: Joseph Shraibman
Date:
Subject: Re: where did debug_print_query go in 7.3???
Next
From: CSN
Date:
Subject: 7.3 RPMS