Thread: Sigh, LIKE indexing is *still* broken in foreign locales

Sigh, LIKE indexing is *still* broken in foreign locales

From
Tom Lane
Date:
Moucha Václav <MouchaV@Radiomobil.cz> writes:
> 1. Compilation
>    ./configure --enable-locale    # not needed for RPMS precompiled binaries 

> 2. Starting postmaster
>    export LC_CTYPE=cs_CZ
>    export LC_COLLATE=cs_CZ        # this setting is important for the
> bug result
>    postmaster -S -D /home/pgsql/data -o '-Fe'    

> 3. SQL steps
>    create table test (name text);
>    insert into test values ('�');    # the first char is E1 from LATIN 2
> coding
>    insert into test values ('�b');
>    create index test_index on test (name);
>    set cpu_tuple_cost=1;        # force backend to use index
> scanning
>    select * from test where name like '�%';

> BUG: Only 1 line is selected with '�' only instead of both lines.

The problem here is that given the search pattern '\341%', the planner
generates index limit conditionsname >= '\341' AND name < '\342';

Apparently, in CZ locale it is true that '\341' is less than '\342',
but it does not follow from that that all strings starting with '\341'
are less than '\342'.  In fact '\341b' is considered greater than '\342'.

Since '\341' and '\342' are two different accented forms of 'a'
(if I'm looking at the right character set), this is perhaps not so
improbable as all that.  Evidently the collation rule is that different
accent forms sort the same unless the strings would otherwise be
considered equal, in which case an ordering is assigned to them.

So, the rule we thought we had for generating index bounds falls flat,
and we're back to the same old question: given a proposed prefix string,
how can we generate bounds that are certain to be considered <= and >=
all strings starting with that prefix?

I am now thinking that maybe we should search for a string that compares
greater than "fooz" when the prefix is "foo" --- that is, append a 'z'
to the prefix string.  But I wouldn't be surprised if that fails too
in some locales.

I'm also wondering if the left-hand inequality ('foo' <= any string
beginning with 'foo') might fail in some locales ... we haven't seen
it reported but who knows ...
        regards, tom lane


Re: Sigh, LIKE indexing is *still* broken in foreign locales

From
Giles Lean
Date:
On Wed, 07 Jun 2000 22:22:06 -0400  Tom Lane wrote:

> Since '\341' and '\342' are two different accented forms of 'a'
> (if I'm looking at the right character set), this is perhaps not so
> improbable as all that.  Evidently the collation rule is that different
> accent forms sort the same unless the strings would otherwise be
> considered equal, in which case an ordering is assigned to them.

I thought that was common, but while I've worked on
internationalisation issues sometimes I'm no linguist.

> So, the rule we thought we had for generating index bounds falls flat,
> and we're back to the same old question: given a proposed prefix string,
> how can we generate bounds that are certain to be considered <= and >=
> all strings starting with that prefix?

To confess ignorance, why does PostgreSQL need to generate such
bounds?  Complete string comparisons with a locale aware function such
as strcoll() are safe.  Using less than a full string is tricky
indeed, and I'm not sure is possible in general although it might be.

Other problematic cases are likely to include one-to-two collations (�
in German, for example) and two-to-one collations (the reverse, but
I've forgotten my example.  Anyone?)

Then there are wide characters, including some encodings that are
stateful.

Regards,

Giles




Re: Sigh, LIKE indexing is *still* broken in foreign locales

From
"Matthias Urlichs"
Date:
Hi,

Giles Lean:
> > So, the rule we thought we had for generating index bounds falls flat,
> > and we're back to the same old question: given a proposed prefix string,
> > how can we generate bounds that are certain to be considered <= and >=
> > all strings starting with that prefix?
> 
> To confess ignorance, why does PostgreSQL need to generate such
> bounds?

To find the position in the index where it should start scanning.

> Then there are wide characters, including some encodings that are
> stateful.

Personally, I am in the "store everything on the server in Unicode"
camp. Let the parser convert everything to Unicode on the way in, 
and vice versa.

There's no sense, IMHO, in burdening the SQL core with multiple
character encoding schemes.

-- 
Matthias Urlichs  |  noris network GmbH   |   smurf@noris.de  |  ICQ: 20193661
The quote was selected randomly. Really.       |        http://smurf.noris.de/
-- 
I loathe that low vice curiosity.                                       -- Lord Byron


Re: [BUGS] Re: Sigh, LIKE indexing is *still* broken in foreign locales

From
Giles Lean
Date:
On Thu, 8 Jun 2000 08:53:25 +0200  "Matthias Urlichs" wrote:

> To find the position in the index where it should start scanning.

Hmm.  That I guess is faster than locating the prefix given to LIKE in
the index and scanning back as well as forward.

> Personally, I am in the "store everything on the server in Unicode"
> camp. Let the parser convert everything to Unicode on the way in, 
> and vice versa.

That would help the charater set problem, although there are people
that argue that Unicode is not acceptable for all languages.  (I only
note this; I don't have an opinion.)

I don't see that using Unicode helps very much with collation though
-- surely collation must still be locale specific, and the problems
with two-pass algorithms, and two-to-one and one-to-two mappings are
unchanged?

Regards,

Giles



Re: Sigh, LIKE indexing is *still* broken in foreign locales

From
Tom Lane
Date:
Giles Lean <giles@nemeton.com.au> writes:
> On Thu, 8 Jun 2000 08:53:25 +0200  "Matthias Urlichs" wrote:

>> To find the position in the index where it should start scanning.

> Hmm.  That I guess is faster than locating the prefix given to LIKE in
> the index and scanning back as well as forward.

Wouldn't help.  The reason why we need both an upper and lower bound
is to know where to stop scanning as well as where to start.  "Scan
outward from the middle" doesn't tell you when you can stop.

The bounds do not have to be perfectly tight, in the sense of being
the least string >= or largest string <= the desired strings.  It's
OK if we scan a few extra tuples in some cases.  But we have to have
reasonably close bounds or we can't implement LIKE with an index.

>> Personally, I am in the "store everything on the server in Unicode"
>> camp. Let the parser convert everything to Unicode on the way in, 
>> and vice versa.

AFAIK, none of our server-side charset encodings are stateful --- and
I for one will argue that we must never accept any, for precisely the
sort of problem being discussed here.  (If a client wants to use such
a brain-dead encoding, that's not our problem...)

However, the problem at hand has little to do with encodings.  I think
it's more a matter of understanding the possible variations of
context-sensitive collation orders.
        regards, tom lane


Re: Sigh, LIKE indexing is *still* broken in foreign locales

From
Giles Lean
Date:
On Thu, 08 Jun 2000 10:04:11 -0400  Tom Lane wrote:

> The bounds do not have to be perfectly tight, in the sense of being
> the least string >= or largest string <= the desired strings.  It's
> OK if we scan a few extra tuples in some cases.  But we have to have
> reasonably close bounds or we can't implement LIKE with an index.

Determining the bounding (sub-)strings looks like a very hard problem.

I think there is enough information in a POSIX locale to determine
what the rules for constructing such bounds would be ... but there is
no programatic interface to determine the rules a locale uses for
collation.  (I have no idea what non-POSIX systems provide.)

(The localedef program can build a locale.  strcoll() and strxfrm()
can use the collation information.  That's all I see.)

In the absence of a way to do this "right" we need someone to see a
"good enough" hack that happens to work everywhere, or else give up
using indexes for LIKE which I doubt would please anyone.  I suppose
the mismatch comes about because LIKE is about pattern matching and
not collation. :(

Regards,

Giles


Re: Sigh, LIKE indexing is *still* broken in foreign locales

From
Erich Stamberger
Date:
On Wed, 7 Jun 2000, Tom Lane wrote:

> Moucha Václav <MouchaV@Radiomobil.cz> writes:
> > 1. Compilation
> >    ./configure --enable-locale    # not needed for RPMS precompiled binaries 
> 
> > 2. Starting postmaster
> >    export LC_CTYPE=cs_CZ
> >    export LC_COLLATE=cs_CZ        # this setting is important for the
> > bug result
> >    postmaster -S -D /home/pgsql/data -o '-Fe'    
> 
> > 3. SQL steps
> >    create table test (name text);
> >    insert into test values ('á');    # the first char is E1 from LATIN 2
> > coding
> >    insert into test values ('áb');
> >    create index test_index on test (name);
> >    set cpu_tuple_cost=1;        # force backend to use index
> > scanning
> >    select * from test where name like 'á%';
> 
> > BUG: Only 1 line is selected with 'á' only instead of both lines.
> 
> The problem here is that given the search pattern '\341%', the planner
> generates index limit conditions
>     name >= '\341' AND name < '\342';
> 
> Apparently, in CZ locale it is true that '\341' is less than '\342',
> but it does not follow from that that all strings starting with '\341'
> are less than '\342'.  In fact '\341b' is considered greater than '\342'.
> 

Hm. The character that follows 0xe1 in iso-8859-2 order is
"a + circumflex" (Oxe2) which is - as far as I know - not
part of the Czech alphapet. The successors of  0xe1 in
Czech collation order (code points from iso-8859-2)
are 0x042 (capital B) and 0x062 (small B).

=> name >= '0xe1' AND (name < '0x062' OR name < '0x042')
provided comparision is done by strcoll().

Another interresting feature of Czech collation is:

H < "CH" < I

and:

B < C < C + CARON < D .. < H < "CH" < I

So what happens with "WHERE name like 'Czec%`" ?

Regards
Erich




Re: Sigh, LIKE indexing is *still* broken in foreign locales

From
Peter Eisentraut
Date:
Tom Lane writes:

> Evidently the collation rule is that different accent forms sort the
> same unless the strings would otherwise be considered equal, in which
> case an ordering is assigned to them.

Yes, that's fairly common.

> I am now thinking that maybe we should search for a string that compares
> greater than "fooz" when the prefix is "foo" --- that is, append a 'z'
> to the prefix string.  But I wouldn't be surprised if that fails too
> in some locales.

It most definitely will. sv_SE, no_NO, and hr_HR are the early candidates.
And there's also nothing that says that you can only use LIKE on letters,
Latin letters at that.

The only thing you can really do in this direction is to append the very
last character in the complete collation sequence, if there's a way to
find that out. If there isn't, it might be worth hard-coding a few popular
ones.

> I'm also wondering if the left-hand inequality ('foo' <= any string
> beginning with 'foo') might fail in some locales ... we haven't seen
> it reported but who knows ...

I think that's pretty safe. Shorter strings are always "less than" longer
ones.


-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



RE: Sigh, LIKE indexing is *still* broken in foreign locales

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Giles Lean
> 
> On Thu, 08 Jun 2000 10:04:11 -0400  Tom Lane wrote:
> 
> > The bounds do not have to be perfectly tight, in the sense of being
> > the least string >= or largest string <= the desired strings.  It's
> > OK if we scan a few extra tuples in some cases.  But we have to have
> > reasonably close bounds or we can't implement LIKE with an index.
> 
> Determining the bounding (sub-)strings looks like a very hard problem.
>

[snip] 

> 
> In the absence of a way to do this "right" we need someone to see a
> "good enough" hack that happens to work everywhere, or else give up
> using indexes for LIKE which I doubt would please anyone.  I suppose
> the mismatch comes about because LIKE is about pattern matching and
> not collation. :(
>

Currently optimizer doesn't choose index scan unless the bounds
are sufficiently restrictive.
However we may be able to choose index scan more often if we
could have various type(e.g. LIKE) of qualificactions in index qual.
Though we have to scan an index file entirely,we may be able to
avoid looking up for the heap relation for sufficiently many index
tuples.

This may also give us another possibity.
Index scan may be available for the queries likeselect * from .. where .. LIKE '%abc%';

Comments ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


Re: Sigh, LIKE indexing is *still* broken in foreign locales

From
Tom Lane
Date:
Erich Stamberger <eberger@gewi.kfunigraz.ac.at> writes:
> Another interresting feature of Czech collation is:
> H < "CH" < I

Oh my, that *is* interesting (using the word in the spirit of the
ancient Chinese curse, "May you live in interesting times"...)

> So what happens with "WHERE name like 'Czec%`" ?

The wrong thing, without doubt.

Would it help any to strip off one character of the given pattern?
That is, if the pattern is LIKE 'foo%', forget about the last 'o'
and generate bounds like 'fo' <= x <= 'fp' ?
        regards, tom lane


Re: Sigh, LIKE indexing is *still* broken in foreign locales

From
Roland Roberts
Date:
-----BEGIN PGP SIGNED MESSAGE-----

>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
   Tom> Erich Stamberger <eberger@gewi.kfunigraz.ac.at> writes:   >> Another interresting feature of Czech collation
is:H < "CH" < I
 
   >> So what happens with "WHERE name like 'Czec%`" ?
   Tom> The wrong thing, without doubt.

I think this is a conceptual problem: some languages use more than one
character to designate a single "letter" in their orthography.  Even
more "familiar" languagues such as Spanish do this: ch, ll, rr.  Maybe
they've changed since I was taught, but my Spanish dictionaries are
put "ciudad" before "chico" (in fact, the latter is in a separate
section after "C" and before "D").

Sorry if this doesn't help find a solution, though....

roland
- --            PGP Key ID: 66 BC 3B CD
Roland B. Roberts, PhD                    Unix Software Solutions
roberts@panix.com                      76-15 113th Street, Apt 3B
rbroberts@acm.org                          Forest Hills, NY 11375

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3a
Charset: noconv
Comment: Processed by Mailcrypt 3.5.4, an Emacs/PGP interface

iQCVAwUBOUEQFOoW38lmvDvNAQED5wQAnwW1BpbCeghJWXh/gTMezRfDfLq2eSPu
4+H0X6Xjm2Gbegrv1SiWlSCjD1yR/FYIgTQpbMCubXlAtadu5tc4auLGNsOdSNIC
3lT2Bc+En5BxT06dsX33QApU1B4GP73KFxSJu2+ngRMKExTFEojM7qzLlKfGzDeG
onaP+RPbtJc=
=Dlnw
-----END PGP SIGNATURE-----


Re: Sigh, LIKE indexing is *still* broken in foreign locales

From
Giles Lean
Date:
On Fri, 9 Jun 2000 02:57:56 +0200 (CEST)  Peter Eisentraut wrote:

> I think that's pretty safe. Shorter strings are always "less than" longer
> ones.

Nope: many-to-one collation elements break this too.

Regards,

Giles