Sigh, LIKE indexing is *still* broken in foreign locales - Mailing list pgsql-bugs

From Tom Lane
Subject Sigh, LIKE indexing is *still* broken in foreign locales
Date
Msg-id 14045.960430926@sss.pgh.pa.us
Whole thread Raw
In response to LIKE bug  (Moucha Václav <MouchaV@Radiomobil.cz>)
Responses Re: [HACKERS] Sigh, LIKE indexing is *still* broken in foreign locales  (Giles Lean <giles@nemeton.com.au>)
Re: [HACKERS] Sigh, LIKE indexing is *still* broken in foreign locales  (Peter Eisentraut <peter_e@gmx.net>)
List pgsql-bugs
Moucha Václav <MouchaV@Radiomobil.cz> writes:
> 1. Compilation
>    ./configure --enable-locale    # not needed for RPMS precompiled binaries 

> 2. Starting postmaster
>    export LC_CTYPE=cs_CZ
>    export LC_COLLATE=cs_CZ        # this setting is important for the
> bug result
>    postmaster -S -D /home/pgsql/data -o '-Fe'    

> 3. SQL steps
>    create table test (name text);
>    insert into test values ('�');    # the first char is E1 from LATIN 2
> coding
>    insert into test values ('�b');
>    create index test_index on test (name);
>    set cpu_tuple_cost=1;        # force backend to use index
> scanning
>    select * from test where name like '�%';

> BUG: Only 1 line is selected with '�' only instead of both lines.

The problem here is that given the search pattern '\341%', the planner
generates index limit conditionsname >= '\341' AND name < '\342';

Apparently, in CZ locale it is true that '\341' is less than '\342',
but it does not follow from that that all strings starting with '\341'
are less than '\342'.  In fact '\341b' is considered greater than '\342'.

Since '\341' and '\342' are two different accented forms of 'a'
(if I'm looking at the right character set), this is perhaps not so
improbable as all that.  Evidently the collation rule is that different
accent forms sort the same unless the strings would otherwise be
considered equal, in which case an ordering is assigned to them.

So, the rule we thought we had for generating index bounds falls flat,
and we're back to the same old question: given a proposed prefix string,
how can we generate bounds that are certain to be considered <= and >=
all strings starting with that prefix?

I am now thinking that maybe we should search for a string that compares
greater than "fooz" when the prefix is "foo" --- that is, append a 'z'
to the prefix string.  But I wouldn't be surprised if that fails too
in some locales.

I'm also wondering if the left-hand inequality ('foo' <= any string
beginning with 'foo') might fail in some locales ... we haven't seen
it reported but who knows ...
        regards, tom lane


pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: Bug?
Next
From: Giles Lean
Date:
Subject: Re: [HACKERS] Sigh, LIKE indexing is *still* broken in foreign locales