Thread: LIKE bug

LIKE bug

From
Moucha Václav
Date:
============================================================================                       POSTGRESQL BUG
REPORTTEMPLATE 
============================================================================

Your name            :    Vaclav Moucha
Your email address    :    mouchav@radiomobil.cz

System Configuration
--------------------- Architecture (example: Intel Pentium)          :    Intel
Pentium III/350
 Operating System (example: Linux 2.0.26 ELF)     :    Linux 2.2.14
 PostgreSQL version (example: PostgreSQL-6.5.1):    postgresql-7.0.1
 Compiler used (example:  gcc 2.8.0)        :    gcc version
egcs-2.91.66
19990314 (egcs-1.1.2 release)

Please enter a FULL description of your problem:
------------------------------------------------
I will find out some bug with like operator if I use locales.

Steps to involve a bug result:

1. Compilation  ./configure --enable-locale    # not needed for RPMS precompiled binaries

2. Starting postmaster  export LC_CTYPE=cs_CZ  export LC_COLLATE=cs_CZ        # this setting is important for the
bug result  postmaster -S -D /home/pgsql/data -o '-Fe'

3. SQL steps  create table test (name text);  insert into test values ('á');    # the first char is E1 from LATIN 2
coding  insert into test values ('áb');  create index test_index on test (name);  set cpu_tuple_cost=1;        # force
backendto use index 
scanning  select * from test where name like 'á%';

BUG: Only 1 line is selected with 'á' only instead of both lines.


Please describe a way to repeat the problem.   Please try to provide a
concise reproducible example, if at all possible:
----------------------------------------------------------------------



If you know how this problem might be fixed, list the solution below:
---------------------------------------------------------------------


Sigh, LIKE indexing is *still* broken in foreign locales

From
Tom Lane
Date:
Moucha Václav <MouchaV@Radiomobil.cz> writes:
> 1. Compilation
>    ./configure --enable-locale    # not needed for RPMS precompiled binaries 

> 2. Starting postmaster
>    export LC_CTYPE=cs_CZ
>    export LC_COLLATE=cs_CZ        # this setting is important for the
> bug result
>    postmaster -S -D /home/pgsql/data -o '-Fe'    

> 3. SQL steps
>    create table test (name text);
>    insert into test values ('�');    # the first char is E1 from LATIN 2
> coding
>    insert into test values ('�b');
>    create index test_index on test (name);
>    set cpu_tuple_cost=1;        # force backend to use index
> scanning
>    select * from test where name like '�%';

> BUG: Only 1 line is selected with '�' only instead of both lines.

The problem here is that given the search pattern '\341%', the planner
generates index limit conditionsname >= '\341' AND name < '\342';

Apparently, in CZ locale it is true that '\341' is less than '\342',
but it does not follow from that that all strings starting with '\341'
are less than '\342'.  In fact '\341b' is considered greater than '\342'.

Since '\341' and '\342' are two different accented forms of 'a'
(if I'm looking at the right character set), this is perhaps not so
improbable as all that.  Evidently the collation rule is that different
accent forms sort the same unless the strings would otherwise be
considered equal, in which case an ordering is assigned to them.

So, the rule we thought we had for generating index bounds falls flat,
and we're back to the same old question: given a proposed prefix string,
how can we generate bounds that are certain to be considered <= and >=
all strings starting with that prefix?

I am now thinking that maybe we should search for a string that compares
greater than "fooz" when the prefix is "foo" --- that is, append a 'z'
to the prefix string.  But I wouldn't be surprised if that fails too
in some locales.

I'm also wondering if the left-hand inequality ('foo' <= any string
beginning with 'foo') might fail in some locales ... we haven't seen
it reported but who knows ...
        regards, tom lane


Re: [HACKERS] Sigh, LIKE indexing is *still* broken in foreign locales

From
Giles Lean
Date:
On Wed, 07 Jun 2000 22:22:06 -0400  Tom Lane wrote:

> Since '\341' and '\342' are two different accented forms of 'a'
> (if I'm looking at the right character set), this is perhaps not so
> improbable as all that.  Evidently the collation rule is that different
> accent forms sort the same unless the strings would otherwise be
> considered equal, in which case an ordering is assigned to them.

I thought that was common, but while I've worked on
internationalisation issues sometimes I'm no linguist.

> So, the rule we thought we had for generating index bounds falls flat,
> and we're back to the same old question: given a proposed prefix string,
> how can we generate bounds that are certain to be considered <= and >=
> all strings starting with that prefix?

To confess ignorance, why does PostgreSQL need to generate such
bounds?  Complete string comparisons with a locale aware function such
as strcoll() are safe.  Using less than a full string is tricky
indeed, and I'm not sure is possible in general although it might be.

Other problematic cases are likely to include one-to-two collations (�
in German, for example) and two-to-one collations (the reverse, but
I've forgotten my example.  Anyone?)

Then there are wide characters, including some encodings that are
stateful.

Regards,

Giles




Re: [HACKERS] Sigh, LIKE indexing is *still* broken in foreign locales

From
"Matthias Urlichs"
Date:
Hi,

Giles Lean:
> > So, the rule we thought we had for generating index bounds falls flat,
> > and we're back to the same old question: given a proposed prefix string,
> > how can we generate bounds that are certain to be considered <= and >=
> > all strings starting with that prefix?
> 
> To confess ignorance, why does PostgreSQL need to generate such
> bounds?

To find the position in the index where it should start scanning.

> Then there are wide characters, including some encodings that are
> stateful.

Personally, I am in the "store everything on the server in Unicode"
camp. Let the parser convert everything to Unicode on the way in, 
and vice versa.

There's no sense, IMHO, in burdening the SQL core with multiple
character encoding schemes.

-- 
Matthias Urlichs  |  noris network GmbH   |   smurf@noris.de  |  ICQ: 20193661
The quote was selected randomly. Really.       |        http://smurf.noris.de/
-- 
I loathe that low vice curiosity.                                       -- Lord Byron


Re: [HACKERS] Sigh, LIKE indexing is *still* broken in foreign locales

From
Peter Eisentraut
Date:
Tom Lane writes:

> Evidently the collation rule is that different accent forms sort the
> same unless the strings would otherwise be considered equal, in which
> case an ordering is assigned to them.

Yes, that's fairly common.

> I am now thinking that maybe we should search for a string that compares
> greater than "fooz" when the prefix is "foo" --- that is, append a 'z'
> to the prefix string.  But I wouldn't be surprised if that fails too
> in some locales.

It most definitely will. sv_SE, no_NO, and hr_HR are the early candidates.
And there's also nothing that says that you can only use LIKE on letters,
Latin letters at that.

The only thing you can really do in this direction is to append the very
last character in the complete collation sequence, if there's a way to
find that out. If there isn't, it might be worth hard-coding a few popular
ones.

> I'm also wondering if the left-hand inequality ('foo' <= any string
> beginning with 'foo') might fail in some locales ... we haven't seen
> it reported but who knows ...

I think that's pretty safe. Shorter strings are always "less than" longer
ones.


-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Re: [HACKERS] Sigh, LIKE indexing is *still* broken in foreign locales

From
Giles Lean
Date:
On Fri, 9 Jun 2000 02:57:56 +0200 (CEST)  Peter Eisentraut wrote:

> I think that's pretty safe. Shorter strings are always "less than" longer
> ones.

Nope: many-to-one collation elements break this too.

Regards,

Giles