Thread: BUG #15892: URGENT: Using an ICU collation in a primary key column breaks ILIKE query

BUG #15892: URGENT: Using an ICU collation in a primary key column breaks ILIKE query

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      15892
Logged by:          James Inform
Email address:      james.inform@pharmapp.de
PostgreSQL version: 11.4
Operating system:   Linux Mint 19 Tara / Mac OS X 10.13.6
Description:

-- This is tested under Linux Mint 19 Tara with uconv v2.1  ICU 60.2
-- and Mac OS X 10.13.6 with uconv v2.1  ICU 58.2
-- PostgreSQL Version 11.4
-- PostgreSQL was built with: ./configure '--prefix=/home/myself/pg114'
'--with-systemd' '--with-icu' '--with-libxml' '--with-libxslt' '--with-perl'
'--with-python' '--disable-rpath' '--with-extra-version= ICU'

-- STEPS to reproduce
-- First you need PostgreSQL 11.4 built with ICU support 

create database testdb;

\c testdb;

-- Just create a simple table with one column
create table icutest(data text not null collate "de-x-icu" primary key);

-- Insert a record with uppercase string
insert into icutest values ('MYTEST');

-- This is not giving a match
select * from icutest where data ilike 'mytest';

-- These two queries give the exspected result
select * from icutest where lower(data) ilike 'mytest';
select * from icutest where data ilike ('mytest' collate "default");

-- If you do the same with a non primary source column, then everything
works like exspected
-- I'm not an Expert, but maybe the index behind the primary key is based on
the wrong collation


    PG Bug reporting form wrote:

> -- Just create a simple table with one column
> create table icutest(data text not null collate "de-x-icu" primary key);
>
> -- Insert a record with uppercase string
> insert into icutest values ('MYTEST');
>
> -- This is not giving a match
> select * from icutest where data ilike 'mytest';

This also happens on v10 and on the master branch.

The bug seems to come from a mistake in like_support.c:


/*
 * Check whether char is a letter (and, hence, subject to case-folding)
 *
 * In multibyte character sets or with ICU, we can't use isalpha, and it does
 * not seem worth trying to convert to wchar_t to use iswalpha.  Instead,
just
 * assume any multibyte char is potentially case-varying.
 */
static int
pattern_char_isalpha(char c, bool is_multibyte,
             pg_locale_t locale, bool locale_is_c)
{
    if (locale_is_c)
    return (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z');
    else if (is_multibyte && IS_HIGHBIT_SET(c))
    return true;
    else if (locale && locale->provider == COLLPROVIDER_ICU)
    return IS_HIGHBIT_SET(c) ? true : false;


With an ICU locale, this returns false for all characters in 'mytest'.

I think this eventually leads the caller to incorrectly believe that it
can optimize the test into an exact match (data='mytest'), given
there are otherwise no wildcards in the pattern.

On fixing the bug, if we make this function returns true for all
characters under an ICU locale, it appears to work, but we're loosing an
opportunity to optimize for some patterns.
If OTOH we wanted to use an ICU call like u_isalpha(), to be closer
to what's done with libc,  we'd need to pass a UChar32 argument,
not a char, and since we're in a char-oriented context, I don't see how
to do that.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



The weird thing is, that this behavior only occures when used on a field field a primary key:

If you do:

create database testdb;

\c testdb;

-- Just create a simple table with one column
create table icutest(data text not null collate "de-x-icu" primary key, data2 text collate "de-x-icu");

-- Insert a record with uppercase string
insert into icutest values ('MYTEST','MYTEST');

-- This is not giving a match
select * from icutest where data ilike 'mytest';

-- BUT THIS GIVES A MATCH:

select * from icutest where data2 ilike 'mytest';

-- So it seems to be especially related to the scenario where a primary key / index exists.


> On July 4, 2019 at 12:36 PM Daniel Verite <daniel@manitou-mail.org> wrote:
>
>
>  PG Bug reporting form wrote:
> > -- Just create a simple table with one column
> > create table icutest(data text not null collate "de-x-icu" primary key);
> >
> > -- Insert a record with uppercase string
> > insert into icutest values ('MYTEST');
> >
> > -- This is not giving a match
> > select * from icutest where data ilike 'mytest';
>
> This also happens on v10 and on the master branch.
>
> The bug seems to come from a mistake in like_support.c:
>
>
> /* * Check whether char is a letter (and, hence, subject to case-folding)
>  * * In multibyte character sets or with ICU, we can't use isalpha, and it does * not seem worth trying to convert to
wchar_tto use iswalpha. Instead, 
> just * assume any multibyte char is potentially case-varying.
>  */
> static int
> pattern_char_isalpha(char c, bool is_multibyte,
>  pg_locale_t locale, bool locale_is_c)
> {
>  if (locale_is_c)
>  return (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z');
>  else if (is_multibyte && IS_HIGHBIT_SET(c))
>  return true;
>  else if (locale && locale->provider == COLLPROVIDER_ICU)
>  return IS_HIGHBIT_SET(c) ? true : false;
>
>
> With an ICU locale, this returns false for all characters in 'mytest'.
>
> I think this eventually leads the caller to incorrectly believe that it
> can optimize the test into an exact match (data='mytest'), given
> there are otherwise no wildcards in the pattern.
>
> On fixing the bug, if we make this function returns true for all
> characters under an ICU locale, it appears to work, but we're loosing an
> opportunity to optimize for some patterns.
> If OTOH we wanted to use an ICU call like u_isalpha(), to be closer
> to what's done with libc, we'd need to pass a UChar32 argument,
> not a char, and since we're in a char-oriented context, I don't see how
> to do that.
>
>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite



    James Inform wrote:

> -- This is not giving a match
> select * from icutest where data ilike 'mytest';
>
> -- BUT THIS GIVES A MATCH:
>
> select * from icutest where data2 ilike 'mytest';
>
> -- So it seems to be especially related to the scenario where a primary key
> / index exists.
>

Yes. It is because of the index that the code checks if the ILIKE
can be evaluated with an index lookup. Otherwise it doesn't.

If you feel like recompiling with a temporary fix against v11.4 to do
your own tests, please try the attached patch.

Here's the result for me:

* Unpatched version:

postgres=# explain analyze select * from icutest where data ilike 'mytest';
                             QUERY PLAN

-------------------------------------------------------------------------------------------------------
---------------------
 Index Only Scan using icutest_pkey on icutest    (cost=0.15..8.17 rows=1
width=32) (actual time=0.013..0
.013 rows=0 loops=1)
   Index Cond: (data = 'mytest'::text)
   Filter: (data ~~* 'mytest'::text)
   Heap Fetches: 0
 Planning Time: 0.593 ms
 Execution Time: 0.035 ms



* Patched version:

postgres=# explain analyze select * from icutest where data ilike 'mytest';
                        QUERY PLAN
---------------------------------------------------------------------------------------------------
 Seq Scan on icutest  (cost=0.00..27.00 rows=1 width=32) (actual
time=0.122..0.123 rows=1 loops=1)
   Filter: (data ~~* 'mytest'::text)
 Planning Time: 0.081 ms
 Execution Time: 0.144 ms


Notice how the patched version ignores the index and correctly finds the
row versus the unpatched version.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Attachment
On 04.07.2019 19:50, Daniel Verite wrote:
> Yes. It is because of the index that the code checks if the ILIKE
> can be evaluated with an index lookup. Otherwise it doesn't.
> 
> If you feel like recompiling with a temporary fix against v11.4 to do
> your own tests, please try the attached patch.

Also this patch works. I thinks we need to use some ICU function to 
check whether a char is a letter.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment
Arthur Zakirov <a.zakirov@postgrespro.ru> writes:
> On 04.07.2019 19:50, Daniel Verite wrote:
>> Yes. It is because of the index that the code checks if the ILIKE
>> can be evaluated with an index lookup. Otherwise it doesn't.
>> If you feel like recompiling with a temporary fix against v11.4 to do
>> your own tests, please try the attached patch.

> Also this patch works. I thinks we need to use some ICU function to 
> check whether a char is a letter.

This topic seems to have slipped through the cracks, which is too bad
because we surely could have gotten it fixed in time for last week's
releases.  Oh well.

I think Daniel's patch leaves more on the table than it needs to,
and I don't like Arthur's patch because it's assuming more than it
should about whether it's okay to pass a "char" to u_isalpha().
I propose that we do things as attached, instead.

(I also added a regression test case.)

            regards, tom lane

diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 26e0634..c8fec78 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -1437,8 +1437,9 @@ regex_selectivity(const char *patt, int pattlen, bool case_insensitive,
  * Check whether char is a letter (and, hence, subject to case-folding)
  *
  * In multibyte character sets or with ICU, we can't use isalpha, and it does
- * not seem worth trying to convert to wchar_t to use iswalpha.  Instead, just
- * assume any multibyte char is potentially case-varying.
+ * not seem worth trying to convert to wchar_t to use iswalpha or u_isalpha.
+ * Instead, just assume any non-ASCII char is potentially case-varying, and
+ * hard-wire knowledge of which ASCII chars are letters.
  */
 static int
 pattern_char_isalpha(char c, bool is_multibyte,
@@ -1449,7 +1450,8 @@ pattern_char_isalpha(char c, bool is_multibyte,
     else if (is_multibyte && IS_HIGHBIT_SET(c))
         return true;
     else if (locale && locale->provider == COLLPROVIDER_ICU)
-        return IS_HIGHBIT_SET(c) ? true : false;
+        return IS_HIGHBIT_SET(c) ||
+            (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z');
 #ifdef HAVE_LOCALE_T
     else if (locale && locale->provider == COLLPROVIDER_LIBC)
         return isalpha_l((unsigned char) c, locale->info.lt);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 51262e0..50d1e6e 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -983,6 +983,38 @@ SELECT relname, pg_get_indexdef(oid) FROM pg_class WHERE relname LIKE 'collate_t
  collate_test1_idx4 | CREATE INDEX collate_test1_idx4 ON collate_tests.collate_test1 USING btree (((b || 'foo'::text))
COLLATE"POSIX") 
 (4 rows)

+set enable_seqscan = off;
+explain (costs off)
+select * from collate_test1 where b ilike 'abc';
+          QUERY PLAN
+-------------------------------
+ Seq Scan on collate_test1
+   Filter: (b ~~* 'abc'::text)
+(2 rows)
+
+select * from collate_test1 where b ilike 'abc';
+ a |  b
+---+-----
+ 1 | abc
+ 4 | ABC
+(2 rows)
+
+explain (costs off)
+select * from collate_test1 where b ilike 'ABC';
+          QUERY PLAN
+-------------------------------
+ Seq Scan on collate_test1
+   Filter: (b ~~* 'ABC'::text)
+(2 rows)
+
+select * from collate_test1 where b ilike 'ABC';
+ a |  b
+---+-----
+ 1 | abc
+ 4 | ABC
+(2 rows)
+
+reset enable_seqscan;
 -- schema manipulation commands
 CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 46999fb..5ed6360 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -341,6 +341,15 @@ CREATE INDEX collate_test1_idx6 ON collate_test1 ((a COLLATE "C")); -- fail

 SELECT relname, pg_get_indexdef(oid) FROM pg_class WHERE relname LIKE 'collate_test%_idx%' ORDER BY 1;

+set enable_seqscan = off;
+explain (costs off)
+select * from collate_test1 where b ilike 'abc';
+select * from collate_test1 where b ilike 'abc';
+explain (costs off)
+select * from collate_test1 where b ilike 'ABC';
+select * from collate_test1 where b ilike 'ABC';
+reset enable_seqscan;
+

 -- schema manipulation commands


On 12.08.2019 00:32, Tom Lane wrote:
> This topic seems to have slipped through the cracks, which is too bad
> because we surely could have gotten it fixed in time for last week's
> releases.  Oh well.

It is true. Also July commitfest had started by that time and it wasn't 
possible to add a new entry.

> I think Daniel's patch leaves more on the table than it needs to,
> and I don't like Arthur's patch because it's assuming more than it
> should about whether it's okay to pass a "char" to u_isalpha().
> I propose that we do things as attached, instead.

Yeah, u_isalpha() wasn't such good idea here. BTW Daniel already 
mentioned it above.

With the patch regression tests pass for me and James test case works 
fine as expected.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Arthur Zakirov <a.zakirov@postgrespro.ru> writes:
> On 12.08.2019 00:32, Tom Lane wrote:
>> I think Daniel's patch leaves more on the table than it needs to,
>> and I don't like Arthur's patch because it's assuming more than it
>> should about whether it's okay to pass a "char" to u_isalpha().
>> I propose that we do things as attached, instead.

> Yeah, u_isalpha() wasn't such good idea here. BTW Daniel already 
> mentioned it above.
> With the patch regression tests pass for me and James test case works 
> fine as expected.

Thanks for double-checking!  Pushed now.

            regards, tom lane