Thread: Bug in index scans with Locale support enabled

Bug in index scans with Locale support enabled

From
Barry Lind
Date:
In researching a problem I have uncovered the following bug in index
scans when Locale support is enabled.

Given a 7.0.3 postgres installation built with Locale support enabled
and a default US RedHat 7.0 Linux installation (meaning that the LANG
environment variable is set to en_US) to enable the US english locale
and

Given the following table and index structure with the following data:

create table test (test_col text);
create index test_index on test (test_col);
insert into test values ('abc.xyz');
insert into test values ('abcxyz');
insert into test values ('abc/xyz');

If you run the query:

select * from test where test_col >= 'abc.';

One would normally expect to only get one record returned, but instead
all records are returned.

The reason for this is that in the en_US locale all non-alphanumeric
characters are ignored when doing string comparisons.  So the data above
gets treated as:
abc.xyz = abcxyz = abc/xyz  (as the non-alphanumeric characters of '.'
and '/' are ignored).  This implys that the above query will then return
all rows as the constant 'abc.' is the same as 'abc' for comparison
purposes and all rows are >= 'abc'.

Note that if you use a different locale for example en_UK, you will get
different results as this locale does not ignore the . and / in the
comparison.

Now the real problem comes in when either the like or regex operators
are used in a sql statement.  Consider the following sql:

select * from text where test_col like 'abc/%';

This query should return one row, the row for 'abc/xyz'.  However if the
above query is executed via an index scan it will return the wrong
number of rows (0 in this case).

Why is this?  Well the query plan created for the above like expression
looks like the following:
select * from text where test_col >= 'abc/' and test_col < 'abc0';

In order to use the index the like has been changed into a '>=' and a
'<' for the constant prefix ('abc/') and the constant prefix with the
last character incremented by one ('/abc0')  (0 is the next character
after / in ASCII).

Given what was shown above about how the en_US locale does comparisons
we know that the non-alphanumeric characters are ignored.  So the query
essentially becomes:
select * from text where test_col >= 'abc' and test_col < 'abc0';
and the data it is comparing against is 'abcxyz' in all cases (once the
.'s an /'s are removed).  Therefore since 'abcxyz' > 'abc0', no rows are
returned.

Over the last couple of months that I have been on the postgres mail
lists there have been a few people who reported that queries of the form
"like '/aaa/bbb/%' don't work.  From the above information I have
determined that such queries don't work if:
a) database is built with Locale support enabled (--enable-locale)
b) the database is running with locale en_US
c) the column the like is being performed on is indexed
d) the query execution plan uses the above index

(Discovering the exact set of circumstances for how to reproduce this
has driven me crazy for a while now).

The current implementation for converting the like into an index scan
doesn't work with Locale support enabled and the en_US locale as shown
above.

thanks,
--Barry

PS. my test case:

drop table test;
create table test (test_col text);
create index test_index on test (test_col);
insert into test values ('abc.xyz');
insert into test values ('abcxyz');
insert into test values ('abc/xyz');
explain select * from test where test_col like 'abc/%';
select * from test where test_col like 'abc/%';


when run against postgres 7.0.3 with locale support enabled (used the
standard RPMs on postgresql.org for RedHat) with LANG=en_US:

barry=# drop table test;
DROP
barry=# create table test (test_col text);
CREATE
barry=# create index test_index on test (test_col);
CREATE
barry=# insert into test values ('abc.xyz');
INSERT 227611 1
barry=# insert into test values ('abcxyz');
INSERT 227612 1
barry=# insert into test values ('abc/xyz');
INSERT 227613 1
barry=# explain select * from test where test_col like 'abc/%';
NOTICE:  QUERY PLAN:

Index Scan using test_index on test  (cost=0.00..8.14 rows=10 width=12)

EXPLAIN
barry=# select * from test where test_col like 'abc/%';
 test_col
----------
(0 rows)

barry=#



when run against postgres 7.0.3 with locale support enabled (used the
standard RPMs on postgresql.org) with LANG=en_UK:

barry=# drop table test;
DROP
barry=# create table test (test_col text);
CREATE
barry=# create index test_index on test (test_col);
CREATE
barry=# insert into test values ('abc.xyz');
INSERT 227628 1
barry=# insert into test values ('abcxyz');
INSERT 227629 1
barry=# insert into test values ('abc/xyz');
INSERT 227630 1
barry=# explain select * from test where test_col like 'abc/%';
NOTICE:  QUERY PLAN:

Index Scan using test_index on test  (cost=0.00..8.14 rows=10 width=12)

EXPLAIN
barry=# select * from test where test_col like 'abc/%';
 test_col
----------
 abc/xyz
(1 row)

barry=#

Note the second query (under en_UK) returns the correct rows, but the
first query (under en_US) returned the wrong number of rows.


PPS.  Another way to work around the problem is to turn off locale
specific collation using the environment variable LC_COLLATE and setting
it to the value C.

Re: [HACKERS] Bug in index scans with Locale support enabled

From
Tom Lane
Date:
Barry Lind <barry@xythos.com> writes:
> Now the real problem comes in when either the like or regex operators
> are used in a sql statement.

Right.  As of 7.1beta1 we are dealing with this by suppressing LIKE/regex
index optimization in all locales other than "C".  That's a pretty crude
answer but it seems the only reliable one :-(.

> Over the last couple of months that I have been on the postgres mail
> lists there have been a few people who reported that queries of the form
> "like '/aaa/bbb/%' don't work.  From the above information I have
> determined that such queries don't work if:
> a) database is built with Locale support enabled (--enable-locale)
> b) the database is running with locale en_US
> c) the column the like is being performed on is indexed
> d) the query execution plan uses the above index

en_US is not the only dangerous locale, unfortunately.

I suspect that there are some non-C locales in which we could still do
the optimization safely.  The trick is to know which ones have collation
rules that are affected by character combinations, multi-pass ordering
rules, etc.  Do you have any info on that?

BTW, thanks for the very clear explanation --- we can point people at
this next time the question comes up, which it does regularly...

            regards, tom lane

SV: Bug in index scans with Locale support enabled

From
"Jarmo Paavilainen"
Date:
Hi,

...
> In researching a problem I have uncovered the following bug in index
> scans when Locale support is enabled.
...
> environment variable is set to en_US) to enable the US english locale
...
> create table test (test_col text);
> create index test_index on test (test_col);
> insert into test values ('abc.xyz');
> insert into test values ('abcxyz');
> insert into test values ('abc/xyz');
>
> If you run the query:
>
> select * from test where test_col >= 'abc.';
>
> One would normally expect to only get one record returned, but instead
> all records are returned.

I would expect all to be returned (maybe not "abc/..."). Because noice
should be sorted first. ie. '.' is less than '0' and 'x' (and maybe '/').

...
> The reason for this is that in the en_US locale all non-alphanumeric
> characters are ignored when doing string comparisons.  So the data above

...or... *I think* they are sorted first. If that is correct in your locale,
I do not know.

...
> Note that if you use a different locale for example en_UK, you will get

Thats odd, I would expect en_UK and en_US to sort the same way (same
charset).

...
> select * from text where test_col like 'abc/%';
>
> This query should return one row, the row for 'abc/xyz'.  However if the
> above query is executed via an index scan it will return the wrong
> number of rows (0 in this case).

ehh index scan? test_col >= 'abc/' or test_col >= 'abc/%' ????
The first one should return all rows but the one with '.', while the second
should return 0 rows. If the first one returns zero rows, then its a bug.

If you meant what the optimizer does with LIKE, well *I think* such
optimazion is asking for trouble (compare strings with anything else than =
and != are, well hard to predict).

...
> "like '/aaa/bbb/%' don't work.  From the above information I have
> determined that such queries don't work if:
> a) database is built with Locale support enabled (--enable-locale)

Actually they should not work without '--enable-locale', or then Im wrong.

> b) the database is running with locale en_US
> c) the column the like is being performed on is indexed

Dangerous LIKE optimation.

...
> The current implementation for converting the like into an index scan
> doesn't work with Locale support enabled and the en_US locale as shown

Hmm. If memory serves its dropped in the later builds (no like optimation).

// Jarmo


Re: [HACKERS] Bug in index scans with Locale support enabled

From
Peter Eisentraut
Date:
Barry Lind writes:

> The reason for this is that in the en_US locale all non-alphanumeric
> characters are ignored when doing string comparisons.  So the data above
> gets treated as:
> abc.xyz = abcxyz = abc/xyz  (as the non-alphanumeric characters of '.'
> and '/' are ignored).  This implys that the above query will then return
> all rows as the constant 'abc.' is the same as 'abc' for comparison
> purposes and all rows are >= 'abc'.
>
> Note that if you use a different locale for example en_UK, you will get
> different results as this locale does not ignore the . and / in the
> comparison.

The reason for that is that en_UK is not a valid locale name and will get
treated as the default "C" locale.  If you use en_GB you will get the same
behaviour as for en_US.

--
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/