Thread: Bug in index scans with Locale support enabled
In researching a problem I have uncovered the following bug in index scans when Locale support is enabled. Given a 7.0.3 postgres installation built with Locale support enabled and a default US RedHat 7.0 Linux installation (meaning that the LANG environment variable is set to en_US) to enable the US english locale and Given the following table and index structure with the following data: create table test (test_col text); create index test_index on test (test_col); insert into test values ('abc.xyz'); insert into test values ('abcxyz'); insert into test values ('abc/xyz'); If you run the query: select * from test where test_col >= 'abc.'; One would normally expect to only get one record returned, but instead all records are returned. The reason for this is that in the en_US locale all non-alphanumeric characters are ignored when doing string comparisons. So the data above gets treated as: abc.xyz = abcxyz = abc/xyz (as the non-alphanumeric characters of '.' and '/' are ignored). This implys that the above query will then return all rows as the constant 'abc.' is the same as 'abc' for comparison purposes and all rows are >= 'abc'. Note that if you use a different locale for example en_UK, you will get different results as this locale does not ignore the . and / in the comparison. Now the real problem comes in when either the like or regex operators are used in a sql statement. Consider the following sql: select * from text where test_col like 'abc/%'; This query should return one row, the row for 'abc/xyz'. However if the above query is executed via an index scan it will return the wrong number of rows (0 in this case). Why is this? Well the query plan created for the above like expression looks like the following: select * from text where test_col >= 'abc/' and test_col < 'abc0'; In order to use the index the like has been changed into a '>=' and a '<' for the constant prefix ('abc/') and the constant prefix with the last character incremented by one ('/abc0') (0 is the next character after / in ASCII). Given what was shown above about how the en_US locale does comparisons we know that the non-alphanumeric characters are ignored. So the query essentially becomes: select * from text where test_col >= 'abc' and test_col < 'abc0'; and the data it is comparing against is 'abcxyz' in all cases (once the .'s an /'s are removed). Therefore since 'abcxyz' > 'abc0', no rows are returned. Over the last couple of months that I have been on the postgres mail lists there have been a few people who reported that queries of the form "like '/aaa/bbb/%' don't work. From the above information I have determined that such queries don't work if: a) database is built with Locale support enabled (--enable-locale) b) the database is running with locale en_US c) the column the like is being performed on is indexed d) the query execution plan uses the above index (Discovering the exact set of circumstances for how to reproduce this has driven me crazy for a while now). The current implementation for converting the like into an index scan doesn't work with Locale support enabled and the en_US locale as shown above. thanks, --Barry PS. my test case: drop table test; create table test (test_col text); create index test_index on test (test_col); insert into test values ('abc.xyz'); insert into test values ('abcxyz'); insert into test values ('abc/xyz'); explain select * from test where test_col like 'abc/%'; select * from test where test_col like 'abc/%'; when run against postgres 7.0.3 with locale support enabled (used the standard RPMs on postgresql.org for RedHat) with LANG=en_US: barry=# drop table test; DROP barry=# create table test (test_col text); CREATE barry=# create index test_index on test (test_col); CREATE barry=# insert into test values ('abc.xyz'); INSERT 227611 1 barry=# insert into test values ('abcxyz'); INSERT 227612 1 barry=# insert into test values ('abc/xyz'); INSERT 227613 1 barry=# explain select * from test where test_col like 'abc/%'; NOTICE: QUERY PLAN: Index Scan using test_index on test (cost=0.00..8.14 rows=10 width=12) EXPLAIN barry=# select * from test where test_col like 'abc/%'; test_col ---------- (0 rows) barry=# when run against postgres 7.0.3 with locale support enabled (used the standard RPMs on postgresql.org) with LANG=en_UK: barry=# drop table test; DROP barry=# create table test (test_col text); CREATE barry=# create index test_index on test (test_col); CREATE barry=# insert into test values ('abc.xyz'); INSERT 227628 1 barry=# insert into test values ('abcxyz'); INSERT 227629 1 barry=# insert into test values ('abc/xyz'); INSERT 227630 1 barry=# explain select * from test where test_col like 'abc/%'; NOTICE: QUERY PLAN: Index Scan using test_index on test (cost=0.00..8.14 rows=10 width=12) EXPLAIN barry=# select * from test where test_col like 'abc/%'; test_col ---------- abc/xyz (1 row) barry=# Note the second query (under en_UK) returns the correct rows, but the first query (under en_US) returned the wrong number of rows. PPS. Another way to work around the problem is to turn off locale specific collation using the environment variable LC_COLLATE and setting it to the value C.
Barry Lind <barry@xythos.com> writes: > Now the real problem comes in when either the like or regex operators > are used in a sql statement. Right. As of 7.1beta1 we are dealing with this by suppressing LIKE/regex index optimization in all locales other than "C". That's a pretty crude answer but it seems the only reliable one :-(. > Over the last couple of months that I have been on the postgres mail > lists there have been a few people who reported that queries of the form > "like '/aaa/bbb/%' don't work. From the above information I have > determined that such queries don't work if: > a) database is built with Locale support enabled (--enable-locale) > b) the database is running with locale en_US > c) the column the like is being performed on is indexed > d) the query execution plan uses the above index en_US is not the only dangerous locale, unfortunately. I suspect that there are some non-C locales in which we could still do the optimization safely. The trick is to know which ones have collation rules that are affected by character combinations, multi-pass ordering rules, etc. Do you have any info on that? BTW, thanks for the very clear explanation --- we can point people at this next time the question comes up, which it does regularly... regards, tom lane
Hi, ... > In researching a problem I have uncovered the following bug in index > scans when Locale support is enabled. ... > environment variable is set to en_US) to enable the US english locale ... > create table test (test_col text); > create index test_index on test (test_col); > insert into test values ('abc.xyz'); > insert into test values ('abcxyz'); > insert into test values ('abc/xyz'); > > If you run the query: > > select * from test where test_col >= 'abc.'; > > One would normally expect to only get one record returned, but instead > all records are returned. I would expect all to be returned (maybe not "abc/..."). Because noice should be sorted first. ie. '.' is less than '0' and 'x' (and maybe '/'). ... > The reason for this is that in the en_US locale all non-alphanumeric > characters are ignored when doing string comparisons. So the data above ...or... *I think* they are sorted first. If that is correct in your locale, I do not know. ... > Note that if you use a different locale for example en_UK, you will get Thats odd, I would expect en_UK and en_US to sort the same way (same charset). ... > select * from text where test_col like 'abc/%'; > > This query should return one row, the row for 'abc/xyz'. However if the > above query is executed via an index scan it will return the wrong > number of rows (0 in this case). ehh index scan? test_col >= 'abc/' or test_col >= 'abc/%' ???? The first one should return all rows but the one with '.', while the second should return 0 rows. If the first one returns zero rows, then its a bug. If you meant what the optimizer does with LIKE, well *I think* such optimazion is asking for trouble (compare strings with anything else than = and != are, well hard to predict). ... > "like '/aaa/bbb/%' don't work. From the above information I have > determined that such queries don't work if: > a) database is built with Locale support enabled (--enable-locale) Actually they should not work without '--enable-locale', or then Im wrong. > b) the database is running with locale en_US > c) the column the like is being performed on is indexed Dangerous LIKE optimation. ... > The current implementation for converting the like into an index scan > doesn't work with Locale support enabled and the en_US locale as shown Hmm. If memory serves its dropped in the later builds (no like optimation). // Jarmo
Barry Lind writes: > The reason for this is that in the en_US locale all non-alphanumeric > characters are ignored when doing string comparisons. So the data above > gets treated as: > abc.xyz = abcxyz = abc/xyz (as the non-alphanumeric characters of '.' > and '/' are ignored). This implys that the above query will then return > all rows as the constant 'abc.' is the same as 'abc' for comparison > purposes and all rows are >= 'abc'. > > Note that if you use a different locale for example en_UK, you will get > different results as this locale does not ignore the . and / in the > comparison. The reason for that is that en_UK is not a valid locale name and will get treated as the default "C" locale. If you use en_GB you will get the same behaviour as for en_US. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/