Thread: BUG #1268: Two different Unicode chars are treated as equal in a query
BUG #1268: Two different Unicode chars are treated as equal in a query
From
"PostgreSQL Bugs List"
Date:
The following bug has been logged online: Bug reference: 1268 Logged by: Kent Tong Email address: kent@cpttm.org.mo PostgreSQL version: 7.4.5 Operating system: RedHat 9 Description: Two different Unicode chars are treated as equal in a query Details: Steps: 1. Create a test database: "createdb -E Unicode -U postgres testdb". 2. Create a test table: "create table testtable (id varchar(100) primary key);". 3. With JDBC, insert a record whose id contains unicode: "insert into testtable values(<a unicode char whose code is 0x4e8c>);". 4. With JDBC, try to retrieve a record whose id contains a different unicde: "select from testtable where id=<a unicode char whose code is 0x4e94>;". It should not find any record but it finds the record created in step 3. Here is the JUnit test case: public class PgSQLTest extends TestCase { private Connection conn; protected void setUp() throws Exception { conn = makeConnection(); } protected void tearDown() throws Exception { conn.close(); } public void testChinese() throws Exception { deleteAll(); insertRow(); PreparedStatement st = conn.prepareStatement("select * from testtable where id=?"); try { st.setString(1, "\u4e94"); ResultSet rs = st.executeQuery(); assertFalse(rs.next()); } finally { st.close(); } } private void insertRow() throws SQLException { PreparedStatement st = conn.prepareStatement("insert into testtable values(?)"); st.setString(1, "\u4e8c"); st.executeUpdate(); st.close(); } private void deleteAll() throws SQLException { PreparedStatement st = conn.prepareStatement("delete from testtable"); st.executeUpdate(); st.close(); } private Connection makeConnection() throws ClassNotFoundException, SQLException { Class.forName("org.postgresql.Driver"); Properties properties = new Properties(); properties.put("user", "postgres"); properties.put("password", ""); return DriverManager.getConnection( "jdbc:postgresql://localhost/testdb", properties); } }
"PostgreSQL Bugs List" <pgsql-bugs@postgresql.org> writes: > Description: Two different Unicode chars are treated as equal in a > query This would be a matter to take up with the maintainer of your locale (which you didn't mention, but in any case it's a locale bug). We just do what strcoll() tells us. Note that it's possible this is a configuration error and not an outright bug. Check to make sure that the locale you initdb'd under is actually designed to work with UTF-8 data. regards, tom lane
Tom Lane wrote: > "PostgreSQL Bugs List" <pgsql-bugs@postgresql.org> writes: > >>Description: Two different Unicode chars are treated as equal in a >>query > > > This would be a matter to take up with the maintainer of your locale > (which you didn't mention, but in any case it's a locale bug). We > just do what strcoll() tells us. Thanks for the quick reply. The system locale is zh_TW.Big5. However, I've tried setting it to "C" but the test case still fails. In order to check if it's a locale bug, I've written a C program: #include <locale.h> #include <stdio.h> #include <string.h> int main() { char *s1 = "\xe4\xba\x8c"; char *s2 = "\xe4\xba\x94"; setlocale(LC_ALL, "en.UTF-8"); //setlocale(LC_ALL, "zh.Big5"); //doesn't make any difference printf("%d\n", strcoll(s1, s2)); return 0; } and compiled it and run it on that computer. It prints -1. It means that strcoll is working. > Note that it's possible this is a configuration error and not an > outright bug. Check to make sure that the locale you initdb'd > under is actually designed to work with UTF-8 data. Does it matter? The encoding provided to initdb is just a default for the databases to be created in the future. When I used createdb, I did specify "-E unicode". -- Kent Tong, Msc, MCSE, SCJP, CCSA, Delphi Certified Manager of IT Dept, CPTTM Authorized training for Borland, Cisco, Microsoft, Oracle, RedFlag & RedHat
Kent Tong <kent@cpttm.org.mo> writes: > Does it matter? The encoding provided to initdb is just > a default for the databases to be created in the future. Yes it does, and you missed the point. I said *locale*, not *encoding*. The LC_COLLATE and LC_CTYPE settings that prevail during initdb are fixed and not alterable without re-initdb. (I agree that this sucks, but that's how it is for now...) Your test program doesn't prove a lot unless you are sure it's executing under the same locale settings as the postmaster is running in. regards, tom lane
Tom Lane wrote: > Yes it does, and you missed the point. I said *locale*, not *encoding*. > The LC_COLLATE and LC_CTYPE settings that prevail during initdb are > fixed and not alterable without re-initdb. (I agree that this sucks, > but that's how it is for now...) You're right. After using: initdb --locale zh_TW.utf8 /var/lib/pgsql/data then it works fine! Thanks again and sorry about any inconvenience. -- Kent Tong, Msc, MCSE, SCJP, CCSA, Delphi Certified Manager of IT Dept, CPTTM Authorized training for Borland, Cisco, Microsoft, Oracle, RedFlag & RedHat