Thread: BUG #1268: Two different Unicode chars are treated as equal in a query

BUG #1268: Two different Unicode chars are treated as equal in a query

From
"PostgreSQL Bugs List"
Date:
The following bug has been logged online:

Bug reference:      1268
Logged by:          Kent Tong

Email address:      kent@cpttm.org.mo

PostgreSQL version: 7.4.5

Operating system:   RedHat 9

Description:        Two different Unicode chars are treated as equal in a
query

Details:

Steps:
1. Create a test database: "createdb -E Unicode -U postgres testdb".
2. Create a test table: "create table testtable (id varchar(100) primary
key);".
3. With JDBC, insert a record whose id contains unicode: "insert into
testtable values(<a unicode char whose code is 0x4e8c>);".
4. With JDBC, try to retrieve a record whose id contains a different unicde:
"select from testtable where id=<a unicode char whose code is 0x4e94>;". It
should not find any record but it finds the record created in step 3.

Here is the JUnit test case:

public class PgSQLTest extends TestCase {
    private Connection conn;
    protected void setUp() throws Exception {
        conn = makeConnection();
    }
    protected void tearDown() throws Exception {
        conn.close();
    }
    public void testChinese() throws Exception {
        deleteAll();
        insertRow();
        PreparedStatement st =
            conn.prepareStatement("select * from testtable where id=?");
        try {
            st.setString(1, "\u4e94");
            ResultSet rs = st.executeQuery();
            assertFalse(rs.next());
        } finally {
            st.close();
        }
    }

    private void insertRow() throws SQLException {
        PreparedStatement st =
            conn.prepareStatement("insert into testtable values(?)");
        st.setString(1, "\u4e8c");
        st.executeUpdate();
        st.close();
    }
    private void deleteAll() throws SQLException {
        PreparedStatement st = conn.prepareStatement("delete from testtable");
        st.executeUpdate();
        st.close();
    }
    private Connection makeConnection()
        throws ClassNotFoundException, SQLException {
        Class.forName("org.postgresql.Driver");
        Properties properties = new Properties();
        properties.put("user", "postgres");
        properties.put("password", "");
        return DriverManager.getConnection(
            "jdbc:postgresql://localhost/testdb",
            properties);
    }
}

Re: BUG #1268: Two different Unicode chars are treated as equal in a query

From
Tom Lane
Date:
"PostgreSQL Bugs List" <pgsql-bugs@postgresql.org> writes:
> Description:        Two different Unicode chars are treated as equal in a
> query

This would be a matter to take up with the maintainer of your locale
(which you didn't mention, but in any case it's a locale bug).  We
just do what strcoll() tells us.

Note that it's possible this is a configuration error and not an
outright bug.  Check to make sure that the locale you initdb'd
under is actually designed to work with UTF-8 data.

            regards, tom lane

Re: BUG #1268: Two different Unicode chars are treated as

From
Kent Tong
Date:
Tom Lane wrote:

> "PostgreSQL Bugs List" <pgsql-bugs@postgresql.org> writes:
>
>>Description:        Two different Unicode chars are treated as equal in a
>>query
>
>
> This would be a matter to take up with the maintainer of your locale
> (which you didn't mention, but in any case it's a locale bug).  We
> just do what strcoll() tells us.

Thanks for the quick reply. The system locale is zh_TW.Big5. However,
I've tried setting it to "C" but the test case still fails.

In order to check if it's a locale bug, I've written a C program:

#include <locale.h>
#include <stdio.h>
#include <string.h>

int main() {
         char *s1 = "\xe4\xba\x8c";
         char *s2 = "\xe4\xba\x94";
         setlocale(LC_ALL, "en.UTF-8");
         //setlocale(LC_ALL, "zh.Big5"); //doesn't make any difference
         printf("%d\n", strcoll(s1, s2));
         return 0;
}

and compiled it and run it on that computer. It prints -1.
It means that strcoll is working.

 > Note that it's possible this is a configuration error and not an
 > outright bug.  Check to make sure that the locale you initdb'd
 > under is actually designed to work with UTF-8 data.

Does it matter? The encoding provided to initdb is just
a default for the databases to be created in the future.
When I used createdb, I did specify "-E unicode".

--
Kent Tong, Msc, MCSE, SCJP, CCSA, Delphi Certified
Manager of IT Dept, CPTTM
Authorized training for Borland, Cisco, Microsoft, Oracle, RedFlag & RedHat

Re: BUG #1268: Two different Unicode chars are treated as

From
Tom Lane
Date:
Kent Tong <kent@cpttm.org.mo> writes:
> Does it matter? The encoding provided to initdb is just
> a default for the databases to be created in the future.

Yes it does, and you missed the point.  I said *locale*, not *encoding*.
The LC_COLLATE and LC_CTYPE settings that prevail during initdb are
fixed and not alterable without re-initdb.  (I agree that this sucks,
but that's how it is for now...)

Your test program doesn't prove a lot unless you are sure it's executing
under the same locale settings as the postmaster is running in.

            regards, tom lane

Re: BUG #1268: Two different Unicode chars are treated as

From
Kent Tong
Date:
Tom Lane wrote:
> Yes it does, and you missed the point.  I said *locale*, not *encoding*.
> The LC_COLLATE and LC_CTYPE settings that prevail during initdb are
> fixed and not alterable without re-initdb.  (I agree that this sucks,
> but that's how it is for now...)

You're right. After using:

   initdb --locale zh_TW.utf8 /var/lib/pgsql/data

then it works fine!

Thanks again and sorry about any inconvenience.

--
Kent Tong, Msc, MCSE, SCJP, CCSA, Delphi Certified
Manager of IT Dept, CPTTM
Authorized training for Borland, Cisco, Microsoft, Oracle, RedFlag & RedHat