BUG #1859: 3-octet private use UTF8 chars reported as identical - Mailing list pgsql-bugs
From | Nathan Culwell-Kanarek |
---|---|
Subject | BUG #1859: 3-octet private use UTF8 chars reported as identical |
Date | |
Msg-id | 20050901232049.3E621F14A7@svr2.postgresql.org Whole thread Raw |
Responses |
Re: BUG #1859: 3-octet private use UTF8 chars reported as identical
|
List | pgsql-bugs |
The following bug has been logged online: Bug reference: 1859 Logged by: Nathan Culwell-Kanarek Email address: nculwell@wisc.edu PostgreSQL version: 8.0.3 Operating system: Fedora Core Linux 4 Description: 3-octet private use UTF8 chars reported as identical Details: ------------------------------------- Bug description: ------------------------------------- Our project is using a specialized Unicode character set that uses some characters from the private use area, displayed with the Aborginal Serif font (the font can be freely downloaded at http://www.languagegeek.com/), to represent the Menominee language. We've run into a problem, which is that the PostgreSQL backend is interpreting 4 of the private use characters as being equivalent. A SELECT query using the equality character indicates that they are equal, and when there is a field with a uniqueness constraint then it cannot have 2 values that differ only by a minimal pair of the private use characters. Note: The private use characters in question are 3 octets long in UTF-8 encoding, which is what we're using. ------------------------------------- Program to reproduce the bug: ------------------------------------- #!env perl # Create a test database. system "createdb -E unicode bugdb"; # Define constants for the private use Unicode # characters. These are defined by explicitly # encoding the UTF-8 octets that compose the # characters we need. From here on the characters # are referred to through these variables. $UTF8_AE = "\xEF\x98\xB6"; # AE $UTF8_ae = "\xEF\x98\xB7"; # ae $UTF8_AE_mac = "\xEF\x98\xB8"; # AE-macron $UTF8_ae_mac = "\xEF\x98\xB9"; # ae-macron sub exec_sql { my $sql = shift; system 'psql -d bugdb -e -c "' . $sql . '"'; } # Make SELECT queries to compare strings that should be # distinct; the expected result for each of these # queries is FALSE. exec_sql("SELECT '$UTF8_AE' = '$UTF8_AE_mac'"); exec_sql("SELECT '$UTF8_AE' = '$UTF8_ae'"); exec_sql("SELECT '$UTF8_AE' = '$UTF8_ae_mac'"); exec_sql("SELECT '$UTF8_ae' = '$UTF8_AE_mac'"); exec_sql("SELECT '$UTF8_ae' = '$UTF8_ae_mac'"); exec_sql("SELECT '$UTF8_AE_mac' = '$UTF8_ae_mac'"); # Make a table with a VARCHAR field as the primary key # (thus the field has UNIQUE and NOT NULL constrants). # Add 2 distinct strings, it should be valid to have # both values in the field in different records. exec_sql("CREATE TABLE mytable (word VARCHAR PRIMARY KEY)"); exec_sql("INSERT INTO mytable (word) VALUES ('n${UTF8_ae}q')"); exec_sql("INSERT INTO mytable (word) VALUES ('n${UTF8_ae_mac}q')"); system "dropdb bugdb"; ------------------------------------- The output from the above program: ------------------------------------- $ perl makebug.pl CREATE DATABASE SELECT 'ï¶' = 'ï¸' ?column? ---------- t (1 row) SELECT 'ï¶' = 'ï·' ?column? ---------- t (1 row) SELECT 'ï¶' = 'ï¹' ?column? ---------- t (1 row) SELECT 'ï·' = 'ï¸' ?column? ---------- t (1 row) SELECT 'ï·' = 'ï¹' ?column? ---------- t (1 row) SELECT 'ï¸' = 'ï¹' ?column? ---------- t (1 row) CREATE TABLE mytable (word VARCHAR PRIMARY KEY) NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "mytable_pkey" for table "mytable" CREATE TABLE INSERT INTO mytable (word) VALUES ('nï·q') INSERT 361753 1 INSERT INTO mytable (word) VALUES ('nï¹q') ERROR: duplicate key violates unique constraint "mytable_pkey" DROP DATABASE ------------------------------- Platform information: ------------------------------- The system is Fedora Core Linux 4 (kernel 2.6.12-1.1398_FC4, glibc 2.3.5-10) running PostgreSQL 8.0.3 on an Intel processor (Pentium 4 I believe) with 512MB of RAM. $ uname -a Linux ling.wisc.edu 2.6.12-1.1398_FC4 #1 Fri Jul 15 00:52:32 EDT 2005 i686 i686 i386 GNU/Linux $ rpm -qa | grep postgresql postgresql-server-8.0.3-1 postgresql-devel-8.0.3-1 postgresql-odbc-08.00.0100-1 postgresql-python-8.0.3-1 postgresql-libs-8.0.3-1 postgresql-tcl-8.0.3-1 postgresql-8.0.3-1 $ postmaster --version postmaster (PostgreSQL) 8.0.3 [nate@ling:~] $ psql --version psql (PostgreSQL) 8.0.3 contains support for command-line editing $ rpm -q glibc glibc-2.3.5-10
pgsql-bugs by date: