BUG #1859: 3-octet private use UTF8 chars reported as identical - Mailing list pgsql-bugs

From Nathan Culwell-Kanarek
Subject BUG #1859: 3-octet private use UTF8 chars reported as identical
Date
Msg-id 20050901232049.3E621F14A7@svr2.postgresql.org
Whole thread Raw
Responses Re: BUG #1859: 3-octet private use UTF8 chars reported as identical  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
The following bug has been logged online:

Bug reference:      1859
Logged by:          Nathan Culwell-Kanarek
Email address:      nculwell@wisc.edu
PostgreSQL version: 8.0.3
Operating system:   Fedora Core Linux 4
Description:        3-octet private use UTF8 chars reported as identical
Details:

-------------------------------------
Bug description:
-------------------------------------

Our project is using a specialized Unicode character set that uses some
characters from the private use area, displayed with the Aborginal Serif
font (the font can be freely downloaded at http://www.languagegeek.com/), to
represent the Menominee language. We've run into a problem, which is that
the PostgreSQL backend is interpreting 4 of the private use characters as
being equivalent. A SELECT query using the equality character indicates that
they are equal, and when there is a field with a uniqueness constraint then
it cannot have 2 values that differ only by a minimal pair of the private
use characters.

Note: The private use characters in question are 3 octets long in UTF-8
encoding, which is what we're using.


-------------------------------------
 Program to reproduce the bug:
-------------------------------------

#!env perl

# Create a test database.

system "createdb -E unicode bugdb";

# Define constants for the private use Unicode
# characters. These are defined by explicitly
# encoding the UTF-8 octets that compose the
# characters we need. From here on the characters
# are referred to through these variables.

$UTF8_AE =     "\xEF\x98\xB6"; # AE
$UTF8_ae =     "\xEF\x98\xB7"; # ae
$UTF8_AE_mac = "\xEF\x98\xB8"; # AE-macron
$UTF8_ae_mac = "\xEF\x98\xB9"; # ae-macron

sub exec_sql {
   my $sql = shift;
   system 'psql -d bugdb -e -c "' . $sql . '"';
}

# Make SELECT queries to compare strings that should be
# distinct; the expected result for each of these
# queries is FALSE.

exec_sql("SELECT '$UTF8_AE' = '$UTF8_AE_mac'");
exec_sql("SELECT '$UTF8_AE' = '$UTF8_ae'");
exec_sql("SELECT '$UTF8_AE' = '$UTF8_ae_mac'");
exec_sql("SELECT '$UTF8_ae' = '$UTF8_AE_mac'");
exec_sql("SELECT '$UTF8_ae' = '$UTF8_ae_mac'");
exec_sql("SELECT '$UTF8_AE_mac' = '$UTF8_ae_mac'");

# Make a table with a VARCHAR field as the primary key
# (thus the field has UNIQUE and NOT NULL constrants).
# Add 2 distinct strings, it should be valid to have
# both values in the field in different records.

exec_sql("CREATE TABLE mytable (word VARCHAR PRIMARY KEY)");
exec_sql("INSERT INTO mytable (word) VALUES ('n${UTF8_ae}q')");
exec_sql("INSERT INTO mytable (word) VALUES ('n${UTF8_ae_mac}q')");

system "dropdb bugdb";


-------------------------------------
The output from the above program:
-------------------------------------

$ perl makebug.pl
CREATE DATABASE
SELECT '' = ''
 ?column?
----------
 t
(1 row)

SELECT '' = ''
 ?column?
----------
 t
(1 row)

SELECT '' = ''
 ?column?
----------
 t
(1 row)

SELECT '' = ''
 ?column?
----------
 t
(1 row)

SELECT '' = ''
 ?column?
----------
 t
(1 row)

SELECT '' = ''
 ?column?
----------
 t
(1 row)

CREATE TABLE mytable (word VARCHAR PRIMARY KEY)
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"mytable_pkey" for table "mytable"
CREATE TABLE
INSERT INTO mytable (word) VALUES ('nq')
INSERT 361753 1
INSERT INTO mytable (word) VALUES ('nq')
ERROR:  duplicate key violates unique constraint "mytable_pkey"
DROP DATABASE


-------------------------------
 Platform information:
-------------------------------

The system is Fedora Core Linux 4 (kernel 2.6.12-1.1398_FC4, glibc 2.3.5-10)
running PostgreSQL 8.0.3 on an Intel processor

(Pentium 4 I believe) with 512MB of RAM.

$ uname -a
Linux ling.wisc.edu 2.6.12-1.1398_FC4 #1 Fri Jul 15 00:52:32 EDT 2005 i686
i686 i386 GNU/Linux
$ rpm -qa | grep postgresql
postgresql-server-8.0.3-1
postgresql-devel-8.0.3-1
postgresql-odbc-08.00.0100-1
postgresql-python-8.0.3-1
postgresql-libs-8.0.3-1
postgresql-tcl-8.0.3-1
postgresql-8.0.3-1
$ postmaster --version
postmaster (PostgreSQL) 8.0.3
[nate@ling:~]
$ psql --version
psql (PostgreSQL) 8.0.3
contains support for command-line editing
$ rpm -q glibc
glibc-2.3.5-10

pgsql-bugs by date:

Previous
From: Andreas Seltenreich
Date:
Subject: Re: Sorting Problem in UNICODE/german
Next
From: "D.J. Kniep"
Date:
Subject: BUG #1858: setting search path in select doesn't (always) work