The following bug has been logged online:
Bug reference: 1859
Logged by: Nathan Culwell-Kanarek
Email address: nculwell@wisc.edu
PostgreSQL version: 8.0.3
Operating system: Fedora Core Linux 4
Description: 3-octet private use UTF8 chars reported as identical
Details:
-------------------------------------
Bug description:
-------------------------------------
Our project is using a specialized Unicode character set that uses some
characters from the private use area, displayed with the Aborginal Serif
font (the font can be freely downloaded at http://www.languagegeek.com/), to
represent the Menominee language. We've run into a problem, which is that
the PostgreSQL backend is interpreting 4 of the private use characters as
being equivalent. A SELECT query using the equality character indicates that
they are equal, and when there is a field with a uniqueness constraint then
it cannot have 2 values that differ only by a minimal pair of the private
use characters.
Note: The private use characters in question are 3 octets long in UTF-8
encoding, which is what we're using.
-------------------------------------
Program to reproduce the bug:
-------------------------------------
#!env perl
# Create a test database.
system "createdb -E unicode bugdb";
# Define constants for the private use Unicode
# characters. These are defined by explicitly
# encoding the UTF-8 octets that compose the
# characters we need. From here on the characters
# are referred to through these variables.
$UTF8_AE = "\xEF\x98\xB6"; # AE
$UTF8_ae = "\xEF\x98\xB7"; # ae
$UTF8_AE_mac = "\xEF\x98\xB8"; # AE-macron
$UTF8_ae_mac = "\xEF\x98\xB9"; # ae-macron
sub exec_sql {
my $sql = shift;
system 'psql -d bugdb -e -c "' . $sql . '"';
}
# Make SELECT queries to compare strings that should be
# distinct; the expected result for each of these
# queries is FALSE.
exec_sql("SELECT '$UTF8_AE' = '$UTF8_AE_mac'");
exec_sql("SELECT '$UTF8_AE' = '$UTF8_ae'");
exec_sql("SELECT '$UTF8_AE' = '$UTF8_ae_mac'");
exec_sql("SELECT '$UTF8_ae' = '$UTF8_AE_mac'");
exec_sql("SELECT '$UTF8_ae' = '$UTF8_ae_mac'");
exec_sql("SELECT '$UTF8_AE_mac' = '$UTF8_ae_mac'");
# Make a table with a VARCHAR field as the primary key
# (thus the field has UNIQUE and NOT NULL constrants).
# Add 2 distinct strings, it should be valid to have
# both values in the field in different records.
exec_sql("CREATE TABLE mytable (word VARCHAR PRIMARY KEY)");
exec_sql("INSERT INTO mytable (word) VALUES ('n${UTF8_ae}q')");
exec_sql("INSERT INTO mytable (word) VALUES ('n${UTF8_ae_mac}q')");
system "dropdb bugdb";
-------------------------------------
The output from the above program:
-------------------------------------
$ perl makebug.pl
CREATE DATABASE
SELECT 'ï¶' = 'ï¸'
?column?
----------
t
(1 row)
SELECT 'ï¶' = 'ï·'
?column?
----------
t
(1 row)
SELECT 'ï¶' = 'ï¹'
?column?
----------
t
(1 row)
SELECT 'ï·' = 'ï¸'
?column?
----------
t
(1 row)
SELECT 'ï·' = 'ï¹'
?column?
----------
t
(1 row)
SELECT 'ï¸' = 'ï¹'
?column?
----------
t
(1 row)
CREATE TABLE mytable (word VARCHAR PRIMARY KEY)
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
"mytable_pkey" for table "mytable"
CREATE TABLE
INSERT INTO mytable (word) VALUES ('nï·q')
INSERT 361753 1
INSERT INTO mytable (word) VALUES ('nï¹q')
ERROR: duplicate key violates unique constraint "mytable_pkey"
DROP DATABASE
-------------------------------
Platform information:
-------------------------------
The system is Fedora Core Linux 4 (kernel 2.6.12-1.1398_FC4, glibc 2.3.5-10)
running PostgreSQL 8.0.3 on an Intel processor
(Pentium 4 I believe) with 512MB of RAM.
$ uname -a
Linux ling.wisc.edu 2.6.12-1.1398_FC4 #1 Fri Jul 15 00:52:32 EDT 2005 i686
i686 i386 GNU/Linux
$ rpm -qa | grep postgresql
postgresql-server-8.0.3-1
postgresql-devel-8.0.3-1
postgresql-odbc-08.00.0100-1
postgresql-python-8.0.3-1
postgresql-libs-8.0.3-1
postgresql-tcl-8.0.3-1
postgresql-8.0.3-1
$ postmaster --version
postmaster (PostgreSQL) 8.0.3
[nate@ling:~]
$ psql --version
psql (PostgreSQL) 8.0.3
contains support for command-line editing
$ rpm -q glibc
glibc-2.3.5-10