Thread: BUG #1859: 3-octet private use UTF8 chars reported as identical

BUG #1859: 3-octet private use UTF8 chars reported as identical

From
"Nathan Culwell-Kanarek"
Date:
The following bug has been logged online:

Bug reference:      1859
Logged by:          Nathan Culwell-Kanarek
Email address:      nculwell@wisc.edu
PostgreSQL version: 8.0.3
Operating system:   Fedora Core Linux 4
Description:        3-octet private use UTF8 chars reported as identical
Details:

-------------------------------------
Bug description:
-------------------------------------

Our project is using a specialized Unicode character set that uses some
characters from the private use area, displayed with the Aborginal Serif
font (the font can be freely downloaded at http://www.languagegeek.com/), to
represent the Menominee language. We've run into a problem, which is that
the PostgreSQL backend is interpreting 4 of the private use characters as
being equivalent. A SELECT query using the equality character indicates that
they are equal, and when there is a field with a uniqueness constraint then
it cannot have 2 values that differ only by a minimal pair of the private
use characters.

Note: The private use characters in question are 3 octets long in UTF-8
encoding, which is what we're using.


-------------------------------------
 Program to reproduce the bug:
-------------------------------------

#!env perl

# Create a test database.

system "createdb -E unicode bugdb";

# Define constants for the private use Unicode
# characters. These are defined by explicitly
# encoding the UTF-8 octets that compose the
# characters we need. From here on the characters
# are referred to through these variables.

$UTF8_AE =     "\xEF\x98\xB6"; # AE
$UTF8_ae =     "\xEF\x98\xB7"; # ae
$UTF8_AE_mac = "\xEF\x98\xB8"; # AE-macron
$UTF8_ae_mac = "\xEF\x98\xB9"; # ae-macron

sub exec_sql {
   my $sql = shift;
   system 'psql -d bugdb -e -c "' . $sql . '"';
}

# Make SELECT queries to compare strings that should be
# distinct; the expected result for each of these
# queries is FALSE.

exec_sql("SELECT '$UTF8_AE' = '$UTF8_AE_mac'");
exec_sql("SELECT '$UTF8_AE' = '$UTF8_ae'");
exec_sql("SELECT '$UTF8_AE' = '$UTF8_ae_mac'");
exec_sql("SELECT '$UTF8_ae' = '$UTF8_AE_mac'");
exec_sql("SELECT '$UTF8_ae' = '$UTF8_ae_mac'");
exec_sql("SELECT '$UTF8_AE_mac' = '$UTF8_ae_mac'");

# Make a table with a VARCHAR field as the primary key
# (thus the field has UNIQUE and NOT NULL constrants).
# Add 2 distinct strings, it should be valid to have
# both values in the field in different records.

exec_sql("CREATE TABLE mytable (word VARCHAR PRIMARY KEY)");
exec_sql("INSERT INTO mytable (word) VALUES ('n${UTF8_ae}q')");
exec_sql("INSERT INTO mytable (word) VALUES ('n${UTF8_ae_mac}q')");

system "dropdb bugdb";


-------------------------------------
The output from the above program:
-------------------------------------

$ perl makebug.pl
CREATE DATABASE
SELECT '' = ''
 ?column?
----------
 t
(1 row)

SELECT '' = ''
 ?column?
----------
 t
(1 row)

SELECT '' = ''
 ?column?
----------
 t
(1 row)

SELECT '' = ''
 ?column?
----------
 t
(1 row)

SELECT '' = ''
 ?column?
----------
 t
(1 row)

SELECT '' = ''
 ?column?
----------
 t
(1 row)

CREATE TABLE mytable (word VARCHAR PRIMARY KEY)
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"mytable_pkey" for table "mytable"
CREATE TABLE
INSERT INTO mytable (word) VALUES ('nq')
INSERT 361753 1
INSERT INTO mytable (word) VALUES ('nq')
ERROR:  duplicate key violates unique constraint "mytable_pkey"
DROP DATABASE


-------------------------------
 Platform information:
-------------------------------

The system is Fedora Core Linux 4 (kernel 2.6.12-1.1398_FC4, glibc 2.3.5-10)
running PostgreSQL 8.0.3 on an Intel processor

(Pentium 4 I believe) with 512MB of RAM.

$ uname -a
Linux ling.wisc.edu 2.6.12-1.1398_FC4 #1 Fri Jul 15 00:52:32 EDT 2005 i686
i686 i386 GNU/Linux
$ rpm -qa | grep postgresql
postgresql-server-8.0.3-1
postgresql-devel-8.0.3-1
postgresql-odbc-08.00.0100-1
postgresql-python-8.0.3-1
postgresql-libs-8.0.3-1
postgresql-tcl-8.0.3-1
postgresql-8.0.3-1
$ postmaster --version
postmaster (PostgreSQL) 8.0.3
[nate@ling:~]
$ psql --version
psql (PostgreSQL) 8.0.3
contains support for command-line editing
$ rpm -q glibc
glibc-2.3.5-10

Re: BUG #1859: 3-octet private use UTF8 chars reported as identical

From
Tom Lane
Date:
"Nathan Culwell-Kanarek" <nculwell@wisc.edu> writes:
> Description:        3-octet private use UTF8 chars reported as identical

> We've run into a problem, which is that
> the PostgreSQL backend is interpreting 4 of the private use characters as
> being equivalent.

Your beef is actually with strcoll(); we just believe whatever that
function tells us when comparing strings.  Check to see that you've
initdb'd in a utf8-based locale --- if not, that might be the source
of the problem.  (IMHO, strcoll ought not claim distinct byte sequences
are equal in any case, but it seems some locale definitions will do
that.)  If no luck, take it up with Red Hat's glibc folk.

            regards, tom lane