Thread: Duplicate primary key when primary key is varchar

Duplicate primary key when primary key is varchar

From
Tính Trương Xuân
Date:

Hello,
I have a very interesting situation with my database schema (attached in this message). But first you have to import this database script into your database:
psql -h localhost -p 5432 -U postgres -f /path/to/the/file.sql your_database

Note: this schema is virus/malware/rootkit free, you can check it for yourself.

In the State table, you will notice that there are two rows with the primary key is AL (it's capital A and capital L). But when you select:
select * from state where code='AL'
There is only one row returned. You may guest that there are some spaces in the primary key? Nope!
You can verify by this query:
select 'a'||code||'b' from state
You will see that there are two rows with the data as aALb. No space in between!

The same thing happened for a row in the Family with the code 1006 (it's one-zero-zero-six)
There are a couple of other rows in other table as well, but not all of them.

You may ask, how did I come up with this schema?
Basically, I am developing an reporting web application, there is an external system exporting data files (in UTF-16 LE). Then I convert those files into UTF-8 and use Hibernate to insert the data.
I know that there is something wrong with the encoding but there is some wrong with PostgreSQL as well.
Here is the script I used to create the database
CREATE DATABASE bug_demo
  WITH OWNER = postgres
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'en_US.utf8'
       LC_CTYPE = 'en_US.utf8'
       CONNECTION LIMIT = -1;
I'm using PostgreSQL 8.4.4 on ArchLinux 32bit kernel 2.6.35.7

Any input will be appreciated

Tinh
Attachment

Re: Duplicate primary key when primary key is varchar

From
Dave Cramer
Date:
2010/10/6 Tính Trương Xuân <tinh.truong@evolus.vn>:
>
> Hello,
> I have a very interesting situation with my database schema (attached in
> this message). But first you have to import this database script into your
> database:
> psql -h localhost -p 5432 -U postgres -f /path/to/the/file.sql your_database
>
> Note: this schema is virus/malware/rootkit free, you can check it for
> yourself.
>
> In the State table, you will notice that there are two rows with the primary
> key is AL (it's capital A and capital L). But when you select:
> select * from state where code='AL'
> There is only one row returned. You may guest that there are some spaces in
> the primary key? Nope!
> You can verify by this query:
> select 'a'||code||'b' from state
> You will see that there are two rows with the data as aALb. No space in
> between!
>
> The same thing happened for a row in the Family with the code 1006 (it's
> one-zero-zero-six)
> There are a couple of other rows in other table as well, but not all of
> them.
>
> You may ask, how did I come up with this schema?
> Basically, I am developing an reporting web application, there is an
> external system exporting data files (in UTF-16 LE). Then I convert those
> files into UTF-8 and use Hibernate to insert the data.
> I know that there is something wrong with the encoding but there is some
> wrong with PostgreSQL as well.
> Here is the script I used to create the database
> CREATE DATABASE bug_demo
>   WITH OWNER = postgres
>        ENCODING = 'UTF8'
>        TABLESPACE = pg_default
>        LC_COLLATE = 'en_US.utf8'
>        LC_CTYPE = 'en_US.utf8'
>        CONNECTION LIMIT = -1;
> I'm using PostgreSQL 8.4.4 on ArchLinux 32bit kernel 2.6.35.7
>
> Any input will be appreciated
>
> Tinh


Tinh,

I looked at the data, did not import it, but both those instances the
copy command AL and 1006 are the first lines and they both have a
space in front of them.

Dave

Re: Duplicate primary key when primary key is varchar

From
Oliver Jowett
Date:
Tính Trương Xuân wrote:

> In the State table, you will notice that there are two rows with the
> primary key is AL (it's capital A and capital L). But when you select:
> select * from state where code='AL'
> There is only one row returned. You may guest that there are some spaces
> in the primary key? Nope!
> You can verify by this query:
> select 'a'||code||'b' from state
> You will see that there are two rows with the data as aALb. No space in
> between!

You do have a space in one of the rows - it's a ZERO WIDTH NO-BREAK
SPACE (U+FEFF), also used as a byte order mark in UTF-16. Here's the
relevant bit courtesy of hexdump -C:

> 00014da0  3a 20 70 6f 73 74 67 72  65 73 0a 2d 2d 0a 0a 43  |: postgres.--..C|
> 00014db0  4f 50 59 20 73 74 61 74  65 20 28 63 6f 64 65 2c  |OPY state (code,|
> 00014dc0  20 6e 61 6d 65 2c 20 73  74 61 74 75 73 29 20 46  | name, status) F|
> 00014dd0  52 4f 4d 20 73 74 64 69  6e 3b 0a ef bb bf 41 4c  |ROM stdin;....AL|
> 00014de0  09 41 4c 41 42 41 4d 41  09 41 43 54 49 56 45 0a  |.ALABAMA.ACTIVE.|
> 00014df0  41 4b 09 41 4c 41 53 4b  41 09 41 43 54 49 56 45  |AK.ALASKA.ACTIVE|
> 00014e00  0a 41 5a 09 41 52 49 5a  4f 4e 41 09 41 43 54 49  |.AZ.ARIZONA.ACTI|
> 00014e10  56 45 0a 41 52 09 41 52  4b 41 4e 53 41 53 09 41  |VE.AR.ARKANSAS.A|

Note the "0a ef bb bf 41 4c" - that's a linefeed, followed by the UTF-8
encoding of U+FEFF, followed by "AL"

"less -U" renders it like this:

> COPY state (code, name, status) FROM stdin;
> <U+FEFF>AL^IALABAMA^IACTIVE
> AK^IALASKA^IACTIVE

I guess that your UTF-16 input starts with a BOM, and you're just
passing it through unchanged. Perhaps you are explicitly claiming to
your parser that the input is UTF-16LE, which shouldn't start with a
BOM, so your parser is interpreting it as a no-break space rather than
filtering it out?

Wikipedia says "For the IANA registered charsets UTF-16BE and UTF-16LE,
a byte order mark must not be used because the names of these character
sets already determine the byte order. If encountered, an initial U+FEFF
must be interpreted as a (deprecated) "zero width no-break space".'

Oliver