Thread: Duplicate primary key when primary key is varchar
Hello,
I have a very interesting situation with my database schema (attached in this message). But first you have to import this database script into your database:
psql -h localhost -p 5432 -U postgres -f /path/to/the/file.sql your_database
Note: this schema is virus/malware/rootkit free, you can check it for yourself.
In the State table, you will notice that there are two rows with the primary key is AL (it's capital A and capital L). But when you select:
select * from state where code='AL'
There is only one row returned. You may guest that there are some spaces in the primary key? Nope!
You can verify by this query:
select 'a'||code||'b' from state
You will see that there are two rows with the data as aALb. No space in between!
The same thing happened for a row in the Family with the code 1006 (it's one-zero-zero-six)
There are a couple of other rows in other table as well, but not all of them.
You may ask, how did I come up with this schema?
Basically, I am developing an reporting web application, there is an external system exporting data files (in UTF-16 LE). Then I convert those files into UTF-8 and use Hibernate to insert the data.
I know that there is something wrong with the encoding but there is some wrong with PostgreSQL as well.
Here is the script I used to create the database
CREATE DATABASE bug_demo
WITH OWNER = postgres
ENCODING = 'UTF8'
TABLESPACE = pg_default
LC_COLLATE = 'en_US.utf8'
LC_CTYPE = 'en_US.utf8'
CONNECTION LIMIT = -1;
I'm using PostgreSQL 8.4.4 on ArchLinux 32bit kernel 2.6.35.7
Any input will be appreciated
Tinh
Attachment
2010/10/6 Tính Trương Xuân <tinh.truong@evolus.vn>: > > Hello, > I have a very interesting situation with my database schema (attached in > this message). But first you have to import this database script into your > database: > psql -h localhost -p 5432 -U postgres -f /path/to/the/file.sql your_database > > Note: this schema is virus/malware/rootkit free, you can check it for > yourself. > > In the State table, you will notice that there are two rows with the primary > key is AL (it's capital A and capital L). But when you select: > select * from state where code='AL' > There is only one row returned. You may guest that there are some spaces in > the primary key? Nope! > You can verify by this query: > select 'a'||code||'b' from state > You will see that there are two rows with the data as aALb. No space in > between! > > The same thing happened for a row in the Family with the code 1006 (it's > one-zero-zero-six) > There are a couple of other rows in other table as well, but not all of > them. > > You may ask, how did I come up with this schema? > Basically, I am developing an reporting web application, there is an > external system exporting data files (in UTF-16 LE). Then I convert those > files into UTF-8 and use Hibernate to insert the data. > I know that there is something wrong with the encoding but there is some > wrong with PostgreSQL as well. > Here is the script I used to create the database > CREATE DATABASE bug_demo > WITH OWNER = postgres > ENCODING = 'UTF8' > TABLESPACE = pg_default > LC_COLLATE = 'en_US.utf8' > LC_CTYPE = 'en_US.utf8' > CONNECTION LIMIT = -1; > I'm using PostgreSQL 8.4.4 on ArchLinux 32bit kernel 2.6.35.7 > > Any input will be appreciated > > Tinh Tinh, I looked at the data, did not import it, but both those instances the copy command AL and 1006 are the first lines and they both have a space in front of them. Dave
Tính Trương Xuân wrote: > In the State table, you will notice that there are two rows with the > primary key is AL (it's capital A and capital L). But when you select: > select * from state where code='AL' > There is only one row returned. You may guest that there are some spaces > in the primary key? Nope! > You can verify by this query: > select 'a'||code||'b' from state > You will see that there are two rows with the data as aALb. No space in > between! You do have a space in one of the rows - it's a ZERO WIDTH NO-BREAK SPACE (U+FEFF), also used as a byte order mark in UTF-16. Here's the relevant bit courtesy of hexdump -C: > 00014da0 3a 20 70 6f 73 74 67 72 65 73 0a 2d 2d 0a 0a 43 |: postgres.--..C| > 00014db0 4f 50 59 20 73 74 61 74 65 20 28 63 6f 64 65 2c |OPY state (code,| > 00014dc0 20 6e 61 6d 65 2c 20 73 74 61 74 75 73 29 20 46 | name, status) F| > 00014dd0 52 4f 4d 20 73 74 64 69 6e 3b 0a ef bb bf 41 4c |ROM stdin;....AL| > 00014de0 09 41 4c 41 42 41 4d 41 09 41 43 54 49 56 45 0a |.ALABAMA.ACTIVE.| > 00014df0 41 4b 09 41 4c 41 53 4b 41 09 41 43 54 49 56 45 |AK.ALASKA.ACTIVE| > 00014e00 0a 41 5a 09 41 52 49 5a 4f 4e 41 09 41 43 54 49 |.AZ.ARIZONA.ACTI| > 00014e10 56 45 0a 41 52 09 41 52 4b 41 4e 53 41 53 09 41 |VE.AR.ARKANSAS.A| Note the "0a ef bb bf 41 4c" - that's a linefeed, followed by the UTF-8 encoding of U+FEFF, followed by "AL" "less -U" renders it like this: > COPY state (code, name, status) FROM stdin; > <U+FEFF>AL^IALABAMA^IACTIVE > AK^IALASKA^IACTIVE I guess that your UTF-16 input starts with a BOM, and you're just passing it through unchanged. Perhaps you are explicitly claiming to your parser that the input is UTF-16LE, which shouldn't start with a BOM, so your parser is interpreting it as a no-break space rather than filtering it out? Wikipedia says "For the IANA registered charsets UTF-16BE and UTF-16LE, a byte order mark must not be used because the names of these character sets already determine the byte order. If encountered, an initial U+FEFF must be interpreted as a (deprecated) "zero width no-break space".' Oliver