dump/restore results in duplicate key violation with 7.4.6. Bug? - Mailing list pgsql-admin

From danda
Subject dump/restore results in duplicate key violation with 7.4.6. Bug?
Date
Msg-id 418BD67D.1010108@osc.co.cr
Whole thread Raw
List pgsql-admin
Hi, I am getting an error when attempting to perform a
pg_dump/pg_restore cycle (data only).

Source database is 7.4.3, Target is 7.4.6.  Source database was
originally using 'SQL_ASCII' as the encoding, but all data is in UTF-8.
Target database uses 'UNICODE'. The table that is giving me problems
contains over 400,000 rows. (It holds category data from the dmoz
project. )  In both the source and target database, there is a unique
constraint on the topic key.  Upon restore I am getting duplicate key
violations on 12 rows.  All of the problem rows contain non latin1 data.

What I've tried:
==========

1) SQL Dump / Restore

source machine:
pg_dump -a -t category -U postgres dbname > cats.dump.sql

target machine:
psql
\i cats.dump.sql

which outputs:
\i /tmp/category.dump.sql
SET
SET
SET
SET
SET
UPDATE 1
SET
psql:/tmp/category.dump.sql:24: ERROR:  duplicate key violates unique
constraint "category_topic_key"
CONTEXT:  COPY category, line 133302: "1227568  503988
Top/Kids_and_Teens/International/Korean/ê²ì     0       0"
SET
UPDATE 1

( I also tried it using inserts instead of copy from, but with similar
results. )



2) Binary (custom) Dump / Restore
source machine:
pg_dump -F c -Z 8 -t category -U postgres dbname > category_dump.custom.gz

target machine:
pg_restore -d dbname -a category_dump.custom.gz

which churns for a while and then gives me this error:

pg_restore: ERROR:  duplicate key violates unique constraint
"category_topic_key"
CONTEXT:  COPY category, line 133302: "1227568  503988
Top/Kids_and_Teens/International/Korean/ê²ì     0       0"
pg_restore: [archiver (db)] error returned by PQendcopy


3) Remove unique constraints.

I then removed the unique constraint in the target database so I could
at least import the data. After that I was able to view exactly which
rows have been duplicated:

select sub.topic, sub.cnt from (select topic, count(*) as cnt  from
category group by topic) sub  where cnt > 1;
                                 topic                                 | cnt
-----------------------------------------------------------------------+-----
 Top/Adult/World/Japanese/ãªã³ã©ã¤ã³ã·ã§ãã/ã°ã㺠                   |   2
 Top/Adult/World/Japanese/ã¨ã³ã¿ã¼ãã¤ã³ã¡ã³ã/ã²ã¼ã /éçºå/ãè¡      |  10
 Top/Adult/World/Korean/ëê±°                                           |   4
 Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸/ìì                               |   2
 Top/Adult/World/Korean/ìì /ë¬¸í                                      |   2
 Top/Adult/World/Korean/ì±ì¸ì©í                                       |   2
 Top/Adult/World/Korean/ì±ì¸ì¬ì´í¸_X/ë§í                              |   2
 Top/Kids_and_Teens/International/Japanese/ã²ã¼ã                       |   2
 Top/Kids_and_Teens/International/Japanese/ã³ã³ãã¥ã¼ã¿                |   2
 Top/Kids_and_Teens/International/Japanese/ã¨ã³ã¿ã¼ãã¤ã³ã¡ã³ã/ã©ã¸ãª |   4
 Top/Kids_and_Teens/International/Japanese/趣å³ã»ã¹ãã¼ã/ããã¡ã     |   2
 Top/Kids_and_Teens/International/Korean/ìì                           |   5
(12 rows)

Performing this same query on the source database:

select sub.topic, sub.cnt from (select topic, count(*) as cnt  from
category group by topic) sub  where cnt > 1;
 topic | cnt
-------+-----
(0 rows)


4) Attempted to identify category_id of duplicate rows:

select category_id from category where topic =
'Top/Adult/World/Korean/ëê±°';
 category_id
-------------
(0 rows)

I believe this failed due to some sort of encoding or font problem
between xterm and psql and DB or even X clipboard.  Note that the data
does display correctly when viewed in mozilla.  Still it would be nice
to be able to copy/paste psql result string and use it as input and
actually find a match!

5) Manual inspection of one of the rows.

I chose the topic 'Top/Adult/World/Korean/ëê±°' to pursue further.  I
executed the following query and looked for multiple instances of that
string.  There should be 4 according to our duplicates query above.

select category_id, topic from category where topic like
'Top/Adult/World/Korean%';
 category_id |                    topic
-------------+----------------------------------------------
      328048 | Top/Adult/World/Korean/ë¹ëì¤,CD
      381025 | Top/Adult/World/Korean/ë¹ëì¤,CD/ë°±ìCD
      400131 | Top/Adult/World/Korean/ì±ì¸ì¬ì´í¸_X
      400136 | Top/Adult/World/Korean/ì±ì¸ì¬ì´í¸_X/ë§í¬
      400133 | Top/Adult/World/Korean/ì±ì¸ì¬ì´í¸_X/íìì
     5830581 | Top/Adult/World/Korean/미ëì´
     5830906 | Top/Adult/World/Korean/ë¹ì¦ëì¤
      589823 | Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸/기ê´,ë¨ì²´
      324253 | Top/Adult/World/Korean
      367742 | Top/Adult/World/Korean/ëê±°
      378503 | Top/Adult/World/Korean/ìì
      590650 | Top/Adult/World/Korean/ìì /ë§í
      378504 | Top/Adult/World/Korean/ìì /문í
      590649 | Top/Adult/World/Korean/ìì /ì ëë©ì´ì
     5828700 | Top/Adult/World/Korean/ì±ì¸ì¬ì´í¸_X/ë§í
     5812536 | Top/Adult/World/Korean/ì±í
     5832542 | Top/Adult/World/Korean/ë¹ì¦ëì¤/ì·¨ì,ì±ì©
      364360 | Top/Adult/World/Korean/ê²ì´
      324254 | Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸
      592044 | Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸/ìì
     5852704 | Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸/ë§í
      406487 | Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸/ì±ì¸ìí
      365940 | Top/Adult/World/Korean/ì±ì¸ì©í
(23 rows)

Yet I only see one row that matches the string exactly.  It is the one
with category_id = 367742.


6) Attempt to import same data back into source database ( 7.4.3 )

Acting on the theory that this is possibly a new problem in 7.4.6, I
tried the following in the source DB, (still with SQL_ASCII encoding)
which worked just fine:

create table category_tmp as select * from category;

alter table category_tmp add constraint category_tmp_topic_key unique
(topic);
NOTICE:  ALTER TABLE / ADD UNIQUE will create implicit index
"category_tmp_topic_key" for table "category_tmp"
ALTER TABLE

pg_dump -a -t category_tmp -U postgres dbname > category_tmp_dump.sql

psql

delete from category_tmp ;
DELETE 461153

\i category_tmp_dump.sql
SET
SET
SET
SET


7) Experiments with encoding

a) Used GNU recode to recode the sql dump file to UTF-8.  Resulting file
was unchanged, meaning data was already UTF-8.

b) Updated the encoding in source database to 'UNICODE' to match target
database.

update pg_database set encoding = 6 where datname = 'dbname';

c) Re-imported the data back into the source database again (as in 6).
Worked fine again.



So at this point I am mostly at a loss.  I would have thought that after
changing the source DB to UNICODE encoding it should exhibit the same
behavior as the target.  I can think of two explanations:

1: initdb does something with the encoding beyond setting
pg_database(encoding).

2: there is a bug in 7.4.6 that does not exist in 7.4.3

I suppose the next step is to create a new DB in 7.4.3 using UNICODE and
attempt to import the data in the same manner.  But right now I need a
break.


Dan Libby



pgsql-admin by date:

Previous
From: Dan Libby
Date:
Subject: Re: dump/restore results in duplicate key violation with 7.4.6.
Next
From: "Claudio Latorraca"
Date:
Subject: compiling postgres 64 bit