Thread: Duplicate rows during pg_dump

Duplicate rows during pg_dump

From
Chaz Yoon
Date:
I am seeing a duplicate, stale copy of the same row when performing a pg_dump or copying a specific table, but not when directly selecting from it. I'm running PostgreSQL 9.3.9 on Amazon RDS, with 9.3.10 client tools.

It's happening on a users table, which has a primary key and enforces a unique email address:

                                        Table "public.users"
      Column     |            Type             |                 Modifiers
  ---------------+-----------------------------+---------------------------------------------------
   id            | integer                     | not null default nextval('users_id_seq'::regclass)
   email         | character varying(255)      | not null default ''::character varying
   last_activity | timestamp without time zone |
  Indexes:
      "users_pkey" PRIMARY KEY, btree (id)
      "users_unique_email" UNIQUE, btree (email)

I first noticed the problem when doing copying the table to another database. Roughly this:

  % pg_dump --column-inserts -Fp -h remotedb remote_db > users.txt
  % psql test_db < users.txt
  [...]
  ERROR:  could not create unique index "users_pkey"
  DETAIL:  Key (id)=(123) is duplicated.
  [...]
  ERROR:  could not create unique index "users_unique_email"
  DETAIL:  Key (email)=(this_user@xyz.com) is duplicated.

It appears there's some sort of duplicate record for a single user in the database. Checking the pg_dump output, I saw that a single user's record was being exported twice:

  % grep -i this_user@xyz.com users.txt
  INSERT INTO users (id, email, last_activity) VALUES (123, 'this_user@xyz.com', '2015-10-21 10:32:15.997887');
  INSERT INTO users (id, email, last_activity) VALUES (123, 'this_user@xyz.com', '2015-10-02 11:32:58.615743');

The rows were not exactly the same. Connecting to the source database directly, I tried this:

  remote_db=> select count(1) from users where id = 123;
   count
  -------
       1
  (1 row)

  remote_db=> select count(1) from users where email = 'this_user@xyz.com';
   count
  -------
       1
  (1 row)

To eliminate any risk of it being a weird locking issue, I restored a snapshot of the database into a new RDS instance but I got the same results. I then tried the following:

  remote_db=> create table users_copy_with_indexes (like users including defaults including constraints including indexes including storage including comments);
  CREATE TABLE
  remote_db=> insert into users_copy_with_indexes select * from users;
  ERROR:  duplicate key value violates unique constraint "users_copy_with_indexes_pkey"
  DETAIL:  Key (id)=(123) already exists.

However, when I created a copy without the indexes, I can see the duplicate rows:

  remote_db=> create table users_copy_without_indexes (like users);
  CREATE TABLE
  remote_db=> insert into users_copy_without_indexes select * from users;
  INSERT 0 523342
  remote_db=> select count(1) from users_copy_without_indexes where id = 123;
   count
  -------
       2
  (1 row)
  remote_db=> select count(1) from users_copy_without_indexes where email = 'this_user@xyz.com';
   count
  -------
       2
  (1 row)

Any suggestions for what to look for next? Is it table corruption?

Chaz

Re: Duplicate rows during pg_dump

From
Adrian Klaver
Date:
On 10/24/2015 12:35 PM, Chaz Yoon wrote:
> I am seeing a duplicate, stale copy of the same row when performing a
> pg_dump or copying a specific table, but not when directly selecting
> from it. I'm running PostgreSQL 9.3.9 on Amazon RDS, with 9.3.10 client
> tools.
>
> It's happening on a users table, which has a primary key and enforces a
> unique email address:
>
>                                          Table "public.users"
>        Column     |            Type             |                 Modifiers
>
> ---------------+-----------------------------+---------------------------------------------------
>     id            | integer                     | not null default
> nextval('users_id_seq'::regclass)
>     email         | character varying(255)      | not null default
> ''::character varying
>     last_activity | timestamp without time zone |
>    Indexes:
>        "users_pkey" PRIMARY KEY, btree (id)
>        "users_unique_email" UNIQUE, btree (email)
>
> I first noticed the problem when doing copying the table to another
> database. Roughly this:
>
>    % pg_dump --column-inserts -Fp -h remotedb remote_db > users.txt
>    % psql test_db < users.txt
>    [...]
>    ERROR:  could not create unique index "users_pkey"
>    DETAIL:  Key (id)=(123) is duplicated.
>    [...]
>    ERROR:  could not create unique index "users_unique_email"
>    DETAIL:  Key (email)=(this_user@xyz.com <mailto:this_user@xyz.com>)
> is duplicated.
>
> It appears there's some sort of duplicate record for a single user in
> the database. Checking the pg_dump output, I saw that a single user's
> record was being exported twice:
>
>    % grep -i this_user@xyz.com <mailto:this_user@xyz.com> users.txt
>    INSERT INTO users (id, email, last_activity) VALUES (123,
> 'this_user@xyz.com <mailto:this_user@xyz.com>', '2015-10-21
> 10:32:15.997887');
>    INSERT INTO users (id, email, last_activity) VALUES (123,
> 'this_user@xyz.com <mailto:this_user@xyz.com>', '2015-10-02
> 11:32:58.615743');
>
> The rows were not exactly the same. Connecting to the source database
> directly, I tried this:
>
>    remote_db=> select count(1) from users where id = 123;
>     count
>    -------
>         1
>    (1 row)
>
>    remote_db=> select count(1) from users where email =
> 'this_user@xyz.com <mailto:this_user@xyz.com>';
>     count
>    -------
>         1
>    (1 row)
>
> To eliminate any risk of it being a weird locking issue, I restored a
> snapshot of the database into a new RDS instance but I got the same
> results. I then tried the following:
>
>    remote_db=> create table users_copy_with_indexes (like users
> including defaults including constraints including indexes including
> storage including comments);
>    CREATE TABLE
>    remote_db=> insert into users_copy_with_indexes select * from users;
>    ERROR:  duplicate key value violates unique constraint
> "users_copy_with_indexes_pkey"
>    DETAIL:  Key (id)=(123) already exists.
>
> However, when I created a copy without the indexes, I can see the
> duplicate rows:
>
>    remote_db=> create table users_copy_without_indexes (like users);
>    CREATE TABLE
>    remote_db=> insert into users_copy_without_indexes select * from users;
>    INSERT 0 523342
>    remote_db=> select count(1) from users_copy_without_indexes where id
> = 123;
>     count
>    -------
>         2
>    (1 row)
>    remote_db=> select count(1) from users_copy_without_indexes where
> email = 'this_user@xyz.com <mailto:this_user@xyz.com>';
>     count
>    -------
>         2
>    (1 row)
>
> Any suggestions for what to look for next? Is it table corruption?

I would say the smoking gun is the copy w/o indexes shows both records
and the one with indexes only one. I would DROP/CREATE index on the
original table, with the usual caveat that this does place a load on the
table. Using Concurrently might help, but I would read the information here:

http://www.postgresql.org/docs/9.3/interactive/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY

Building Indexes Concurrently
>
> Chaz
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Duplicate rows during pg_dump

From
Marc Mamin
Date:
>On 10/24/2015 12:35 PM, Chaz Yoon wrote:
>> I am seeing a duplicate, stale copy of the same row when performing a
>> pg_dump or copying a specific table, but not when directly selecting
>> from it. I'm running PostgreSQL 9.3.9 on Amazon RDS, with 9.3.10 client
>> tools.
>>
>> It's happening on a users table, which has a primary key and enforces a
>> unique email address:
>>
>>                                          Table "public.users"
>>        Column     |            Type             |                 Modifiers
>>
>> ---------------+-----------------------------+---------------------------------------------------
>>     id            | integer                     | not null default
>> nextval('users_id_seq'::regclass)
>>     email         | character varying(255)      | not null default
>> ''::character varying
>>     last_activity | timestamp without time zone |
>>    Indexes:
>>        "users_pkey" PRIMARY KEY, btree (id)
>>        "users_unique_email" UNIQUE, btree (email)
>>
>> I first noticed the problem when doing copying the table to another
>> database. Roughly this:
>>
>>    % pg_dump --column-inserts -Fp -h remotedb remote_db > users.txt
>>    % psql test_db < users.txt
>>    [...]
>>    ERROR:  could not create unique index "users_pkey"
>>    DETAIL:  Key (id)=(123) is duplicated.
>>    [...]
>>    ERROR:  could not create unique index "users_unique_email"
>>    DETAIL:  Key (email)=(this_user@xyz.com <mailto:this_user@xyz.com>)
>> is duplicated.
>>
>> It appears there's some sort of duplicate record for a single user in
>> the database. Checking the pg_dump output, I saw that a single user's
>> record was being exported twice:
>>
>>    % grep -i this_user@xyz.com <mailto:this_user@xyz.com> users.txt
>>    INSERT INTO users (id, email, last_activity) VALUES (123,
>> 'this_user@xyz.com <mailto:this_user@xyz.com>', '2015-10-21
>> 10:32:15.997887');
>>    INSERT INTO users (id, email, last_activity) VALUES (123,
>> 'this_user@xyz.com <mailto:this_user@xyz.com>', '2015-10-02
>> 11:32:58.615743');
>>
>> The rows were not exactly the same. Connecting to the source database
>> directly, I tried this:
>>
>>    remote_db=> select count(1) from users where id = 123;
>>     count
>>    -------
>>         1
>>    (1 row)
>>
>>    remote_db=> select count(1) from users where email =
>> 'this_user@xyz.com <mailto:this_user@xyz.com>';
>>     count
>>    -------
>>         1
>>    (1 row)
>>
>> To eliminate any risk of it being a weird locking issue, I restored a
>> snapshot of the database into a new RDS instance but I got the same
>> results. I then tried the following:
>>
>>    remote_db=> create table users_copy_with_indexes (like users
>> including defaults including constraints including indexes including
>> storage including comments);
>>    CREATE TABLE
>>    remote_db=> insert into users_copy_with_indexes select * from users;
>>    ERROR:  duplicate key value violates unique constraint
>> "users_copy_with_indexes_pkey"
>>    DETAIL:  Key (id)=(123) already exists.
>>
>> However, when I created a copy without the indexes, I can see the
>> duplicate rows:
>>
>>    remote_db=> create table users_copy_without_indexes (like users);
>>    CREATE TABLE
>>    remote_db=> insert into users_copy_without_indexes select * from users;
>>    INSERT 0 523342
>>    remote_db=> select count(1) from users_copy_without_indexes where id
>> = 123;
>>     count
>>    -------
>>         2
>>    (1 row)
>>    remote_db=> select count(1) from users_copy_without_indexes where
>> email = 'this_user@xyz.com <mailto:this_user@xyz.com>';
>>     count
>>    -------
>>         2
>>    (1 row)
>>
>> Any suggestions for what to look for next? Is it table corruption?

Most likely is the index corrupt, not the table.
You should check for further duplicates, fix them and as Adrian writes,
build a new index an then drop the corrupt one.

I've seen this a few times before, and if I recall well it was always after some plate got full.
Is AWS getting out of space :)

regards,
Marc Mamin

>
>I would say the smoking gun is the copy w/o indexes shows both records
>and the one with indexes only one. I would DROP/CREATE index on the
>original table, with the usual caveat that this does place a load on the
>table. Using Concurrently might help, but I would read the information here:
>
>http://www.postgresql.org/docs/9.3/interactive/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY
>
>Building Indexes Concurrently
>>
>> Chaz
>>
>
>
>--
>Adrian Klaver
>adrian.klaver@aklaver.com
>
>
>--
>Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-general
>

Re: Duplicate rows during pg_dump

From
Jim Nasby
Date:
On 10/24/15 3:15 PM, Marc Mamin wrote:
>>> Any suggestions for what to look for next? Is it table corruption?
> Most likely is the index corrupt, not the table.
> You should check for further duplicates, fix them and as Adrian writes,
> build a new index an then drop the corrupt one.
>
> I've seen this a few times before, and if I recall well it was always after some plate got full.
> Is AWS getting out of space:)

You should report this to the RDS team, because an out of space
condition shouldn't leave multiple values in the index. I suspect
they've made a modification somewhere that is causing this. It could be
a base Postgres bug, but I'd think we'd have caught such a bug by now...
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: Duplicate rows during pg_dump

From
Tom Lane
Date:
Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
> On 10/24/15 3:15 PM, Marc Mamin wrote:
> Any suggestions for what to look for next? Is it table corruption?
>> Most likely is the index corrupt, not the table.
>> You should check for further duplicates, fix them and as Adrian writes,
>> build a new index an then drop the corrupt one.
>>
>> I've seen this a few times before, and if I recall well it was always after some plate got full.
>> Is AWS getting out of space:)

> You should report this to the RDS team, because an out of space
> condition shouldn't leave multiple values in the index. I suspect
> they've made a modification somewhere that is causing this. It could be
> a base Postgres bug, but I'd think we'd have caught such a bug by now...

Notable also is that pg_dump invariably reads tables with a plain "COPY foo"
or "SELECT * FROM foo", which should ignore all indexes and just read the
table contents.  So I doubt that reindexing will fix anything: you almost
certainly do have duplicate rows in the base table.  It's highly likely
that the index is corrupt, which is what would be necessary to get into
such a state ... but you will need to manually remove the dup rows before
rebuilding the unique index will succeed.

            regards, tom lane


Re: Duplicate rows during pg_dump

From
Marc Mamin
Date:

> -----Original Message-----
> From: Jim Nasby [mailto:Jim.Nasby@BlueTreble.com]
> Sent: Montag, 26. Oktober 2015 01:55
> To: Marc Mamin; Adrian Klaver; Chaz Yoon; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Duplicate rows during pg_dump
>
> On 10/24/15 3:15 PM, Marc Mamin wrote:
> >>> Any suggestions for what to look for next? Is it table corruption?
> > Most likely is the index corrupt, not the table.
> > You should check for further duplicates, fix them and as Adrian
> > writes, build a new index an then drop the corrupt one.
> >
> > I've seen this a few times before, and if I recall well it was always
> after some plate got full.
> > Is AWS getting out of space:)
>
> You should report this to the RDS team, because an out of space
> condition shouldn't leave multiple values in the index. I suspect
> they've made a modification somewhere that is causing this. It could be
> a base Postgres bug, but I'd think we'd have caught such a bug by
> now...

Last time I got this trouble was 4-5 years ago...

regards,

Marc Mamin




> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in
> Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in
> Treble! http://BlueTreble.com