Thread: Corruption with duplicate primary key

Corruption with duplicate primary key

From
Alex Adriaanse
Date:
We have a Postgres 10 database that we recently upgraded to Postgres 12 using pg_upgrade. We recently discovered that
thereare rows in one of the tables that have duplicate primary keys: 

record_loader=# \d loader.sync
                              Table "loader.sync"
      Column       |           Type           | Collation | Nullable | Default
-------------------+--------------------------+-----------+----------+---------
 source            | text                     |           | not null |
 natural_key       | text                     |           | not null |
 payload           | jsonb                    |           |          |
 dispatched        | timestamp with time zone |           | not null | now()
 initial_load_id   | text                     |           |          |
 deleted_load_id   | text                     |           |          |
 created_timestamp | timestamp with time zone |           |          | now()
 updated_timestamp | timestamp with time zone |           |          | now()
 deleted_timestamp | timestamp with time zone |           |          |
Indexes:
    "sync_pkey" PRIMARY KEY, btree (source, natural_key)
Publications:
    "debezium"

This table is modified via triggers that fire off when a COPY command inserts many rows into another table.

Here are two example duplicate rows:

# SELECT xmin, xmax, cmin, cmax, source, md5(natural_key) AS natural_key_hash, dispatched, created_timestamp,
updated_timestamp,deleted_timestamp FROM loader.sync WHERE (source, natural_key) = ('ok_lease', '...') ORDER BY
xmin::text::int,cmin::text::int; 
-[ RECORD 1 ]-----+---------------------------------
xmin              | 116649
xmax              | 0
cmin              | 5304404
cmax              | 5304404
source            | ok_lease
natural_key_hash  | de3e9a567b90025c3399c4c63c823fe9
dispatched        | 2019-11-24 05:09:36.099686+00
created_timestamp | 2019-11-24 05:09:36.099686+00
updated_timestamp | 2019-11-24 05:09:36.099686+00
deleted_timestamp |
-[ RECORD 2 ]-----+---------------------------------
xmin              | 116649
xmax              | 118583
cmin              | 5312208
cmax              | 5312208
source            | ok_lease
natural_key_hash  | de3e9a567b90025c3399c4c63c823fe9
dispatched        | 2019-11-10 05:09:24.214964+00
created_timestamp | 2019-05-17 21:24:19.558219+00
updated_timestamp | 2019-11-24 05:09:36.099686+00
deleted_timestamp | 2019-11-24 05:09:36.099686+00

It appears that the second row was in place originally, then got updated by a trigger (and even deleted later on,
althoughit doesn't appear that the delete transaction got committed), and then the first row was inserted within the
sametransaction that updated the second row. 

Another example:
-[ RECORD 1 ]-----+---------------------------------
xmin              | 116649
xmax              | 0
cmin              | 5304403
cmax              | 5304403
source            | ok_lease
natural_key_hash  | 1c8031348701a32cb5fee26839d6b0b4
dispatched        | 2019-11-10 05:09:24.214964+00
created_timestamp | 2019-05-31 06:00:33.765547+00
updated_timestamp | 2019-11-24 05:09:36.099686+00
deleted_timestamp | 2019-11-24 05:09:36.099686+00
-[ RECORD 2 ]-----+---------------------------------
xmin              | 116649
xmax              | 0
cmin              | 5304404
cmax              | 5304404
source            | ok_lease
natural_key_hash  | 1c8031348701a32cb5fee26839d6b0b4
dispatched        | 2019-11-24 05:09:36.099686+00
created_timestamp | 2019-11-24 05:09:36.099686+00
updated_timestamp | 2019-11-24 05:09:36.099686+00
deleted_timestamp |

Both examples have in common that the two duplicate rows were touched within the same transaction.

This database runs inside Docker, with the data directory bind-mounted to a reflink-enabled XFS filesystem. The VM is
runningDebian's 4.19.16-1~bpo9+1 kernel inside an AWS EC2 instance. We have Debezium stream data from this database via
pgoutput.

Recreating the primary key confirms that the constraint doesn't (or at least shouldn't) permit these duplicate rows:

record_loader=# BEGIN;
BEGIN
record_loader=# ALTER TABLE loader.sync DROP CONSTRAINT sync_pkey;
ALTER TABLE
record_loader=# ALTER TABLE loader.sync ADD CONSTRAINT sync_pkey PRIMARY KEY (source, natural_key);
ERROR:  could not create unique index "sync_pkey"
DETAIL:  Key (source, natural_key)=(ok_lease, ...) is duplicated.
CONTEXT:  parallel worker

Any ideas on what might cause this behavior?

Thanks,

Alex


Re: Corruption with duplicate primary key

From
Peter Geoghegan
Date:
On Thu, Dec 5, 2019 at 1:14 PM Alex Adriaanse <alex@oseberg.io> wrote:
> We have a Postgres 10 database that we recently upgraded to Postgres 12 using pg_upgrade. We recently discovered that
thereare rows in one of the tables that have duplicate primary keys:
 

What's the timeline here? In other words, does it look like these rows
were updated and/or deleted before, around the same time as, or after
the upgrade?

The symptoms you report seem fairly generic to me, at least at first
blush. They could be caused by storage level inconsistencies that
could have many causes.

> This database runs inside Docker, with the data directory bind-mounted to a reflink-enabled XFS filesystem. The VM is
runningDebian's 4.19.16-1~bpo9+1 kernel inside an AWS EC2 instance. We have Debezium stream data from this database via
pgoutput.

That seems suspicious, since reflink support for XFS is rather immature.

How did you invoke pg_upgrade? Did you use the --link (hard link) option?

-- 
Peter Geoghegan



Re: Corruption with duplicate primary key

From
Tomas Vondra
Date:
On Thu, Dec 05, 2019 at 09:14:12PM +0000, Alex Adriaanse wrote:
>We have a Postgres 10 database that we recently upgraded to Postgres 12 using pg_upgrade. We recently discovered that
thereare rows in one of the tables that have duplicate primary keys:
 
>
>record_loader=# \d loader.sync
>                              Table "loader.sync"
>      Column       |           Type           | Collation | Nullable | Default
>-------------------+--------------------------+-----------+----------+---------
> source            | text                     |           | not null |
> natural_key       | text                     |           | not null |
> payload           | jsonb                    |           |          |
> dispatched        | timestamp with time zone |           | not null | now()
> initial_load_id   | text                     |           |          |
> deleted_load_id   | text                     |           |          |
> created_timestamp | timestamp with time zone |           |          | now()
> updated_timestamp | timestamp with time zone |           |          | now()
> deleted_timestamp | timestamp with time zone |           |          |
>Indexes:
>    "sync_pkey" PRIMARY KEY, btree (source, natural_key)
>Publications:
>    "debezium"
>
>This table is modified via triggers that fire off when a COPY command inserts many rows into another table.
>
>Here are two example duplicate rows:
>
># SELECT xmin, xmax, cmin, cmax, source, md5(natural_key) AS natural_key_hash, dispatched, created_timestamp,
updated_timestamp,deleted_timestamp FROM loader.sync WHERE (source, natural_key) = ('ok_lease', '...') ORDER BY
xmin::text::int,cmin::text::int;
 
>-[ RECORD 1 ]-----+---------------------------------
>xmin              | 116649
>xmax              | 0
>cmin              | 5304404
>cmax              | 5304404
>source            | ok_lease
>natural_key_hash  | de3e9a567b90025c3399c4c63c823fe9
>dispatched        | 2019-11-24 05:09:36.099686+00
>created_timestamp | 2019-11-24 05:09:36.099686+00
>updated_timestamp | 2019-11-24 05:09:36.099686+00
>deleted_timestamp |
>-[ RECORD 2 ]-----+---------------------------------
>xmin              | 116649
>xmax              | 118583
>cmin              | 5312208
>cmax              | 5312208
>source            | ok_lease
>natural_key_hash  | de3e9a567b90025c3399c4c63c823fe9
>dispatched        | 2019-11-10 05:09:24.214964+00
>created_timestamp | 2019-05-17 21:24:19.558219+00
>updated_timestamp | 2019-11-24 05:09:36.099686+00
>deleted_timestamp | 2019-11-24 05:09:36.099686+00
>
>It appears that the second row was in place originally, then got updated by a trigger (and even deleted later on,
althoughit doesn't appear that the delete transaction got committed), and then the first row was inserted within the
sametransaction that updated the second row.
 
>
>Another example:
>-[ RECORD 1 ]-----+---------------------------------
>xmin              | 116649
>xmax              | 0
>cmin              | 5304403
>cmax              | 5304403
>source            | ok_lease
>natural_key_hash  | 1c8031348701a32cb5fee26839d6b0b4
>dispatched        | 2019-11-10 05:09:24.214964+00
>created_timestamp | 2019-05-31 06:00:33.765547+00
>updated_timestamp | 2019-11-24 05:09:36.099686+00
>deleted_timestamp | 2019-11-24 05:09:36.099686+00
>-[ RECORD 2 ]-----+---------------------------------
>xmin              | 116649
>xmax              | 0
>cmin              | 5304404
>cmax              | 5304404
>source            | ok_lease
>natural_key_hash  | 1c8031348701a32cb5fee26839d6b0b4
>dispatched        | 2019-11-24 05:09:36.099686+00
>created_timestamp | 2019-11-24 05:09:36.099686+00
>updated_timestamp | 2019-11-24 05:09:36.099686+00
>deleted_timestamp |
>
>Both examples have in common that the two duplicate rows were touched within the same transaction.
>
>This database runs inside Docker, with the data directory bind-mounted to a reflink-enabled XFS filesystem. The VM is
runningDebian's 4.19.16-1~bpo9+1 kernel inside an AWS EC2 instance. We have Debezium stream data from this database via
pgoutput.
>
>Recreating the primary key confirms that the constraint doesn't (or at least shouldn't) permit these duplicate rows:
>
>record_loader=# BEGIN;
>BEGIN
>record_loader=# ALTER TABLE loader.sync DROP CONSTRAINT sync_pkey;
>ALTER TABLE
>record_loader=# ALTER TABLE loader.sync ADD CONSTRAINT sync_pkey PRIMARY KEY (source, natural_key);
>ERROR:  could not create unique index "sync_pkey"
>DETAIL:  Key (source, natural_key)=(ok_lease, ...) is duplicated.
>CONTEXT:  parallel worker
>
>Any ideas on what might cause this behavior?
>

Not sure. At first I thought maybe this might be due to collations
changing and breaking the index silently. What collation are you using?

A couple questions:

1) When you do the queries, do they use index scan or sequential scan?
Perhaps it does sequential scan, and if you force index scan (e.g. by
rewriting the query) it'll only find one of those rows.

2) Can you check in backups if this data corruption was present in the
PG10 cluster, before running pg_upgrade? 

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Corruption with duplicate primary key

From
"Finnerty, Jim"
Date:
Re: " It appears that the second row was in place originally, then got updated by a trigger (and even deleted later on,
althoughit doesn't appear that the delete transaction got committed), and then the first row was inserted within the
sametransaction that updated the second row."
 

If you have BEFORE triggers, and a BEFORE trigger signaled failure with RETURN NULL, then this is one known (and
documented)issue that I think could cause the behavior you're reporting:
 


https://www.postgresql-archive.org/BEFORE-triggers-that-return-NULL-can-circumvent-referential-integrity-tt6056390.html#none

It's hard to say if this is the cause or not, but if you have any BEFORE triggers that RETURN NULL, you might want to
reviewthe documentation very carefully.
 

thanks,

    /Jim F

On 12/5/19, 6:45 PM, "Tomas Vondra" <tomas.vondra@2ndquadrant.com> wrote:

    On Thu, Dec 05, 2019 at 09:14:12PM +0000, Alex Adriaanse wrote:
    >We have a Postgres 10 database that we recently upgraded to Postgres 12 using pg_upgrade. We recently discovered
thatthere are rows in one of the tables that have duplicate primary keys:
 
    >
    >record_loader=# \d loader.sync
    >                              Table "loader.sync"
    >      Column       |           Type           | Collation | Nullable | Default
    >-------------------+--------------------------+-----------+----------+---------
    > source            | text                     |           | not null |
    > natural_key       | text                     |           | not null |
    > payload           | jsonb                    |           |          |
    > dispatched        | timestamp with time zone |           | not null | now()
    > initial_load_id   | text                     |           |          |
    > deleted_load_id   | text                     |           |          |
    > created_timestamp | timestamp with time zone |           |          | now()
    > updated_timestamp | timestamp with time zone |           |          | now()
    > deleted_timestamp | timestamp with time zone |           |          |
    >Indexes:
    >    "sync_pkey" PRIMARY KEY, btree (source, natural_key)
    >Publications:
    >    "debezium"
    >
    >This table is modified via triggers that fire off when a COPY command inserts many rows into another table.
    >
    >Here are two example duplicate rows:
    >
    ># SELECT xmin, xmax, cmin, cmax, source, md5(natural_key) AS natural_key_hash, dispatched, created_timestamp,
updated_timestamp,deleted_timestamp FROM loader.sync WHERE (source, natural_key) = ('ok_lease', '...') ORDER BY
xmin::text::int,cmin::text::int;
 
    >-[ RECORD 1 ]-----+---------------------------------
    >xmin              | 116649
    >xmax              | 0
    >cmin              | 5304404
    >cmax              | 5304404
    >source            | ok_lease
    >natural_key_hash  | de3e9a567b90025c3399c4c63c823fe9
    >dispatched        | 2019-11-24 05:09:36.099686+00
    >created_timestamp | 2019-11-24 05:09:36.099686+00
    >updated_timestamp | 2019-11-24 05:09:36.099686+00
    >deleted_timestamp |
    >-[ RECORD 2 ]-----+---------------------------------
    >xmin              | 116649
    >xmax              | 118583
    >cmin              | 5312208
    >cmax              | 5312208
    >source            | ok_lease
    >natural_key_hash  | de3e9a567b90025c3399c4c63c823fe9
    >dispatched        | 2019-11-10 05:09:24.214964+00
    >created_timestamp | 2019-05-17 21:24:19.558219+00
    >updated_timestamp | 2019-11-24 05:09:36.099686+00
    >deleted_timestamp | 2019-11-24 05:09:36.099686+00
    >
    >It appears that the second row was in place originally, then got updated by a trigger (and even deleted later on,
althoughit doesn't appear that the delete transaction got committed), and then the first row was inserted within the
sametransaction that updated the second row.
 
    >
    >Another example:
    >-[ RECORD 1 ]-----+---------------------------------
    >xmin              | 116649
    >xmax              | 0
    >cmin              | 5304403
    >cmax              | 5304403
    >source            | ok_lease
    >natural_key_hash  | 1c8031348701a32cb5fee26839d6b0b4
    >dispatched        | 2019-11-10 05:09:24.214964+00
    >created_timestamp | 2019-05-31 06:00:33.765547+00
    >updated_timestamp | 2019-11-24 05:09:36.099686+00
    >deleted_timestamp | 2019-11-24 05:09:36.099686+00
    >-[ RECORD 2 ]-----+---------------------------------
    >xmin              | 116649
    >xmax              | 0
    >cmin              | 5304404
    >cmax              | 5304404
    >source            | ok_lease
    >natural_key_hash  | 1c8031348701a32cb5fee26839d6b0b4
    >dispatched        | 2019-11-24 05:09:36.099686+00
    >created_timestamp | 2019-11-24 05:09:36.099686+00
    >updated_timestamp | 2019-11-24 05:09:36.099686+00
    >deleted_timestamp |
    >
    >Both examples have in common that the two duplicate rows were touched within the same transaction.
    >
    >This database runs inside Docker, with the data directory bind-mounted to a reflink-enabled XFS filesystem. The VM
isrunning Debian's 4.19.16-1~bpo9+1 kernel inside an AWS EC2 instance. We have Debezium stream data from this database
viapgoutput.
 
    >
    >Recreating the primary key confirms that the constraint doesn't (or at least shouldn't) permit these duplicate
rows:
    >
    >record_loader=# BEGIN;
    >BEGIN
    >record_loader=# ALTER TABLE loader.sync DROP CONSTRAINT sync_pkey;
    >ALTER TABLE
    >record_loader=# ALTER TABLE loader.sync ADD CONSTRAINT sync_pkey PRIMARY KEY (source, natural_key);
    >ERROR:  could not create unique index "sync_pkey"
    >DETAIL:  Key (source, natural_key)=(ok_lease, ...) is duplicated.
    >CONTEXT:  parallel worker
    >
    >Any ideas on what might cause this behavior?
    >
    
    Not sure. At first I thought maybe this might be due to collations
    changing and breaking the index silently. What collation are you using?
    
    A couple questions:
    
    1) When you do the queries, do they use index scan or sequential scan?
    Perhaps it does sequential scan, and if you force index scan (e.g. by
    rewriting the query) it'll only find one of those rows.
    
    2) Can you check in backups if this data corruption was present in the
    PG10 cluster, before running pg_upgrade? 
    
    regards
    
    -- 
    Tomas Vondra                  http://www.2ndQuadrant.com
    PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
    
    
    


Re: Corruption with duplicate primary key

From
Alex Adriaanse
Date:
On Thu, December 5, 2019 at 5:34 PM Peter Geoghegan wrote:
> > We have a Postgres 10 database that we recently upgraded to Postgres 12 using pg_upgrade. We recently discovered
thatthere are rows in one of the tables that have duplicate primary keys: 
>
> What's the timeline here? In other words, does it look like these rows
> were updated and/or deleted before, around the same time as, or after
> the upgrade?

The Postgres 12 upgrade was performed on 2019-11-22, so the affected rows were modified after this upgrade (although
someof the rows were originally inserted before then, before they were modified/duplicated). 

> > This database runs inside Docker, with the data directory bind-mounted to a reflink-enabled XFS filesystem. The VM
isrunning Debian's 4.19.16-1~bpo9+1 kernel inside an AWS EC2 instance. We have Debezium stream data from this database
viapgoutput. 
>
> That seems suspicious, since reflink support for XFS is rather immature.

Good point. Looking at kernel commits since 4.19.16 it appears that there have been a few bug fixes in later kernel
versionsthat address a few XFS corruption issues. Regardless of whether FS bugs are responsible of this corruption I'll
planon upgrading to a newer kernel. 

> How did you invoke pg_upgrade? Did you use the --link (hard link) option?

Yes, we first created a backup using "cp -a --reflink=always", ran initdb on the new directory, and then upgraded using
"pg_upgrade-b ... -B ... -d ... -D -k". 

Alex


Re: Corruption with duplicate primary key

From
Alex Adriaanse
Date:
On Thu., December 5, 2019 at 5:45 PM, Tomas Vondra wrote:
> At first I thought maybe this might be due to collations
> changing and breaking the index silently. What collation are you using?

We're using en_US.utf8. We did not make any collation changes to my knowledge.

> 1) When you do the queries, do they use index scan or sequential scan?
> Perhaps it does sequential scan, and if you force index scan (e.g. by
> rewriting the query) it'll only find one of those rows.

By default it used an index scan. When I re-ran the query today (and confirmed that the query used an index only scan)
Idid not see any duplicates. If I force a sequential scan using "SET enable_index[only]scan = false" the duplicates
reappear.

However, using a backup from a week ago I see duplicates in both the query that uses an index only scan as well as the
querythat uses the sequential scan. So somehow over the past week the index got changed to eliminate duplicates. 

> 2) Can you check in backups if this data corruption was present in the
> PG10 cluster, before running pg_upgrade?

Sure. I just checked and did not see any corruption in the PG10 pre-upgrade backup. I also re-upgraded that PG10 backup
toPG12, and right after the upgrade I did not see any corruption either. I checked using both index scans and
sequentialscans. 

Alex


Re: Corruption with duplicate primary key

From
Alex Adriaanse
Date:
On Mon, December 9, 2019 at 11:05 AM Finnerty, Jim wrote:
> If you have BEFORE triggers, and a BEFORE trigger signaled failure with RETURN NULL, then this is one known (and
documented)issue that I think could cause the behavior you're reporting: 
>    
>
https://www.postgresql-archive.org/BEFORE-triggers-that-return-NULL-can-circumvent-referential-integrity-tt6056390.html#none
>
> It's hard to say if this is the cause or not, but if you have any BEFORE triggers that RETURN NULL, you might want to
reviewthe documentation very carefully. 

We do have a BEFORE INSERT trigger, but it should never return NULL. This trigger INSERTs into a different table using
anON CONFLICT DO NOTHING clause and then does a RETURN NEW. 

Alex


Re: Corruption with duplicate primary key

From
Tomas Vondra
Date:
On Wed, Dec 11, 2019 at 11:46:40PM +0000, Alex Adriaanse wrote:
>On Thu., December 5, 2019 at 5:45 PM, Tomas Vondra wrote:
>> At first I thought maybe this might be due to collations changing and
>> breaking the index silently. What collation are you using?
>
>We're using en_US.utf8. We did not make any collation changes to my
>knowledge.
>

Well, the idea was more that glibc got updated and the collations
changed because of that (without PostgreSQL having a chance to even
notice that).

>> 1) When you do the queries, do they use index scan or sequential
>> scan?  Perhaps it does sequential scan, and if you force index scan
>> (e.g. by rewriting the query) it'll only find one of those rows.
>
>By default it used an index scan. When I re-ran the query today (and
>confirmed that the query used an index only scan) I did not see any
>duplicates. If I force a sequential scan using "SET
>enable_index[only]scan = false" the duplicates reappear.
>

Hmmm, that's probably a sign of some sort of index corruption. Clearly,
when a row can't be found through an index, it's invisible to code
enforcing the unique constraint (relying on the index).

>However, using a backup from a week ago I see duplicates in both the
>query that uses an index only scan as well as the query that uses the
>sequential scan. So somehow over the past week the index got changed to
>eliminate duplicates.
>

Hmmm, that's interesting ... and confusing.

The good thing is that this is not an upgrade issue, because there was
no corruption right after the upgrade.

But then apparently the corruption appeared, and then disappeared for
some unknown reason, but only from the index. Puzzling.

>> 2) Can you check in backups if this data corruption was present in
>> the PG10 cluster, before running pg_upgrade?
>
>Sure. I just checked and did not see any corruption in the PG10
>pre-upgrade backup. I also re-upgraded that PG10 backup to PG12, and
>right after the upgrade I did not see any corruption either. I checked
>using both index scans and sequential scans.
>

OK, thanks. That's valuable piece of information.

How active is the system and can you do PITR? That is, can you try
restoring it into different points in time by replaying WAL? Then we
could check narrow-down when the corruption appeared and inspect the WAL
from that period.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Corruption with duplicate primary key

From
Alex Adriaanse
Date:
On Thu., December 12, 2019 at 5:25 PM, Tomas Vondra wrote:
>On Wed, Dec 11, 2019 at 11:46:40PM +0000, Alex Adriaanse wrote:
>>On Thu., December 5, 2019 at 5:45 PM, Tomas Vondra wrote:
>>> At first I thought maybe this might be due to collations changing and
>>> breaking the index silently. What collation are you using?
>>
>>We're using en_US.utf8. We did not make any collation changes to my
>>knowledge.
>
>Well, the idea was more that glibc got updated and the collations
>changed because of that (without PostgreSQL having a chance to even
>notice that).

Closing the loop on this, I've investigated this some more and it turns out this is exactly what happened. As you
suspected,the issue had nothing to do with pg_upgrade or PG12, but rather the glibc upgrade that was seen in Debian
Buster.The postgres:10 and postgres:11 images are based on Debian Stretch, whereas postgres:12 is based on Buster. 

When I kept the database on an older version of Postgres (10 or 11) but switched from the older Docker image to the
postgres:12or debian:buster(-slim) image, manually installing older Postgres packages inside those images, I saw index
corruptionthere too. 

Thanks for the input!

Alex