Thread: Corruption with duplicate primary key
We have a Postgres 10 database that we recently upgraded to Postgres 12 using pg_upgrade. We recently discovered that thereare rows in one of the tables that have duplicate primary keys: record_loader=# \d loader.sync Table "loader.sync" Column | Type | Collation | Nullable | Default -------------------+--------------------------+-----------+----------+--------- source | text | | not null | natural_key | text | | not null | payload | jsonb | | | dispatched | timestamp with time zone | | not null | now() initial_load_id | text | | | deleted_load_id | text | | | created_timestamp | timestamp with time zone | | | now() updated_timestamp | timestamp with time zone | | | now() deleted_timestamp | timestamp with time zone | | | Indexes: "sync_pkey" PRIMARY KEY, btree (source, natural_key) Publications: "debezium" This table is modified via triggers that fire off when a COPY command inserts many rows into another table. Here are two example duplicate rows: # SELECT xmin, xmax, cmin, cmax, source, md5(natural_key) AS natural_key_hash, dispatched, created_timestamp, updated_timestamp,deleted_timestamp FROM loader.sync WHERE (source, natural_key) = ('ok_lease', '...') ORDER BY xmin::text::int,cmin::text::int; -[ RECORD 1 ]-----+--------------------------------- xmin | 116649 xmax | 0 cmin | 5304404 cmax | 5304404 source | ok_lease natural_key_hash | de3e9a567b90025c3399c4c63c823fe9 dispatched | 2019-11-24 05:09:36.099686+00 created_timestamp | 2019-11-24 05:09:36.099686+00 updated_timestamp | 2019-11-24 05:09:36.099686+00 deleted_timestamp | -[ RECORD 2 ]-----+--------------------------------- xmin | 116649 xmax | 118583 cmin | 5312208 cmax | 5312208 source | ok_lease natural_key_hash | de3e9a567b90025c3399c4c63c823fe9 dispatched | 2019-11-10 05:09:24.214964+00 created_timestamp | 2019-05-17 21:24:19.558219+00 updated_timestamp | 2019-11-24 05:09:36.099686+00 deleted_timestamp | 2019-11-24 05:09:36.099686+00 It appears that the second row was in place originally, then got updated by a trigger (and even deleted later on, althoughit doesn't appear that the delete transaction got committed), and then the first row was inserted within the sametransaction that updated the second row. Another example: -[ RECORD 1 ]-----+--------------------------------- xmin | 116649 xmax | 0 cmin | 5304403 cmax | 5304403 source | ok_lease natural_key_hash | 1c8031348701a32cb5fee26839d6b0b4 dispatched | 2019-11-10 05:09:24.214964+00 created_timestamp | 2019-05-31 06:00:33.765547+00 updated_timestamp | 2019-11-24 05:09:36.099686+00 deleted_timestamp | 2019-11-24 05:09:36.099686+00 -[ RECORD 2 ]-----+--------------------------------- xmin | 116649 xmax | 0 cmin | 5304404 cmax | 5304404 source | ok_lease natural_key_hash | 1c8031348701a32cb5fee26839d6b0b4 dispatched | 2019-11-24 05:09:36.099686+00 created_timestamp | 2019-11-24 05:09:36.099686+00 updated_timestamp | 2019-11-24 05:09:36.099686+00 deleted_timestamp | Both examples have in common that the two duplicate rows were touched within the same transaction. This database runs inside Docker, with the data directory bind-mounted to a reflink-enabled XFS filesystem. The VM is runningDebian's 4.19.16-1~bpo9+1 kernel inside an AWS EC2 instance. We have Debezium stream data from this database via pgoutput. Recreating the primary key confirms that the constraint doesn't (or at least shouldn't) permit these duplicate rows: record_loader=# BEGIN; BEGIN record_loader=# ALTER TABLE loader.sync DROP CONSTRAINT sync_pkey; ALTER TABLE record_loader=# ALTER TABLE loader.sync ADD CONSTRAINT sync_pkey PRIMARY KEY (source, natural_key); ERROR: could not create unique index "sync_pkey" DETAIL: Key (source, natural_key)=(ok_lease, ...) is duplicated. CONTEXT: parallel worker Any ideas on what might cause this behavior? Thanks, Alex
On Thu, Dec 5, 2019 at 1:14 PM Alex Adriaanse <alex@oseberg.io> wrote: > We have a Postgres 10 database that we recently upgraded to Postgres 12 using pg_upgrade. We recently discovered that thereare rows in one of the tables that have duplicate primary keys: What's the timeline here? In other words, does it look like these rows were updated and/or deleted before, around the same time as, or after the upgrade? The symptoms you report seem fairly generic to me, at least at first blush. They could be caused by storage level inconsistencies that could have many causes. > This database runs inside Docker, with the data directory bind-mounted to a reflink-enabled XFS filesystem. The VM is runningDebian's 4.19.16-1~bpo9+1 kernel inside an AWS EC2 instance. We have Debezium stream data from this database via pgoutput. That seems suspicious, since reflink support for XFS is rather immature. How did you invoke pg_upgrade? Did you use the --link (hard link) option? -- Peter Geoghegan
On Thu, Dec 05, 2019 at 09:14:12PM +0000, Alex Adriaanse wrote: >We have a Postgres 10 database that we recently upgraded to Postgres 12 using pg_upgrade. We recently discovered that thereare rows in one of the tables that have duplicate primary keys: > >record_loader=# \d loader.sync > Table "loader.sync" > Column | Type | Collation | Nullable | Default >-------------------+--------------------------+-----------+----------+--------- > source | text | | not null | > natural_key | text | | not null | > payload | jsonb | | | > dispatched | timestamp with time zone | | not null | now() > initial_load_id | text | | | > deleted_load_id | text | | | > created_timestamp | timestamp with time zone | | | now() > updated_timestamp | timestamp with time zone | | | now() > deleted_timestamp | timestamp with time zone | | | >Indexes: > "sync_pkey" PRIMARY KEY, btree (source, natural_key) >Publications: > "debezium" > >This table is modified via triggers that fire off when a COPY command inserts many rows into another table. > >Here are two example duplicate rows: > ># SELECT xmin, xmax, cmin, cmax, source, md5(natural_key) AS natural_key_hash, dispatched, created_timestamp, updated_timestamp,deleted_timestamp FROM loader.sync WHERE (source, natural_key) = ('ok_lease', '...') ORDER BY xmin::text::int,cmin::text::int; >-[ RECORD 1 ]-----+--------------------------------- >xmin | 116649 >xmax | 0 >cmin | 5304404 >cmax | 5304404 >source | ok_lease >natural_key_hash | de3e9a567b90025c3399c4c63c823fe9 >dispatched | 2019-11-24 05:09:36.099686+00 >created_timestamp | 2019-11-24 05:09:36.099686+00 >updated_timestamp | 2019-11-24 05:09:36.099686+00 >deleted_timestamp | >-[ RECORD 2 ]-----+--------------------------------- >xmin | 116649 >xmax | 118583 >cmin | 5312208 >cmax | 5312208 >source | ok_lease >natural_key_hash | de3e9a567b90025c3399c4c63c823fe9 >dispatched | 2019-11-10 05:09:24.214964+00 >created_timestamp | 2019-05-17 21:24:19.558219+00 >updated_timestamp | 2019-11-24 05:09:36.099686+00 >deleted_timestamp | 2019-11-24 05:09:36.099686+00 > >It appears that the second row was in place originally, then got updated by a trigger (and even deleted later on, althoughit doesn't appear that the delete transaction got committed), and then the first row was inserted within the sametransaction that updated the second row. > >Another example: >-[ RECORD 1 ]-----+--------------------------------- >xmin | 116649 >xmax | 0 >cmin | 5304403 >cmax | 5304403 >source | ok_lease >natural_key_hash | 1c8031348701a32cb5fee26839d6b0b4 >dispatched | 2019-11-10 05:09:24.214964+00 >created_timestamp | 2019-05-31 06:00:33.765547+00 >updated_timestamp | 2019-11-24 05:09:36.099686+00 >deleted_timestamp | 2019-11-24 05:09:36.099686+00 >-[ RECORD 2 ]-----+--------------------------------- >xmin | 116649 >xmax | 0 >cmin | 5304404 >cmax | 5304404 >source | ok_lease >natural_key_hash | 1c8031348701a32cb5fee26839d6b0b4 >dispatched | 2019-11-24 05:09:36.099686+00 >created_timestamp | 2019-11-24 05:09:36.099686+00 >updated_timestamp | 2019-11-24 05:09:36.099686+00 >deleted_timestamp | > >Both examples have in common that the two duplicate rows were touched within the same transaction. > >This database runs inside Docker, with the data directory bind-mounted to a reflink-enabled XFS filesystem. The VM is runningDebian's 4.19.16-1~bpo9+1 kernel inside an AWS EC2 instance. We have Debezium stream data from this database via pgoutput. > >Recreating the primary key confirms that the constraint doesn't (or at least shouldn't) permit these duplicate rows: > >record_loader=# BEGIN; >BEGIN >record_loader=# ALTER TABLE loader.sync DROP CONSTRAINT sync_pkey; >ALTER TABLE >record_loader=# ALTER TABLE loader.sync ADD CONSTRAINT sync_pkey PRIMARY KEY (source, natural_key); >ERROR: could not create unique index "sync_pkey" >DETAIL: Key (source, natural_key)=(ok_lease, ...) is duplicated. >CONTEXT: parallel worker > >Any ideas on what might cause this behavior? > Not sure. At first I thought maybe this might be due to collations changing and breaking the index silently. What collation are you using? A couple questions: 1) When you do the queries, do they use index scan or sequential scan? Perhaps it does sequential scan, and if you force index scan (e.g. by rewriting the query) it'll only find one of those rows. 2) Can you check in backups if this data corruption was present in the PG10 cluster, before running pg_upgrade? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: " It appears that the second row was in place originally, then got updated by a trigger (and even deleted later on, althoughit doesn't appear that the delete transaction got committed), and then the first row was inserted within the sametransaction that updated the second row." If you have BEFORE triggers, and a BEFORE trigger signaled failure with RETURN NULL, then this is one known (and documented)issue that I think could cause the behavior you're reporting: https://www.postgresql-archive.org/BEFORE-triggers-that-return-NULL-can-circumvent-referential-integrity-tt6056390.html#none It's hard to say if this is the cause or not, but if you have any BEFORE triggers that RETURN NULL, you might want to reviewthe documentation very carefully. thanks, /Jim F On 12/5/19, 6:45 PM, "Tomas Vondra" <tomas.vondra@2ndquadrant.com> wrote: On Thu, Dec 05, 2019 at 09:14:12PM +0000, Alex Adriaanse wrote: >We have a Postgres 10 database that we recently upgraded to Postgres 12 using pg_upgrade. We recently discovered thatthere are rows in one of the tables that have duplicate primary keys: > >record_loader=# \d loader.sync > Table "loader.sync" > Column | Type | Collation | Nullable | Default >-------------------+--------------------------+-----------+----------+--------- > source | text | | not null | > natural_key | text | | not null | > payload | jsonb | | | > dispatched | timestamp with time zone | | not null | now() > initial_load_id | text | | | > deleted_load_id | text | | | > created_timestamp | timestamp with time zone | | | now() > updated_timestamp | timestamp with time zone | | | now() > deleted_timestamp | timestamp with time zone | | | >Indexes: > "sync_pkey" PRIMARY KEY, btree (source, natural_key) >Publications: > "debezium" > >This table is modified via triggers that fire off when a COPY command inserts many rows into another table. > >Here are two example duplicate rows: > ># SELECT xmin, xmax, cmin, cmax, source, md5(natural_key) AS natural_key_hash, dispatched, created_timestamp, updated_timestamp,deleted_timestamp FROM loader.sync WHERE (source, natural_key) = ('ok_lease', '...') ORDER BY xmin::text::int,cmin::text::int; >-[ RECORD 1 ]-----+--------------------------------- >xmin | 116649 >xmax | 0 >cmin | 5304404 >cmax | 5304404 >source | ok_lease >natural_key_hash | de3e9a567b90025c3399c4c63c823fe9 >dispatched | 2019-11-24 05:09:36.099686+00 >created_timestamp | 2019-11-24 05:09:36.099686+00 >updated_timestamp | 2019-11-24 05:09:36.099686+00 >deleted_timestamp | >-[ RECORD 2 ]-----+--------------------------------- >xmin | 116649 >xmax | 118583 >cmin | 5312208 >cmax | 5312208 >source | ok_lease >natural_key_hash | de3e9a567b90025c3399c4c63c823fe9 >dispatched | 2019-11-10 05:09:24.214964+00 >created_timestamp | 2019-05-17 21:24:19.558219+00 >updated_timestamp | 2019-11-24 05:09:36.099686+00 >deleted_timestamp | 2019-11-24 05:09:36.099686+00 > >It appears that the second row was in place originally, then got updated by a trigger (and even deleted later on, althoughit doesn't appear that the delete transaction got committed), and then the first row was inserted within the sametransaction that updated the second row. > >Another example: >-[ RECORD 1 ]-----+--------------------------------- >xmin | 116649 >xmax | 0 >cmin | 5304403 >cmax | 5304403 >source | ok_lease >natural_key_hash | 1c8031348701a32cb5fee26839d6b0b4 >dispatched | 2019-11-10 05:09:24.214964+00 >created_timestamp | 2019-05-31 06:00:33.765547+00 >updated_timestamp | 2019-11-24 05:09:36.099686+00 >deleted_timestamp | 2019-11-24 05:09:36.099686+00 >-[ RECORD 2 ]-----+--------------------------------- >xmin | 116649 >xmax | 0 >cmin | 5304404 >cmax | 5304404 >source | ok_lease >natural_key_hash | 1c8031348701a32cb5fee26839d6b0b4 >dispatched | 2019-11-24 05:09:36.099686+00 >created_timestamp | 2019-11-24 05:09:36.099686+00 >updated_timestamp | 2019-11-24 05:09:36.099686+00 >deleted_timestamp | > >Both examples have in common that the two duplicate rows were touched within the same transaction. > >This database runs inside Docker, with the data directory bind-mounted to a reflink-enabled XFS filesystem. The VM isrunning Debian's 4.19.16-1~bpo9+1 kernel inside an AWS EC2 instance. We have Debezium stream data from this database viapgoutput. > >Recreating the primary key confirms that the constraint doesn't (or at least shouldn't) permit these duplicate rows: > >record_loader=# BEGIN; >BEGIN >record_loader=# ALTER TABLE loader.sync DROP CONSTRAINT sync_pkey; >ALTER TABLE >record_loader=# ALTER TABLE loader.sync ADD CONSTRAINT sync_pkey PRIMARY KEY (source, natural_key); >ERROR: could not create unique index "sync_pkey" >DETAIL: Key (source, natural_key)=(ok_lease, ...) is duplicated. >CONTEXT: parallel worker > >Any ideas on what might cause this behavior? > Not sure. At first I thought maybe this might be due to collations changing and breaking the index silently. What collation are you using? A couple questions: 1) When you do the queries, do they use index scan or sequential scan? Perhaps it does sequential scan, and if you force index scan (e.g. by rewriting the query) it'll only find one of those rows. 2) Can you check in backups if this data corruption was present in the PG10 cluster, before running pg_upgrade? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, December 5, 2019 at 5:34 PM Peter Geoghegan wrote: > > We have a Postgres 10 database that we recently upgraded to Postgres 12 using pg_upgrade. We recently discovered thatthere are rows in one of the tables that have duplicate primary keys: > > What's the timeline here? In other words, does it look like these rows > were updated and/or deleted before, around the same time as, or after > the upgrade? The Postgres 12 upgrade was performed on 2019-11-22, so the affected rows were modified after this upgrade (although someof the rows were originally inserted before then, before they were modified/duplicated). > > This database runs inside Docker, with the data directory bind-mounted to a reflink-enabled XFS filesystem. The VM isrunning Debian's 4.19.16-1~bpo9+1 kernel inside an AWS EC2 instance. We have Debezium stream data from this database viapgoutput. > > That seems suspicious, since reflink support for XFS is rather immature. Good point. Looking at kernel commits since 4.19.16 it appears that there have been a few bug fixes in later kernel versionsthat address a few XFS corruption issues. Regardless of whether FS bugs are responsible of this corruption I'll planon upgrading to a newer kernel. > How did you invoke pg_upgrade? Did you use the --link (hard link) option? Yes, we first created a backup using "cp -a --reflink=always", ran initdb on the new directory, and then upgraded using "pg_upgrade-b ... -B ... -d ... -D -k". Alex
On Thu., December 5, 2019 at 5:45 PM, Tomas Vondra wrote: > At first I thought maybe this might be due to collations > changing and breaking the index silently. What collation are you using? We're using en_US.utf8. We did not make any collation changes to my knowledge. > 1) When you do the queries, do they use index scan or sequential scan? > Perhaps it does sequential scan, and if you force index scan (e.g. by > rewriting the query) it'll only find one of those rows. By default it used an index scan. When I re-ran the query today (and confirmed that the query used an index only scan) Idid not see any duplicates. If I force a sequential scan using "SET enable_index[only]scan = false" the duplicates reappear. However, using a backup from a week ago I see duplicates in both the query that uses an index only scan as well as the querythat uses the sequential scan. So somehow over the past week the index got changed to eliminate duplicates. > 2) Can you check in backups if this data corruption was present in the > PG10 cluster, before running pg_upgrade? Sure. I just checked and did not see any corruption in the PG10 pre-upgrade backup. I also re-upgraded that PG10 backup toPG12, and right after the upgrade I did not see any corruption either. I checked using both index scans and sequentialscans. Alex
On Mon, December 9, 2019 at 11:05 AM Finnerty, Jim wrote: > If you have BEFORE triggers, and a BEFORE trigger signaled failure with RETURN NULL, then this is one known (and documented)issue that I think could cause the behavior you're reporting: > > https://www.postgresql-archive.org/BEFORE-triggers-that-return-NULL-can-circumvent-referential-integrity-tt6056390.html#none > > It's hard to say if this is the cause or not, but if you have any BEFORE triggers that RETURN NULL, you might want to reviewthe documentation very carefully. We do have a BEFORE INSERT trigger, but it should never return NULL. This trigger INSERTs into a different table using anON CONFLICT DO NOTHING clause and then does a RETURN NEW. Alex
On Wed, Dec 11, 2019 at 11:46:40PM +0000, Alex Adriaanse wrote: >On Thu., December 5, 2019 at 5:45 PM, Tomas Vondra wrote: >> At first I thought maybe this might be due to collations changing and >> breaking the index silently. What collation are you using? > >We're using en_US.utf8. We did not make any collation changes to my >knowledge. > Well, the idea was more that glibc got updated and the collations changed because of that (without PostgreSQL having a chance to even notice that). >> 1) When you do the queries, do they use index scan or sequential >> scan? Perhaps it does sequential scan, and if you force index scan >> (e.g. by rewriting the query) it'll only find one of those rows. > >By default it used an index scan. When I re-ran the query today (and >confirmed that the query used an index only scan) I did not see any >duplicates. If I force a sequential scan using "SET >enable_index[only]scan = false" the duplicates reappear. > Hmmm, that's probably a sign of some sort of index corruption. Clearly, when a row can't be found through an index, it's invisible to code enforcing the unique constraint (relying on the index). >However, using a backup from a week ago I see duplicates in both the >query that uses an index only scan as well as the query that uses the >sequential scan. So somehow over the past week the index got changed to >eliminate duplicates. > Hmmm, that's interesting ... and confusing. The good thing is that this is not an upgrade issue, because there was no corruption right after the upgrade. But then apparently the corruption appeared, and then disappeared for some unknown reason, but only from the index. Puzzling. >> 2) Can you check in backups if this data corruption was present in >> the PG10 cluster, before running pg_upgrade? > >Sure. I just checked and did not see any corruption in the PG10 >pre-upgrade backup. I also re-upgraded that PG10 backup to PG12, and >right after the upgrade I did not see any corruption either. I checked >using both index scans and sequential scans. > OK, thanks. That's valuable piece of information. How active is the system and can you do PITR? That is, can you try restoring it into different points in time by replaying WAL? Then we could check narrow-down when the corruption appeared and inspect the WAL from that period. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu., December 12, 2019 at 5:25 PM, Tomas Vondra wrote: >On Wed, Dec 11, 2019 at 11:46:40PM +0000, Alex Adriaanse wrote: >>On Thu., December 5, 2019 at 5:45 PM, Tomas Vondra wrote: >>> At first I thought maybe this might be due to collations changing and >>> breaking the index silently. What collation are you using? >> >>We're using en_US.utf8. We did not make any collation changes to my >>knowledge. > >Well, the idea was more that glibc got updated and the collations >changed because of that (without PostgreSQL having a chance to even >notice that). Closing the loop on this, I've investigated this some more and it turns out this is exactly what happened. As you suspected,the issue had nothing to do with pg_upgrade or PG12, but rather the glibc upgrade that was seen in Debian Buster.The postgres:10 and postgres:11 images are based on Debian Stretch, whereas postgres:12 is based on Buster. When I kept the database on an older version of Postgres (10 or 11) but switched from the older Docker image to the postgres:12or debian:buster(-slim) image, manually installing older Postgres packages inside those images, I saw index corruptionthere too. Thanks for the input! Alex