Re: Corruption with duplicate primary key - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Corruption with duplicate primary key
Date
Msg-id 20191212232526.zj3dqncjrc7pyckb@development
Whole thread Raw
In response to Re: Corruption with duplicate primary key  (Alex Adriaanse <alex@oseberg.io>)
Responses Re: Corruption with duplicate primary key
List pgsql-hackers
On Wed, Dec 11, 2019 at 11:46:40PM +0000, Alex Adriaanse wrote:
>On Thu., December 5, 2019 at 5:45 PM, Tomas Vondra wrote:
>> At first I thought maybe this might be due to collations changing and
>> breaking the index silently. What collation are you using?
>
>We're using en_US.utf8. We did not make any collation changes to my
>knowledge.
>

Well, the idea was more that glibc got updated and the collations
changed because of that (without PostgreSQL having a chance to even
notice that).

>> 1) When you do the queries, do they use index scan or sequential
>> scan?  Perhaps it does sequential scan, and if you force index scan
>> (e.g. by rewriting the query) it'll only find one of those rows.
>
>By default it used an index scan. When I re-ran the query today (and
>confirmed that the query used an index only scan) I did not see any
>duplicates. If I force a sequential scan using "SET
>enable_index[only]scan = false" the duplicates reappear.
>

Hmmm, that's probably a sign of some sort of index corruption. Clearly,
when a row can't be found through an index, it's invisible to code
enforcing the unique constraint (relying on the index).

>However, using a backup from a week ago I see duplicates in both the
>query that uses an index only scan as well as the query that uses the
>sequential scan. So somehow over the past week the index got changed to
>eliminate duplicates.
>

Hmmm, that's interesting ... and confusing.

The good thing is that this is not an upgrade issue, because there was
no corruption right after the upgrade.

But then apparently the corruption appeared, and then disappeared for
some unknown reason, but only from the index. Puzzling.

>> 2) Can you check in backups if this data corruption was present in
>> the PG10 cluster, before running pg_upgrade?
>
>Sure. I just checked and did not see any corruption in the PG10
>pre-upgrade backup. I also re-upgraded that PG10 backup to PG12, and
>right after the upgrade I did not see any corruption either. I checked
>using both index scans and sequential scans.
>

OK, thanks. That's valuable piece of information.

How active is the system and can you do PITR? That is, can you try
restoring it into different points in time by replaying WAL? Then we
could check narrow-down when the corruption appeared and inspect the WAL
from that period.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: tuplesort test coverage
Next
From: Will Leinweber
Date:
Subject: Errors "failed to construct the join relation" and "failed to buildany 2-way joins"