Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1 - Mailing list pgsql-hackers

From Christophe Pettus
Subject Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date
Msg-id 0E76EE0A-1740-41B5-88FF-54AA98794532@thebuild.com
Whole thread Raw
Responses Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1  (Christophe Pettus <xof@thebuild.com>)
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1  (Andres Freund <andres@2ndquadrant.com>)
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1  (Andres Freund <andres@2ndquadrant.com>)
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
Three times in the last two weeks, we have experience data corruption secondary servers using streaming replication on
clientsystems.  The versions involved are 9.0.14, 9.2.5, and 9.3.1.  Each incident was separate; two cases they were
forthe same client (9.0.14 and 9.3.1), one for a different client (9.2.5). 

The details of each incident are similar, but not identical.

The details of each incident are:

INCDIDENT #1: 9.0.14 -- A new secondary (S1) was initialized using rsync off of an existing, correct primary (P1) for
thebase backup, and using WAL-E for WAL segment shipping.  Both the primary and secondary were running 9.0.14.  S1
properlyconnected to the primary once the it was caught up on WAL segments, and S1 was then promoted as a primary using
thetrigger file. 

No errors in the log files on either system.

After promotion, it was discovered that there was significant data loss on S1.  Rows that were present on P1 were
missingon S1, and some rows were duplicated (including duplicates that violated primary key and other unique
constraints). The indexes were corrupt, in that they seemed to think that the duplicates were not duplicated, and that
themissing rows were still present. 

Because the client's schema included a "last_updated" field, we were able to determine that all of the rows that were
eithermissing or duplicated had been updated on P1 shortly (3-5 minutes) before S1 was promoted.  It's possible, but
notconfirmed, that there were active queries (including updates) running on P1 at the moment of S1's promotion. 

As a note, P1 was created from another system (let's call it P0) using just WAL shipping (no streaming replication),
andno data corruption was observed. 

P1 and S1 were both AWS instances running Ubuntu 12.04, using EBS (with xfs as the file system) as the data volume.

P1 and S1 have been destroyed at this point.


INCIDENT #2: 9.3.1 -- In order to repair the database, a pg_dump was taken of S1y, after having dropped the primary and
uniqueconstraints, and restored into a new 9.3.1 server, P2.  Duplicate rows were purged, and missing rows were added
again. The database, a new primary, was then put back into production, and ran without incident. 

A new secondary, S2 was created off of the primary.  This secondary was created using pg_basebackup using
--xlog-method=stream,although the WAL-E archiving was still present. 

S2 attached to P2 without incident and no errors in the logs, but nearly-identical corruption was discovered (although
thistime without the duplicated rows, just missing rows).  At this point, it's not clear if there was some clustering
inthe "last_updated" timestamp for the rows that are missing from S2.  No duplicated rows were observed. 

P2 and S2 are both AWS instances running Ubuntu 12.04, using EBS (with xfs as the file system) as the data volume.

No errors in the log files on either system.

P2 and S2 are still operational.


INCIDENT #3: 9.2.5 -- A client was migrating a large database from a 9.2.2 system (P3) to a new 9.2.5 system (S3) using
streamingreplication.  As I personally didn't do the steps on this one, I don't have quite as much information, but the
basicsare close to incident #2: When S3 was promoted using the trigger file, no errors were observed and the database
cameup normally, but rows were missing from S3 that were present on P3. 

P1 is running Centos 6.3 with ext4 as the file system.

P2 is running Centos 6.4 with ext3 as the file system.

Log shipping in this case was done via rsync.

P3 and S3 are still operational.

No errors in the log files on either system.

--

Obviously, we're very concerned that a bug was introduced in the latest minor release.  We're happy to gather data as
requiredto assist in diagnosing this. 
--
-- Christophe Pettus  xof@thebuild.com




pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Improvement of pg_stat_statement usage about buffer hit ratio
Next
From: Merlin Moncure
Date:
Subject: Re: additional json functionality