Home > mailing lists

Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1 - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date	November 18, 2013 19:29:11
Msg-id	20131118192857.GB26763@awork2.anarazel.de Whole thread Raw
In response to	Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1 (Christophe Pettus <xof@thebuild.com>)
Responses	Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
List	pgsql-hackers

Tree view

Hi,

On 2013-11-18 10:58:26 -0800, Christophe Pettus wrote:
> INCDIDENT #1: 9.0.14 -- A new secondary (S1) was initialized using
> rsync off of an existing, correct primary (P1) for the base backup,
> and using WAL-E for WAL segment shipping.  Both the primary and
> secondary were running 9.0.14.  S1 properly connected to the primary
> once the it was caught up on WAL segments, and S1 was then promoted as
> a primary using the trigger file.

Could you detail how exactly the base backup was created? Including the
*exact* logic for copying?

> No errors in the log files on either system.

Do you have the log entries for the startup after the base backup?

> Because the client's schema included a "last_updated" field, we were
> able to determine that all of the rows that were either missing or
> duplicated had been updated on P1 shortly (3-5 minutes) before S1 was
> promoted.  It's possible, but not confirmed, that there were active
> queries (including updates) running on P1 at the moment of S1's
> promotion.

Any chance you have log_checkpoints enabled? If so, could you check
whether there was a checkpoint around the time of the base backup?

This server is gone, right? If not, could you do:
SELECT ctid, xmin, xmax, * FROM whatever WHERE duplicate_row?

> INCIDENT #2: 9.3.1 -- In order to repair the database, a pg_dump was taken of S1y, after having dropped the primary
andunique constraints, and restored into a new 9.3.1 server, P2.  Duplicate rows were purged, and missing rows were
addedagain.  The database, a new primary, was then put back into production, and ran without incident.
 
> 
> A new secondary, S2 was created off of the primary.  This secondary was created using pg_basebackup using
--xlog-method=stream,although the WAL-E archiving was still present.
 
> 
> S2 attached to P2 without incident and no errors in the logs, but nearly-identical corruption was discovered
(althoughthis time without the duplicated rows, just missing rows).  At this point, it's not clear if there was some
clusteringin the "last_updated" timestamp for the rows that are missing from S2.  No duplicated rows were observed.
 
> 
> P2 and S2 are both AWS instances running Ubuntu 12.04, using EBS (with xfs as the file system) as the data volume.
> 
> No errors in the log files on either system.

Could you list the *exact* steps you did to startup the cluster?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

pgsql-hackers by date:

From: Christophe Pettus
Date: 18 November 2013, 19:19:11
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

From: Christophe Pettus
Date: 18 November 2013, 19:38:50
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1 - Mailing list pgsql-hackers

Previous

Next