Re: Fwd: Data corruption after restarting replica - Mailing list pgsql-general

From Novák, Petr
Subject Re: Fwd: Data corruption after restarting replica
Date
Msg-id CA+eEC0r9d6aX5=t_sKr3B9TThU582xH3E9oTc3McvNJ+Gk+VZg@mail.gmail.com
Whole thread Raw
In response to Fwd: Data corruption after restarting replica  (Novák, Petr <novakp@avast.com>)
List pgsql-general
Hi Tomasi,


sorry for replying so late, but I haven't noticed your email..

>On 16.2.2015 11:44, Novák, Petr wrote:
>> Hello,
>>
>> sorry for posting to second list, but as I've received  no reply
>> there, I'm trying my luck here.
>>
>> Thanks
>> Petr
>>
>>
>> ---------- Forwarded message ----------
>> From: Novák, Petr <novakp(at)avast(dot)com>
>> Date: Tue, Feb 10, 2015 at 12:49 PM
>> Subject: Data corruption after restarting replica
>> To: pgsql-bugs(at)postgresql(dot)org
>>
>>
>> Hi all,
>>
>> we're experiencing data corruption after switching streamed replica
>> to primary.This is not the first time I've encountered this issue,
>> so I'l try to describe it in more detail.
>>
>> For this particular cluster we have 6 servers in two datacenters (3
>> in each). There are two instances running on each server, each with
>> its own port and datadir. On the first two servers in each datacenter
>> one instance is primary and the other is replica for the primary from
>> the other server. Third server holds two offsite replicas from the
>> other datacenter (for DR purposes)
>
>So essentially you have three servers in each datacenter, configured
>like this:
>
>server A
>--------
>instance A1 (primary)
>instance A2 (standby for B1)
>
>server B
>--------
>instance B1 (primary)
>instance B2 (standby for B2)

You've probably mistyped. Correctly is instance B2 (standby for A1)

>
>server C (in the other datacenter)
>----------------------------------
>instance C1 (standby for A1)
>instance C2 (standby for B1)
>
>So if A or B fails, you can swich to the other machine to the replica in
>th same datacenter, and if the whole datacenter fails then you have C in
>the other one.
>
>Correct?
>

Exactly.

>> Each replica was set up by taking pg_basebackup from primary
>> (pg_basebackup -h <hostname> -p 5430 -D /data2/basebackup -P -v -U
>> <user> -x -c fast). Then directories from initdb were replaced with
>> the ones from basebackup (only the configuration files remained) and
>> the replica started and was successfully connected to primary. It
>> was running with no problem keeping up with the primary. We were
>> experiencing some connection problem between the two datacenters,
>> but replication didn't break.
>
>OK, that sounds about right. How long was the replication running before
>the failover? Have you tried running some queries on the standby?
>

The replication was running for several days. No, there were no
queries to the standby that I know of.

>BTW are you running async or sync replication (I guess async)?
>

Async.

>> Then we needed to take one datacenter offline due to hardware
>> maintenance. So I've switched the applications down, verified that
>> no more clients were connected to primary, then shut the primary down
>> and restarted replica without recovery.conf and the application were
>> started using the new db with no problem. Other replica even
>> successfully reconnected to this new primary.
>
>Other replica? If I understand your deployment right, and you've
>switched to the other datacenter, all the related replicas (original
>master + first standby) were in the other datacetenter (now offline). So
>what replica reconnected to the newly promoted primary?
>

It was before the DC outage. Puppet reconfigured the B2 to reconnect to C1

>Restart after removing recovery.conf is a bit cumbersome because of the
>restart (pg_ctl promote does not need that IIRC), but it's a legal way
>to do the promote.
>

Thanks for confirming that.

>> Few hours from the switch lines appeared in the server log (which
>> didn't appear before), indicating a corruption:
>>
>> ERROR: index "account_username_key" contains unexpected zero page at
>> block 1112135 ERROR: right sibling's left-link doesn't match: block
>> 476354 links to 1062443 instead of expected 250322 in index
>> "account_pkey"
>>
>> ..and many more reporting corruption in several other indexes.
>>
>> The issue was resolved by creating new indexes and dropping the
>> affected ones, although there were already some duplicities in the
>> data, that has to be resolved, as some of the indexes were unique.
>
>So the machine was running for a few hours just fine, and then something
>happened and these errors started occuring?
>

I've checked the logs again. It was running ok for 25 minutes and then
one query uncovered the problem when hitting the bad index page. Since
then there were several of these errors (with different indices and
different pages) each hour.

>Can you check whether the table was vacuumed recently (before the errors
>started to appear) or what else interesting happened?
>

There was ANALYZE completed 5 miuted before the first error appeared.

>> This particular case uses Postgres 9.1.14 on both primary and
>> replica. But I've experienced similar behavior on 9.2.9. OS Centos
>> 6.6 in all cases. This may mean, that there can be something wrong
>> with our configuration or the replication setup steps, but I've set
>> up another instance using the same steps with no problem.
>
>How different are the systems? I see you're running 6.6 on both ends,
>but maybe there's some tiny difference? One common issue we see from
>time to time are minor locale differences, affecting the indexes.
>

There are differences at various system packages for sure. Puppet only
ensures several package versions, otherwise the versions are dependent
on when was the system installed or when someone decided to update it
:(.
Do you have any particular packages in mind to check please?

>> Fsync related setting are at their defaults. Data directories are on
>> RAID10 arrays, with BBUs. Filesystem is ext4 mounted with nobarrier
>> option.
>
>Good ;-)
>
>> Database is fairly large ~120GB with several 50mil+ tables, lots of
>> indexes and FK constraints. It is mostly queried,
>> updates/inserts/deletes are only several rows/s.
>
>So it was queried for a few hours? Any idea if it was using the indexes,
>or if it was choosing different execution plans?
>

Well it was queried for 25 minutes, as I've now discovered. But if it
wouldn't be using indexes, we would certainly noticed, as the mostly
queried table has ~170mil rows :)

Petr

>--
>Tomas Vondra                http://www.2ndQuadrant.com/
>PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-general by date:

Previous
From: Oliver Dizon
Date:
Subject: POSTGRES to ORACLE dump
Next
From: Alexander Shutyaev
Date:
Subject: shared_buffers formula