Re: Corrupted Data ? - Mailing list pgsql-general

From Ioana Danes
Subject Re: Corrupted Data ?
Date
Msg-id CAPg0s+5V1b72xx7yT8N_9jz1nsUdYHHaWG4nchaHqE3eghce6A@mail.gmail.com
Whole thread Raw
In response to Re: Corrupted Data ?  (Adrian Klaver <adrian.klaver@aklaver.com>)
Responses Re: Corrupted Data ?  (Melvin Davidson <melvin6925@gmail.com>)
Re: Corrupted Data ?  (Francisco Olarte <folarte@peoplecall.com>)
List pgsql-general
Hello Everyone,

I have new information on this case.  I also open a post for Bucardo because I am still not sure what triggers this problem.

The problem happened again on the same table but on another field. Few days ago I started a fourth database called drdb that is a PITR slave from db3.

DB1, DB2, DRDB:

-[ RECORD 1 ]-------+--------------------------
gameplayid           | 324455566
systemuserid        | 515151
gameid                 | 41
transactionid         | 666556533
drawid                  | 318220
....

DB3:
-[ RECORD 1 ]-------+--------------------------
gameplayid           | 32445556
systemuserid        | 515151
gameid                 | 41
transactionid         | 666556533
drawid                  | 318216



Here are the facts I know:

August 10 @ 11:10
    - The record was created on db1 and replicated to db2 and db3
August 11 @ 2:30
    - db1, db2 and db3 are in sync (I have a script that compares the data for all 3 dbs every night @ 2:30 am)
August 12 @ 2:30
    - db3 is out of sync because of this field (drawid)
    - drdb (which is PITRed from db3) is in sync with db1 and db2?????


Because drdb (PITR slave from db3) is in sync with db1 and db2 and because the base backup was taken before the record in case was created, I believe that the xlogs are fine and I have a data kind of corruption on db3 on the data file for that table that happened after August 11 @ 2:30 (because the compare script found the dbs in sync)...

Also the index is correct on db3 as the record in case (with drawid = 318216) is retrieved if I filter by drawid = 318220

Any help is greatly appreciated,

Thank you


On Mon, Aug 8, 2016 at 1:25 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 08/08/2016 10:06 AM, Ioana Danes wrote:


On Mon, Aug 8, 2016 at 12:55 PM, Adrian Klaver>         75315811???


    Corrupted index on db3?

yes


    Might want to look in the db3 logs to see if anything pops out.

I checked the logs, no traces of errors or corruption.


    I just do not know enough about Burcardo to be of much help beyond that.



it is trigger based, it saves the ids of the inserted record in a delta
table and then on sync it creates copy commands to the slave. Even if
there is a bug or corruption in that process I don't see how that
corrupts the index on db3...

It seems to do more then that:

https://bucardo.org/wiki/Bucardo/Documentation/Overview

That is why I suggested the post to the Burcardo list. Folks there will have a better idea what goes under the hood.

There is also this from a previous post:

"Only one master is active at one time the other one is in stand by that is a topic for another discussion but in our case that works well."

Have no idea how that interaction plays out.

At this point what I see is:

1) Data is entered on a master and is correct there.

2) Data is replicated to a single standby from one of two possible sources via Bucardo and is no longer correct.

3) Now Bucardo uses Postgres to do its work so it is possible that something in Postgres is at fault. Still the fact that the data is good on the master but not in the standby tends to indicate that the act of replication is the issue.

4) Exactly how that replication is accomplished is not obvious to me.





So it is either replication bug + index corruption on db3 or data
corruption on db3...


In response to Melvin, the query returns no rows:


SELECT n.nspname,
               i.relname,
               i.indexrelname,
               CASE WHEN idx.indisprimary
                         THEN 'pkey'
                          WHEN idx.indisunique
                          THEN 'uidx'
                          ELSE 'idx'
                  END AS type,
                  'INVALID'
  FROM pg_stat_all_indexes i
    JOIN pg_class c     ON (c.oid = i.relid)
    JOIN pg_namespace n ON (n.oid = c.relnamespace)
    JOIN pg_index idx   ON (idx.indexrelid =  i.indexrelid )
WHERE idx.indisvalid = FALSE
ORDER BY 1, 2;

 nspname | relname | indexrelname | type | ?column?
---------+---------+--------------+------+----------
(0 rows)


Thank you for your thoughts,
ioana



        Thanks,
        ioana



            --
            Adrian Klaver
            adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
        <mailto:adrian.klaver@aklaver.com
        <mailto:adrian.klaver@aklaver.com>>




    --
    Adrian Klaver
    adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>




--
Adrian Klaver
adrian.klaver@aklaver.com

pgsql-general by date:

Previous
From: Chris Travers
Date:
Subject: Re: Postgres Pain Points 2 ruby / node language drivers
Next
From: Melvin Davidson
Date:
Subject: Re: Corrupted Data ?