Thread: Database corruption in cascaded replica, "pg_xact/003A" doesn'texist, reading as zeroes"

I believe this is a result of my "broken" procedure for setting up a
cascaded replica. I would love to know where the issue is.

This is a report of a database corruption which was detected after
promoting a replica server to a master server. We suspect the actual
corruption occurred during replication of a two-level cascaded replica,
however it was not detected until after it was promoted to master server
when pg_dump and pg_restore failed. After re-checking postgresql logs on
the corrupted node, we found this log line, which was the only indicator
of a possible issue, but was overlooked: "2018-07-25 13:14:42 UTCLOG:
file "pg_xact/003A" doesn't exist, reading as zeroes".

After extensive investigation, we detected three different cases of data
corruption:
1.)Issue with pg_toast - happens during pg_dump or when row with
corrupted data is selected
"pg_dump: Dumping the contents of table failed: PQgetResult() failed.
pg_dump: Error message from server: ERROR: missing chunk number 2 for
toast value 86123528 in pg_toast_41291242"

2.) Issue with uncommitted transactions, occurs when running VACUUM or
pg_dump/pg_restore:
"ERROR: uncommitted xmin 60817551 from before xid cutoff 61487222 needs
to be frozen
SQL state: XX001"

3.) 2 duplicated rows in different tables, violated PK unique
constraint, occurs during pg_restore

This is how we fixed the three issues:
1.) Using bisection and SELECT, we found the row/column with corrupted
data. We nulled the corrupt column and deleted the row (it was not
critical, had no FKs)
2.) We ran VACUUM on all tables to check where this issue occured and
updated all rows between reported xmin and xid cutoff with same data -
this generated a new xmin on the broken rows, which fixed the issue.
3.) We removed the duplicated rows in a transaction, disabled all
triggers before and enabled them right after
4.) Our final step is to do a full dump and restore on master DB so that
the DB is rebuilt

This is the procedure we used to replicate the cascaded replica (master
-> replica -> replica)

Notes:
Machines are running on Centos 7, Postgresql 10.2
DB-1 = master
DB-2 = replica of DB-1
DB-3 = replica of DB-2

--------------> DB-3 (new cascaded replica)
systemctl disable postgresql-10
systemctl stop postgresql-10
--------------> DB-1 (master)
psql -U postgres
select pg_start_backup('clone',true);
<------------- DB-3 (cascaded replica)
rsync -azv -e --delete DB-2:/isl/pgsql10/data /isl/pgsql10/ --exclude
pg_wal --exclude postgresql.pid
assert rsync finished
vi recovery.conf
     standby_mode = 'on'
     primary_conninfo = 'host=DB-2'
     recovery_target_timeline = 'latest'
-------------> DB-1
select pg_stop_backup();
<------------- DB-3
rsync -azv -e --delete DB-2:/isl/pgsql10/data/pg_wal /isl/pgsql10/data/
systemctl start postgresql-10
systemctl enable postgresql-10
tail log file

Any comments regarding the cascaded replication procedure or database
corruption detection or resolution are welcome.

Best regards,

Dejan Petrovic
ISL Online



Greetings,

* Dejan Petrovic (dejan.petrovic@islonline.com) wrote:
> I believe this is a result of my "broken" procedure for setting up a
> cascaded replica. I would love to know where the issue is.

[...]

> Notes:
> Machines are running on Centos 7, Postgresql 10.2
> DB-1 = master
> DB-2 = replica of DB-1
> DB-3 = replica of DB-2
>
> --------------> DB-3 (new cascaded replica)
> systemctl disable postgresql-10
> systemctl stop postgresql-10
> --------------> DB-1 (master)
> psql -U postgres
> select pg_start_backup('clone',true);
> <------------- DB-3 (cascaded replica)
> rsync -azv -e --delete DB-2:/isl/pgsql10/data /isl/pgsql10/ --exclude pg_wal
> --exclude postgresql.pid
> assert rsync finished
> vi recovery.conf
>     standby_mode = 'on'
>     primary_conninfo = 'host=DB-2'
>     recovery_target_timeline = 'latest'
> -------------> DB-1
> select pg_stop_backup();
> <------------- DB-3
> rsync -azv -e --delete DB-2:/isl/pgsql10/data/pg_wal /isl/pgsql10/data/
> systemctl start postgresql-10
> systemctl enable postgresql-10

If I'm reading this correctly, you're doing a pg_start_backup() on the
primary, but then making a copy of the DB-2 replica (while the primary
and the DB-2 replica are still running..).

In short, no, that doesn't work.  Never has.  That someone thought they
could create a replica that way is just another good reason to rip out
the broken exclusive-backup mode.  I hope this isn't how your regular
backups are done.

I'd strongly suggest that you use existing tools to do this- in
partciular, you should be able to use pg_basebackup against the DB-2
replica to build out your cascaded replica and that'll do all the right
things.  Alternatively, using a backup solution like pgbackrest would
allow you to quickly build a replica from your backups instead of from a
running replica.

Thanks!

Stephen

Attachment
On 5/12/18 4:10 μ.μ., Dejan Petrovic wrote:
> I believe this is a result of my "broken" procedure for setting up a cascaded replica. I would love to know where the
issueis.
 
>
> This is a report of a database corruption which was detected after promoting a replica server to a master server. We
suspectthe actual corruption occurred during replication of a two-level cascaded 
 
> replica, however it was not detected until after it was promoted to master server when pg_dump and pg_restore failed.
Afterre-checking postgresql logs on the corrupted node, we found this log line, 
 
> which was the only indicator of a possible issue, but was overlooked: "2018-07-25 13:14:42 UTCLOG: file
"pg_xact/003A"doesn't exist, reading as zeroes".
 
>
> After extensive investigation, we detected three different cases of data corruption:
> 1.)Issue with pg_toast - happens during pg_dump or when row with corrupted data is selected
> "pg_dump: Dumping the contents of table failed: PQgetResult() failed.
> pg_dump: Error message from server: ERROR: missing chunk number 2 for toast value 86123528 in pg_toast_41291242"
>
> 2.) Issue with uncommitted transactions, occurs when running VACUUM or pg_dump/pg_restore:
> "ERROR: uncommitted xmin 60817551 from before xid cutoff 61487222 needs to be frozen
> SQL state: XX001"
>
> 3.) 2 duplicated rows in different tables, violated PK unique constraint, occurs during pg_restore
>
> This is how we fixed the three issues:
> 1.) Using bisection and SELECT, we found the row/column with corrupted data. We nulled the corrupt column and deleted
therow (it was not critical, had no FKs)
 
> 2.) We ran VACUUM on all tables to check where this issue occured and updated all rows between reported xmin and xid
cutoffwith same data - this generated a new xmin on the broken rows, which fixed 
 
> the issue.
> 3.) We removed the duplicated rows in a transaction, disabled all triggers before and enabled them right after
> 4.) Our final step is to do a full dump and restore on master DB so that the DB is rebuilt
>
> This is the procedure we used to replicate the cascaded replica (master -> replica -> replica)
>
> Notes:
> Machines are running on Centos 7, Postgresql 10.2
> DB-1 = master
> DB-2 = replica of DB-1
> DB-3 = replica of DB-2
>
> --------------> DB-3 (new cascaded replica)
> systemctl disable postgresql-10
> systemctl stop postgresql-10
> --------------> DB-1 (master)
> psql -U postgres
> select pg_start_backup('clone',true);
> <------------- DB-3 (cascaded replica)
> rsync -azv -e --delete DB-2:/isl/pgsql10/data /isl/pgsql10/ --exclude pg_wal --exclude postgresql.pid
> assert rsync finished
> vi recovery.conf
>     standby_mode = 'on'
>     primary_conninfo = 'host=DB-2'
>     recovery_target_timeline = 'latest'

In addition to what Stephen wrote you, from : https://www.postgresql.org/docs/10/continuous-archiving.html
the low level backup method is relying on :
"1. Ensure that WAL archiving is enabled and working."
Have you setup this correctly?

Also there should at least be a restore_command inside your recovery.conf . Otherwise how do you know that DB-2 has the
WALsyou need? Whereas with your WALs archived it is guaranteed that the backup 
 
is consistent.

But anyway, its better to either use pg_basebackup or invest in some higher level tool like Stephen said.

> -------------> DB-1
> select pg_stop_backup();
> <------------- DB-3
> rsync -azv -e --delete DB-2:/isl/pgsql10/data/pg_wal /isl/pgsql10/data/
> systemctl start postgresql-10
> systemctl enable postgresql-10
> tail log file
>
> Any comments regarding the cascaded replication procedure or database corruption detection or resolution are
welcome.
>
> Best regards,
>
> Dejan Petrovic
> ISL Online
>
>


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt