Re: Database corruption in cascaded replica, "pg_xact/003A" doesn'texist, reading as zeroes" - Mailing list pgsql-general
From | Achilleas Mantzios |
---|---|
Subject | Re: Database corruption in cascaded replica, "pg_xact/003A" doesn'texist, reading as zeroes" |
Date | |
Msg-id | 98f9b849-183c-4ef2-8cf9-d1efdd9d752c@matrix.gatewaynet.com Whole thread Raw |
In response to | Database corruption in cascaded replica, "pg_xact/003A" doesn'texist, reading as zeroes" (Dejan Petrovic <dejan.petrovic@islonline.com>) |
List | pgsql-general |
On 5/12/18 4:10 μ.μ., Dejan Petrovic wrote: > I believe this is a result of my "broken" procedure for setting up a cascaded replica. I would love to know where the issueis. > > This is a report of a database corruption which was detected after promoting a replica server to a master server. We suspectthe actual corruption occurred during replication of a two-level cascaded > replica, however it was not detected until after it was promoted to master server when pg_dump and pg_restore failed. Afterre-checking postgresql logs on the corrupted node, we found this log line, > which was the only indicator of a possible issue, but was overlooked: "2018-07-25 13:14:42 UTCLOG: file "pg_xact/003A"doesn't exist, reading as zeroes". > > After extensive investigation, we detected three different cases of data corruption: > 1.)Issue with pg_toast - happens during pg_dump or when row with corrupted data is selected > "pg_dump: Dumping the contents of table failed: PQgetResult() failed. > pg_dump: Error message from server: ERROR: missing chunk number 2 for toast value 86123528 in pg_toast_41291242" > > 2.) Issue with uncommitted transactions, occurs when running VACUUM or pg_dump/pg_restore: > "ERROR: uncommitted xmin 60817551 from before xid cutoff 61487222 needs to be frozen > SQL state: XX001" > > 3.) 2 duplicated rows in different tables, violated PK unique constraint, occurs during pg_restore > > This is how we fixed the three issues: > 1.) Using bisection and SELECT, we found the row/column with corrupted data. We nulled the corrupt column and deleted therow (it was not critical, had no FKs) > 2.) We ran VACUUM on all tables to check where this issue occured and updated all rows between reported xmin and xid cutoffwith same data - this generated a new xmin on the broken rows, which fixed > the issue. > 3.) We removed the duplicated rows in a transaction, disabled all triggers before and enabled them right after > 4.) Our final step is to do a full dump and restore on master DB so that the DB is rebuilt > > This is the procedure we used to replicate the cascaded replica (master -> replica -> replica) > > Notes: > Machines are running on Centos 7, Postgresql 10.2 > DB-1 = master > DB-2 = replica of DB-1 > DB-3 = replica of DB-2 > > --------------> DB-3 (new cascaded replica) > systemctl disable postgresql-10 > systemctl stop postgresql-10 > --------------> DB-1 (master) > psql -U postgres > select pg_start_backup('clone',true); > <------------- DB-3 (cascaded replica) > rsync -azv -e --delete DB-2:/isl/pgsql10/data /isl/pgsql10/ --exclude pg_wal --exclude postgresql.pid > assert rsync finished > vi recovery.conf > standby_mode = 'on' > primary_conninfo = 'host=DB-2' > recovery_target_timeline = 'latest' In addition to what Stephen wrote you, from : https://www.postgresql.org/docs/10/continuous-archiving.html the low level backup method is relying on : "1. Ensure that WAL archiving is enabled and working." Have you setup this correctly? Also there should at least be a restore_command inside your recovery.conf . Otherwise how do you know that DB-2 has the WALsyou need? Whereas with your WALs archived it is guaranteed that the backup is consistent. But anyway, its better to either use pg_basebackup or invest in some higher level tool like Stephen said. > -------------> DB-1 > select pg_stop_backup(); > <------------- DB-3 > rsync -azv -e --delete DB-2:/isl/pgsql10/data/pg_wal /isl/pgsql10/data/ > systemctl start postgresql-10 > systemctl enable postgresql-10 > tail log file > > Any comments regarding the cascaded replication procedure or database corruption detection or resolution are welcome. > > Best regards, > > Dejan Petrovic > ISL Online > > -- Achilleas Mantzios IT DEV Lead IT DEPT Dynacom Tankers Mgmt
pgsql-general by date: