Thread: pg 8.3 replication causing corruption
Should replication cause corruption on the secondary when stopping/starting the primary? (pg 8.3.12, windows 2008 R2 on both servers)
Everything seems to work OK, but when we trigger the replication server, it doesn’t pass a reindex. (Errors with can’t create unique indexes due to duplicate values.)
I detailed the logs in another post (link below), but fear my verbosity may have scared people off! :)
http://archives.postgresql.org/pgsql-general/2011-10/msg00121.php
Right now, our only solution seems to be to do another pg_start_backup() -> copy -> pg_stop_backup() after each nightly backup.
I'm doing something wrong.... Thanks for any help/pointers!
On 10/10/11 4:12 PM, Bob Hatfield wrote: > Should replication cause corruption on the secondary when > stopping/starting the primary? I wasn't aware 8.3 had any built in replication? what sort of replication add-ons are you using? -- john r pierce N 37, W 122 santa cruz ca mid-left coast
>> >> Should replication cause corruption on the secondary when stopping/starting the primary? > > I wasn't aware 8.3 had any built in replication? what sort of replication add-ons are you using? > Continuous archiving / WAL shipping as described in: http://www.postgresql.org/docs/8.3/interactive/continuous-archiving.html
On Tue, Oct 11, 2011 at 12:12 AM, Bob Hatfield <bobhatfield@gmail.com> wrote: > Should replication cause corruption on the secondary when stopping/starting > the primary? (pg 8.3.12, windows 2008 R2 on both servers) No, it shouldn't. Any duplicate keys would represent a serious error. It sounds like you're using warm standby, but when you say run pg_start_backup() AFTER each nightly backup I admit to being confused. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
>> >> Should replication cause corruption on the secondary when stopping/starting the primary? > > I wasn't aware 8.3 had any built in replication? what sort of replication add-ons are you using? > Continuous archiving / WAL shipping as described in: http://www.postgresql.org/docs/8.3/interactive/continuous-archiving.html
>> Should replication cause corruption on the secondary when stopping/starting >> the primary? (pg 8.3.12, windows 2008 R2 on both servers) > > No, it shouldn't. Any duplicate keys would represent a serious error. > > It sounds like you're using warm standby, but when you say run > pg_start_backup() AFTER each nightly backup I admit to being confused. > Thanks for your response. Perhaps a quick process flow would help clarify: Scenario 1 (no errors): a) create warm standby and let run throughout the day (works great) b) at the end of the day, trigger the secondary and run a reindex for testing (no errors) Scenario 2 (errors): a) create warm standby and let run throughout the day (works great) b) nightly backup: shutdown pg on primary, do a file system copy (for backup later), start pg again on primary c) the next morning, trigger the secondary and run a re-index for testing (ERRORS as described in thread) Side note: the data copied in 2.b is fine and also passes a full re-index. Scenario 3 (work around - not a very good one): a) create warm standby and let run throughout the day (works great) b) nightly backup: shutdown pg on primary, do a file system copy (for backup later), start pg again on primary c) the next morning, re-create the warm standby (this is where I may have confused you with doing a pg_start_backup after nightly backups) Thanks!
On Wed, Oct 12, 2011 at 10:51 PM, Bob Hatfield <bobhatfield@gmail.com> wrote: >>> Should replication cause corruption on the secondary when stopping/starting >>> the primary? (pg 8.3.12, windows 2008 R2 on both servers) >> >> No, it shouldn't. Any duplicate keys would represent a serious error. >> >> It sounds like you're using warm standby, but when you say run >> pg_start_backup() AFTER each nightly backup I admit to being confused. >> > > Thanks for your response. Perhaps a quick process flow would help clarify: > > Scenario 1 (no errors): > a) create warm standby and let run throughout the day (works great) > b) at the end of the day, trigger the secondary and run a reindex for > testing (no errors) > > Scenario 2 (errors): > a) create warm standby and let run throughout the day (works great) > b) nightly backup: shutdown pg on primary, do a file system copy (for > backup later), start pg again on primary > c) the next morning, trigger the secondary and run a re-index for > testing (ERRORS as described in thread) I see no reason to expect errors there. Something about your setup is suspect. Disks perhaps. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> Something about your setup is suspect. Disks perhaps. Disk: Fusion IOdrive (1.2TB NAND drive) I've read that one should set wal_sync_method=fsync_writethrough for Windows servers. It's currently set to open_datasync, I have no idea what effect that will have other than I've read less performance. Thoughts?
On Wed, Oct 12, 2011 at 5:17 PM, Bob Hatfield <bobhatfield@gmail.com> wrote: >> Something about your setup is suspect. Disks perhaps. > > Disk: Fusion IOdrive (1.2TB NAND drive) > > I've read that one should set wal_sync_method=fsync_writethrough for > Windows servers. It's currently set to open_datasync, I have no idea > what effect that will have other than I've read less performance. > > Thoughts? have you had any power events? hard shutdowns, etc? I wonder if the problem is in the clog files, and not the heap itself. merlin
> have you had any power events? hard shutdowns, etc? I wonder if the problem is in the clog files, and not the heap itself. Nothing unusual for as long as I can tell. Reminder that as long as I don't restart the primary's pg process, everything works fine (secondary's data is intact). It's as if stopping/starting the primary causes a shipped wal file to be corrupt or contain duplicated data then processed by the secondary.
On Thu, Oct 13, 2011 at 4:07 PM, Bob Hatfield <bobhatfield@gmail.com> wrote: >> have you had any power events? hard shutdowns, etc? I wonder if the problem is in the clog files, and not the heap itself. > > Nothing unusual for as long as I can tell. Reminder that as long as I > don't restart the primary's pg process, everything works fine > (secondary's data is intact). > > It's as if stopping/starting the primary causes a shipped wal file to > be corrupt or contain duplicated data then processed by the secondary. My money is on clog/visibility related issues. It's a bit of a bear, but can you pull the xmin/xmax/ctid for the two duplicate records on the standby and the correspondingly non-duplicated record on the master? I'm curious if the heap blocks are identical and if the standby is incorrectly marking a transaction as valid/invalid. From there, We need to: *) figure out the transaction bits in clog on both systems and look them up there. *) also, look for differences in clog generally *) digest the heap block containing the records to see if they are identical *) double check hint bits? merlin
On Thu, Oct 13, 2011 at 4:20 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, Oct 13, 2011 at 4:07 PM, Bob Hatfield <bobhatfield@gmail.com> wrote: >>> have you had any power events? hard shutdowns, etc? I wonder if the problem is in the clog files, and not the heap itself. >> >> Nothing unusual for as long as I can tell. Reminder that as long as I >> don't restart the primary's pg process, everything works fine >> (secondary's data is intact). >> >> It's as if stopping/starting the primary causes a shipped wal file to >> be corrupt or contain duplicated data then processed by the secondary. > > My money is on clog/visibility related issues. It's a bit of a bear, > but can you pull the xmin/xmax/ctid for the two duplicate records on > the standby and the correspondingly non-duplicated record on the > master? I'm curious if the heap blocks are identical and if the > standby is incorrectly marking a transaction as valid/invalid. > > From there, > > We need to: > *) figure out the transaction bits in clog on both systems and look > them up there. > *) also, look for differences in clog generally > *) digest the heap block containing the records to see if they are identical > *) double check hint bits? Any movement on this? There is considerable interest in any known issues resolving reproducible issues with postgres replication. Do you happen to remember if set up the standby when the master was under high load conditions? Any interesting/unexplained messages in the standby logs? merlin
> Any movement on this? There is considerable interest in any known > issues resolving reproducible issues with postgres replication. Do > you happen to remember if set up the standby when the master was under > high load conditions? Any interesting/unexplained messages in the > standby logs? I'm still trying to digest your last post AND I'm going to do some more testing this weekend. I do typically setup the standby while master is under load -- I will try to set it up under light/no load this weekend and test. Some Logs: head of postgresql...log on slave: ----------------------------------------------- 2011-09-27 02:24:54 MST LOG: could not open file "pg_xlog/000000010000059F000000C7" (log file 1439, segment 199): No such file or directory 2011-09-27 02:24:54 MST LOG: redo done at 59F/C6FFDB08 2011-09-27 02:24:54 MST LOG: last completed transaction was at log time 2011-09-26 23:55:29.746-07 2011-09-27 02:24:59 MST LOG: restored log file "000000010000059F000000C6" from archive 2011-09-27 02:25:29 MST LOG: selected new timeline ID: 2 2011-09-27 02:25:59 MST LOG: archive recovery complete 2011-09-27 02:26:03 MST LOG: database system is ready to accept connections 2011-09-27 02:26:03 MST LOG: autovacuum launcher started tail of STANDBY.LOG on slave after trigger: --------------------------------------------------------------- WAL file not present yet. Checking for trigger file... WAL file not present yet. Checking for trigger file... WAL file not present yet. Checking for trigger file...trigger file found Trigger file : P:\PROGRA~1\PostgreSQL\trigger.txt Waiting for WAL file : 000000010000059F000000C6 WAL file path : P:\PROGRA~1\PostgreSQL\archivewal\000000010000059F000000C6 Restoring to... : pg_xlog\RECOVERYXLOG Sleep interval : 5 seconds Max wait interval : 0 forever Command for restore : copy "P:\PROGRA~1\PostgreSQL\archivewal\000000010000059F000000C6" "pg_xlog\RECOVERYXLOG" Keep archive history : 000000000000000000000000 and later running restore : OK Trigger file : P:\PROGRA~1\PostgreSQL\trigger.txt Waiting for WAL file : 00000002.history WAL file path : P:\PROGRA~1\PostgreSQL\archivewal\00000002.history Restoring to... : pg_xlog\RECOVERYHISTORY Sleep interval : 5 seconds Max wait interval : 0 forever Command for restore : copy "P:\PROGRA~1\PostgreSQL\archivewal\00000002.history" "pg_xlog\RECOVERYHISTORY" Keep archive history : 000000000000000000000000 and later running restore :not restored : history file not found Trigger file : P:\PROGRA~1\PostgreSQL\trigger.txt Waiting for WAL file : 00000001.history WAL file path : P:\PROGRA~1\PostgreSQL\archivewal\00000001.history Restoring to... : pg_xlog\RECOVERYHISTORY Sleep interval : 5 seconds Max wait interval : 0 forever Command for restore : copy "P:\PROGRA~1\PostgreSQL\archivewal\00000001.history" "pg_xlog\RECOVERYHISTORY" Keep archive history : 000000000000000000000000 and later running restore :not restored : history file not found Tail of postgresql...log on master before/during shutdown for backups: --------------------------------------------------------------------------------------------- 2011-10-04 01:17:02 MST LOG: unexpected EOF on client connection 2011-10-04 01:17:11 MST LOG: received fast shutdown request 2011-10-04 01:17:11 MST LOG: aborting any active transactions 2011-10-04 01:17:11 MST FATAL: terminating connection due to administrator command 2011-10-04 01:17:11 MST FATAL: terminating connection due to administrator command 2011-10-04 01:17:12 MST FATAL: terminating connection due to administrator command 2011-10-04 01:17:12 MST FATAL: terminating connection due to administrator command 2011-10-04 01:17:12 MST FATAL: terminating connection due to administrator command 2011-10-04 01:17:12 MST FATAL: terminating connection due to administrator command 2011-10-04 01:17:12 MST FATAL: terminating connection due to administrator command 2011-10-04 01:17:12 MST LOG: autovacuum launcher shutting down 2011-10-04 01:17:12 MST LOG: shutting down 2011-10-04 01:17:12 MST LOG: database system is shut down Head of postgresql...log on master starting up after stopping/starting master's pg process: -------------------------------------------------------------------------------------------------------------------------- 2011-10-04 02:19:37 MST LOG: database system was shut down at 2011-10-04 01:17:12 MST 2011-10-04 02:19:37 MST LOG: database system is ready to accept connections 2011-10-04 02:19:37 MST LOG: autovacuum launcher started 2011-10-04 02:19:37 MST LOG: loaded library "$libdir/plugins/plugin_debugger.dll" <end of post>