Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves... - Mailing list pgsql-general
From | Sean Chittenden |
---|---|
Subject | Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves... |
Date | |
Msg-id | 57B4F1DB-D8E1-4195-86C4-F66E61915B84@chittenden.org Whole thread Raw |
In response to | Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves... (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-general |
>> Oct 5 15:00:25 db01 postgres[76648]: [5944-1] javafail@dbcluster 76648 0: FATAL: pipe() failed: Too many open filesin system > > This message must be coming from initSelfPipe(), and after poking around > a bit I think the failure must be occurring while a new backend is > attempting to do "OwnLatch(&MyProc->procLatch)" in InitProcess. The > reason the postmaster treats this as a crash is that the new backend > just armed the dead-man switch (MarkPostmasterChildActive) but it exits > without doing ProcKill which would disarm it. So this is just an > order-of-operations bug in InitProcess: we're assuming that it can't > fail before reaching "on_shmem_exit(ProcKill, 0)", and the latch > additions broke that. (Though looking at it, assuming that the > PGSemaphoreReset call cannot fail seems a tad risky too.) > > So that explains the crashes, but it doesn't (directly) explain why you > had data corruption. I've since been able to repeat this with full_page_writes=on and have had identical corruption, so I don't think this isfull_page_writes related any more. I can also confirm that there was just one crash by the master database because it requiredmanual intervention to bring back up (a backup label was sitting stale in the data dir[1]). The slaves died during the post-crash VACUUM just as earlier. I'm completing the VACUUM now and am seeing multiple warningsfrom VACUUM. WARNING: relation "tbl_a" page 2115352 is uninitialized --- fixing WARNING: relation "tbl_a" page 2115353 is uninitialized --- fixing WARNING: relation "tbl_a" page 2115354 is uninitialized --- fixing WARNING: relation "tbl_a" page 2115355 is uninitialized --- fixing WARNING: relation "tbl_a" page 2115356 is uninitialized --- fixing WARNING: relation "tbl_a" page 2115357 is uninitialized --- fixing WARNING: relation "tbl_a" page 2115358 is uninitialized --- fixing WARNING: relation "tbl_a" page 2115359 is uninitialized --- fixing WARNING: relation "tbl_a" page 2115360 is uninitialized --- fixing On one of the slaves: Oct 11 00:17:36 db02 postgres[66904]: [21-1] @ 66904 0: WARNING: page 120547 of relation base/16387/20196 is uninitialized Oct 11 00:17:36 db02 postgres[66904]: [21-2] @ 66904 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 125016, lastBlockVacuumed0 Oct 11 00:17:36 db02 postgres[66904]: [22-1] @ 66904 0: PANIC: WAL contains references to invalid pages Oct 11 00:17:36 db02 postgres[66904]: [22-2] @ 66904 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 125016, lastBlockVacuumed0 Oct 11 00:17:36 db02 postgres[66897]: [10-1] @ 66897 0: LOG: startup process (PID 66904) was terminated by signal 6: Aborttrap Oct 11 00:17:36 db02 postgres[66897]: [11-1] @ 66897 0: LOG: terminating any other active server processes Hopefully this is helpful information. -sc [1] The backup label was residual from the initial sync and wasn't cleaned up for some reason during a pg_stop_backup().For now I'm chalking this up as a bug in repmgr even though repmgr completed cloning the slave successfully(supposedly). START WAL LOCATION: 9F/36000020 (file 000000010000009F00000036) CHECKPOINT LOCATION: 9F/37E9D6D8 BACKUP METHOD: pg_start_backup BACKUP FROM: master START TIME: 2012-10-06 17:48:10 UTC LABEL: repmgr_standby_clone_1349545601 -- Sean Chittenden sean@chittenden.org
pgsql-general by date: