Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves... - Mailing list pgsql-general

From Sean Chittenden
Subject Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves...
Date
Msg-id 57B4F1DB-D8E1-4195-86C4-F66E61915B84@chittenden.org
Whole thread Raw
In response to Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves...  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
>> Oct  5 15:00:25 db01 postgres[76648]: [5944-1] javafail@dbcluster 76648 0: FATAL:  pipe() failed: Too many open
filesin system 
>
> This message must be coming from initSelfPipe(), and after poking around
> a bit I think the failure must be occurring while a new backend is
> attempting to do "OwnLatch(&MyProc->procLatch)" in InitProcess.  The
> reason the postmaster treats this as a crash is that the new backend
> just armed the dead-man switch (MarkPostmasterChildActive) but it exits
> without doing ProcKill which would disarm it.  So this is just an
> order-of-operations bug in InitProcess: we're assuming that it can't
> fail before reaching "on_shmem_exit(ProcKill, 0)", and the latch
> additions broke that.  (Though looking at it, assuming that the
> PGSemaphoreReset call cannot fail seems a tad risky too.)
>
> So that explains the crashes, but it doesn't (directly) explain why you
> had data corruption.

I've since been able to repeat this with full_page_writes=on and have had identical corruption, so I don't think this
isfull_page_writes related any more. I can also confirm that there was just one crash by the master database because it
requiredmanual intervention to bring back up (a backup label was sitting stale in the data dir[1]). 

The slaves died during the post-crash VACUUM just as earlier. I'm completing the VACUUM now and am seeing multiple
warningsfrom VACUUM. 

WARNING:  relation "tbl_a" page 2115352 is uninitialized --- fixing
WARNING:  relation "tbl_a" page 2115353 is uninitialized --- fixing
WARNING:  relation "tbl_a" page 2115354 is uninitialized --- fixing
WARNING:  relation "tbl_a" page 2115355 is uninitialized --- fixing
WARNING:  relation "tbl_a" page 2115356 is uninitialized --- fixing
WARNING:  relation "tbl_a" page 2115357 is uninitialized --- fixing
WARNING:  relation "tbl_a" page 2115358 is uninitialized --- fixing
WARNING:  relation "tbl_a" page 2115359 is uninitialized --- fixing
WARNING:  relation "tbl_a" page 2115360 is uninitialized --- fixing

On one of the slaves:

Oct 11 00:17:36 db02 postgres[66904]: [21-1] @ 66904 0: WARNING:  page 120547 of relation base/16387/20196 is
uninitialized
Oct 11 00:17:36 db02 postgres[66904]: [21-2] @ 66904 0: CONTEXT:  xlog redo vacuum: rel 1663/16387/20196; blk 125016,
lastBlockVacuumed0 
Oct 11 00:17:36 db02 postgres[66904]: [22-1] @ 66904 0: PANIC:  WAL contains references to invalid pages
Oct 11 00:17:36 db02 postgres[66904]: [22-2] @ 66904 0: CONTEXT:  xlog redo vacuum: rel 1663/16387/20196; blk 125016,
lastBlockVacuumed0 
Oct 11 00:17:36 db02 postgres[66897]: [10-1] @ 66897 0: LOG:  startup process (PID 66904) was terminated by signal 6:
Aborttrap 
Oct 11 00:17:36 db02 postgres[66897]: [11-1] @ 66897 0: LOG:  terminating any other active server processes

Hopefully this is helpful information. -sc



[1] The backup label was residual from the initial sync and wasn't cleaned up for some reason during a
pg_stop_backup().For now I'm chalking this up as a bug in repmgr even though repmgr completed cloning the slave
successfully(supposedly). 

START WAL LOCATION: 9F/36000020 (file 000000010000009F00000036)
CHECKPOINT LOCATION: 9F/37E9D6D8
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2012-10-06 17:48:10 UTC
LABEL: repmgr_standby_clone_1349545601

--
Sean Chittenden
sean@chittenden.org



pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Can two “SELECT FOR UPDATE” statements on the same table cause a deadlock?
Next
From: "Joshua D. Drake"
Date:
Subject: Re: moving from MySQL to pgsql