Re: Standby stopped working after PANIC: WAL contains references to invalid pages - Mailing list pgsql-general

From Dan Kogan
Subject Re: Standby stopped working after PANIC: WAL contains references to invalid pages
Date
Msg-id 60B572D9298D944580F7D51195DD3080468902FFD1@VMBX125.ihostexchange.net
Whole thread Raw
In response to Re: Standby stopped working after PANIC: WAL contains references to invalid pages  (Lonni J Friedman <netllama@gmail.com>)
Responses Re: Standby stopped working after PANIC: WAL contains references to invalid pages  (Lonni J Friedman <netllama@gmail.com>)
List pgsql-general
Re-seeding the standby with a full base backup does seem to make the error go away.
The standby started, caught up and has been working for about 2 hours.

The file in the error message was an index.  We rebuilt it just in case.
Is there any way to debug the issue at this point?



-----Original Message-----
From: Lonni J Friedman [mailto:netllama@gmail.com] 
Sent: Saturday, June 22, 2013 4:11 PM
To: Dan Kogan
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages

Looks like some kind of data corruption.  Question is whether it came from the master, or was created by the standby.
Ifyou re-seed the standby with a full (base) backup, does the problem go away?
 

On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan@iqtell.com> wrote:
> Hello,
>
>
>
> Today our standby instance stopped working with this error in the log:
>
>
>
> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING:  page 158130 of 
> relation
> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
>
> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT:  xlog redo vacuum: 
> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC:  WAL contains 
> references to invalid pages
>
> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT:  xlog redo vacuum: 
> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG:  startup process (PID 
> 8367) was terminated by signal 6: Aborted
>
> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG:  terminating any other 
> active server processes
>
>
>
> After re-start the same exact error occurred.
>
>
>
> We thought that maybe we hit this bug - 
>
http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
>
> However, there is nothing in our log about sub-transactions, so it 
> didn't seem the same to us.
>
>
>
> Any advice on how to further debug this so we can avoid this in the 
> future is appreciated.
>
>
>
> Environment:
>
>
>
> AWS, High I/O instance (hi1.4xlarge), 60GB RAM
>
>
>
> Software and settings:
>
>
>
> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc 
> (Ubuntu/Linaro
> 4.5.2-8ubuntu4) 4.5.2, 64-bit
>
>
>
> archive_command          rsync -a %p
> slave:/var/lib/postgresql/replication_load/%f
>
> archive_mode   on
>
> autovacuum_freeze_max_age 1000000000
>
> autovacuum_max_workers        6
>
> checkpoint_completion_target 0.9
>
> checkpoint_segments   128
>
> checkpoint_timeout       30min
>
> default_text_search_config       pg_catalog.english
>
> hot_standby      on
>
> lc_messages      en_US.UTF-8
>
> lc_monetary      en_US.UTF-8
>
> lc_numeric          en_US.UTF-8
>
> lc_time en_US.UTF-8
>
> listen_addresses              *
>
> log_checkpoints               on
>
> log_destination                stderr
>
> log_line_prefix %t [%p]: [%l-1] [%h]
>
> log_min_duration_statement    -1
>
> log_min_error_statement           error
>
> log_min_messages         error
>
> log_timezone    UTC
>
> maintenance_work_mem           1GB
>
> max_connections            1200
>
> max_standby_streaming_delay                90s
>
> max_wal_senders           5
>
> port       5432
>
> random_page_cost        2
>
> seq_page_cost 1
>
> shared_buffers                4GB
>
> ssl           off
>
> ssl_cert_file       /etc/ssl/certs/ssl-cert-snakeoil.pem
>
> ssl_key_file        /etc/ssl/private/ssl-cert-snakeoil.key
>
> synchronous_commit    off
>
> TimeZone            UTC
>
> wal_keep_segments     128
>
> wal_level             hot_standby
>
> work_mem        8MB
>
>
>
> root@ip-10-148-131-236:~# /usr/local/pgsql/bin/pg_controldata
> /usr/local/pgsql/data
>
> pg_control version number:            922
>
> Catalog version number:               201204301
>
> Database system identifier:           5838668587531239413
>
> Database cluster state:               in archive recovery
>
> pg_control last modified:             Sat 22 Jun 2013 06:13:07 PM UTC
>
> Latest checkpoint location:           2250/18CA0790
>
> Prior checkpoint location:            2250/18CA0790
>
> Latest checkpoint's REDO location:    224F/E127B078
>
> Latest checkpoint's TimeLineID:       2
>
> Latest checkpoint's full_page_writes: on
>
> Latest checkpoint's NextXID:          1/2018629527
>
> Latest checkpoint's NextOID:          43086248
>
> Latest checkpoint's NextMultiXactId:  7088726
>
> Latest checkpoint's NextMultiOffset:  20617234
>
> Latest checkpoint's oldestXID:        1690316999
>
> Latest checkpoint's oldestXID's DB:   16448
>
> Latest checkpoint's oldestActiveXID:  2018629527
>
> Time of latest checkpoint:            Sat 22 Jun 2013 03:24:05 PM UTC
>
> Minimum recovery ending location:     2251/5EA631F0
>
> Backup start location:                0/0
>
> Backup end location:                  0/0
>
> End-of-backup record required:        no
>
> Current wal_level setting:            hot_standby
>
> Current max_connections setting:      1200
>
> Current max_prepared_xacts setting:   0
>
> Current max_locks_per_xact setting:   64
>
> Maximum data alignment:               8
>
> Database block size:                  8192
>
> Blocks per segment of large relation: 131072
>
> WAL block size:                       8192
>
> Bytes per WAL segment:                16777216
>
> Maximum length of identifiers:        64
>
> Maximum columns in an index:          32
>
> Maximum size of a TOAST chunk:        1996
>
> Date/time type storage:               64-bit integers
>
> Float4 argument passing:              by value
>
> Float8 argument passing:              by value
>
> root@ip-10-148-131-236:~#
>
>
>
> Thanks again.
>
>
>
> Dan



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       https://netllama.linux-sxs.org

pgsql-general by date:

Previous
From: Dan Kogan
Date:
Subject: Re: Standby stopped working after PANIC: WAL contains references to invalid pages
Next
From: Lonni J Friedman
Date:
Subject: Re: Standby stopped working after PANIC: WAL contains references to invalid pages