Thread: Standby stopped working after PANIC: WAL contains references to invalid pages

Hello,

 

Today our standby instance stopped working with this error in the log:

 

2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING:  page 158130 of relation pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized

2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT:  xlog redo vacuum: rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129

2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC:  WAL contains references to invalid pages

2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT:  xlog redo vacuum: rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129

2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG:  startup process (PID 8367) was terminated by signal 6: Aborted

2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG:  terminating any other active server processes

 

After re-start the same exact error occurred.

 

We thought that maybe we hit this bug - http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.

However, there is nothing in our log about sub-transactions, so it didn't seem the same to us.

 

Any advice on how to further debug this so we can avoid this in the future is appreciated.

 

Environment:

 

AWS, High I/O instance (hi1.4xlarge), 60GB RAM

 

Software and settings:

 

PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2, 64-bit

 

archive_command          rsync -a %p slave:/var/lib/postgresql/replication_load/%f

archive_mode   on

autovacuum_freeze_max_age 1000000000

autovacuum_max_workers        6

checkpoint_completion_target 0.9

checkpoint_segments   128

checkpoint_timeout       30min

default_text_search_config       pg_catalog.english

hot_standby      on

lc_messages      en_US.UTF-8

lc_monetary      en_US.UTF-8

lc_numeric          en_US.UTF-8

lc_time en_US.UTF-8

listen_addresses              *

log_checkpoints               on

log_destination                stderr

log_line_prefix %t [%p]: [%l-1] [%h]

log_min_duration_statement    -1

log_min_error_statement           error

log_min_messages         error

log_timezone    UTC

maintenance_work_mem           1GB

max_connections            1200

max_standby_streaming_delay                90s

max_wal_senders           5

port       5432

random_page_cost        2

seq_page_cost 1

shared_buffers                4GB

ssl           off

ssl_cert_file       /etc/ssl/certs/ssl-cert-snakeoil.pem

ssl_key_file        /etc/ssl/private/ssl-cert-snakeoil.key

synchronous_commit    off

TimeZone            UTC

wal_keep_segments     128

wal_level             hot_standby

work_mem        8MB

 

root@ip-10-148-131-236:~# /usr/local/pgsql/bin/pg_controldata /usr/local/pgsql/data

pg_control version number:            922

Catalog version number:               201204301

Database system identifier:           5838668587531239413

Database cluster state:               in archive recovery

pg_control last modified:             Sat 22 Jun 2013 06:13:07 PM UTC

Latest checkpoint location:           2250/18CA0790

Prior checkpoint location:            2250/18CA0790

Latest checkpoint's REDO location:    224F/E127B078

Latest checkpoint's TimeLineID:       2

Latest checkpoint's full_page_writes: on

Latest checkpoint's NextXID:          1/2018629527

Latest checkpoint's NextOID:          43086248

Latest checkpoint's NextMultiXactId:  7088726

Latest checkpoint's NextMultiOffset:  20617234

Latest checkpoint's oldestXID:        1690316999

Latest checkpoint's oldestXID's DB:   16448

Latest checkpoint's oldestActiveXID:  2018629527

Time of latest checkpoint:            Sat 22 Jun 2013 03:24:05 PM UTC

Minimum recovery ending location:     2251/5EA631F0

Backup start location:                0/0

Backup end location:                  0/0

End-of-backup record required:        no

Current wal_level setting:            hot_standby

Current max_connections setting:      1200

Current max_prepared_xacts setting:   0

Current max_locks_per_xact setting:   64

Maximum data alignment:               8

Database block size:                  8192

Blocks per segment of large relation: 131072

WAL block size:                       8192

Bytes per WAL segment:                16777216

Maximum length of identifiers:        64

Maximum columns in an index:          32

Maximum size of a TOAST chunk:        1996

Date/time type storage:               64-bit integers

Float4 argument passing:              by value

Float8 argument passing:              by value

root@ip-10-148-131-236:~#

 

Thanks again.

 

Dan

Re: Standby stopped working after PANIC: WAL contains references to invalid pages

From
Lonni J Friedman
Date:
Looks like some kind of data corruption.  Question is whether it came
from the master, or was created by the standby.  If you re-seed the
standby with a full (base) backup, does the problem go away?

On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan@iqtell.com> wrote:
> Hello,
>
>
>
> Today our standby instance stopped working with this error in the log:
>
>
>
> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING:  page 158130 of relation
> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
>
> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT:  xlog redo vacuum: rel
> 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC:  WAL contains references
> to invalid pages
>
> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT:  xlog redo vacuum: rel
> 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG:  startup process (PID 8367)
> was terminated by signal 6: Aborted
>
> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG:  terminating any other active
> server processes
>
>
>
> After re-start the same exact error occurred.
>
>
>
> We thought that maybe we hit this bug -
>
http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
>
> However, there is nothing in our log about sub-transactions, so it didn't
> seem the same to us.
>
>
>
> Any advice on how to further debug this so we can avoid this in the future
> is appreciated.
>
>
>
> Environment:
>
>
>
> AWS, High I/O instance (hi1.4xlarge), 60GB RAM
>
>
>
> Software and settings:
>
>
>
> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro
> 4.5.2-8ubuntu4) 4.5.2, 64-bit
>
>
>
> archive_command          rsync -a %p
> slave:/var/lib/postgresql/replication_load/%f
>
> archive_mode   on
>
> autovacuum_freeze_max_age 1000000000
>
> autovacuum_max_workers        6
>
> checkpoint_completion_target 0.9
>
> checkpoint_segments   128
>
> checkpoint_timeout       30min
>
> default_text_search_config       pg_catalog.english
>
> hot_standby      on
>
> lc_messages      en_US.UTF-8
>
> lc_monetary      en_US.UTF-8
>
> lc_numeric          en_US.UTF-8
>
> lc_time en_US.UTF-8
>
> listen_addresses              *
>
> log_checkpoints               on
>
> log_destination                stderr
>
> log_line_prefix %t [%p]: [%l-1] [%h]
>
> log_min_duration_statement    -1
>
> log_min_error_statement           error
>
> log_min_messages         error
>
> log_timezone    UTC
>
> maintenance_work_mem           1GB
>
> max_connections            1200
>
> max_standby_streaming_delay                90s
>
> max_wal_senders           5
>
> port       5432
>
> random_page_cost        2
>
> seq_page_cost 1
>
> shared_buffers                4GB
>
> ssl           off
>
> ssl_cert_file       /etc/ssl/certs/ssl-cert-snakeoil.pem
>
> ssl_key_file        /etc/ssl/private/ssl-cert-snakeoil.key
>
> synchronous_commit    off
>
> TimeZone            UTC
>
> wal_keep_segments     128
>
> wal_level             hot_standby
>
> work_mem        8MB
>
>
>
> root@ip-10-148-131-236:~# /usr/local/pgsql/bin/pg_controldata
> /usr/local/pgsql/data
>
> pg_control version number:            922
>
> Catalog version number:               201204301
>
> Database system identifier:           5838668587531239413
>
> Database cluster state:               in archive recovery
>
> pg_control last modified:             Sat 22 Jun 2013 06:13:07 PM UTC
>
> Latest checkpoint location:           2250/18CA0790
>
> Prior checkpoint location:            2250/18CA0790
>
> Latest checkpoint's REDO location:    224F/E127B078
>
> Latest checkpoint's TimeLineID:       2
>
> Latest checkpoint's full_page_writes: on
>
> Latest checkpoint's NextXID:          1/2018629527
>
> Latest checkpoint's NextOID:          43086248
>
> Latest checkpoint's NextMultiXactId:  7088726
>
> Latest checkpoint's NextMultiOffset:  20617234
>
> Latest checkpoint's oldestXID:        1690316999
>
> Latest checkpoint's oldestXID's DB:   16448
>
> Latest checkpoint's oldestActiveXID:  2018629527
>
> Time of latest checkpoint:            Sat 22 Jun 2013 03:24:05 PM UTC
>
> Minimum recovery ending location:     2251/5EA631F0
>
> Backup start location:                0/0
>
> Backup end location:                  0/0
>
> End-of-backup record required:        no
>
> Current wal_level setting:            hot_standby
>
> Current max_connections setting:      1200
>
> Current max_prepared_xacts setting:   0
>
> Current max_locks_per_xact setting:   64
>
> Maximum data alignment:               8
>
> Database block size:                  8192
>
> Blocks per segment of large relation: 131072
>
> WAL block size:                       8192
>
> Bytes per WAL segment:                16777216
>
> Maximum length of identifiers:        64
>
> Maximum columns in an index:          32
>
> Maximum size of a TOAST chunk:        1996
>
> Date/time type storage:               64-bit integers
>
> Float4 argument passing:              by value
>
> Float8 argument passing:              by value
>
> root@ip-10-148-131-236:~#
>
>
>
> Thanks again.
>
>
>
> Dan



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       https://netllama.linux-sxs.org


I am in the process of doing that now.  I'll reply again with results once that is done.


-----Original Message-----
From: Lonni J Friedman [mailto:netllama@gmail.com] 
Sent: Saturday, June 22, 2013 4:11 PM
To: Dan Kogan
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages

Looks like some kind of data corruption.  Question is whether it came from the master, or was created by the standby.
Ifyou re-seed the standby with a full (base) backup, does the problem go away?
 

On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan@iqtell.com> wrote:
> Hello,
>
>
>
> Today our standby instance stopped working with this error in the log:
>
>
>
> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING:  page 158130 of 
> relation
> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
>
> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT:  xlog redo vacuum: 
> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC:  WAL contains 
> references to invalid pages
>
> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT:  xlog redo vacuum: 
> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG:  startup process (PID 
> 8367) was terminated by signal 6: Aborted
>
> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG:  terminating any other 
> active server processes
>
>
>
> After re-start the same exact error occurred.
>
>
>
> We thought that maybe we hit this bug - 
>
http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
>
> However, there is nothing in our log about sub-transactions, so it 
> didn't seem the same to us.
>
>
>
> Any advice on how to further debug this so we can avoid this in the 
> future is appreciated.
>
>
>
> Environment:
>
>
>
> AWS, High I/O instance (hi1.4xlarge), 60GB RAM
>
>
>
> Software and settings:
>
>
>
> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc 
> (Ubuntu/Linaro
> 4.5.2-8ubuntu4) 4.5.2, 64-bit
>
>
>
> archive_command          rsync -a %p
> slave:/var/lib/postgresql/replication_load/%f
>
> archive_mode   on
>
> autovacuum_freeze_max_age 1000000000
>
> autovacuum_max_workers        6
>
> checkpoint_completion_target 0.9
>
> checkpoint_segments   128
>
> checkpoint_timeout       30min
>
> default_text_search_config       pg_catalog.english
>
> hot_standby      on
>
> lc_messages      en_US.UTF-8
>
> lc_monetary      en_US.UTF-8
>
> lc_numeric          en_US.UTF-8
>
> lc_time en_US.UTF-8
>
> listen_addresses              *
>
> log_checkpoints               on
>
> log_destination                stderr
>
> log_line_prefix %t [%p]: [%l-1] [%h]
>
> log_min_duration_statement    -1
>
> log_min_error_statement           error
>
> log_min_messages         error
>
> log_timezone    UTC
>
> maintenance_work_mem           1GB
>
> max_connections            1200
>
> max_standby_streaming_delay                90s
>
> max_wal_senders           5
>
> port       5432
>
> random_page_cost        2
>
> seq_page_cost 1
>
> shared_buffers                4GB
>
> ssl           off
>
> ssl_cert_file       /etc/ssl/certs/ssl-cert-snakeoil.pem
>
> ssl_key_file        /etc/ssl/private/ssl-cert-snakeoil.key
>
> synchronous_commit    off
>
> TimeZone            UTC
>
> wal_keep_segments     128
>
> wal_level             hot_standby
>
> work_mem        8MB
>
>
>
> root@ip-10-148-131-236:~# /usr/local/pgsql/bin/pg_controldata
> /usr/local/pgsql/data
>
> pg_control version number:            922
>
> Catalog version number:               201204301
>
> Database system identifier:           5838668587531239413
>
> Database cluster state:               in archive recovery
>
> pg_control last modified:             Sat 22 Jun 2013 06:13:07 PM UTC
>
> Latest checkpoint location:           2250/18CA0790
>
> Prior checkpoint location:            2250/18CA0790
>
> Latest checkpoint's REDO location:    224F/E127B078
>
> Latest checkpoint's TimeLineID:       2
>
> Latest checkpoint's full_page_writes: on
>
> Latest checkpoint's NextXID:          1/2018629527
>
> Latest checkpoint's NextOID:          43086248
>
> Latest checkpoint's NextMultiXactId:  7088726
>
> Latest checkpoint's NextMultiOffset:  20617234
>
> Latest checkpoint's oldestXID:        1690316999
>
> Latest checkpoint's oldestXID's DB:   16448
>
> Latest checkpoint's oldestActiveXID:  2018629527
>
> Time of latest checkpoint:            Sat 22 Jun 2013 03:24:05 PM UTC
>
> Minimum recovery ending location:     2251/5EA631F0
>
> Backup start location:                0/0
>
> Backup end location:                  0/0
>
> End-of-backup record required:        no
>
> Current wal_level setting:            hot_standby
>
> Current max_connections setting:      1200
>
> Current max_prepared_xacts setting:   0
>
> Current max_locks_per_xact setting:   64
>
> Maximum data alignment:               8
>
> Database block size:                  8192
>
> Blocks per segment of large relation: 131072
>
> WAL block size:                       8192
>
> Bytes per WAL segment:                16777216
>
> Maximum length of identifiers:        64
>
> Maximum columns in an index:          32
>
> Maximum size of a TOAST chunk:        1996
>
> Date/time type storage:               64-bit integers
>
> Float4 argument passing:              by value
>
> Float8 argument passing:              by value
>
> root@ip-10-148-131-236:~#
>
>
>
> Thanks again.
>
>
>
> Dan



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       https://netllama.linux-sxs.org

Re-seeding the standby with a full base backup does seem to make the error go away.
The standby started, caught up and has been working for about 2 hours.

The file in the error message was an index.  We rebuilt it just in case.
Is there any way to debug the issue at this point?



-----Original Message-----
From: Lonni J Friedman [mailto:netllama@gmail.com] 
Sent: Saturday, June 22, 2013 4:11 PM
To: Dan Kogan
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages

Looks like some kind of data corruption.  Question is whether it came from the master, or was created by the standby.
Ifyou re-seed the standby with a full (base) backup, does the problem go away?
 

On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan@iqtell.com> wrote:
> Hello,
>
>
>
> Today our standby instance stopped working with this error in the log:
>
>
>
> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING:  page 158130 of 
> relation
> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
>
> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT:  xlog redo vacuum: 
> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC:  WAL contains 
> references to invalid pages
>
> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT:  xlog redo vacuum: 
> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG:  startup process (PID 
> 8367) was terminated by signal 6: Aborted
>
> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG:  terminating any other 
> active server processes
>
>
>
> After re-start the same exact error occurred.
>
>
>
> We thought that maybe we hit this bug - 
>
http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
>
> However, there is nothing in our log about sub-transactions, so it 
> didn't seem the same to us.
>
>
>
> Any advice on how to further debug this so we can avoid this in the 
> future is appreciated.
>
>
>
> Environment:
>
>
>
> AWS, High I/O instance (hi1.4xlarge), 60GB RAM
>
>
>
> Software and settings:
>
>
>
> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc 
> (Ubuntu/Linaro
> 4.5.2-8ubuntu4) 4.5.2, 64-bit
>
>
>
> archive_command          rsync -a %p
> slave:/var/lib/postgresql/replication_load/%f
>
> archive_mode   on
>
> autovacuum_freeze_max_age 1000000000
>
> autovacuum_max_workers        6
>
> checkpoint_completion_target 0.9
>
> checkpoint_segments   128
>
> checkpoint_timeout       30min
>
> default_text_search_config       pg_catalog.english
>
> hot_standby      on
>
> lc_messages      en_US.UTF-8
>
> lc_monetary      en_US.UTF-8
>
> lc_numeric          en_US.UTF-8
>
> lc_time en_US.UTF-8
>
> listen_addresses              *
>
> log_checkpoints               on
>
> log_destination                stderr
>
> log_line_prefix %t [%p]: [%l-1] [%h]
>
> log_min_duration_statement    -1
>
> log_min_error_statement           error
>
> log_min_messages         error
>
> log_timezone    UTC
>
> maintenance_work_mem           1GB
>
> max_connections            1200
>
> max_standby_streaming_delay                90s
>
> max_wal_senders           5
>
> port       5432
>
> random_page_cost        2
>
> seq_page_cost 1
>
> shared_buffers                4GB
>
> ssl           off
>
> ssl_cert_file       /etc/ssl/certs/ssl-cert-snakeoil.pem
>
> ssl_key_file        /etc/ssl/private/ssl-cert-snakeoil.key
>
> synchronous_commit    off
>
> TimeZone            UTC
>
> wal_keep_segments     128
>
> wal_level             hot_standby
>
> work_mem        8MB
>
>
>
> root@ip-10-148-131-236:~# /usr/local/pgsql/bin/pg_controldata
> /usr/local/pgsql/data
>
> pg_control version number:            922
>
> Catalog version number:               201204301
>
> Database system identifier:           5838668587531239413
>
> Database cluster state:               in archive recovery
>
> pg_control last modified:             Sat 22 Jun 2013 06:13:07 PM UTC
>
> Latest checkpoint location:           2250/18CA0790
>
> Prior checkpoint location:            2250/18CA0790
>
> Latest checkpoint's REDO location:    224F/E127B078
>
> Latest checkpoint's TimeLineID:       2
>
> Latest checkpoint's full_page_writes: on
>
> Latest checkpoint's NextXID:          1/2018629527
>
> Latest checkpoint's NextOID:          43086248
>
> Latest checkpoint's NextMultiXactId:  7088726
>
> Latest checkpoint's NextMultiOffset:  20617234
>
> Latest checkpoint's oldestXID:        1690316999
>
> Latest checkpoint's oldestXID's DB:   16448
>
> Latest checkpoint's oldestActiveXID:  2018629527
>
> Time of latest checkpoint:            Sat 22 Jun 2013 03:24:05 PM UTC
>
> Minimum recovery ending location:     2251/5EA631F0
>
> Backup start location:                0/0
>
> Backup end location:                  0/0
>
> End-of-backup record required:        no
>
> Current wal_level setting:            hot_standby
>
> Current max_connections setting:      1200
>
> Current max_prepared_xacts setting:   0
>
> Current max_locks_per_xact setting:   64
>
> Maximum data alignment:               8
>
> Database block size:                  8192
>
> Blocks per segment of large relation: 131072
>
> WAL block size:                       8192
>
> Bytes per WAL segment:                16777216
>
> Maximum length of identifiers:        64
>
> Maximum columns in an index:          32
>
> Maximum size of a TOAST chunk:        1996
>
> Date/time type storage:               64-bit integers
>
> Float4 argument passing:              by value
>
> Float8 argument passing:              by value
>
> root@ip-10-148-131-236:~#
>
>
>
> Thanks again.
>
>
>
> Dan



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       https://netllama.linux-sxs.org

Re: Standby stopped working after PANIC: WAL contains references to invalid pages

From
Lonni J Friedman
Date:
Assuming that you still have $PGDATA from the broken instance (such
that you can reproduce the crash again), there might be a way to debug
it further.  I'd guess that something like bad RAM or storage could
cause an index to get corrupted in this fashion, but the fact that
you're using AWS makes that less likely.  Someone far more
knowledgeable than I will need to provide guidance on how to debug
this though.

On Sat, Jun 22, 2013 at 4:17 PM, Dan Kogan <dan@iqtell.com> wrote:
> Re-seeding the standby with a full base backup does seem to make the error go away.
> The standby started, caught up and has been working for about 2 hours.
>
> The file in the error message was an index.  We rebuilt it just in case.
> Is there any way to debug the issue at this point?
>
>
>
> -----Original Message-----
> From: Lonni J Friedman [mailto:netllama@gmail.com]
> Sent: Saturday, June 22, 2013 4:11 PM
> To: Dan Kogan
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages
>
> Looks like some kind of data corruption.  Question is whether it came from the master, or was created by the standby.
If you re-seed the standby with a full (base) backup, does the problem go away? 
>
> On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan@iqtell.com> wrote:
>> Hello,
>>
>>
>>
>> Today our standby instance stopped working with this error in the log:
>>
>>
>>
>> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING:  page 158130 of
>> relation
>> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
>>
>> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT:  xlog redo vacuum:
>> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>>
>> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC:  WAL contains
>> references to invalid pages
>>
>> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT:  xlog redo vacuum:
>> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>>
>> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG:  startup process (PID
>> 8367) was terminated by signal 6: Aborted
>>
>> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG:  terminating any other
>> active server processes
>>
>>
>>
>> After re-start the same exact error occurred.
>>
>>
>>
>> We thought that maybe we hit this bug -
>>
http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
>>
>> However, there is nothing in our log about sub-transactions, so it
>> didn't seem the same to us.
>>
>>
>>
>> Any advice on how to further debug this so we can avoid this in the
>> future is appreciated.
>>
>>
>>
>> Environment:
>>
>>
>>
>> AWS, High I/O instance (hi1.4xlarge), 60GB RAM
>>
>>
>>
>> Software and settings:
>>
>>
>>
>> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc
>> (Ubuntu/Linaro
>> 4.5.2-8ubuntu4) 4.5.2, 64-bit
>>
>>
>>
>> archive_command          rsync -a %p
>> slave:/var/lib/postgresql/replication_load/%f
>>
>> archive_mode   on
>>
>> autovacuum_freeze_max_age 1000000000
>>
>> autovacuum_max_workers        6
>>
>> checkpoint_completion_target 0.9
>>
>> checkpoint_segments   128
>>
>> checkpoint_timeout       30min
>>
>> default_text_search_config       pg_catalog.english
>>
>> hot_standby      on
>>
>> lc_messages      en_US.UTF-8
>>


We have backed up $PGDATA, but had to re-initialize the slave.
We also have the WALs from the day this happened.

Thanks,
Dan

-----Original Message-----
From: Lonni J Friedman [mailto:netllama@gmail.com] 
Sent: Saturday, June 22, 2013 10:09 PM
To: Dan Kogan
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages

Assuming that you still have $PGDATA from the broken instance (such that you can reproduce the crash again), there
mightbe a way to debug it further.  I'd guess that something like bad RAM or storage could cause an index to get
corruptedin this fashion, but the fact that you're using AWS makes that less likely.  Someone far more knowledgeable
thanI will need to provide guidance on how to debug this though.
 

On Sat, Jun 22, 2013 at 4:17 PM, Dan Kogan <dan@iqtell.com> wrote:
> Re-seeding the standby with a full base backup does seem to make the error go away.
> The standby started, caught up and has been working for about 2 hours.
>
> The file in the error message was an index.  We rebuilt it just in case.
> Is there any way to debug the issue at this point?
>
>
>
> -----Original Message-----
> From: Lonni J Friedman [mailto:netllama@gmail.com]
> Sent: Saturday, June 22, 2013 4:11 PM
> To: Dan Kogan
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL 
> contains references to invalid pages
>
> Looks like some kind of data corruption.  Question is whether it came from the master, or was created by the standby.
If you re-seed the standby with a full (base) backup, does the problem go away?
 
>
> On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan@iqtell.com> wrote:
>> Hello,
>>
>>
>>
>> Today our standby instance stopped working with this error in the log:
>>
>>
>>
>> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING:  page 158130 of 
>> relation
>> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
>>
>> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT:  xlog redo vacuum:
>> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>>
>> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC:  WAL contains 
>> references to invalid pages
>>
>> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT:  xlog redo vacuum:
>> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>>
>> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG:  startup process (PID
>> 8367) was terminated by signal 6: Aborted
>>
>> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG:  terminating any other 
>> active server processes
>>
>>
>>
>> After re-start the same exact error occurred.
>>
>>
>>
>> We thought that maybe we hit this bug - 
>>
http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
>>
>> However, there is nothing in our log about sub-transactions, so it 
>> didn't seem the same to us.
>>
>>
>>
>> Any advice on how to further debug this so we can avoid this in the 
>> future is appreciated.
>>
>>
>>
>> Environment:
>>
>>
>>
>> AWS, High I/O instance (hi1.4xlarge), 60GB RAM
>>
>>
>>
>> Software and settings:
>>
>>
>>
>> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc 
>> (Ubuntu/Linaro
>> 4.5.2-8ubuntu4) 4.5.2, 64-bit
>>
>>
>>
>> archive_command          rsync -a %p
>> slave:/var/lib/postgresql/replication_load/%f
>>
>> archive_mode   on
>>
>> autovacuum_freeze_max_age 1000000000
>>
>> autovacuum_max_workers        6
>>
>> checkpoint_completion_target 0.9
>>
>> checkpoint_segments   128
>>
>> checkpoint_timeout       30min
>>
>> default_text_search_config       pg_catalog.english
>>
>> hot_standby      on
>>
>> lc_messages      en_US.UTF-8
>>