Thread: Standby stopped working after PANIC: WAL contains references to invalid pages
Hello,
Today our standby instance stopped working with this error in the log:
2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING: page 158130 of relation pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT: xlog redo vacuum: rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC: WAL contains references to invalid pages
2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT: xlog redo vacuum: rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG: startup process (PID 8367) was terminated by signal 6: Aborted
2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG: terminating any other active server processes
After re-start the same exact error occurred.
We thought that maybe we hit this bug - http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
However, there is nothing in our log about sub-transactions, so it didn't seem the same to us.
Any advice on how to further debug this so we can avoid this in the future is appreciated.
Environment:
AWS, High I/O instance (hi1.4xlarge), 60GB RAM
Software and settings:
PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2, 64-bit
archive_command rsync -a %p slave:/var/lib/postgresql/replication_load/%f
archive_mode on
autovacuum_freeze_max_age 1000000000
autovacuum_max_workers 6
checkpoint_completion_target 0.9
checkpoint_segments 128
checkpoint_timeout 30min
default_text_search_config pg_catalog.english
hot_standby on
lc_messages en_US.UTF-8
lc_monetary en_US.UTF-8
lc_numeric en_US.UTF-8
lc_time en_US.UTF-8
listen_addresses *
log_checkpoints on
log_destination stderr
log_line_prefix %t [%p]: [%l-1] [%h]
log_min_duration_statement -1
log_min_error_statement error
log_min_messages error
log_timezone UTC
maintenance_work_mem 1GB
max_connections 1200
max_standby_streaming_delay 90s
max_wal_senders 5
port 5432
random_page_cost 2
seq_page_cost 1
shared_buffers 4GB
ssl off
ssl_cert_file /etc/ssl/certs/ssl-cert-snakeoil.pem
ssl_key_file /etc/ssl/private/ssl-cert-snakeoil.key
synchronous_commit off
TimeZone UTC
wal_keep_segments 128
wal_level hot_standby
work_mem 8MB
root@ip-10-148-131-236:~# /usr/local/pgsql/bin/pg_controldata /usr/local/pgsql/data
pg_control version number: 922
Catalog version number: 201204301
Database system identifier: 5838668587531239413
Database cluster state: in archive recovery
pg_control last modified: Sat 22 Jun 2013 06:13:07 PM UTC
Latest checkpoint location: 2250/18CA0790
Prior checkpoint location: 2250/18CA0790
Latest checkpoint's REDO location: 224F/E127B078
Latest checkpoint's TimeLineID: 2
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 1/2018629527
Latest checkpoint's NextOID: 43086248
Latest checkpoint's NextMultiXactId: 7088726
Latest checkpoint's NextMultiOffset: 20617234
Latest checkpoint's oldestXID: 1690316999
Latest checkpoint's oldestXID's DB: 16448
Latest checkpoint's oldestActiveXID: 2018629527
Time of latest checkpoint: Sat 22 Jun 2013 03:24:05 PM UTC
Minimum recovery ending location: 2251/5EA631F0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby
Current max_connections setting: 1200
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
root@ip-10-148-131-236:~#
Thanks again.
Dan
Re: Standby stopped working after PANIC: WAL contains references to invalid pages
Looks like some kind of data corruption. Question is whether it came from the master, or was created by the standby. If you re-seed the standby with a full (base) backup, does the problem go away? On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan@iqtell.com> wrote: > Hello, > > > > Today our standby instance stopped working with this error in the log: > > > > 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING: page 158130 of relation > pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized > > 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT: xlog redo vacuum: rel > 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129 > > 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC: WAL contains references > to invalid pages > > 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT: xlog redo vacuum: rel > 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129 > > 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG: startup process (PID 8367) > was terminated by signal 6: Aborted > > 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG: terminating any other active > server processes > > > > After re-start the same exact error occurred. > > > > We thought that maybe we hit this bug - > http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html. > > However, there is nothing in our log about sub-transactions, so it didn't > seem the same to us. > > > > Any advice on how to further debug this so we can avoid this in the future > is appreciated. > > > > Environment: > > > > AWS, High I/O instance (hi1.4xlarge), 60GB RAM > > > > Software and settings: > > > > PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro > 4.5.2-8ubuntu4) 4.5.2, 64-bit > > > > archive_command rsync -a %p > slave:/var/lib/postgresql/replication_load/%f > > archive_mode on > > autovacuum_freeze_max_age 1000000000 > > autovacuum_max_workers 6 > > checkpoint_completion_target 0.9 > > checkpoint_segments 128 > > checkpoint_timeout 30min > > default_text_search_config pg_catalog.english > > hot_standby on > > lc_messages en_US.UTF-8 > > lc_monetary en_US.UTF-8 > > lc_numeric en_US.UTF-8 > > lc_time en_US.UTF-8 > > listen_addresses * > > log_checkpoints on > > log_destination stderr > > log_line_prefix %t [%p]: [%l-1] [%h] > > log_min_duration_statement -1 > > log_min_error_statement error > > log_min_messages error > > log_timezone UTC > > maintenance_work_mem 1GB > > max_connections 1200 > > max_standby_streaming_delay 90s > > max_wal_senders 5 > > port 5432 > > random_page_cost 2 > > seq_page_cost 1 > > shared_buffers 4GB > > ssl off > > ssl_cert_file /etc/ssl/certs/ssl-cert-snakeoil.pem > > ssl_key_file /etc/ssl/private/ssl-cert-snakeoil.key > > synchronous_commit off > > TimeZone UTC > > wal_keep_segments 128 > > wal_level hot_standby > > work_mem 8MB > > > > root@ip-10-148-131-236:~# /usr/local/pgsql/bin/pg_controldata > /usr/local/pgsql/data > > pg_control version number: 922 > > Catalog version number: 201204301 > > Database system identifier: 5838668587531239413 > > Database cluster state: in archive recovery > > pg_control last modified: Sat 22 Jun 2013 06:13:07 PM UTC > > Latest checkpoint location: 2250/18CA0790 > > Prior checkpoint location: 2250/18CA0790 > > Latest checkpoint's REDO location: 224F/E127B078 > > Latest checkpoint's TimeLineID: 2 > > Latest checkpoint's full_page_writes: on > > Latest checkpoint's NextXID: 1/2018629527 > > Latest checkpoint's NextOID: 43086248 > > Latest checkpoint's NextMultiXactId: 7088726 > > Latest checkpoint's NextMultiOffset: 20617234 > > Latest checkpoint's oldestXID: 1690316999 > > Latest checkpoint's oldestXID's DB: 16448 > > Latest checkpoint's oldestActiveXID: 2018629527 > > Time of latest checkpoint: Sat 22 Jun 2013 03:24:05 PM UTC > > Minimum recovery ending location: 2251/5EA631F0 > > Backup start location: 0/0 > > Backup end location: 0/0 > > End-of-backup record required: no > > Current wal_level setting: hot_standby > > Current max_connections setting: 1200 > > Current max_prepared_xacts setting: 0 > > Current max_locks_per_xact setting: 64 > > Maximum data alignment: 8 > > Database block size: 8192 > > Blocks per segment of large relation: 131072 > > WAL block size: 8192 > > Bytes per WAL segment: 16777216 > > Maximum length of identifiers: 64 > > Maximum columns in an index: 32 > > Maximum size of a TOAST chunk: 1996 > > Date/time type storage: 64-bit integers > > Float4 argument passing: by value > > Float8 argument passing: by value > > root@ip-10-148-131-236:~# > > > > Thanks again. > > > > Dan -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ L. Friedman netllama@gmail.com LlamaLand https://netllama.linux-sxs.org
Re: Standby stopped working after PANIC: WAL contains references to invalid pages
I am in the process of doing that now. I'll reply again with results once that is done. -----Original Message----- From: Lonni J Friedman [mailto:netllama@gmail.com] Sent: Saturday, June 22, 2013 4:11 PM To: Dan Kogan Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages Looks like some kind of data corruption. Question is whether it came from the master, or was created by the standby. Ifyou re-seed the standby with a full (base) backup, does the problem go away? On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan@iqtell.com> wrote: > Hello, > > > > Today our standby instance stopped working with this error in the log: > > > > 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING: page 158130 of > relation > pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized > > 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT: xlog redo vacuum: > rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129 > > 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC: WAL contains > references to invalid pages > > 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT: xlog redo vacuum: > rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129 > > 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG: startup process (PID > 8367) was terminated by signal 6: Aborted > > 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG: terminating any other > active server processes > > > > After re-start the same exact error occurred. > > > > We thought that maybe we hit this bug - > http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html. > > However, there is nothing in our log about sub-transactions, so it > didn't seem the same to us. > > > > Any advice on how to further debug this so we can avoid this in the > future is appreciated. > > > > Environment: > > > > AWS, High I/O instance (hi1.4xlarge), 60GB RAM > > > > Software and settings: > > > > PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc > (Ubuntu/Linaro > 4.5.2-8ubuntu4) 4.5.2, 64-bit > > > > archive_command rsync -a %p > slave:/var/lib/postgresql/replication_load/%f > > archive_mode on > > autovacuum_freeze_max_age 1000000000 > > autovacuum_max_workers 6 > > checkpoint_completion_target 0.9 > > checkpoint_segments 128 > > checkpoint_timeout 30min > > default_text_search_config pg_catalog.english > > hot_standby on > > lc_messages en_US.UTF-8 > > lc_monetary en_US.UTF-8 > > lc_numeric en_US.UTF-8 > > lc_time en_US.UTF-8 > > listen_addresses * > > log_checkpoints on > > log_destination stderr > > log_line_prefix %t [%p]: [%l-1] [%h] > > log_min_duration_statement -1 > > log_min_error_statement error > > log_min_messages error > > log_timezone UTC > > maintenance_work_mem 1GB > > max_connections 1200 > > max_standby_streaming_delay 90s > > max_wal_senders 5 > > port 5432 > > random_page_cost 2 > > seq_page_cost 1 > > shared_buffers 4GB > > ssl off > > ssl_cert_file /etc/ssl/certs/ssl-cert-snakeoil.pem > > ssl_key_file /etc/ssl/private/ssl-cert-snakeoil.key > > synchronous_commit off > > TimeZone UTC > > wal_keep_segments 128 > > wal_level hot_standby > > work_mem 8MB > > > > root@ip-10-148-131-236:~# /usr/local/pgsql/bin/pg_controldata > /usr/local/pgsql/data > > pg_control version number: 922 > > Catalog version number: 201204301 > > Database system identifier: 5838668587531239413 > > Database cluster state: in archive recovery > > pg_control last modified: Sat 22 Jun 2013 06:13:07 PM UTC > > Latest checkpoint location: 2250/18CA0790 > > Prior checkpoint location: 2250/18CA0790 > > Latest checkpoint's REDO location: 224F/E127B078 > > Latest checkpoint's TimeLineID: 2 > > Latest checkpoint's full_page_writes: on > > Latest checkpoint's NextXID: 1/2018629527 > > Latest checkpoint's NextOID: 43086248 > > Latest checkpoint's NextMultiXactId: 7088726 > > Latest checkpoint's NextMultiOffset: 20617234 > > Latest checkpoint's oldestXID: 1690316999 > > Latest checkpoint's oldestXID's DB: 16448 > > Latest checkpoint's oldestActiveXID: 2018629527 > > Time of latest checkpoint: Sat 22 Jun 2013 03:24:05 PM UTC > > Minimum recovery ending location: 2251/5EA631F0 > > Backup start location: 0/0 > > Backup end location: 0/0 > > End-of-backup record required: no > > Current wal_level setting: hot_standby > > Current max_connections setting: 1200 > > Current max_prepared_xacts setting: 0 > > Current max_locks_per_xact setting: 64 > > Maximum data alignment: 8 > > Database block size: 8192 > > Blocks per segment of large relation: 131072 > > WAL block size: 8192 > > Bytes per WAL segment: 16777216 > > Maximum length of identifiers: 64 > > Maximum columns in an index: 32 > > Maximum size of a TOAST chunk: 1996 > > Date/time type storage: 64-bit integers > > Float4 argument passing: by value > > Float8 argument passing: by value > > root@ip-10-148-131-236:~# > > > > Thanks again. > > > > Dan -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ L. Friedman netllama@gmail.com LlamaLand https://netllama.linux-sxs.org
Re: Standby stopped working after PANIC: WAL contains references to invalid pages
Re-seeding the standby with a full base backup does seem to make the error go away. The standby started, caught up and has been working for about 2 hours. The file in the error message was an index. We rebuilt it just in case. Is there any way to debug the issue at this point? -----Original Message----- From: Lonni J Friedman [mailto:netllama@gmail.com] Sent: Saturday, June 22, 2013 4:11 PM To: Dan Kogan Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages Looks like some kind of data corruption. Question is whether it came from the master, or was created by the standby. Ifyou re-seed the standby with a full (base) backup, does the problem go away? On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan@iqtell.com> wrote: > Hello, > > > > Today our standby instance stopped working with this error in the log: > > > > 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING: page 158130 of > relation > pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized > > 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT: xlog redo vacuum: > rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129 > > 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC: WAL contains > references to invalid pages > > 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT: xlog redo vacuum: > rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129 > > 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG: startup process (PID > 8367) was terminated by signal 6: Aborted > > 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG: terminating any other > active server processes > > > > After re-start the same exact error occurred. > > > > We thought that maybe we hit this bug - > http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html. > > However, there is nothing in our log about sub-transactions, so it > didn't seem the same to us. > > > > Any advice on how to further debug this so we can avoid this in the > future is appreciated. > > > > Environment: > > > > AWS, High I/O instance (hi1.4xlarge), 60GB RAM > > > > Software and settings: > > > > PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc > (Ubuntu/Linaro > 4.5.2-8ubuntu4) 4.5.2, 64-bit > > > > archive_command rsync -a %p > slave:/var/lib/postgresql/replication_load/%f > > archive_mode on > > autovacuum_freeze_max_age 1000000000 > > autovacuum_max_workers 6 > > checkpoint_completion_target 0.9 > > checkpoint_segments 128 > > checkpoint_timeout 30min > > default_text_search_config pg_catalog.english > > hot_standby on > > lc_messages en_US.UTF-8 > > lc_monetary en_US.UTF-8 > > lc_numeric en_US.UTF-8 > > lc_time en_US.UTF-8 > > listen_addresses * > > log_checkpoints on > > log_destination stderr > > log_line_prefix %t [%p]: [%l-1] [%h] > > log_min_duration_statement -1 > > log_min_error_statement error > > log_min_messages error > > log_timezone UTC > > maintenance_work_mem 1GB > > max_connections 1200 > > max_standby_streaming_delay 90s > > max_wal_senders 5 > > port 5432 > > random_page_cost 2 > > seq_page_cost 1 > > shared_buffers 4GB > > ssl off > > ssl_cert_file /etc/ssl/certs/ssl-cert-snakeoil.pem > > ssl_key_file /etc/ssl/private/ssl-cert-snakeoil.key > > synchronous_commit off > > TimeZone UTC > > wal_keep_segments 128 > > wal_level hot_standby > > work_mem 8MB > > > > root@ip-10-148-131-236:~# /usr/local/pgsql/bin/pg_controldata > /usr/local/pgsql/data > > pg_control version number: 922 > > Catalog version number: 201204301 > > Database system identifier: 5838668587531239413 > > Database cluster state: in archive recovery > > pg_control last modified: Sat 22 Jun 2013 06:13:07 PM UTC > > Latest checkpoint location: 2250/18CA0790 > > Prior checkpoint location: 2250/18CA0790 > > Latest checkpoint's REDO location: 224F/E127B078 > > Latest checkpoint's TimeLineID: 2 > > Latest checkpoint's full_page_writes: on > > Latest checkpoint's NextXID: 1/2018629527 > > Latest checkpoint's NextOID: 43086248 > > Latest checkpoint's NextMultiXactId: 7088726 > > Latest checkpoint's NextMultiOffset: 20617234 > > Latest checkpoint's oldestXID: 1690316999 > > Latest checkpoint's oldestXID's DB: 16448 > > Latest checkpoint's oldestActiveXID: 2018629527 > > Time of latest checkpoint: Sat 22 Jun 2013 03:24:05 PM UTC > > Minimum recovery ending location: 2251/5EA631F0 > > Backup start location: 0/0 > > Backup end location: 0/0 > > End-of-backup record required: no > > Current wal_level setting: hot_standby > > Current max_connections setting: 1200 > > Current max_prepared_xacts setting: 0 > > Current max_locks_per_xact setting: 64 > > Maximum data alignment: 8 > > Database block size: 8192 > > Blocks per segment of large relation: 131072 > > WAL block size: 8192 > > Bytes per WAL segment: 16777216 > > Maximum length of identifiers: 64 > > Maximum columns in an index: 32 > > Maximum size of a TOAST chunk: 1996 > > Date/time type storage: 64-bit integers > > Float4 argument passing: by value > > Float8 argument passing: by value > > root@ip-10-148-131-236:~# > > > > Thanks again. > > > > Dan -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ L. Friedman netllama@gmail.com LlamaLand https://netllama.linux-sxs.org
Re: Standby stopped working after PANIC: WAL contains references to invalid pages
Assuming that you still have $PGDATA from the broken instance (such that you can reproduce the crash again), there might be a way to debug it further. I'd guess that something like bad RAM or storage could cause an index to get corrupted in this fashion, but the fact that you're using AWS makes that less likely. Someone far more knowledgeable than I will need to provide guidance on how to debug this though. On Sat, Jun 22, 2013 at 4:17 PM, Dan Kogan <dan@iqtell.com> wrote: > Re-seeding the standby with a full base backup does seem to make the error go away. > The standby started, caught up and has been working for about 2 hours. > > The file in the error message was an index. We rebuilt it just in case. > Is there any way to debug the issue at this point? > > > > -----Original Message----- > From: Lonni J Friedman [mailto:netllama@gmail.com] > Sent: Saturday, June 22, 2013 4:11 PM > To: Dan Kogan > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages > > Looks like some kind of data corruption. Question is whether it came from the master, or was created by the standby. If you re-seed the standby with a full (base) backup, does the problem go away? > > On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan@iqtell.com> wrote: >> Hello, >> >> >> >> Today our standby instance stopped working with this error in the log: >> >> >> >> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING: page 158130 of >> relation >> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized >> >> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT: xlog redo vacuum: >> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129 >> >> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC: WAL contains >> references to invalid pages >> >> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT: xlog redo vacuum: >> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129 >> >> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG: startup process (PID >> 8367) was terminated by signal 6: Aborted >> >> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG: terminating any other >> active server processes >> >> >> >> After re-start the same exact error occurred. >> >> >> >> We thought that maybe we hit this bug - >> http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html. >> >> However, there is nothing in our log about sub-transactions, so it >> didn't seem the same to us. >> >> >> >> Any advice on how to further debug this so we can avoid this in the >> future is appreciated. >> >> >> >> Environment: >> >> >> >> AWS, High I/O instance (hi1.4xlarge), 60GB RAM >> >> >> >> Software and settings: >> >> >> >> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc >> (Ubuntu/Linaro >> 4.5.2-8ubuntu4) 4.5.2, 64-bit >> >> >> >> archive_command rsync -a %p >> slave:/var/lib/postgresql/replication_load/%f >> >> archive_mode on >> >> autovacuum_freeze_max_age 1000000000 >> >> autovacuum_max_workers 6 >> >> checkpoint_completion_target 0.9 >> >> checkpoint_segments 128 >> >> checkpoint_timeout 30min >> >> default_text_search_config pg_catalog.english >> >> hot_standby on >> >> lc_messages en_US.UTF-8 >>
Re: Standby stopped working after PANIC: WAL contains references to invalid pages
We have backed up $PGDATA, but had to re-initialize the slave. We also have the WALs from the day this happened. Thanks, Dan -----Original Message----- From: Lonni J Friedman [mailto:netllama@gmail.com] Sent: Saturday, June 22, 2013 10:09 PM To: Dan Kogan Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages Assuming that you still have $PGDATA from the broken instance (such that you can reproduce the crash again), there mightbe a way to debug it further. I'd guess that something like bad RAM or storage could cause an index to get corruptedin this fashion, but the fact that you're using AWS makes that less likely. Someone far more knowledgeable thanI will need to provide guidance on how to debug this though. On Sat, Jun 22, 2013 at 4:17 PM, Dan Kogan <dan@iqtell.com> wrote: > Re-seeding the standby with a full base backup does seem to make the error go away. > The standby started, caught up and has been working for about 2 hours. > > The file in the error message was an index. We rebuilt it just in case. > Is there any way to debug the issue at this point? > > > > -----Original Message----- > From: Lonni J Friedman [mailto:netllama@gmail.com] > Sent: Saturday, June 22, 2013 4:11 PM > To: Dan Kogan > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL > contains references to invalid pages > > Looks like some kind of data corruption. Question is whether it came from the master, or was created by the standby. If you re-seed the standby with a full (base) backup, does the problem go away? > > On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan@iqtell.com> wrote: >> Hello, >> >> >> >> Today our standby instance stopped working with this error in the log: >> >> >> >> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING: page 158130 of >> relation >> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized >> >> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT: xlog redo vacuum: >> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129 >> >> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC: WAL contains >> references to invalid pages >> >> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT: xlog redo vacuum: >> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129 >> >> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG: startup process (PID >> 8367) was terminated by signal 6: Aborted >> >> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG: terminating any other >> active server processes >> >> >> >> After re-start the same exact error occurred. >> >> >> >> We thought that maybe we hit this bug - >> http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html. >> >> However, there is nothing in our log about sub-transactions, so it >> didn't seem the same to us. >> >> >> >> Any advice on how to further debug this so we can avoid this in the >> future is appreciated. >> >> >> >> Environment: >> >> >> >> AWS, High I/O instance (hi1.4xlarge), 60GB RAM >> >> >> >> Software and settings: >> >> >> >> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc >> (Ubuntu/Linaro >> 4.5.2-8ubuntu4) 4.5.2, 64-bit >> >> >> >> archive_command rsync -a %p >> slave:/var/lib/postgresql/replication_load/%f >> >> archive_mode on >> >> autovacuum_freeze_max_age 1000000000 >> >> autovacuum_max_workers 6 >> >> checkpoint_completion_target 0.9 >> >> checkpoint_segments 128 >> >> checkpoint_timeout 30min >> >> default_text_search_config pg_catalog.english >> >> hot_standby on >> >> lc_messages en_US.UTF-8 >>