Thread: Standby stopped working after PANIC: WAL contains references to invalid pages

Standby stopped working after PANIC: WAL contains references to invalid pages

From

Dan Kogan

Date:

22 June 2013, 22:44:00

Hello,

Today our standby instance stopped working with this error in the log:

2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING: page 158130 of relation pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized

2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT: xlog redo vacuum: rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129

2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC: WAL contains references to invalid pages

2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT: xlog redo vacuum: rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129

2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG: startup process (PID 8367) was terminated by signal 6: Aborted

2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG: terminating any other active server processes

After re-start the same exact error occurred.

We thought that maybe we hit this bug - http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.

However, there is nothing in our log about sub-transactions, so it didn't seem the same to us.

Any advice on how to further debug this so we can avoid this in the future is appreciated.

Environment:

AWS, High I/O instance (hi1.4xlarge), 60GB RAM

Software and settings:

PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2, 64-bit

archive_command rsync -a %p slave:/var/lib/postgresql/replication_load/%f

archive_mode on

autovacuum_freeze_max_age 1000000000

autovacuum_max_workers 6

checkpoint_completion_target 0.9

checkpoint_segments 128

checkpoint_timeout 30min

default_text_search_config pg_catalog.english

hot_standby on

lc_messages en_US.UTF-8

lc_monetary en_US.UTF-8

lc_numeric en_US.UTF-8

lc_time en_US.UTF-8

listen_addresses *

log_checkpoints on

log_destination stderr

log_line_prefix %t [%p]: [%l-1] [%h]

log_min_duration_statement -1

log_min_error_statement error

log_min_messages error

log_timezone UTC

maintenance_work_mem 1GB

max_connections 1200

max_standby_streaming_delay 90s

max_wal_senders 5

port 5432

random_page_cost 2

seq_page_cost 1

shared_buffers 4GB

ssl off

ssl_cert_file /etc/ssl/certs/ssl-cert-snakeoil.pem

ssl_key_file /etc/ssl/private/ssl-cert-snakeoil.key

synchronous_commit off

TimeZone UTC

wal_keep_segments 128

wal_level hot_standby

work_mem 8MB

root@ip-10-148-131-236:~# /usr/local/pgsql/bin/pg_controldata /usr/local/pgsql/data

pg_control version number: 922

Catalog version number: 201204301

Database system identifier: 5838668587531239413

Database cluster state: in archive recovery

pg_control last modified: Sat 22 Jun 2013 06:13:07 PM UTC

Latest checkpoint location: 2250/18CA0790

Prior checkpoint location: 2250/18CA0790

Latest checkpoint's REDO location: 224F/E127B078

Latest checkpoint's TimeLineID: 2

Latest checkpoint's full_page_writes: on

Latest checkpoint's NextXID: 1/2018629527

Latest checkpoint's NextOID: 43086248

Latest checkpoint's NextMultiXactId: 7088726

Latest checkpoint's NextMultiOffset: 20617234

Latest checkpoint's oldestXID: 1690316999

Latest checkpoint's oldestXID's DB: 16448

Latest checkpoint's oldestActiveXID: 2018629527

Time of latest checkpoint: Sat 22 Jun 2013 03:24:05 PM UTC

Minimum recovery ending location: 2251/5EA631F0

Backup start location: 0/0

Backup end location: 0/0

End-of-backup record required: no

Current wal_level setting: hot_standby

Current max_connections setting: 1200

Current max_prepared_xacts setting: 0

Current max_locks_per_xact setting: 64

Maximum data alignment: 8

Database block size: 8192

Blocks per segment of large relation: 131072

WAL block size: 8192

Bytes per WAL segment: 16777216

Maximum length of identifiers: 64

Maximum columns in an index: 32

Maximum size of a TOAST chunk: 1996

Date/time type storage: 64-bit integers

Float4 argument passing: by value

Float8 argument passing: by value

root@ip-10-148-131-236:~#

Thanks again.

Dan

Re: Standby stopped working after PANIC: WAL contains references to invalid pages

From

Lonni J Friedman

Date:

22 June 2013, 23:11:19

Looks like some kind of data corruption.  Question is whether it came
from the master, or was created by the standby.  If you re-seed the
standby with a full (base) backup, does the problem go away?

On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan@iqtell.com> wrote:
> Hello,
>
>
>
> Today our standby instance stopped working with this error in the log:
>
>
>
> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING:  page 158130 of relation
> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
>
> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT:  xlog redo vacuum: rel
> 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC:  WAL contains references
> to invalid pages
>
> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT:  xlog redo vacuum: rel
> 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG:  startup process (PID 8367)
> was terminated by signal 6: Aborted
>
> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG:  terminating any other active
> server processes
>
>
>
> After re-start the same exact error occurred.
>
>
>
> We thought that maybe we hit this bug -
>
http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
>
> However, there is nothing in our log about sub-transactions, so it didn't
> seem the same to us.
>
>
>
> Any advice on how to further debug this so we can avoid this in the future
> is appreciated.
>
>
>
> Environment:
>
>
>
> AWS, High I/O instance (hi1.4xlarge), 60GB RAM
>
>
>
> Software and settings:
>
>
>
> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro
> 4.5.2-8ubuntu4) 4.5.2, 64-bit
>
>
>
> archive_command          rsync -a %p
> slave:/var/lib/postgresql/replication_load/%f
>
> archive_mode   on
>
> autovacuum_freeze_max_age 1000000000
>
> autovacuum_max_workers        6
>
> checkpoint_completion_target 0.9
>
> checkpoint_segments   128
>
> checkpoint_timeout       30min
>
> default_text_search_config       pg_catalog.english
>
> hot_standby      on
>
> lc_messages      en_US.UTF-8
>
> lc_monetary      en_US.UTF-8
>
> lc_numeric          en_US.UTF-8
>
> lc_time en_US.UTF-8
>
> listen_addresses              *
>
> log_checkpoints               on
>
> log_destination                stderr
>
> log_line_prefix %t [%p]: [%l-1] [%h]
>
> log_min_duration_statement    -1
>
> log_min_error_statement           error
>
> log_min_messages         error
>
> log_timezone    UTC
>
> maintenance_work_mem           1GB
>
> max_connections            1200
>
> max_standby_streaming_delay                90s
>
> max_wal_senders           5
>
> port       5432
>
> random_page_cost        2
>
> seq_page_cost 1
>
> shared_buffers                4GB
>
> ssl           off
>
> ssl_cert_file       /etc/ssl/certs/ssl-cert-snakeoil.pem
>
> ssl_key_file        /etc/ssl/private/ssl-cert-snakeoil.key
>
> synchronous_commit    off
>
> TimeZone            UTC
>
> wal_keep_segments     128
>
> wal_level             hot_standby
>
> work_mem        8MB
>
>
>
> root@ip-10-148-131-236:~# /usr/local/pgsql/bin/pg_controldata
> /usr/local/pgsql/data
>
> pg_control version number:            922
>
> Catalog version number:               201204301
>
> Database system identifier:           5838668587531239413
>
> Database cluster state:               in archive recovery
>
> pg_control last modified:             Sat 22 Jun 2013 06:13:07 PM UTC
>
> Latest checkpoint location:           2250/18CA0790
>
> Prior checkpoint location:            2250/18CA0790
>
> Latest checkpoint's REDO location:    224F/E127B078
>
> Latest checkpoint's TimeLineID:       2
>
> Latest checkpoint's full_page_writes: on
>
> Latest checkpoint's NextXID:          1/2018629527
>
> Latest checkpoint's NextOID:          43086248
>
> Latest checkpoint's NextMultiXactId:  7088726
>
> Latest checkpoint's NextMultiOffset:  20617234
>
> Latest checkpoint's oldestXID:        1690316999
>
> Latest checkpoint's oldestXID's DB:   16448
>
> Latest checkpoint's oldestActiveXID:  2018629527
>
> Time of latest checkpoint:            Sat 22 Jun 2013 03:24:05 PM UTC
>
> Minimum recovery ending location:     2251/5EA631F0
>
> Backup start location:                0/0
>
> Backup end location:                  0/0
>
> End-of-backup record required:        no
>
> Current wal_level setting:            hot_standby
>
> Current max_connections setting:      1200
>
> Current max_prepared_xacts setting:   0
>
> Current max_locks_per_xact setting:   64
>
> Maximum data alignment:               8
>
> Database block size:                  8192
>
> Blocks per segment of large relation: 131072
>
> WAL block size:                       8192
>
> Bytes per WAL segment:                16777216
>
> Maximum length of identifiers:        64
>
> Maximum columns in an index:          32
>
> Maximum size of a TOAST chunk:        1996
>
> Date/time type storage:               64-bit integers
>
> Float4 argument passing:              by value
>
> Float8 argument passing:              by value
>
> root@ip-10-148-131-236:~#
>
>
>
> Thanks again.
>
>
>
> Dan



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       https://netllama.linux-sxs.org

Re: Standby stopped working after PANIC: WAL contains references to invalid pages

From

Dan Kogan

Date:

22 June 2013, 23:19:18

I am in the process of doing that now.  I'll reply again with results once that is done.


-----Original Message-----
From: Lonni J Friedman [mailto:netllama@gmail.com] 
Sent: Saturday, June 22, 2013 4:11 PM
To: Dan Kogan
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages

Looks like some kind of data corruption.  Question is whether it came from the master, or was created by the standby.
Ifyou re-seed the standby with a full (base) backup, does the problem go away?
 

On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan@iqtell.com> wrote:
> Hello,
>
>
>
> Today our standby instance stopped working with this error in the log:
>
>
>
> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING:  page 158130 of 
> relation
> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
>
> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT:  xlog redo vacuum: 
> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC:  WAL contains 
> references to invalid pages
>
> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT:  xlog redo vacuum: 
> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG:  startup process (PID 
> 8367) was terminated by signal 6: Aborted
>
> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG:  terminating any other 
> active server processes
>
>
>
> After re-start the same exact error occurred.
>
>
>
> We thought that maybe we hit this bug - 
>
http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
>
> However, there is nothing in our log about sub-transactions, so it 
> didn't seem the same to us.
>
>
>
> Any advice on how to further debug this so we can avoid this in the 
> future is appreciated.
>
>
>
> Environment:
>
>
>
> AWS, High I/O instance (hi1.4xlarge), 60GB RAM
>
>
>
> Software and settings:
>
>
>
> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc 
> (Ubuntu/Linaro
> 4.5.2-8ubuntu4) 4.5.2, 64-bit
>
>
>
> archive_command          rsync -a %p
> slave:/var/lib/postgresql/replication_load/%f
>
> archive_mode   on
>
> autovacuum_freeze_max_age 1000000000
>
> autovacuum_max_workers        6
>
> checkpoint_completion_target 0.9
>
> checkpoint_segments   128
>
> checkpoint_timeout       30min
>
> default_text_search_config       pg_catalog.english
>
> hot_standby      on
>
> lc_messages      en_US.UTF-8
>
> lc_monetary      en_US.UTF-8
>
> lc_numeric          en_US.UTF-8
>
> lc_time en_US.UTF-8
>
> listen_addresses              *
>
> log_checkpoints               on
>
> log_destination                stderr
>
> log_line_prefix %t [%p]: [%l-1] [%h]
>
> log_min_duration_statement    -1
>
> log_min_error_statement           error
>
> log_min_messages         error
>
> log_timezone    UTC
>
> maintenance_work_mem           1GB
>
> max_connections            1200
>
> max_standby_streaming_delay                90s
>
> max_wal_senders           5
>
> port       5432
>
> random_page_cost        2
>
> seq_page_cost 1
>
> shared_buffers                4GB
>
> ssl           off
>
> ssl_cert_file       /etc/ssl/certs/ssl-cert-snakeoil.pem
>
> ssl_key_file        /etc/ssl/private/ssl-cert-snakeoil.key
>
> synchronous_commit    off
>
> TimeZone            UTC
>
> wal_keep_segments     128
>
> wal_level             hot_standby
>
> work_mem        8MB
>
>
>
> root@ip-10-148-131-236:~# /usr/local/pgsql/bin/pg_controldata
> /usr/local/pgsql/data
>
> pg_control version number:            922
>
> Catalog version number:               201204301
>
> Database system identifier:           5838668587531239413
>
> Database cluster state:               in archive recovery
>
> pg_control last modified:             Sat 22 Jun 2013 06:13:07 PM UTC
>
> Latest checkpoint location:           2250/18CA0790
>
> Prior checkpoint location:            2250/18CA0790
>
> Latest checkpoint's REDO location:    224F/E127B078
>
> Latest checkpoint's TimeLineID:       2
>
> Latest checkpoint's full_page_writes: on
>
> Latest checkpoint's NextXID:          1/2018629527
>
> Latest checkpoint's NextOID:          43086248
>
> Latest checkpoint's NextMultiXactId:  7088726
>
> Latest checkpoint's NextMultiOffset:  20617234
>
> Latest checkpoint's oldestXID:        1690316999
>
> Latest checkpoint's oldestXID's DB:   16448
>
> Latest checkpoint's oldestActiveXID:  2018629527
>
> Time of latest checkpoint:            Sat 22 Jun 2013 03:24:05 PM UTC
>
> Minimum recovery ending location:     2251/5EA631F0
>
> Backup start location:                0/0
>
> Backup end location:                  0/0
>
> End-of-backup record required:        no
>
> Current wal_level setting:            hot_standby
>
> Current max_connections setting:      1200
>
> Current max_prepared_xacts setting:   0
>
> Current max_locks_per_xact setting:   64
>
> Maximum data alignment:               8
>
> Database block size:                  8192
>
> Blocks per segment of large relation: 131072
>
> WAL block size:                       8192
>
> Bytes per WAL segment:                16777216
>
> Maximum length of identifiers:        64
>
> Maximum columns in an index:          32
>
> Maximum size of a TOAST chunk:        1996
>
> Date/time type storage:               64-bit integers
>
> Float4 argument passing:              by value
>
> Float8 argument passing:              by value
>
> root@ip-10-148-131-236:~#
>
>
>
> Thanks again.
>
>
>
> Dan



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       https://netllama.linux-sxs.org

Re: Standby stopped working after PANIC: WAL contains references to invalid pages

From

Dan Kogan

Date:

23 June 2013, 02:17:36

Re-seeding the standby with a full base backup does seem to make the error go away.
The standby started, caught up and has been working for about 2 hours.

The file in the error message was an index.  We rebuilt it just in case.
Is there any way to debug the issue at this point?



-----Original Message-----
From: Lonni J Friedman [mailto:netllama@gmail.com] 
Sent: Saturday, June 22, 2013 4:11 PM
To: Dan Kogan
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages

Looks like some kind of data corruption.  Question is whether it came from the master, or was created by the standby.
Ifyou re-seed the standby with a full (base) backup, does the problem go away?
 

On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan@iqtell.com> wrote:
> Hello,
>
>
>
> Today our standby instance stopped working with this error in the log:
>
>
>
> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING:  page 158130 of 
> relation
> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
>
> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT:  xlog redo vacuum: 
> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC:  WAL contains 
> references to invalid pages
>
> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT:  xlog redo vacuum: 
> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG:  startup process (PID 
> 8367) was terminated by signal 6: Aborted
>
> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG:  terminating any other 
> active server processes
>
>
>
> After re-start the same exact error occurred.
>
>
>
> We thought that maybe we hit this bug - 
>
http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
>
> However, there is nothing in our log about sub-transactions, so it 
> didn't seem the same to us.
>
>
>
> Any advice on how to further debug this so we can avoid this in the 
> future is appreciated.
>
>
>
> Environment:
>
>
>
> AWS, High I/O instance (hi1.4xlarge), 60GB RAM
>
>
>
> Software and settings:
>
>
>
> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc 
> (Ubuntu/Linaro
> 4.5.2-8ubuntu4) 4.5.2, 64-bit
>
>
>
> archive_command          rsync -a %p
> slave:/var/lib/postgresql/replication_load/%f
>
> archive_mode   on
>
> autovacuum_freeze_max_age 1000000000
>
> autovacuum_max_workers        6
>
> checkpoint_completion_target 0.9
>
> checkpoint_segments   128
>
> checkpoint_timeout       30min
>
> default_text_search_config       pg_catalog.english
>
> hot_standby      on
>
> lc_messages      en_US.UTF-8
>
> lc_monetary      en_US.UTF-8
>
> lc_numeric          en_US.UTF-8
>
> lc_time en_US.UTF-8
>
> listen_addresses              *
>
> log_checkpoints               on
>
> log_destination                stderr
>
> log_line_prefix %t [%p]: [%l-1] [%h]
>
> log_min_duration_statement    -1
>
> log_min_error_statement           error
>
> log_min_messages         error
>
> log_timezone    UTC
>
> maintenance_work_mem           1GB
>
> max_connections            1200
>
> max_standby_streaming_delay                90s
>
> max_wal_senders           5
>
> port       5432
>
> random_page_cost        2
>
> seq_page_cost 1
>
> shared_buffers                4GB
>
> ssl           off
>
> ssl_cert_file       /etc/ssl/certs/ssl-cert-snakeoil.pem
>
> ssl_key_file        /etc/ssl/private/ssl-cert-snakeoil.key
>
> synchronous_commit    off
>
> TimeZone            UTC
>
> wal_keep_segments     128
>
> wal_level             hot_standby
>
> work_mem        8MB
>
>
>
> root@ip-10-148-131-236:~# /usr/local/pgsql/bin/pg_controldata
> /usr/local/pgsql/data
>
> pg_control version number:            922
>
> Catalog version number:               201204301
>
> Database system identifier:           5838668587531239413
>
> Database cluster state:               in archive recovery
>
> pg_control last modified:             Sat 22 Jun 2013 06:13:07 PM UTC
>
> Latest checkpoint location:           2250/18CA0790
>
> Prior checkpoint location:            2250/18CA0790
>
> Latest checkpoint's REDO location:    224F/E127B078
>
> Latest checkpoint's TimeLineID:       2
>
> Latest checkpoint's full_page_writes: on
>
> Latest checkpoint's NextXID:          1/2018629527
>
> Latest checkpoint's NextOID:          43086248
>
> Latest checkpoint's NextMultiXactId:  7088726
>
> Latest checkpoint's NextMultiOffset:  20617234
>
> Latest checkpoint's oldestXID:        1690316999
>
> Latest checkpoint's oldestXID's DB:   16448
>
> Latest checkpoint's oldestActiveXID:  2018629527
>
> Time of latest checkpoint:            Sat 22 Jun 2013 03:24:05 PM UTC
>
> Minimum recovery ending location:     2251/5EA631F0
>
> Backup start location:                0/0
>
> Backup end location:                  0/0
>
> End-of-backup record required:        no
>
> Current wal_level setting:            hot_standby
>
> Current max_connections setting:      1200
>
> Current max_prepared_xacts setting:   0
>
> Current max_locks_per_xact setting:   64
>
> Maximum data alignment:               8
>
> Database block size:                  8192
>
> Blocks per segment of large relation: 131072
>
> WAL block size:                       8192
>
> Bytes per WAL segment:                16777216
>
> Maximum length of identifiers:        64
>
> Maximum columns in an index:          32
>
> Maximum size of a TOAST chunk:        1996
>
> Date/time type storage:               64-bit integers
>
> Float4 argument passing:              by value
>
> Float8 argument passing:              by value
>
> root@ip-10-148-131-236:~#
>
>
>
> Thanks again.
>
>
>
> Dan



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       https://netllama.linux-sxs.org

Re: Standby stopped working after PANIC: WAL contains references to invalid pages

From

Lonni J Friedman

Date:

23 June 2013, 05:08:48

Assuming that you still have $PGDATA from the broken instance (such
that you can reproduce the crash again), there might be a way to debug
it further.  I'd guess that something like bad RAM or storage could
cause an index to get corrupted in this fashion, but the fact that
you're using AWS makes that less likely.  Someone far more
knowledgeable than I will need to provide guidance on how to debug
this though.

On Sat, Jun 22, 2013 at 4:17 PM, Dan Kogan <dan@iqtell.com> wrote:
> Re-seeding the standby with a full base backup does seem to make the error go away.
> The standby started, caught up and has been working for about 2 hours.
>
> The file in the error message was an index.  We rebuilt it just in case.
> Is there any way to debug the issue at this point?
>
>
>
> -----Original Message-----
> From: Lonni J Friedman [mailto:netllama@gmail.com]
> Sent: Saturday, June 22, 2013 4:11 PM
> To: Dan Kogan
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages
>
> Looks like some kind of data corruption.  Question is whether it came from the master, or was created by the standby.
If you re-seed the standby with a full (base) backup, does the problem go away? 
>
> On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan@iqtell.com> wrote:
>> Hello,
>>
>>
>>
>> Today our standby instance stopped working with this error in the log:
>>
>>
>>
>> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING:  page 158130 of
>> relation
>> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
>>
>> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT:  xlog redo vacuum:
>> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>>
>> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC:  WAL contains
>> references to invalid pages
>>
>> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT:  xlog redo vacuum:
>> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>>
>> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG:  startup process (PID
>> 8367) was terminated by signal 6: Aborted
>>
>> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG:  terminating any other
>> active server processes
>>
>>
>>
>> After re-start the same exact error occurred.
>>
>>
>>
>> We thought that maybe we hit this bug -
>>
http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
>>
>> However, there is nothing in our log about sub-transactions, so it
>> didn't seem the same to us.
>>
>>
>>
>> Any advice on how to further debug this so we can avoid this in the
>> future is appreciated.
>>
>>
>>
>> Environment:
>>
>>
>>
>> AWS, High I/O instance (hi1.4xlarge), 60GB RAM
>>
>>
>>
>> Software and settings:
>>
>>
>>
>> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc
>> (Ubuntu/Linaro
>> 4.5.2-8ubuntu4) 4.5.2, 64-bit
>>
>>
>>
>> archive_command          rsync -a %p
>> slave:/var/lib/postgresql/replication_load/%f
>>
>> archive_mode   on
>>
>> autovacuum_freeze_max_age 1000000000
>>
>> autovacuum_max_workers        6
>>
>> checkpoint_completion_target 0.9
>>
>> checkpoint_segments   128
>>
>> checkpoint_timeout       30min
>>
>> default_text_search_config       pg_catalog.english
>>
>> hot_standby      on
>>
>> lc_messages      en_US.UTF-8
>>

Re: Standby stopped working after PANIC: WAL contains references to invalid pages

From

Dan Kogan

Date:

24 June 2013, 16:44:18

We have backed up $PGDATA, but had to re-initialize the slave.
We also have the WALs from the day this happened.

Thanks,
Dan

-----Original Message-----
From: Lonni J Friedman [mailto:netllama@gmail.com] 
Sent: Saturday, June 22, 2013 10:09 PM
To: Dan Kogan
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages

Assuming that you still have $PGDATA from the broken instance (such that you can reproduce the crash again), there
mightbe a way to debug it further.  I'd guess that something like bad RAM or storage could cause an index to get
corruptedin this fashion, but the fact that you're using AWS makes that less likely.  Someone far more knowledgeable
thanI will need to provide guidance on how to debug this though.
 

On Sat, Jun 22, 2013 at 4:17 PM, Dan Kogan <dan@iqtell.com> wrote:
> Re-seeding the standby with a full base backup does seem to make the error go away.
> The standby started, caught up and has been working for about 2 hours.
>
> The file in the error message was an index.  We rebuilt it just in case.
> Is there any way to debug the issue at this point?
>
>
>
> -----Original Message-----
> From: Lonni J Friedman [mailto:netllama@gmail.com]
> Sent: Saturday, June 22, 2013 4:11 PM
> To: Dan Kogan
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL 
> contains references to invalid pages
>
> Looks like some kind of data corruption.  Question is whether it came from the master, or was created by the standby.
If you re-seed the standby with a full (base) backup, does the problem go away?
 
>
> On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan@iqtell.com> wrote:
>> Hello,
>>
>>
>>
>> Today our standby instance stopped working with this error in the log:
>>
>>
>>
>> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING:  page 158130 of 
>> relation
>> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
>>
>> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT:  xlog redo vacuum:
>> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>>
>> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC:  WAL contains 
>> references to invalid pages
>>
>> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT:  xlog redo vacuum:
>> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>>
>> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG:  startup process (PID
>> 8367) was terminated by signal 6: Aborted
>>
>> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG:  terminating any other 
>> active server processes
>>
>>
>>
>> After re-start the same exact error occurred.
>>
>>
>>
>> We thought that maybe we hit this bug - 
>>
http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
>>
>> However, there is nothing in our log about sub-transactions, so it 
>> didn't seem the same to us.
>>
>>
>>
>> Any advice on how to further debug this so we can avoid this in the 
>> future is appreciated.
>>
>>
>>
>> Environment:
>>
>>
>>
>> AWS, High I/O instance (hi1.4xlarge), 60GB RAM
>>
>>
>>
>> Software and settings:
>>
>>
>>
>> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc 
>> (Ubuntu/Linaro
>> 4.5.2-8ubuntu4) 4.5.2, 64-bit
>>
>>
>>
>> archive_command          rsync -a %p
>> slave:/var/lib/postgresql/replication_load/%f
>>
>> archive_mode   on
>>
>> autovacuum_freeze_max_age 1000000000
>>
>> autovacuum_max_workers        6
>>
>> checkpoint_completion_target 0.9
>>
>> checkpoint_segments   128
>>
>> checkpoint_timeout       30min
>>
>> default_text_search_config       pg_catalog.english
>>
>> hot_standby      on
>>
>> lc_messages      en_US.UTF-8
>>