BUG #6725: hot-standby slave repeatable crashed after restart (GIN error) - Mailing list pgsql-bugs

From maxim.boguk@gmail.com
Subject BUG #6725: hot-standby slave repeatable crashed after restart (GIN error)
Date
Msg-id E1SoNn4-00086s-Ie@wrigleys.postgresql.org
Whole thread Raw
Responses Re: BUG #6725: hot-standby slave repeatable crashed after restart (GIN error)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      6725
Logged by:          Maxim Boguk
Email address:      maxim.boguk@gmail.com
PostgreSQL version: 9.0.7
Operating system:   Debian Linux
Description:=20=20=20=20=20=20=20=20

Hi,

On one of my production databases hot-standby server start repeatable crash
after restart.

There is log:

2012-07-10 03:30:20 MSK 12034 postgres@postgres from [local] [vxid:2/0
txid:0] [CHECKPOINT] LOG:  duration: 100.262 ms  statement: checkpoint;
2012-07-10 03:30:24 MSK 12034 postgres@postgres from [local] [vxid:2/0
txid:0] [CHECKPOINT] LOG:  duration: 100.207 ms  statement: checkpoint;
2012-07-10 03:30:41 MSK 24618 @ from  [vxid: txid:0] [] LOG:  received fast
shutdown request
2012-07-10 03:30:41 MSK 24618 @ from  [vxid: txid:0] [] LOG:  aborting any
active transactions
2012-07-10 03:30:41 MSK 25312 @ from  [vxid: txid:0] [] FATAL:  terminating
walreceiver process due to administrator command
2012-07-10 03:30:41 MSK 24646 @ from  [vxid: txid:0] [] LOG:  shutting down
2012-07-10 03:30:41 MSK 24646 @ from  [vxid: txid:0] [] LOG:  database
system is shut down

2012-07-10 03:30:45 MSK 12202 @ from  [vxid: txid:0] [] LOG:  database
system was shut down in recovery at 2012-07-10 03:30:41 MSK
2012-07-10 03:30:45 MSK 12202 @ from  [vxid: txid:0] [] LOG:  entering
standby mode
2012-07-10 03:30:45 MSK 12202 @ from  [vxid:1/0 txid:0] [] LOG:  redo starts
at 127C/51EF6798
2012-07-10 03:31:35 MSK 12202 @ from  [vxid:1/0 txid:0] [] PANIC:  GIN
metapage disappeared
2012-07-10 03:31:35 MSK 12202 @ from  [vxid:1/0 txid:0] [] CONTEXT:  xlog
redo Update metapage, node: 1663/62450/27663577 blkno: 2164
2012-07-10 03:31:35 MSK 12189 @ from  [vxid: txid:0] [] LOG:  startup
process (PID 12202) was terminated by signal 6: Aborted
2012-07-10 03:31:35 MSK 12189 @ from  [vxid: txid:0] [] LOG:  terminating
any other active server processes

2012-07-10 03:33:37 MSK 13071 @ from  [vxid: txid:0] [] LOG:  database
system was interrupted while in recovery at log time 2012-07-10 02:17:21
MSK
2012-07-10 03:33:37 MSK 13071 @ from  [vxid: txid:0] [] HINT:  If this has
occurred more than once some data might be corrupted and you might need to
choose an earlier recovery target.
2012-07-10 03:33:37 MSK 13071 @ from  [vxid: txid:0] [] LOG:  entering
standby mode
2012-07-10 03:33:37 MSK 13071 @ from  [vxid:1/0 txid:0] [] LOG:  redo starts
at 127C/51EF6798
2012-07-10 03:34:15 MSK 13071 @ from  [vxid:1/0 txid:0] [] PANIC:  GIN
metapage disappeared
2012-07-10 03:34:15 MSK 13071 @ from  [vxid:1/0 txid:0] [] CONTEXT:  xlog
redo Update metapage, node: 1663/62450/27663577 blkno: 2164
2012-07-10 03:34:15 MSK 13065 @ from  [vxid: txid:0] [] LOG:  startup
process (PID 13071) was terminated by signal 6: Aborted
2012-07-10 03:34:15 MSK 13065 @ from  [vxid: txid:0] [] LOG:  terminating
any other active server processes

and so on.

I almost sure there are no disk corruption or such errors.

Now some additional information from the master server:
1)1663/62450/27663577  - corresponding to the pg_default tablespace and
wsfilm database. However, no entries with relfilenode=3D27663577 were found=
 in
pg_class.
In the same time:=20=20
ls -la ./9.0/main/base/62450/27663577
-rw------- 1 postgres postgres 0 =D0=98=D1=8E=D0=BB 10 02:29
./9.0/main/base/62450/27663577

In wsfilm db only one GIN index exists:
wsfilm=3D# SELECT * from pg_indexes where indexdef like '%gin%';
-[ RECORD 1 ]---------------------------------------------------
schemaname | public
tablename  | movie
indexname  | movie_fti_idx
tablespace |
indexdef   | CREATE INDEX movie_fti_idx ON movie USING gin (fti)

But it has different relfilenode:
wsfilm=3D# SELECT relfilenode from pg_class where relname=3D'movie_fti_idx';
-[ RECORD 1 ]---------
relfilenode | 27677714


2)Just few minutes before that index was rebuild on the master database:

2012-07-10 02:29:50 MSK 1480 postgres@wsfilm from [local] [vxid:33/3324338
txid:0] [idle] LOG:  statement: CREATE INDEX CONCURRENTLY
pgcompactor_tmp1436 ON movie USING gin (fti);
2012-07-10 02:29:52 MSK 1480 postgres@wsfilm from [local] [vxid:33/3324343
txid:0] [idle] LOG:  statement: BEGIN; DROP INDEX public.movie_fti_idx;
ALTER INDEX public.pgcompactor_tmp1436 RENAME TO movie_fti_idx; END;


My theory that the wal replay trying replay GIN index information on already
dropped index.

pgsql-bugs by date:

Previous
From: Dave Page
Date:
Subject: Re: BUG #6722: Debugger broken?
Next
From: Tom Lane
Date:
Subject: Re: BUG #6725: hot-standby slave repeatable crashed after restart (GIN error)