Thread: BUG #6725: hot-standby slave repeatable crashed after restart (GIN error)

BUG #6725: hot-standby slave repeatable crashed after restart (GIN error)

From
maxim.boguk@gmail.com
Date:
The following bug has been logged on the website:

Bug reference:      6725
Logged by:          Maxim Boguk
Email address:      maxim.boguk@gmail.com
PostgreSQL version: 9.0.7
Operating system:   Debian Linux
Description:=20=20=20=20=20=20=20=20

Hi,

On one of my production databases hot-standby server start repeatable crash
after restart.

There is log:

2012-07-10 03:30:20 MSK 12034 postgres@postgres from [local] [vxid:2/0
txid:0] [CHECKPOINT] LOG:  duration: 100.262 ms  statement: checkpoint;
2012-07-10 03:30:24 MSK 12034 postgres@postgres from [local] [vxid:2/0
txid:0] [CHECKPOINT] LOG:  duration: 100.207 ms  statement: checkpoint;
2012-07-10 03:30:41 MSK 24618 @ from  [vxid: txid:0] [] LOG:  received fast
shutdown request
2012-07-10 03:30:41 MSK 24618 @ from  [vxid: txid:0] [] LOG:  aborting any
active transactions
2012-07-10 03:30:41 MSK 25312 @ from  [vxid: txid:0] [] FATAL:  terminating
walreceiver process due to administrator command
2012-07-10 03:30:41 MSK 24646 @ from  [vxid: txid:0] [] LOG:  shutting down
2012-07-10 03:30:41 MSK 24646 @ from  [vxid: txid:0] [] LOG:  database
system is shut down

2012-07-10 03:30:45 MSK 12202 @ from  [vxid: txid:0] [] LOG:  database
system was shut down in recovery at 2012-07-10 03:30:41 MSK
2012-07-10 03:30:45 MSK 12202 @ from  [vxid: txid:0] [] LOG:  entering
standby mode
2012-07-10 03:30:45 MSK 12202 @ from  [vxid:1/0 txid:0] [] LOG:  redo starts
at 127C/51EF6798
2012-07-10 03:31:35 MSK 12202 @ from  [vxid:1/0 txid:0] [] PANIC:  GIN
metapage disappeared
2012-07-10 03:31:35 MSK 12202 @ from  [vxid:1/0 txid:0] [] CONTEXT:  xlog
redo Update metapage, node: 1663/62450/27663577 blkno: 2164
2012-07-10 03:31:35 MSK 12189 @ from  [vxid: txid:0] [] LOG:  startup
process (PID 12202) was terminated by signal 6: Aborted
2012-07-10 03:31:35 MSK 12189 @ from  [vxid: txid:0] [] LOG:  terminating
any other active server processes

2012-07-10 03:33:37 MSK 13071 @ from  [vxid: txid:0] [] LOG:  database
system was interrupted while in recovery at log time 2012-07-10 02:17:21
MSK
2012-07-10 03:33:37 MSK 13071 @ from  [vxid: txid:0] [] HINT:  If this has
occurred more than once some data might be corrupted and you might need to
choose an earlier recovery target.
2012-07-10 03:33:37 MSK 13071 @ from  [vxid: txid:0] [] LOG:  entering
standby mode
2012-07-10 03:33:37 MSK 13071 @ from  [vxid:1/0 txid:0] [] LOG:  redo starts
at 127C/51EF6798
2012-07-10 03:34:15 MSK 13071 @ from  [vxid:1/0 txid:0] [] PANIC:  GIN
metapage disappeared
2012-07-10 03:34:15 MSK 13071 @ from  [vxid:1/0 txid:0] [] CONTEXT:  xlog
redo Update metapage, node: 1663/62450/27663577 blkno: 2164
2012-07-10 03:34:15 MSK 13065 @ from  [vxid: txid:0] [] LOG:  startup
process (PID 13071) was terminated by signal 6: Aborted
2012-07-10 03:34:15 MSK 13065 @ from  [vxid: txid:0] [] LOG:  terminating
any other active server processes

and so on.

I almost sure there are no disk corruption or such errors.

Now some additional information from the master server:
1)1663/62450/27663577  - corresponding to the pg_default tablespace and
wsfilm database. However, no entries with relfilenode=3D27663577 were found=
 in
pg_class.
In the same time:=20=20
ls -la ./9.0/main/base/62450/27663577
-rw------- 1 postgres postgres 0 =D0=98=D1=8E=D0=BB 10 02:29
./9.0/main/base/62450/27663577

In wsfilm db only one GIN index exists:
wsfilm=3D# SELECT * from pg_indexes where indexdef like '%gin%';
-[ RECORD 1 ]---------------------------------------------------
schemaname | public
tablename  | movie
indexname  | movie_fti_idx
tablespace |
indexdef   | CREATE INDEX movie_fti_idx ON movie USING gin (fti)

But it has different relfilenode:
wsfilm=3D# SELECT relfilenode from pg_class where relname=3D'movie_fti_idx';
-[ RECORD 1 ]---------
relfilenode | 27677714


2)Just few minutes before that index was rebuild on the master database:

2012-07-10 02:29:50 MSK 1480 postgres@wsfilm from [local] [vxid:33/3324338
txid:0] [idle] LOG:  statement: CREATE INDEX CONCURRENTLY
pgcompactor_tmp1436 ON movie USING gin (fti);
2012-07-10 02:29:52 MSK 1480 postgres@wsfilm from [local] [vxid:33/3324343
txid:0] [idle] LOG:  statement: BEGIN; DROP INDEX public.movie_fti_idx;
ALTER INDEX public.pgcompactor_tmp1436 RENAME TO movie_fti_idx; END;


My theory that the wal replay trying replay GIN index information on already
dropped index.
maxim.boguk@gmail.com writes:
> On one of my production databases hot-standby server start repeatable crash
> after restart.

> 2012-07-10 03:31:35 MSK 12202 @ from  [vxid:1/0 txid:0] [] PANIC:  GIN
> metapage disappeared

This is fixed in current minor releases.

http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=268ca4f57ea2dc3bf0dfb0f5c17fbda269b5d462

            regards, tom lane
>
> > 2012-07-10 03:31:35 MSK 12202 @ from  [vxid:1/0 txid:0] [] PANIC:  GIN
> > metapage disappeared
>
> This is fixed in current minor releases.
>
>
> http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=268ca4f57ea2dc3bf0dfb0f5c17fbda269b5d462
>
>                         regards, tom lane
>

Ooops sorry, I should check a release notes from the atest version before
posting new bugs.

Thank you.

Kind Regards,
Maksym