Re: PANIC in GIN code - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: PANIC in GIN code
Date
Msg-id CAMkU=1zUZPhY+Dt6dy3YNqX8384RFk2Rj71bUm2_Nbz9wCG56w@mail.gmail.com
Whole thread Raw
In response to Re: PANIC in GIN code  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: PANIC in GIN code
List pgsql-hackers
On Mon, Jun 29, 2015 at 1:37 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 06/29/2015 01:12 AM, Jeff Janes wrote:
Now I'm getting a different error, with or without checksums.

ERROR:  invalid page in block 0 of relation base/16384/16420
CONTEXT:  automatic vacuum of table "jjanes.public.foo"

16420 is the gin index.  I can't even get the page with pageinspect:

jjanes=# SELECT * FROM get_raw_page('foo_text_array_idx', 0);
ERROR:  invalid page in block 0 of relation base/16384/16420

This is the last few gin entries from pg_xlogdump


rmgr: Gin         len (rec/tot):      0/  3893, tx:          0, lsn:
0/77270E90, prev 0/77270E68, desc: VACUUM_PAGE , blkref #0: rel
1663/16384/16420 blk 27 FPW
rmgr: Gin         len (rec/tot):      0/  3013, tx:          0, lsn:
0/77272080, prev 0/77272058, desc: VACUUM_PAGE , blkref #0: rel
1663/16384/16420 blk 6904 FPW
rmgr: Gin         len (rec/tot):      0/  3093, tx:          0, lsn:
0/77272E08, prev 0/77272DE0, desc: VACUUM_PAGE , blkref #0: rel
1663/16384/16420 blk 1257 FPW
rmgr: Gin         len (rec/tot):      8/  4662, tx:  318119897, lsn:
0/77A2CF10, prev 0/77A2CEC8, desc: INSERT_LISTPAGE , blkref #0: rel
1663/16384/16420 blk 22184
rmgr: Gin         len (rec/tot):     88/   134, tx:  318119897, lsn:
0/77A2E188, prev 0/77A2E160, desc: UPDATE_META_PAGE , blkref #0: rel
1663/16384/16420 blk 0

Another piece of info here that might be relevant.  Almost all UPDATE_META_PAGE xlog records other than the last one have two backup blocks.  The last UPDATE_META_PAGE record only has one backup block.

And the metapage is mostly zeros:

head -c 8192 /tmp/data2_invalid_page/base/16384/16420 | od
0000000 000000 000000 161020 073642 000000 000000 000000 000000
0000020 000000 000000 000000 000000 053250 000000 053250 000000
0000040 006140 000000 000001 000000 000001 000000 000000 000000
0000060 031215 000000 000452 000000 000000 000000 000000 000000
0000100 025370 000000 000000 000000 000002 000000 000000 000000
0000120 000000 000000 000000 000000 000000 000000 000000 000000
*
0020000

Hmm. Looking at ginRedoUpdateMetapage, I think I see the problem: it doesn't initialize the page. It copies the metapage data, but it doesn't touch the page headers. The only way I can see that that would cause trouble is if the index somehow got truncated away or removed in the standby. That could happen in crash recovery, if you drop the index and the crash, but that should be harmless, because crash recovery doesn't try to read the metapage, only update it (by overwriting it), and by the time crash recovery has completed, the index drop is replayed too.

But AFAICS that bug is present in earlier versions too.

Yes, I did see this error reported previously but it was always after the first appearance of the PANIC, so I assumed it was a sequella to that and didn't investigate it further at that time.

 
Can you reproduce this easily? How?

I can reproduce it fairly easy. 

I apply the attached patch and compile with enable-casssert (full list '--enable-debug' '--with-libxml' '--with-perl' '--with-python' '--with-ldap' '--with-openssl' '--with-gssapi' '--prefix=/home/jjanes/pgsql/torn_bisect/' '--enable-cassert')

Then edit do.sh to point to the data directory and installation directory you want, and run that.  It calls count.pl from the same directory.  I started getting the errors after about 10 minutes on a 8 core Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz.

sh do.sh >& do_cassert_fix.out2 &

The output is quite a mess, mingling logfile from PostgreSQL and from Perl together.  Since I already know what I'm looking for, I use:

tail -f do_cassert_fix.out2 |fgrep ERROR

Cheers,

Jeff



Attachment

pgsql-hackers by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: Oh, this is embarrassing: init file logic is still broken
Next
From: Simon Riggs
Date:
Subject: Re: Reduce ProcArrayLock contention