Re: regression test failed when enabling checksum - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: regression test failed when enabling checksum |
Date | |
Msg-id | 20130403093121.GB4682@awork2.anarazel.de Whole thread Raw |
In response to | regression test failed when enabling checksum (Jeff Janes <jeff.janes@gmail.com>) |
Responses |
Re: regression test failed when enabling checksum
|
List | pgsql-hackers |
On 2013-04-01 19:51:19 -0700, Jeff Janes wrote: > On Mon, Apr 1, 2013 at 10:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > > > On Tue, Mar 26, 2013 at 4:23 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > > >> > >> Patch attached. Only brief testing done, so I might have missed > >> something. I will look more closely later. > >> > > > > After applying your patch, I could run the stress test described here: > > > > http://archives.postgresql.org/pgsql-hackers/2012-02/msg01227.php > > > > But altered to make use of initdb -k, of course. > > > > Over 10,000 cycles of crash and recovery, I encountered two cases of > > checksum failures after recovery, example: > > ... > > > > > > Unfortunately I already cleaned up the data directory before noticing the > > problem, so I have nothing to post for forensic analysis. I'll try to > > reproduce the problem. > > > > > I've reproduced the problem, this time in block 74 of relation > base/16384/4931589, and a tarball of the data directory is here: > > https://docs.google.com/file/d/0Bzqrh1SO9FcELS1majlFcTZsR0k/edit?usp=sharing > > (the table is in database jjanes under role jjanes, the binary is commit > 9ad27c215362df436f8c) > > What I would probably really want is the data as it existed after the crash > but before recovery started, but since the postmaster immediately starts > recovery after the crash, I don't know of a good way to capture this. > > I guess one thing to do would be to extract from the WAL the most recent > FPW for block 74 of relation base/16384/4931589 (assuming it hasn't been > recycled already) and see if it matches what is actually in that block of > that data file, but I don't currently know how to do that. > > 11500 SELECT 2013-04-01 12:01:56.926 PDT:WARNING: page verification > failed, calculated checksum 54570 but expected 34212 > 11500 SELECT 2013-04-01 12:01:56.926 PDT:ERROR: invalid page in block 74 > of relation base/16384/4931589 > 11500 SELECT 2013-04-01 12:01:56.926 PDT:STATEMENT: select sum(count) from > foo I just checked and unfortunately your dump doesn't contain all that much valid WAL: rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 7/AB000028, prev 7/AA000090, bkp: 0000, desc: checkpoint:redo 7/AB000028; tli 1; prev tli 1; fpw true; xid 0/156747297; oid 4939781; multi 1; offset 0; oldest xid 1799in DB 1; oldest multi 1 in DB 1; oldest running xid 0; online rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 7/AB000090, prev 7/AB000028, bkp: 0000, desc: checkpoint:redo 7/AB000090; tli 1; prev tli 1; fpw true; xid 0/156747297; oid 4939781; multi 1; offset 0; oldest xid 1799in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown pg_xlogdump: FATAL: error in WAL record at 7/AB000090: record with zero length at 7/AB0000F8 So just two checkpoint records. Unfortunately I fear that won't be enough to diagnose the problem, could you reproduce it with a higher wal_keep_segments? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: