Re: regression test failed when enabling checksum - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: regression test failed when enabling checksum
Date
Msg-id CAMkU=1x393o6hvJ8Bp0Pk+P4Ad-DdNedUov4cf-aNKssTsv+xg@mail.gmail.com
Whole thread Raw
In response to Re: regression test failed when enabling checksum  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On Monday, April 1, 2013, Jeff Davis wrote:
On Mon, 2013-04-01 at 10:37 -0700, Jeff Janes wrote:

> Over 10,000 cycles of crash and recovery, I encountered two cases of
> checksum failures after recovery, example:
>
>
> 14264 SELECT 2013-03-28 13:08:38.980 PDT:WARNING:  page verification
> failed, calculated checksum 7017 but expected 1098
> 14264 SELECT 2013-03-28 13:08:38.980 PDT:ERROR:  invalid page in block
> 77 of relation base/16384/2088965
>
> 14264 SELECT 2013-03-28 13:08:38.980 PDT:STATEMENT:  select sum(count)
> from foo

It would be nice to know whether that's an index or a heap page.

It is a heap page for the table jjanes.public.foo.
 

>
> In both cases, the bad block (77 in this case) is the same block that
> was intentionally partially-written during the "crash".  However, that
> block should have been restored from the WAL FPW, so its fragmented
> nature should not have been present in order to be detected.  Any idea
> what is going on?

Not right now. My primary suspect is what's going on in
visibilitymap_set() and heap_xlog_visible(), which is more complex than
some of the other code. That would require some VACUUM activity, which
isn't in your workload -- do you think autovacuum may kick in sometimes?

Yes, a modification to my test harness that I failed to mention is that it now sleeps for 2 minutes after every 100 rounds of crash/recovery, specifically so that autovac has a chance to kick in and run to completion.   I made that change so as to avoid wrap-around shut-downs on long running tests.  However "foo" is truncated at the beginning of every test, so I don't think this would be relevant to that table, as any poisoned fruits of the autovac would be discarded with the truncation.

Cheers,

Jeff 

pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: regression test failed when enabling checksum
Next
From: Jeff Janes
Date:
Subject: Spin Lock sleep resolution