Thread: regression test failed when enabling checksum

regression test failed when enabling checksum

From
Fujii Masao
Date:
Hi,

I found that the regression test failed when I created the database
cluster with the checksum and set wal_level to archive. I think that
there are some bugs around checksum feature. Attached is the regression.diff.

Regards,

--
Fujii Masao

Attachment

Re: regression test failed when enabling checksum

From
Jeff Davis
Date:
On Tue, 2013-03-26 at 02:50 +0900, Fujii Masao wrote:
> Hi,
>
> I found that the regression test failed when I created the database
> cluster with the checksum and set wal_level to archive. I think that
> there are some bugs around checksum feature. Attached is the regression.diff.

Thank you for the report. This was a significant oversight, but simple
to diagnose and fix.

There were several places that were doing something like:

   PageSetChecksumInplace
   if (use_wal)
      log_newpage
   smgrextend

Which is obviously wrong, because log_newpage set the LSN of the page,
invalidating the checksum. We need to set the checksum after
log_newpage.

Also, I noticed that copy_relation_data was doing smgrread without
validating the checksum (or page header, for that matter), so I also
fixed that.

Patch attached. Only brief testing done, so I might have missed
something. I will look more closely later.

Regards,
    Jeff Davis

Attachment

Re: regression test failed when enabling checksum

From
Simon Riggs
Date:
On 25 March 2013 17:50, Fujii Masao <masao.fujii@gmail.com> wrote:

> I found that the regression test failed when I created the database
> cluster with the checksum and set wal_level to archive. I think that
> there are some bugs around checksum feature. Attached is the regression.diff.

Apologies for not responding to your original email, I must have missed that.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: regression test failed when enabling checksum

From
Simon Riggs
Date:
On 26 March 2013 23:23, Jeff Davis <pgsql@j-davis.com> wrote:

> Patch attached. Only brief testing done, so I might have missed
> something. I will look more closely later.

Thanks, I'll look at that tomorrow.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: regression test failed when enabling checksum

From
Jeff Janes
Date:
On Tue, Mar 26, 2013 at 4:23 PM, Jeff Davis <pgsql@j-davis.com> wrote:
On Tue, 2013-03-26 at 02:50 +0900, Fujii Masao wrote:
> Hi,
>
> I found that the regression test failed when I created the database
> cluster with the checksum and set wal_level to archive. I think that
> there are some bugs around checksum feature. Attached is the regression.diff.

Thank you for the report. This was a significant oversight, but simple
to diagnose and fix.

There were several places that were doing something like:

   PageSetChecksumInplace
   if (use_wal)
      log_newpage
   smgrextend

Which is obviously wrong, because log_newpage set the LSN of the page,
invalidating the checksum. We need to set the checksum after
log_newpage.

Also, I noticed that copy_relation_data was doing smgrread without
validating the checksum (or page header, for that matter), so I also
fixed that.

Patch attached. Only brief testing done, so I might have missed
something. I will look more closely later.

After applying your patch, I could run the stress test described here:


But altered to make use of initdb -k, of course.

Over 10,000 cycles of crash and recovery, I encountered two cases of checksum failures after recovery, example:

14264 SELECT 2013-03-28 13:08:38.980 PDT:WARNING:  page verification failed, calculated checksum 7017 but expected 1098
14264 SELECT 2013-03-28 13:08:38.980 PDT:ERROR:  invalid page in block 77 of relation base/16384/2088965
14264 SELECT 2013-03-28 13:08:38.980 PDT:STATEMENT:  select sum(count) from foo

In both cases, the bad block (77 in this case) is the same block that was intentionally partially-written during the "crash".  However, that block should have been restored from the WAL FPW, so its fragmented nature should not have been present in order to be detected.  Any idea what is going on?  

Unfortunately I already cleaned up the data directory before noticing the problem, so I have nothing to post for forensic analysis.  I'll try to reproduce the problem.

Without the initdb -k option, I ran it for 30,000 cycles and found no problems.  I don't think this is because the problem exists but is going undetected, because my test is designed to detect such problems--if the block is fragmented but not overwritten by WAL FPW, that should occasionally lead to detectable inconsistent tuples.

I don't think your patch caused this particular problem, but it merely fixed a problem that was previously preventing me from running my test.

 
Cheers,

Jeff

Re: regression test failed when enabling checksum

From
Jeff Davis
Date:
On Mon, 2013-04-01 at 10:37 -0700, Jeff Janes wrote:

> Over 10,000 cycles of crash and recovery, I encountered two cases of
> checksum failures after recovery, example:
> 
> 
> 14264 SELECT 2013-03-28 13:08:38.980 PDT:WARNING:  page verification
> failed, calculated checksum 7017 but expected 1098
> 14264 SELECT 2013-03-28 13:08:38.980 PDT:ERROR:  invalid page in block
> 77 of relation base/16384/2088965
> 
> 14264 SELECT 2013-03-28 13:08:38.980 PDT:STATEMENT:  select sum(count)
> from foo

It would be nice to know whether that's an index or a heap page.

> 
> In both cases, the bad block (77 in this case) is the same block that
> was intentionally partially-written during the "crash".  However, that
> block should have been restored from the WAL FPW, so its fragmented
> nature should not have been present in order to be detected.  Any idea
> what is going on?  

Not right now. My primary suspect is what's going on in
visibilitymap_set() and heap_xlog_visible(), which is more complex than
some of the other code. That would require some VACUUM activity, which
isn't in your workload -- do you think autovacuum may kick in sometimes?

Thank you for testing! I will try to reproduce it, as well.

Regards,Jeff Davis




regression test failed when enabling checksum

From
Jeff Janes
Date:
On Mon, Apr 1, 2013 at 10:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Mar 26, 2013 at 4:23 PM, Jeff Davis <pgsql@j-davis.com> wrote:

Patch attached. Only brief testing done, so I might have missed
something. I will look more closely later.

After applying your patch, I could run the stress test described here:


But altered to make use of initdb -k, of course.

Over 10,000 cycles of crash and recovery, I encountered two cases of checksum failures after recovery, example:
...
 
Unfortunately I already cleaned up the data directory before noticing the problem, so I have nothing to post for forensic analysis.  I'll try to reproduce the problem.


I've reproduced the problem, this time in block 74 of relation base/16384/4931589, and a tarball of the data directory is here:


(the table is in database jjanes under role jjanes, the binary is commit 9ad27c215362df436f8c)

What I would probably really want is the data as it existed after the crash but before recovery started, but since the postmaster immediately starts recovery after the crash, I don't know of a good way to capture this.

I guess one thing to do would be to extract from the WAL the most recent FPW for block 74 of relation base/16384/4931589  (assuming it hasn't been recycled already) and see if it matches what is actually in that block of that data file, but I don't currently know how to do that.

11500 SELECT 2013-04-01 12:01:56.926 PDT:WARNING:  page verification failed, calculated checksum 54570 but expected 34212
11500 SELECT 2013-04-01 12:01:56.926 PDT:ERROR:  invalid page in block 74 of relation base/16384/4931589
11500 SELECT 2013-04-01 12:01:56.926 PDT:STATEMENT:  select sum(count) from foo

Cheers,

Jeff

Re: regression test failed when enabling checksum

From
Jeff Janes
Date:
On Monday, April 1, 2013, Jeff Davis wrote:
On Mon, 2013-04-01 at 10:37 -0700, Jeff Janes wrote:

> Over 10,000 cycles of crash and recovery, I encountered two cases of
> checksum failures after recovery, example:
>
>
> 14264 SELECT 2013-03-28 13:08:38.980 PDT:WARNING:  page verification
> failed, calculated checksum 7017 but expected 1098
> 14264 SELECT 2013-03-28 13:08:38.980 PDT:ERROR:  invalid page in block
> 77 of relation base/16384/2088965
>
> 14264 SELECT 2013-03-28 13:08:38.980 PDT:STATEMENT:  select sum(count)
> from foo

It would be nice to know whether that's an index or a heap page.

It is a heap page for the table jjanes.public.foo.
 

>
> In both cases, the bad block (77 in this case) is the same block that
> was intentionally partially-written during the "crash".  However, that
> block should have been restored from the WAL FPW, so its fragmented
> nature should not have been present in order to be detected.  Any idea
> what is going on?

Not right now. My primary suspect is what's going on in
visibilitymap_set() and heap_xlog_visible(), which is more complex than
some of the other code. That would require some VACUUM activity, which
isn't in your workload -- do you think autovacuum may kick in sometimes?

Yes, a modification to my test harness that I failed to mention is that it now sleeps for 2 minutes after every 100 rounds of crash/recovery, specifically so that autovac has a chance to kick in and run to completion.   I made that change so as to avoid wrap-around shut-downs on long running tests.  However "foo" is truncated at the beginning of every test, so I don't think this would be relevant to that table, as any poisoned fruits of the autovac would be discarded with the truncation.

Cheers,

Jeff 

Re: regression test failed when enabling checksum

From
Simon Riggs
Date:
On 2 April 2013 02:53, Jeff Davis <pgsql@j-davis.com> wrote:

>> Any idea
>> what is going on?
>
> Not right now.

Since I'm now responsible for the quality of this patch, I need to say
this before someone else does: we have until the end of the week to
fix this conclusively, or I will need to consider whether to revoke
the patch in this release. Thanks for your time and trouble to locate
problems.

It may be that the patch is revealing an underlying bug, so we don't
yet have evidence for the source of the bug.

> My primary suspect is what's going on in
> visibilitymap_set() and heap_xlog_visible(), which is more complex than
> some of the other code. That would require some VACUUM activity, which
> isn't in your workload -- do you think autovacuum may kick in sometimes?

Other candidates might be:

* page hole is non-empty, so the overwrite of the "Full" page write
may leave the block in an intermediate state. Tom recently fixed
RestoreBkpBlock() to avoid it zeroing the contents first before
applying the page.

* something to do with access to GetPageLSN()

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: regression test failed when enabling checksum

From
Andres Freund
Date:
On 2013-04-01 19:51:19 -0700, Jeff Janes wrote:
> On Mon, Apr 1, 2013 at 10:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
> > On Tue, Mar 26, 2013 at 4:23 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> >
> >>
> >> Patch attached. Only brief testing done, so I might have missed
> >> something. I will look more closely later.
> >>
> >
> > After applying your patch, I could run the stress test described here:
> >
> > http://archives.postgresql.org/pgsql-hackers/2012-02/msg01227.php
> >
> > But altered to make use of initdb -k, of course.
> >
> > Over 10,000 cycles of crash and recovery, I encountered two cases of
> > checksum failures after recovery, example:
> > ...
> >
>
>
> > Unfortunately I already cleaned up the data directory before noticing the
> > problem, so I have nothing to post for forensic analysis.  I'll try to
> > reproduce the problem.
> >
> >
> I've reproduced the problem, this time in block 74 of relation
> base/16384/4931589, and a tarball of the data directory is here:
>
> https://docs.google.com/file/d/0Bzqrh1SO9FcELS1majlFcTZsR0k/edit?usp=sharing
>
> (the table is in database jjanes under role jjanes, the binary is commit
> 9ad27c215362df436f8c)
>
> What I would probably really want is the data as it existed after the crash
> but before recovery started, but since the postmaster immediately starts
> recovery after the crash, I don't know of a good way to capture this.
>
> I guess one thing to do would be to extract from the WAL the most recent
> FPW for block 74 of relation base/16384/4931589  (assuming it hasn't been
> recycled already) and see if it matches what is actually in that block of
> that data file, but I don't currently know how to do that.

Since I bragged somewhere else recently that it should be easy to do now that
we have pg_xlogdump I hacked it up so it dumps all the full page writes into
the directory specified by --dump-bkp=PATH. It currently overwrites previous
full page writes to the same page but that should be trivial to change if you
want by adding %X.%X for the lsn into the path sprintf.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: regression test failed when enabling checksum

From
Andres Freund
Date:
On 2013-04-01 19:51:19 -0700, Jeff Janes wrote:
> On Mon, Apr 1, 2013 at 10:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> 
> > On Tue, Mar 26, 2013 at 4:23 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> >
> >>
> >> Patch attached. Only brief testing done, so I might have missed
> >> something. I will look more closely later.
> >>
> >
> > After applying your patch, I could run the stress test described here:
> >
> > http://archives.postgresql.org/pgsql-hackers/2012-02/msg01227.php
> >
> > But altered to make use of initdb -k, of course.
> >
> > Over 10,000 cycles of crash and recovery, I encountered two cases of
> > checksum failures after recovery, example:
> > ...
> >
> 
> 
> > Unfortunately I already cleaned up the data directory before noticing the
> > problem, so I have nothing to post for forensic analysis.  I'll try to
> > reproduce the problem.
> >
> >
> I've reproduced the problem, this time in block 74 of relation
> base/16384/4931589, and a tarball of the data directory is here:
> 
> https://docs.google.com/file/d/0Bzqrh1SO9FcELS1majlFcTZsR0k/edit?usp=sharing
> 
> (the table is in database jjanes under role jjanes, the binary is commit
> 9ad27c215362df436f8c)
> 
> What I would probably really want is the data as it existed after the crash
> but before recovery started, but since the postmaster immediately starts
> recovery after the crash, I don't know of a good way to capture this.
> 
> I guess one thing to do would be to extract from the WAL the most recent
> FPW for block 74 of relation base/16384/4931589  (assuming it hasn't been
> recycled already) and see if it matches what is actually in that block of
> that data file, but I don't currently know how to do that.
> 
> 11500 SELECT 2013-04-01 12:01:56.926 PDT:WARNING:  page verification
> failed, calculated checksum 54570 but expected 34212
> 11500 SELECT 2013-04-01 12:01:56.926 PDT:ERROR:  invalid page in block 74
> of relation base/16384/4931589
> 11500 SELECT 2013-04-01 12:01:56.926 PDT:STATEMENT:  select sum(count) from
> foo

I just checked and unfortunately your dump doesn't contain all that much
valid WAL:
rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn: 7/AB000028, prev 7/AA000090, bkp: 0000, desc:
checkpoint:redo 7/AB000028; tli 1; prev tli 1; fpw true; xid 0/156747297; oid 4939781; multi 1; offset 0; oldest xid
1799in DB 1; oldest multi 1 in DB 1; oldest running xid 0; online
 
rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn: 7/AB000090, prev 7/AB000028, bkp: 0000, desc:
checkpoint:redo 7/AB000090; tli 1; prev tli 1; fpw true; xid 0/156747297; oid 4939781; multi 1; offset 0; oldest xid
1799in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
 
pg_xlogdump: FATAL:  error in WAL record at 7/AB000090: record with zero length at 7/AB0000F8

So just two checkpoint records.

Unfortunately I  fear that won't be enough to diagnose the problem,
could you reproduce it with a higher wal_keep_segments?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: regression test failed when enabling checksum

From
Jeff Janes
Date:
On Wed, Apr 3, 2013 at 2:31 AM, Andres Freund <andres@2ndquadrant.com> wrote:


I just checked and unfortunately your dump doesn't contain all that much
valid WAL:
...
 
So just two checkpoint records.

Unfortunately I  fear that won't be enough to diagnose the problem,
could you reproduce it with a higher wal_keep_segments?

I've been trying, but see message "commit dfda6ebaec67 versus wal_keep_segments".


Looking at some of the log files more, I see that vacuum is involved, but in some way I don't understand.  The crash always happens on a test cycle immediately after the sleep that allows the autovac to kick in and finish.  So the events goes something like this:

...
run the frantic updating of "foo" until crash
recovery
query "foo" and verify the results are consistent with expectations
sleep to allow autovac to do its job.
truncate "foo" and repopulate it.
run the frantic updating of "foo" until crash
recovery
attempt to query "foo" but get the checksum failure.

What the vacuum is doing that corrupts the system in a way that survives the truncate is a mystery to me.

Also, at one point I had the harness itself exit as soon as it detected the problem, but I failed to have it shut down the server.  So the server keep running idle and having autovac do its thing, which produced some interesting log output:

WARNING:  relation "foo" page 45 is uninitialized --- fixing
WARNING:  relation "foo" page 46 is uninitialized --- fixing
...
WARNING:  relation "foo" page 72 is uninitialized --- fixing
WARNING:  relation "foo" page 73 is uninitialized --- fixing
WARNING:  page verification failed, calculated checksum 54570 but expected 34212
ERROR:  invalid page in block 74 of relation base/16384/4931589

This happened 3 times.  Every time, the warnings started on page 45, and they continued up until the invalid page was found (which varied, being 74, 86, and 74 again)

I wonder if the bug is in checksums, or if the checksums are doing their job by finding some other bug.  And why did those uninitialized pages trigger warnings when they were autovacced, but not when they were seq scanned in a query?

Cheers,

Jeff

Re: regression test failed when enabling checksum

From
Andres Freund
Date:
On 2013-04-03 09:48:54 -0700, Jeff Janes wrote:
> On Wed, Apr 3, 2013 at 2:31 AM, Andres Freund <andres@2ndquadrant.com>wrote:
> 
> >
> >
>  I just checked and unfortunately your dump doesn't contain all that much
> > valid WAL:
> > ...
> >
> 
> 
> > So just two checkpoint records.
> >
> > Unfortunately I  fear that won't be enough to diagnose the problem,
> > could you reproduce it with a higher wal_keep_segments?
> >
> 
> I've been trying, but see message "commit dfda6ebaec67 versus
> wal_keep_segments".

Setting up an archive_command could help if its affordable from the
space perspective :/

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: regression test failed when enabling checksum

From
Jeff Davis
Date:
On Wed, 2013-04-03 at 09:48 -0700, Jeff Janes wrote:

>  And why did those uninitialized pages trigger warnings when they were
> autovacced, but not when they were seq scanned in a query?
> 
A scan won't trigger that warning. Empty pages are sometimes the
expected result of a crash when the file is extended but the page is not
written yet. So empty pages aren't necessarily an indication of
corruption (though I'd like to fix that eventually, because sometimes
zeroing is corruption).

Regards,Jeff Davis





Re: regression test failed when enabling checksum

From
Jeff Davis
Date:
On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:

> I've reproduced the problem, this time in block 74 of relation
> base/16384/4931589, and a tarball of the data directory is here:
> 
> 
> https://docs.google.com/file/d/0Bzqrh1SO9FcELS1majlFcTZsR0k/edit?usp=sharing
> 
> 
> 
> (the table is in database jjanes under role jjanes, the binary is
> commit 9ad27c215362df436f8c)
> 
> 
> What I would probably really want is the data as it existed after the
> crash but before recovery started, but since the postmaster
> immediately starts recovery after the crash, I don't know of a good
> way to capture this.

Can you just turn off the restart_after_crash GUC? I had a chance to
look at this, and seeing the block before and after recovery would be
nice. I didn't see a log file in the data directory, but it didn't go
through recovery, so I assume it already did that.

The block is corrupt as far as I can tell. The first third is written,
and the remainder is all zeros. The header looks like this:

(These numbers are mostly from pageinspect. The checksum value that
comes from pageinspect needed to be cast back to an unsigned short to
match the error message above -- should that be changed in
pageinspect?).

lsn: 7/252E4080
checksum: 34212
flags: 1
lower: 1188
upper: 5952
special: 8192
pagesize: 8192
version: 4
prune_xid: 156833911

So the header looks good, but most of the page data is missing. I tried
with pg_filedump (the 9.2.0 version, which should be fine), and it
agrees with me that the page is corrupt.

Interestingly, that doesn't result in a user-visible error when
ignore_checksum_failure=on. That's because the item pointers appear to
all be either not normal or they point to the zeroed region. Testing the
visibility of a zeroed tuple header is always false, so no problem.

You'd still think this would cause incorrect results, but I think what's
happening is that this is a new page (otherwise it would have been
written with something other than zeroes before). So, the tuples that
are supposed to be there may be uncommitted anyway.

So, the page may be corrupt without checksums as well, but it just
happens to be hidden for the same reason. Can you try to reproduce it
without -k? And on the checkin right before checksums were added?
Without checksums, you'll need to use pg_filedump (or similar) to find
whether an error has happened.

To start speculating about the root cause: something is violating the
WAL before data rule, or not writing a FPI when it's supposed to, or not
properly restoring the FPI during recovery, or something sets the wrong
LSN. This could still be caused by the checksums patch, but it seems a
little less likely now. The reason I say that is because it's a new page
with tuples on it, so that means something in the insert/update path
ended up not writing the FPI before writing the page.

Regards,Jeff Davis