Thread: corrupt pages detected by enabling checksums

corrupt pages detected by enabling checksums

From
Jeff Janes
Date:

I've changed the subject from "regression test failed when enabling checksum" because I now know they are totally unrelated.

My test case didn't need to depend on archiving being on, and so with a simple tweak I rendered the two issues orthogonal.


On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql@j-davis.com> wrote:
On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:

> What I would probably really want is the data as it existed after the
> crash but before recovery started, but since the postmaster
> immediately starts recovery after the crash, I don't know of a good
> way to capture this.

Can you just turn off the restart_after_crash GUC? I had a chance to
look at this, and seeing the block before and after recovery would be
nice. I didn't see a log file in the data directory, but it didn't go
through recovery, so I assume it already did that.

You don't know that the cluster is in the bad state until after it goes through recovery because most crashes recover perfectly fine.  So it would have to make a side-copy of the cluster after the crash, then recover the original and see how things go, then either retain or delete the side-copy.  Unfortunately my testing harness can't do this at the moment, because the perl script storing the consistency info needs to survive over the crash and recovery.   It might take me awhile to figure out how to make it do this.

I have the server log just go to stderr, where it gets mingled together with messages from my testing harness.  I had not uploaded that file.

Here is a new upload. It contains both a data_dir tarball including xlog, and the mingled stderr ("do_new.out")


The other 3 files in it constitute the harness.  It is a quite a mess, with some hard-coded paths.  The just-posted fix for wal_keep_segments will also have to be applied.

 

The block is corrupt as far as I can tell. The first third is written,
and the remainder is all zeros. The header looks like this:

Yes, that part is by my design.  Why it didn't get fixed from a FPI is not by my design, or course.
 

So, the page may be corrupt without checksums as well, but it just
happens to be hidden for the same reason. Can you try to reproduce it
without -k?

No, things run (seemingly) fine without -k.  
 
And on the checkin right before checksums were added?
Without checksums, you'll need to use pg_filedump (or similar) to find
whether an error has happened.

I'll work on it, but it will take awhile to figure out exactly how to use pg_filedump to do that, and how to automate that process and incorporate it into the harness.

In the meantime, I tested the checksum commit itself (96ef3b8ff1c) and the problem occurs there, so if the problem is not the checksums themselves (and I agree it probably isn't) it must have been introduced before that rather than after.
 
Cheers,

Jeff

Re: corrupt pages detected by enabling checksums

From
Andres Freund
Date:
On 2013-04-03 15:57:49 -0700, Jeff Janes wrote:
> I've changed the subject from "regression test failed when enabling
> checksum" because I now know they are totally unrelated.
> 
> My test case didn't need to depend on archiving being on, and so with a
> simple tweak I rendered the two issues orthogonal.
> 
> 
> On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> 
> > On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:
> >
> > > What I would probably really want is the data as it existed after the
> > > crash but before recovery started, but since the postmaster
> > > immediately starts recovery after the crash, I don't know of a good
> > > way to capture this.
> >
> > Can you just turn off the restart_after_crash GUC? I had a chance to
> > look at this, and seeing the block before and after recovery would be
> > nice. I didn't see a log file in the data directory, but it didn't go
> > through recovery, so I assume it already did that.
> >
> 
> You don't know that the cluster is in the bad state until after it goes
> through recovery because most crashes recover perfectly fine.  So it would
> have to make a side-copy of the cluster after the crash, then recover the
> original and see how things go, then either retain or delete the side-copy.
>  Unfortunately my testing harness can't do this at the moment, because the
> perl script storing the consistency info needs to survive over the crash
> and recovery.   It might take me awhile to figure out how to make it do
> this.
> 
> I have the server log just go to stderr, where it gets mingled together
> with messages from my testing harness.  I had not uploaded that file.
> 
> Here is a new upload. It contains both a data_dir tarball including xlog,
> and the mingled stderr ("do_new.out")
> 
> https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&usp=sharing
> 
> The other 3 files in it constitute the harness.  It is a quite a mess, with
> some hard-coded paths.  The just-posted fix for wal_keep_segments will also
> have to be applied.
> 
> 
> 
> >
> > The block is corrupt as far as I can tell. The first third is written,
> > and the remainder is all zeros. The header looks like this:
> >
> 
> Yes, that part is by my design.  Why it didn't get fixed from a FPI is not
> by my design, or course.

There are no FPIs (if you mean a full page image with that) afaics:

Your logs tell us about recovery:
27692  2013-04-03 10:09:15.621 PDT:LOG:  redo starts at 1/31000090
27692  2013-04-03 10:09:15.647 PDT:LOG:  incorrect resource manager data checksum in record at 1/31169A68
27692  2013-04-03 10:09:15.647 PDT:LOG:  redo done at 1/31169A38

And then the error:

27698 SELECT 2013-04-03 10:09:16.688 PDT:WARNING:  page verification failed, calculated checksum 16296 but expected
49517
27698 SELECT 2013-04-03 10:09:16.688 PDT:ERROR:  invalid page in block 90 of relation base/16384/835589


looking at xlog from that time, the first records are:
rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn: 1/31000028, prev 1/30000090, bkp: 0000, desc:
checkpoint:redo 1/31000028; tli 1; prev tli 1; fpw true; xid 0/26254997; oid 835589; multi 1; offset 0; oldest xid 1799
inDB 1; oldest multi 1 in DB 1; oldest running xid 0; online
 
rmgr: XLOG        len (rec/tot):      4/    36, tx:          0, lsn: 1/31000090, prev 1/31000028, bkp: 0000, desc:
nextOid:843781
 
rmgr: Storage     len (rec/tot):     16/    48, tx:          0, lsn: 1/310000B8, prev 1/31000090, bkp: 0000, desc: file
create:base/16384/835589
 
rmgr: Heap        len (rec/tot):     37/  7905, tx:   26254997, lsn: 1/310000E8, prev 1/310000B8, bkp: 1000, desc:
hot_update:rel 1663/16384/12660; tid 0/47 xmax 26254997 ; new tid 0/33 xmax 0
 

So the file was created after the last checkpoint, so no full pages need
to be logged. And we theoretically should be able to recovery all things
like torn pages from the WAL since that should cover everything that
happened to that file.

The bkp block 1 indicated in the 4th record above is the only one in
that period of WAL btw.

Looking a bit more, the last page that's covered by WAL is:
rmgr: Heap        len (rec/tot):     35/    67, tx:   26254998, lsn: 1/311699A8, prev 1/31169960, bkp: 0000, desc:
insert:rel 1663/16384/835589; tid 44/56
 

which is far earlier than the "block 90" the error is found in unless i
misunderstand something. Which is strange, since non-zero content to
those pages shouldn't go to disk until the respective WAL is
flushed. Huh, are we missing something here?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: corrupt pages detected by enabling checksums

From
andres@anarazel.de (Andres Freund)
Date:
On 2013-04-04 01:52:41 +0200, Andres Freund wrote:
> On 2013-04-03 15:57:49 -0700, Jeff Janes wrote:
> > I've changed the subject from "regression test failed when enabling
> > checksum" because I now know they are totally unrelated.
> > 
> > My test case didn't need to depend on archiving being on, and so with a
> > simple tweak I rendered the two issues orthogonal.
> > 
> > 
> > On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > 
> > > On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:
> > >
> > > > What I would probably really want is the data as it existed after the
> > > > crash but before recovery started, but since the postmaster
> > > > immediately starts recovery after the crash, I don't know of a good
> > > > way to capture this.
> > >
> > > Can you just turn off the restart_after_crash GUC? I had a chance to
> > > look at this, and seeing the block before and after recovery would be
> > > nice. I didn't see a log file in the data directory, but it didn't go
> > > through recovery, so I assume it already did that.
> > >
> > 
> > You don't know that the cluster is in the bad state until after it goes
> > through recovery because most crashes recover perfectly fine.  So it would
> > have to make a side-copy of the cluster after the crash, then recover the
> > original and see how things go, then either retain or delete the side-copy.
> >  Unfortunately my testing harness can't do this at the moment, because the
> > perl script storing the consistency info needs to survive over the crash
> > and recovery.   It might take me awhile to figure out how to make it do
> > this.
> > 
> > I have the server log just go to stderr, where it gets mingled together
> > with messages from my testing harness.  I had not uploaded that file.
> > 
> > Here is a new upload. It contains both a data_dir tarball including xlog,
> > and the mingled stderr ("do_new.out")
> > 
> > https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&usp=sharing
> > 
> > The other 3 files in it constitute the harness.  It is a quite a mess, with
> > some hard-coded paths.  The just-posted fix for wal_keep_segments will also
> > have to be applied.
> > 
> > 
> > 
> > >
> > > The block is corrupt as far as I can tell. The first third is written,
> > > and the remainder is all zeros. The header looks like this:
> > >
> > 
> > Yes, that part is by my design.  Why it didn't get fixed from a FPI is not
> > by my design, or course.
> 
> There are no FPIs (if you mean a full page image with that) afaics:
> 
> Your logs tell us about recovery:
> 27692  2013-04-03 10:09:15.621 PDT:LOG:  redo starts at 1/31000090
> 27692  2013-04-03 10:09:15.647 PDT:LOG:  incorrect resource manager data checksum in record at 1/31169A68
> 27692  2013-04-03 10:09:15.647 PDT:LOG:  redo done at 1/31169A38
> 
> And then the error:
> 
> 27698 SELECT 2013-04-03 10:09:16.688 PDT:WARNING:  page verification failed, calculated checksum 16296 but expected
49517
> 27698 SELECT 2013-04-03 10:09:16.688 PDT:ERROR:  invalid page in block 90 of relation base/16384/835589
> 
> 
> looking at xlog from that time, the first records are:
> rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn: 1/31000028, prev 1/30000090, bkp: 0000, desc:
checkpoint:redo 1/31000028; tli 1; prev tli 1; fpw true; xid 0/26254997; oid 835589; multi 1; offset 0; oldest xid 1799
inDB 1; oldest multi 1 in DB 1; oldest running xid 0; online
 
> rmgr: XLOG        len (rec/tot):      4/    36, tx:          0, lsn: 1/31000090, prev 1/31000028, bkp: 0000, desc:
nextOid:843781
 
> rmgr: Storage     len (rec/tot):     16/    48, tx:          0, lsn: 1/310000B8, prev 1/31000090, bkp: 0000, desc:
filecreate: base/16384/835589
 
> rmgr: Heap        len (rec/tot):     37/  7905, tx:   26254997, lsn: 1/310000E8, prev 1/310000B8, bkp: 1000, desc:
hot_update:rel 1663/16384/12660; tid 0/47 xmax 26254997 ; new tid 0/33 xmax 0
 
> 
> So the file was created after the last checkpoint, so no full pages need
> to be logged. And we theoretically should be able to recovery all things
> like torn pages from the WAL since that should cover everything that
> happened to that file.
> 
> The bkp block 1 indicated in the 4th record above is the only one in
> that period of WAL btw.
> 
> Looking a bit more, the last page that's covered by WAL is:
> rmgr: Heap        len (rec/tot):     35/    67, tx:   26254998, lsn: 1/311699A8, prev 1/31169960, bkp: 0000, desc:
insert:rel 1663/16384/835589; tid 44/56
 
> 
> which is far earlier than the "block 90" the error is found in unless i
> misunderstand something. Which is strange, since non-zero content to
> those pages shouldn't go to disk until the respective WAL is
> flushed. Huh, are we missing something here?

Looking at the page lsn's with dd I noticed something peculiar:

page 0:
01 00 00 00 18 c2 00 31 => 1/3100C218
page 1:
01 00 00 00 80 44 01 31 => 1/31014480
page 10:
01 00 00 00 60 ce 05 31 => 1/3105ce60
page 43:
01 00 00 00 58 7a 16 31 => 1/31167a58
page 44:
01 00 00 00 f0 99 16 31 => 1/311699f0
page 45:
00 00 00 00 00 00 00 00 => 0/0
page 90:
01 00 00 00 90 17 1d 32 => 1/321d1790
page 91:
01 00 00 00 38 ef 1b 32 => 1/321bef38

So we have written out pages that are after pages without a LSN that
have an LSN thats *beyond* the point XLOG has successfully been written
to disk (1/31169A38)?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
 On Wed, 2013-04-03 at 15:57 -0700, Jeff Janes wrote:



> You don't know that the cluster is in the bad state until after it
> goes through recovery because most crashes recover perfectly fine.  So
> it would have to make a side-copy of the cluster after the crash, then
> recover the original and see how things go, then either retain or
> delete the side-copy.  Unfortunately my testing harness can't do this
> at the moment, because the perl script storing the consistency info
> needs to survive over the crash and recovery.   It might take
> me awhile to figure out how to make it do this.

Hmm... you're right, that is difficult to integrate into a test.

Another approach is to expand PageHeaderIsValid in a pre-checksums
version to do some extra checks (perhaps the same checks as pg_filedump
does). It might be able to at least detect simple cases, like all zeros
between pd_upper and pd_special (when pd_upper < pd_special). Still not
easy, but maybe more practical than saving the directory like that. We
may even want a GUC for that to aid future debugging (obviously it would
have a performance impact).

> The other 3 files in it constitute the harness.  It is a quite a mess,
> with some hard-coded paths.  The just-posted fix for wal_keep_segments
> will also have to be applied.

I'm still trying to reproduce the problem myself. I keep trying simpler
versions of your test and I don't see the problem. It's a little awkward
to try to run the full version of your test (and I'm jumping between
code inspection and various other ideas).

> I'll work on it, but it will take awhile to figure out exactly how to
> use pg_filedump to do that, and how to automate that process and
> incorporate it into the harness.

It might be more productive to try to expand PageHeaderIsValid as I
suggested above.

> In the meantime, I tested the checksum commit itself (96ef3b8ff1c) and
> the problem occurs there, so if the problem is not the checksums
> themselves (and I agree it probably isn't) it must have been
> introduced before that rather than after.

Thank you for all of your work on this. I have also attached another
test patch that disables most of the complex areas of the checksums
patch (VM bits, hint bits, etc.). If you apply different portions of the
patch (or the whole thing), it will help eliminate possibilities. In
particular, eliminating the calls to PageSetAllVisible+visibilitymap_set
could tell us a lot.

Also, the first data directory you posted had a problem with a page that
appeared to be a new page (otherwise I would have expected either old or
new data at the end). Do you think this might be a problem only for new
pages? Or do you think your test just makes it likely that many writes
will happen on new pages?

I did find one potential problem in the checksums code (aside from my
previous patch involving wal_level=archive): visibilitymap_set is called
sometimes before the heap page is marked dirty. I will examine a little
more closely, but I expect that to require another patch. Not sure if
that could explain this problem or not.

Regards,
    Jeff Davis


Attachment

Re: corrupt pages detected by enabling checksums

From
Andres Freund
Date:
On 2013-04-04 02:28:32 +0200, Andres Freund wrote:
> On 2013-04-04 01:52:41 +0200, Andres Freund wrote:
> > On 2013-04-03 15:57:49 -0700, Jeff Janes wrote:
> > > I've changed the subject from "regression test failed when enabling
> > > checksum" because I now know they are totally unrelated.
> > > 
> > > My test case didn't need to depend on archiving being on, and so with a
> > > simple tweak I rendered the two issues orthogonal.
> > > 
> > > 
> > > On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > > 
> > > > On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:
> > > >
> > > > > What I would probably really want is the data as it existed after the
> > > > > crash but before recovery started, but since the postmaster
> > > > > immediately starts recovery after the crash, I don't know of a good
> > > > > way to capture this.
> > > >
> > > > Can you just turn off the restart_after_crash GUC? I had a chance to
> > > > look at this, and seeing the block before and after recovery would be
> > > > nice. I didn't see a log file in the data directory, but it didn't go
> > > > through recovery, so I assume it already did that.
> > > >
> > > 
> > > You don't know that the cluster is in the bad state until after it goes
> > > through recovery because most crashes recover perfectly fine.  So it would
> > > have to make a side-copy of the cluster after the crash, then recover the
> > > original and see how things go, then either retain or delete the side-copy.
> > >  Unfortunately my testing harness can't do this at the moment, because the
> > > perl script storing the consistency info needs to survive over the crash
> > > and recovery.   It might take me awhile to figure out how to make it do
> > > this.
> > > 
> > > I have the server log just go to stderr, where it gets mingled together
> > > with messages from my testing harness.  I had not uploaded that file.
> > > 
> > > Here is a new upload. It contains both a data_dir tarball including xlog,
> > > and the mingled stderr ("do_new.out")
> > > 
> > > https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&usp=sharing
> > > 
> > > The other 3 files in it constitute the harness.  It is a quite a mess, with
> > > some hard-coded paths.  The just-posted fix for wal_keep_segments will also
> > > have to be applied.
> > > 
> > > 
> > > 
> > > >
> > > > The block is corrupt as far as I can tell. The first third is written,
> > > > and the remainder is all zeros. The header looks like this:
> > > >
> > > 
> > > Yes, that part is by my design.  Why it didn't get fixed from a FPI is not
> > > by my design, or course.
> > 
> > There are no FPIs (if you mean a full page image with that) afaics:
> > 
> > Your logs tell us about recovery:
> > 27692  2013-04-03 10:09:15.621 PDT:LOG:  redo starts at 1/31000090
> > 27692  2013-04-03 10:09:15.647 PDT:LOG:  incorrect resource manager data checksum in record at 1/31169A68
> > 27692  2013-04-03 10:09:15.647 PDT:LOG:  redo done at 1/31169A38

> Looking at the page lsn's with dd I noticed something peculiar:
> 
> page 0:
> 01 00 00 00 18 c2 00 31 => 1/3100C218
> page 1:
> 01 00 00 00 80 44 01 31 => 1/31014480
> page 10:
> 01 00 00 00 60 ce 05 31 => 1/3105ce60
> page 43:
> 01 00 00 00 58 7a 16 31 => 1/31167a58
> page 44:
> 01 00 00 00 f0 99 16 31 => 1/311699f0
> page 45:
> 00 00 00 00 00 00 00 00 => 0/0
> page 90:
> 01 00 00 00 90 17 1d 32 => 1/321d1790
> page 91:
> 01 00 00 00 38 ef 1b 32 => 1/321bef38
> 
> So we have written out pages that are after pages without a LSN that
> have an LSN thats *beyond* the point XLOG has successfully been written
> to disk (1/31169A38)?

By now I am pretty sure the issue is that the end-of-wal is detected too
early. Above the end is detected at 1/31169A38, but the next WAL file
contains valid data:
pg_xlogdump /tmp/tmp/data2/pg_xlog/000000010000000100000032 -n 10
rmgr: Heap2       len (rec/tot):     26/    58, tx:          0, lsn: 1/32000030, prev 1/31FFFFD8, bkp: 0000, desc:
clean:rel 1663/16384/835589; blk 122 remxid 26328926
 
rmgr: Heap        len (rec/tot):     51/    83, tx:   26328964, lsn: 1/32000070, prev 1/32000030, bkp: 0000, desc:
update:rel 1663/16384/835589; tid 122/191 xmax 26328964 ; new tid 129/140 xmax 0
 
rmgr: Heap2       len (rec/tot):     30/    62, tx:          0, lsn: 1/320000C8, prev 1/32000070, bkp: 0000, desc:
clean:rel 1663/16384/835589; blk 63 remxid 26328803
 
rmgr: Btree       len (rec/tot):     34/    66, tx:   26328964, lsn: 1/32000108, prev 1/320000C8, bkp: 0000, desc:
insert:rel 1663/16384/835590; tid 25/14
 

That would explains exactly those symptopms, wouldn't it? Whats causing
that - I am not sure yet.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: corrupt pages detected by enabling checksums

From
Tom Lane
Date:
andres@anarazel.de (Andres Freund) writes:
> Looking at the page lsn's with dd I noticed something peculiar:

> page 0:
> 01 00 00 00 18 c2 00 31 => 1/3100C218
> page 1:
> 01 00 00 00 80 44 01 31 => 1/31014480
> page 10:
> 01 00 00 00 60 ce 05 31 => 1/3105ce60
> page 43:
> 01 00 00 00 58 7a 16 31 => 1/31167a58
> page 44:
> 01 00 00 00 f0 99 16 31 => 1/311699f0
> page 45:
> 00 00 00 00 00 00 00 00 => 0/0
> page 90:
> 01 00 00 00 90 17 1d 32 => 1/321d1790
> page 91:
> 01 00 00 00 38 ef 1b 32 => 1/321bef38

> So we have written out pages that are after pages without a LSN that
> have an LSN thats *beyond* the point XLOG has successfully been written
> to disk (1/31169A38)?

If you're looking into the FPIs, those would contain the page's older
LSN, not the one assigned by the current WAL record.
        regards, tom lane



Re: corrupt pages detected by enabling checksums

From
Andres Freund
Date:
On 2013-04-03 20:45:51 -0400, Tom Lane wrote:
> andres@anarazel.de (Andres Freund) writes:
> > Looking at the page lsn's with dd I noticed something peculiar:
> 
> > page 0:
> > 01 00 00 00 18 c2 00 31 => 1/3100C218
> > page 1:
> > 01 00 00 00 80 44 01 31 => 1/31014480
> > page 10:
> > 01 00 00 00 60 ce 05 31 => 1/3105ce60
> > page 43:
> > 01 00 00 00 58 7a 16 31 => 1/31167a58
> > page 44:
> > 01 00 00 00 f0 99 16 31 => 1/311699f0
> > page 45:
> > 00 00 00 00 00 00 00 00 => 0/0
> > page 90:
> > 01 00 00 00 90 17 1d 32 => 1/321d1790
> > page 91:
> > 01 00 00 00 38 ef 1b 32 => 1/321bef38
> 
> > So we have written out pages that are after pages without a LSN that
> > have an LSN thats *beyond* the point XLOG has successfully been written
> > to disk (1/31169A38)?
> 
> If you're looking into the FPIs, those would contain the page's older
> LSN, not the one assigned by the current WAL record.

Nope, thats from the heap, and the LSNs are *newer* than what startup
recovered to. I am pretty sure by now we are missing out on valid WAL, I
am just not sure why.

Unfortunately we can't easily diagnose what happened at:
27692  2013-04-03 10:09:15.647 PDT:LOG:  incorrect resource manager data checksum in record at 1/31169A68
since the startup process wrote its end of recovery checkpoint there:
rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn: 1/31169A68, prev 1/31169A38, bkp: 0000, desc:
checkpoint:redo 1/31169A68; tli 1; prev tli 1; fpw true; xid 0/26254999; oid 843781; multi 1; offset 0; oldest xid 1799
inDB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
 

Starting from a some blocks in that wal segments later:
pg_xlogdump /tmp/tmp/data2/pg_xlog/000000010000000100000031 -s 1/3116c000 -n 10
first record is after 1/3116C000, at 1/3116D9D8, skipping over 6616 bytes
rmgr: Heap        len (rec/tot):     51/    83, tx:   26254999, lsn: 1/3116D9D8, prev 1/3116BA20, bkp: 0000, desc:
update:rel 1663/16384/835589; tid 38/148 xmax 26254999 ; new tid 44/57 xmax 0
 
rmgr: Btree       len (rec/tot):     34/    66, tx:   26254999, lsn: 1/3116DA30, prev 1/3116D9D8, bkp: 0000, desc:
insert:rel 1663/16384/835590; tid 25/319
 
rmgr: Heap        len (rec/tot):     51/    83, tx:   26255000, lsn: 1/3116DA78, prev 1/3116DA30, bkp: 0000, desc:
update:rel 1663/16384/835589; tid 19/214 xmax 26255000 ; new tid 44/58 xmax 0
 

the records continue again.

Greetings,


Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: corrupt pages detected by enabling checksums

From
Andres Freund
Date:
On 2013-04-04 02:58:43 +0200, Andres Freund wrote:
> On 2013-04-03 20:45:51 -0400, Tom Lane wrote:
> > andres@anarazel.de (Andres Freund) writes:
> > > Looking at the page lsn's with dd I noticed something peculiar:
> >
> > > page 0:
> > > 01 00 00 00 18 c2 00 31 => 1/3100C218
> > > page 1:
> > > 01 00 00 00 80 44 01 31 => 1/31014480
> > > page 10:
> > > 01 00 00 00 60 ce 05 31 => 1/3105ce60
> > > page 43:
> > > 01 00 00 00 58 7a 16 31 => 1/31167a58
> > > page 44:
> > > 01 00 00 00 f0 99 16 31 => 1/311699f0
> > > page 45:
> > > 00 00 00 00 00 00 00 00 => 0/0
> > > page 90:
> > > 01 00 00 00 90 17 1d 32 => 1/321d1790
> > > page 91:
> > > 01 00 00 00 38 ef 1b 32 => 1/321bef38
> >
> > > So we have written out pages that are after pages without a LSN that
> > > have an LSN thats *beyond* the point XLOG has successfully been written
> > > to disk (1/31169A38)?
> >
> > If you're looking into the FPIs, those would contain the page's older
> > LSN, not the one assigned by the current WAL record.
>
> Nope, thats from the heap, and the LSNs are *newer* than what startup
> recovered to. I am pretty sure by now we are missing out on valid WAL, I
> am just not sure why.

Ok, I think I see the bug. And I think its been introduced in the
checkpoints patch.

Remember, as explained in other mails, I am pretty sure the end of wal
was errorneously detected, resulting in the following error:
27692  2013-04-03 10:09:15.647 PDT:LOG:  incorrect resource manager data checksum in record at 1/31169A68

Not that without a *hardware* crash end of wal is normally found first
by the checks for wrong LSNs or wrong record lengths. Not here.

My theory is that a page that was full page written changed while it was
inserted into the WAL which caused the error. A possibly scenario for
that would e.g. two concurrent calls to MarkBufferDirtyHint().
Note:* 2. The caller might have only share-lock instead of exclusive-lock on the*    buffer's content lock.
if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=    (BM_DIRTY | BM_JUST_DIRTIED)){    if (DataChecksumsEnabled()
&&(bufHdr->flags & BM_PERMANENT))    {        MyPgXact->delayChkpt = delayChkpt = true;        lsn =
XLogSaveBufferForHint(buffer);/* PART 1 */    }    }
 
    LockBufHdr(bufHdr);    Assert(bufHdr->refcount > 0);
    if (!(bufHdr->flags & BM_DIRTY))    {        dirtied = true;        /* Means "will be dirtied by this action" */
        /*         * Set the page LSN if we wrote a backup block. We aren't         * supposed to set this when only
holdinga share lock but         * as long as we serialise it somehow we're OK. We choose to         * set LSN while
holdingthe buffer header lock, which causes         * any reader of an LSN who holds only a share lock to also
*obtain a buffer header lock before using PageGetLSN().         * Fortunately, thats not too many places.         *
   * If checksums are enabled, you might think we should reset the         * checksum here. That will happen when the
pageis written         * sometime later in this checkpoint cycle.         */        if (!XLogRecPtrIsInvalid(lsn))
     PageSetLSN(page, lsn); /* PART 2 */    }
 

So XLogSaveBufferForHint(buffer) is called for all buffers which are not
yet dirty. Without any locks. It just does:

XLogSaveBufferForHint(Buffer buffer)
{rdata[0].data = (char *) (&watermark);rdata[0].len = sizeof(int);rdata[0].next = &(rdata[1]);
rdata[1].data = NULL;rdata[1].len = 0;rdata[1].buffer = buffer;rdata[1].buffer_std = true;rdata[1].next = NULL;
return XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);
}

I think what happens is that that one backend grabs WALInsertLock first,
it computes a full page write for the buffer, including a CRC. And
returns a valid LSN. Then it continues towards the PageSetLSN() after
LockBufHdr().
Allthewhile the second backend notices that WALInsertLock is free again
and continues towards the WALInsertLock protected parts of XLogInsert().

Only that it already has done that part:/* * Calculate CRC of the data, including all the backup blocks * * Note that
therecord header isn't added into the CRC initially since we * don't know the prev-link yet.  Thus, the CRC will
representthe CRC of * the whole record in the order: rdata, then backup blocks, then record * header.
*/INIT_CRC32(rdata_crc);for(rdt = rdata; rdt != NULL; rdt = rdt->next)    COMP_CRC32(rdata_crc, rdt->data, rdt->len);
 

It goes on to write all the data to the WAL after successfully getting
WALInsertLock:while (write_len){    while (rdata->data == NULL)        rdata = rdata->next;
    if (freespace > 0)    {        if (rdata->len > freespace)        {            memcpy(Insert->currpos, rdata->data,
freespace);           rdata->data += freespace;            rdata->len -= freespace;            write_len -= freespace;
     }        else        {            memcpy(Insert->currpos, rdata->data, rdata->len);            freespace -=
rdata->len;           write_len -= rdata->len;            Insert->currpos += rdata->len;            rdata =
rdata->next;           continue;        }    }
 
    /* Use next buffer */    updrqst = AdvanceXLInsertBuffer(false);    curridx = Insert->curridx;    /* Mark page
headerto indicate this record continues on the page */    Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
Insert->currpage->xlp_rem_len= write_len;    freespace = INSERT_FREESPACE(Insert);}
 

If by now the first backend has proceeded to PageSetLSN() we are writing
different data to disk than the one we computed the checksum of
before. Boom.

I think the whole locking interactions in MarkBufferDirtyHint() need to
be thought over pretty carefully.

Does this explanation make sense?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 4 April 2013 02:39, Andres Freund <andres@2ndquadrant.com> wrote:

> Ok, I think I see the bug. And I think its been introduced in the
> checkpoints patch.

Well spotted. (I think you mean checksums patch).

> If by now the first backend has proceeded to PageSetLSN() we are writing
> different data to disk than the one we computed the checksum of
> before. Boom.

Right, so nothing else we were doing was wrong, that's why we couldn't
spot a bug. The problem is that we aren't replaying enough WAL because
the checksum on the WAL record is broke.

> I think the whole locking interactions in MarkBufferDirtyHint() need to
> be thought over pretty carefully.

When we write out a buffer with checksums enabled, we take a copy of
the buffer so that the checksum is consistent, even while other
backends may be writing hints to the same bufer.

I missed out on doing that with XLOG_HINT records, so the WAL CRC can
be incorrect because the data is scanned twice; normally that would be
OK because we have an exclusive lock on the block, but with hints we
only have share lock. So what we need to do is take a copy of the
buffer before we do XLogInsert().

Simple patch to do this attached for discussion. (Not tested).

We might also do this by modifying the WAL record to take the whole
block and bypass the BkpBlock mechanism entirely. But that's more work
and doesn't seem like it would be any cleaner. I figure lets solve the
problem first then discuss which approach is best.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: corrupt pages detected by enabling checksums

From
Andres Freund
Date:
On 2013-04-04 13:30:40 +0100, Simon Riggs wrote:
> On 4 April 2013 02:39, Andres Freund <andres@2ndquadrant.com> wrote:
>
> > Ok, I think I see the bug. And I think its been introduced in the
> > checkpoints patch.
>
> Well spotted. (I think you mean checksums patch).

Heh, yes. I was slightly tired at that point ;)

> > If by now the first backend has proceeded to PageSetLSN() we are writing
> > different data to disk than the one we computed the checksum of
> > before. Boom.
>
> Right, so nothing else we were doing was wrong, that's why we couldn't
> spot a bug. The problem is that we aren't replaying enough WAL because
> the checksum on the WAL record is broke.

Well, there might be other bugs we can't see yet... But lets hope not.

> > I think the whole locking interactions in MarkBufferDirtyHint() need to
> > be thought over pretty carefully.
>
> When we write out a buffer with checksums enabled, we take a copy of
> the buffer so that the checksum is consistent, even while other
> backends may be writing hints to the same bufer.
>
> I missed out on doing that with XLOG_HINT records, so the WAL CRC can
> be incorrect because the data is scanned twice; normally that would be
> OK because we have an exclusive lock on the block, but with hints we
> only have share lock. So what we need to do is take a copy of the
> buffer before we do XLogInsert().
>
> Simple patch to do this attached for discussion. (Not tested).
>
> We might also do this by modifying the WAL record to take the whole
> block and bypass the BkpBlock mechanism entirely. But that's more work
> and doesn't seem like it would be any cleaner. I figure lets solve the
> problem first then discuss which approach is best.

Unfortunately I find that approach unacceptably ugly. I don't think its
ok to make an already *very* complicated function (XLogInsert()) in one
of the most complicated files (xlog.c) even more complicated. It
literally took me years to understand a large percentage of it.
I even think the XLOG_HINT escape hatch should be taken out and be
replaced by something outside of XLogInsert().

Don't think that approach would work with as few lines anyway, since you
need to properly pass it into XLogCheckBuffer() et al which checks the
buffer tags and such - which it needs to properly compute the struct
BkpBlock. Also we iterate over rdata->page for the CRC computation.

I think the route you quickly sketched is more realistic. That would
remove all knowledge obout XLOG_HINT from generic code hich is a very
good thing, I spent like 15minutes yesterday wondering whether the early
return in there might be the cause of the bug...

Something like:

XLogRecPtr
XLogSaveBufferForHint(Buffer buffer)
{   XLogRecPtr recptr = InvalidXLogRecPtr;   XLogRecPtr lsn;   XLogRecData rdata[2];   BkbBlock bkp;
   /*    * make sure no checkpoint can happen while we decide about logging    * something or not, since we don't hold
anylock preventing that    * otherwise.    */   Assert(MyPgXact->delayChkpt);
 
   /* update RedoRecPtr */   GetRedoRecPtr();
   /* setup phony rdata element */   rdata[0].buffer = buffer;   rdata[0].buffer_std = true; /* is that actually ok?
*/
   /* force full pages on, otherwise checksums won't work? */   if (XLogCheckBuffer(rdata, true, &lsn, &bkp))   {
charcopied_buffer[BLCKSZ];
 
       /*        * copy buffer so we don't have to worry about        * concurrent hint bit or lsn updates. We assume
pd_lower/upper       * cannot be changed without an exclusive lock, so the contents        * bkp are not racy.
*/      memcpy(copied_buffer, BufferGetBlock(buffer), BLCKSZ);
 
       rdata[0].data = bkp;       rdata[0].len = sizeof(BkbBlock);       rdata[0].buffer = InvalidBuffer;
rdata[0].buffer_std= false;       rdata[0].next = &(rdata[1]);
 
       rdata[1].data = copied_buffer;       rdata[1].len = BLCKSZ;       rdata[1].buffer = InvalidBuffer;
rdata[1].buffer_std= false;       rdata[1].next = NULL;
 
       recptr = XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);   }
   return recptr;
}

That should work?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 4 April 2013 15:53, Andres Freund <andres@2ndquadrant.com> wrote:

> Unfortunately I find that approach unacceptably ugly.

Yeh. If we can confirm its a fix we can discuss a cleaner patch and
that is much better.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
Andres,

Thank you for diagnosing this problem!

On Thu, 2013-04-04 at 16:53 +0200, Andres Freund wrote:
> I think the route you quickly sketched is more realistic. That would
> remove all knowledge obout XLOG_HINT from generic code hich is a very
> good thing, I spent like 15minutes yesterday wondering whether the early
> return in there might be the cause of the bug...

I like this approach. It may have some performance impact though,
because there are a couple extra spinlocks taken, and an extra memcpy.
The code looks good to me except that we should be consistent about the
page hole -- XLogCheckBuffer is calculating it, but then we copy the
entire page. I don't think anything can change the size of the page hole
while we have a shared lock on the buffer, so it seems OK to skip the
page hole during the copy.

Another possible approach is to drop the lock on the buffer and
re-acquire it in exclusive mode after we find out we'll need to do
XLogInsert. It means that MarkBufferDirtyHint may end up blocking for
longer, but would avoid the memcpy. I haven't really thought through the
details.

Regards,Jeff Davis





Re: corrupt pages detected by enabling checksums

From
Andres Freund
Date:
On 2013-04-04 12:59:36 -0700, Jeff Davis wrote:
> Andres,
> 
> Thank you for diagnosing this problem!
> 
> On Thu, 2013-04-04 at 16:53 +0200, Andres Freund wrote:
> > I think the route you quickly sketched is more realistic. That would
> > remove all knowledge obout XLOG_HINT from generic code hich is a very
> > good thing, I spent like 15minutes yesterday wondering whether the early
> > return in there might be the cause of the bug...
> 
> I like this approach. It may have some performance impact though,
> because there are a couple extra spinlocks taken, and an extra memcpy.

I don't think its really slower. Earlier the code took WalInsertLock
everytime, even if we ended up not logging anything. Thats far more
epensive than a single spinlock. And the copy should also only be taken
in the case we need to log. So I think we end up ahead of the current
state.

> The code looks good to me except that we should be consistent about the
> page hole -- XLogCheckBuffer is calculating it, but then we copy the
> entire page. I don't think anything can change the size of the page hole
> while we have a shared lock on the buffer, so it seems OK to skip the
> page hole during the copy.

I don't think it can change either, but I doubt that there's a
performance advantage by not copying the hole. I'd guess the simpler
code ends up faster.

> Another possible approach is to drop the lock on the buffer and
> re-acquire it in exclusive mode after we find out we'll need to do
> XLogInsert. It means that MarkBufferDirtyHint may end up blocking for
> longer, but would avoid the memcpy. I haven't really thought through the
> details.

That sounds like it would be prone to deadlocks. So I would dislike to
go there.

Will write up a patch tomorrow.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: corrupt pages detected by enabling checksums

From
Jeff Janes
Date:
On Thu, Apr 4, 2013 at 5:30 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 4 April 2013 02:39, Andres Freund <andres@2ndquadrant.com> wrote:

> If by now the first backend has proceeded to PageSetLSN() we are writing
> different data to disk than the one we computed the checksum of
> before. Boom.

Right, so nothing else we were doing was wrong, that's why we couldn't
spot a bug. The problem is that we aren't replaying enough WAL because
the checksum on the WAL record is broke.

This brings up a pretty frightening possibility to me, unrelated to data checksums.  If a bit gets twiddled in the WAL file due to a hardware issue or a "cosmic ray", and then a crash happens, automatic recovery will stop early with the failed WAL checksum with an innocuous looking message.  The system will start up but will be invisibly inconsistent, and will proceed to overwrite that portion of the WAL file which contains the old data (real data that would have been necessary to reconstruct, once the corruption is finally realized ) with an end-of-recovery checkpoint record and continue to chew up real data from there.

I don't know a solution here, though, other than trusting your hardware.  Changing timelines upon ordinary crash recovery, not just media recovery, seems excessive but also seems to be exactly what timelines were invented for, right?

Back to the main topic here, Jeff Davis mentioned earlier "You'd still think this would cause incorrect results, but...".  I didn't realize the significance of that until now.  It does produce incorrect query results.  I was just bailing out before detecting them.  Once I specify ignore_checksum_failure=1 my test harness complains bitterly about the data not being consistent with what the Perl program knows it is supposed to be.  


I missed out on doing that with XLOG_HINT records, so the WAL CRC can
be incorrect because the data is scanned twice; normally that would be
OK because we have an exclusive lock on the block, but with hints we
only have share lock. So what we need to do is take a copy of the
buffer before we do XLogInsert().

Simple patch to do this attached for discussion. (Not tested).

We might also do this by modifying the WAL record to take the whole
block and bypass the BkpBlock mechanism entirely. But that's more work
and doesn't seem like it would be any cleaner. I figure lets solve the
problem first then discuss which approach is best.


I've tested your patch it and it seems to do the job.  It has survived far longer than unpatched ever did, with neither checksum warnings, nor complaints of inconsistent data.  (I haven't analyzed the code much, just the results, and leave the discussion of the best approach to others)


 Thanks,

Jeff

Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Thu, 2013-04-04 at 14:21 -0700, Jeff Janes wrote:

> This brings up a pretty frightening possibility to me, unrelated to
> data checksums.  If a bit gets twiddled in the WAL file due to a
> hardware issue or a "cosmic ray", and then a crash happens, automatic
> recovery will stop early with the failed WAL checksum with
> an innocuous looking message.  The system will start up but will be
> invisibly inconsistent, and will proceed to overwrite that portion of
> the WAL file which contains the old data (real data that would have
> been necessary to reconstruct, once the corruption is finally realized
> ) with an end-of-recovery checkpoint record and continue to chew up
> real data from there.

I've been worried about that for a while, and I may have even seen
something like this happen before. We could perhaps do some checks, but
in general it seems hard to solve without writing flushing some data to
two different places. For example, you could flush WAL, and then update
an LSN stored somewhere else indicating how far the WAL has been
written. Recovery could complain if it gets an error in the WAL before
that point.

But obviously, that makes WAL flushes expensive (in many cases, about
twice as expensive).

Maybe it's not out of the question to offer that as an option if nobody
has a better idea. Performance-conscious users could place the extra LSN
on an SSD or NVRAM or something; or maybe use commit_delay or async
commits. It would only need to store a few bytes.

Streaming replication mitigates the problem somewhat, by being a second
place to write WAL.

Regards,Jeff Davis





Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Thu, 2013-04-04 at 22:39 +0200, Andres Freund wrote:
> I don't think its really slower. Earlier the code took WalInsertLock
> everytime, even if we ended up not logging anything. Thats far more
> epensive than a single spinlock. And the copy should also only be taken
> in the case we need to log. So I think we end up ahead of the current
> state.

Good point.

> > The code looks good to me except that we should be consistent about the
> > page hole -- XLogCheckBuffer is calculating it, but then we copy the
> > entire page. I don't think anything can change the size of the page hole
> > while we have a shared lock on the buffer, so it seems OK to skip the
> > page hole during the copy.
> 
> I don't think it can change either, but I doubt that there's a
> performance advantage by not copying the hole. I'd guess the simpler
> code ends up faster.

I was thinking more about the WAL size, but I don't have a strong
opinion.

Regards,Jeff Davis




Re: corrupt pages detected by enabling checksums

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Thu, 2013-04-04 at 14:21 -0700, Jeff Janes wrote:
>> This brings up a pretty frightening possibility to me, unrelated to
>> data checksums.  If a bit gets twiddled in the WAL file due to a
>> hardware issue or a "cosmic ray", and then a crash happens, automatic
>> recovery will stop early with the failed WAL checksum with
>> an innocuous looking message.  The system will start up but will be
>> invisibly inconsistent, and will proceed to overwrite that portion of
>> the WAL file which contains the old data (real data that would have
>> been necessary to reconstruct, once the corruption is finally realized
>> ) with an end-of-recovery checkpoint record and continue to chew up
>> real data from there.

> I've been worried about that for a while, and I may have even seen
> something like this happen before. We could perhaps do some checks, but
> in general it seems hard to solve without writing flushing some data to
> two different places. For example, you could flush WAL, and then update
> an LSN stored somewhere else indicating how far the WAL has been
> written. Recovery could complain if it gets an error in the WAL before
> that point.

> But obviously, that makes WAL flushes expensive (in many cases, about
> twice as expensive).

> Maybe it's not out of the question to offer that as an option if nobody
> has a better idea. Performance-conscious users could place the extra LSN
> on an SSD or NVRAM or something; or maybe use commit_delay or async
> commits. It would only need to store a few bytes.

At least on traditional rotating media, this is only going to perform
well if you dedicate two drives to the purpose.  At which point you
might as well just say "let's write two copies of WAL".  Or maybe three
copies, so that you can take a vote when they disagree.  While this is
not so unreasonable on its face for ultra-high-reliability requirements,
I can't escape the feeling that we'd just be reinventing software RAID.
There's no reason to think that we can deal with this class of problems
better than the storage system can.

> Streaming replication mitigates the problem somewhat, by being a second
> place to write WAL.

Yeah, if you're going to do this at all it makes more sense for the
redundant copies to be on other machines.  So the questions that that
leads to are how smart is our SR code about dealing with a master that
tries to re-send WAL regions it already sent, and what a slave ought to
do in such a situation if the new data doesn't match.
        regards, tom lane



Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Thu, 2013-04-04 at 21:06 -0400, Tom Lane wrote:
> I can't escape the feeling that we'd just be reinventing software RAID.
> There's no reason to think that we can deal with this class of problems
> better than the storage system can.

The goal would be to reliably detect a situation where WAL that has been
flushed successfully was later corrupted; not necessarily to recover
when we find it. Unless it's something like ZFS, I don't think most
software RAID will reliably detect corruption.

A side benefit is that it would also help catch bugs like this in the
future.

I'm not advocating for this particular solution, but I do concur with
Jeff Janes that it's a little bit scary and that we can entertain some
ideas about how to mitigate it.

Regards,Jeff Davis





Re: corrupt pages detected by enabling checksums

From
Florian Pflug
Date:
On Apr4, 2013, at 23:21 , Jeff Janes <jeff.janes@gmail.com> wrote:
> This brings up a pretty frightening possibility to me, unrelated to data checksums.  If a bit gets twiddled in the
WALfile due to a hardware issue or a "cosmic ray", and then a crash happens, automatic recovery will stop early with
thefailed WAL checksum with an innocuous looking message.  The system will start up but will be invisibly inconsistent,
andwill proceed to overwrite that portion of the WAL file which contains the old data (real data that would have been
necessaryto reconstruct, once the corruption is finally realized ) with an end-of-recovery checkpoint record and
continueto chew up real data from there. 

Maybe we could scan forward to check whether a corrupted WAL record is followed by one or more valid ones with sensible
LSNs.If it is, chances are high that we haven't actually hit the end of the WAL. In that case, we could either log a
warning,or (better, probably) abort crash recovery. The user would then need to either restore the broken WAL segment
frombackup, or override the check by e.g. setting recovery_target_record="invalid_record". (The default would be
recovery_target_record="last_record".The name of the GUC tries to be consistent with existing recovery.conf settings,
eventhough it affects crash recovery, not archive recovery.) 

Corruption of fields which we require to scan past the record would cause false negatives, i.e. no trigger an error
eventhough we do abort recovery mid-way through. There's a risk of false positives too, but they require quite specific
orderingsof writes and thus seem rather unlikely. (AFAICS, the OS would have to write some parts of record N followed
bythe whole of record N+1 and then crash to cause a false positive). 

best regards,
Florian Pflug




Re: corrupt pages detected by enabling checksums

From
Andres Freund
Date:
On 2013-04-04 17:39:16 -0700, Jeff Davis wrote:
> On Thu, 2013-04-04 at 22:39 +0200, Andres Freund wrote:
> > I don't think its really slower. Earlier the code took WalInsertLock
> > everytime, even if we ended up not logging anything. Thats far more
> > epensive than a single spinlock. And the copy should also only be taken
> > in the case we need to log. So I think we end up ahead of the current
> > state.
>
> Good point.
>
> > > The code looks good to me except that we should be consistent about the
> > > page hole -- XLogCheckBuffer is calculating it, but then we copy the
> > > entire page. I don't think anything can change the size of the page hole
> > > while we have a shared lock on the buffer, so it seems OK to skip the
> > > page hole during the copy.
> >
> > I don't think it can change either, but I doubt that there's a
> > performance advantage by not copying the hole. I'd guess the simpler
> > code ends up faster.
>
> I was thinking more about the WAL size, but I don't have a strong
> opinion.

I was just a bit dense. No idea what I missed there.

How does the attached version look? I verified that it survives
recovery, but not more.

Jeff, any chance you can run this for a round with your suite?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Fri, 2013-04-05 at 15:09 +0200, Andres Freund wrote:
> How does the attached version look? I verified that it survives
> recovery, but not more.

Comments:

* Regarding full page writes, we can: - always write full pages (as in your current patch), regardless of
the current settings - take WALInsertLock briefly to get the current settings from XLogCtl - always calculate the full
pagerecord, but in XLogInsert, if it
 
happens to be a HINT record, and full page writes are not necessary,
then discard it (right now I somewhat favor this option but I haven't
looked at the details)

* typo in "Backup blocks are not used in **xlog xlog** records"

* To get the appropriate setting for buffer_std, we should pass it down
through MarkBufferDirty as an extra flag, I think.

* I'm a bit worried that we'll need a cleanup lock when restoring an FSM
page. What do you think?

* In xlog_redo, it seemed slightly awkward to call XLogRecGetData twice.
Merely a matter of preference but I thought I would mention it.

> Jeff, any chance you can run this for a round with your suite?

Yes. I don't have a rigorous test suite, but I'll do some manual tests
and walk through it with gdb.

Regards,Jeff Davis




Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Fri, 2013-04-05 at 10:34 +0200, Florian Pflug wrote:
> Maybe we could scan forward to check whether a corrupted WAL record is
> followed by one or more valid ones with sensible LSNs. If it is,
> chances are high that we haven't actually hit the end of the WAL. In
> that case, we could either log a warning, or (better, probably) abort
> crash recovery.

+1.

> Corruption of fields which we require to scan past the record would
> cause false negatives, i.e. no trigger an error even though we do
> abort recovery mid-way through. There's a risk of false positives too,
> but they require quite specific orderings of writes and thus seem
> rather unlikely. (AFAICS, the OS would have to write some parts of
> record N followed by the whole of record N+1 and then crash to cause a
> false positive).

Does the xlp_pageaddr help solve this?

Also, we'd need to be a little careful when written-but-not-flushed WAL
data makes it to disk, which could cause a false positive and may be a
fairly common case.

Regards,Jeff Davis





Re: corrupt pages detected by enabling checksums

From
Jaime Casanova
Date:
On Fri, Apr 5, 2013 at 8:09 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>
> How does the attached version look? I verified that it survives
> recovery, but not more.
>

I still got errors when executing make installcheck in a just compiled
9.3devel + this_patch, this is when setting wal_level higher than
minimal.
Attached regression.diffs

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566         Cell: +593 987171157

Attachment

Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Fri, 2013-04-05 at 19:22 -0500, Jaime Casanova wrote:
> On Fri, Apr 5, 2013 at 8:09 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >
> > How does the attached version look? I verified that it survives
> > recovery, but not more.
> >
> 
> I still got errors when executing make installcheck in a just compiled
> 9.3devel + this_patch, this is when setting wal_level higher than
> minimal.
> Attached regression.diffs

Did you also apply the other patch here:

http://www.postgresql.org/message-id/1364340203.21411.143.camel@sussancws0025

?

I think that is a completely separate issue. Simon, should that patch be
applied or are you waiting on the issue Jeff Janes raised to be
resolved, as well?

Regards,Jeff Davis





Re: corrupt pages detected by enabling checksums

From
Jaime Casanova
Date:
On Fri, Apr 5, 2013 at 7:39 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Fri, 2013-04-05 at 19:22 -0500, Jaime Casanova wrote:
>> On Fri, Apr 5, 2013 at 8:09 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> >
>> > How does the attached version look? I verified that it survives
>> > recovery, but not more.
>> >
>>
>> I still got errors when executing make installcheck in a just compiled
>> 9.3devel + this_patch, this is when setting wal_level higher than
>> minimal.
>> Attached regression.diffs
>
> Did you also apply the other patch here:
>
> http://www.postgresql.org/message-id/1364340203.21411.143.camel@sussancws0025
>

i missed that one... with both applied, first problems disappear...
will try some more tests later

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566         Cell: +593 987171157



Re: corrupt pages detected by enabling checksums

From
Andres Freund
Date:
On 2013-04-05 16:29:47 -0700, Jeff Davis wrote:
> On Fri, 2013-04-05 at 15:09 +0200, Andres Freund wrote:
> > How does the attached version look? I verified that it survives
> > recovery, but not more.
> 
> Comments:
> 
> * Regarding full page writes, we can:
>   - always write full pages (as in your current patch), regardless of
> the current settings
>   - take WALInsertLock briefly to get the current settings from XLogCtl
>   - always calculate the full page record, but in XLogInsert, if it
> happens to be a HINT record, and full page writes are not necessary,
> then discard it (right now I somewhat favor this option but I haven't
> looked at the details)

I feel pretty strongly that we shouldn't add any such complications to
XLogInsert() itself, its complicated enough already and it should be
made simpler, not more complicated.

I think we can just make up the rule that changing full page writes also
requires SpinLockAcquire(&xlogctl->info_lck);. Then its easy enough. And
it can hardly be a performance bottleneck given how infrequently its
modified.

In retrospect I think making up the rule that full_page_writes changes
imply a checkpoint would have made things easier performance and
codewise.

> * typo in "Backup blocks are not used in **xlog xlog** records"

Thats just me being "funny" after some long days ;). See its an 'xlog'
'xlog record'.

> * To get the appropriate setting for buffer_std, we should pass it down
> through MarkBufferDirty as an extra flag, I think.

Or just declare as part of the api that only std buffers are allowed to
be passed down. After a quick look it looks like all callers use enough
of the standard page layout to make that viable. But that really was
just a quick look.

> * I'm a bit worried that we'll need a cleanup lock when restoring an FSM
> page. What do you think?

I don't yet see why, while recovery is ongoing there shouldn't be anyone
doing anything concurrently to it but startup itself?

> * In xlog_redo, it seemed slightly awkward to call XLogRecGetData twice.
> Merely a matter of preference but I thought I would mention it.

Youre absolutely right, memcpy should have gotten passed 'data', not
XLogRecGetData().

> > Jeff, any chance you can run this for a round with your suite?
> 
> Yes. I don't have a rigorous test suite, but I'll do some manual tests
> and walk through it with gdb.

Heh, in this and only this I was talking to Jeff Janes. Strangely I
never had noticed that you share the same name ;)

Thanks!

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: corrupt pages detected by enabling checksums

From
Jeff Janes
Date:
On Fri, Apr 5, 2013 at 6:09 AM, Andres Freund <andres@2ndquadrant.com> wrote:
 
How does the attached version look? I verified that it survives
recovery, but not more.

Jeff, any chance you can run this for a round with your suite?


I've run it for a while now and have found no problems.

Thanks,

Jeff

Re: corrupt pages detected by enabling checksums

From
Jaime Casanova
Date:
On Sat, Apr 6, 2013 at 1:36 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Fri, Apr 5, 2013 at 6:09 AM, Andres Freund <andres@2ndquadrant.com>
> wrote:
>
>>
>> How does the attached version look? I verified that it survives
>> recovery, but not more.
>>
>> Jeff, any chance you can run this for a round with your suite?
>
>
>
> I've run it for a while now and have found no problems.
>

fwiw, i have run installcheck (serial and parallel) and isolationtest,
also combination of those (one installcheck, one isolationtest) at the
same time while executing vacuum full, reindex database and manual
checkpoint...

i also check that the bgwriter was doing some work.

i did all of this in a master node in a cluster with a standby and a
cascade standby that were later promoted... and i have no problem with
checksums at all, so i would say that the combination of Jeff's and
Andres' patches fixed the problems we have seen until now

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566         Cell: +593 987171157



Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 6 April 2013 15:44, Andres Freund <andres@2ndquadrant.com> wrote:
 
> * In xlog_redo, it seemed slightly awkward to call XLogRecGetData twice.
> Merely a matter of preference but I thought I would mention it.

Youre absolutely right, memcpy should have gotten passed 'data', not
XLogRecGetData().

Applied, with this as the only code change.

Thanks everybody for good research and coding and fast testing.

We're in good shape now.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Mon, 2013-04-08 at 09:19 +0100, Simon Riggs wrote:

> Applied, with this as the only code change.
>
>
> Thanks everybody for good research and coding and fast testing.
>
>
> We're in good shape now.

Thank you.

I have attached two more patches:

1. I believe that the issue I brought up at the end of this email:

http://www.postgresql.org/message-id/1365035537.7580.380.camel@sussancws0025

is a real issue. In lazy_vacuum_page(), the following sequence can
happen when checksums are on:

   a. PageSetAllVisible
   b. Pass heap page to visibilitymap_set
   c. visibilitymap_set logs the heap page and sets the LSN
   d. MarkBufferDirty

If a checkpoint happens between (c) and (d), then we have a problem. The
fix is easy: just mark the heap buffer dirty first. There's another call
site that looks like a potential problem, but I don't think it is. I
simplified the code there to make it (hopefully) more clearly correct.

2. A cleanup patch to pass the buffer_std flag down through
MarkBufferDirtyHint. This is a matter of preference and purely cosmetic,
so it might not be wanted. The reason I thought it was useful is that a
future caller that sets a hint on a non-standard buffer might easily
miss the assumption that we have a standard buffer.

Regards,
    Jeff Davis

Attachment

Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Sat, 2013-04-06 at 16:44 +0200, Andres Freund wrote:
> I think we can just make up the rule that changing full page writes also
> requires SpinLockAcquire(&xlogctl->info_lck);. Then its easy enough. And
> it can hardly be a performance bottleneck given how infrequently its
> modified.

That seems like a good idea to me. As it stands, checksums basically
force full page writes to be on; so we should either fix that or
document it.

> In retrospect I think making up the rule that full_page_writes changes
> imply a checkpoint would have made things easier performance and
> codewise.

I don't even see why we allow changing that while the server is on.
Either the I/O system requires it for safety or not, right?

Regards,Jeff Davis





Re: corrupt pages detected by enabling checksums

From
Robert Haas
Date:
On Sat, Apr 6, 2013 at 10:44 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I feel pretty strongly that we shouldn't add any such complications to
> XLogInsert() itself, its complicated enough already and it should be
> made simpler, not more complicated.

+1, emphatically.  XLogInsert is a really nasty scalability
bottleneck.  We need to move as much logic out of that function as
possible, and particularly out from under WALInsertLock.

...Robert



Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 11 April 2013 00:37, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Apr 6, 2013 at 10:44 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I feel pretty strongly that we shouldn't add any such complications to
> XLogInsert() itself, its complicated enough already and it should be
> made simpler, not more complicated.

+1, emphatically.  XLogInsert is a really nasty scalability
bottleneck.  We need to move as much logic out of that function as
possible, and particularly out from under WALInsertLock.

Andres' patch was applied, so not sure what you mean by +1ing a comment made in relation to that patch.

I'm aware that WALInsertLock is a bottleneck and am not going to be making that worse.
 
--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 9 April 2013 08:36, Jeff Davis <pgsql@j-davis.com> wrote:

> 1. I believe that the issue I brought up at the end of this email:
>
> http://www.postgresql.org/message-id/1365035537.7580.380.camel@sussancws0025
>
> is a real issue. In lazy_vacuum_page(), the following sequence can
> happen when checksums are on:
>
>    a. PageSetAllVisible
>    b. Pass heap page to visibilitymap_set
>    c. visibilitymap_set logs the heap page and sets the LSN
>    d. MarkBufferDirty
>
> If a checkpoint happens between (c) and (d), then we have a problem. The
> fix is easy: just mark the heap buffer dirty first. There's another call
> site that looks like a potential problem, but I don't think it is. I
> simplified the code there to make it (hopefully) more clearly correct.


Applied

> 2. A cleanup patch to pass the buffer_std flag down through
> MarkBufferDirtyHint. This is a matter of preference and purely cosmetic,
> so it might not be wanted. The reason I thought it was useful is that a
> future caller that sets a hint on a non-standard buffer might easily
> miss the assumption that we have a standard buffer.

Skipped that for now. Do we really need it? Can you set a hint on a
non-standard buffer?

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: corrupt pages detected by enabling checksums

From
Robert Haas
Date:
On Tue, Apr 30, 2013 at 6:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 9 April 2013 08:36, Jeff Davis <pgsql@j-davis.com> wrote:
>
>> 1. I believe that the issue I brought up at the end of this email:
>>
>> http://www.postgresql.org/message-id/1365035537.7580.380.camel@sussancws0025
>>
>> is a real issue. In lazy_vacuum_page(), the following sequence can
>> happen when checksums are on:
>>
>>    a. PageSetAllVisible
>>    b. Pass heap page to visibilitymap_set
>>    c. visibilitymap_set logs the heap page and sets the LSN
>>    d. MarkBufferDirty
>>
>> If a checkpoint happens between (c) and (d), then we have a problem. The
>> fix is easy: just mark the heap buffer dirty first. There's another call
>> site that looks like a potential problem, but I don't think it is. I
>> simplified the code there to make it (hopefully) more clearly correct.
>
> Applied

Uh, wait a minute.  I think this is completely wrong.  The buffer is
LOCKED for this entire sequence of operations.  For a checkpoint to
"happen", it's got to write every buffer, which it will not be able to
do for so long as the buffer is locked.

The effect of the change to lazy_scan_heap is to force the buffer to
be written even if we're only updating the visibility map page.
That's a bad idea and should be reverted.  The change to
lazy_vacuum_page is harmless, but the comment is wrong.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 30 April 2013 13:34, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Apr 30, 2013 at 6:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 9 April 2013 08:36, Jeff Davis <pgsql@j-davis.com> wrote:
>>
>>> 1. I believe that the issue I brought up at the end of this email:
>>>
>>> http://www.postgresql.org/message-id/1365035537.7580.380.camel@sussancws0025
>>>
>>> is a real issue. In lazy_vacuum_page(), the following sequence can
>>> happen when checksums are on:
>>>
>>>    a. PageSetAllVisible
>>>    b. Pass heap page to visibilitymap_set
>>>    c. visibilitymap_set logs the heap page and sets the LSN
>>>    d. MarkBufferDirty
>>>
>>> If a checkpoint happens between (c) and (d), then we have a problem. The
>>> fix is easy: just mark the heap buffer dirty first. There's another call
>>> site that looks like a potential problem, but I don't think it is. I
>>> simplified the code there to make it (hopefully) more clearly correct.
>>
>> Applied
>
> Uh, wait a minute.  I think this is completely wrong.

Thanks for your review.

> The buffer is
> LOCKED for this entire sequence of operations.  For a checkpoint to
> "happen", it's got to write every buffer, which it will not be able to
> do for so long as the buffer is locked.

I was following the logic we use for WAL: mark the buffer dirty, then write WAL.

The important thing is for us to signal that the buffer should be
written, which we do by marking it dirty. The actual writing of the
buffer comes later, often minutes later.

> The effect of the change to lazy_scan_heap is to force the buffer to
> be written even if we're only updating the visibility map page.
> That's a bad idea and should be reverted.

OK, agreed. Will wait for this discussion to end before acting though,
so I'm sure exactly what you mean.

> The change to
> lazy_vacuum_page is harmless, but the comment is wrong.

In what way, wrong? What should it say?

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Tue, 2013-04-30 at 08:34 -0400, Robert Haas wrote:
> Uh, wait a minute.  I think this is completely wrong.  The buffer is
> LOCKED for this entire sequence of operations.  For a checkpoint to
> "happen", it's got to write every buffer, which it will not be able to
> do for so long as the buffer is locked.

I went back and forth on this, so you could be right, but here was my
reasoning:

I was worried because SyncOneBuffer checks whether it needs writing
without taking a content lock, so the exclusive lock doesn't help. That
makes sense, because you don't want a checkpoint to have to get a
content lock on every buffer in the buffer pool. But it also means we
need to follow the rules laid out in transam/README and dirty the pages
before writing WAL.

> The effect of the change to lazy_scan_heap is to force the buffer to
> be written even if we're only updating the visibility map page.
> That's a bad idea and should be reverted.

The only time the VM and the data page are out of sync during vacuum is
after a crash, right? If that's the case, I didn't think it was a big
deal to dirty one extra page (should be extremely rare). Am I missing
something?

The reason I removed that special case was just code
complexity/readability. I tried preserving the previous behavior, and
it's not so bad, but it seemed unnecessarily ugly for the benefit of a
rare case.

Regards,Jeff Davis





Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 30 April 2013 22:54, Jeff Davis <pgsql@j-davis.com> wrote:
> On Tue, 2013-04-30 at 08:34 -0400, Robert Haas wrote:
>> Uh, wait a minute.  I think this is completely wrong.  The buffer is
>> LOCKED for this entire sequence of operations.  For a checkpoint to
>> "happen", it's got to write every buffer, which it will not be able to
>> do for so long as the buffer is locked.
>
> I went back and forth on this, so you could be right, but here was my
> reasoning:
>
> I was worried because SyncOneBuffer checks whether it needs writing
> without taking a content lock, so the exclusive lock doesn't help. That
> makes sense, because you don't want a checkpoint to have to get a
> content lock on every buffer in the buffer pool. But it also means we
> need to follow the rules laid out in transam/README and dirty the pages
> before writing WAL.
>
>> The effect of the change to lazy_scan_heap is to force the buffer to
>> be written even if we're only updating the visibility map page.
>> That's a bad idea and should be reverted.
>
> The only time the VM and the data page are out of sync during vacuum is
> after a crash, right? If that's the case, I didn't think it was a big
> deal to dirty one extra page (should be extremely rare). Am I missing
> something?
>
> The reason I removed that special case was just code
> complexity/readability. I tried preserving the previous behavior, and
> it's not so bad, but it seemed unnecessarily ugly for the benefit of a
> rare case.

All of that makes perfect sense to me.

Waiting to hear back from Robert whether he still has an objection.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: corrupt pages detected by enabling checksums

From
Robert Haas
Date:
On Tue, Apr 30, 2013 at 5:54 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Tue, 2013-04-30 at 08:34 -0400, Robert Haas wrote:
>> Uh, wait a minute.  I think this is completely wrong.  The buffer is
>> LOCKED for this entire sequence of operations.  For a checkpoint to
>> "happen", it's got to write every buffer, which it will not be able to
>> do for so long as the buffer is locked.
>
> I went back and forth on this, so you could be right, but here was my
> reasoning:
>
> I was worried because SyncOneBuffer checks whether it needs writing
> without taking a content lock, so the exclusive lock doesn't help. That
> makes sense, because you don't want a checkpoint to have to get a
> content lock on every buffer in the buffer pool. But it also means we
> need to follow the rules laid out in transam/README and dirty the pages
> before writing WAL.

On further review, I think you're right about this.  We'd have a
problem if the following sequence of events were to occur:

1. vacuum writes the WAL record, but does not yet mark the buffer
dirty, and then goes into the tank
2. checkpointer decides where the checkpoint will begin
3. buffer pool is scanned as part of the checkpoint process, observing
target buffer not to be dirty
4. vacuum finally wakes up and marks the buffer dirty
5. crash, after WAL is flushed but before buffer is written

However, on looking at this a little more, shouldn't we be doing
log_heap_clean() *before* we do visibilitymap_set()?

>> The effect of the change to lazy_scan_heap is to force the buffer to
>> be written even if we're only updating the visibility map page.
>> That's a bad idea and should be reverted.
>
> The only time the VM and the data page are out of sync during vacuum is
> after a crash, right? If that's the case, I didn't think it was a big
> deal to dirty one extra page (should be extremely rare). Am I missing
> something?
>
> The reason I removed that special case was just code
> complexity/readability. I tried preserving the previous behavior, and
> it's not so bad, but it seemed unnecessarily ugly for the benefit of a
> rare case.

Well, I don't find it that ugly, but then again

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: corrupt pages detected by enabling checksums

From
Robert Haas
Date:
On Wed, May 1, 2013 at 11:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I was worried because SyncOneBuffer checks whether it needs writing
>> without taking a content lock, so the exclusive lock doesn't help. That
>> makes sense, because you don't want a checkpoint to have to get a
>> content lock on every buffer in the buffer pool. But it also means we
>> need to follow the rules laid out in transam/README and dirty the pages
>> before writing WAL.
>
> On further review, I think you're right about this.  We'd have a
> problem if the following sequence of events were to occur:
>
> 1. vacuum writes the WAL record, but does not yet mark the buffer
> dirty, and then goes into the tank
> 2. checkpointer decides where the checkpoint will begin
> 3. buffer pool is scanned as part of the checkpoint process, observing
> target buffer not to be dirty
> 4. vacuum finally wakes up and marks the buffer dirty
> 5. crash, after WAL is flushed but before buffer is written
>
> However, on looking at this a little more, shouldn't we be doing
> log_heap_clean() *before* we do visibilitymap_set()?

Hit send too soon.

To finish that thought: right now the order of operations is...

1. Perform the cleanup operations on the buffer.
2. Set the visibility map bit.
3. Log the visibility map change.
4. Log the cleanup operations.

It seems to me that it would be better to do (4) as close to (1) as
possible.  Isn't it bad if the operations are replayed in an order
that differs from the order in which they were performed - or if, say,
the first WAL record were replayed but the second were not, for
whatever reason?

>>> The effect of the change to lazy_scan_heap is to force the buffer to
>>> be written even if we're only updating the visibility map page.
>>> That's a bad idea and should be reverted.
>>
>> The only time the VM and the data page are out of sync during vacuum is
>> after a crash, right? If that's the case, I didn't think it was a big
>> deal to dirty one extra page (should be extremely rare). Am I missing
>> something?
>>
>> The reason I removed that special case was just code
>> complexity/readability. I tried preserving the previous behavior, and
>> it's not so bad, but it seemed unnecessarily ugly for the benefit of a
>> rare case.
>
> Well, I don't find it that ugly, but then again

...I've done a fair amount of hacking on this code.  The scenario that
I find problematic here is that you could easily lose a couple of
visibility map pages in a crash.  Each one you lose, with this patch,
potentially involves half a gigabyte of unnecessary disk writes, if
the relation is being vacuumed at the time.  For the few extra lines
of code it takes to protect against that scenario, I think we should
protect against it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 1 May 2013 16:33, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 1, 2013 at 11:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I was worried because SyncOneBuffer checks whether it needs writing
>>> without taking a content lock, so the exclusive lock doesn't help. That
>>> makes sense, because you don't want a checkpoint to have to get a
>>> content lock on every buffer in the buffer pool. But it also means we
>>> need to follow the rules laid out in transam/README and dirty the pages
>>> before writing WAL.
>>
>> On further review, I think you're right about this.  We'd have a
>> problem if the following sequence of events were to occur:
>>
>> 1. vacuum writes the WAL record, but does not yet mark the buffer
>> dirty, and then goes into the tank
>> 2. checkpointer decides where the checkpoint will begin
>> 3. buffer pool is scanned as part of the checkpoint process, observing
>> target buffer not to be dirty
>> 4. vacuum finally wakes up and marks the buffer dirty
>> 5. crash, after WAL is flushed but before buffer is written
>>
>> However, on looking at this a little more, shouldn't we be doing
>> log_heap_clean() *before* we do visibilitymap_set()?
>
> Hit send too soon.
>
> To finish that thought: right now the order of operations is...
>
> 1. Perform the cleanup operations on the buffer.
> 2. Set the visibility map bit.
> 3. Log the visibility map change.
> 4. Log the cleanup operations.
>
> It seems to me that it would be better to do (4) as close to (1) as
> possible.  Isn't it bad if the operations are replayed in an order
> that differs from the order in which they were performed - or if, say,
> the first WAL record were replayed but the second were not, for
> whatever reason?

I agree, but that was in the original coding wasn't it?

Why aren't we writing just one WAL record for this action? We use a
single WAL record in other places where we make changes to multiple
blocks with multiple full page writes, e.g. index block split. That
would make the action atomic and we'd just have this...

1. Perform the cleanup operations on the buffer.
2. Set the visibility map bit.
3. Log the cleanup operations and visibility map change.

which can then be replayed with correct sequence, locking etc.
and AFAICS would likely be faster also.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: corrupt pages detected by enabling checksums

From
Robert Haas
Date:
On Wed, May 1, 2013 at 1:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I agree, but that was in the original coding wasn't it?

I believe the problem was introduced by this commit:

commit fdf9e21196a6f58c6021c967dc5776a16190f295
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   Wed Feb 13 17:46:23 2013 +0200
   Update visibility map in the second phase of vacuum.
   There's a high chance that a page becomes all-visible when the second phase   of vacuum removes all the dead tuples
onit, so it makes sense to check for   that. Otherwise the visibility map won't get updated until the next vacuum.
 
   Pavan Deolasee, reviewed by Jeff Janes.

> Why aren't we writing just one WAL record for this action? We use a
> single WAL record in other places where we make changes to multiple
> blocks with multiple full page writes, e.g. index block split. That
> would make the action atomic and we'd just have this...
>
> 1. Perform the cleanup operations on the buffer.
> 2. Set the visibility map bit.
> 3. Log the cleanup operations and visibility map change.
>
> which can then be replayed with correct sequence, locking etc.
> and AFAICS would likely be faster also.

I thought about that, too.  It certainly seems like more than we want
to try to do for 9.3 at this point.  The other complication is that
there's a lot of conditional logic here.  We're definitely going to
emit a cleanup record.  We're going to emit a record to make the page
all-visible only sometimes, because it might not be all-visible yet:
it could have tuples on it that are deleted but not yet dead.  And
then there's additional logic to handle the checksum case.  Plus, the
all-visible marking can happen in other code paths, too, specifically
in phase 1 of vacuum.  So it might be possible to consolidate this,
but off-hand it looks messy to me out of proportion to the benefits.

Now that I'm looking at this, I'm a bit confused by the new logic in
visibilitymap_set().  When checksums are enabled, we set the page LSN,
which is described like this: "we need to protect the heap page from
being torn".  But how does setting the page LSN do that?  Don't we
need to mark the buffer dirty or something like that?  Sorry if this
is a dumb question.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Wed, 2013-05-01 at 11:33 -0400, Robert Haas wrote:
> >> The only time the VM and the data page are out of sync during vacuum is
> >> after a crash, right? If that's the case, I didn't think it was a big
> >> deal to dirty one extra page (should be extremely rare). Am I missing
> >> something?
> >>
> >> The reason I removed that special case was just code
> >> complexity/readability. I tried preserving the previous behavior, and
> >> it's not so bad, but it seemed unnecessarily ugly for the benefit of a
> >> rare case.
> >
> > Well, I don't find it that ugly, but then again
>
> ...I've done a fair amount of hacking on this code.  The scenario that
> I find problematic here is that you could easily lose a couple of
> visibility map pages in a crash.  Each one you lose, with this patch,
> potentially involves half a gigabyte of unnecessary disk writes, if
> the relation is being vacuumed at the time.  For the few extra lines
> of code it takes to protect against that scenario, I think we should
> protect against it.

I'm still unclear on how we'd lose so many VM bits at once, because they
are logged. I see how we can lose one if the heap page makes it to disk
before the VM bit's WAL is flushed, but that seems to be bounded to me.

Regardless, you have a reasonable claim that my patch had effects that
were not necessary. I have attached a draft patch to remedy that. Only
rudimentary testing was done.

Regards,
    Jeff Davis


Attachment

Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 1 May 2013 19:16, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 1, 2013 at 1:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> I agree, but that was in the original coding wasn't it?
>
> I believe the problem was introduced by this commit:
>
> commit fdf9e21196a6f58c6021c967dc5776a16190f295
> Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date:   Wed Feb 13 17:46:23 2013 +0200
>
>     Update visibility map in the second phase of vacuum.
>
>     There's a high chance that a page becomes all-visible when the second phase
>     of vacuum removes all the dead tuples on it, so it makes sense to check for
>     that. Otherwise the visibility map won't get updated until the next vacuum.
>
>     Pavan Deolasee, reviewed by Jeff Janes.
>
>> Why aren't we writing just one WAL record for this action? We use a
>> single WAL record in other places where we make changes to multiple
>> blocks with multiple full page writes, e.g. index block split. That
>> would make the action atomic and we'd just have this...
>>
>> 1. Perform the cleanup operations on the buffer.
>> 2. Set the visibility map bit.
>> 3. Log the cleanup operations and visibility map change.
>>
>> which can then be replayed with correct sequence, locking etc.
>> and AFAICS would likely be faster also.
>
> I thought about that, too.  It certainly seems like more than we want
> to try to do for 9.3 at this point.  The other complication is that
> there's a lot of conditional logic here.  We're definitely going to
> emit a cleanup record.  We're going to emit a record to make the page
> all-visible only sometimes, because it might not be all-visible yet:
> it could have tuples on it that are deleted but not yet dead.  And
> then there's additional logic to handle the checksum case.  Plus, the
> all-visible marking can happen in other code paths, too, specifically
> in phase 1 of vacuum.  So it might be possible to consolidate this,
> but off-hand it looks messy to me out of proportion to the benefits.

Looks easy. There is no additional logic for checksums, so there's no
third complexity.

So we either have
* cleanup info with vismap setting info
* cleanup info only

which is the same number of WAL records as we have now, just that we
never emit 2 records when one will do.

> Now that I'm looking at this, I'm a bit confused by the new logic in
> visibilitymap_set().  When checksums are enabled, we set the page LSN,
> which is described like this: "we need to protect the heap page from
> being torn".  But how does setting the page LSN do that?

It doesn't

> Don't we
> need to mark the buffer dirty or something like that?

We do.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Wed, 2013-05-01 at 14:16 -0400, Robert Haas wrote:
> Now that I'm looking at this, I'm a bit confused by the new logic in
> visibilitymap_set().  When checksums are enabled, we set the page LSN,
> which is described like this: "we need to protect the heap page from
> being torn".  But how does setting the page LSN do that?  Don't we
> need to mark the buffer dirty or something like that?  Sorry if this
> is a dumb question.

The caller is supposed to dirty the heap page when setting
PD_ALL_VISIBLE. I thought I said that explicitly in the comments
somewhere, but now I don't see it.

It is slightly awkward to put the comment about the page being torn just
before setting the LSN. Feel free to move/remove/reword if you can think
of something better.

Because setting PD_ALL_VISIBLE does not write WAL by itself, my design
was to have it piggyback on the VM WAL record if checksums are turned
on. It would be nice to just use MarkBufferDirtyHint, but that does not
guarantee that the page will actually be marked dirty (see comments);
and we really need it to be properly marked dirty (otherwise we risk the
VM bit making it and the heap page not).

I also explored the idea of passing more options to MarkBufferDirty so
that we could force it to do what we want in each case, but decided
against it:

http://www.postgresql.org/message-id/1357798000.5162.38.camel@jdavis-laptop

Regards,Jeff Davis





Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Wed, 2013-05-01 at 20:06 +0100, Simon Riggs wrote:
> >> Why aren't we writing just one WAL record for this action?

...

> >
> > I thought about that, too.  It certainly seems like more than we want
> > to try to do for 9.3 at this point.  The other complication is that
> > there's a lot of conditional logic here.

...

> Looks easy. There is no additional logic for checksums, so there's no
> third complexity.
> 
> So we either have
> * cleanup info with vismap setting info
> * cleanup info only
> 
> which is the same number of WAL records as we have now, just that we
> never emit 2 records when one will do.

What if we only set PD_ALL_VISIBLE and the VM bit, and make no other
changes? Right now, that generates no WAL for the heap page (unless
checksums are enabled, of course), which means no full page images for
the heap pages.

I think we have to special-case that somehow, and I believe that's the
conditional logic that Robert is referring to.

That being said, there may be a simpler way to achieve the same result,
and that may be what you have in mind.

Regards,Jeff Davis





Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 1 May 2013 20:40, Jeff Davis <pgsql@j-davis.com> wrote:

>> Looks easy. There is no additional logic for checksums, so there's no
>> third complexity.
>>
>> So we either have
>> * cleanup info with vismap setting info
>> * cleanup info only
>>
>> which is the same number of WAL records as we have now, just that we
>> never emit 2 records when one will do.
>
> What if we only set PD_ALL_VISIBLE and the VM bit, and make no other
> changes? Right now, that generates no WAL for the heap page (unless
> checksums are enabled, of course), which means no full page images for
> the heap pages.

The only place I see that code path is when we clear up a heap page
that is empty.

Is that what you meant?

Empty pages are pretty rare, so I guess I meant that we should just
get rid of the vismap update when we see that. Since we're adding the
block straught back into the fsm, it won't stay all visible for very
long.



Bottom line is: is there a problem there, or not?
If there's not, I'm happy. If there is, then do we have another plan
than making this into one WAL record, so it is atomic?
And presumably y'all want me to fix it.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Fri, 2013-05-03 at 19:52 +0100, Simon Riggs wrote:
> On 1 May 2013 20:40, Jeff Davis <pgsql@j-davis.com> wrote:
> 
> >> Looks easy. There is no additional logic for checksums, so there's no
> >> third complexity.
> >>
> >> So we either have
> >> * cleanup info with vismap setting info
> >> * cleanup info only
> >>
> >> which is the same number of WAL records as we have now, just that we
> >> never emit 2 records when one will do.
> >
> > What if we only set PD_ALL_VISIBLE and the VM bit, and make no other
> > changes? Right now, that generates no WAL for the heap page (unless
> > checksums are enabled, of course), which means no full page images for
> > the heap pages.
> 
> The only place I see that code path is when we clear up a heap page
> that is empty.

Or if there are no other changes to make to the heap page. For example,
if you do a fresh data load, then set the hint bits, and then do a
VACCUM. The only changes needed during VACUUM are setting PD_ALL_VISIBLE
and the VM bit.

Right now, that does not cause a wal record to be written for the heap
page, so there are no full-page images for the heap page.

At this point, I don't think more changes are required. Robert seemed
concerned about dirtying extra pages after a crash, but I didn't
entirely understand his reasoning. I am inclined toward simplifying this
part of the code as you suggest, but maybe that's better left for 9.4
(which would give me a chance to reignite the discussion about whether
PD_ALL_VISIBLE is needed at all) unless there is an actual problem.

Regards,Jeff Davis





Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 3 May 2013 21:53, Jeff Davis <pgsql@j-davis.com> wrote:

> At this point, I don't think more changes are required.

After detailed further analysis, I agree, no further changes are required.

I think the code in that area needs considerable refactoring to
improve things. I've looked for an easy way to avoid calling
PageSetLSN() twice, but I don't see one, which is the thing I thought
was a problem. I don't much like the nested critical sections either.
But overall, we do follow the requirements for WAL laid out in the
README, in that we dirty the buffer first, insert WAL, then set LSN,
all within a critical section and with buffer locking. So I don't see
any place where this will break with the current coding.

In the future I would hope to see that code simplified so that we do
just one WAL record per block, rather than the 3 (or more?) records
that can get written now: freeze, clean and visibility.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: corrupt pages detected by enabling checksums

From
Robert Haas
Date:
On Wed, May 1, 2013 at 3:04 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> Regardless, you have a reasonable claim that my patch had effects that
> were not necessary. I have attached a draft patch to remedy that. Only
> rudimentary testing was done.

This looks reasonable to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Mon, 2013-05-06 at 15:31 -0400, Robert Haas wrote:
> On Wed, May 1, 2013 at 3:04 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > Regardless, you have a reasonable claim that my patch had effects that
> > were not necessary. I have attached a draft patch to remedy that. Only
> > rudimentary testing was done.
> 
> This looks reasonable to me.

Can you please explain the scenario that loses many VM bits at once
during a crash, and results in a bunch of already-all-visible heap pages
being dirtied for no reason?

Regards,Jeff Davis





Re: corrupt pages detected by enabling checksums

From
Robert Haas
Date:
On Mon, May 6, 2013 at 5:04 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Mon, 2013-05-06 at 15:31 -0400, Robert Haas wrote:
>> On Wed, May 1, 2013 at 3:04 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>> > Regardless, you have a reasonable claim that my patch had effects that
>> > were not necessary. I have attached a draft patch to remedy that. Only
>> > rudimentary testing was done.
>>
>> This looks reasonable to me.
>
> Can you please explain the scenario that loses many VM bits at once
> during a crash, and results in a bunch of already-all-visible heap pages
> being dirtied for no reason?

Hmm.  Rereading your last email, I see your point: since we now have
HEAP_XLOG_VISIBLE, this is much less of an issue than it would have
been before.  I'm still not convinced that simplifying that code is a
good idea, but maybe it doesn't really hurt us much in practice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Tue, 2013-05-07 at 13:20 -0400, Robert Haas wrote:
> Hmm.  Rereading your last email, I see your point: since we now have
> HEAP_XLOG_VISIBLE, this is much less of an issue than it would have
> been before.  I'm still not convinced that simplifying that code is a
> good idea, but maybe it doesn't really hurt us much in practice.

Given that there's not a big impact one way or another, I don't mind
whether this particular patch is committed or not. Whichever you think
is more understandable and safer at this late hour. Also, to be clear,
the fact that I posted a patch was not meant to advocate either way;
merely to present the options.

Not sure exactly what you mean about the code simplification, but I
agree that anything more substantial than this patch should be left for
9.4.

Regards,Jeff Davis





Re: corrupt pages detected by enabling checksums

From
Jim Nasby
Date:
On 4/5/13 6:39 PM, Jeff Davis wrote:
> On Fri, 2013-04-05 at 10:34 +0200, Florian Pflug wrote:
>> Maybe we could scan forward to check whether a corrupted WAL record is
>> followed by one or more valid ones with sensible LSNs. If it is,
>> chances are high that we haven't actually hit the end of the WAL. In
>> that case, we could either log a warning, or (better, probably) abort
>> crash recovery.
>
> +1.
>
>> Corruption of fields which we require to scan past the record would
>> cause false negatives, i.e. no trigger an error even though we do
>> abort recovery mid-way through. There's a risk of false positives too,
>> but they require quite specific orderings of writes and thus seem
>> rather unlikely. (AFAICS, the OS would have to write some parts of
>> record N followed by the whole of record N+1 and then crash to cause a
>> false positive).
>
> Does the xlp_pageaddr help solve this?
>
> Also, we'd need to be a little careful when written-but-not-flushed WAL
> data makes it to disk, which could cause a false positive and may be a
> fairly common case.

Apologies if this is a stupid question, but is this mostly an issue due to torn pages? IOW, if we had a way to ensure
wenever see torn pages, would that mean an invalid CRC on a WAL page indicated there really was corruption on that
page?

Maybe it's worth putting (yet more) thought into the torn page issue... :/
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Wed, 2013-05-08 at 17:56 -0500, Jim Nasby wrote:
> Apologies if this is a stupid question, but is this mostly an issue
> due to torn pages? IOW, if we had a way to ensure we never see torn
> pages, would that mean an invalid CRC on a WAL page indicated there
> really was corruption on that page?
> 
> Maybe it's worth putting (yet more) thought into the torn page
> issue... :/

Sort of. For data, a page is the logically-atomic unit that is expected
to be intact. For WAL, a record is the logically-atomic unit that is
expected to be intact.

So it might be better to say that the issue for the WAL is "torn
records". A record might be larger than a page (it can hold up to three
full-page images in one record), but is often much smaller.

We use a CRC to validate that the WAL record is fully intact. The
concern is that, if it fails the CRC check, we *assume* that it's
because it wasn't completely flushed yet (i.e. a "torn record"). Based
on that assumption, neither that record nor any later record contains
committed transactions, so we can safely consider that the end of the
WAL (as of the crash) and bring the system up.

The problem is that the assumption is not always true: a CRC failure
could also indicate real corruption of WAL records that have been
previously flushed successfully, and may contain committed transactions.
That can mean we bring the system up way too early, corrupting the
database.

Unfortunately, it seems that doing any kind of validation to determine
that we have a valid end-of-the-WAL inherently requires some kind of
separate durable write somewhere. It would be a tiny amount of data (an
LSN and maybe some extra crosscheck information), so I could imagine
that would be just fine given the right hardware; but if we just write
to disk that would be pretty bad. Ideas welcome.

Regards,Jeff Davis





Re: corrupt pages detected by enabling checksums

From
Jim Nasby
Date:
On 5/8/13 7:34 PM, Jeff Davis wrote:
> On Wed, 2013-05-08 at 17:56 -0500, Jim Nasby wrote:
>> Apologies if this is a stupid question, but is this mostly an issue
>> due to torn pages? IOW, if we had a way to ensure we never see torn
>> pages, would that mean an invalid CRC on a WAL page indicated there
>> really was corruption on that page?
>>
>> Maybe it's worth putting (yet more) thought into the torn page
>> issue... :/
>
> Sort of. For data, a page is the logically-atomic unit that is expected
> to be intact. For WAL, a record is the logically-atomic unit that is
> expected to be intact.
>
> So it might be better to say that the issue for the WAL is "torn
> records". A record might be larger than a page (it can hold up to three
> full-page images in one record), but is often much smaller.
>
> We use a CRC to validate that the WAL record is fully intact. The
> concern is that, if it fails the CRC check, we *assume* that it's
> because it wasn't completely flushed yet (i.e. a "torn record"). Based
> on that assumption, neither that record nor any later record contains
> committed transactions, so we can safely consider that the end of the
> WAL (as of the crash) and bring the system up.
>
> The problem is that the assumption is not always true: a CRC failure
> could also indicate real corruption of WAL records that have been
> previously flushed successfully, and may contain committed transactions.
> That can mean we bring the system up way too early, corrupting the
> database.
>
> Unfortunately, it seems that doing any kind of validation to determine
> that we have a valid end-of-the-WAL inherently requires some kind of
> separate durable write somewhere. It would be a tiny amount of data (an
> LSN and maybe some extra crosscheck information), so I could imagine
> that would be just fine given the right hardware; but if we just write
> to disk that would be pretty bad. Ideas welcome.

What about moving some critical data from the beginning of the WAL record to the end? That would make it easier to
detectthat we don't have a complete record. It wouldn't necessarily replace the CRC though, so maybe that's not good
enough.

Actually, what if we actually *duplicated* some of the same WAL header info at the end of the record? Given a
reasonableamount of data that would damn-near ensure that a torn record was detected, because the odds of having the
exactsame sequence of random bytes would be so low. Potentially even just duplicating the LSN would suffice.
 

On the separate write idea, if that could be controlled by a GUC I think it'd be worth doing. Anyone that needs to
worryabout this corner case probably has hardware that would support that.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 9 May 2013 20:28, Jim Nasby <jim@nasby.net> wrote:

>> Unfortunately, it seems that doing any kind of validation to determine
>> that we have a valid end-of-the-WAL inherently requires some kind of
>> separate durable write somewhere. It would be a tiny amount of data (an
>> LSN and maybe some extra crosscheck information), so I could imagine
>> that would be just fine given the right hardware; but if we just write
>> to disk that would be pretty bad. Ideas welcome.

Not so sure.

If the WAL record length is intact, and it probably is, then we can
test whether the next WAL record is valid also.

If the current WAL record is corrupt and the next WAL record is
corrupt, then we have a problem.

If the current WAL record is corrupt and the next WAL record is in
every way valid, we can potentially continue. But we need to keep
track of accumulated errors to avoid getting into a worse situation.
Obviously, we would need to treat the next WAL record with complete
scepticism, but I have seen cases where only a single WAL record was
corrupt.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: corrupt pages detected by enabling checksums

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> If the current WAL record is corrupt and the next WAL record is in
> every way valid, we can potentially continue.

That seems like a seriously bad idea.
        regards, tom lane



Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 9 May 2013 22:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> If the current WAL record is corrupt and the next WAL record is in
>> every way valid, we can potentially continue.
>
> That seems like a seriously bad idea.

I agree. But if you knew that were true, is stopping a better idea?

Any corrupt WAL record needs to halt recovery. What I'm thinking is to
check whether it might be possible to continue, and if all looks good,
offer the user the option to continue.

"Continuing could be dangerous and might cause deeper corruption of
your database. We suggest you take a backup of the server before
continuing"

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: corrupt pages detected by enabling checksums

From
Greg Stark
Date:
On Thu, May 9, 2013 at 10:45 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 9 May 2013 22:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>> If the current WAL record is corrupt and the next WAL record is in
>>> every way valid, we can potentially continue.
>>
>> That seems like a seriously bad idea.
>
> I agree. But if you knew that were true, is stopping a better idea?

Having one corrupt record followed by a valid record is not an
abnormal situation. It could easily be the correct end of WAL.

I think it's not possible to protect 100% against this without giving
up the checksum optimization which implies doing two fsyncs per commit
instead of 1.

However it is possible to reduce the window. Every time the
transaction log is synced a different file can be updated with the a
known minimum transaction log recovery point. Even if it's not synced
consistently on every transaction commit or wal sync it would serve as
a low water mark. Recovering to that point is not sufficient but is
necessary for a consistent recovery. That file could be synced lazily,
say, every 10s or something like that and would guarantee that any wal
corruption would be caught except for the last 10s of wal traffic for
example.

If you're only interested in database consistency and not lost commits
then that file could be synced on buffer xlog flushes (making a
painful case even more painful). Off the top of my head that would be
sufficient to guarantee that a corrupt xlog that would create an
inconsistent database would not be missed. I may be missing cases
involving checkpoints or the like though.


-- 
greg



Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Thu, 2013-05-09 at 14:28 -0500, Jim Nasby wrote:
> What about moving some critical data from the beginning of the WAL
> record to the end? That would make it easier to detect that we don't
> have a complete record. It wouldn't necessarily replace the CRC
> though, so maybe that's not good enough.
> 
> Actually, what if we actually *duplicated* some of the same WAL header
> info at the end of the record? Given a reasonable amount of data that
> would damn-near ensure that a torn record was detected, because the
> odds of having the exact same sequence of random bytes would be so
> low. Potentially even just duplicating the LSN would suffice.

I think both of these ideas have some false positives and false
negatives.

If the corruption happens at the record boundary, and wipes out the
special information at the end of the record, then you might think it
was not fully flushed, and we're in the same position as today.

If the WAL record is large, and somehow the beginning and the end get
written to disk but not the middle, then it will look like corruption;
but really the WAL was just not completely flushed. This seems pretty
unlikely, but not impossible.

That being said, I like the idea of introducing some extra checks if a
perfect solution is not possible.

> On the separate write idea, if that could be controlled by a GUC I
> think it'd be worth doing. Anyone that needs to worry about this
> corner case probably has hardware that would support that.

It sounds pretty easy to do that naively. I'm just worried that the
performance will be so bad for so many users that it's not a very
reasonable choice.

Today, it would probably make more sense to just use sync rep. If the
master's WAL is corrupt, and it starts up too early, then that should be
obvious when you try to reconnect streaming replication. I haven't tried
it, but I'm assuming that it gives a useful error message.

Regards,Jeff Davis





Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Thu, 2013-05-09 at 23:13 +0100, Greg Stark wrote:
> However it is possible to reduce the window...

Sounds reasonable.

It's fairly limited though -- the window is already a checkpoint
(typically 5-30 minutes), and we'd bring that down an order of magnitude
(10s). I speculate that, if it got corrupted within 30 minutes, it
probably got corrupted at the time of being written (as happened in Jeff
Janes's case, due to a bug).

So, the question is: if the WAL is corrupted on write, does reducing the
window significantly increase the chances that the wal writer will hang
around long enough before a crash to flush this other file?

On the other hand, checkpoint hides any corrupt WAL records by not
replaying them, whereas your scheme would identify that there is a
problem.

I don't think this would have helped Jeff Janes's case because I think
the crashes were happening too quickly. But that is artificial, so it
may help in real cases.

I just had a thought: we don't necessarily need to flush the auxiliary
file each time; merely writing it to the kernel buffers would help a
lot. Maybe an extra write() of the auxiliary file during a WAL flush
isn't so bad; and combined with periodic fsync()s of the auxiliary file,
should offer a lot of coverage against problems.

Regards,Jeff Davis





Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 9 May 2013 23:13, Greg Stark <stark@mit.edu> wrote:
> On Thu, May 9, 2013 at 10:45 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 9 May 2013 22:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>>> If the current WAL record is corrupt and the next WAL record is in
>>>> every way valid, we can potentially continue.
>>>
>>> That seems like a seriously bad idea.
>>
>> I agree. But if you knew that were true, is stopping a better idea?
>
> Having one corrupt record followed by a valid record is not an
> abnormal situation. It could easily be the correct end of WAL.

I disagree, that *is* an abnormal situation and would not be the
"correct end-of-WAL".

Each WAL record contains a "prev" pointer to the last WAL record. So
for the next record to be valid the prev pointer would need to be
exactly correct.


> However it is possible to reduce the window. Every time the
> transaction log is synced a different file can be updated with the a
> known minimum transaction log recovery point. Even if it's not synced
> consistently on every transaction commit or wal sync it would serve as
> a low water mark. Recovering to that point is not sufficient but is
> necessary for a consistent recovery. That file could be synced lazily,
> say, every 10s or something like that and would guarantee that any wal
> corruption would be caught except for the last 10s of wal traffic for
> example.

I think it would be easy enough to have the WALwriter update the
minRecoveryPoint once per cycle, after it has flushed WAL.

Given the importance of pg_control and its small size, it seems like
it would be a good idea to take a backup copy of it every checkpoint
to make sure we have that data safe. And have pg_resetxlog keep a copy
also every time that is run.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: corrupt pages detected by enabling checksums

From
Greg Stark
Date:
On Fri, May 10, 2013 at 7:44 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Having one corrupt record followed by a valid record is not an
>> abnormal situation. It could easily be the correct end of WAL.
>
> I disagree, that *is* an abnormal situation and would not be the
> "correct end-of-WAL".
>
> Each WAL record contains a "prev" pointer to the last WAL record. So
> for the next record to be valid the prev pointer would need to be
> exactly correct.

Well then you're wrong. The OS makes no guarantee that blocks are
written out in order. When the system crashes any random subset of the
blocks written but not synced might have been written to disk and
others not. There could be megabytes of correct WAL written with just
one block in the middle of the first record not written. If no xlog
sync had occurred (or one was in progress but not completed) then
that's the correct end of WAL.


-- 
greg



Re: corrupt pages detected by enabling checksums

From
Amit Kapila
Date:
On Friday, May 10, 2013 6:09 PM Greg Stark wrote:
> On Fri, May 10, 2013 at 7:44 AM, Simon Riggs <simon@2ndquadrant.com>
> wrote:
> >> Having one corrupt record followed by a valid record is not an
> >> abnormal situation. It could easily be the correct end of WAL.
> >
> > I disagree, that *is* an abnormal situation and would not be the
> > "correct end-of-WAL".
> >
> > Each WAL record contains a "prev" pointer to the last WAL record. So
> > for the next record to be valid the prev pointer would need to be
> > exactly correct.
> 
> Well then you're wrong. The OS makes no guarantee that blocks are
> written out in order. When the system crashes any random subset of the
> blocks written but not synced might have been written to disk and
> others not. There could be megabytes of correct WAL written with just
> one block in the middle of the first record not written. If no xlog
> sync had occurred (or one was in progress but not completed) then
> that's the correct end of WAL.

In the case where one block is missing, how can it even reach to next record
to check "prev" pointer.
I think it can be possible when one of the record is corrupt and following
are okay which I think is the 
case in which it can proceed with warning as suggested by Simon.

With Regards,
Amit Kapila.




Re: corrupt pages detected by enabling checksums

From
Greg Stark
Date:
On Fri, May 10, 2013 at 5:31 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
> In the case where one block is missing, how can it even reach to next record
> to check "prev" pointer.
> I think it can be possible when one of the record is corrupt and following
> are okay which I think is the
> case in which it can proceed with warning as suggested by Simon.

A single WAL record can be over 24kB. The checksum covers the entire
WAL record and if it reports corruption it can be because a chunk in
the middle wasn't flushed to disk before the system crashed. The
beginning of the WAL record with the length and checksum and the
entire following record with its prev pointer might have been flushed
but the missing block in the middle of this record means it can't be
replayed. This would be a normal situation in case of a system crash.

If you replayed the following record but not this record you would
have an inconsistent database. The following record could be an insert
for a child table with a foreign key reference to a tuple that was
inserted by the skipped record for example. Resulting in a database
that is logically inconsistent.

Or it could be an index insert for that tuple would would result in a
physically inconsistent database with index pointers that point to
incorrect tuples. Index scans would return tuples that didn't match
the index  or would miss tuples that should be returned.

-- 
greg



Re: corrupt pages detected by enabling checksums

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> A single WAL record can be over 24kB.

<pedantic>
Actually, WAL records can run to megabytes.  Consider for example a
commit record for a transaction that dropped thousands of tables ---
there'll be info about each such table in the commit record, to cue
replay to remove those files.
</pedantic>

> If you replayed the following record but not this record you would
> have an inconsistent database. ...
> Or it could be an index insert for that tuple would would result in a
> physically inconsistent database with index pointers that point to
> incorrect tuples. Index scans would return tuples that didn't match
> the index  or would miss tuples that should be returned.

Skipping actions such as index page splits would lead to even more fun.

Even in simple cases such as successive inserts and deletions in the
same heap page, failing to replay some of the actions is going to be
disastrous.  The *best case* scenario for that is that WAL replay
PANICs when it notices that the action it's trying to replay is
inconsistent with the current state of the page, eg it's trying to
insert at a TID that already exists.

IMO we can't proceed past a broken WAL record.  The actually useful
suggestion upthread was that we try to notice whether there seem
to be valid WAL records past the broken one, so that we could warn
the DBA that some commits might have been lost.  I don't think we
can do much in the way of automatic data recovery, but we could give
the DBA a chance to do forensics rather than blindly starting up (and
promptly overwriting all the evidence).
        regards, tom lane



Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 10 May 2013 13:39, Greg Stark <stark@mit.edu> wrote:
> On Fri, May 10, 2013 at 7:44 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> Having one corrupt record followed by a valid record is not an
>>> abnormal situation. It could easily be the correct end of WAL.
>>
>> I disagree, that *is* an abnormal situation and would not be the
>> "correct end-of-WAL".
>>
>> Each WAL record contains a "prev" pointer to the last WAL record. So
>> for the next record to be valid the prev pointer would need to be
>> exactly correct.
>
> Well then you're wrong. The OS makes no guarantee that blocks are
> written out in order. When the system crashes any random subset of the
> blocks written but not synced might have been written to disk and
> others not. There could be megabytes of correct WAL written with just
> one block in the middle of the first record not written. If no xlog
> sync had occurred (or one was in progress but not completed) then
> that's the correct end of WAL.

I agree that the correct end of WAL is where the last sync occurred.
We don't write() WAL except with an immediate sync(), so the chances
of what you say happening are very low to impossible. The timing
window between the write and the sync is negligible and yet I/O would
need to occur in that window and also be out of order from the order
of the write, which is unlikely because an I/O elevator would either
not touch the order of writes at all, or would want to maintain
sequential order to avoid head movement, which is what we want. I
guess we should add here "...with disks, maybe not with SSDs".

In any case, what is more important is that your idea to make an
occasional write of the minRecoveryPoint and then use that to cross
check against current LSN would allow us to at least confirm that we
have a single corrupt record and report that situation accurately to
the user. That idea will cover 95+% of such problems anyway, since
what we care about is long sequences of WAL records, not just the last
few written at crash, which the above discussion focuses on.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 10 May 2013 18:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@mit.edu> writes:
>> A single WAL record can be over 24kB.
>
> <pedantic>
> Actually, WAL records can run to megabytes.  Consider for example a
> commit record for a transaction that dropped thousands of tables ---
> there'll be info about each such table in the commit record, to cue
> replay to remove those files.
> </pedantic>
>
>> If you replayed the following record but not this record you would
>> have an inconsistent database. ...
>> Or it could be an index insert for that tuple would would result in a
>> physically inconsistent database with index pointers that point to
>> incorrect tuples. Index scans would return tuples that didn't match
>> the index  or would miss tuples that should be returned.
>
> Skipping actions such as index page splits would lead to even more fun.
>
> Even in simple cases such as successive inserts and deletions in the
> same heap page, failing to replay some of the actions is going to be
> disastrous.  The *best case* scenario for that is that WAL replay
> PANICs when it notices that the action it's trying to replay is
> inconsistent with the current state of the page, eg it's trying to
> insert at a TID that already exists.

Agreed

> IMO we can't proceed past a broken WAL record.  The actually useful
> suggestion upthread was that we try to notice whether there seem
> to be valid WAL records past the broken one, so that we could warn
> the DBA that some commits might have been lost.  I don't think we
> can do much in the way of automatic data recovery, but we could give
> the DBA a chance to do forensics rather than blindly starting up (and
> promptly overwriting all the evidence).

The usual answer is switchover to the standby, hoping that the WAL was
not corrupted before it got sent there also.

If all servers go down people will want a "carry on regardless" option
as well, since the fault can be investigated on a standby. Even with
all of the above caveats,  lying on our backs with our feet in the air
'cos we lost a few blocks will not impress anyone. Doing that will
likely be a medium-high risk thing, but that will still be better than
the certainty of a down server. It would need to be a manually iniated
option.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: corrupt pages detected by enabling checksums

From
Jeff Janes
Date:
On Fri, May 10, 2013 at 9:54 AM, Greg Stark <stark@mit.edu> wrote:
On Fri, May 10, 2013 at 5:31 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
> In the case where one block is missing, how can it even reach to next record
> to check "prev" pointer.
> I think it can be possible when one of the record is corrupt and following
> are okay which I think is the
> case in which it can proceed with warning as suggested by Simon.

A single WAL record can be over 24kB. The checksum covers the entire
WAL record and if it reports corruption it can be because a chunk in
the middle wasn't flushed to disk before the system crashed. The
beginning of the WAL record with the length and checksum and the
entire following record with its prev pointer might have been flushed
but the missing block in the middle of this record means it can't be
replayed. This would be a normal situation in case of a system crash.

If you replayed the following record but not this record you would
have an inconsistent database.

I don't think we would ever want to *skip* the record and play the next one.  But if it looks like the next record is valid, we might not want to automatically open the database in a possibly inconsistent state and in the process overwrite the only existing copy of those WAL records which would be necessary to make it consistent.  Instead, could we present the DBA with an explicit choice to either open the database, or try to reconstruct the corrupted record via forensic inspection so that it can be played through (I have no idea how likely it is that such an attempt would succeed), or to copy the database for later inspection and then open it.

But based on your description, perhaps refusing to automatically restart and forcing an explicit decision would happen a lot more often, during normal crashes with no corruption, than I was thinking it would.

Of course the paranoid DBA could turn off restart_after_crash and do a manual investigation on every crash, but in that case the database would refuse to restart even in the case where it perfectly clear that all the following WAL belongs to the recycled file and not the current file.  They would also have to turn off any startup scripts in init.d, to make sure a rebooting server doesn't do recovery automatically and destroy evidence that way.


Cheers,

Jeff

Re: corrupt pages detected by enabling checksums

From
Robert Haas
Date:
On Fri, May 10, 2013 at 2:06 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> But based on your description, perhaps refusing to automatically restart and
> forcing an explicit decision would happen a lot more often, during normal
> crashes with no corruption, than I was thinking it would.

I bet it would.  But I think Greg Stark's idea of periodically
flushing the current minimum recovery point to disk has a lot to
recommend it.  That would be massively more reliable than what we have
now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: corrupt pages detected by enabling checksums

From
Jeff Davis
Date:
On Fri, 2013-05-10 at 18:32 +0100, Simon Riggs wrote:
> We don't write() WAL except with an immediate sync(), so the chances
> of what you say happening are very low to impossible.

Are you sure? An XLogwrtRqst contains a write and a flush pointer, so I
assume they can be different.

I agree that it sounds unlikely that blocks 100 and 102 would be
written, but not 101. But perhaps that's more likely in systems like ZFS
where the physical blocks might be in very different places.

Regards,Jeff Davis





Re: corrupt pages detected by enabling checksums

From
Greg Smith
Date:
On 5/10/13 1:32 PM, Simon Riggs wrote:
> The timing
> window between the write and the sync is negligible and yet I/O would
> need to occur in that window and also be out of order from the order
> of the write, which is unlikely because an I/O elevator would either
> not touch the order of writes at all, or would want to maintain
> sequential order to avoid head movement, which is what we want. I
> guess we should add here "...with disks, maybe not with SSDs".

It's not really safe to make any assumptions about I/O elevators. 
Reordering gets done from the perspective of the last item written. 
When the previous write was at the logical end of the disk, it can just 
as easily re-order a queue of writes in the complete reverse order they 
were issued in.

The only way you can ever get a useful guarantee is when an fsync 
returns completion.  Writes can effectively go out in a completely 
random order until that point.  All you can rely on is throwing up a 
stop sign that says "tell me when all of them are done".  In between 
those, you have no idea of the ordering.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: corrupt pages detected by enabling checksums

From
Amit kapila
Date:
On Friday, May 10, 2013 10:24 PM Greg Stark wrote:
On Fri, May 10, 2013 at 5:31 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
>> In the case where one block is missing, how can it even reach to next record
>> to check "prev" pointer.
>> I think it can be possible when one of the record is corrupt and following
>> are okay which I think is the
>> case in which it can proceed with warning as suggested by Simon.

>A single WAL record can be over 24kB. The checksum covers the entire
>WAL record and if it reports corruption it can be because a chunk in
>the middle wasn't flushed to disk before the system crashed. The
>beginning of the WAL record with the length and checksum and the
>entire following record with its prev pointer might have been flushed
>but the missing block in the middle of this record means it can't be
>replayed. This would be a normal situation in case of a system crash.

The only point I wanted to say was it can be only "one such record",
length can be large or small.


>If you replayed the following record but not this record you would
>have an inconsistent database. The following record could be an insert
>for a child table with a foreign key reference to a tuple that was
>inserted by the skipped record for example. Resulting in a database
>that is logically inconsistent.

The corrupt record can be such that it can lead to inconsistency in database or
it could be a commit record of transaction which has performed only select operation.
It will be difficult or not possible to find such information during recovery,
but informing DBA/user at such occasion can be useful and with an optional way for him to
continue (although I am not sure how in such a case DBA can decide, may be need some other information as well).
The reason why it can be useful to allow DBA/user intervention in such cases is that, in case
of ignoring data after one corrupt record, it can still take time for DBA/user to find
which of the operations he needs to perform again.

With Regards,
Amit Kapila.


Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 10 May 2013 23:41, Jeff Davis <pgsql@j-davis.com> wrote:
> On Fri, 2013-05-10 at 18:32 +0100, Simon Riggs wrote:
>> We don't write() WAL except with an immediate sync(), so the chances
>> of what you say happening are very low to impossible.
>
> Are you sure? An XLogwrtRqst contains a write and a flush pointer, so I
> assume they can be different.

I'm answering this just to complete the discussion.

Yes, the write and flush pointer can be different. The write/flush are
two actions; we do one first, then the other, very quickly, inside
XLogWrite().

But we cannot rely on that, since there are some times when we don't
do that, such as wal_buffer overflow and when background walwriter
writes are at the end of the ring buffer. Not common, but they do
exist and when they exist they write many blocks. Which is counter to
what I had said earlier.

This part of the proposal isn't necessary for us to get a good answer
95% of the time, so it is dropped.

As mentioned on other post, we can write
UpdateMinRecoveryPoint(LogwrtResult.Flush, false) every time we do
XLogBackgroundFlush() and some other rework to make that happen
correctly. I'll post a separate patch.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: corrupt pages detected by enabling checksums

From
Jim Nasby
Date:
On 5/9/13 5:18 PM, Jeff Davis wrote:
> On Thu, 2013-05-09 at 14:28 -0500, Jim Nasby wrote:
>> What about moving some critical data from the beginning of the WAL
>> record to the end? That would make it easier to detect that we don't
>> have a complete record. It wouldn't necessarily replace the CRC
>> though, so maybe that's not good enough.
>>
>> Actually, what if we actually *duplicated* some of the same WAL header
>> info at the end of the record? Given a reasonable amount of data that
>> would damn-near ensure that a torn record was detected, because the
>> odds of having the exact same sequence of random bytes would be so
>> low. Potentially even just duplicating the LSN would suffice.
>
> I think both of these ideas have some false positives and false
> negatives.
>
> If the corruption happens at the record boundary, and wipes out the
> special information at the end of the record, then you might think it
> was not fully flushed, and we're in the same position as today.
>
> If the WAL record is large, and somehow the beginning and the end get
> written to disk but not the middle, then it will look like corruption;
> but really the WAL was just not completely flushed. This seems pretty
> unlikely, but not impossible.
>
> That being said, I like the idea of introducing some extra checks if a
> perfect solution is not possible.

Yeah, I don't think a perfect solution is possible, short of attempting to tie directly into the filesystem (ie: on a
journalingFS have some way to essentially treat the FS journal as WAL).
 

One additional step we might be able to take would be to scan forward looking for a record that would tell us when an
fsyncmust have occurred (heck, maybe we should add an fsync WAL record...). If we find a corrupt WAL record followed by
anfsync we know that we've now lost data. That closes some of the holes. Actually, that might handle all the holes...
 

>> On the separate write idea, if that could be controlled by a GUC I
>> think it'd be worth doing. Anyone that needs to worry about this
>> corner case probably has hardware that would support that.
>
> It sounds pretty easy to do that naively. I'm just worried that the
> performance will be so bad for so many users that it's not a very
> reasonable choice.
>
> Today, it would probably make more sense to just use sync rep. If the
> master's WAL is corrupt, and it starts up too early, then that should be
> obvious when you try to reconnect streaming replication. I haven't tried
> it, but I'm assuming that it gives a useful error message.

I wonder if there are DW environments that are too large to keep a SR copy but would be able to afford the double-write
overhead.

BTW, isn't performance what killed the double-buffer idea?
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: corrupt pages detected by enabling checksums

From
Jim Nasby
Date:
On 5/10/13 1:06 PM, Jeff Janes wrote:
> Of course the paranoid DBA could turn off restart_after_crash and do a manual investigation on every crash, but in
thatcase the database would refuse to restart even in the case where it perfectly clear that all the following WAL
belongsto the recycled file and not the current file.
 

Perhaps we should also allow for zeroing out WAL files before reuse (or just disable reuse). I know there's a
performancehit there, but the reuse idea happened before we had bgWriter. Theoretically the overhead creating a new
filewould always fall to bgWriter and therefore not be a big deal.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: corrupt pages detected by enabling checksums

From
Jon Nelson
Date:
On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote:
> On 5/10/13 1:06 PM, Jeff Janes wrote:
>>
>> Of course the paranoid DBA could turn off restart_after_crash and do a
>> manual investigation on every crash, but in that case the database would
>> refuse to restart even in the case where it perfectly clear that all the
>> following WAL belongs to the recycled file and not the current file.
>
>
> Perhaps we should also allow for zeroing out WAL files before reuse (or just
> disable reuse). I know there's a performance hit there, but the reuse idea
> happened before we had bgWriter. Theoretically the overhead creating a new
> file would always fall to bgWriter and therefore not be a big deal.

For filesystems like btrfs, re-using a WAL file is suboptimal to
simply creating a new one and removing the old one when it's no longer
required. Using fallocate (or posix_fallocate) (I have a patch for
that!) to create a new one is - by my tests - 28 times faster than the
currently-used method.


--
Jon



Re: corrupt pages detected by enabling checksums

From
"ktm@rice.edu"
Date:
On Sun, May 12, 2013 at 03:46:00PM -0500, Jim Nasby wrote:
> On 5/10/13 1:06 PM, Jeff Janes wrote:
> >Of course the paranoid DBA could turn off restart_after_crash and do a manual investigation on every crash, but in
thatcase the database would refuse to restart even in the case where it perfectly clear that all the following WAL
belongsto the recycled file and not the current file.
 
> 
> Perhaps we should also allow for zeroing out WAL files before reuse (or just disable reuse). I know there's a
performancehit there, but the reuse idea happened before we had bgWriter. Theoretically the overhead creating a new
filewould always fall to bgWriter and therefore not be a big deal.
 
> -- 
> Jim C. Nasby, Data Architect                       jim@nasby.net
> 512.569.9461 (cell)                         http://jim.nasby.net
> 

Unless something has changed dramtically, creating new files is a LOT more
overhead than reusing existing files. My two cents.

Regards,
Ken



Re: corrupt pages detected by enabling checksums

From
"ktm@rice.edu"
Date:
On Sun, May 12, 2013 at 07:41:26PM -0500, Jon Nelson wrote:
> On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote:
> > On 5/10/13 1:06 PM, Jeff Janes wrote:
> >>
> >> Of course the paranoid DBA could turn off restart_after_crash and do a
> >> manual investigation on every crash, but in that case the database would
> >> refuse to restart even in the case where it perfectly clear that all the
> >> following WAL belongs to the recycled file and not the current file.
> >
> >
> > Perhaps we should also allow for zeroing out WAL files before reuse (or just
> > disable reuse). I know there's a performance hit there, but the reuse idea
> > happened before we had bgWriter. Theoretically the overhead creating a new
> > file would always fall to bgWriter and therefore not be a big deal.
> 
> For filesystems like btrfs, re-using a WAL file is suboptimal to
> simply creating a new one and removing the old one when it's no longer
> required. Using fallocate (or posix_fallocate) (I have a patch for
> that!) to create a new one is - by my tests - 28 times faster than the
> currently-used method.
> 
> 
> --
> Jon
> 

What about for less cutting (bleeding) edge filesystems?

Ken



Re: corrupt pages detected by enabling checksums

From
Jon Nelson
Date:
On Mon, May 13, 2013 at 7:49 AM, ktm@rice.edu <ktm@rice.edu> wrote:
> On Sun, May 12, 2013 at 07:41:26PM -0500, Jon Nelson wrote:
>> On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote:
>> > On 5/10/13 1:06 PM, Jeff Janes wrote:
>> >>
>> >> Of course the paranoid DBA could turn off restart_after_crash and do a
>> >> manual investigation on every crash, but in that case the database would
>> >> refuse to restart even in the case where it perfectly clear that all the
>> >> following WAL belongs to the recycled file and not the current file.
>> >
>> >
>> > Perhaps we should also allow for zeroing out WAL files before reuse (or just
>> > disable reuse). I know there's a performance hit there, but the reuse idea
>> > happened before we had bgWriter. Theoretically the overhead creating a new
>> > file would always fall to bgWriter and therefore not be a big deal.
>>
>> For filesystems like btrfs, re-using a WAL file is suboptimal to
>> simply creating a new one and removing the old one when it's no longer
>> required. Using fallocate (or posix_fallocate) (I have a patch for
>> that!) to create a new one is - by my tests - 28 times faster than the
>> currently-used method.
>>

> What about for less cutting (bleeding) edge filesystems?

The patch would be applicable for any filesystem which implements the
fallocate/posix_fallocate calls in an efficient fashion. xfs and ext4
would both work, if I recall properly. I'm certain there are others,
and the technique would probably work on other operating systems like
the *BSDs, etc.. Additionally, it's improbable that there would be a
performance hit for other filesystems versus simply writing zeroes,
since that's the approach that is taken anyway as a fallback.  Another
win is reduction in fragmentation. I would love to be able to turn off
WAL recycling to perform more useful testing.

--
Jon



Re: corrupt pages detected by enabling checksums

From
Andres Freund
Date:
On 2013-05-12 19:41:26 -0500, Jon Nelson wrote:
> On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote:
> > On 5/10/13 1:06 PM, Jeff Janes wrote:
> >>
> >> Of course the paranoid DBA could turn off restart_after_crash and do a
> >> manual investigation on every crash, but in that case the database would
> >> refuse to restart even in the case where it perfectly clear that all the
> >> following WAL belongs to the recycled file and not the current file.
> >
> >
> > Perhaps we should also allow for zeroing out WAL files before reuse (or just
> > disable reuse). I know there's a performance hit there, but the reuse idea
> > happened before we had bgWriter. Theoretically the overhead creating a new
> > file would always fall to bgWriter and therefore not be a big deal.
> 
> For filesystems like btrfs, re-using a WAL file is suboptimal to
> simply creating a new one and removing the old one when it's no longer
> required. Using fallocate (or posix_fallocate) (I have a patch for
> that!) to create a new one is - by my tests - 28 times faster than the
> currently-used method.

I don't think the comparison between just fallocate()ing and what we
currently do is fair. fallocate() doesn't guarantee that the file is the
same size after a crash, so you would still need an fsync() or we
couldn't use fdatasync() anymore. And I'd guess the benefits aren't all
that big anymore in that case?
That said, using posix_fallocate seems like a good idea in lots of
places inside pg, its just not all that easy to do in some of the
places.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: corrupt pages detected by enabling checksums

From
Jon Nelson
Date:
On Mon, May 13, 2013 at 8:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-05-12 19:41:26 -0500, Jon Nelson wrote:
>> On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote:
>> > On 5/10/13 1:06 PM, Jeff Janes wrote:
>> >>
>> >> Of course the paranoid DBA could turn off restart_after_crash and do a
>> >> manual investigation on every crash, but in that case the database would
>> >> refuse to restart even in the case where it perfectly clear that all the
>> >> following WAL belongs to the recycled file and not the current file.
>> >
>> >
>> > Perhaps we should also allow for zeroing out WAL files before reuse (or just
>> > disable reuse). I know there's a performance hit there, but the reuse idea
>> > happened before we had bgWriter. Theoretically the overhead creating a new
>> > file would always fall to bgWriter and therefore not be a big deal.
>>
>> For filesystems like btrfs, re-using a WAL file is suboptimal to
>> simply creating a new one and removing the old one when it's no longer
>> required. Using fallocate (or posix_fallocate) (I have a patch for
>> that!) to create a new one is - by my tests - 28 times faster than the
>> currently-used method.
>
> I don't think the comparison between just fallocate()ing and what we
> currently do is fair. fallocate() doesn't guarantee that the file is the
> same size after a crash, so you would still need an fsync() or we
> couldn't use fdatasync() anymore. And I'd guess the benefits aren't all
> that big anymore in that case?

fallocate (16MB) + fsync is still almost certainly faster than
write+write+write... + fsync.
The test I performed at the time did exactly that .. posix_fallocate + pg_fsync.

> That said, using posix_fallocate seems like a good idea in lots of
> places inside pg, its just not all that easy to do in some of the
> places.

I should not derail this thread any further. Perhaps, if interested
parties would like to discuss the use of fallocate/posix_fallocate, a
new thread might be more appropriate?


--
Jon



Re: corrupt pages detected by enabling checksums

From
Andres Freund
Date:
On 2013-05-13 08:45:41 -0500, Jon Nelson wrote:
> On Mon, May 13, 2013 at 8:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-05-12 19:41:26 -0500, Jon Nelson wrote:
> >> On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote:
> >> > On 5/10/13 1:06 PM, Jeff Janes wrote:
> >> >>
> >> >> Of course the paranoid DBA could turn off restart_after_crash and do a
> >> >> manual investigation on every crash, but in that case the database would
> >> >> refuse to restart even in the case where it perfectly clear that all the
> >> >> following WAL belongs to the recycled file and not the current file.
> >> >
> >> >
> >> > Perhaps we should also allow for zeroing out WAL files before reuse (or just
> >> > disable reuse). I know there's a performance hit there, but the reuse idea
> >> > happened before we had bgWriter. Theoretically the overhead creating a new
> >> > file would always fall to bgWriter and therefore not be a big deal.
> >>
> >> For filesystems like btrfs, re-using a WAL file is suboptimal to
> >> simply creating a new one and removing the old one when it's no longer
> >> required. Using fallocate (or posix_fallocate) (I have a patch for
> >> that!) to create a new one is - by my tests - 28 times faster than the
> >> currently-used method.
> >
> > I don't think the comparison between just fallocate()ing and what we
> > currently do is fair. fallocate() doesn't guarantee that the file is the
> > same size after a crash, so you would still need an fsync() or we
> > couldn't use fdatasync() anymore. And I'd guess the benefits aren't all
> > that big anymore in that case?
> 
> fallocate (16MB) + fsync is still almost certainly faster than
> write+write+write... + fsync.
> The test I performed at the time did exactly that .. posix_fallocate + pg_fsync.
Sure, the initial file creation will be faster. But are the actual
individual wal writes (small, frequently fdatasync()ed) still faster?
That's the critical path currently.
Whether it is pretty much depends on how the filesystem manages
allocated but not initialized blocks...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: corrupt pages detected by enabling checksums

From
Greg Stark
Date:
On Mon, May 13, 2013 at 2:49 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Sure, the initial file creation will be faster. But are the actual
> individual wal writes (small, frequently fdatasync()ed) still faster?
> That's the critical path currently.
> Whether it is pretty much depends on how the filesystem manages
> allocated but not initialized blocks...

In ext4 aIui it doesn't actually pick target blocks. It just adjusts
the accounting so it knows that many blocks will be needed for this
file and guarantees they'll be available. If you read from them it
knows to provide 0s. So in theory the performance in the critical path
would be worse but I think by an insignificant amount.

The reason Postgres pre-allocates the blocks is not for the
performance optimization. It's for safety. To guarantee -- as best as
possible -- that it won't get a write error when the time comes to
write to it. Especially to guarantee that the disk won't suddenly turn
out to be full.

It seems possible that some file systems would not protect you against
media errors nearly as well using it. It might take time to respond to
a media error and in a poorly written filesystem it might even be
reported to the application even though there's no need. But media
errors can occur any time, even after the initial write so I don't
think this should be a blocker.  I think posix_fallocate is good
enough for us and I would support using it.

-- 
greg



Re: corrupt pages detected by enabling checksums

From
Andres Freund
Date:
On 2013-05-13 16:03:11 +0100, Greg Stark wrote:
> On Mon, May 13, 2013 at 2:49 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Sure, the initial file creation will be faster. But are the actual
> > individual wal writes (small, frequently fdatasync()ed) still faster?
> > That's the critical path currently.
> > Whether it is pretty much depends on how the filesystem manages
> > allocated but not initialized blocks...
> 
> In ext4 aIui it doesn't actually pick target blocks. It just adjusts
> the accounting so it knows that many blocks will be needed for this
> file and guarantees they'll be available. If you read from them it
> knows to provide 0s. So in theory the performance in the critical path
> would be worse but I think by an insignificant amount.
> 
> The reason Postgres pre-allocates the blocks is not for the
> performance optimization. It's for safety. To guarantee -- as best as
> possible -- that it won't get a write error when the time comes to
> write to it. Especially to guarantee that the disk won't suddenly turn
> out to be full.

posix_fallocate() guarantees that. And if you fsync() the file
afterwards its even supposed to still have enough space after the crash.

"After a successful call to posix_fallocate(), subsequent writes to
bytes in the specified range are guaranteed not to fail because of lack
of disk space."


> It seems possible that some file systems would not protect you against
> media errors nearly as well using it.

True. The same probably is true for modifications of existing files for
those fancy COW filesystems...

> I think posix_fallocate is good enough for us and I would support
> using it.

Me too, although this isn't the place where I'd be really excited to see
a patch implementing it properly ;)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: corrupt pages detected by enabling checksums

From
Simon Riggs
Date:
On 13 May 2013 14:45, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:

> I should not derail this thread any further. Perhaps, if interested
> parties would like to discuss the use of fallocate/posix_fallocate, a
> new thread might be more appropriate?

Sounds like a good idea. Always nice to see a fresh take on earlier ideas.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services