Thread: corrupt pages detected by enabling checksums
I've changed the subject from "regression test failed when enabling checksum" because I now know they are totally unrelated.
My test case didn't need to depend on archiving being on, and so with a simple tweak I rendered the two issues orthogonal.
On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql@j-davis.com> wrote:
On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:Can you just turn off the restart_after_crash GUC? I had a chance to
> What I would probably really want is the data as it existed after the
> crash but before recovery started, but since the postmaster
> immediately starts recovery after the crash, I don't know of a good
> way to capture this.
look at this, and seeing the block before and after recovery would be
nice. I didn't see a log file in the data directory, but it didn't go
through recovery, so I assume it already did that.
You don't know that the cluster is in the bad state until after it goes through recovery because most crashes recover perfectly fine. So it would have to make a side-copy of the cluster after the crash, then recover the original and see how things go, then either retain or delete the side-copy. Unfortunately my testing harness can't do this at the moment, because the perl script storing the consistency info needs to survive over the crash and recovery. It might take me awhile to figure out how to make it do this.
I have the server log just go to stderr, where it gets mingled together with messages from my testing harness. I had not uploaded that file.
Here is a new upload. It contains both a data_dir tarball including xlog, and the mingled stderr ("do_new.out")
The other 3 files in it constitute the harness. It is a quite a mess, with some hard-coded paths. The just-posted fix for wal_keep_segments will also have to be applied.
The block is corrupt as far as I can tell. The first third is written,
and the remainder is all zeros. The header looks like this:
Yes, that part is by my design. Why it didn't get fixed from a FPI is not by my design, or course.
So, the page may be corrupt without checksums as well, but it just
happens to be hidden for the same reason. Can you try to reproduce it
without -k?
No, things run (seemingly) fine without -k.
And on the checkin right before checksums were added?
Without checksums, you'll need to use pg_filedump (or similar) to find
whether an error has happened.
I'll work on it, but it will take awhile to figure out exactly how to use pg_filedump to do that, and how to automate that process and incorporate it into the harness.
In the meantime, I tested the checksum commit itself (96ef3b8ff1c) and the problem occurs there, so if the problem is not the checksums themselves (and I agree it probably isn't) it must have been introduced before that rather than after.
Cheers,
Jeff
On 2013-04-03 15:57:49 -0700, Jeff Janes wrote: > I've changed the subject from "regression test failed when enabling > checksum" because I now know they are totally unrelated. > > My test case didn't need to depend on archiving being on, and so with a > simple tweak I rendered the two issues orthogonal. > > > On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > > On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote: > > > > > What I would probably really want is the data as it existed after the > > > crash but before recovery started, but since the postmaster > > > immediately starts recovery after the crash, I don't know of a good > > > way to capture this. > > > > Can you just turn off the restart_after_crash GUC? I had a chance to > > look at this, and seeing the block before and after recovery would be > > nice. I didn't see a log file in the data directory, but it didn't go > > through recovery, so I assume it already did that. > > > > You don't know that the cluster is in the bad state until after it goes > through recovery because most crashes recover perfectly fine. So it would > have to make a side-copy of the cluster after the crash, then recover the > original and see how things go, then either retain or delete the side-copy. > Unfortunately my testing harness can't do this at the moment, because the > perl script storing the consistency info needs to survive over the crash > and recovery. It might take me awhile to figure out how to make it do > this. > > I have the server log just go to stderr, where it gets mingled together > with messages from my testing harness. I had not uploaded that file. > > Here is a new upload. It contains both a data_dir tarball including xlog, > and the mingled stderr ("do_new.out") > > https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&usp=sharing > > The other 3 files in it constitute the harness. It is a quite a mess, with > some hard-coded paths. The just-posted fix for wal_keep_segments will also > have to be applied. > > > > > > > The block is corrupt as far as I can tell. The first third is written, > > and the remainder is all zeros. The header looks like this: > > > > Yes, that part is by my design. Why it didn't get fixed from a FPI is not > by my design, or course. There are no FPIs (if you mean a full page image with that) afaics: Your logs tell us about recovery: 27692 2013-04-03 10:09:15.621 PDT:LOG: redo starts at 1/31000090 27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68 27692 2013-04-03 10:09:15.647 PDT:LOG: redo done at 1/31169A38 And then the error: 27698 SELECT 2013-04-03 10:09:16.688 PDT:WARNING: page verification failed, calculated checksum 16296 but expected 49517 27698 SELECT 2013-04-03 10:09:16.688 PDT:ERROR: invalid page in block 90 of relation base/16384/835589 looking at xlog from that time, the first records are: rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 1/31000028, prev 1/30000090, bkp: 0000, desc: checkpoint:redo 1/31000028; tli 1; prev tli 1; fpw true; xid 0/26254997; oid 835589; multi 1; offset 0; oldest xid 1799 inDB 1; oldest multi 1 in DB 1; oldest running xid 0; online rmgr: XLOG len (rec/tot): 4/ 36, tx: 0, lsn: 1/31000090, prev 1/31000028, bkp: 0000, desc: nextOid:843781 rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 1/310000B8, prev 1/31000090, bkp: 0000, desc: file create:base/16384/835589 rmgr: Heap len (rec/tot): 37/ 7905, tx: 26254997, lsn: 1/310000E8, prev 1/310000B8, bkp: 1000, desc: hot_update:rel 1663/16384/12660; tid 0/47 xmax 26254997 ; new tid 0/33 xmax 0 So the file was created after the last checkpoint, so no full pages need to be logged. And we theoretically should be able to recovery all things like torn pages from the WAL since that should cover everything that happened to that file. The bkp block 1 indicated in the 4th record above is the only one in that period of WAL btw. Looking a bit more, the last page that's covered by WAL is: rmgr: Heap len (rec/tot): 35/ 67, tx: 26254998, lsn: 1/311699A8, prev 1/31169960, bkp: 0000, desc: insert:rel 1663/16384/835589; tid 44/56 which is far earlier than the "block 90" the error is found in unless i misunderstand something. Which is strange, since non-zero content to those pages shouldn't go to disk until the respective WAL is flushed. Huh, are we missing something here? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-04-04 01:52:41 +0200, Andres Freund wrote: > On 2013-04-03 15:57:49 -0700, Jeff Janes wrote: > > I've changed the subject from "regression test failed when enabling > > checksum" because I now know they are totally unrelated. > > > > My test case didn't need to depend on archiving being on, and so with a > > simple tweak I rendered the two issues orthogonal. > > > > > > On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > > > > On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote: > > > > > > > What I would probably really want is the data as it existed after the > > > > crash but before recovery started, but since the postmaster > > > > immediately starts recovery after the crash, I don't know of a good > > > > way to capture this. > > > > > > Can you just turn off the restart_after_crash GUC? I had a chance to > > > look at this, and seeing the block before and after recovery would be > > > nice. I didn't see a log file in the data directory, but it didn't go > > > through recovery, so I assume it already did that. > > > > > > > You don't know that the cluster is in the bad state until after it goes > > through recovery because most crashes recover perfectly fine. So it would > > have to make a side-copy of the cluster after the crash, then recover the > > original and see how things go, then either retain or delete the side-copy. > > Unfortunately my testing harness can't do this at the moment, because the > > perl script storing the consistency info needs to survive over the crash > > and recovery. It might take me awhile to figure out how to make it do > > this. > > > > I have the server log just go to stderr, where it gets mingled together > > with messages from my testing harness. I had not uploaded that file. > > > > Here is a new upload. It contains both a data_dir tarball including xlog, > > and the mingled stderr ("do_new.out") > > > > https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&usp=sharing > > > > The other 3 files in it constitute the harness. It is a quite a mess, with > > some hard-coded paths. The just-posted fix for wal_keep_segments will also > > have to be applied. > > > > > > > > > > > > The block is corrupt as far as I can tell. The first third is written, > > > and the remainder is all zeros. The header looks like this: > > > > > > > Yes, that part is by my design. Why it didn't get fixed from a FPI is not > > by my design, or course. > > There are no FPIs (if you mean a full page image with that) afaics: > > Your logs tell us about recovery: > 27692 2013-04-03 10:09:15.621 PDT:LOG: redo starts at 1/31000090 > 27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68 > 27692 2013-04-03 10:09:15.647 PDT:LOG: redo done at 1/31169A38 > > And then the error: > > 27698 SELECT 2013-04-03 10:09:16.688 PDT:WARNING: page verification failed, calculated checksum 16296 but expected 49517 > 27698 SELECT 2013-04-03 10:09:16.688 PDT:ERROR: invalid page in block 90 of relation base/16384/835589 > > > looking at xlog from that time, the first records are: > rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 1/31000028, prev 1/30000090, bkp: 0000, desc: checkpoint:redo 1/31000028; tli 1; prev tli 1; fpw true; xid 0/26254997; oid 835589; multi 1; offset 0; oldest xid 1799 inDB 1; oldest multi 1 in DB 1; oldest running xid 0; online > rmgr: XLOG len (rec/tot): 4/ 36, tx: 0, lsn: 1/31000090, prev 1/31000028, bkp: 0000, desc: nextOid:843781 > rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 1/310000B8, prev 1/31000090, bkp: 0000, desc: filecreate: base/16384/835589 > rmgr: Heap len (rec/tot): 37/ 7905, tx: 26254997, lsn: 1/310000E8, prev 1/310000B8, bkp: 1000, desc: hot_update:rel 1663/16384/12660; tid 0/47 xmax 26254997 ; new tid 0/33 xmax 0 > > So the file was created after the last checkpoint, so no full pages need > to be logged. And we theoretically should be able to recovery all things > like torn pages from the WAL since that should cover everything that > happened to that file. > > The bkp block 1 indicated in the 4th record above is the only one in > that period of WAL btw. > > Looking a bit more, the last page that's covered by WAL is: > rmgr: Heap len (rec/tot): 35/ 67, tx: 26254998, lsn: 1/311699A8, prev 1/31169960, bkp: 0000, desc: insert:rel 1663/16384/835589; tid 44/56 > > which is far earlier than the "block 90" the error is found in unless i > misunderstand something. Which is strange, since non-zero content to > those pages shouldn't go to disk until the respective WAL is > flushed. Huh, are we missing something here? Looking at the page lsn's with dd I noticed something peculiar: page 0: 01 00 00 00 18 c2 00 31 => 1/3100C218 page 1: 01 00 00 00 80 44 01 31 => 1/31014480 page 10: 01 00 00 00 60 ce 05 31 => 1/3105ce60 page 43: 01 00 00 00 58 7a 16 31 => 1/31167a58 page 44: 01 00 00 00 f0 99 16 31 => 1/311699f0 page 45: 00 00 00 00 00 00 00 00 => 0/0 page 90: 01 00 00 00 90 17 1d 32 => 1/321d1790 page 91: 01 00 00 00 38 ef 1b 32 => 1/321bef38 So we have written out pages that are after pages without a LSN that have an LSN thats *beyond* the point XLOG has successfully been written to disk (1/31169A38)? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, 2013-04-03 at 15:57 -0700, Jeff Janes wrote: > You don't know that the cluster is in the bad state until after it > goes through recovery because most crashes recover perfectly fine. So > it would have to make a side-copy of the cluster after the crash, then > recover the original and see how things go, then either retain or > delete the side-copy. Unfortunately my testing harness can't do this > at the moment, because the perl script storing the consistency info > needs to survive over the crash and recovery. It might take > me awhile to figure out how to make it do this. Hmm... you're right, that is difficult to integrate into a test. Another approach is to expand PageHeaderIsValid in a pre-checksums version to do some extra checks (perhaps the same checks as pg_filedump does). It might be able to at least detect simple cases, like all zeros between pd_upper and pd_special (when pd_upper < pd_special). Still not easy, but maybe more practical than saving the directory like that. We may even want a GUC for that to aid future debugging (obviously it would have a performance impact). > The other 3 files in it constitute the harness. It is a quite a mess, > with some hard-coded paths. The just-posted fix for wal_keep_segments > will also have to be applied. I'm still trying to reproduce the problem myself. I keep trying simpler versions of your test and I don't see the problem. It's a little awkward to try to run the full version of your test (and I'm jumping between code inspection and various other ideas). > I'll work on it, but it will take awhile to figure out exactly how to > use pg_filedump to do that, and how to automate that process and > incorporate it into the harness. It might be more productive to try to expand PageHeaderIsValid as I suggested above. > In the meantime, I tested the checksum commit itself (96ef3b8ff1c) and > the problem occurs there, so if the problem is not the checksums > themselves (and I agree it probably isn't) it must have been > introduced before that rather than after. Thank you for all of your work on this. I have also attached another test patch that disables most of the complex areas of the checksums patch (VM bits, hint bits, etc.). If you apply different portions of the patch (or the whole thing), it will help eliminate possibilities. In particular, eliminating the calls to PageSetAllVisible+visibilitymap_set could tell us a lot. Also, the first data directory you posted had a problem with a page that appeared to be a new page (otherwise I would have expected either old or new data at the end). Do you think this might be a problem only for new pages? Or do you think your test just makes it likely that many writes will happen on new pages? I did find one potential problem in the checksums code (aside from my previous patch involving wal_level=archive): visibilitymap_set is called sometimes before the heap page is marked dirty. I will examine a little more closely, but I expect that to require another patch. Not sure if that could explain this problem or not. Regards, Jeff Davis
Attachment
On 2013-04-04 02:28:32 +0200, Andres Freund wrote: > On 2013-04-04 01:52:41 +0200, Andres Freund wrote: > > On 2013-04-03 15:57:49 -0700, Jeff Janes wrote: > > > I've changed the subject from "regression test failed when enabling > > > checksum" because I now know they are totally unrelated. > > > > > > My test case didn't need to depend on archiving being on, and so with a > > > simple tweak I rendered the two issues orthogonal. > > > > > > > > > On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > > > > > > On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote: > > > > > > > > > What I would probably really want is the data as it existed after the > > > > > crash but before recovery started, but since the postmaster > > > > > immediately starts recovery after the crash, I don't know of a good > > > > > way to capture this. > > > > > > > > Can you just turn off the restart_after_crash GUC? I had a chance to > > > > look at this, and seeing the block before and after recovery would be > > > > nice. I didn't see a log file in the data directory, but it didn't go > > > > through recovery, so I assume it already did that. > > > > > > > > > > You don't know that the cluster is in the bad state until after it goes > > > through recovery because most crashes recover perfectly fine. So it would > > > have to make a side-copy of the cluster after the crash, then recover the > > > original and see how things go, then either retain or delete the side-copy. > > > Unfortunately my testing harness can't do this at the moment, because the > > > perl script storing the consistency info needs to survive over the crash > > > and recovery. It might take me awhile to figure out how to make it do > > > this. > > > > > > I have the server log just go to stderr, where it gets mingled together > > > with messages from my testing harness. I had not uploaded that file. > > > > > > Here is a new upload. It contains both a data_dir tarball including xlog, > > > and the mingled stderr ("do_new.out") > > > > > > https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&usp=sharing > > > > > > The other 3 files in it constitute the harness. It is a quite a mess, with > > > some hard-coded paths. The just-posted fix for wal_keep_segments will also > > > have to be applied. > > > > > > > > > > > > > > > > > The block is corrupt as far as I can tell. The first third is written, > > > > and the remainder is all zeros. The header looks like this: > > > > > > > > > > Yes, that part is by my design. Why it didn't get fixed from a FPI is not > > > by my design, or course. > > > > There are no FPIs (if you mean a full page image with that) afaics: > > > > Your logs tell us about recovery: > > 27692 2013-04-03 10:09:15.621 PDT:LOG: redo starts at 1/31000090 > > 27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68 > > 27692 2013-04-03 10:09:15.647 PDT:LOG: redo done at 1/31169A38 > Looking at the page lsn's with dd I noticed something peculiar: > > page 0: > 01 00 00 00 18 c2 00 31 => 1/3100C218 > page 1: > 01 00 00 00 80 44 01 31 => 1/31014480 > page 10: > 01 00 00 00 60 ce 05 31 => 1/3105ce60 > page 43: > 01 00 00 00 58 7a 16 31 => 1/31167a58 > page 44: > 01 00 00 00 f0 99 16 31 => 1/311699f0 > page 45: > 00 00 00 00 00 00 00 00 => 0/0 > page 90: > 01 00 00 00 90 17 1d 32 => 1/321d1790 > page 91: > 01 00 00 00 38 ef 1b 32 => 1/321bef38 > > So we have written out pages that are after pages without a LSN that > have an LSN thats *beyond* the point XLOG has successfully been written > to disk (1/31169A38)? By now I am pretty sure the issue is that the end-of-wal is detected too early. Above the end is detected at 1/31169A38, but the next WAL file contains valid data: pg_xlogdump /tmp/tmp/data2/pg_xlog/000000010000000100000032 -n 10 rmgr: Heap2 len (rec/tot): 26/ 58, tx: 0, lsn: 1/32000030, prev 1/31FFFFD8, bkp: 0000, desc: clean:rel 1663/16384/835589; blk 122 remxid 26328926 rmgr: Heap len (rec/tot): 51/ 83, tx: 26328964, lsn: 1/32000070, prev 1/32000030, bkp: 0000, desc: update:rel 1663/16384/835589; tid 122/191 xmax 26328964 ; new tid 129/140 xmax 0 rmgr: Heap2 len (rec/tot): 30/ 62, tx: 0, lsn: 1/320000C8, prev 1/32000070, bkp: 0000, desc: clean:rel 1663/16384/835589; blk 63 remxid 26328803 rmgr: Btree len (rec/tot): 34/ 66, tx: 26328964, lsn: 1/32000108, prev 1/320000C8, bkp: 0000, desc: insert:rel 1663/16384/835590; tid 25/14 That would explains exactly those symptopms, wouldn't it? Whats causing that - I am not sure yet. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
andres@anarazel.de (Andres Freund) writes: > Looking at the page lsn's with dd I noticed something peculiar: > page 0: > 01 00 00 00 18 c2 00 31 => 1/3100C218 > page 1: > 01 00 00 00 80 44 01 31 => 1/31014480 > page 10: > 01 00 00 00 60 ce 05 31 => 1/3105ce60 > page 43: > 01 00 00 00 58 7a 16 31 => 1/31167a58 > page 44: > 01 00 00 00 f0 99 16 31 => 1/311699f0 > page 45: > 00 00 00 00 00 00 00 00 => 0/0 > page 90: > 01 00 00 00 90 17 1d 32 => 1/321d1790 > page 91: > 01 00 00 00 38 ef 1b 32 => 1/321bef38 > So we have written out pages that are after pages without a LSN that > have an LSN thats *beyond* the point XLOG has successfully been written > to disk (1/31169A38)? If you're looking into the FPIs, those would contain the page's older LSN, not the one assigned by the current WAL record. regards, tom lane
On 2013-04-03 20:45:51 -0400, Tom Lane wrote: > andres@anarazel.de (Andres Freund) writes: > > Looking at the page lsn's with dd I noticed something peculiar: > > > page 0: > > 01 00 00 00 18 c2 00 31 => 1/3100C218 > > page 1: > > 01 00 00 00 80 44 01 31 => 1/31014480 > > page 10: > > 01 00 00 00 60 ce 05 31 => 1/3105ce60 > > page 43: > > 01 00 00 00 58 7a 16 31 => 1/31167a58 > > page 44: > > 01 00 00 00 f0 99 16 31 => 1/311699f0 > > page 45: > > 00 00 00 00 00 00 00 00 => 0/0 > > page 90: > > 01 00 00 00 90 17 1d 32 => 1/321d1790 > > page 91: > > 01 00 00 00 38 ef 1b 32 => 1/321bef38 > > > So we have written out pages that are after pages without a LSN that > > have an LSN thats *beyond* the point XLOG has successfully been written > > to disk (1/31169A38)? > > If you're looking into the FPIs, those would contain the page's older > LSN, not the one assigned by the current WAL record. Nope, thats from the heap, and the LSNs are *newer* than what startup recovered to. I am pretty sure by now we are missing out on valid WAL, I am just not sure why. Unfortunately we can't easily diagnose what happened at: 27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68 since the startup process wrote its end of recovery checkpoint there: rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 1/31169A68, prev 1/31169A38, bkp: 0000, desc: checkpoint:redo 1/31169A68; tli 1; prev tli 1; fpw true; xid 0/26254999; oid 843781; multi 1; offset 0; oldest xid 1799 inDB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown Starting from a some blocks in that wal segments later: pg_xlogdump /tmp/tmp/data2/pg_xlog/000000010000000100000031 -s 1/3116c000 -n 10 first record is after 1/3116C000, at 1/3116D9D8, skipping over 6616 bytes rmgr: Heap len (rec/tot): 51/ 83, tx: 26254999, lsn: 1/3116D9D8, prev 1/3116BA20, bkp: 0000, desc: update:rel 1663/16384/835589; tid 38/148 xmax 26254999 ; new tid 44/57 xmax 0 rmgr: Btree len (rec/tot): 34/ 66, tx: 26254999, lsn: 1/3116DA30, prev 1/3116D9D8, bkp: 0000, desc: insert:rel 1663/16384/835590; tid 25/319 rmgr: Heap len (rec/tot): 51/ 83, tx: 26255000, lsn: 1/3116DA78, prev 1/3116DA30, bkp: 0000, desc: update:rel 1663/16384/835589; tid 19/214 xmax 26255000 ; new tid 44/58 xmax 0 the records continue again. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-04-04 02:58:43 +0200, Andres Freund wrote: > On 2013-04-03 20:45:51 -0400, Tom Lane wrote: > > andres@anarazel.de (Andres Freund) writes: > > > Looking at the page lsn's with dd I noticed something peculiar: > > > > > page 0: > > > 01 00 00 00 18 c2 00 31 => 1/3100C218 > > > page 1: > > > 01 00 00 00 80 44 01 31 => 1/31014480 > > > page 10: > > > 01 00 00 00 60 ce 05 31 => 1/3105ce60 > > > page 43: > > > 01 00 00 00 58 7a 16 31 => 1/31167a58 > > > page 44: > > > 01 00 00 00 f0 99 16 31 => 1/311699f0 > > > page 45: > > > 00 00 00 00 00 00 00 00 => 0/0 > > > page 90: > > > 01 00 00 00 90 17 1d 32 => 1/321d1790 > > > page 91: > > > 01 00 00 00 38 ef 1b 32 => 1/321bef38 > > > > > So we have written out pages that are after pages without a LSN that > > > have an LSN thats *beyond* the point XLOG has successfully been written > > > to disk (1/31169A38)? > > > > If you're looking into the FPIs, those would contain the page's older > > LSN, not the one assigned by the current WAL record. > > Nope, thats from the heap, and the LSNs are *newer* than what startup > recovered to. I am pretty sure by now we are missing out on valid WAL, I > am just not sure why. Ok, I think I see the bug. And I think its been introduced in the checkpoints patch. Remember, as explained in other mails, I am pretty sure the end of wal was errorneously detected, resulting in the following error: 27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68 Not that without a *hardware* crash end of wal is normally found first by the checks for wrong LSNs or wrong record lengths. Not here. My theory is that a page that was full page written changed while it was inserted into the WAL which caused the error. A possibly scenario for that would e.g. two concurrent calls to MarkBufferDirtyHint(). Note:* 2. The caller might have only share-lock instead of exclusive-lock on the* buffer's content lock. if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) != (BM_DIRTY | BM_JUST_DIRTIED)){ if (DataChecksumsEnabled() &&(bufHdr->flags & BM_PERMANENT)) { MyPgXact->delayChkpt = delayChkpt = true; lsn = XLogSaveBufferForHint(buffer);/* PART 1 */ } } LockBufHdr(bufHdr); Assert(bufHdr->refcount > 0); if (!(bufHdr->flags & BM_DIRTY)) { dirtied = true; /* Means "will be dirtied by this action" */ /* * Set the page LSN if we wrote a backup block. We aren't * supposed to set this when only holdinga share lock but * as long as we serialise it somehow we're OK. We choose to * set LSN while holdingthe buffer header lock, which causes * any reader of an LSN who holds only a share lock to also *obtain a buffer header lock before using PageGetLSN(). * Fortunately, thats not too many places. * * If checksums are enabled, you might think we should reset the * checksum here. That will happen when the pageis written * sometime later in this checkpoint cycle. */ if (!XLogRecPtrIsInvalid(lsn)) PageSetLSN(page, lsn); /* PART 2 */ } So XLogSaveBufferForHint(buffer) is called for all buffers which are not yet dirty. Without any locks. It just does: XLogSaveBufferForHint(Buffer buffer) {rdata[0].data = (char *) (&watermark);rdata[0].len = sizeof(int);rdata[0].next = &(rdata[1]); rdata[1].data = NULL;rdata[1].len = 0;rdata[1].buffer = buffer;rdata[1].buffer_std = true;rdata[1].next = NULL; return XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata); } I think what happens is that that one backend grabs WALInsertLock first, it computes a full page write for the buffer, including a CRC. And returns a valid LSN. Then it continues towards the PageSetLSN() after LockBufHdr(). Allthewhile the second backend notices that WALInsertLock is free again and continues towards the WALInsertLock protected parts of XLogInsert(). Only that it already has done that part:/* * Calculate CRC of the data, including all the backup blocks * * Note that therecord header isn't added into the CRC initially since we * don't know the prev-link yet. Thus, the CRC will representthe CRC of * the whole record in the order: rdata, then backup blocks, then record * header. */INIT_CRC32(rdata_crc);for(rdt = rdata; rdt != NULL; rdt = rdt->next) COMP_CRC32(rdata_crc, rdt->data, rdt->len); It goes on to write all the data to the WAL after successfully getting WALInsertLock:while (write_len){ while (rdata->data == NULL) rdata = rdata->next; if (freespace > 0) { if (rdata->len > freespace) { memcpy(Insert->currpos, rdata->data, freespace); rdata->data += freespace; rdata->len -= freespace; write_len -= freespace; } else { memcpy(Insert->currpos, rdata->data, rdata->len); freespace -= rdata->len; write_len -= rdata->len; Insert->currpos += rdata->len; rdata = rdata->next; continue; } } /* Use next buffer */ updrqst = AdvanceXLInsertBuffer(false); curridx = Insert->curridx; /* Mark page headerto indicate this record continues on the page */ Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD; Insert->currpage->xlp_rem_len= write_len; freespace = INSERT_FREESPACE(Insert);} If by now the first backend has proceeded to PageSetLSN() we are writing different data to disk than the one we computed the checksum of before. Boom. I think the whole locking interactions in MarkBufferDirtyHint() need to be thought over pretty carefully. Does this explanation make sense? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 4 April 2013 02:39, Andres Freund <andres@2ndquadrant.com> wrote: > Ok, I think I see the bug. And I think its been introduced in the > checkpoints patch. Well spotted. (I think you mean checksums patch). > If by now the first backend has proceeded to PageSetLSN() we are writing > different data to disk than the one we computed the checksum of > before. Boom. Right, so nothing else we were doing was wrong, that's why we couldn't spot a bug. The problem is that we aren't replaying enough WAL because the checksum on the WAL record is broke. > I think the whole locking interactions in MarkBufferDirtyHint() need to > be thought over pretty carefully. When we write out a buffer with checksums enabled, we take a copy of the buffer so that the checksum is consistent, even while other backends may be writing hints to the same bufer. I missed out on doing that with XLOG_HINT records, so the WAL CRC can be incorrect because the data is scanned twice; normally that would be OK because we have an exclusive lock on the block, but with hints we only have share lock. So what we need to do is take a copy of the buffer before we do XLogInsert(). Simple patch to do this attached for discussion. (Not tested). We might also do this by modifying the WAL record to take the whole block and bypass the BkpBlock mechanism entirely. But that's more work and doesn't seem like it would be any cleaner. I figure lets solve the problem first then discuss which approach is best. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2013-04-04 13:30:40 +0100, Simon Riggs wrote: > On 4 April 2013 02:39, Andres Freund <andres@2ndquadrant.com> wrote: > > > Ok, I think I see the bug. And I think its been introduced in the > > checkpoints patch. > > Well spotted. (I think you mean checksums patch). Heh, yes. I was slightly tired at that point ;) > > If by now the first backend has proceeded to PageSetLSN() we are writing > > different data to disk than the one we computed the checksum of > > before. Boom. > > Right, so nothing else we were doing was wrong, that's why we couldn't > spot a bug. The problem is that we aren't replaying enough WAL because > the checksum on the WAL record is broke. Well, there might be other bugs we can't see yet... But lets hope not. > > I think the whole locking interactions in MarkBufferDirtyHint() need to > > be thought over pretty carefully. > > When we write out a buffer with checksums enabled, we take a copy of > the buffer so that the checksum is consistent, even while other > backends may be writing hints to the same bufer. > > I missed out on doing that with XLOG_HINT records, so the WAL CRC can > be incorrect because the data is scanned twice; normally that would be > OK because we have an exclusive lock on the block, but with hints we > only have share lock. So what we need to do is take a copy of the > buffer before we do XLogInsert(). > > Simple patch to do this attached for discussion. (Not tested). > > We might also do this by modifying the WAL record to take the whole > block and bypass the BkpBlock mechanism entirely. But that's more work > and doesn't seem like it would be any cleaner. I figure lets solve the > problem first then discuss which approach is best. Unfortunately I find that approach unacceptably ugly. I don't think its ok to make an already *very* complicated function (XLogInsert()) in one of the most complicated files (xlog.c) even more complicated. It literally took me years to understand a large percentage of it. I even think the XLOG_HINT escape hatch should be taken out and be replaced by something outside of XLogInsert(). Don't think that approach would work with as few lines anyway, since you need to properly pass it into XLogCheckBuffer() et al which checks the buffer tags and such - which it needs to properly compute the struct BkpBlock. Also we iterate over rdata->page for the CRC computation. I think the route you quickly sketched is more realistic. That would remove all knowledge obout XLOG_HINT from generic code hich is a very good thing, I spent like 15minutes yesterday wondering whether the early return in there might be the cause of the bug... Something like: XLogRecPtr XLogSaveBufferForHint(Buffer buffer) { XLogRecPtr recptr = InvalidXLogRecPtr; XLogRecPtr lsn; XLogRecData rdata[2]; BkbBlock bkp; /* * make sure no checkpoint can happen while we decide about logging * something or not, since we don't hold anylock preventing that * otherwise. */ Assert(MyPgXact->delayChkpt); /* update RedoRecPtr */ GetRedoRecPtr(); /* setup phony rdata element */ rdata[0].buffer = buffer; rdata[0].buffer_std = true; /* is that actually ok? */ /* force full pages on, otherwise checksums won't work? */ if (XLogCheckBuffer(rdata, true, &lsn, &bkp)) { charcopied_buffer[BLCKSZ]; /* * copy buffer so we don't have to worry about * concurrent hint bit or lsn updates. We assume pd_lower/upper * cannot be changed without an exclusive lock, so the contents * bkp are not racy. */ memcpy(copied_buffer, BufferGetBlock(buffer), BLCKSZ); rdata[0].data = bkp; rdata[0].len = sizeof(BkbBlock); rdata[0].buffer = InvalidBuffer; rdata[0].buffer_std= false; rdata[0].next = &(rdata[1]); rdata[1].data = copied_buffer; rdata[1].len = BLCKSZ; rdata[1].buffer = InvalidBuffer; rdata[1].buffer_std= false; rdata[1].next = NULL; recptr = XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata); } return recptr; } That should work? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 4 April 2013 15:53, Andres Freund <andres@2ndquadrant.com> wrote: > Unfortunately I find that approach unacceptably ugly. Yeh. If we can confirm its a fix we can discuss a cleaner patch and that is much better. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres, Thank you for diagnosing this problem! On Thu, 2013-04-04 at 16:53 +0200, Andres Freund wrote: > I think the route you quickly sketched is more realistic. That would > remove all knowledge obout XLOG_HINT from generic code hich is a very > good thing, I spent like 15minutes yesterday wondering whether the early > return in there might be the cause of the bug... I like this approach. It may have some performance impact though, because there are a couple extra spinlocks taken, and an extra memcpy. The code looks good to me except that we should be consistent about the page hole -- XLogCheckBuffer is calculating it, but then we copy the entire page. I don't think anything can change the size of the page hole while we have a shared lock on the buffer, so it seems OK to skip the page hole during the copy. Another possible approach is to drop the lock on the buffer and re-acquire it in exclusive mode after we find out we'll need to do XLogInsert. It means that MarkBufferDirtyHint may end up blocking for longer, but would avoid the memcpy. I haven't really thought through the details. Regards,Jeff Davis
On 2013-04-04 12:59:36 -0700, Jeff Davis wrote: > Andres, > > Thank you for diagnosing this problem! > > On Thu, 2013-04-04 at 16:53 +0200, Andres Freund wrote: > > I think the route you quickly sketched is more realistic. That would > > remove all knowledge obout XLOG_HINT from generic code hich is a very > > good thing, I spent like 15minutes yesterday wondering whether the early > > return in there might be the cause of the bug... > > I like this approach. It may have some performance impact though, > because there are a couple extra spinlocks taken, and an extra memcpy. I don't think its really slower. Earlier the code took WalInsertLock everytime, even if we ended up not logging anything. Thats far more epensive than a single spinlock. And the copy should also only be taken in the case we need to log. So I think we end up ahead of the current state. > The code looks good to me except that we should be consistent about the > page hole -- XLogCheckBuffer is calculating it, but then we copy the > entire page. I don't think anything can change the size of the page hole > while we have a shared lock on the buffer, so it seems OK to skip the > page hole during the copy. I don't think it can change either, but I doubt that there's a performance advantage by not copying the hole. I'd guess the simpler code ends up faster. > Another possible approach is to drop the lock on the buffer and > re-acquire it in exclusive mode after we find out we'll need to do > XLogInsert. It means that MarkBufferDirtyHint may end up blocking for > longer, but would avoid the memcpy. I haven't really thought through the > details. That sounds like it would be prone to deadlocks. So I would dislike to go there. Will write up a patch tomorrow. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Apr 4, 2013 at 5:30 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> If by now the first backend has proceeded to PageSetLSN() we are writingRight, so nothing else we were doing was wrong, that's why we couldn't
> different data to disk than the one we computed the checksum of
> before. Boom.
spot a bug. The problem is that we aren't replaying enough WAL because
the checksum on the WAL record is broke.
This brings up a pretty frightening possibility to me, unrelated to data checksums. If a bit gets twiddled in the WAL file due to a hardware issue or a "cosmic ray", and then a crash happens, automatic recovery will stop early with the failed WAL checksum with an innocuous looking message. The system will start up but will be invisibly inconsistent, and will proceed to overwrite that portion of the WAL file which contains the old data (real data that would have been necessary to reconstruct, once the corruption is finally realized ) with an end-of-recovery checkpoint record and continue to chew up real data from there.
I don't know a solution here, though, other than trusting your hardware. Changing timelines upon ordinary crash recovery, not just media recovery, seems excessive but also seems to be exactly what timelines were invented for, right?
Back to the main topic here, Jeff Davis mentioned earlier "You'd still think this would cause incorrect results, but...". I didn't realize the significance of that until now. It does produce incorrect query results. I was just bailing out before detecting them. Once I specify ignore_checksum_failure=1 my test harness complains bitterly about the data not being consistent with what the Perl program knows it is supposed to be.
I missed out on doing that with XLOG_HINT records, so the WAL CRC can
be incorrect because the data is scanned twice; normally that would be
OK because we have an exclusive lock on the block, but with hints we
only have share lock. So what we need to do is take a copy of the
buffer before we do XLogInsert().
Simple patch to do this attached for discussion. (Not tested).
We might also do this by modifying the WAL record to take the whole
block and bypass the BkpBlock mechanism entirely. But that's more work
and doesn't seem like it would be any cleaner. I figure lets solve the
problem first then discuss which approach is best.
I've tested your patch it and it seems to do the job. It has survived far longer than unpatched ever did, with neither checksum warnings, nor complaints of inconsistent data. (I haven't analyzed the code much, just the results, and leave the discussion of the best approach to others)
Thanks,
Jeff
On Thu, 2013-04-04 at 14:21 -0700, Jeff Janes wrote: > This brings up a pretty frightening possibility to me, unrelated to > data checksums. If a bit gets twiddled in the WAL file due to a > hardware issue or a "cosmic ray", and then a crash happens, automatic > recovery will stop early with the failed WAL checksum with > an innocuous looking message. The system will start up but will be > invisibly inconsistent, and will proceed to overwrite that portion of > the WAL file which contains the old data (real data that would have > been necessary to reconstruct, once the corruption is finally realized > ) with an end-of-recovery checkpoint record and continue to chew up > real data from there. I've been worried about that for a while, and I may have even seen something like this happen before. We could perhaps do some checks, but in general it seems hard to solve without writing flushing some data to two different places. For example, you could flush WAL, and then update an LSN stored somewhere else indicating how far the WAL has been written. Recovery could complain if it gets an error in the WAL before that point. But obviously, that makes WAL flushes expensive (in many cases, about twice as expensive). Maybe it's not out of the question to offer that as an option if nobody has a better idea. Performance-conscious users could place the extra LSN on an SSD or NVRAM or something; or maybe use commit_delay or async commits. It would only need to store a few bytes. Streaming replication mitigates the problem somewhat, by being a second place to write WAL. Regards,Jeff Davis
On Thu, 2013-04-04 at 22:39 +0200, Andres Freund wrote: > I don't think its really slower. Earlier the code took WalInsertLock > everytime, even if we ended up not logging anything. Thats far more > epensive than a single spinlock. And the copy should also only be taken > in the case we need to log. So I think we end up ahead of the current > state. Good point. > > The code looks good to me except that we should be consistent about the > > page hole -- XLogCheckBuffer is calculating it, but then we copy the > > entire page. I don't think anything can change the size of the page hole > > while we have a shared lock on the buffer, so it seems OK to skip the > > page hole during the copy. > > I don't think it can change either, but I doubt that there's a > performance advantage by not copying the hole. I'd guess the simpler > code ends up faster. I was thinking more about the WAL size, but I don't have a strong opinion. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Thu, 2013-04-04 at 14:21 -0700, Jeff Janes wrote: >> This brings up a pretty frightening possibility to me, unrelated to >> data checksums. If a bit gets twiddled in the WAL file due to a >> hardware issue or a "cosmic ray", and then a crash happens, automatic >> recovery will stop early with the failed WAL checksum with >> an innocuous looking message. The system will start up but will be >> invisibly inconsistent, and will proceed to overwrite that portion of >> the WAL file which contains the old data (real data that would have >> been necessary to reconstruct, once the corruption is finally realized >> ) with an end-of-recovery checkpoint record and continue to chew up >> real data from there. > I've been worried about that for a while, and I may have even seen > something like this happen before. We could perhaps do some checks, but > in general it seems hard to solve without writing flushing some data to > two different places. For example, you could flush WAL, and then update > an LSN stored somewhere else indicating how far the WAL has been > written. Recovery could complain if it gets an error in the WAL before > that point. > But obviously, that makes WAL flushes expensive (in many cases, about > twice as expensive). > Maybe it's not out of the question to offer that as an option if nobody > has a better idea. Performance-conscious users could place the extra LSN > on an SSD or NVRAM or something; or maybe use commit_delay or async > commits. It would only need to store a few bytes. At least on traditional rotating media, this is only going to perform well if you dedicate two drives to the purpose. At which point you might as well just say "let's write two copies of WAL". Or maybe three copies, so that you can take a vote when they disagree. While this is not so unreasonable on its face for ultra-high-reliability requirements, I can't escape the feeling that we'd just be reinventing software RAID. There's no reason to think that we can deal with this class of problems better than the storage system can. > Streaming replication mitigates the problem somewhat, by being a second > place to write WAL. Yeah, if you're going to do this at all it makes more sense for the redundant copies to be on other machines. So the questions that that leads to are how smart is our SR code about dealing with a master that tries to re-send WAL regions it already sent, and what a slave ought to do in such a situation if the new data doesn't match. regards, tom lane
On Thu, 2013-04-04 at 21:06 -0400, Tom Lane wrote: > I can't escape the feeling that we'd just be reinventing software RAID. > There's no reason to think that we can deal with this class of problems > better than the storage system can. The goal would be to reliably detect a situation where WAL that has been flushed successfully was later corrupted; not necessarily to recover when we find it. Unless it's something like ZFS, I don't think most software RAID will reliably detect corruption. A side benefit is that it would also help catch bugs like this in the future. I'm not advocating for this particular solution, but I do concur with Jeff Janes that it's a little bit scary and that we can entertain some ideas about how to mitigate it. Regards,Jeff Davis
On Apr4, 2013, at 23:21 , Jeff Janes <jeff.janes@gmail.com> wrote: > This brings up a pretty frightening possibility to me, unrelated to data checksums. If a bit gets twiddled in the WALfile due to a hardware issue or a "cosmic ray", and then a crash happens, automatic recovery will stop early with thefailed WAL checksum with an innocuous looking message. The system will start up but will be invisibly inconsistent, andwill proceed to overwrite that portion of the WAL file which contains the old data (real data that would have been necessaryto reconstruct, once the corruption is finally realized ) with an end-of-recovery checkpoint record and continueto chew up real data from there. Maybe we could scan forward to check whether a corrupted WAL record is followed by one or more valid ones with sensible LSNs.If it is, chances are high that we haven't actually hit the end of the WAL. In that case, we could either log a warning,or (better, probably) abort crash recovery. The user would then need to either restore the broken WAL segment frombackup, or override the check by e.g. setting recovery_target_record="invalid_record". (The default would be recovery_target_record="last_record".The name of the GUC tries to be consistent with existing recovery.conf settings, eventhough it affects crash recovery, not archive recovery.) Corruption of fields which we require to scan past the record would cause false negatives, i.e. no trigger an error eventhough we do abort recovery mid-way through. There's a risk of false positives too, but they require quite specific orderingsof writes and thus seem rather unlikely. (AFAICS, the OS would have to write some parts of record N followed bythe whole of record N+1 and then crash to cause a false positive). best regards, Florian Pflug
On 2013-04-04 17:39:16 -0700, Jeff Davis wrote: > On Thu, 2013-04-04 at 22:39 +0200, Andres Freund wrote: > > I don't think its really slower. Earlier the code took WalInsertLock > > everytime, even if we ended up not logging anything. Thats far more > > epensive than a single spinlock. And the copy should also only be taken > > in the case we need to log. So I think we end up ahead of the current > > state. > > Good point. > > > > The code looks good to me except that we should be consistent about the > > > page hole -- XLogCheckBuffer is calculating it, but then we copy the > > > entire page. I don't think anything can change the size of the page hole > > > while we have a shared lock on the buffer, so it seems OK to skip the > > > page hole during the copy. > > > > I don't think it can change either, but I doubt that there's a > > performance advantage by not copying the hole. I'd guess the simpler > > code ends up faster. > > I was thinking more about the WAL size, but I don't have a strong > opinion. I was just a bit dense. No idea what I missed there. How does the attached version look? I verified that it survives recovery, but not more. Jeff, any chance you can run this for a round with your suite? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Fri, 2013-04-05 at 15:09 +0200, Andres Freund wrote: > How does the attached version look? I verified that it survives > recovery, but not more. Comments: * Regarding full page writes, we can: - always write full pages (as in your current patch), regardless of the current settings - take WALInsertLock briefly to get the current settings from XLogCtl - always calculate the full pagerecord, but in XLogInsert, if it happens to be a HINT record, and full page writes are not necessary, then discard it (right now I somewhat favor this option but I haven't looked at the details) * typo in "Backup blocks are not used in **xlog xlog** records" * To get the appropriate setting for buffer_std, we should pass it down through MarkBufferDirty as an extra flag, I think. * I'm a bit worried that we'll need a cleanup lock when restoring an FSM page. What do you think? * In xlog_redo, it seemed slightly awkward to call XLogRecGetData twice. Merely a matter of preference but I thought I would mention it. > Jeff, any chance you can run this for a round with your suite? Yes. I don't have a rigorous test suite, but I'll do some manual tests and walk through it with gdb. Regards,Jeff Davis
On Fri, 2013-04-05 at 10:34 +0200, Florian Pflug wrote: > Maybe we could scan forward to check whether a corrupted WAL record is > followed by one or more valid ones with sensible LSNs. If it is, > chances are high that we haven't actually hit the end of the WAL. In > that case, we could either log a warning, or (better, probably) abort > crash recovery. +1. > Corruption of fields which we require to scan past the record would > cause false negatives, i.e. no trigger an error even though we do > abort recovery mid-way through. There's a risk of false positives too, > but they require quite specific orderings of writes and thus seem > rather unlikely. (AFAICS, the OS would have to write some parts of > record N followed by the whole of record N+1 and then crash to cause a > false positive). Does the xlp_pageaddr help solve this? Also, we'd need to be a little careful when written-but-not-flushed WAL data makes it to disk, which could cause a false positive and may be a fairly common case. Regards,Jeff Davis
On Fri, Apr 5, 2013 at 8:09 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > How does the attached version look? I verified that it survives > recovery, but not more. > I still got errors when executing make installcheck in a just compiled 9.3devel + this_patch, this is when setting wal_level higher than minimal. Attached regression.diffs -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación Phone: +593 4 5107566 Cell: +593 987171157
Attachment
On Fri, 2013-04-05 at 19:22 -0500, Jaime Casanova wrote: > On Fri, Apr 5, 2013 at 8:09 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > > > How does the attached version look? I verified that it survives > > recovery, but not more. > > > > I still got errors when executing make installcheck in a just compiled > 9.3devel + this_patch, this is when setting wal_level higher than > minimal. > Attached regression.diffs Did you also apply the other patch here: http://www.postgresql.org/message-id/1364340203.21411.143.camel@sussancws0025 ? I think that is a completely separate issue. Simon, should that patch be applied or are you waiting on the issue Jeff Janes raised to be resolved, as well? Regards,Jeff Davis
On Fri, Apr 5, 2013 at 7:39 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2013-04-05 at 19:22 -0500, Jaime Casanova wrote: >> On Fri, Apr 5, 2013 at 8:09 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> > >> > How does the attached version look? I verified that it survives >> > recovery, but not more. >> > >> >> I still got errors when executing make installcheck in a just compiled >> 9.3devel + this_patch, this is when setting wal_level higher than >> minimal. >> Attached regression.diffs > > Did you also apply the other patch here: > > http://www.postgresql.org/message-id/1364340203.21411.143.camel@sussancws0025 > i missed that one... with both applied, first problems disappear... will try some more tests later -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación Phone: +593 4 5107566 Cell: +593 987171157
On 2013-04-05 16:29:47 -0700, Jeff Davis wrote: > On Fri, 2013-04-05 at 15:09 +0200, Andres Freund wrote: > > How does the attached version look? I verified that it survives > > recovery, but not more. > > Comments: > > * Regarding full page writes, we can: > - always write full pages (as in your current patch), regardless of > the current settings > - take WALInsertLock briefly to get the current settings from XLogCtl > - always calculate the full page record, but in XLogInsert, if it > happens to be a HINT record, and full page writes are not necessary, > then discard it (right now I somewhat favor this option but I haven't > looked at the details) I feel pretty strongly that we shouldn't add any such complications to XLogInsert() itself, its complicated enough already and it should be made simpler, not more complicated. I think we can just make up the rule that changing full page writes also requires SpinLockAcquire(&xlogctl->info_lck);. Then its easy enough. And it can hardly be a performance bottleneck given how infrequently its modified. In retrospect I think making up the rule that full_page_writes changes imply a checkpoint would have made things easier performance and codewise. > * typo in "Backup blocks are not used in **xlog xlog** records" Thats just me being "funny" after some long days ;). See its an 'xlog' 'xlog record'. > * To get the appropriate setting for buffer_std, we should pass it down > through MarkBufferDirty as an extra flag, I think. Or just declare as part of the api that only std buffers are allowed to be passed down. After a quick look it looks like all callers use enough of the standard page layout to make that viable. But that really was just a quick look. > * I'm a bit worried that we'll need a cleanup lock when restoring an FSM > page. What do you think? I don't yet see why, while recovery is ongoing there shouldn't be anyone doing anything concurrently to it but startup itself? > * In xlog_redo, it seemed slightly awkward to call XLogRecGetData twice. > Merely a matter of preference but I thought I would mention it. Youre absolutely right, memcpy should have gotten passed 'data', not XLogRecGetData(). > > Jeff, any chance you can run this for a round with your suite? > > Yes. I don't have a rigorous test suite, but I'll do some manual tests > and walk through it with gdb. Heh, in this and only this I was talking to Jeff Janes. Strangely I never had noticed that you share the same name ;) Thanks! Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Apr 5, 2013 at 6:09 AM, Andres Freund <andres@2ndquadrant.com> wrote:
recovery, but not more.How does the attached version look? I verified that it survives
Jeff, any chance you can run this for a round with your suite?
I've run it for a while now and have found no problems.
Thanks,
Jeff
On Sat, Apr 6, 2013 at 1:36 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Fri, Apr 5, 2013 at 6:09 AM, Andres Freund <andres@2ndquadrant.com> > wrote: > >> >> How does the attached version look? I verified that it survives >> recovery, but not more. >> >> Jeff, any chance you can run this for a round with your suite? > > > > I've run it for a while now and have found no problems. > fwiw, i have run installcheck (serial and parallel) and isolationtest, also combination of those (one installcheck, one isolationtest) at the same time while executing vacuum full, reindex database and manual checkpoint... i also check that the bgwriter was doing some work. i did all of this in a master node in a cluster with a standby and a cascade standby that were later promoted... and i have no problem with checksums at all, so i would say that the combination of Jeff's and Andres' patches fixed the problems we have seen until now -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación Phone: +593 4 5107566 Cell: +593 987171157
On 6 April 2013 15:44, Andres Freund <andres@2ndquadrant.com> wrote:
> * In xlog_redo, it seemed slightly awkward to call XLogRecGetData twice.> Merely a matter of preference but I thought I would mention it.Youre absolutely right, memcpy should have gotten passed 'data', not
XLogRecGetData().
Applied, with this as the only code change.
Thanks everybody for good research and coding and fast testing.
We're in good shape now.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, 2013-04-08 at 09:19 +0100, Simon Riggs wrote: > Applied, with this as the only code change. > > > Thanks everybody for good research and coding and fast testing. > > > We're in good shape now. Thank you. I have attached two more patches: 1. I believe that the issue I brought up at the end of this email: http://www.postgresql.org/message-id/1365035537.7580.380.camel@sussancws0025 is a real issue. In lazy_vacuum_page(), the following sequence can happen when checksums are on: a. PageSetAllVisible b. Pass heap page to visibilitymap_set c. visibilitymap_set logs the heap page and sets the LSN d. MarkBufferDirty If a checkpoint happens between (c) and (d), then we have a problem. The fix is easy: just mark the heap buffer dirty first. There's another call site that looks like a potential problem, but I don't think it is. I simplified the code there to make it (hopefully) more clearly correct. 2. A cleanup patch to pass the buffer_std flag down through MarkBufferDirtyHint. This is a matter of preference and purely cosmetic, so it might not be wanted. The reason I thought it was useful is that a future caller that sets a hint on a non-standard buffer might easily miss the assumption that we have a standard buffer. Regards, Jeff Davis
Attachment
On Sat, 2013-04-06 at 16:44 +0200, Andres Freund wrote: > I think we can just make up the rule that changing full page writes also > requires SpinLockAcquire(&xlogctl->info_lck);. Then its easy enough. And > it can hardly be a performance bottleneck given how infrequently its > modified. That seems like a good idea to me. As it stands, checksums basically force full page writes to be on; so we should either fix that or document it. > In retrospect I think making up the rule that full_page_writes changes > imply a checkpoint would have made things easier performance and > codewise. I don't even see why we allow changing that while the server is on. Either the I/O system requires it for safety or not, right? Regards,Jeff Davis
On Sat, Apr 6, 2013 at 10:44 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I feel pretty strongly that we shouldn't add any such complications to > XLogInsert() itself, its complicated enough already and it should be > made simpler, not more complicated. +1, emphatically. XLogInsert is a really nasty scalability bottleneck. We need to move as much logic out of that function as possible, and particularly out from under WALInsertLock. ...Robert
On 11 April 2013 00:37, Robert Haas <robertmhaas@gmail.com> wrote:
-- On Sat, Apr 6, 2013 at 10:44 AM, Andres Freund <andres@2ndquadrant.com> wrote:+1, emphatically. XLogInsert is a really nasty scalability
> I feel pretty strongly that we shouldn't add any such complications to
> XLogInsert() itself, its complicated enough already and it should be
> made simpler, not more complicated.
bottleneck. We need to move as much logic out of that function as
possible, and particularly out from under WALInsertLock.
Andres' patch was applied, so not sure what you mean by +1ing a comment made in relation to that patch.
I'm aware that WALInsertLock is a bottleneck and am not going to be making that worse.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 9 April 2013 08:36, Jeff Davis <pgsql@j-davis.com> wrote: > 1. I believe that the issue I brought up at the end of this email: > > http://www.postgresql.org/message-id/1365035537.7580.380.camel@sussancws0025 > > is a real issue. In lazy_vacuum_page(), the following sequence can > happen when checksums are on: > > a. PageSetAllVisible > b. Pass heap page to visibilitymap_set > c. visibilitymap_set logs the heap page and sets the LSN > d. MarkBufferDirty > > If a checkpoint happens between (c) and (d), then we have a problem. The > fix is easy: just mark the heap buffer dirty first. There's another call > site that looks like a potential problem, but I don't think it is. I > simplified the code there to make it (hopefully) more clearly correct. Applied > 2. A cleanup patch to pass the buffer_std flag down through > MarkBufferDirtyHint. This is a matter of preference and purely cosmetic, > so it might not be wanted. The reason I thought it was useful is that a > future caller that sets a hint on a non-standard buffer might easily > miss the assumption that we have a standard buffer. Skipped that for now. Do we really need it? Can you set a hint on a non-standard buffer? --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Apr 30, 2013 at 6:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 9 April 2013 08:36, Jeff Davis <pgsql@j-davis.com> wrote: > >> 1. I believe that the issue I brought up at the end of this email: >> >> http://www.postgresql.org/message-id/1365035537.7580.380.camel@sussancws0025 >> >> is a real issue. In lazy_vacuum_page(), the following sequence can >> happen when checksums are on: >> >> a. PageSetAllVisible >> b. Pass heap page to visibilitymap_set >> c. visibilitymap_set logs the heap page and sets the LSN >> d. MarkBufferDirty >> >> If a checkpoint happens between (c) and (d), then we have a problem. The >> fix is easy: just mark the heap buffer dirty first. There's another call >> site that looks like a potential problem, but I don't think it is. I >> simplified the code there to make it (hopefully) more clearly correct. > > Applied Uh, wait a minute. I think this is completely wrong. The buffer is LOCKED for this entire sequence of operations. For a checkpoint to "happen", it's got to write every buffer, which it will not be able to do for so long as the buffer is locked. The effect of the change to lazy_scan_heap is to force the buffer to be written even if we're only updating the visibility map page. That's a bad idea and should be reverted. The change to lazy_vacuum_page is harmless, but the comment is wrong. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 30 April 2013 13:34, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Apr 30, 2013 at 6:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 9 April 2013 08:36, Jeff Davis <pgsql@j-davis.com> wrote: >> >>> 1. I believe that the issue I brought up at the end of this email: >>> >>> http://www.postgresql.org/message-id/1365035537.7580.380.camel@sussancws0025 >>> >>> is a real issue. In lazy_vacuum_page(), the following sequence can >>> happen when checksums are on: >>> >>> a. PageSetAllVisible >>> b. Pass heap page to visibilitymap_set >>> c. visibilitymap_set logs the heap page and sets the LSN >>> d. MarkBufferDirty >>> >>> If a checkpoint happens between (c) and (d), then we have a problem. The >>> fix is easy: just mark the heap buffer dirty first. There's another call >>> site that looks like a potential problem, but I don't think it is. I >>> simplified the code there to make it (hopefully) more clearly correct. >> >> Applied > > Uh, wait a minute. I think this is completely wrong. Thanks for your review. > The buffer is > LOCKED for this entire sequence of operations. For a checkpoint to > "happen", it's got to write every buffer, which it will not be able to > do for so long as the buffer is locked. I was following the logic we use for WAL: mark the buffer dirty, then write WAL. The important thing is for us to signal that the buffer should be written, which we do by marking it dirty. The actual writing of the buffer comes later, often minutes later. > The effect of the change to lazy_scan_heap is to force the buffer to > be written even if we're only updating the visibility map page. > That's a bad idea and should be reverted. OK, agreed. Will wait for this discussion to end before acting though, so I'm sure exactly what you mean. > The change to > lazy_vacuum_page is harmless, but the comment is wrong. In what way, wrong? What should it say? --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, 2013-04-30 at 08:34 -0400, Robert Haas wrote: > Uh, wait a minute. I think this is completely wrong. The buffer is > LOCKED for this entire sequence of operations. For a checkpoint to > "happen", it's got to write every buffer, which it will not be able to > do for so long as the buffer is locked. I went back and forth on this, so you could be right, but here was my reasoning: I was worried because SyncOneBuffer checks whether it needs writing without taking a content lock, so the exclusive lock doesn't help. That makes sense, because you don't want a checkpoint to have to get a content lock on every buffer in the buffer pool. But it also means we need to follow the rules laid out in transam/README and dirty the pages before writing WAL. > The effect of the change to lazy_scan_heap is to force the buffer to > be written even if we're only updating the visibility map page. > That's a bad idea and should be reverted. The only time the VM and the data page are out of sync during vacuum is after a crash, right? If that's the case, I didn't think it was a big deal to dirty one extra page (should be extremely rare). Am I missing something? The reason I removed that special case was just code complexity/readability. I tried preserving the previous behavior, and it's not so bad, but it seemed unnecessarily ugly for the benefit of a rare case. Regards,Jeff Davis
On 30 April 2013 22:54, Jeff Davis <pgsql@j-davis.com> wrote: > On Tue, 2013-04-30 at 08:34 -0400, Robert Haas wrote: >> Uh, wait a minute. I think this is completely wrong. The buffer is >> LOCKED for this entire sequence of operations. For a checkpoint to >> "happen", it's got to write every buffer, which it will not be able to >> do for so long as the buffer is locked. > > I went back and forth on this, so you could be right, but here was my > reasoning: > > I was worried because SyncOneBuffer checks whether it needs writing > without taking a content lock, so the exclusive lock doesn't help. That > makes sense, because you don't want a checkpoint to have to get a > content lock on every buffer in the buffer pool. But it also means we > need to follow the rules laid out in transam/README and dirty the pages > before writing WAL. > >> The effect of the change to lazy_scan_heap is to force the buffer to >> be written even if we're only updating the visibility map page. >> That's a bad idea and should be reverted. > > The only time the VM and the data page are out of sync during vacuum is > after a crash, right? If that's the case, I didn't think it was a big > deal to dirty one extra page (should be extremely rare). Am I missing > something? > > The reason I removed that special case was just code > complexity/readability. I tried preserving the previous behavior, and > it's not so bad, but it seemed unnecessarily ugly for the benefit of a > rare case. All of that makes perfect sense to me. Waiting to hear back from Robert whether he still has an objection. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Apr 30, 2013 at 5:54 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Tue, 2013-04-30 at 08:34 -0400, Robert Haas wrote: >> Uh, wait a minute. I think this is completely wrong. The buffer is >> LOCKED for this entire sequence of operations. For a checkpoint to >> "happen", it's got to write every buffer, which it will not be able to >> do for so long as the buffer is locked. > > I went back and forth on this, so you could be right, but here was my > reasoning: > > I was worried because SyncOneBuffer checks whether it needs writing > without taking a content lock, so the exclusive lock doesn't help. That > makes sense, because you don't want a checkpoint to have to get a > content lock on every buffer in the buffer pool. But it also means we > need to follow the rules laid out in transam/README and dirty the pages > before writing WAL. On further review, I think you're right about this. We'd have a problem if the following sequence of events were to occur: 1. vacuum writes the WAL record, but does not yet mark the buffer dirty, and then goes into the tank 2. checkpointer decides where the checkpoint will begin 3. buffer pool is scanned as part of the checkpoint process, observing target buffer not to be dirty 4. vacuum finally wakes up and marks the buffer dirty 5. crash, after WAL is flushed but before buffer is written However, on looking at this a little more, shouldn't we be doing log_heap_clean() *before* we do visibilitymap_set()? >> The effect of the change to lazy_scan_heap is to force the buffer to >> be written even if we're only updating the visibility map page. >> That's a bad idea and should be reverted. > > The only time the VM and the data page are out of sync during vacuum is > after a crash, right? If that's the case, I didn't think it was a big > deal to dirty one extra page (should be extremely rare). Am I missing > something? > > The reason I removed that special case was just code > complexity/readability. I tried preserving the previous behavior, and > it's not so bad, but it seemed unnecessarily ugly for the benefit of a > rare case. Well, I don't find it that ugly, but then again -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 1, 2013 at 11:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I was worried because SyncOneBuffer checks whether it needs writing >> without taking a content lock, so the exclusive lock doesn't help. That >> makes sense, because you don't want a checkpoint to have to get a >> content lock on every buffer in the buffer pool. But it also means we >> need to follow the rules laid out in transam/README and dirty the pages >> before writing WAL. > > On further review, I think you're right about this. We'd have a > problem if the following sequence of events were to occur: > > 1. vacuum writes the WAL record, but does not yet mark the buffer > dirty, and then goes into the tank > 2. checkpointer decides where the checkpoint will begin > 3. buffer pool is scanned as part of the checkpoint process, observing > target buffer not to be dirty > 4. vacuum finally wakes up and marks the buffer dirty > 5. crash, after WAL is flushed but before buffer is written > > However, on looking at this a little more, shouldn't we be doing > log_heap_clean() *before* we do visibilitymap_set()? Hit send too soon. To finish that thought: right now the order of operations is... 1. Perform the cleanup operations on the buffer. 2. Set the visibility map bit. 3. Log the visibility map change. 4. Log the cleanup operations. It seems to me that it would be better to do (4) as close to (1) as possible. Isn't it bad if the operations are replayed in an order that differs from the order in which they were performed - or if, say, the first WAL record were replayed but the second were not, for whatever reason? >>> The effect of the change to lazy_scan_heap is to force the buffer to >>> be written even if we're only updating the visibility map page. >>> That's a bad idea and should be reverted. >> >> The only time the VM and the data page are out of sync during vacuum is >> after a crash, right? If that's the case, I didn't think it was a big >> deal to dirty one extra page (should be extremely rare). Am I missing >> something? >> >> The reason I removed that special case was just code >> complexity/readability. I tried preserving the previous behavior, and >> it's not so bad, but it seemed unnecessarily ugly for the benefit of a >> rare case. > > Well, I don't find it that ugly, but then again ...I've done a fair amount of hacking on this code. The scenario that I find problematic here is that you could easily lose a couple of visibility map pages in a crash. Each one you lose, with this patch, potentially involves half a gigabyte of unnecessary disk writes, if the relation is being vacuumed at the time. For the few extra lines of code it takes to protect against that scenario, I think we should protect against it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1 May 2013 16:33, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, May 1, 2013 at 11:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> I was worried because SyncOneBuffer checks whether it needs writing >>> without taking a content lock, so the exclusive lock doesn't help. That >>> makes sense, because you don't want a checkpoint to have to get a >>> content lock on every buffer in the buffer pool. But it also means we >>> need to follow the rules laid out in transam/README and dirty the pages >>> before writing WAL. >> >> On further review, I think you're right about this. We'd have a >> problem if the following sequence of events were to occur: >> >> 1. vacuum writes the WAL record, but does not yet mark the buffer >> dirty, and then goes into the tank >> 2. checkpointer decides where the checkpoint will begin >> 3. buffer pool is scanned as part of the checkpoint process, observing >> target buffer not to be dirty >> 4. vacuum finally wakes up and marks the buffer dirty >> 5. crash, after WAL is flushed but before buffer is written >> >> However, on looking at this a little more, shouldn't we be doing >> log_heap_clean() *before* we do visibilitymap_set()? > > Hit send too soon. > > To finish that thought: right now the order of operations is... > > 1. Perform the cleanup operations on the buffer. > 2. Set the visibility map bit. > 3. Log the visibility map change. > 4. Log the cleanup operations. > > It seems to me that it would be better to do (4) as close to (1) as > possible. Isn't it bad if the operations are replayed in an order > that differs from the order in which they were performed - or if, say, > the first WAL record were replayed but the second were not, for > whatever reason? I agree, but that was in the original coding wasn't it? Why aren't we writing just one WAL record for this action? We use a single WAL record in other places where we make changes to multiple blocks with multiple full page writes, e.g. index block split. That would make the action atomic and we'd just have this... 1. Perform the cleanup operations on the buffer. 2. Set the visibility map bit. 3. Log the cleanup operations and visibility map change. which can then be replayed with correct sequence, locking etc. and AFAICS would likely be faster also. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, May 1, 2013 at 1:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > I agree, but that was in the original coding wasn't it? I believe the problem was introduced by this commit: commit fdf9e21196a6f58c6021c967dc5776a16190f295 Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> Date: Wed Feb 13 17:46:23 2013 +0200 Update visibility map in the second phase of vacuum. There's a high chance that a page becomes all-visible when the second phase of vacuum removes all the dead tuples onit, so it makes sense to check for that. Otherwise the visibility map won't get updated until the next vacuum. Pavan Deolasee, reviewed by Jeff Janes. > Why aren't we writing just one WAL record for this action? We use a > single WAL record in other places where we make changes to multiple > blocks with multiple full page writes, e.g. index block split. That > would make the action atomic and we'd just have this... > > 1. Perform the cleanup operations on the buffer. > 2. Set the visibility map bit. > 3. Log the cleanup operations and visibility map change. > > which can then be replayed with correct sequence, locking etc. > and AFAICS would likely be faster also. I thought about that, too. It certainly seems like more than we want to try to do for 9.3 at this point. The other complication is that there's a lot of conditional logic here. We're definitely going to emit a cleanup record. We're going to emit a record to make the page all-visible only sometimes, because it might not be all-visible yet: it could have tuples on it that are deleted but not yet dead. And then there's additional logic to handle the checksum case. Plus, the all-visible marking can happen in other code paths, too, specifically in phase 1 of vacuum. So it might be possible to consolidate this, but off-hand it looks messy to me out of proportion to the benefits. Now that I'm looking at this, I'm a bit confused by the new logic in visibilitymap_set(). When checksums are enabled, we set the page LSN, which is described like this: "we need to protect the heap page from being torn". But how does setting the page LSN do that? Don't we need to mark the buffer dirty or something like that? Sorry if this is a dumb question. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 2013-05-01 at 11:33 -0400, Robert Haas wrote: > >> The only time the VM and the data page are out of sync during vacuum is > >> after a crash, right? If that's the case, I didn't think it was a big > >> deal to dirty one extra page (should be extremely rare). Am I missing > >> something? > >> > >> The reason I removed that special case was just code > >> complexity/readability. I tried preserving the previous behavior, and > >> it's not so bad, but it seemed unnecessarily ugly for the benefit of a > >> rare case. > > > > Well, I don't find it that ugly, but then again > > ...I've done a fair amount of hacking on this code. The scenario that > I find problematic here is that you could easily lose a couple of > visibility map pages in a crash. Each one you lose, with this patch, > potentially involves half a gigabyte of unnecessary disk writes, if > the relation is being vacuumed at the time. For the few extra lines > of code it takes to protect against that scenario, I think we should > protect against it. I'm still unclear on how we'd lose so many VM bits at once, because they are logged. I see how we can lose one if the heap page makes it to disk before the VM bit's WAL is flushed, but that seems to be bounded to me. Regardless, you have a reasonable claim that my patch had effects that were not necessary. I have attached a draft patch to remedy that. Only rudimentary testing was done. Regards, Jeff Davis
Attachment
On 1 May 2013 19:16, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, May 1, 2013 at 1:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> I agree, but that was in the original coding wasn't it? > > I believe the problem was introduced by this commit: > > commit fdf9e21196a6f58c6021c967dc5776a16190f295 > Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Wed Feb 13 17:46:23 2013 +0200 > > Update visibility map in the second phase of vacuum. > > There's a high chance that a page becomes all-visible when the second phase > of vacuum removes all the dead tuples on it, so it makes sense to check for > that. Otherwise the visibility map won't get updated until the next vacuum. > > Pavan Deolasee, reviewed by Jeff Janes. > >> Why aren't we writing just one WAL record for this action? We use a >> single WAL record in other places where we make changes to multiple >> blocks with multiple full page writes, e.g. index block split. That >> would make the action atomic and we'd just have this... >> >> 1. Perform the cleanup operations on the buffer. >> 2. Set the visibility map bit. >> 3. Log the cleanup operations and visibility map change. >> >> which can then be replayed with correct sequence, locking etc. >> and AFAICS would likely be faster also. > > I thought about that, too. It certainly seems like more than we want > to try to do for 9.3 at this point. The other complication is that > there's a lot of conditional logic here. We're definitely going to > emit a cleanup record. We're going to emit a record to make the page > all-visible only sometimes, because it might not be all-visible yet: > it could have tuples on it that are deleted but not yet dead. And > then there's additional logic to handle the checksum case. Plus, the > all-visible marking can happen in other code paths, too, specifically > in phase 1 of vacuum. So it might be possible to consolidate this, > but off-hand it looks messy to me out of proportion to the benefits. Looks easy. There is no additional logic for checksums, so there's no third complexity. So we either have * cleanup info with vismap setting info * cleanup info only which is the same number of WAL records as we have now, just that we never emit 2 records when one will do. > Now that I'm looking at this, I'm a bit confused by the new logic in > visibilitymap_set(). When checksums are enabled, we set the page LSN, > which is described like this: "we need to protect the heap page from > being torn". But how does setting the page LSN do that? It doesn't > Don't we > need to mark the buffer dirty or something like that? We do. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, 2013-05-01 at 14:16 -0400, Robert Haas wrote: > Now that I'm looking at this, I'm a bit confused by the new logic in > visibilitymap_set(). When checksums are enabled, we set the page LSN, > which is described like this: "we need to protect the heap page from > being torn". But how does setting the page LSN do that? Don't we > need to mark the buffer dirty or something like that? Sorry if this > is a dumb question. The caller is supposed to dirty the heap page when setting PD_ALL_VISIBLE. I thought I said that explicitly in the comments somewhere, but now I don't see it. It is slightly awkward to put the comment about the page being torn just before setting the LSN. Feel free to move/remove/reword if you can think of something better. Because setting PD_ALL_VISIBLE does not write WAL by itself, my design was to have it piggyback on the VM WAL record if checksums are turned on. It would be nice to just use MarkBufferDirtyHint, but that does not guarantee that the page will actually be marked dirty (see comments); and we really need it to be properly marked dirty (otherwise we risk the VM bit making it and the heap page not). I also explored the idea of passing more options to MarkBufferDirty so that we could force it to do what we want in each case, but decided against it: http://www.postgresql.org/message-id/1357798000.5162.38.camel@jdavis-laptop Regards,Jeff Davis
On Wed, 2013-05-01 at 20:06 +0100, Simon Riggs wrote: > >> Why aren't we writing just one WAL record for this action? ... > > > > I thought about that, too. It certainly seems like more than we want > > to try to do for 9.3 at this point. The other complication is that > > there's a lot of conditional logic here. ... > Looks easy. There is no additional logic for checksums, so there's no > third complexity. > > So we either have > * cleanup info with vismap setting info > * cleanup info only > > which is the same number of WAL records as we have now, just that we > never emit 2 records when one will do. What if we only set PD_ALL_VISIBLE and the VM bit, and make no other changes? Right now, that generates no WAL for the heap page (unless checksums are enabled, of course), which means no full page images for the heap pages. I think we have to special-case that somehow, and I believe that's the conditional logic that Robert is referring to. That being said, there may be a simpler way to achieve the same result, and that may be what you have in mind. Regards,Jeff Davis
On 1 May 2013 20:40, Jeff Davis <pgsql@j-davis.com> wrote: >> Looks easy. There is no additional logic for checksums, so there's no >> third complexity. >> >> So we either have >> * cleanup info with vismap setting info >> * cleanup info only >> >> which is the same number of WAL records as we have now, just that we >> never emit 2 records when one will do. > > What if we only set PD_ALL_VISIBLE and the VM bit, and make no other > changes? Right now, that generates no WAL for the heap page (unless > checksums are enabled, of course), which means no full page images for > the heap pages. The only place I see that code path is when we clear up a heap page that is empty. Is that what you meant? Empty pages are pretty rare, so I guess I meant that we should just get rid of the vismap update when we see that. Since we're adding the block straught back into the fsm, it won't stay all visible for very long. Bottom line is: is there a problem there, or not? If there's not, I'm happy. If there is, then do we have another plan than making this into one WAL record, so it is atomic? And presumably y'all want me to fix it. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, 2013-05-03 at 19:52 +0100, Simon Riggs wrote: > On 1 May 2013 20:40, Jeff Davis <pgsql@j-davis.com> wrote: > > >> Looks easy. There is no additional logic for checksums, so there's no > >> third complexity. > >> > >> So we either have > >> * cleanup info with vismap setting info > >> * cleanup info only > >> > >> which is the same number of WAL records as we have now, just that we > >> never emit 2 records when one will do. > > > > What if we only set PD_ALL_VISIBLE and the VM bit, and make no other > > changes? Right now, that generates no WAL for the heap page (unless > > checksums are enabled, of course), which means no full page images for > > the heap pages. > > The only place I see that code path is when we clear up a heap page > that is empty. Or if there are no other changes to make to the heap page. For example, if you do a fresh data load, then set the hint bits, and then do a VACCUM. The only changes needed during VACUUM are setting PD_ALL_VISIBLE and the VM bit. Right now, that does not cause a wal record to be written for the heap page, so there are no full-page images for the heap page. At this point, I don't think more changes are required. Robert seemed concerned about dirtying extra pages after a crash, but I didn't entirely understand his reasoning. I am inclined toward simplifying this part of the code as you suggest, but maybe that's better left for 9.4 (which would give me a chance to reignite the discussion about whether PD_ALL_VISIBLE is needed at all) unless there is an actual problem. Regards,Jeff Davis
On 3 May 2013 21:53, Jeff Davis <pgsql@j-davis.com> wrote: > At this point, I don't think more changes are required. After detailed further analysis, I agree, no further changes are required. I think the code in that area needs considerable refactoring to improve things. I've looked for an easy way to avoid calling PageSetLSN() twice, but I don't see one, which is the thing I thought was a problem. I don't much like the nested critical sections either. But overall, we do follow the requirements for WAL laid out in the README, in that we dirty the buffer first, insert WAL, then set LSN, all within a critical section and with buffer locking. So I don't see any place where this will break with the current coding. In the future I would hope to see that code simplified so that we do just one WAL record per block, rather than the 3 (or more?) records that can get written now: freeze, clean and visibility. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, May 1, 2013 at 3:04 PM, Jeff Davis <pgsql@j-davis.com> wrote: > Regardless, you have a reasonable claim that my patch had effects that > were not necessary. I have attached a draft patch to remedy that. Only > rudimentary testing was done. This looks reasonable to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, 2013-05-06 at 15:31 -0400, Robert Haas wrote: > On Wed, May 1, 2013 at 3:04 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > Regardless, you have a reasonable claim that my patch had effects that > > were not necessary. I have attached a draft patch to remedy that. Only > > rudimentary testing was done. > > This looks reasonable to me. Can you please explain the scenario that loses many VM bits at once during a crash, and results in a bunch of already-all-visible heap pages being dirtied for no reason? Regards,Jeff Davis
On Mon, May 6, 2013 at 5:04 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Mon, 2013-05-06 at 15:31 -0400, Robert Haas wrote: >> On Wed, May 1, 2013 at 3:04 PM, Jeff Davis <pgsql@j-davis.com> wrote: >> > Regardless, you have a reasonable claim that my patch had effects that >> > were not necessary. I have attached a draft patch to remedy that. Only >> > rudimentary testing was done. >> >> This looks reasonable to me. > > Can you please explain the scenario that loses many VM bits at once > during a crash, and results in a bunch of already-all-visible heap pages > being dirtied for no reason? Hmm. Rereading your last email, I see your point: since we now have HEAP_XLOG_VISIBLE, this is much less of an issue than it would have been before. I'm still not convinced that simplifying that code is a good idea, but maybe it doesn't really hurt us much in practice. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 2013-05-07 at 13:20 -0400, Robert Haas wrote: > Hmm. Rereading your last email, I see your point: since we now have > HEAP_XLOG_VISIBLE, this is much less of an issue than it would have > been before. I'm still not convinced that simplifying that code is a > good idea, but maybe it doesn't really hurt us much in practice. Given that there's not a big impact one way or another, I don't mind whether this particular patch is committed or not. Whichever you think is more understandable and safer at this late hour. Also, to be clear, the fact that I posted a patch was not meant to advocate either way; merely to present the options. Not sure exactly what you mean about the code simplification, but I agree that anything more substantial than this patch should be left for 9.4. Regards,Jeff Davis
On 4/5/13 6:39 PM, Jeff Davis wrote: > On Fri, 2013-04-05 at 10:34 +0200, Florian Pflug wrote: >> Maybe we could scan forward to check whether a corrupted WAL record is >> followed by one or more valid ones with sensible LSNs. If it is, >> chances are high that we haven't actually hit the end of the WAL. In >> that case, we could either log a warning, or (better, probably) abort >> crash recovery. > > +1. > >> Corruption of fields which we require to scan past the record would >> cause false negatives, i.e. no trigger an error even though we do >> abort recovery mid-way through. There's a risk of false positives too, >> but they require quite specific orderings of writes and thus seem >> rather unlikely. (AFAICS, the OS would have to write some parts of >> record N followed by the whole of record N+1 and then crash to cause a >> false positive). > > Does the xlp_pageaddr help solve this? > > Also, we'd need to be a little careful when written-but-not-flushed WAL > data makes it to disk, which could cause a false positive and may be a > fairly common case. Apologies if this is a stupid question, but is this mostly an issue due to torn pages? IOW, if we had a way to ensure wenever see torn pages, would that mean an invalid CRC on a WAL page indicated there really was corruption on that page? Maybe it's worth putting (yet more) thought into the torn page issue... :/ -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Wed, 2013-05-08 at 17:56 -0500, Jim Nasby wrote: > Apologies if this is a stupid question, but is this mostly an issue > due to torn pages? IOW, if we had a way to ensure we never see torn > pages, would that mean an invalid CRC on a WAL page indicated there > really was corruption on that page? > > Maybe it's worth putting (yet more) thought into the torn page > issue... :/ Sort of. For data, a page is the logically-atomic unit that is expected to be intact. For WAL, a record is the logically-atomic unit that is expected to be intact. So it might be better to say that the issue for the WAL is "torn records". A record might be larger than a page (it can hold up to three full-page images in one record), but is often much smaller. We use a CRC to validate that the WAL record is fully intact. The concern is that, if it fails the CRC check, we *assume* that it's because it wasn't completely flushed yet (i.e. a "torn record"). Based on that assumption, neither that record nor any later record contains committed transactions, so we can safely consider that the end of the WAL (as of the crash) and bring the system up. The problem is that the assumption is not always true: a CRC failure could also indicate real corruption of WAL records that have been previously flushed successfully, and may contain committed transactions. That can mean we bring the system up way too early, corrupting the database. Unfortunately, it seems that doing any kind of validation to determine that we have a valid end-of-the-WAL inherently requires some kind of separate durable write somewhere. It would be a tiny amount of data (an LSN and maybe some extra crosscheck information), so I could imagine that would be just fine given the right hardware; but if we just write to disk that would be pretty bad. Ideas welcome. Regards,Jeff Davis
On 5/8/13 7:34 PM, Jeff Davis wrote: > On Wed, 2013-05-08 at 17:56 -0500, Jim Nasby wrote: >> Apologies if this is a stupid question, but is this mostly an issue >> due to torn pages? IOW, if we had a way to ensure we never see torn >> pages, would that mean an invalid CRC on a WAL page indicated there >> really was corruption on that page? >> >> Maybe it's worth putting (yet more) thought into the torn page >> issue... :/ > > Sort of. For data, a page is the logically-atomic unit that is expected > to be intact. For WAL, a record is the logically-atomic unit that is > expected to be intact. > > So it might be better to say that the issue for the WAL is "torn > records". A record might be larger than a page (it can hold up to three > full-page images in one record), but is often much smaller. > > We use a CRC to validate that the WAL record is fully intact. The > concern is that, if it fails the CRC check, we *assume* that it's > because it wasn't completely flushed yet (i.e. a "torn record"). Based > on that assumption, neither that record nor any later record contains > committed transactions, so we can safely consider that the end of the > WAL (as of the crash) and bring the system up. > > The problem is that the assumption is not always true: a CRC failure > could also indicate real corruption of WAL records that have been > previously flushed successfully, and may contain committed transactions. > That can mean we bring the system up way too early, corrupting the > database. > > Unfortunately, it seems that doing any kind of validation to determine > that we have a valid end-of-the-WAL inherently requires some kind of > separate durable write somewhere. It would be a tiny amount of data (an > LSN and maybe some extra crosscheck information), so I could imagine > that would be just fine given the right hardware; but if we just write > to disk that would be pretty bad. Ideas welcome. What about moving some critical data from the beginning of the WAL record to the end? That would make it easier to detectthat we don't have a complete record. It wouldn't necessarily replace the CRC though, so maybe that's not good enough. Actually, what if we actually *duplicated* some of the same WAL header info at the end of the record? Given a reasonableamount of data that would damn-near ensure that a torn record was detected, because the odds of having the exactsame sequence of random bytes would be so low. Potentially even just duplicating the LSN would suffice. On the separate write idea, if that could be controlled by a GUC I think it'd be worth doing. Anyone that needs to worryabout this corner case probably has hardware that would support that. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 9 May 2013 20:28, Jim Nasby <jim@nasby.net> wrote: >> Unfortunately, it seems that doing any kind of validation to determine >> that we have a valid end-of-the-WAL inherently requires some kind of >> separate durable write somewhere. It would be a tiny amount of data (an >> LSN and maybe some extra crosscheck information), so I could imagine >> that would be just fine given the right hardware; but if we just write >> to disk that would be pretty bad. Ideas welcome. Not so sure. If the WAL record length is intact, and it probably is, then we can test whether the next WAL record is valid also. If the current WAL record is corrupt and the next WAL record is corrupt, then we have a problem. If the current WAL record is corrupt and the next WAL record is in every way valid, we can potentially continue. But we need to keep track of accumulated errors to avoid getting into a worse situation. Obviously, we would need to treat the next WAL record with complete scepticism, but I have seen cases where only a single WAL record was corrupt. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > If the current WAL record is corrupt and the next WAL record is in > every way valid, we can potentially continue. That seems like a seriously bad idea. regards, tom lane
On 9 May 2013 22:39, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> If the current WAL record is corrupt and the next WAL record is in >> every way valid, we can potentially continue. > > That seems like a seriously bad idea. I agree. But if you knew that were true, is stopping a better idea? Any corrupt WAL record needs to halt recovery. What I'm thinking is to check whether it might be possible to continue, and if all looks good, offer the user the option to continue. "Continuing could be dangerous and might cause deeper corruption of your database. We suggest you take a backup of the server before continuing" --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, May 9, 2013 at 10:45 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 9 May 2013 22:39, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Simon Riggs <simon@2ndQuadrant.com> writes: >>> If the current WAL record is corrupt and the next WAL record is in >>> every way valid, we can potentially continue. >> >> That seems like a seriously bad idea. > > I agree. But if you knew that were true, is stopping a better idea? Having one corrupt record followed by a valid record is not an abnormal situation. It could easily be the correct end of WAL. I think it's not possible to protect 100% against this without giving up the checksum optimization which implies doing two fsyncs per commit instead of 1. However it is possible to reduce the window. Every time the transaction log is synced a different file can be updated with the a known minimum transaction log recovery point. Even if it's not synced consistently on every transaction commit or wal sync it would serve as a low water mark. Recovering to that point is not sufficient but is necessary for a consistent recovery. That file could be synced lazily, say, every 10s or something like that and would guarantee that any wal corruption would be caught except for the last 10s of wal traffic for example. If you're only interested in database consistency and not lost commits then that file could be synced on buffer xlog flushes (making a painful case even more painful). Off the top of my head that would be sufficient to guarantee that a corrupt xlog that would create an inconsistent database would not be missed. I may be missing cases involving checkpoints or the like though. -- greg
On Thu, 2013-05-09 at 14:28 -0500, Jim Nasby wrote: > What about moving some critical data from the beginning of the WAL > record to the end? That would make it easier to detect that we don't > have a complete record. It wouldn't necessarily replace the CRC > though, so maybe that's not good enough. > > Actually, what if we actually *duplicated* some of the same WAL header > info at the end of the record? Given a reasonable amount of data that > would damn-near ensure that a torn record was detected, because the > odds of having the exact same sequence of random bytes would be so > low. Potentially even just duplicating the LSN would suffice. I think both of these ideas have some false positives and false negatives. If the corruption happens at the record boundary, and wipes out the special information at the end of the record, then you might think it was not fully flushed, and we're in the same position as today. If the WAL record is large, and somehow the beginning and the end get written to disk but not the middle, then it will look like corruption; but really the WAL was just not completely flushed. This seems pretty unlikely, but not impossible. That being said, I like the idea of introducing some extra checks if a perfect solution is not possible. > On the separate write idea, if that could be controlled by a GUC I > think it'd be worth doing. Anyone that needs to worry about this > corner case probably has hardware that would support that. It sounds pretty easy to do that naively. I'm just worried that the performance will be so bad for so many users that it's not a very reasonable choice. Today, it would probably make more sense to just use sync rep. If the master's WAL is corrupt, and it starts up too early, then that should be obvious when you try to reconnect streaming replication. I haven't tried it, but I'm assuming that it gives a useful error message. Regards,Jeff Davis
On Thu, 2013-05-09 at 23:13 +0100, Greg Stark wrote: > However it is possible to reduce the window... Sounds reasonable. It's fairly limited though -- the window is already a checkpoint (typically 5-30 minutes), and we'd bring that down an order of magnitude (10s). I speculate that, if it got corrupted within 30 minutes, it probably got corrupted at the time of being written (as happened in Jeff Janes's case, due to a bug). So, the question is: if the WAL is corrupted on write, does reducing the window significantly increase the chances that the wal writer will hang around long enough before a crash to flush this other file? On the other hand, checkpoint hides any corrupt WAL records by not replaying them, whereas your scheme would identify that there is a problem. I don't think this would have helped Jeff Janes's case because I think the crashes were happening too quickly. But that is artificial, so it may help in real cases. I just had a thought: we don't necessarily need to flush the auxiliary file each time; merely writing it to the kernel buffers would help a lot. Maybe an extra write() of the auxiliary file during a WAL flush isn't so bad; and combined with periodic fsync()s of the auxiliary file, should offer a lot of coverage against problems. Regards,Jeff Davis
On 9 May 2013 23:13, Greg Stark <stark@mit.edu> wrote: > On Thu, May 9, 2013 at 10:45 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 9 May 2013 22:39, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Simon Riggs <simon@2ndQuadrant.com> writes: >>>> If the current WAL record is corrupt and the next WAL record is in >>>> every way valid, we can potentially continue. >>> >>> That seems like a seriously bad idea. >> >> I agree. But if you knew that were true, is stopping a better idea? > > Having one corrupt record followed by a valid record is not an > abnormal situation. It could easily be the correct end of WAL. I disagree, that *is* an abnormal situation and would not be the "correct end-of-WAL". Each WAL record contains a "prev" pointer to the last WAL record. So for the next record to be valid the prev pointer would need to be exactly correct. > However it is possible to reduce the window. Every time the > transaction log is synced a different file can be updated with the a > known minimum transaction log recovery point. Even if it's not synced > consistently on every transaction commit or wal sync it would serve as > a low water mark. Recovering to that point is not sufficient but is > necessary for a consistent recovery. That file could be synced lazily, > say, every 10s or something like that and would guarantee that any wal > corruption would be caught except for the last 10s of wal traffic for > example. I think it would be easy enough to have the WALwriter update the minRecoveryPoint once per cycle, after it has flushed WAL. Given the importance of pg_control and its small size, it seems like it would be a good idea to take a backup copy of it every checkpoint to make sure we have that data safe. And have pg_resetxlog keep a copy also every time that is run. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, May 10, 2013 at 7:44 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Having one corrupt record followed by a valid record is not an >> abnormal situation. It could easily be the correct end of WAL. > > I disagree, that *is* an abnormal situation and would not be the > "correct end-of-WAL". > > Each WAL record contains a "prev" pointer to the last WAL record. So > for the next record to be valid the prev pointer would need to be > exactly correct. Well then you're wrong. The OS makes no guarantee that blocks are written out in order. When the system crashes any random subset of the blocks written but not synced might have been written to disk and others not. There could be megabytes of correct WAL written with just one block in the middle of the first record not written. If no xlog sync had occurred (or one was in progress but not completed) then that's the correct end of WAL. -- greg
On Friday, May 10, 2013 6:09 PM Greg Stark wrote: > On Fri, May 10, 2013 at 7:44 AM, Simon Riggs <simon@2ndquadrant.com> > wrote: > >> Having one corrupt record followed by a valid record is not an > >> abnormal situation. It could easily be the correct end of WAL. > > > > I disagree, that *is* an abnormal situation and would not be the > > "correct end-of-WAL". > > > > Each WAL record contains a "prev" pointer to the last WAL record. So > > for the next record to be valid the prev pointer would need to be > > exactly correct. > > Well then you're wrong. The OS makes no guarantee that blocks are > written out in order. When the system crashes any random subset of the > blocks written but not synced might have been written to disk and > others not. There could be megabytes of correct WAL written with just > one block in the middle of the first record not written. If no xlog > sync had occurred (or one was in progress but not completed) then > that's the correct end of WAL. In the case where one block is missing, how can it even reach to next record to check "prev" pointer. I think it can be possible when one of the record is corrupt and following are okay which I think is the case in which it can proceed with warning as suggested by Simon. With Regards, Amit Kapila.
On Fri, May 10, 2013 at 5:31 PM, Amit Kapila <amit.kapila@huawei.com> wrote: > In the case where one block is missing, how can it even reach to next record > to check "prev" pointer. > I think it can be possible when one of the record is corrupt and following > are okay which I think is the > case in which it can proceed with warning as suggested by Simon. A single WAL record can be over 24kB. The checksum covers the entire WAL record and if it reports corruption it can be because a chunk in the middle wasn't flushed to disk before the system crashed. The beginning of the WAL record with the length and checksum and the entire following record with its prev pointer might have been flushed but the missing block in the middle of this record means it can't be replayed. This would be a normal situation in case of a system crash. If you replayed the following record but not this record you would have an inconsistent database. The following record could be an insert for a child table with a foreign key reference to a tuple that was inserted by the skipped record for example. Resulting in a database that is logically inconsistent. Or it could be an index insert for that tuple would would result in a physically inconsistent database with index pointers that point to incorrect tuples. Index scans would return tuples that didn't match the index or would miss tuples that should be returned. -- greg
Greg Stark <stark@mit.edu> writes: > A single WAL record can be over 24kB. <pedantic> Actually, WAL records can run to megabytes. Consider for example a commit record for a transaction that dropped thousands of tables --- there'll be info about each such table in the commit record, to cue replay to remove those files. </pedantic> > If you replayed the following record but not this record you would > have an inconsistent database. ... > Or it could be an index insert for that tuple would would result in a > physically inconsistent database with index pointers that point to > incorrect tuples. Index scans would return tuples that didn't match > the index or would miss tuples that should be returned. Skipping actions such as index page splits would lead to even more fun. Even in simple cases such as successive inserts and deletions in the same heap page, failing to replay some of the actions is going to be disastrous. The *best case* scenario for that is that WAL replay PANICs when it notices that the action it's trying to replay is inconsistent with the current state of the page, eg it's trying to insert at a TID that already exists. IMO we can't proceed past a broken WAL record. The actually useful suggestion upthread was that we try to notice whether there seem to be valid WAL records past the broken one, so that we could warn the DBA that some commits might have been lost. I don't think we can do much in the way of automatic data recovery, but we could give the DBA a chance to do forensics rather than blindly starting up (and promptly overwriting all the evidence). regards, tom lane
On 10 May 2013 13:39, Greg Stark <stark@mit.edu> wrote: > On Fri, May 10, 2013 at 7:44 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> Having one corrupt record followed by a valid record is not an >>> abnormal situation. It could easily be the correct end of WAL. >> >> I disagree, that *is* an abnormal situation and would not be the >> "correct end-of-WAL". >> >> Each WAL record contains a "prev" pointer to the last WAL record. So >> for the next record to be valid the prev pointer would need to be >> exactly correct. > > Well then you're wrong. The OS makes no guarantee that blocks are > written out in order. When the system crashes any random subset of the > blocks written but not synced might have been written to disk and > others not. There could be megabytes of correct WAL written with just > one block in the middle of the first record not written. If no xlog > sync had occurred (or one was in progress but not completed) then > that's the correct end of WAL. I agree that the correct end of WAL is where the last sync occurred. We don't write() WAL except with an immediate sync(), so the chances of what you say happening are very low to impossible. The timing window between the write and the sync is negligible and yet I/O would need to occur in that window and also be out of order from the order of the write, which is unlikely because an I/O elevator would either not touch the order of writes at all, or would want to maintain sequential order to avoid head movement, which is what we want. I guess we should add here "...with disks, maybe not with SSDs". In any case, what is more important is that your idea to make an occasional write of the minRecoveryPoint and then use that to cross check against current LSN would allow us to at least confirm that we have a single corrupt record and report that situation accurately to the user. That idea will cover 95+% of such problems anyway, since what we care about is long sequences of WAL records, not just the last few written at crash, which the above discussion focuses on. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 10 May 2013 18:23, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <stark@mit.edu> writes: >> A single WAL record can be over 24kB. > > <pedantic> > Actually, WAL records can run to megabytes. Consider for example a > commit record for a transaction that dropped thousands of tables --- > there'll be info about each such table in the commit record, to cue > replay to remove those files. > </pedantic> > >> If you replayed the following record but not this record you would >> have an inconsistent database. ... >> Or it could be an index insert for that tuple would would result in a >> physically inconsistent database with index pointers that point to >> incorrect tuples. Index scans would return tuples that didn't match >> the index or would miss tuples that should be returned. > > Skipping actions such as index page splits would lead to even more fun. > > Even in simple cases such as successive inserts and deletions in the > same heap page, failing to replay some of the actions is going to be > disastrous. The *best case* scenario for that is that WAL replay > PANICs when it notices that the action it's trying to replay is > inconsistent with the current state of the page, eg it's trying to > insert at a TID that already exists. Agreed > IMO we can't proceed past a broken WAL record. The actually useful > suggestion upthread was that we try to notice whether there seem > to be valid WAL records past the broken one, so that we could warn > the DBA that some commits might have been lost. I don't think we > can do much in the way of automatic data recovery, but we could give > the DBA a chance to do forensics rather than blindly starting up (and > promptly overwriting all the evidence). The usual answer is switchover to the standby, hoping that the WAL was not corrupted before it got sent there also. If all servers go down people will want a "carry on regardless" option as well, since the fault can be investigated on a standby. Even with all of the above caveats, lying on our backs with our feet in the air 'cos we lost a few blocks will not impress anyone. Doing that will likely be a medium-high risk thing, but that will still be better than the certainty of a down server. It would need to be a manually iniated option. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, May 10, 2013 at 9:54 AM, Greg Stark <stark@mit.edu> wrote:
On Fri, May 10, 2013 at 5:31 PM, Amit Kapila <amit.kapila@huawei.com> wrote:A single WAL record can be over 24kB. The checksum covers the entire
> In the case where one block is missing, how can it even reach to next record
> to check "prev" pointer.
> I think it can be possible when one of the record is corrupt and following
> are okay which I think is the
> case in which it can proceed with warning as suggested by Simon.
WAL record and if it reports corruption it can be because a chunk in
the middle wasn't flushed to disk before the system crashed. The
beginning of the WAL record with the length and checksum and the
entire following record with its prev pointer might have been flushed
but the missing block in the middle of this record means it can't be
replayed. This would be a normal situation in case of a system crash.
If you replayed the following record but not this record you would
have an inconsistent database.
I don't think we would ever want to *skip* the record and play the next one. But if it looks like the next record is valid, we might not want to automatically open the database in a possibly inconsistent state and in the process overwrite the only existing copy of those WAL records which would be necessary to make it consistent. Instead, could we present the DBA with an explicit choice to either open the database, or try to reconstruct the corrupted record via forensic inspection so that it can be played through (I have no idea how likely it is that such an attempt would succeed), or to copy the database for later inspection and then open it.
But based on your description, perhaps refusing to automatically restart and forcing an explicit decision would happen a lot more often, during normal crashes with no corruption, than I was thinking it would.
Of course the paranoid DBA could turn off restart_after_crash and do a manual investigation on every crash, but in that case the database would refuse to restart even in the case where it perfectly clear that all the following WAL belongs to the recycled file and not the current file. They would also have to turn off any startup scripts in init.d, to make sure a rebooting server doesn't do recovery automatically and destroy evidence that way.
Cheers,
Jeff
On Fri, May 10, 2013 at 2:06 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > But based on your description, perhaps refusing to automatically restart and > forcing an explicit decision would happen a lot more often, during normal > crashes with no corruption, than I was thinking it would. I bet it would. But I think Greg Stark's idea of periodically flushing the current minimum recovery point to disk has a lot to recommend it. That would be massively more reliable than what we have now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, 2013-05-10 at 18:32 +0100, Simon Riggs wrote: > We don't write() WAL except with an immediate sync(), so the chances > of what you say happening are very low to impossible. Are you sure? An XLogwrtRqst contains a write and a flush pointer, so I assume they can be different. I agree that it sounds unlikely that blocks 100 and 102 would be written, but not 101. But perhaps that's more likely in systems like ZFS where the physical blocks might be in very different places. Regards,Jeff Davis
On 5/10/13 1:32 PM, Simon Riggs wrote: > The timing > window between the write and the sync is negligible and yet I/O would > need to occur in that window and also be out of order from the order > of the write, which is unlikely because an I/O elevator would either > not touch the order of writes at all, or would want to maintain > sequential order to avoid head movement, which is what we want. I > guess we should add here "...with disks, maybe not with SSDs". It's not really safe to make any assumptions about I/O elevators. Reordering gets done from the perspective of the last item written. When the previous write was at the logical end of the disk, it can just as easily re-order a queue of writes in the complete reverse order they were issued in. The only way you can ever get a useful guarantee is when an fsync returns completion. Writes can effectively go out in a completely random order until that point. All you can rely on is throwing up a stop sign that says "tell me when all of them are done". In between those, you have no idea of the ordering. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Friday, May 10, 2013 10:24 PM Greg Stark wrote: On Fri, May 10, 2013 at 5:31 PM, Amit Kapila <amit.kapila@huawei.com> wrote: >> In the case where one block is missing, how can it even reach to next record >> to check "prev" pointer. >> I think it can be possible when one of the record is corrupt and following >> are okay which I think is the >> case in which it can proceed with warning as suggested by Simon. >A single WAL record can be over 24kB. The checksum covers the entire >WAL record and if it reports corruption it can be because a chunk in >the middle wasn't flushed to disk before the system crashed. The >beginning of the WAL record with the length and checksum and the >entire following record with its prev pointer might have been flushed >but the missing block in the middle of this record means it can't be >replayed. This would be a normal situation in case of a system crash. The only point I wanted to say was it can be only "one such record", length can be large or small. >If you replayed the following record but not this record you would >have an inconsistent database. The following record could be an insert >for a child table with a foreign key reference to a tuple that was >inserted by the skipped record for example. Resulting in a database >that is logically inconsistent. The corrupt record can be such that it can lead to inconsistency in database or it could be a commit record of transaction which has performed only select operation. It will be difficult or not possible to find such information during recovery, but informing DBA/user at such occasion can be useful and with an optional way for him to continue (although I am not sure how in such a case DBA can decide, may be need some other information as well). The reason why it can be useful to allow DBA/user intervention in such cases is that, in case of ignoring data after one corrupt record, it can still take time for DBA/user to find which of the operations he needs to perform again. With Regards, Amit Kapila.
On 10 May 2013 23:41, Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2013-05-10 at 18:32 +0100, Simon Riggs wrote: >> We don't write() WAL except with an immediate sync(), so the chances >> of what you say happening are very low to impossible. > > Are you sure? An XLogwrtRqst contains a write and a flush pointer, so I > assume they can be different. I'm answering this just to complete the discussion. Yes, the write and flush pointer can be different. The write/flush are two actions; we do one first, then the other, very quickly, inside XLogWrite(). But we cannot rely on that, since there are some times when we don't do that, such as wal_buffer overflow and when background walwriter writes are at the end of the ring buffer. Not common, but they do exist and when they exist they write many blocks. Which is counter to what I had said earlier. This part of the proposal isn't necessary for us to get a good answer 95% of the time, so it is dropped. As mentioned on other post, we can write UpdateMinRecoveryPoint(LogwrtResult.Flush, false) every time we do XLogBackgroundFlush() and some other rework to make that happen correctly. I'll post a separate patch. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 5/9/13 5:18 PM, Jeff Davis wrote: > On Thu, 2013-05-09 at 14:28 -0500, Jim Nasby wrote: >> What about moving some critical data from the beginning of the WAL >> record to the end? That would make it easier to detect that we don't >> have a complete record. It wouldn't necessarily replace the CRC >> though, so maybe that's not good enough. >> >> Actually, what if we actually *duplicated* some of the same WAL header >> info at the end of the record? Given a reasonable amount of data that >> would damn-near ensure that a torn record was detected, because the >> odds of having the exact same sequence of random bytes would be so >> low. Potentially even just duplicating the LSN would suffice. > > I think both of these ideas have some false positives and false > negatives. > > If the corruption happens at the record boundary, and wipes out the > special information at the end of the record, then you might think it > was not fully flushed, and we're in the same position as today. > > If the WAL record is large, and somehow the beginning and the end get > written to disk but not the middle, then it will look like corruption; > but really the WAL was just not completely flushed. This seems pretty > unlikely, but not impossible. > > That being said, I like the idea of introducing some extra checks if a > perfect solution is not possible. Yeah, I don't think a perfect solution is possible, short of attempting to tie directly into the filesystem (ie: on a journalingFS have some way to essentially treat the FS journal as WAL). One additional step we might be able to take would be to scan forward looking for a record that would tell us when an fsyncmust have occurred (heck, maybe we should add an fsync WAL record...). If we find a corrupt WAL record followed by anfsync we know that we've now lost data. That closes some of the holes. Actually, that might handle all the holes... >> On the separate write idea, if that could be controlled by a GUC I >> think it'd be worth doing. Anyone that needs to worry about this >> corner case probably has hardware that would support that. > > It sounds pretty easy to do that naively. I'm just worried that the > performance will be so bad for so many users that it's not a very > reasonable choice. > > Today, it would probably make more sense to just use sync rep. If the > master's WAL is corrupt, and it starts up too early, then that should be > obvious when you try to reconnect streaming replication. I haven't tried > it, but I'm assuming that it gives a useful error message. I wonder if there are DW environments that are too large to keep a SR copy but would be able to afford the double-write overhead. BTW, isn't performance what killed the double-buffer idea? -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 5/10/13 1:06 PM, Jeff Janes wrote: > Of course the paranoid DBA could turn off restart_after_crash and do a manual investigation on every crash, but in thatcase the database would refuse to restart even in the case where it perfectly clear that all the following WAL belongsto the recycled file and not the current file. Perhaps we should also allow for zeroing out WAL files before reuse (or just disable reuse). I know there's a performancehit there, but the reuse idea happened before we had bgWriter. Theoretically the overhead creating a new filewould always fall to bgWriter and therefore not be a big deal. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote: > On 5/10/13 1:06 PM, Jeff Janes wrote: >> >> Of course the paranoid DBA could turn off restart_after_crash and do a >> manual investigation on every crash, but in that case the database would >> refuse to restart even in the case where it perfectly clear that all the >> following WAL belongs to the recycled file and not the current file. > > > Perhaps we should also allow for zeroing out WAL files before reuse (or just > disable reuse). I know there's a performance hit there, but the reuse idea > happened before we had bgWriter. Theoretically the overhead creating a new > file would always fall to bgWriter and therefore not be a big deal. For filesystems like btrfs, re-using a WAL file is suboptimal to simply creating a new one and removing the old one when it's no longer required. Using fallocate (or posix_fallocate) (I have a patch for that!) to create a new one is - by my tests - 28 times faster than the currently-used method. -- Jon
On Sun, May 12, 2013 at 03:46:00PM -0500, Jim Nasby wrote: > On 5/10/13 1:06 PM, Jeff Janes wrote: > >Of course the paranoid DBA could turn off restart_after_crash and do a manual investigation on every crash, but in thatcase the database would refuse to restart even in the case where it perfectly clear that all the following WAL belongsto the recycled file and not the current file. > > Perhaps we should also allow for zeroing out WAL files before reuse (or just disable reuse). I know there's a performancehit there, but the reuse idea happened before we had bgWriter. Theoretically the overhead creating a new filewould always fall to bgWriter and therefore not be a big deal. > -- > Jim C. Nasby, Data Architect jim@nasby.net > 512.569.9461 (cell) http://jim.nasby.net > Unless something has changed dramtically, creating new files is a LOT more overhead than reusing existing files. My two cents. Regards, Ken
On Sun, May 12, 2013 at 07:41:26PM -0500, Jon Nelson wrote: > On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote: > > On 5/10/13 1:06 PM, Jeff Janes wrote: > >> > >> Of course the paranoid DBA could turn off restart_after_crash and do a > >> manual investigation on every crash, but in that case the database would > >> refuse to restart even in the case where it perfectly clear that all the > >> following WAL belongs to the recycled file and not the current file. > > > > > > Perhaps we should also allow for zeroing out WAL files before reuse (or just > > disable reuse). I know there's a performance hit there, but the reuse idea > > happened before we had bgWriter. Theoretically the overhead creating a new > > file would always fall to bgWriter and therefore not be a big deal. > > For filesystems like btrfs, re-using a WAL file is suboptimal to > simply creating a new one and removing the old one when it's no longer > required. Using fallocate (or posix_fallocate) (I have a patch for > that!) to create a new one is - by my tests - 28 times faster than the > currently-used method. > > > -- > Jon > What about for less cutting (bleeding) edge filesystems? Ken
On Mon, May 13, 2013 at 7:49 AM, ktm@rice.edu <ktm@rice.edu> wrote: > On Sun, May 12, 2013 at 07:41:26PM -0500, Jon Nelson wrote: >> On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote: >> > On 5/10/13 1:06 PM, Jeff Janes wrote: >> >> >> >> Of course the paranoid DBA could turn off restart_after_crash and do a >> >> manual investigation on every crash, but in that case the database would >> >> refuse to restart even in the case where it perfectly clear that all the >> >> following WAL belongs to the recycled file and not the current file. >> > >> > >> > Perhaps we should also allow for zeroing out WAL files before reuse (or just >> > disable reuse). I know there's a performance hit there, but the reuse idea >> > happened before we had bgWriter. Theoretically the overhead creating a new >> > file would always fall to bgWriter and therefore not be a big deal. >> >> For filesystems like btrfs, re-using a WAL file is suboptimal to >> simply creating a new one and removing the old one when it's no longer >> required. Using fallocate (or posix_fallocate) (I have a patch for >> that!) to create a new one is - by my tests - 28 times faster than the >> currently-used method. >> > What about for less cutting (bleeding) edge filesystems? The patch would be applicable for any filesystem which implements the fallocate/posix_fallocate calls in an efficient fashion. xfs and ext4 would both work, if I recall properly. I'm certain there are others, and the technique would probably work on other operating systems like the *BSDs, etc.. Additionally, it's improbable that there would be a performance hit for other filesystems versus simply writing zeroes, since that's the approach that is taken anyway as a fallback. Another win is reduction in fragmentation. I would love to be able to turn off WAL recycling to perform more useful testing. -- Jon
On 2013-05-12 19:41:26 -0500, Jon Nelson wrote: > On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote: > > On 5/10/13 1:06 PM, Jeff Janes wrote: > >> > >> Of course the paranoid DBA could turn off restart_after_crash and do a > >> manual investigation on every crash, but in that case the database would > >> refuse to restart even in the case where it perfectly clear that all the > >> following WAL belongs to the recycled file and not the current file. > > > > > > Perhaps we should also allow for zeroing out WAL files before reuse (or just > > disable reuse). I know there's a performance hit there, but the reuse idea > > happened before we had bgWriter. Theoretically the overhead creating a new > > file would always fall to bgWriter and therefore not be a big deal. > > For filesystems like btrfs, re-using a WAL file is suboptimal to > simply creating a new one and removing the old one when it's no longer > required. Using fallocate (or posix_fallocate) (I have a patch for > that!) to create a new one is - by my tests - 28 times faster than the > currently-used method. I don't think the comparison between just fallocate()ing and what we currently do is fair. fallocate() doesn't guarantee that the file is the same size after a crash, so you would still need an fsync() or we couldn't use fdatasync() anymore. And I'd guess the benefits aren't all that big anymore in that case? That said, using posix_fallocate seems like a good idea in lots of places inside pg, its just not all that easy to do in some of the places. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, May 13, 2013 at 8:32 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-05-12 19:41:26 -0500, Jon Nelson wrote: >> On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote: >> > On 5/10/13 1:06 PM, Jeff Janes wrote: >> >> >> >> Of course the paranoid DBA could turn off restart_after_crash and do a >> >> manual investigation on every crash, but in that case the database would >> >> refuse to restart even in the case where it perfectly clear that all the >> >> following WAL belongs to the recycled file and not the current file. >> > >> > >> > Perhaps we should also allow for zeroing out WAL files before reuse (or just >> > disable reuse). I know there's a performance hit there, but the reuse idea >> > happened before we had bgWriter. Theoretically the overhead creating a new >> > file would always fall to bgWriter and therefore not be a big deal. >> >> For filesystems like btrfs, re-using a WAL file is suboptimal to >> simply creating a new one and removing the old one when it's no longer >> required. Using fallocate (or posix_fallocate) (I have a patch for >> that!) to create a new one is - by my tests - 28 times faster than the >> currently-used method. > > I don't think the comparison between just fallocate()ing and what we > currently do is fair. fallocate() doesn't guarantee that the file is the > same size after a crash, so you would still need an fsync() or we > couldn't use fdatasync() anymore. And I'd guess the benefits aren't all > that big anymore in that case? fallocate (16MB) + fsync is still almost certainly faster than write+write+write... + fsync. The test I performed at the time did exactly that .. posix_fallocate + pg_fsync. > That said, using posix_fallocate seems like a good idea in lots of > places inside pg, its just not all that easy to do in some of the > places. I should not derail this thread any further. Perhaps, if interested parties would like to discuss the use of fallocate/posix_fallocate, a new thread might be more appropriate? -- Jon
On 2013-05-13 08:45:41 -0500, Jon Nelson wrote: > On Mon, May 13, 2013 at 8:32 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-05-12 19:41:26 -0500, Jon Nelson wrote: > >> On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim@nasby.net> wrote: > >> > On 5/10/13 1:06 PM, Jeff Janes wrote: > >> >> > >> >> Of course the paranoid DBA could turn off restart_after_crash and do a > >> >> manual investigation on every crash, but in that case the database would > >> >> refuse to restart even in the case where it perfectly clear that all the > >> >> following WAL belongs to the recycled file and not the current file. > >> > > >> > > >> > Perhaps we should also allow for zeroing out WAL files before reuse (or just > >> > disable reuse). I know there's a performance hit there, but the reuse idea > >> > happened before we had bgWriter. Theoretically the overhead creating a new > >> > file would always fall to bgWriter and therefore not be a big deal. > >> > >> For filesystems like btrfs, re-using a WAL file is suboptimal to > >> simply creating a new one and removing the old one when it's no longer > >> required. Using fallocate (or posix_fallocate) (I have a patch for > >> that!) to create a new one is - by my tests - 28 times faster than the > >> currently-used method. > > > > I don't think the comparison between just fallocate()ing and what we > > currently do is fair. fallocate() doesn't guarantee that the file is the > > same size after a crash, so you would still need an fsync() or we > > couldn't use fdatasync() anymore. And I'd guess the benefits aren't all > > that big anymore in that case? > > fallocate (16MB) + fsync is still almost certainly faster than > write+write+write... + fsync. > The test I performed at the time did exactly that .. posix_fallocate + pg_fsync. Sure, the initial file creation will be faster. But are the actual individual wal writes (small, frequently fdatasync()ed) still faster? That's the critical path currently. Whether it is pretty much depends on how the filesystem manages allocated but not initialized blocks... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, May 13, 2013 at 2:49 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Sure, the initial file creation will be faster. But are the actual > individual wal writes (small, frequently fdatasync()ed) still faster? > That's the critical path currently. > Whether it is pretty much depends on how the filesystem manages > allocated but not initialized blocks... In ext4 aIui it doesn't actually pick target blocks. It just adjusts the accounting so it knows that many blocks will be needed for this file and guarantees they'll be available. If you read from them it knows to provide 0s. So in theory the performance in the critical path would be worse but I think by an insignificant amount. The reason Postgres pre-allocates the blocks is not for the performance optimization. It's for safety. To guarantee -- as best as possible -- that it won't get a write error when the time comes to write to it. Especially to guarantee that the disk won't suddenly turn out to be full. It seems possible that some file systems would not protect you against media errors nearly as well using it. It might take time to respond to a media error and in a poorly written filesystem it might even be reported to the application even though there's no need. But media errors can occur any time, even after the initial write so I don't think this should be a blocker. I think posix_fallocate is good enough for us and I would support using it. -- greg
On 2013-05-13 16:03:11 +0100, Greg Stark wrote: > On Mon, May 13, 2013 at 2:49 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > Sure, the initial file creation will be faster. But are the actual > > individual wal writes (small, frequently fdatasync()ed) still faster? > > That's the critical path currently. > > Whether it is pretty much depends on how the filesystem manages > > allocated but not initialized blocks... > > In ext4 aIui it doesn't actually pick target blocks. It just adjusts > the accounting so it knows that many blocks will be needed for this > file and guarantees they'll be available. If you read from them it > knows to provide 0s. So in theory the performance in the critical path > would be worse but I think by an insignificant amount. > > The reason Postgres pre-allocates the blocks is not for the > performance optimization. It's for safety. To guarantee -- as best as > possible -- that it won't get a write error when the time comes to > write to it. Especially to guarantee that the disk won't suddenly turn > out to be full. posix_fallocate() guarantees that. And if you fsync() the file afterwards its even supposed to still have enough space after the crash. "After a successful call to posix_fallocate(), subsequent writes to bytes in the specified range are guaranteed not to fail because of lack of disk space." > It seems possible that some file systems would not protect you against > media errors nearly as well using it. True. The same probably is true for modifications of existing files for those fancy COW filesystems... > I think posix_fallocate is good enough for us and I would support > using it. Me too, although this isn't the place where I'd be really excited to see a patch implementing it properly ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 13 May 2013 14:45, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > I should not derail this thread any further. Perhaps, if interested > parties would like to discuss the use of fallocate/posix_fallocate, a > new thread might be more appropriate? Sounds like a good idea. Always nice to see a fresh take on earlier ideas. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services