Thread: Logical to physical page mapping
I just had this thought a few minutes ago, discussed it briefly with RhodiumToad on #postgresql and wanted to put it out here for discussion. Feel free to rip it apart. It probably is a bit "al-dente" at this point and needs more cooking. The reason why we need full_page_writes is that we need to guard against torn pages or partial writes. So what if smgr would manage a mapping between logical page numbers and their physical location in the relation? At the moment where we today require a full page write into WAL, we would mark the buffer as "needs relocation". The smgr would then write this page into another physical location whenever it is time to write it (via the background writer, hopefully). After that page is flushed, it would update the page location pointer, or whatever we want to call it. A thus free'd physical page location can be reused, once the location pointer has been flushed to disk. This is a critical ordering of writes. First the page at the new location, second the pointer to the current location. Doing so would make write(2) appear atomic to us, which is exactly what we need for crash recovery. In addition to that, vacuum would now be able to tell smgr "hey, this page is completely empty". Instead of doing the second "empty page for truncate" scan, smgr could slowly migrate pages on first touch after a checkpoint towards the head of the file, into these empty pages. This way it would free pages at the end and now smgr is completely at liberty to truncate them off whenever it sees fit. No extra scan require, just a little more bookkeeping. This would not only be the case for heap pages, but for empty index pages as well. Shrinking/truncating indexes is something, we are completely unable to do today. Whenever the buffer manager is asked for such a page that doesn't exist physically any more, it would just initialize an empty one of that kind (heap/index) in a buffer and mark it "needs relocation". It would get recreated physically on eviction/checkpoint without freeing any previously occupied space. Comments? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
Jan Wieck <JanWieck@Yahoo.com> writes: > The reason why we need full_page_writes is that we need to guard against > torn pages or partial writes. So what if smgr would manage a mapping > between logical page numbers and their physical location in the relation? > At the moment where we today require a full page write into WAL, we > would mark the buffer as "needs relocation". The smgr would then write > this page into another physical location whenever it is time to write it > (via the background writer, hopefully). After that page is flushed, it > would update the page location pointer, or whatever we want to call it. > A thus free'd physical page location can be reused, once the location > pointer has been flushed to disk. This is a critical ordering of writes. > First the page at the new location, second the pointer to the current > location. Doing so would make write(2) appear atomic to us, which is > exactly what we need for crash recovery. I think you're just moving the atomic-write problem from the data pages to wherever you keep these pointers. regards, tom lane
On 27.10.2012 16:43, Tom Lane wrote: > Jan Wieck<JanWieck@Yahoo.com> writes: >> The reason why we need full_page_writes is that we need to guard against >> torn pages or partial writes. So what if smgr would manage a mapping >> between logical page numbers and their physical location in the relation? > >> At the moment where we today require a full page write into WAL, we >> would mark the buffer as "needs relocation". The smgr would then write >> this page into another physical location whenever it is time to write it >> (via the background writer, hopefully). After that page is flushed, it >> would update the page location pointer, or whatever we want to call it. >> A thus free'd physical page location can be reused, once the location >> pointer has been flushed to disk. This is a critical ordering of writes. >> First the page at the new location, second the pointer to the current >> location. Doing so would make write(2) appear atomic to us, which is >> exactly what we need for crash recovery. Hmm, aka copy-on-write. > I think you're just moving the atomic-write problem from the data pages > to wherever you keep these pointers. If the pointers are stored as simple 4-byte integers, you probably could assume that they're atomic, and won't be torn. There's a lot of practical problems in adding another level of indirection to every page access, though. It'll surely add some overhead to every access, even if the data never changes. And it's not at all clear to me that it would perform better than full_page_writes. You're writing and flushing out roughly the same amount of data AFAICS. What exactly is the problem with full_page_writes that we're trying to solve? - Heikki
On Sat, Oct 27, 2012 at 3:41 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > >> I think you're just moving the atomic-write problem from the data pages >> to wherever you keep these pointers. > > > If the pointers are stored as simple 4-byte integers, you probably could > assume that they're atomic, and won't be torn. That could backfire.
Claudio Freire <klaussfreire@gmail.com> writes: > On Sat, Oct 27, 2012 at 3:41 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >>> I think you're just moving the atomic-write problem from the data pages >>> to wherever you keep these pointers. >> If the pointers are stored as simple 4-byte integers, you probably could >> assume that they're atomic, and won't be torn. > That could backfire. Yeah, the potential loss in resiliency is scary. Assume for the sake of argument that we're storing these indirection pointers in 8K pages, 2000 or so to the page. If you get a read failure on a regular heap page, you've lost 8K worth of data. If you get a read failure on an indirection page, you've lost 16MB worth. (Though perhaps the pointers could be reconstructed, given additional overhead data on each regular heap page; but that wouldn't be very cheap.) Also, the write traffic on a pointer page is 2000X as much as it is on an average heap page, offering 2000X the opportunities for the disk hardware to drop a bit, or for SSD storage to wear down, etc. (Perhaps it's not that bad for typical access patterns, but then again perhaps it's worse, given the already noted strict requirements for write ordering.) So it seems like there'd be a rather nasty magnification of consequences for hardware errors. You've also got something like a 2000X concentration of access contention for pointer pages compared to regular pages. regards, tom lane
On 28/10/12 07:41, Heikki Linnakangas wrote: > On 27.10.2012 16:43, Tom Lane wrote: >> Jan Wieck<JanWieck@Yahoo.com> writes: >>> The reason why we need full_page_writes is that we need to guard >>> against >>> torn pages or partial writes. So what if smgr would manage a mapping >>> between logical page numbers and their physical location in the >>> relation? >> >>> At the moment where we today require a full page write into WAL, we >>> would mark the buffer as "needs relocation". The smgr would then write >>> this page into another physical location whenever it is time to >>> write it >>> (via the background writer, hopefully). After that page is flushed, it >>> would update the page location pointer, or whatever we want to call it. >>> A thus free'd physical page location can be reused, once the location >>> pointer has been flushed to disk. This is a critical ordering of >>> writes. >>> First the page at the new location, second the pointer to the current >>> location. Doing so would make write(2) appear atomic to us, which is >>> exactly what we need for crash recovery. > > Hmm, aka copy-on-write. > >> I think you're just moving the atomic-write problem from the data pages >> to wherever you keep these pointers. > > If the pointers are stored as simple 4-byte integers, you probably > could assume that they're atomic, and won't be torn. > > There's a lot of practical problems in adding another level of > indirection to every page access, though. It'll surely add some > overhead to every access, even if the data never changes. And it's not > at all clear to me that it would perform better than full_page_writes. > You're writing and flushing out roughly the same amount of data AFAICS. > > What exactly is the problem with full_page_writes that we're trying to > solve? > > - Heikki > > Would a 4 byte pointer be adequate for a 64 bit machine with well over 4GB used by Postgres? Cheers, Gavin
On Sat, Oct 27, 2012 at 8:41 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > If the pointers are stored as simple 4-byte integers, you probably could > assume that they're atomic, and won't be torn. Actually I think any fixed-size data structure would be fine. We're worried about storage on disk here, not in memory. As long as the pointers don't move around on the page then if part of a page is written and part not then the only pointer possibly corrupted would be the one spanning the old and new partial pages. That could be avoided or detected easily. The problem with torn pages on heap pages comes about only because tuples are variable sized. As a result the page has line pointers that need to be in sync with other parts of the page and vacuum needs to be able to move tuples around on the page. (Hm. I wonder if we could actually get away with only doing full page writes on vacuum if we could detect torn pages from vacuum reliably and avoid trying to read any data from them until they get restored by the full page write.) -- greg
On 10/27/2012 2:41 PM, Heikki Linnakangas wrote: > And it's not at all > clear to me that it would perform better than full_page_writes. You're > writing and flushing out roughly the same amount of data AFAICS. I think this assumption is wrong. full_page_writes=on means we write the full page content to WAL on first change after a checkpoint. A change after a checkpoint logically means that the same page is dirty now and must also be written latest during the next checkpoint, which means 16K written minimum for every page changed after a checkpoint. > What exactly is the problem with full_page_writes that we're trying to > solve? Full page writes are meant to guard against torn pages. That is the case when an 8K page is written by the underlying OS/filesystem/HW in smaller chunks (for example 512 byte sectors), and in the case of a crash some of these chunks make it, others don't. Without full_page_writes, crash recovery can work if all 8K made it, or nothing made it (aka atomic writes). But it will fail otherwise. The amount of WAL generated with full_page_writes=on is quite substantial. For pgbench for example the ratio 20:1. Meaning with full_page_writes you write 20x the amount you do without. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 28 October 2012 22:35, Jan Wieck <JanWieck@yahoo.com> wrote: > The amount of WAL generated with full_page_writes=on is quite substantial. > For pgbench for example the ratio 20:1. Meaning with full_page_writes you > write 20x the amount you do without. Sure, but as always, pgbench pessimises everything by having writes follow a uniform distribution, which is completely unrepresentative of the natural world. This will literally maximise the number of pristine full page images that need to be included. The actual frequency of checkpoints is a crucial factor here too, and you didn't mention anything about that. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 10/28/2012 10:50 PM, Peter Geoghegan wrote: > On 28 October 2012 22:35, Jan Wieck <JanWieck@yahoo.com> wrote: >> The amount of WAL generated with full_page_writes=on is quite substantial. >> For pgbench for example the ratio 20:1. Meaning with full_page_writes you >> write 20x the amount you do without. > > Sure, but as always, pgbench pessimises everything by having writes > follow a uniform distribution, which is completely unrepresentative of > the natural world. This will literally maximise the number of pristine > full page images that need to be included. The actual frequency of > checkpoints is a crucial factor here too, and you didn't mention > anything about that. Higher buffer cache hit rates certainly reduce that ratio. Well, I guess it was just one of those random thoughts that can't work in the end or aren't worth the work anyway. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On Sat, Oct 27, 2012 at 1:01 AM, Jan Wieck <JanWieck@yahoo.com> wrote: > The reason why we need full_page_writes is that we need to guard against > torn pages or partial writes. So what if smgr would manage a mapping between > logical page numbers and their physical location in the relation? This sounds a lot like http://en.wikipedia.org/wiki/Shadow_paging According to my copy of Gray and Reuter, shadow paging is in fact a workable way of providing atomicity and durability, but as of its writing (1992) shadow paging had been essentially abandoned because it didn't have very good performance characteristics. One of the big problems is that you lose locality of reference - e.g. there's nothing at all sequential about a sequential scan if, below the mapping layer, the blocks are scattered about the disk, which is a likely outcome, if they are frequently updated, or in the long run even if they are only occasionally updated. It's occurred to me before to think that this might work if we did it, not at the block level, but at some higher level, with say 64MB segments. That wouldn't impinge too much on sequential access, but it would allow vacuum to clip out an entire 64MB segment anywhere in the relation if it happened to be empty, or perhaps to rewrite a 64MB segment of a relation without rewriting the whole thing. But it wouldn't do anything about torn pages. Another idea that's been previously proposed (and which is used by MySQL, and previously proposed by VMware for inclusion in PostgreSQL) for torn-page avoidance is that of a double-write buffer - i.e. instead of including full page images in WAL, write them to the double-write buffer; if we crash, start by restoring all the pages from the double-write buffer; then, replay WAL. This avoids passing the full-page images through the WAL stream sent from master to slave, because the slave can have its own double-write buffer. This would probably also allow slaves to perform restart-points at arbitrary locations independent of where the master performs checkpoints. In the patch as proposed, the double-write buffer was kept very small, in the hopes of keeping it within the presumed BBWC, so that very-frequent fsyncs would all reference the same pages and therefore all be absorbed by the cache. This delivers terrible performance without a BBWC, though, because the fsyncs are so frequent. Alternatively, you could imagine a large double-write buffer which only gets flushed once per checkpoint cycle or so - i.e. basically what we have now, but just separating the FPW traffic from the main WAL stream. Indeed, you could extend that a bit futher: why throw out the double-write buffer just because there's been a checkpoint cycle? In a workload like pgbench, it seems likely that the same pages will be written over and over again. You could have a checkpoint whose purpose is only to minimize the recovery time in cases where no pages are torn. You could then also have a less frequent "super-checkpoint" cycle and retain WAL back to the last "super-checkpoint". In the hopefully-unikely event that we detect a torn page (through a checksum failure, presumably) then we hunt backwards through WAL (something our current infrastructure doesn't really support) and find the last FPI for that torn page and then begin selective replay from that point, scanning through all of the WAL since the last super-checkpoint and replaying all and only records pertaining to that page. But when no pages are torn then you only need to recover from the last "normal" checkpoint. I have heard reports (on this mailing list, I think) that Oracle does something like this, but I haven't tried to verify for myself whether that is in fact the case. Yet another idea we've tossed around is to make only vacuum records include FPWs, and have the more common heap insert/update/delete operations include enough information that they can still be applied correctly even if the page has been "torn" by the previous replay of such a record. This would involve modifying the recovery algorithm so that, until consistency is reached, we replay all records, regardless of LSN, which would cost some extra I/O, but maybe not too much to live with? It would also require that, for example, a heap-insert record mention the line pointer index used for the insertion; currently, we count on the previous state of the page to tell us that.For checkpoint cycles of reasonable length, the costof storing the line pointer in every WAL record seems like it'll be less than the cost needing to write an FPI for the page once per checkpoint cycle, but this isn't certain to be the case for all workloads. OK, I'll stop babbling now... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/29/2012 12:05 PM, Robert Haas wrote: > OK, I'll stop babbling now... Not perceived as babbling here. Thanks for that nice round-up of options and ideas around the torn page problem. Regards Markus Wanner
On Mon, Oct 29, 2012 at 07:05:39AM -0400, Robert Haas wrote: > Yet another idea we've tossed around is to make only vacuum records > include FPWs, and have the more common heap insert/update/delete > operations include enough information that they can still be applied > correctly even if the page has been "torn" by the previous replay of > such a record. This would involve modifying the recovery algorithm so > that, until consistency is reached, we replay all records, regardless > of LSN, which would cost some extra I/O, but maybe not too much to > live with? It would also require that, for example, a heap-insert > record mention the line pointer index used for the insertion; > currently, we count on the previous state of the page to tell us that. > For checkpoint cycles of reasonable length, the cost of storing the > line pointer in every WAL record seems like it'll be less than the > cost needing to write an FPI for the page once per checkpoint cycle, > but this isn't certain to be the case for all workloads. This last idea has the most promise for me. Vacuum is far less common than row modification writes. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +