Re: Enabling Checksums - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: Enabling Checksums |
Date | |
Msg-id | 1359325730.7413.33.camel@jdavis-laptop Whole thread Raw |
In response to | Re: Enabling Checksums (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Enabling Checksums
|
List | pgsql-hackers |
On Sat, 2013-01-26 at 23:23 -0500, Robert Haas wrote: > > If we were to try to defer writing the WAL until the page was being > > written, the most it would possibly save is the small XLOG_HINT WAL > > record; it would not save any FPIs. > > How is the XLOG_HINT_WAL record kept small and why does it not itself > require an FPI? There's a maximum of one FPI per page per cycle, and we need the FPI for any modified page in this design regardless. So, deferring the XLOG_HINT WAL record doesn't change the total number of FPIs emitted. The only savings would be on the trivial XLOG_HINT wal record itself, because we might notice that it's not necessary in the case where some other WAL action happened to the page. > > At first glance, it seems sound as long as the WAL FPI makes it to disk > > before the data. But to meet that requirement, it seems like we'd need > > to write an FPI and then immediately flush WAL before cleaning a page, > > and that doesn't seem like a win. Do you (or Simon) see an opportunity > > here that I'm missing? > > I am not sure that isn't a win. After all, we can need to flush WAL > before flushing a buffer anyway, so this is just adding another case - Right, but if we get the WAL record in earlier, there is a greater chance that it goes out with some unrelated WAL flush, and we don't need to flush the WAL to clean the buffer at all. Separating WAL insertions from WAL flushes seems like a fairly important goal, so I'm a little skeptical of a proposal to narrow that gap so drastically. It's hard to analyze without a specific proposal on the table. But if cleaning pages requires a WAL record followed immediately by a flush, it seems like that would increase the number of actual WAL flushes we need to do by a lot. > and the payoff is that the initial access to a page, setting hint > bits, is quickly followed by a write operation, we avoid the need for > any extra WAL to cover the hint bit change. I bet that's common, > because if updating you'll usually need to look at the tuples on the > page and decide whether they are visible to your scan before, say, > updating one of them That's a good point, I'm just not sure how avoid that problem without a lot of complexity or a big cost. It seems like we want to defer the XLOG_HINT WAL record for a short time; but not wait so long that we need to clean the buffer or miss a chance to piggyback on another WAL flush. > > By the way, the approach I took was to add the heap buffer to the WAL > > chain of the XLOG_HEAP2_VISIBLE wal record when doing log_heap_visible. > > It seemed simpler to understand than trying to add a bunch of options to > > MarkBufferDirty. > > Unless I am mistaken, that's going to heavy penalize the case where > the user vacuums an insert-only table. It will emit much more WAL > than currently. Yes, that's true, but I think that's pretty fundamental to this checksums design (and of course it only applies if checksums are enabled). We need to make sure an FPI is written and the LSN bumped before we write a page. That's why I was pushing a little on various proposals to either remove or mitigate the impact of hint bits (load path, remove PD_ALL_VISIBLE, cut down on the less-important hint bits, etc.). Maybe those aren't viable, but that's why I spent time on them. There are some other options, but I cringe a little bit thinking about them. One is to simply exclude the PD_ALL_VISIBLE bit from the checksum calculation, so that a torn page doesn't cause a problem (though obviously that one bit would be vulnerable to corruption). Another is to use a double-write buffer, but that didn't seem to go very far. Or, we could abandon the whole thing and tell people to use ZFS/btrfs/NAS/SAN. Regards,Jeff Davis
pgsql-hackers by date: