Re: crash-safe visibility map, take three - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: crash-safe visibility map, take three |
Date | |
Msg-id | AANLkTi=DOvFWWZFNxJObeAWiEu9dyAcTxLUQCgA1fNdt@mail.gmail.com Whole thread Raw |
In response to | Re: crash-safe visibility map, take three (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: crash-safe visibility map, take three
Re: crash-safe visibility map, take three |
List | pgsql-hackers |
On Wed, Dec 1, 2010 at 5:24 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2010-12-01 at 15:59 -0500, Robert Haas wrote: >> As for CRCs, there's a pretty direct chain of inference here: >> >> 1. CRCs are hard (really impossible) because we have hint bits. > > I would disagree with "impossible". If we don't set hint bits during > reading; and when we do set them, we log them (including full page > writes); then we can do CRCs. > > Those things have costs, but we might be willing to pay them if we had a > bulk loading strategy that avoids or mitigates the costs. > > The reason we can't do CRCs now is because hint bits violate the > WAL-before-data rule; not because of hint bits themselves. We're talking > about adding another feature that breaks the rule, in a more complex way > than hint bits. > > I just wanted to step back for a second and consider the problem from a > different angle before we committed to that. Well, let's think about what we'd need to do to make CRCs work reliably. There are two problems. 1. Currently, hint bits are not vulnerable to the torn-page problem, because the hint bit change is to single byte, and neither of the two possible values for the affected byte invalidate the contents of the block. Thus, they do not need to be WAL-logged - we're happy if they all make it to disk, but if some or none of them make it to disk, that's OK. If we CRC the entire page, the torn pages are never acceptable, so every action that modifies the page must be WAL-logged. 2. Currently, we allow hint bits on a page to be updated while holding a shared-content lock; we also allow the page to be written while holding only a shared-content lock. This makes it a bit nondeterministic whether the hint bit update is included in the write, but we don't care. If we were to compute a CRC and write that into the page before writing it out to the OS, it would be unacceptable for the page contents to change thereafter in any way. So, to make CRCs work, we'd need to (a) WAL-log every hint bit update and (b) change either hint bit updates or page write-outs to require an exclusive content lock rather than a shared one. The first would result in an increase in I/O, while the second would result in a reduction in concurrency. Thinking about it a bit, I wonder if we couldn't mitigate (b) quite a bit by adding a new level for buffer content locks, share exclusive. This would conflict with itself and with exclusive but not with share locks, and would be required to set hint bits or write the buffer. When setting hint bits with only a share lock, we'd attempt to do a non-blocking upgrade to share exclusive. If that failed - because someone else already held a share-exclusive lock - we'd just skip the hint bit update. I have no idea what to do about (a), though. *thinks some more* Or maybe I do. One other thing I've been thinking about with regard to hint bit updates is that we might choose to mark that are hint-bit-updated as "untidy" rather than "dirty". The background writer could treat these pages as dirty, but checkpoints and backends doing desperation-buffer-reclamation could treat them as clean. This would allow hint bit updates to trickle out to disk in the background, without letting them bottleneck anything on the critical path. Maybe we could do this - if CRCs are enabled and we are the background writer cleaning scan, write dirty buffers in the usual way and write untidy buffers to a "double-write buffer" (to borrow a page from InnoDB) along with the current LSN. At the conclusion of the scan, fsync() the double-write buffer and then write the buffers a second time in the normal fashion if their mappings haven't changed and they are still untidy. On redo, when you reach an LSN recorded in the double-write buffer, restore the FPI. In general, a double-write buffer is inferior to our existing FPI system, because you end up needing to fsync both the double-write buffer and the WAL stream. But it might be OK in this case, if it's all happening as background work. -- With respect to your concerns about this method, after some thought, I think #2 isn't an issue at all, because I don't believe we can risk having our update to HEAP_XMIN_FROZEN stomped on by someone else trying to set HEAP_XMIN_COMMITTED, so I think that when making a page all-visible we'll need an exclusive (or share-exclusive) content lock anyway. As to #1, I think we could restore the WAL-before-data rules if we kept a bit somewhere in the buffer descriptor indicating whether a given buffer has had an FPI since the last checkpoint. Then, perhaps, WAL records that are torn-page-safe could bump the TLI without emitting a FPI. The next WAL record to come along would be able to determine that one was still needed. Of course, to make CRCs work with this, you still need to emit FPIs or use a double-write buffer. That sucks, and I don't know what to do about it. Since our current hint-bit updates are not WAL-logged, a CRC implementation over it could try to get by with chunking untidy buffers (either all the time or just sometimes) without actually writing them. But these updates WILL be WAL-logged, so you can't just refuse to write them after the fact. Hmm... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: