Re: 9.3: summary of corruption detection / checksums / CRCs discussion - Mailing list pgsql-hackers

From Robert Haas
Subject Re: 9.3: summary of corruption detection / checksums / CRCs discussion
Date
Msg-id CA+TgmoY9pB27f22aWfYrPXKYaoCfLNojDu5aTkWu-xrYyMgymA@mail.gmail.com
Whole thread Raw
In response to 9.3: summary of corruption detection / checksums / CRCs discussion  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On Sat, Apr 21, 2012 at 5:40 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> * In addition to detecting random garbage, we also need to be able to
> detect zeroing of pages. Right now, a zero page is not considered
> corrupt, so that's a problem. We'll need to WAL table extension
> operations, and we'll need to mitigate the performance impact of doing
> so. I think we can do that by extending larger tables by many pages
> (say, 16 at a time) so we can amortize the cost of WAL and avoid
> contention.

I think that extending tables in larger chunks is probably a very good
idea for performance reasons completely apart from checksums.
However, I don't think that WAL-logging relation extension will allow
you to treat a zero page as an error condition unless you not only
WAL-log the operation but also *flush WAL* before performing the
actual table-extension operation.  Otherwise, we might crash after the
actual extension and before the WAL record hits the disk, and now
we're back to having a zero'd page in the file.  And the impact of
writing *and flushing* WAL for every extension seems likely to be more
than we're willing to pay.  If we extended 16 pages at a time, that
means waiting for 8 WAL fsyncs per MB of relation extension.  On my
MacBook Pro, which is admittedly a pessimal case for fsyncs, that
would work out to an extra half second of elapsed time per MB written,
so pgbench -i -s 100 would probably take about an extra 640 seconds.
If that's a correct analysis, that sounds pretty bad, because right
now it's taking 143 seconds.

> * Should we try to use existing space in header? It's very appealing to
> be able to avoid the upgrade work by using existing space in the header.
> There was a surprising amount of discussion about which space to use:
> pd_pagesize_version or pd_tli. There was also some discussion of using a
> few bits from various places.

It still seems to me that pd_tli is the obvious choice, since there is
no code anywhere in the system that relies on that field having any
particular value, so we can pretty much whack it around at will
without breaking anything.  AFAICS, the only reason to bump the page
format is if we want a longer checksum - 32 bits, say, instead of 16.
But I am inclined to think that 16 ought to be enough to detect the
vast majority of cases of corruption.

> * Table-level, or system level? Table-level would be appealing if there
> turns out to be a significant performance impact. But there are
> challenges during recovery, because no relcache is available. It seems
> like a relatively minor problem, because pages would indicate whether
> they have a checksum or not, but there are some details to be worked
> out.

I think the root of the issue here is that it's hard to turn checksums
on and off *online*.  If checksums were an initdb-time option, the
design here would be pretty simple: store the flag in the control
file.  And it wouldn't even be hard to allow the flag to be flipped
while the cluster is shut down: just write a utility to checksum and
rewrite all the blocks, fsync everything, and then flip the flag in
the control file and fsync that; also, force the user to rebuild all
their standbys.  This might not be very convenient, but it would be
comparatively simple to implement.

However, I've been assuming that we do want to handle turning
checksums on and off without shutting down the cluster, and that
definitely makes things more complicated.  In theory it could work
more or less the same way as in the off-line case: you launch some
command or function that visits every data block in the cluster and
rewrites them all, xlogging as it goes.  However, I'm a little
concerned that it won't be very usable in that form for some of the
same reasons that the off-line solution might not be very usable: you
essentially force the user to do a very large data-rewriting operation
all at once.  We already know that having a kicking off a big vacuum
in the background can impact the performance of the whole cluster, and
that's just one table; this would be the whole cluster all at once.

Moreover, if it's based on the existing vacuum code, which seems
generally desirable, it seems like it's going to have to be something
you run in one database at a time until you've hit them all.  In that
case, you're going to need some way to indicate which tables or
databases have been processed already and which haven't yet ... and if
you have that, then it's not a very big step to think that maybe we
ought to just control it on a per-table or per-database level to begin
with, since that would also have some of the collateral benefits you
mention.  The alternative is to have a new command, or a new form of
vacuum, that processes every table in every cluster regardless of
which DB you're connected to at the moment.  That's a lot of new,
fairly special-purpose code and it doesn't seem very convenient from a
management perspective, but it is probably simpler otherwise.

I'm not sure what the right decision is here.  I like the idea of
being able to activate and deactivate the feature without shutting
down the cluster, and I really like the idea of being able to do it
incrementally, one table at a time, instead of all at once.  But it's
surely more complex.  If you store the flag on a table level, then
it's not available during recovery (unless we invent some new trick to
make that work, like storing it in some other format that is simple
enough to be interpreted by the recovery code); if you store it on a
page level, then you can miss corruption if that corruption has the
side effect of clearing the bit.  It's not dissimilar to your
complaint upthread about empty pages: if you see an empty page (a page
that says it hasn't been checksummed), how do you know whether it's
really supposed to be empty (have no checksum) or whether it ended
empty (unchecksummed) because of exactly the sort of error we're
trying to detect?

> We don't want torn pages to falsely indicate a checksum failure. Many
> page writes are already protected from this with full-page images in the
> WAL; but hint bit updates (including the index dead tuple markers) are
> not.
>
> * Just pay the price -- WAL all hint bit updates, including FPIs.

Simon's patch contains a pretty good idea on this point: we don't
actually need to WAL log all hint bit updates.  In the worst case, we
need to log an FPI once per checkpoint cycle, and then only if the
page is hint-bit dirtied before it's dirtied by a fully WAL-logged
operation.  Moreover, if we WAL-log an FPI when the page is
hint-bit-dirtied, and a fully WAL-logged operation comes along later
in the same checkpoint cycle, we can skip the FPI for the second
operation, which probably recovers most of the cost of the previous
FPI.

The real problem here is: what happens when we're dirtying a large
number of pages only for hint bits?  In that case, we're potentially
generating a huge volume of WAL.  We can probably optimize the
bulk-loading case, but there are going to be other cases that stink.
Apart from bulk-loading, pgbench doesn't take much of a hit because
it's basically not going to touch a pgbench_accounts page again until
it comes back to update another row on the same page, so the overhead
shouldn't be that bad.  But it might be much worse on some other
workload where there are lots of reads mixed in with the writes: now
you've got a good chance of kicking out a bunch of extra FPIs.  There
are some things we could do about that - e.g. make sure to set all
possible hint bits before evicting an already-dirty buffer, so that
the chances of needing to dirty it again the next time we visit are
minimized.  However, I don't think all of that is work that has to be
done up front.  I suspect we will need to do some mitigation of the
bulk-loading case before committing even an initial version of
checksums, but I think beyond that we could probably postpone most of
the other optimization in this area to a future patch.

> * Double-Write buffer -- this attacks the problem most directly. Don't
> make any changes to the way hint bits are done; instead, push all page
> writes through a double-write buffer. There are numerous performance
> implications here, some of which may improve performance and some which
> may hurt performance. It's hard to say, at the end, whether this will be
> a good solution for everyone (particularly those without battery-backed
> caches), but it seems like an accepted approach that can be very good
> for the people who need performance the most.

I'm pretty skeptical of this approach.  It may work out well with
specialized hardware, but none of the hardware I have is going to cope
with that volume of fsyncs anything like gracefully.  Unless of course
we make the buffer really big, but that hasn't seemed to be a popular
approach.

> * Some way of caching CLOG information or making the access faster.
> IIRC, there were some vague ideas about mmapping() the CLOG, or caching
> a very small representation of the CLOG.

I'm interested in pursuing this in general, independent of checksums.
Not sure how far I will get, but reducing CLOG contention is something
I do plan to spend more time on.

> * Something else -- There are a few other lines of thought here. For
> instance, can we use WAL for hint bits without a FPI, and still protect
> against torn pages causing CRC failures? This is related to a comment
> during the 2011 developer meeting, where someone brought up the idea of
> idempotent WAL changes, and how that could help us avoid FPIs. It seems
> possible after reading the discussions, but not clear enough on the
> challenges to summarize here.

There are some possibilities here, but overall I have my doubts about
backing up checksums behind another complex project with uncertain
prospects.

> If we do use WAL for hint bit updates, that has an impact on Hot
> Standby, because HS can't write WAL. So, it would seem that HS could not
> set hint bits.

Yeah.  The good news is that WAL-logging hint bits on the master would
propagate those hint bits to the slave, which would reduce the need to
set them on the slave.  But there's definitely some potential for
performance regression here; I'm just not sure how much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Stefan Kaltenbrunner
Date:
Subject: Re: remove dead ports?
Next
From: Robert Haas
Date:
Subject: Re: 9.3: summary of corruption detection / checksums / CRCs discussion