Re: Enabling Checksums - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Enabling Checksums
Date
Msg-id 1362412800.26602.32.camel@jdavis
Whole thread Raw
In response to Re: Enabling Checksums  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Enabling Checksums  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Re: Enabling Checksums  (Jim Nasby <jim@nasby.net>)
List pgsql-hackers
On Mon, 2013-03-04 at 10:36 +0200, Heikki Linnakangas wrote:
> On 04.03.2013 09:11, Simon Riggs wrote:
> > Are there objectors?
> 
> FWIW, I still think that checksumming belongs in the filesystem, not 
> PostgreSQL.

Doing checksums in the filesystem has some downsides. One is that you
need to use a copy-on-write filesystem like btrfs or zfs, which (by
design) will fragment the heap on random writes. If we're going to start
pushing people toward those systems, we will probably need to spend some
effort to mitigate this problem (aside: my patch to remove
PD_ALL_VISIBLE might get some new wind behind it).

There are also other issues, like what fraction of our users can freely
move to btrfs, and when. If it doesn't happen to be already there, you
need root to get it there, which has never been a requirement before.

I don't fundamentally disagree. We probably need to perform reasonably
well on btrfs in COW mode[1] regardless, because a lot of people will be
using it a few years from now. But there are a lot of unknowns here, and
I'm concerned about tying checksums to a series of things that will be
resolved a few years from now, if ever.

[1] Interestingly, you can turn off COW mode on btrfs, but you lose
checksums if you do.

>  If you go ahead with this anyway, at the very least I'd like 
> to see some sort of a comparison with e.g btrfs. How do performance, 
> error-detection rate, and behavior on error compare? Any other metrics 
> that are relevant here?

I suspect it will be hard to get an apples-to-apples comparison here
because of the heap fragmentation, which means that a sequential scan is
not so sequential. That may be acceptable for some workloads but not for
others, so it would get tricky to compare. And any performance numbers
from an experimental filesystem are somewhat suspect anyway.

Also, it's a little more challenging to test corruption on a filesystem,
because you need to find the location of the file you want to corrupt,
and corrupt it out from underneath the filesystem.

Greg may have more comments on this matter.

Regards,Jeff Davis




pgsql-hackers by date:

Previous
From: Cliff_Bytes
Date:
Subject: Re: LIBPQ Implementation Requiring BYTEA Data
Next
From: Tom Lane
Date:
Subject: Re: Seg fault when processing large SPI cursor (PG9.13)