Re: Enabling Checksums - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Enabling Checksums
Date
Msg-id 51352F75.8010201@2ndQuadrant.com
Whole thread Raw
In response to Re: Enabling Checksums  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
On 3/4/13 3:13 PM, Heikki Linnakangas wrote:
> This PostgreSQL patch hasn't seen any production use, either. In fact,
> I'd consider btrfs to be more mature than this patch. Unless you think
> that there will be some major changes to the worse in performance in
> btrfs, it's perfectly valid and useful to compare the two.

I think my last message came out with a bit more hostile attitude about 
this than I intended it to; sorry about that.  My problem with this idea 
comes from looking at the history of how Linux has failed to work 
properly before.  The best example I can point at is the one I 
documented at 
http://www.postgresql.org/message-id/4B512D0D.4030909@2ndquadrant.com 
along with this handy pgbench chart: 
http://www.phoronix.com/scan.php?page=article&item=ubuntu_lucid_alpha2&num=3

TPS on pgbench dropped from 1102 to about 110 after a kernel bug fix. 
It was 10X as fast in some kernel versions because fsync wasn't working 
properly.  Kernel filesystem issues have regularly resulted in data not 
being written to disk when it should have been, inflating the results 
accordingly.  Fake writes due to "lying drives", write barriers that 
only actually work on server-class hardware, write barriers that don't 
work on md volumes, and then this one; it's a recurring pattern.  It's 
not the fault of the kernel developers, it's a hard problem and drive 
manufacturers aren't making it easy for them.

My concern, then, is that if the comparison target is btrfs performance, 
how do we know it's working reliably?  The track record says that bugs 
in this area usually inflate results, compared with a correct 
implementation.  You are certainly right that this checksum code is less 
mature than btrfs; it's just over a year old after all.  I feel quite 
good that it's not benchmarking faster than it really is, especially 
when I can directly measure how the write volume is increasing in the 
worst result.

I can't say that btrfs is slower or faster than it will eventually be 
due to bugs; I can't tell you the right way to tune btrfs for 
PostgreSQL; and I haven't even had anyone asking the question yet. 
Right now, the main thing I know about testing performance on Linux 
kernels new enough to support btrfs is that they're just generally slow 
running PostgreSQL.  See the multiple confirmed regression issues at 
http://www.postgresql.org/message-id/60B572D9298D944580F7D51195DD30804357FA4ABF@VMBX125.ihostexchange.net 
for example.  That new kernel mess needs to get sorted out too one day.  Why does database performance suck on kernel
3.2? I don't know yet, 
 
but it doesn't help me get excited about assuming btrfs results will be 
useful.

ZFS was supposed to save everyone from worrying about corruption issues.  That didn't work out, I think due to the
commercialagenda behind its 
 
development.  Now we have btrfs coming in some number of years, a 
project still tied more than I would like to Oracle.  I'm not too 
optimistic about that one either.  It doesn't help that now the original 
project lead, Chris Mason, has left there and is working at 
FusionIO--and that company's filesystem plans don't include 
checksumming, either.  (See 
http://www.fusionio.com/blog/under-the-hood-of-the-iomemory-sdk/ for a 
quick intro to what they're doing right now, which includes bypassing 
the Linux filesystem layer with their own flash optimized but POSIX 
compliant directFS)

There is an optimistic future path I can envision where btrfs matures 
quickly and in a way that performs well for PostgreSQL.  Maybe we'll end 
up there, and if that happens everyone can look back and say this was a 
stupid idea.  But there are a lot of other outcomes I see as possible 
here, and in all the rest of them having some checksumming capabilities 
available is a win.

One of the areas PostgreSQL has a solid reputation on is being trusted 
to run as reliably as possible.  All of the deployment trends I'm seeing 
have people moving toward less reliable hardware.  VMs, cloud systems, 
regular drives instead of hardware RAID, etc.  A lot of people badly 
want to leave behind the era of the giant database server, and have a 
lot of replicas running on smaller/cheaper systems instead.  There's a 
useful advocacy win for the project if lower grade hardware can be used 
to hit a target reliability level, with software picking up some of the 
error detection job instead.  Yes, it costs something in terms of future 
maintenance on the codebase, as new features almost invariably do.  If I 
didn't see being able to make noise about the improved reliability of 
PostgreSQL as valuable enough to consider it anyway, I wouldn't even be 
working on this thing.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Re: Enabling Checksums
Next
From: Jim Nasby
Date:
Subject: Re: Enabling Checksums