Re: data_checksums enabled by default (was: Move --data-checksums to common options in initdb --help) - Mailing list pgsql-hackers
From | Stephen Frost |
---|---|
Subject | Re: data_checksums enabled by default (was: Move --data-checksums to common options in initdb --help) |
Date | |
Msg-id | 20210107211433.GS27507@tamriel.snowman.net Whole thread Raw |
In response to | Re: data_checksums enabled by default (was: Move --data-checksums to common options in initdb --help) (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: data_checksums enabled by default (was: Move --data-checksums to common options in initdb --help)
Re: data_checksums enabled by default (was: Move --data-checksums to common options in initdb --help) |
List | pgsql-hackers |
Greetings, * Peter Geoghegan (pg@bowt.ie) wrote: > On Wed, Jan 6, 2021 at 12:30 PM Stephen Frost <sfrost@snowman.net> wrote: > > As already mentioned, it's also, at least today, far > > simpler to disable checksums than to enable them, which is something > > else to consider when thinking about what the default should be. > > That is a valid concern. I just don't think that it's good enough on > its own, given the overwhelming downside of enabling checksums given > the WAL architecture that we have today. I expected there'd be some disagreement on this, but I do continue to feel that it's sensible to enable checksums by default. I also don't think there's anything particularly wrong with such a difference of opinion, though it likely means that we're going to continue on with the status quo- where, certainly, very many deployments enable it even though the upstream default is to have it disabled. This certainly isn't the only place that's done, though we've been working to improve that situation with things like trying to get rid of 'trust' being used in our default pg_hba.conf. > > That the major cloud providers all have checksums enabled (at least by > > default, though I wonder if they would even let you turn them off..), > > even when we don't have them on by default, strikes me as pretty telling > > that this is something that we should have on by default. > > Please provide supporting evidence. I know that EBS itself uses > checksums at the block device level, so I'm sure that RDS "uses > checksums" in some sense. But does RDS use --data-checksums during > initdb? Short answer is 'yes', as mentioned down-thread and having checksums was a pre-requisite to deploying PG in RDS (or so folks very involved in RDS have told me previously- and I'll also note that it was 9.3 that was first deployed as part of RDS). I don't think there's any question that they're using --data-checksums and that it is, in fact, the actual original PG checksum code (or at least was at 9.3, though I've further heard comments that they actively try to minimize the delta between RDS and PG). > > Certainly there's a different risk profile between the two and there may > > be times when someone is fine with running without fsync, or fine > > running without checksums, but those are, in my view, exceptions made > > once you understand exactly what risk you're willing to accept, and not > > what the default or typical deployment should be. > > Okay, I'll bite. Here is the important difference: Enabling checksums > doesn't actually make data corruption less likely, it just makes it > easier to detect. Whereas disabling fsync will reliably produce > corruption before too long in almost any installation. It may > occasionally be appropriate to disable fsync in a very controlled > environment, but it's rare, and not much faster than disabling > synchronous commits in any case. It barely ever happens. I agree that it doesn't happen very often. I'd say that it's also very infrequent for users who are aware that data checksums are available, and not enabled by default, to deploy non-checksumed systems. > We added page-level checksums in 9.3. Can you imagine a counterfactual > history in which Postgres had page checksums since the 1990s, but only > added the fsync feature in 9.3? Please answer this non-rhetorical > question. Nope, the risk from not having fsync was clearly understood, and still is, to be a larger risk than not having checksums. That doesn't mean there's no risk to not having checksums or that we simply shouldn't consider checksums to be worthwhile or that we shouldn't have them on by default. I outlined them together in that they're both there to address the risk that "something doesn't go right", but, as I said previously and again above, the level of risk between the two isn't the same. That doesn't mean we shouldn't consider that checksums *do* address a risk and consider enabling them by default- even with the performance impact that they have today. Much of this line of discussion seems to be, incorrectly, focused on my mere mention of viewing the use of fsync and checksums as mechanism for addressing certain risks, but that doesn't seem to be a terribly fruitful direction to be going in. I'm not suggesting that we should go turn off fsync by default simply because we don't have checksums on by default, which seems to be the implication. I do think that fsync addresses a large amount of the risks we face (random system reboots, storage being disconnected from the server, etc), and I feel that checksums address certain risks (latent bit flips at various levels, from the physical medium through whatever path is taken to get from the physical medium to the kernel and then to PG, random blocks being swapped from other applications, people deciding to gzip their data directory which I saw last week, etc...) and that all of those risks amount to sufficient justification that they both be enabled by default, but allowed to be disabled in environments where the administrator has considered the risks from each and decided that they're willing to accept them for the benefit of performance. Thanks, Stephen
Attachment
pgsql-hackers by date: