Re: [HACKERS] Checksums by default? - Mailing list pgsql-hackers

From David Steele
Subject Re: [HACKERS] Checksums by default?
Date
Msg-id b8f21c38-3b28-a50f-d997-a1c113136a6d@pgmasters.net
Whole thread Raw
In response to Re: [HACKERS] Checksums by default?  (Stephen Frost <sfrost@snowman.net>)
Responses Re: [HACKERS] Checksums by default?
List pgsql-hackers
On 1/25/17 10:38 PM, Stephen Frost wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> On Wed, Jan 25, 2017 at 7:37 PM, Andres Freund <andres@anarazel.de> wrote:
>>> On 2017-01-25 19:30:08 -0500, Stephen Frost wrote:
>>>> * Peter Geoghegan (pg@heroku.com) wrote:
>>>>> On Wed, Jan 25, 2017 at 3:30 PM, Stephen Frost <sfrost@snowman.net> wrote:
>>>>>> As it is, there are backup solutions which *do* check the checksum when
>>>>>> backing up PG.  This is no longer, thankfully, some hypothetical thing,
>>>>>> but something which really exists and will hopefully keep users from
>>>>>> losing data.
>>>>>
>>>>> Wouldn't that have issues with torn pages?
>>>>
>>>> No, why would it?  The page has either been written out by PG to the OS,
>>>> in which case the backup s/w will see the new page, or it hasn't been.
>>>
>>> Uh. Writes aren't atomic on that granularity.  That means you very well
>>> *can* see a torn page (in linux you can e.g. on 4KB os page boundaries
>>> of a 8KB postgres page). Just read a page while it's being written out.
>>
>> Yeah.  This is also why backups force full page writes on even if
>> they're turned off in general.
>
> I've got a question into David about this, I know we chatted about the
> risk at one point, I just don't recall what we ended up doing (I can
> imagine a few different possible things- re-read the page, which isn't a
> guarantee but reduces the chances a fair bit, or check the LSN, or
> perhaps the plan was to just check if it's in the WAL, as I mentioned)
> or if we ended up concluding it wasn't a risk for some, perhaps
> incorrect, reason and need to revisit it.

The solution was to simply ignore the checksums of any pages with an LSN
>= the LSN returned by pg_start_backup().  This means that hot blocks
may never be checked during backup, but if they are active then any
problems should be caught directly by PostgreSQL.

This technique assumes that blocks can be consistently read in the order
they were written.  If the second 4k (or 512 byte, etc.) block of the
fwrite is visible before the first 4k block then there would a false
positive.  I have a hard time imagining any sane buffering system
working this way, but I can't discount it.

It's definitely possible for pages on disk to have this characteristic
(i.e., the first block is not written first) but that should be fixed
during recovery before it is possible to take a backup.

Note that reports of page checksum errors are informational only and do
not have any effect on the backup process.  Even so we would definitely
prefer to avoid false positives.  If anybody can poke a hole in this
solution then I would like to hear it.

--
-David
david@pgmasters.net


pgsql-hackers by date:

Previous
From: David Fetter
Date:
Subject: Re: [HACKERS] One-shot expanded output in psql using \G
Next
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] sequence data type