Thread: [HACKERS] Online enabling of page level checksums

[HACKERS] Online enabling of page level checksums

From
Magnus Hagander
Date:
So, that post I made about checksums certainly spurred a lot of discussion :) One summary is that life would be a lot easier if we could turn checksums on (and off) without re-initdbing.  I'm breaking out this question into this thread to talk about it separately.


I've been toying with a branch to work on this, but haven't had a time to get it even to compiling state. But instead of waiting until I have some code to show, let me outline the idea I had.

My general idea is this:

Take one bit in the checksum version field and make it mean "in progress". That means chat checksums can now be "on", "off", or "in progress".

When checksums are "in progress", PostgreSQL will compute and write checksums whenever it writes out a buffer, but it will *not* verify checksums on read.

This state would be set by calling a function (or an external command with the system shut down if need be - I can accept a restart for this, but I'd rather avoid it if possible).

This function would also launch a background worker. This worker would enumerate the entire database block by block. Read a block, verify if the checksum is set and correct. If it is, ignore it (because any further updates will keep it in state ok when we're in state "in progress"). If not then mark it as dirty and write it out through regular means, which will include computing and writing the checksum since we're "in progress". With something similar to vacuum cost delay to control how quickly it writes.

Yes, this means the entire db will end up in the transaction log since everything is rewritten. That's not great, but for a lot of people that will be a trade they're willing to make since it's a one-time thing. Yes, this background process might take days or weeks - that's OK as long as it happens online.

Once the background worker is done, it flips the checksum state to "on", and the system starts verifying checksums as well.

If the system is interrupted before the background worker is done, it starts over from the beginning. Previously touched blocks will be read and verified, but not written (because their checksum is already correct). This will take time, but not re-generate the WAL.

I think the actual functions and background worker could go in an extension that's installed and loaded only by those who need it. But the core functionality of being able to have "checksum in progress" would have to be in the core codebase.

So, is there something obviously missing in this plan? Or just the code to do it :)

--

Re: [HACKERS] Online enabling of page level checksums

From
Jim Nasby
Date:
On 1/22/17 5:13 AM, Magnus Hagander wrote:
> If the system is interrupted before the background worker is done, it
> starts over from the beginning. Previously touched blocks will be read
> and verified, but not written (because their checksum is already
> correct). This will take time, but not re-generate the WAL.

Another option would be to store a watermark of the largest block ID 
that had been verified for each relation/fork. That would be used to 
determine if checksums could be trusted, and more usefully would allow 
you to start where you left off if the verification process got interrupted.

> I think the actual functions and background worker could go in an
> extension that's installed and loaded only by those who need it. But the
> core functionality of being able to have "checksum in progress" would
> have to be in the core codebase.

If it was in contrib I think that'd be fine.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Online enabling of page level checksums

From
David Christensen
Date:
So as mentioned on IRC, I have a patch that I am working to rebase on HEAD with the following design.  It is very
similarto what you have proposed, so maybe we can use my development notes as a jumping-off point for
discussion/refinement.

* Incremental Checksums

PostgreSQL users should have a way up upgrading their cluster to use data checksums without having to do a costly
pg_dump/pg_restore;in particular, checksums should be able to be enabled/disabled at will, with the database enforcing
thelogic of whether the pages considered for a given database are valid. 

Considered approaches for this are having additional flags to pg_upgrade to set up the new cluster to use checksums
wherethey did not before (or optionally turning these off).  This approach is a nice tool to have, but in order to be
ableto support this process in a manner which has the database online while the database is going throught the initial
checksumprocess. 

In order to support the idea of incremental checksums, this design adds the following things:

** pg_control:

Keep "data_checksum_version", but have it indicate *only* the algorithm version for checksums. i.e., it's no longer
usedfor the data_checksum enabled/disabled state. 

Add "data_checksum_state", an enum with multiple states: "disabled", "enabling", "enforcing" (perhaps "revalidating"
too;something to indicate that we are reprocessing a database that purports to have been completely checksummed
already)

An explanation of the states as well as the behavior of the checksums for each.

- disabled => not in a checksum cycle; no read validation, no checksums written.  This is the current behavior for
Postgres*without* checksums. 

- enabling => in a checksum cycle; no read validation, write checksums.  Any page that gets written to disk will be a
validchecksum.  This is required when transitioning a cluster which has never had checksums, as the page reads would
normallyfail since they are uninitialized. 

- enforcing => not in a checksum cycle; read validation, write checksums.  This is the current behavior of Postgres
*with*checksums. 
(caveat: I'm not certain the following state is needed (and the current version of this patch doesn't have it)):

- revalidating => in a checksum cycle; read validation, write checksums.  The difference between this and "enabling" is
thatwe care if page reads fail, since by definition they should have been validly checksummed, as we should verify
this.

Add "data_checksum_cycle", a counter that gets incremented with every checksum cycle change.  This is used as a flag to
verifywhen new checksum actions take place, for instance if we wanted to upgrade/change the checksum algorithm, or if
wejust want to support periodic checksum validation. 

This variable will be compared against new values in the system tables to keep track of which relations still need to
bechecksummed in the cluster. 

** pg_database:

Add a field "datlastchecksum" which will be the last checksum cycle which has completed for all relations in that
database.

** pg_class:

Add a field "rellastchecksum" which stores the last successful checksum cycle for each relation.

** The checksum bgworker:

When the enabling event is initiated, we will iterate over all databases, checking for all databases where the
"datlastchecksum"field is < the current checksum cycle.  For each of these, we will spawn a bgworker to connect to
thesedbs and iterate over pg_class looking for "rellastchecksum < data_checksum_cycle".  If it finds none (i.e., every
recordhas rellastchecksum == data_checksum_cycle) then it marks the containing database as up-to-date by updating
"datlastchecksum= data_checksum_cycle".  We can presumably skip over temporary and unlogged relations here. 

For any relation that it finds in the database which is not checksummed, it starts an actual worker to handle the
checksumprocess for this table.  Since the state of the cluster is already either "enforcing" or "revalidating", any
blockwrites will get checksums added automatically, so the only thing the bgworker needs to do is load each block in
therelation and explicitly mark as dirty (unless that's not required for FlushBuffer() to do its thing).  After every
blockin the relation is visited this way and checksummed, its pg_class record will have "rellastchecksum" updated. 

(XXX: how to handle databases where connections are disabled, like "template1"?)

When all database have "datlastchecksum" == data_checksum_cycle, we initiate checksumming of any global cluster heap
files. When the global cluster tables heap files have been checksummed, then we consider the checksum cycle complete,
changepg_control's "data_checksum_state" to "enforcing" and consider things fully up-to-date. 


** Function API:

Interface to the functionality will be via the following Utility functions:
 - pg_enable_checksums(void) => turn checksums on for a cluster.  Will error if the state is anything but "disabled".
Ifthis is the first time this cluster has run this, this will initialize ControlFile->data_checksum_version to the
preferredbuilt-in algorithm (since there's only one currently, we just set it to 1).  This increments the
ControlFile->data_checksum_cyclevariable, then sets the state to "enabling", which means that the next time the
bgworkerchecks if there is anything to do it will see that state,  scan all the databases' "datlastchecksum" fields,
andstart kicking off the bgworker processes to handle the checksumming of the actual relation files. 
 - pg_disable_checksums(void) => turn checksums off for a cluster.  Sets the state to "disabled", which means bg_worker
willnot do anything. 
 - pg_request_checksum_cycle(void) => if checksums are "enabled", increment the data_checksum_cycle counter and set the
stateto "enabling".  (Alterantely, if we use the "revalidate" state here we could ensure that existing checksums are
validatedon read to alert us of any blocks with problems.  This could also be made to be "smart" i.e., interrupt
existingrunning checksum cycle to kick off another one (not sure of the use case), effectively call
pg_enable_checksums()if the cluster has not been explictly enabled before, etc; depends on how pedantic we want to be. 

** Design notes/implications:

When the system is in one of the modes which write checksums (currently everything but "disabled") any new
relations/databaseswill have their "rellastchecksum"/"datlastchecksum" counters prepopulated with the current value of
"data_checksum_counter",as we know that any space used for these relations will be checksummed, and hence valid.  By
pre-settingthis, we remove the need for the checksum bgworker to explicitly visit these new relations and force
checksumswhich will already be valid. 

With checksums on, we know any full-heap-modifying operations will be properly checksummed, we may be able to pre-set
rellastchecksumfor other operations such as ALTER TABLEs which trigger a full rewrite *without* having to explicitly
havethe checksum bgworker run on this.  I suspect there are a number of other places which may lend themselves to
optimizationlike this to avoid having to process relations explicitly.  (Say, if we somehow were able to force a
checksumoperation on any full SeqScan and update the state after the fact, we'd avoid paying this penalty another
time.)

** pg_upgrade:

With this additional complexity, we need to consider pg_upgrade, both now and in future versions.  For one thing, we
needto transfer settings from pg_control, plus make sure that pg_upgrade accepts deviances in any of the
data_checksum-relatedsettings in pg_control. 

4 scenarios to consider if/what to allow:

*** non-checksummed -> non-checksummed

exactly as it stands now

*** checksummed -> non-checksummed

pretty trivial; since the system tables will be non-checksummed, just equivalent to resetting the checksum_cycle and
pg_controlfields; user data files will be copied or linked into place with the checksums, but since it is disbled they
willbe ignored. 

*** non-checksummed -> checksummed

For the major version this patch makes it into, this will likely be the primary use case; add an --enable-checksums
optionto `pg_upgrade` to initially set the new cluster state to the checksums_enabling state, pre-init the system
databaseswith the correct state and checksum cycle flag, or have checksums be the *default* option.  Either way, this
needsto be set in the initial system initialization. 

*** checksummed -> checksummed

The potentially tricky case (but likely to be more common going forward as incremental checksums are supported).

Since we may have had checksum cycles in process in the old cluster or otherwise had the checksum counter we need to do
thefollowing: 

- need to propagate data_checksum_state, data_checksum_cycle, and data_checksum_version.  If we wanted to support a
differentCRC algorithm, we could pre-set the data_checksum_version to a different version here, increment
data_checksum_cycle,and set data_checksum_state to either "enabling" or "revalidating", depending on the original state
fromthe old cluster.  (i.e., if we were in the middle of an initial checksum cycle (state == "enabling"). 

- new cluster's system tables may need to have the "rellastchecksum" and "datlastchecksum" settings saved from the
previoussystem, if that's easy, to avoid a fresh checksum run if there is no need. 

** Handling checksums on a standby:

How to handle checksums on a standby is a bit trickier since checksums are inherently a local cluster state and not WAL
loggedbut we are storing state in the system tables for each database we need to make sure that the replicas reflect
truthfulstate for the checksums for the cluster. 

In order to manage this discrepency, we WAL log a few additional pieces of information; specifically:

- new events to capture/propogate any of the pg_control fields, such as: checksum version data, checksum cycle
increases,enabling/disabling actions 

- checksum background worker block ranges.  (XXX: we could decide that we're okay with relations being all or nothing
here,rendering this point moot.) 

Some notes on the block ranges: This would effectively be a series of records containing (datid, relid, start block,
endblock) for explicit checksum ranges, generated by the checksum bgworker as it checksums individual relations.  This
couldbe broken up into a series of blocks so rather than having the granularity be by relation we could have these
recordsget generated periodicaly (say in groups of 10K blocks or whatever, number to be determined) to allow standby
checksumrecalculation to be incremental so as not to delay replay unnecessarily as checksums are being created. 

Since the block range WAL records will be replayed before any of the pg_class/pg_database catalog records are replay,
we'llbe guaranteed to have the checksums calculated on the standby by the time it appears valid due to system state. 

We may also be able to use the WAL records to speed up the processing of existing heap files if they are interrupted
forsome reason, this remains to be seen. 

** Testing changes:

We need to add separate initdb checksum regression test which are outside of the normal pg_regress framework.

--
David Christensen
End Point Corporation
david@endpoint.com
785-727-1171






Re: [HACKERS] Online enabling of page level checksums

From
Simon Riggs
Date:
On 23 January 2017 at 16:32, David Christensen <david@endpoint.com> wrote:

> ** Handling checksums on a standby:
>
> How to handle checksums on a standby is a bit trickier since checksums are inherently a local cluster state and not
WALlogged but we are storing state in the system tables for each database we need to make sure that the replicas
reflecttruthful state for the checksums for the cluster. 

Not WAL logged? If the objective of this feature is robustness, it
will need to be WAL logged.

Relation options aren't accessed by the startup process during
recovery, so that part won't work in recovery. Perhaps disable
checksumming until the everything is enabled rather than relation by
relation.

If y'all serious about this then you're pretty late in the cycle for
such a huge/critical patch. So lets think of ways of reducing the
complexity...

Seems most sensible to add the "enable checksums for table" function
into VACUUM because then it will happen automatically via autovacuum,
or you could force the issue using something like
vacuumdb --jobs 4 --enable-checksums
That gives you vacuum_delay and a background worker for no effort

Hope that helps

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Online enabling of page level checksums

From
David Christensen
Date:
> On Jan 23, 2017, at 10:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On 23 January 2017 at 16:32, David Christensen <david@endpoint.com> wrote:
>
>> ** Handling checksums on a standby:
>>
>> How to handle checksums on a standby is a bit trickier since checksums are inherently a local cluster state and not
WALlogged but we are storing state in the system tables for each database we need to make sure that the replicas
reflecttruthful state for the checksums for the cluster. 
>
> Not WAL logged? If the objective of this feature is robustness, it
> will need to be WAL logged.
>
> Relation options aren't accessed by the startup process during
> recovery, so that part won't work in recovery. Perhaps disable
> checksumming until the everything is enabled rather than relation by
> relation.
>
> If y'all serious about this then you're pretty late in the cycle for
> such a huge/critical patch. So lets think of ways of reducing the
> complexity...
>
> Seems most sensible to add the "enable checksums for table" function
> into VACUUM because then it will happen automatically via autovacuum,
> or you could force the issue using something like
> vacuumdb --jobs 4 --enable-checksums
> That gives you vacuum_delay and a background worker for no effort

I’m not sure that this will work as-is, unless we’re forcing VACUUM to ignore the visibility map.  I had originally
consideredhaving this sit on top of VACUUm though, we just need a way to iterate over all relations and read every
page.

--
David Christensen
End Point Corporation
david@endpoint.com
785-727-1171






Re: [HACKERS] Online enabling of page level checksums

From
Greg Stark
Date:


On Jan 22, 2017 11:13 AM, "Magnus Hagander" <magnus@hagander.net> wrote:

Yes, this means the entire db will end up in the transaction log since everything is rewritten. That's not great, but for a lot of people that will be a trade they're willing to make since it's a one-time thing. Yes, this background process might take days or weeks - that's OK as long as it happens online.

I'm not sure that's actually necessary. You could just log a wal record saying "checksum this block" and if it gets replayed then recalculate the checksum on that block again. This record could be exempt from the usual rules for having a fpw.

There's no danger of torn pages from the checksum alone. The danger would be if some other operation does dirty that page then your need to know that the page is in this weird in between state where it's dirty but not yet had a fpw written. 

I'm not sure whether it's worth the infrastructure to have such a state just for this or not. On the other hand it sounds like something that would be useful.


Re: [HACKERS] Online enabling of page level checksums

From
David Christensen
Date:
> On Jan 23, 2017, at 10:59 AM, David Christensen <david@endpoint.com> wrote:
>
>>
>> On Jan 23, 2017, at 10:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> On 23 January 2017 at 16:32, David Christensen <david@endpoint.com> wrote:
>>
>>> ** Handling checksums on a standby:
>>>
>>> How to handle checksums on a standby is a bit trickier since checksums are inherently a local cluster state and not
WALlogged but we are storing state in the system tables for each database we need to make sure that the replicas
reflecttruthful state for the checksums for the cluster. 
>>
>> Not WAL logged? If the objective of this feature is robustness, it
>> will need to be WAL logged.
>>
>> Relation options aren't accessed by the startup process during
>> recovery, so that part won't work in recovery. Perhaps disable
>> checksumming until the everything is enabled rather than relation by
>> relation.
>>
>> If y'all serious about this then you're pretty late in the cycle for
>> such a huge/critical patch. So lets think of ways of reducing the
>> complexity...
>>
>> Seems most sensible to add the "enable checksums for table" function
>> into VACUUM because then it will happen automatically via autovacuum,
>> or you could force the issue using something like
>> vacuumdb --jobs 4 --enable-checksums
>> That gives you vacuum_delay and a background worker for no effort
>
> I’m not sure that this will work as-is, unless we’re forcing VACUUM to ignore the visibility map.  I had originally
consideredhaving this sit on top of VACUUm though, we just need a way to iterate over all relations and read every
page.

Another issue with this (that I think would still exist with the bgworker approach) is how to handle databases with
datallowconn= 0.  `vacuumdb`, at least, explicitly filters out these rows when iterating over databases to connect to,
sowhile we could enable them for all databases, we can’t enable for the cluster without verifying that these disallowed
dbsare already checksummed. 
--
David Christensen
End Point Corporation
david@endpoint.com
785-727-1171






Re: [HACKERS] Online enabling of page level checksums

From
Jim Nasby
Date:
On 1/23/17 1:11 PM, David Christensen wrote:
>> I’m not sure that this will work as-is, unless we’re forcing VACUUM to ignore the visibility map.  I had originally
consideredhaving this sit on top of VACUUm though, we just need a way to iterate over all relations and read every
page.
> Another issue with this (that I think would still exist with the bgworker approach) is how to handle databases with
datallowconn= 0.  `vacuumdb`, at least, explicitly filters out these rows when iterating over databases to connect to,
sowhile we could enable them for all databases, we can’t enable for the cluster without verifying that these disallowed
dbsare already checksummed.
 

For a first pass, I think it's acceptable for autovac and vac to notice 
if a relation needs checksums computed and ignore the VM in that case 
(make sure it's ignoring all frozen bits too).

Likewise, for right now I think it's OK to force users that are enabling 
this to manually connect to datallowcon=false and run vacuum.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Online enabling of page level checksums

From
Tom Lane
Date:
Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
> For a first pass, I think it's acceptable for autovac and vac to notice 
> if a relation needs checksums computed and ignore the VM in that case 
> (make sure it's ignoring all frozen bits too).

> Likewise, for right now I think it's OK to force users that are enabling 
> this to manually connect to datallowcon=false and run vacuum.

I think it's a *complete* mistake to overload vacuum with this
responsibility.  Build a separate tool with a separate scheduling policy.

As one reason why not: vacuum doesn't generally attempt to scan indexes
sequentially at all.  Some of the index AMs might happen to do that, but
touching every page of an index is nowhere in vacuum's charter.  Nor is
there a good reason for index AMs to be involved in the job of placing
checksums, but they'd all have to know about this if you insist on going
through vacuum.
        regards, tom lane



Re: [HACKERS] Online enabling of page level checksums

From
Jim Nasby
Date:
On 1/23/17 12:03 PM, Greg Stark wrote:
> On Jan 22, 2017 11:13 AM, "Magnus Hagander" <magnus@hagander.net
> <mailto:magnus@hagander.net>> wrote:
>
>
>     Yes, this means the entire db will end up in the transaction log
>     since everything is rewritten. That's not great, but for a lot of
>     people that will be a trade they're willing to make since it's a
>     one-time thing. Yes, this background process might take days or
>     weeks - that's OK as long as it happens online.
>
>
> I'm not sure that's actually necessary. You could just log a wal record
> saying "checksum this block" and if it gets replayed then recalculate
> the checksum on that block again. This record could be exempt from the
> usual rules for having a fpw.
>
> There's no danger of torn pages from the checksum alone. The danger
> would be if some other operation does dirty that page then your need to
> know that the page is in this weird in between state where it's dirty
> but not yet had a fpw written.
>
> I'm not sure whether it's worth the infrastructure to have such a state
> just for this or not. On the other hand it sounds like something that
> would be useful.

I'm a bit concerned about how much fancy we're trying to put into a 
first pass at this.

I think it's reasonable to require a fairly significant amount of effort 
on the part of an admin to enable checksums. For that matter, I think 
it'd be a significant improvement if there was NO way to enable 
checksums, only disable them. That would at least allow us to enable 
them by default in initdb and provide DBAs an easy way to disable them 
if desired. That would get us a lot more data about whether checksums 
help with corruption than we're going to get otherwise.

To put it another way, I think it's entirely reasonable *from a 
technical standpoint* to enable by default in 10, with only the ability 
to dynamically disable. Given the concerns that keep popping up about 
dynamically enabling, I'm not at all sure that we could get that into 10.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Online enabling of page level checksums

From
David Steele
Date:
On 1/23/17 9:45 PM, Jim Nasby wrote:

> To put it another way, I think it's entirely reasonable *from a
> technical standpoint* to enable by default in 10, with only the ability
> to dynamically disable. Given the concerns that keep popping up about
> dynamically enabling, I'm not at all sure that we could get that into 10.

+1.  I think this is a reasonable suggestion and realistic to complete
for inclusion in 10.

-- 
-David
david@pgmasters.net