Re: Offline enabling/disabling of data checksums - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: Offline enabling/disabling of data checksums
Date
Msg-id alpine.DEB.2.21.1903210745310.3843@lancre
Whole thread Raw
In response to Re: Offline enabling/disabling of data checksums  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Offline enabling/disabling of data checksums
List pgsql-hackers
Bonjour Michaël,

> On Wed, Mar 20, 2019 at 05:46:32PM +0100, Fabien COELHO wrote:
>> I think that the motivation/risks should appear before the solution. "As xyz
>> ..., ...", or there at least the logical link should be outlined.
>>
>> It is not clear for me whether the following sentences, which seems specific
>> to "pg_rewind", are linked to the previous advice, which seems rather to
>> refer to streaming replication?
>
> Do you have a better idea of formulation?

I can try, but I must admit that I'm fuzzy about the actual issue. Is 
there a problem on a streaming replication with inconsistent checksum 
settings, or not?

You seem to suggest that the issue is more about how some commands or 
backup tools operate on a cluster.

I'll reread the thread carefully and will make a proposal.

> Imagine for example a primary-standby with checksums disabled: [...]

Yep, that's cool.

>> Should not disabling in reverse order be safe? the checksum are not checked
>> afterwards?
>
> I don't quite understand your comment about the ordering.  If all the 
> standbys are destroyed first, then enabling/disabling checksums happens 
> at a single place.

Sure. I was suggesting that disabling on replicated clusters is possibly 
safer, but do not know the detail of replication & checksumming with 
enough precision to be that sure about it.

>> After the reboot, some data files are not fully updated with their
>> checksums, although the controlfiles tells that they are. It should then
>> fail after a restart when a no-checksum page is loaded?
>>
>> What am I missing?
>
> Please note that we do that in other tools as well and we live fine
> with that as pg_basebackup, pg_rewind just to name two.

The fact that other commands are exposed to the same potential risk is not 
a very good argument not to fix it.

> I am not saying that it is not a problem in some cases, but I am saying 
> that this is not a problem that this patch should solve.

As solving the issue involves exchanging two lines and turning one boolean 
parameter to true, I do not see why it should not be done. Fixing the 
issue takes much less time than writing about it...

And if other commands can be improved fine with me.

> If we were to do something about that, it could make sense to make 
> fsync_pgdata() smarter so as the control file is flushed last there, or 
> define flush strategies there.

ISTM that this would not work: The control file update can only be done 
*after* the fsync to describe the cluster actual status, otherwise it is 
just a question of luck whether the cluster is corrupt on an crash while 
fsyncing. The enforced order of operation, with a barrier in between, is 
the important thing here.

-- 
Fabien.

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: MSVC Build support with visual studio 2019
Next
From: Fabien COELHO
Date:
Subject: Re: Offline enabling/disabling of data checksums