Thread: Online checksums patch - once again

Online checksums patch - once again

From
Magnus Hagander
Date:
OK, let's try this again :)

This is work mainly based in the first version of the online checksums patch, but based on top of Andres WIP patchset for global barriers (https://www.postgresql.org/message-id/20181030051643.elbxjww5jjgnjaxg%40alap3.anarazel.de)


I'm including both those as attachments here to hopefully trick the cfbot into being able to build things :)

Other than that, we believe that the list of objections from the first one should be covered by now, but it's been quite some time and many emails, so it's possible we missed some. So if you find them, point them out!

Documentation needs another go-over in particular base don changes since, but the basic principles of how it works should not have changed.

//Magnus & Daniel

Attachment

Re: Online checksums patch - once again

From
Alvaro Herrera
Date:
On 2019-Aug-26, Magnus Hagander wrote:

> OK, let's try this again :)
> 
> This is work mainly based in the first version of the online checksums
> patch, but based on top of Andres WIP patchset for global barriers (
> https://www.postgresql.org/message-id/20181030051643.elbxjww5jjgnjaxg%40alap3.anarazel.de
> )
> 
> Andres patch has been enhanced with wait events per
> https://www.postgresql.org/message-id/CABUevEwy4LUFqePC5YzanwtzyDDpYvgrj6R5WNznwrO5ouVg1w%40mail.gmail.com
> .

Travis says your SGML doesn't compile (maybe you just forgot to "git
add" and edit allfiles.sgml?):

/usr/bin/xmllint --path . --noout --valid postgres.sgml
reference.sgml:287: parser error : Entity 'pgVerifyChecksums' not defined
   &pgVerifyChecksums;
                      ^
reference.sgml:295: parser error : chunk is not well balanced
postgres.sgml:231: parser error : Failure to process entity reference
 &reference;
            ^
postgres.sgml:231: parser error : Entity 'reference' not defined
 &reference;
            ^

Other than bots, this patch doesn't seem to have attracted any reviewers
this time around.  Perhaps you need to bribe someone?  (Maybe "how sad
your committer SSH key stopped working" would do?)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online checksums patch - once again

From
Magnus Hagander
Date:


On Thu, Sep 26, 2019 at 9:48 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2019-Aug-26, Magnus Hagander wrote:

> OK, let's try this again :)
>
> This is work mainly based in the first version of the online checksums
> patch, but based on top of Andres WIP patchset for global barriers (
> https://www.postgresql.org/message-id/20181030051643.elbxjww5jjgnjaxg%40alap3.anarazel.de
> )
>
> Andres patch has been enhanced with wait events per
> https://www.postgresql.org/message-id/CABUevEwy4LUFqePC5YzanwtzyDDpYvgrj6R5WNznwrO5ouVg1w%40mail.gmail.com
> .

Travis says your SGML doesn't compile (maybe you just forgot to "git
add" and edit allfiles.sgml?):

Nope, even easier -- the reference pgVerifyChecksums was renamed to pgChecksums and for some reason we missed that in the merge.

I've rebased again on top of todays master, but that was the only change I had to make.
 

Other than bots, this patch doesn't seem to have attracted any reviewers
this time around.  Perhaps you need to bribe someone?  (Maybe "how sad
your committer SSH key stopped working" would do?)

Hmm. I don't think that's a bribe, that's a threat. However, maybe it will work.
 
--
Attachment

Re: Online checksums patch - once again

From
Tomas Vondra
Date:
On Mon, Sep 30, 2019 at 01:03:20PM +0200, Magnus Hagander wrote:
>On Thu, Sep 26, 2019 at 9:48 PM Alvaro Herrera <alvherre@2ndquadrant.com>
>wrote:
>
>> On 2019-Aug-26, Magnus Hagander wrote:
>>
>> > OK, let's try this again :)
>> >
>> > This is work mainly based in the first version of the online checksums
>> > patch, but based on top of Andres WIP patchset for global barriers (
>> >
>> https://www.postgresql.org/message-id/20181030051643.elbxjww5jjgnjaxg%40alap3.anarazel.de
>> > )
>> >
>> > Andres patch has been enhanced with wait events per
>> >
>> https://www.postgresql.org/message-id/CABUevEwy4LUFqePC5YzanwtzyDDpYvgrj6R5WNznwrO5ouVg1w%40mail.gmail.com
>> > .
>>
>> Travis says your SGML doesn't compile (maybe you just forgot to "git
>> add" and edit allfiles.sgml?):
>>
>
>Nope, even easier -- the reference pgVerifyChecksums was renamed to
>pgChecksums and for some reason we missed that in the merge.
>
>I've rebased again on top of todays master, but that was the only change I
>had to make.
>
>
>Other than bots, this patch doesn't seem to have attracted any reviewers
>> this time around.  Perhaps you need to bribe someone?  (Maybe "how sad
>> your committer SSH key stopped working" would do?)
>>
>
>Hmm. I don't think that's a bribe, that's a threat. However, maybe it will
>work.
>

IMHO the patch is ready to go - I think the global barrier solves the
issue in the previous version, and that's the only problem I'm aware of.
So +1 from me to go ahead and push it.

And now please uncomment my commit SSH key again, please ;-)


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online checksums patch - once again

From
Bruce Momjian
Date:
On Mon, Sep 30, 2019 at 02:49:44PM +0200, Tomas Vondra wrote:
> On Mon, Sep 30, 2019 at 01:03:20PM +0200, Magnus Hagander wrote:
> > Other than bots, this patch doesn't seem to have attracted any reviewers
> > > this time around.  Perhaps you need to bribe someone?  (Maybe "how sad
> > > your committer SSH key stopped working" would do?)
> > > 
> > 
> > Hmm. I don't think that's a bribe, that's a threat. However, maybe it will
> > work.
> > 
> 
> IMHO the patch is ready to go - I think the global barrier solves the
> issue in the previous version, and that's the only problem I'm aware of.
> So +1 from me to go ahead and push it.
> 
> And now please uncomment my commit SSH key again, please ;-)

For adding cluster-level encryption to Postgres, the plan is to create a
standby that has encryption enabled, then switchover to it.  Is that a
method we support now for adding checksums to Postgres?  Do we need the
ability to do it in-place too?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: Online checksums patch - once again

From
Magnus Hagander
Date:


On Mon, Sep 30, 2019 at 4:53 PM Bruce Momjian <bruce@momjian.us> wrote:
On Mon, Sep 30, 2019 at 02:49:44PM +0200, Tomas Vondra wrote:
> On Mon, Sep 30, 2019 at 01:03:20PM +0200, Magnus Hagander wrote:
> > Other than bots, this patch doesn't seem to have attracted any reviewers
> > > this time around.  Perhaps you need to bribe someone?  (Maybe "how sad
> > > your committer SSH key stopped working" would do?)
> > >
> >
> > Hmm. I don't think that's a bribe, that's a threat. However, maybe it will
> > work.
> >
>
> IMHO the patch is ready to go - I think the global barrier solves the
> issue in the previous version, and that's the only problem I'm aware of.
> So +1 from me to go ahead and push it.
>
> And now please uncomment my commit SSH key again, please ;-)

For adding cluster-level encryption to Postgres, the plan is to create a
standby that has encryption enabled, then switchover to it.  Is that a
method we support now for adding checksums to Postgres?  Do we need the
ability to do it in-place too?

I definitely think we need the ability to do it in-place as well, yes. 

--

Re: Online checksums patch - once again

From
Bruce Momjian
Date:
On Mon, Sep 30, 2019 at 04:57:41PM +0200, Magnus Hagander wrote:
> 
> 
> On Mon, Sep 30, 2019 at 4:53 PM Bruce Momjian <bruce@momjian.us> wrote:
> 
>     On Mon, Sep 30, 2019 at 02:49:44PM +0200, Tomas Vondra wrote:
>     > On Mon, Sep 30, 2019 at 01:03:20PM +0200, Magnus Hagander wrote:
>     > > Other than bots, this patch doesn't seem to have attracted any
>     reviewers
>     > > > this time around.  Perhaps you need to bribe someone?  (Maybe "how
>     sad
>     > > > your committer SSH key stopped working" would do?)
>     > > >
>     > >
>     > > Hmm. I don't think that's a bribe, that's a threat. However, maybe it
>     will
>     > > work.
>     > >
>     >
>     > IMHO the patch is ready to go - I think the global barrier solves the
>     > issue in the previous version, and that's the only problem I'm aware of.
>     > So +1 from me to go ahead and push it.
>     >
>     > And now please uncomment my commit SSH key again, please ;-)
> 
>     For adding cluster-level encryption to Postgres, the plan is to create a
>     standby that has encryption enabled, then switchover to it.  Is that a
>     method we support now for adding checksums to Postgres?  Do we need the
>     ability to do it in-place too?
> 
> 
> I definitely think we need the ability to do it in-place as well, yes. 

OK, just has to ask.  I think for encryption, for the first version, we
will do just replica-only changes.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: Online checksums patch - once again

From
Magnus Hagander
Date:


On Mon, Sep 30, 2019 at 2:49 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Mon, Sep 30, 2019 at 01:03:20PM +0200, Magnus Hagander wrote:
>On Thu, Sep 26, 2019 at 9:48 PM Alvaro Herrera <alvherre@2ndquadrant.com>
>wrote:
>
>> On 2019-Aug-26, Magnus Hagander wrote:
>>
>> > OK, let's try this again :)
>> >
>> > This is work mainly based in the first version of the online checksums
>> > patch, but based on top of Andres WIP patchset for global barriers (
>> >
>> https://www.postgresql.org/message-id/20181030051643.elbxjww5jjgnjaxg%40alap3.anarazel.de
>> > )
>> >
>> > Andres patch has been enhanced with wait events per
>> >
>> https://www.postgresql.org/message-id/CABUevEwy4LUFqePC5YzanwtzyDDpYvgrj6R5WNznwrO5ouVg1w%40mail.gmail.com
>> > .
>>
>> Travis says your SGML doesn't compile (maybe you just forgot to "git
>> add" and edit allfiles.sgml?):
>>
>
>Nope, even easier -- the reference pgVerifyChecksums was renamed to
>pgChecksums and for some reason we missed that in the merge.
>
>I've rebased again on top of todays master, but that was the only change I
>had to make.
>
>
>Other than bots, this patch doesn't seem to have attracted any reviewers
>> this time around.  Perhaps you need to bribe someone?  (Maybe "how sad
>> your committer SSH key stopped working" would do?)
>>
>
>Hmm. I don't think that's a bribe, that's a threat. However, maybe it will
>work.
>

IMHO the patch is ready to go - I think the global barrier solves the
issue in the previous version, and that's the only problem I'm aware of.
So +1 from me to go ahead and push it.

Not to downvalue your review, but I'd really appreciate a review from someone who was one of the ones who spotted the issue initially.

Especially -- Andres, any chance I can bribe you to take another look?


And now please uncomment my commit SSH key again, please ;-)

I'll consider it...
 
--

Re: Online checksums patch - once again

From
Andres Freund
Date:
Hi,

On 2019-09-30 16:59:00 +0200, Magnus Hagander wrote:
> On Mon, Sep 30, 2019 at 2:49 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:
> > IMHO the patch is ready to go - I think the global barrier solves the
> > issue in the previous version, and that's the only problem I'm aware of.
> > So +1 from me to go ahead and push it.

I don't think the global barrier part is necessarily ready. I wrote it
on a flight back from a conference, to allow Magnus to make some
progress. And I don't think it has received meaningful review so far.


> Especially -- Andres, any chance I can bribe you to take another look?

I'll try to take a look.

Greetings,

Andres Freund



Re: Online checksums patch - once again

From
Magnus Hagander
Date:
On Mon, Sep 30, 2019 at 6:11 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-09-30 16:59:00 +0200, Magnus Hagander wrote:
> On Mon, Sep 30, 2019 at 2:49 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:
> > IMHO the patch is ready to go - I think the global barrier solves the
> > issue in the previous version, and that's the only problem I'm aware of.
> > So +1 from me to go ahead and push it.

I don't think the global barrier part is necessarily ready. I wrote it
on a flight back from a conference, to allow Magnus to make some
progress. And I don't think it has received meaningful review so far.

I don't believe it has, no. I wouldn't trust my own level of review a tleast :)


> Especially -- Andres, any chance I can bribe you to take another look?

I'll try to take a look.

Much appreciated! 

--

Re: Online checksums patch - once again

From
Michael Paquier
Date:
On Wed, Oct 02, 2019 at 08:59:27PM +0200, Magnus Hagander wrote:
> Much appreciated!

The latest patch does not apply, could you send a rebase?  Moved it to
next CF, waiting on author.
--
Michael

Attachment

Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 1 Dec 2019, at 03:32, Michael Paquier <michael@paquier.xyz> wrote:

> The latest patch does not apply, could you send a rebase?  Moved it to
> next CF, waiting on author.

Attached is a rebased v14 patchset on top of maser.  The Global Barriers patch
is left as a prerequisite, but it will obviously be dropped, or be
significantly changed, once the work Robert is doing with ProcSignalBarrier
lands.

cheers ./daniel


Attachment

Re: Online checksums patch - once again

From
Robert Haas
Date:
On Tue, Dec 3, 2019 at 6:41 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> Attached is a rebased v14 patchset on top of maser.  The Global Barriers patch
> is left as a prerequisite, but it will obviously be dropped, or be
> significantly changed, once the work Robert is doing with ProcSignalBarrier
> lands.

Any chance you and/or Magnus could offer opinions on some of those
patches? I am reluctant to start committing things with nobody having
replied.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 5 Dec 2019, at 16:13, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> On Tue, Dec 3, 2019 at 6:41 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>> Attached is a rebased v14 patchset on top of maser.  The Global Barriers patch
>> is left as a prerequisite, but it will obviously be dropped, or be
>> significantly changed, once the work Robert is doing with ProcSignalBarrier
>> lands.
>
> Any chance you and/or Magnus could offer opinions on some of those
> patches? I am reluctant to start committing things with nobody having
> replied.

I am currently reviewing your latest patchset, but need a bit more time.

cheers ./daniel


Re: Online checksums patch - once again

From
Robert Haas
Date:
On Thu, Dec 5, 2019 at 10:28 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> I am currently reviewing your latest patchset, but need a bit more time.

Oh, great, thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 5 Dec 2019, at 16:13, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Dec 3, 2019 at 6:41 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>> Attached is a rebased v14 patchset on top of maser.  The Global Barriers patch
>> is left as a prerequisite, but it will obviously be dropped, or be
>> significantly changed, once the work Robert is doing with ProcSignalBarrier
>> lands.
>
> Any chance you and/or Magnus could offer opinions on some of those
> patches? I am reluctant to start committing things with nobody having
> replied.

Attached is a v15 of the online checksums patchset (minus 0005), rebased on top
of your v3 ProcSignalBarrier patch rather than Andres' PoC GlobalBarrier patch.
It does take the, perhaps, controversial approach of replacing the SAMPLE
barrier with the CHECKSUM barrier.  The cfbot will be angry since this email
doesn't contain the procsignalbarrier patch, but it sounded like that would go
in shortly so opted for that.

This version also contains touchups to the documentation part, as well as a
pgindent run.

If reviewers think this version is nearing completion, then a v16 should
address the comment below, but as this version switches its underlying
infrastructure it seemed usefel for testing still.

+   /*
+    * Force a checkpoint to get everything out to disk. XXX: this should
+    * probably not be an IMMEDIATE checkpoint, but leave it there for now for
+    * testing.
+    */
+   RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT | CHECKPOINT_IMMEDIATE);

cheers ./daniel


Attachment

Re: Online checksums patch - once again

From
Sergei Kornilov
Date:
Hello

> Attached is a v15 of the online checksums patchset (minus 0005), rebased on top
> of your v3 ProcSignalBarrier patch rather than Andres' PoC GlobalBarrier patch.
> It does take the, perhaps, controversial approach of replacing the SAMPLE
> barrier with the CHECKSUM barrier. The cfbot will be angry since this email
> doesn't contain the procsignalbarrier patch, but it sounded like that would go
> in shortly so opted for that.

ProcSignalBarrier was committed, so online checksums patchset has no other pending dependencies and should be applied
cleanlyon master. Right? The patchset needs another rebase in this case, does not apply...
 

regards, Sergei



Re: Online checksums patch - once again

From
Robert Haas
Date:
On Mon, Dec 16, 2019 at 10:16 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> If reviewers think this version is nearing completion, then a v16 should
> address the comment below, but as this version switches its underlying
> infrastructure it seemed usefel for testing still.

I think this patch still needs a lot of work.

- doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
+ doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites ||
DataChecksumsInProgress());

This will have a small performance cost in a pretty hot code path. Not
sure that it's enough to worry about, though.

-DataChecksumsEnabled(void)
+DataChecksumsNeedWrite(void)
 {
  Assert(ControlFile != NULL);
  return (ControlFile->data_checksum_version > 0);
 }

This seems troubling, because data_checksum_version can now change,
but you're still accessing it without a lock. This complain applies
likewise to a bunch of related functions in xlog.c as well.

+ elog(ERROR, "Checksums not in inprogress mode");

Questionable capitalization and punctuation.

+void
+SetDataChecksumsOff(void)
+{
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+
+ ControlFile->data_checksum_version = 0;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+ WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM));
+
+ XlogChecksums(0);
+}

This looks racey. Suppose that checksums are on.  Other backends will
see that checksums are disabled as soon as
ControlFile->data_checksum_version = 0 happens, and they will feel
free to write blocks without checksums. Now we crash, and those blocks
exist on disk even though the on-disk state still otherwise shows
checksums fully enabled. It's a little better if we stop reading
data_checksum_version without a lock, because now nobody else can see
the updated state until we've actually updated the control file. But
even then, isn't it strange that writes of non-checksummed stuff could
appear or be written to disk before XlogChecksums(0) happens? If
that's safe, it seems like it deserves some kind of comment.

+ /*
+ * If we reach this point with checksums in inprogress state, we notify
+ * the user that they need to manually restart the process to enable
+ * checksums. This is because we cannot launch a dynamic background worker
+ * directly from here, it has to be launched from a regular backend.
+ */
+ if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_VERSION)
+ ereport(WARNING,
+ (errmsg("checksum state is \"inprogress\" with no worker"),
+ errhint("Either disable or enable checksums by calling the
pg_disable_data_checksums() or pg_enable_data_checksums()
functions.")));

This seems pretty half-baked.

+ (errmsg("could not start checksumhelper: has been canceled")));
+ (errmsg("could not start checksumhelper: already running")));
+ (errmsg("failed to start checksum helper launcher")));

These seem rough around the edges. Using an internal term like
'checksumhelper' in a user-facing error message doesn't seem great.
Generally primary error messages are phrased as a single utterance
where we can, rather than colon-separated fragments like this. The
third message calls it 'checksum helper launcher' whereas the other
two call it 'checksumhelper'. It also isn't very helpful; I don't
think most people like a message saying that something failed with no
explanation given.

+ elog(DEBUG1, "Checksumhelper skipping relation %d as it no longer
exists", relationId);

Here's another way to spell 'checksumhelper', and this time it refers
to the worker rather than the launcher. Also, relation IDs are OIDs,
so need to be printed with %u, and usually we try to print names if
possible. Also, this message, like a lot of messages in this patch,
begins with a capital letter and does not end with a period. That is
neither the style for primary messages nor the style for detail
messages. As these are primary messages, the primary message style
should be used. That style is no capital and no period.

+ if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
+ {
+ ereport(LOG,
+ (errmsg("failed to start worker for checksumhelper in \"%s\"",
+ db->dbname)));
+ return FAILED;
+ }

I don't think having constants with names like
SUCCESSFUL/ABORTED/FAILED is a very good idea. Too much chance of name
collisions. I suggest adding a prefix.

Also, the retry logic here doesn't look particularly robust.
RegisterDynamicBackgroundWorker will fail if all slots are available;
if that happens twice for the same database, once on first attempting
it and again when retrying it, the whole process fails, all state is
lost, and all work has to be redone. That seems neither particularly
unlikely nor pleasant.

+ if (DatabaseList == NIL || list_length(DatabaseList) == 0)

I don't think that the second half of this test serves any purpose.

+ snprintf(activity, sizeof(activity), "Waiting for current
transactions to finish (waiting for %d)", waitforxid);

%u here too.

+ if (pgc->relpersistence == 't')

Use the relevant constant.

+ /*
+ * Wait for all temp tables that existed when we started to go away. This
+ * is necessary since we cannot "reach" them to enable checksums. Any temp
+ * tables created after we started will already have checksums in them
+ * (due to the inprogress state), so those are safe.
+ */

This does not seem very nice. It just leaves a worker running more or
less forever. It's essentially waiting for temp-table using sessions
to go away, but that could take a really long time.

+ WAIT_EVENT_PG_SLEEP);

You need to invent a new wait event and add docs for it.

+ if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_CHECKSUM))
+ {
+ /*
+ * By virtue of getting here (i.e. interrupts being processed), we
+ * know that this backend won't have any in-progress writes (which
+ * might have missed the checksum change).
+ */
+ }

I don't believe this. I already wrote some about this over here:

http://postgr.es/m/CA+TgmobORrsgSUydZ3MsSw9L5MBUGz7jRK+973uPZgiyCQ81ag@mail.gmail.com

As a general point, I think that the idea of the ProcSignalBarrier
mechanism is that every backend has some private state that needs to
be updated, and when it absorbs the barrier you know that it's updated
that state, and so when everybody's absorbed the barrier you know that
all the state has been updated. Here, however, there actually is no
backend-private state. The only state that everyone's consulting is
the shared state stored in ControlFile->data_checksum_version. So what
does absorbing the barrier prove? Only that we've reached a
CHECK_FOR_INTERRUPTS(). But that is a useful guarantee only if we
never check for interrupts between the time we examine
ControlFile->data_checksum_version and the time we use it, and I see
no particular reason to believe that should always be true, and I
suspect it isn't, and even if it happens to be true today I think it
could get broken in the future pretty easily. There are no particular
restrictions documented in terms of where DataChecksumsNeedWrite(),
XLogHintBitIsNeeded(), etc. can be checked or what can be done between
checking the value and using it. The issue doesn't arise for today's
DataChecksumsEnabled() because the value can't ever change, but with
this patch things can change, and to me what the patch does about that
doesn't really look adequate.

I'm sort of out of time for right now, but I think this patch needs a
lot more work on the concurrency end of things. It seems to me that it
probably mostly works in practice, but that the whole concurrency
mechanism is not very solid and probably has a lot of rare cases where
it can just misbehave if you get unlucky. I'll try to spend some more
time thinking about this next week. I also think that the fact that
the mechanism for getting from 'inprogress' to 'on' seems fragile and
under-engineered. It still bothers me that there's no mechanism for
persisting the progress that we've made in enabling checksums; but
even apart from that, I think that there just hasn't been enough
thought given here to error/problem cases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 3 Jan 2020, at 23:07, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Dec 16, 2019 at 10:16 AM Daniel Gustafsson <daniel@yesql.se> wrote:
>> If reviewers think this version is nearing completion, then a v16 should
>> address the comment below, but as this version switches its underlying
>> infrastructure it seemed usefel for testing still.
>
> I think this patch still needs a lot of work.

Thanks a lot for your thorough review, much appreciated!  Also, sorry for being
slow to respond.  Below are fixes and responses to most of the feedback, but I
need a bit more time to think about the concurrency aspects that you brought
up.  However, in the spirit of showing work early/often I opted for still
sending the partial response, to perhaps be able to at least close some of the
raised issues in the meantime.

> - doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
> + doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites ||
> DataChecksumsInProgress());
>
> This will have a small performance cost in a pretty hot code path. Not
> sure that it's enough to worry about, though.

Not sure either, and/or how clever compilers are about inlining this.  As a
test, I've switched over this to be a static inline function, as it's only
consumer is in xlog.c.  Now, as mentioned later in this review, reading the
version unlocked has issues so do consider this a WIP test, not a final
suggestion.

> -DataChecksumsEnabled(void)
> +DataChecksumsNeedWrite(void)
> {
>  Assert(ControlFile != NULL);
>  return (ControlFile->data_checksum_version > 0);
> }
>
> This seems troubling, because data_checksum_version can now change,
> but you're still accessing it without a lock. This complain applies
> likewise to a bunch of related functions in xlog.c as well.

Right, let me do some more thinking on this before addressing in a next version
of the patch.  Simply wrapping it in a SHARED lock still has TOCTOU problems so
a bit more work/thinking is needed.

> + elog(ERROR, "Checksums not in inprogress mode");
>
> Questionable capitalization and punctuation.

Fixed capitalization, but elogs shouldn't end with a period so left that.

> +void
> +SetDataChecksumsOff(void)
> +{
> + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> +
> + ControlFile->data_checksum_version = 0;
> + UpdateControlFile();
> + LWLockRelease(ControlFileLock);
> + WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM));
> +
> + XlogChecksums(0);
> +}
>
> This looks racey. Suppose that checksums are on.  Other backends will
> see that checksums are disabled as soon as
> ControlFile->data_checksum_version = 0 happens, and they will feel
> free to write blocks without checksums. Now we crash, and those blocks
> exist on disk even though the on-disk state still otherwise shows
> checksums fully enabled. It's a little better if we stop reading
> data_checksum_version without a lock, because now nobody else can see
> the updated state until we've actually updated the control file. But
> even then, isn't it strange that writes of non-checksummed stuff could
> appear or be written to disk before XlogChecksums(0) happens? If
> that's safe, it seems like it deserves some kind of comment.

As mentioned above, I would like to address this in the next version.  I'm
working on it, just need a little more time and wanted to share progress on the
other bits.

> + /*
> + * If we reach this point with checksums in inprogress state, we notify
> + * the user that they need to manually restart the process to enable
> + * checksums. This is because we cannot launch a dynamic background worker
> + * directly from here, it has to be launched from a regular backend.
> + */
> + if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_VERSION)
> + ereport(WARNING,
> + (errmsg("checksum state is \"inprogress\" with no worker"),
> + errhint("Either disable or enable checksums by calling the
> pg_disable_data_checksums() or pg_enable_data_checksums()
> functions.")));
>
> This seems pretty half-baked.

I don't disagree with that.  However, given that enabling checksums is a pretty
intensive operation it seems somewhat unfriendly to automatically restart.  As
a DBA I wouldn't want that to kick off without manual intervention, but there
is also the risk of this being missed due to assumptions that it would restart.
Any ideas on how to treat this?

If/when we can restart the processing where it left off, without the need to go
over all data again, things might be different wrt the default action.

> + (errmsg("could not start checksumhelper: has been canceled")));
> + (errmsg("could not start checksumhelper: already running")));
> + (errmsg("failed to start checksum helper launcher")));
>
> These seem rough around the edges. Using an internal term like
> 'checksumhelper' in a user-facing error message doesn't seem great.
> Generally primary error messages are phrased as a single utterance
> where we can, rather than colon-separated fragments like this. The
> third message calls it 'checksum helper launcher' whereas the other
> two call it 'checksumhelper'. It also isn't very helpful; I don't
> think most people like a message saying that something failed with no
> explanation given.
>
> + elog(DEBUG1, "Checksumhelper skipping relation %d as it no longer
> exists", relationId);
>
> Here's another way to spell 'checksumhelper', and this time it refers
> to the worker rather than the launcher. Also, relation IDs are OIDs,
> so need to be printed with %u, and usually we try to print names if
> possible. Also, this message, like a lot of messages in this patch,
> begins with a capital letter and does not end with a period. That is
> neither the style for primary messages nor the style for detail
> messages. As these are primary messages, the primary message style
> should be used. That style is no capital and no period.

I've removed checksumhelper from all user facing strings, and only kept them in
the DEBUG strings (which to some extent probably will be removed before a final
version of the patch, so didn't spend too much time on those just now).  The
bgworker name is still checksumhelper launcher and checksumhelper worker, but
that could perhaps do with a better name too.

> + if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
> + {
> + ereport(LOG,
> + (errmsg("failed to start worker for checksumhelper in \"%s\"",
> + db->dbname)));
> + return FAILED;
> + }
>
> I don't think having constants with names like
> SUCCESSFUL/ABORTED/FAILED is a very good idea. Too much chance of name
> collisions. I suggest adding a prefix.

Fixed.

> Also, the retry logic here doesn't look particularly robust.
> RegisterDynamicBackgroundWorker will fail if all slots are available;
> if that happens twice for the same database, once on first attempting
> it and again when retrying it, the whole process fails, all state is
> lost, and all work has to be redone. That seems neither particularly
> unlikely nor pleasant.

Agreed, this was a brick or two shy of a load.  I've rewritten this logic to
cope better with the conditions around startup/shutdown of bgworkers.  I think
there should be some form of backoff mechanism as well in case there
temporarily aren't any slots, to avoid running through all the databases in
short order only to run up the retry counter.  Something like if X databases in
succession fail on no slot being available, back off a little before trying X+1
to allow for operations that consume the slot(s) to finish.  Or something.  It
wont help for systems which are permanently starved with a too low
max_worker_processes, but nothing sort of will.  For the latter, I've added a
note to the documentation.

> + if (DatabaseList == NIL || list_length(DatabaseList) == 0)
>
> I don't think that the second half of this test serves any purpose.

True, but I think the code is clearer if the second half is the one we keep, so
went with that.

> + snprintf(activity, sizeof(activity), "Waiting for current
> transactions to finish (waiting for %d)", waitforxid);
>
> %u here too.

Fixed.

> + if (pgc->relpersistence == 't')
>
> Use the relevant constant.

Fixed.

> + /*
> + * Wait for all temp tables that existed when we started to go away. This
> + * is necessary since we cannot "reach" them to enable checksums. Any temp
> + * tables created after we started will already have checksums in them
> + * (due to the inprogress state), so those are safe.
> + */
>
> This does not seem very nice. It just leaves a worker running more or
> less forever. It's essentially waiting for temp-table using sessions
> to go away, but that could take a really long time.

It can, but is there a realistic alternative?  I can't think of one but if you
have ideas I'd love for this requirement to go away, or be made less blocking.

At the same time, enabling checksums is hardly the kind of operation one does
casually in a busy database, but probably a more planned operation.  This
requirement is mentioned in the documentation such that a DBA can plan for when
to start the processing.

> + WAIT_EVENT_PG_SLEEP);
>
> You need to invent a new wait event and add docs for it.

Done.  I failed to figure out a (IMO) good name though, and welcome suggestions
that are more descriptive.  CHECKSUM_ENABLE_STARTCONDITION was what I settled on
but I'm not too excited about it.

> + if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_CHECKSUM))
> + {
> + /*
> + * By virtue of getting here (i.e. interrupts being processed), we
> + * know that this backend won't have any in-progress writes (which
> + * might have missed the checksum change).
> + */
> + }
>
> I don't believe this. I already wrote some about this over here:
>
> http://postgr.es/m/CA+TgmobORrsgSUydZ3MsSw9L5MBUGz7jRK+973uPZgiyCQ81ag@mail.gmail.com
>
> As a general point, I think that the idea of the ProcSignalBarrier
> mechanism is that every backend has some private state that needs to
> be updated, and when it absorbs the barrier you know that it's updated
> that state, and so when everybody's absorbed the barrier you know that
> all the state has been updated. Here, however, there actually is no
> backend-private state. The only state that everyone's consulting is
> the shared state stored in ControlFile->data_checksum_version. So what
> does absorbing the barrier prove? Only that we've reached a
> CHECK_FOR_INTERRUPTS(). But that is a useful guarantee only if we
> never check for interrupts between the time we examine
> ControlFile->data_checksum_version and the time we use it, and I see
> no particular reason to believe that should always be true, and I
> suspect it isn't, and even if it happens to be true today I think it
> could get broken in the future pretty easily. There are no particular
> restrictions documented in terms of where DataChecksumsNeedWrite(),
> XLogHintBitIsNeeded(), etc. can be checked or what can be done between
> checking the value and using it. The issue doesn't arise for today's
> DataChecksumsEnabled() because the value can't ever change, but with
> this patch things can change, and to me what the patch does about that
> doesn't really look adequate.

I don't disagree with this, but I need to do a bit more thinking before
presenting a suggested fix for this concurrency issue.

> I'm sort of out of time for right now, but I think this patch needs a
> lot more work on the concurrency end of things. It seems to me that it
> probably mostly works in practice, but that the whole concurrency
> mechanism is not very solid and probably has a lot of rare cases where
> it can just misbehave if you get unlucky. I'll try to spend some more
> time thinking about this next week. I also think that the fact that
> the mechanism for getting from 'inprogress' to 'on' seems fragile and
> under-engineered. It still bothers me that there's no mechanism for
> persisting the progress that we've made in enabling checksums; but
> even apart from that, I think that there just hasn't been enough
> thought given here to error/problem cases.

Thanks again for reviewing (and working on the infrastructure required for this
patch to begin with)!  Regarding the persisting the progress; that would be a
really neat feature but I don't have any suggestion on how to do that safely
for real use-cases.

Attached is a v16 rebased on top of current master which addresses the above
commented points, and which I am basing the concurrency work on.

cheers ./daniel




Attachment

Re: Online checksums patch - once again

From
Robert Haas
Date:
On Sat, Jan 18, 2020 at 6:18 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> Thanks again for reviewing (and working on the infrastructure required for this
> patch to begin with)!  Regarding the persisting the progress; that would be a
> really neat feature but I don't have any suggestion on how to do that safely
> for real use-cases.

Leaving to one side the question of how much work is involved, could
we do something conceptually similar to relfrozenxid/datfrozenxid,
i.e. use catalog state to keep track of which objects have been
handled and which not?

Very rough sketch:

* set a flag indicating that checksums must be computed for all page writes
* use barriers and other magic to make sure everyone has gotten the
memo from the previous step
* use new catalog fields pg_class.relhaschecksums and
pg_database.dathaschecksums to track whether checksums are enabled
* keep launching workers for databases where !pg_class.dathaschecksums
until none remain
* mark checksums as fully enabled
* party

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Online checksums patch - once again

From
Magnus Hagander
Date:
 On Mon, Jan 20, 2020 at 12:14 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Jan 18, 2020 at 6:18 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> > Thanks again for reviewing (and working on the infrastructure required for this
> > patch to begin with)!  Regarding the persisting the progress; that would be a
> > really neat feature but I don't have any suggestion on how to do that safely
> > for real use-cases.
>
> Leaving to one side the question of how much work is involved, could
> we do something conceptually similar to relfrozenxid/datfrozenxid,
> i.e. use catalog state to keep track of which objects have been
> handled and which not?
>
> Very rough sketch:
>
> * set a flag indicating that checksums must be computed for all page writes
> * use barriers and other magic to make sure everyone has gotten the
> memo from the previous step
> * use new catalog fields pg_class.relhaschecksums and
> pg_database.dathaschecksums to track whether checksums are enabled
> * keep launching workers for databases where !pg_class.dathaschecksums
> until none remain
> * mark checksums as fully enabled
> * party

We did discuss this back when we started work on this (I can't
remember if it was just me and Daniel and someone else or on a list --
but that's not important atm).

The reasoning that led us to *not* doing that is that it's a one-off
operation. That along with the fact that we hope to at some point be
able to change the default to chekcsums on  (and t wouldn't be
necessary for the transition on->off as that is very fast), it would
become an increasingly rate on-off operation. And by adding these
flags to the catalogs, everybody is paying the overhead for this
one-off rare operation. Another option would be to add the flag on the
pg_database level which would decrease the overhead, but our guess was
that this would also decrease the usefulness in most cases to make it
not worth it (most people with big databases don't have many big
databases in the same cluster -- it's usually just one or two, so in
the end the results would be more or less the same was we have now as
it would have to keep re-doing the big ones)

Unless we actually want to support running systems more or less
permanently with some tables with checksums and other tables without
checksums. But that's going to have an effect on the validation of
checksums that would generate a huge overhead (since each buffer check
would have to look up the pg_class entry).

FYI, Daniel is working on an update that will include this -- so we
can see what the actual outcome is of it in th case of complexity as
well. Should hopefully be ready soon.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: Online checksums patch - once again

From
Robert Haas
Date:
On Wed, Jan 22, 2020 at 2:50 PM Magnus Hagander <magnus@hagander.net> wrote:
> The reasoning that led us to *not* doing that is that it's a one-off
> operation. That along with the fact that we hope to at some point be
> able to change the default to chekcsums on  (and t wouldn't be
> necessary for the transition on->off as that is very fast), it would
> become an increasingly rate on-off operation. And by adding these
> flags to the catalogs, everybody is paying the overhead for this
> one-off rare operation. Another option would be to add the flag on the
> pg_database level which would decrease the overhead, but our guess was
> that this would also decrease the usefulness in most cases to make it
> not worth it (most people with big databases don't have many big
> databases in the same cluster -- it's usually just one or two, so in
> the end the results would be more or less the same was we have now as
> it would have to keep re-doing the big ones)

I understand, but the point for me is that the patch does not seem
robust as written. Nobody's going to be happy if there are reasonably
high-probability scenarios where it turns checksums part way on and
then just stops. Now, that can probably be improved to some degree
without adding catalog flags, but I bet it can be improved more and
for less effort if we do add catalog flags. Maybe being able to
survive a cluster restart without losing track of progress is not a
hard requirement for this feature, but it certainly seems nice. And I
would venture to say that continuing to run without giving up if there
happen to be no background workers available for a while IS a hard
requirement, because that can easily happen due to normal use of
parallel query. We do not normally commit features if, without any
error occurring, they might just give up part way through the
operation.

I think the argument about adding catalog flags adding overhead is
pretty much bogus. Fixed-width fields in catalogs are pretty cheap.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Online checksums patch - once again

From
Magnus Hagander
Date:
On Wed, Jan 22, 2020 at 12:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jan 22, 2020 at 2:50 PM Magnus Hagander <magnus@hagander.net> wrote:
> > The reasoning that led us to *not* doing that is that it's a one-off
> > operation. That along with the fact that we hope to at some point be
> > able to change the default to chekcsums on  (and t wouldn't be
> > necessary for the transition on->off as that is very fast), it would
> > become an increasingly rate on-off operation. And by adding these
> > flags to the catalogs, everybody is paying the overhead for this
> > one-off rare operation. Another option would be to add the flag on the
> > pg_database level which would decrease the overhead, but our guess was
> > that this would also decrease the usefulness in most cases to make it
> > not worth it (most people with big databases don't have many big
> > databases in the same cluster -- it's usually just one or two, so in
> > the end the results would be more or less the same was we have now as
> > it would have to keep re-doing the big ones)
>
> I understand, but the point for me is that the patch does not seem
> robust as written. Nobody's going to be happy if there are reasonably
> high-probability scenarios where it turns checksums part way on and
> then just stops. Now, that can probably be improved to some degree
> without adding catalog flags, but I bet it can be improved more and
> for less effort if we do add catalog flags. Maybe being able to
> survive a cluster restart without losing track of progress is not a
> hard requirement for this feature, but it certainly seems nice. And I

It's certainly nice, but that is of course a cost/benefit tradeoff
calculation. Our thoughts on that was that the cost was higher than
the benefit -- which may of course be wrong, and in that case it's
better to have it changed.


> would venture to say that continuing to run without giving up if there
> happen to be no background workers available for a while IS a hard
> requirement, because that can easily happen due to normal use of
> parallel query. We do not normally commit features if, without any
> error occurring, they might just give up part way through the
> operation.

That part I agree with, but I don't think that in itself requires
per-relation level tracking.


> I think the argument about adding catalog flags adding overhead is
> pretty much bogus. Fixed-width fields in catalogs are pretty cheap.

If that's the general view, then yeah our "cost calculations" were
off. I guess I may have been colored by the cost of adding statistics
counters, and had that influence the thinking. Incorrect judgement on
that cost certainly contributed to the decision. back then.

But as noted, work is being done on adding it, so let's see what that
ends up looking like in reality.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: Online checksums patch - once again

From
Robert Haas
Date:
On Wed, Jan 22, 2020 at 3:28 PM Magnus Hagander <magnus@hagander.net> wrote:
> > I think the argument about adding catalog flags adding overhead is
> > pretty much bogus. Fixed-width fields in catalogs are pretty cheap.
>
> If that's the general view, then yeah our "cost calculations" were
> off. I guess I may have been colored by the cost of adding statistics
> counters, and had that influence the thinking. Incorrect judgement on
> that cost certainly contributed to the decision. back then.

For either statistics or for pg_class, the amount of data that we have
to manage is proportional to the number of relations (which could be
big) multiplied by the data stored for each relation. But the
difference is that the stats file has to be rewritten, at least on a
per-database basis, very frequently, while pg_class goes through
shared-buffers and so doesn't provoke the same stupid
write-the-whole-darn-thing behavior. That is a pretty key difference,
IMHO.

Now, it would be nice to fix the stats system, but until we do, here we are.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 22 Jan 2020, at 23:07, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jan 22, 2020 at 3:28 PM Magnus Hagander <magnus@hagander.net> wrote:
>>> I think the argument about adding catalog flags adding overhead is
>>> pretty much bogus. Fixed-width fields in catalogs are pretty cheap.
>>
>> If that's the general view, then yeah our "cost calculations" were
>> off. I guess I may have been colored by the cost of adding statistics
>> counters, and had that influence the thinking. Incorrect judgement on
>> that cost certainly contributed to the decision. back then.
>
> For either statistics or for pg_class, the amount of data that we have
> to manage is proportional to the number of relations (which could be
> big) multiplied by the data stored for each relation. But the
> difference is that the stats file has to be rewritten, at least on a
> per-database basis, very frequently, while pg_class goes through
> shared-buffers and so doesn't provoke the same stupid
> write-the-whole-darn-thing behavior. That is a pretty key difference,
> IMHO.

I think the cost is less about performance and more about carrying around an
attribute which wont be terribly interesting during the cluster lifetime,
except for the transition. But, it's as you say probably a manageable expense.

A bigger question is how to handle the offline capabilities.  pg_checksums can
enable or disable checksums in an offline cluster, which will put the cluster
in a state where the pg_control file and the catalog don't match at startup.
One strategy could be to always trust the pg_control file and alter the catalog
accordingly, but that still leaves a window of inconsistent cluster state.

cheers ./daniel


Re: Online checksums patch - once again

From
Robert Haas
Date:
On Thu, Jan 23, 2020 at 6:19 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> A bigger question is how to handle the offline capabilities.  pg_checksums can
> enable or disable checksums in an offline cluster, which will put the cluster
> in a state where the pg_control file and the catalog don't match at startup.
> One strategy could be to always trust the pg_control file and alter the catalog
> accordingly, but that still leaves a window of inconsistent cluster state.

I suggest that we define things so that the catalog state is only
meaningful during a state transition. That is, suppose the cluster
state is either "on", "enabling", or "off". When it's "on", checksums
are written and verified. When it is "off", checksums are not written
and not verified. When it's "enabling", checksums are written but not
verified. Also, when and only when the state is "enabling", the
background workers that try to rewrite relations to add checksums run,
and those workers look at the catalog state to figure out what to do.
Once the state changes to "on", those workers don't run any more, and
so the catalog state does not make any difference.

A tricky problem is to handling the case where the state is switched
from "enabling" to "on" and then back to "off" and then to "enabling"
again. You don't want to confuse the state from the previous round of
enabling with the state for the current round of enabling. Suppose in
addition to storing the cluster-wide state of on/off/enabling, we also
store an "enable counter" which is incremented every time the state
goes from "off" to "enabling". Then, for each database and relation,
we store a counter that indicates the value of the enable counter at
the time we last scanned/rewrote that relation to set checksums. Now,
you're covered. And, to save space, it can probably be a 32-bit
counter, since 4 billion disable/reenable cycles ought to be enough
for anybody.

It would not be strictly necessary to store this in pg_class. Another
thing that could be done is to store it in a separate system table
that could even be truncated when enabling is not in progress - though
it would be unwise to assume that it's always truncated at the
beginning of an enabling cycle, since it would be hard to guarantee
that the previous enabling cycle didn't fail when trying to truncate.
So you'd probably still end up with something like the counter
approach. I am inclined to think that inventing a whole new catalog
for this is over-engineering, but someone might think differently.
Note that creating a table while enabling is in progress needs to set
the enabling counter for the new table to the new value of the
enabling counter, not the old one, because the new table starts empty
and won't end up with any pages that don't have valid checksums.
Similarly, TRUNCATE, CLUSTER, VACUUM FULL, and rewriting variants of
ALTER TABLE can set the new value for the enabling counter as a side
effect. That's probably easier and more efficient if it's just value
in pg_class than if they have to go poking around in another catalog.
So I am tentatively inclined to think that just putting it in pg_class
makes more sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Online checksums patch - once again

From
Andres Freund
Date:
Hi,

On 2020-01-23 12:23:09 -0500, Robert Haas wrote:
> On Thu, Jan 23, 2020 at 6:19 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> > A bigger question is how to handle the offline capabilities.  pg_checksums can
> > enable or disable checksums in an offline cluster, which will put the cluster
> > in a state where the pg_control file and the catalog don't match at startup.
> > One strategy could be to always trust the pg_control file and alter the catalog
> > accordingly, but that still leaves a window of inconsistent cluster state.
> 
> I suggest that we define things so that the catalog state is only
> meaningful during a state transition. That is, suppose the cluster
> state is either "on", "enabling", or "off". When it's "on", checksums
> are written and verified. When it is "off", checksums are not written
> and not verified. When it's "enabling", checksums are written but not
> verified. Also, when and only when the state is "enabling", the
> background workers that try to rewrite relations to add checksums run,
> and those workers look at the catalog state to figure out what to do.
> Once the state changes to "on", those workers don't run any more, and
> so the catalog state does not make any difference.
> 
> A tricky problem is to handling the case where the state is switched
> from "enabling" to "on" and then back to "off" and then to "enabling"
> again. You don't want to confuse the state from the previous round of
> enabling with the state for the current round of enabling. Suppose in
> addition to storing the cluster-wide state of on/off/enabling, we also
> store an "enable counter" which is incremented every time the state
> goes from "off" to "enabling". Then, for each database and relation,
> we store a counter that indicates the value of the enable counter at
> the time we last scanned/rewrote that relation to set checksums. Now,
> you're covered. And, to save space, it can probably be a 32-bit
> counter, since 4 billion disable/reenable cycles ought to be enough
> for anybody.
> 
> It would not be strictly necessary to store this in pg_class. Another
> thing that could be done is to store it in a separate system table
> that could even be truncated when enabling is not in progress - though
> it would be unwise to assume that it's always truncated at the
> beginning of an enabling cycle, since it would be hard to guarantee
> that the previous enabling cycle didn't fail when trying to truncate.
> So you'd probably still end up with something like the counter
> approach. I am inclined to think that inventing a whole new catalog
> for this is over-engineering, but someone might think differently.
> Note that creating a table while enabling is in progress needs to set
> the enabling counter for the new table to the new value of the
> enabling counter, not the old one, because the new table starts empty
> and won't end up with any pages that don't have valid checksums.
> Similarly, TRUNCATE, CLUSTER, VACUUM FULL, and rewriting variants of
> ALTER TABLE can set the new value for the enabling counter as a side
> effect. That's probably easier and more efficient if it's just value
> in pg_class than if they have to go poking around in another catalog.
> So I am tentatively inclined to think that just putting it in pg_class
> makes more sense.

I'm somewhat inclined to think that it's worth first making this robust
without catalog state - even though I find restartability
important. Especially due to not having convenient ways to have cross
database state that we can reset without again needing background
workers. I also wonder if it's not worthwhile to design the feature in a
way that, *in the future*, checksums could be separately set on the
standby/primary - without needing to ship the whole database through
WAL.

Oh, if all relation types had a metapage with a common header, this
would be so much easier...

It'd also be a lot easier if we could map from relfilenode back to a
relation oid, without needing catalog access. That'd allow us to acquire
locks on the relation for a filenode, without needing to be connected to
a database. Again, with a metapage, that'd be quite doable.


Probably not worth doing just for this, but I'm wondering about solving
the metapage issue by just adding a metadata relation fork. Sucks to
increase the number of files further, but I don't really see a path
towards having a metapage for everything, given pg_upgrade compat
requirements. Such a metadata fork, in contrast, could easily be filled
by pg_upgrade.  That metadata file could perhaps replace init forks too.

For debuggability storing some information about the relation in that
metadata fork would be great. Being able to identify the type of
relation etc from there, and perhaps even the relname at creation, would
certainly be helpful for cases the database doesn't start up anymore.

With a bit of care, we could allow AMs to store additional information
in there, by having a offset pointer for am information in the common
header.

E.g. for tables it'd be feasible to have the types of columns in there
(since it's associated with a relfilenode, rather than relation, there's
no problem with rewrites), allowing to correctly interpret data without
catalog access when shit has hit the fan.

Greetings,

Andres Freund



Re: Online checksums patch - once again

From
David Steele
Date:
On 1/18/20 6:18 PM, Daniel Gustafsson wrote:
> 
> Attached is a v16 rebased on top of current master which addresses the above
> commented points, and which I am basing the concurrency work on.

This patch no longer applies cleanly: 
http://cfbot.cputube.org/patch_27_2260.log

The CF entry has been updated to Waiting on Author.

Regards,

-- 
-David
david@pgmasters.net



Re: Online checksums patch - once again

From
David Steele
Date:
On 4/1/20 11:30 AM, David Steele wrote:
> On 1/18/20 6:18 PM, Daniel Gustafsson wrote:
>>
>> Attached is a v16 rebased on top of current master which addresses the 
>> above
>> commented points, and which I am basing the concurrency work on.
> 
> This patch no longer applies cleanly: 
> http://cfbot.cputube.org/patch_27_2260.log
> 
> The CF entry has been updated to Waiting on Author.
> 
> Regards,

There has been review on this patch but no updates in some time. As far 
as I can see there's debate on how to mark relations as fully 
checksummed and/or how to resume.

I'm marking this patch Returned with Feedback. Please feel free to 
resubmit when it is again ready for review.

Regards,
-- 
-David
david@pgmasters.net



Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 23 Jan 2020, at 18:23, Robert Haas <robertmhaas@gmail.com> wrote:

> ..That's probably easier and more efficient if it's just value
> in pg_class than if they have to go poking around in another catalog.
> So I am tentatively inclined to think that just putting it in pg_class
> makes more sense.

..which is what I did, but more on that later.

Attached is a new version of the online checksums patch which, I hope, address
most of the concerns raised in previous reviews.  There has been a fair amount
of fiddling done, so below is a summary of what has been done.

Error handling and synchronization around pg_control have been overhauled as
well as the absorption of the process barriers.  The comment there suggested
that the absorbing function shouldn't reside with the procsignal code, input is
gladly received on where it makes the most sense since I can see merit to quite
a few places.

The checksumhelper is renamed datachecksumsworker, since checksumhelper is now
used since c12e43a2e0d45a6b59f2.  I think there is room for better casing here
and there on this.

Restartability is implemented by keeping state in pg_class.  I opted for a bool
which is cleared as the first step of checksum enable, since it offers fewer
synchronization cornercases I think.  The field is only useful during
processing, and is not guaranteed to reflect reality outside of processing.
The current name I came up with does not convey that, better suggestions are
more than welcome.  For now, the process must be restarted manually by running
pg_enable_data_checksums() again, which I sort of like but that might just be
Stockholm syndrome from having enabled/disabled checksums locally a gazillion
times.

Testing has been extended to cover basics but also restartability.  Testing a
resumed restart is a tad tricky while still avoiding timing related tests, so
I've (possibly ab-)used an interactive psql session to act as a blocker to keep
processing from finishing.  I did extend poll_query_until to take a non-default
timeout to make sure these tests finish in a reasonable time during hacking,
but thats left out of this version.

There are a few TODO markers left where I'd appreciate input from reviewers,
for example what to do for disabling already disabled checksums (is it a LOG,
NOTICE, ERROR or silent return?)

This is an executive summary of hacking done, and I have most likely forgotten
to mention something important, but I hope it covers most things.  Few, if any,
changes are made to the interface of this, changes are contained under the
hood.  I will stick this patch in the upcoming commitfest.

cheers ./daniel


Attachment

Re: Online checksums patch - once again

From
Robert Haas
Date:
On Mon, Jun 22, 2020 at 8:27 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> Restartability is implemented by keeping state in pg_class.  I opted for a bool
> which is cleared as the first step of checksum enable, since it offers fewer
> synchronization cornercases I think.

Unless you take AccessExclusiveLock on the table, this probably needs
to be three-valued. Or maybe I am misunderstanding the design...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 22 Jun 2020, at 18:29, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jun 22, 2020 at 8:27 AM Daniel Gustafsson <daniel@yesql.se> wrote:
>> Restartability is implemented by keeping state in pg_class.  I opted for a bool
>> which is cleared as the first step of checksum enable, since it offers fewer
>> synchronization cornercases I think.
>
> Unless you take AccessExclusiveLock on the table, this probably needs
> to be three-valued. Or maybe I am misunderstanding the design...

Sorry being a bit thick, can you elaborate which case you're thinking about?
CREATE TABLE sets the attribute according to the value of data_checksums, and
before enabling checksums (and before changing data_checksums to inprogress)
the bgworker will update all relhaschecksums from true (if any) to false.  Once
the state is set to inprogress all new relations will set relhaschecksums to
true.

The attached v19 fixes a few doc issues I had missed.

cheers ./daniel


Attachment

Re: Online checksums patch - once again

From
Robert Haas
Date:
On Thu, Jun 25, 2020 at 5:43 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> Sorry being a bit thick, can you elaborate which case you're thinking about?
> CREATE TABLE sets the attribute according to the value of data_checksums, and
> before enabling checksums (and before changing data_checksums to inprogress)
> the bgworker will update all relhaschecksums from true (if any) to false.  Once
> the state is set to inprogress all new relations will set relhaschecksums to
> true.

Oh, I think I was the one who was confused. I guess relhaschecksums
only really has meaning when we're in the process of enabling
checksums? So if we're in that state, then the Boolean tells us
whether a particular relation is done, and otherwise it doesn't
matter?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 26 Jun 2020, at 14:12, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jun 25, 2020 at 5:43 AM Daniel Gustafsson <daniel@yesql.se> wrote:
>> Sorry being a bit thick, can you elaborate which case you're thinking about?
>> CREATE TABLE sets the attribute according to the value of data_checksums, and
>> before enabling checksums (and before changing data_checksums to inprogress)
>> the bgworker will update all relhaschecksums from true (if any) to false.  Once
>> the state is set to inprogress all new relations will set relhaschecksums to
>> true.
>
> Oh, I think I was the one who was confused. I guess relhaschecksums
> only really has meaning when we're in the process of enabling
> checksums? So if we're in that state, then the Boolean tells us
> whether a particular relation is done, and otherwise it doesn't
> matter?

That is correct (which is why the name is terrible since it doesn't convey
that at all).

cheers ./daniel


Re: Online checksums patch - once again

From
Justin Pryzby
Date:
On Thu, Jun 25, 2020 at 11:43:00AM +0200, Daniel Gustafsson wrote:
> The attached v19 fixes a few doc issues I had missed.

+   They can also be enabled or disabled at a later timne, either as an offline
=> time

+     * awaiting shutdown, but we can continue turning off checksums anyway
=> a waiting

+     * We are starting a checksumming process scratch, and need to start by
=> FROM scratch

+     * to inprogress new relations will set relhaschecksums in pg_class so it
=> inprogress COMMA

+         * Relation no longer exist. We don't consider this an error since
=> exists

+     * so when the cluster comes back up processing will habe to be resumed.
=> have

+             "completed one pass over all databases for checksum enabling, %i databases processed",
=> I think this will be confusing to be hardcoded "one".  It'll say "one" over
and over.

+         * still exist.
=> exists

In many places, you refer to "datachecksumsworker" (sums) but in nine places
you refer to datachecksumworker (sum).

+ProcessSingleRelationFork(Relation reln, ForkNumber forkNum, BufferAccessStrategy strategy)
+{
+    BlockNumber numblocks = RelationGetNumberOfBlocksInFork(reln, forkNum);

=> I think looping over numblocks is safe since new blocks are intended to be
written with checksum, right?  Maybe it's good to say that here.

+    BlockNumber b;

blknum will be easier to grep for

+            (errmsg("background worker \"datachecksumsworker\" starting for database oid %d",
=> Should be %u or similar (several of these)

Some questions:

It looks like you rewrite every page, even if it already has correct checksum,
to handle replicas.  I wonder if it's possible/reasonable/good to skip pages
with correct checksum when wal_level=minimal ?

It looks like it's not possible to change the checksum delay while a checksum
worker is already running.  That may be important to allow: 1) decreased delay
during slow periods; 2) increased delay if the process is significantly done,
but needs to be throttled to avoid disrupting production environment.

Have you collaborated with Julien about this one?  His patch adds new GUCs:
https://www.postgresql.org/message-id/20200714090808.GA20780@nol
checksum_cost_delay
checksum_cost_page
checksum_cost_limit

Maybe you'd say that Julien's pg_check_relation() should accept parameters
instead of adding GUCs.  I think you should be in agreement on that.  It'd be
silly if the verification function added three GUCs and allowed adjusting
throttle midcourse, but the checksum writer process didn't use them.

If you used something like that, I guess you'd also want to distinguish
checksum_cost_page_read vs write.  Possibly, the GUCs part should be a
preliminary shared patch 0001 that you both used.

-- 
Justin



Re: Online checksums patch - once again

From
Robert Haas
Date:
On Mon, Jun 22, 2020 at 8:27 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> Attached is a new version of the online checksums patch which, I hope, address
> most of the concerns raised in previous reviews.  There has been a fair amount
> of fiddling done, so below is a summary of what has been done.

Here are a bunch of comments based on a partial read-through of this
patch. The most serious concerns, around synchronization, are down
toward at the bottom. Sorry this is a bit eclectic as a review, but I
wrote things down as I read through the patch more or less in the
order I ran across them.

Regarding disable_data_checksums(), I disagree with ereport(LOG, ...)
here. If you want to indicate to the caller whether or not a state
change occurred, you could consider returning a Boolean instead of
void. If you want to do it with log messages, I vote for NOTICE, not
LOG. Maybe NOTICE is better, because enable_data_checksums() seems to
want to convey more information that you can represent in a Boolean,
but then it should use NOTICE consistently, not a mix of NOTICE and
LOG.

Formatting needs work for project style: typically no braces around
single statements, "ereport(WHATEVER," should always have a line break
at that point.

+ * cluster, which was not initialized with checksums, this worker will ensure

"which was not initialized with checksums" => "that does not running
with checksums enabled"?

+ * turned on. In the case of disabling checksums, the state transition is
+ * recorded in the catalog and controlfile, no changes are performed
+ * on the data pages or in the catalog.

Comma splice. Either write "controlfile; no" or "controlfile, and no".

My spell-checker complains that controfile, clusterwide, inprogress,
and endstate are not words. I think you should think about inserting
spaces or, in the case of cluster-wide, a dash, unless they are being
used as literals, in which case perhaps those instances should be
quoted. "havent" needs an apostrophe.

+ * DataChecksumsWorker will compile a list of databases which exists at the

which exist

+ * For each database, all relations which have storage are read and every data
+ * page is marked dirty to force a write with the checksum, this will generate

Comma splice. Split into two sentences.

+ * In case checksums have been enabled and later disabled, when re-enabling
+ * pg_class.relhaschecksums will be reset to false before entering inprogress
+ * mode to ensure that all relations are re-processed.

"If checksums are enabled, then disabled, and then re-enabled, every
relation's pg_class.relhaschecksums field will be reset to false
before entering the in-progress mode."

+ * Disabling checksums is done as an immediate operation as it only updates

s/done as //

+ * to pg_class.relhaschecksums is performed as it only tracks state during

is performed -> are necessary

+ * Access to other members can be done without a lock, as while they are
+ * in shared memory, they are never concurrently accessed. When a worker
+ * is running, the launcher is only waiting for that worker to finish.

The way this is written, it sounds like you're saying that concurrent
access might be possible when this structure isn't in shared memory.
But since it's called DatachecksumsWorkerShmemStruct that's not likely
a correct conclusion, so I think it needs rephrasing.

+ if (DatachecksumsWorkerShmem->launcher_started &&
!DatachecksumsWorkerShmem->abort)
+ started = true;

Why not started = a && b instead of started = false; if (a && b) started = true?

+ {
+ LWLockRelease(DatachecksumsWorkerLock);
+ ereport(ERROR,
+ (errmsg("data checksums worker has been aborted")));
+ }

Errors always release LWLocks, so this seems unnecessary. Also, the
error message looks confusing from a user perspective. What does it
mean if I ask you to make me a cheeseburger and you tell me the
cheeseburger has been eaten? I'm asking for a *new* cheeseburger (or
in this case, a new worker).

I wonder why this thing is inventing a brand new way of aborting a
worker, anyway. Why not just keep track of the PID and send it SIGINT
and have it use CHECK_FOR_INTERRUPTS()? That's already sprinkled all
over the code, so it's likely to work better than some brand-new
mechanism that will probably have checks in a lot fewer places.

+ vacuum_delay_point();

Huh? Why?

+ elog(DEBUG2,
+ "background worker \"datachecksumsworker\" starting to process relation %u",
+ relationId);

This and similar messages seem likely they refer needlessly to
internals, e.g. this could be "adding checksums to relation with OID
%u" without needing to reference background workers or
datachecksumworker. It would be even better if we could find a way to
report relation names.

+ * so when the cluster comes back up processing will habe to be resumed.

habe -> have

+ ereport(FATAL,
+ (errmsg("cannot enable checksums without the postmaster process"),
+ errhint("Restart the database and restart the checksumming process
by calling pg_enable_data_checksums().")));

I understand the motivation for this design and it may be the best we
can do, but honestly it kinda sucks. It would be nice if the system
itself figured out whether or not the worker should be running and, if
yes, ran it. Like, if we're in this state when we exit recovery (or
decide to skip recovery), just register the worker then automatically.
Now that could still fail for lack of slots, so I guess to make this
really robust we'd need a way for the registration to get retried,
e.g. autovacuum could try to reregister it periodically, and we could
just blow off the case where autovacuum=off. I don't know. I'd rather
avoid burdening users with an implementation detail if we can get
there, or at least minimize what they need to worry about.

+ snprintf(activity, sizeof(activity) - 1,
+ "Waiting for worker in database %s (pid %d)", db->dbname, pid);
+ pgstat_report_activity(STATE_RUNNING, activity);

So we only know how to run one such worker at a time?

Maybe WaitForAllTransactionsToFinish should advertise something in
pg_stat_activity.

I think you should try to give all of the functions header comments,
or at least all the bigger ones.

+ else if (result == DATACHECKSUMSWORKER_ABORTED)
+ /* Abort flag set, so exit the whole process */
+ return false;

I'd put braces here. And also, why bail out like this instead of
retrying periodically until we succeed?

+ * True if all data pages of the relation have data checksums.

Not fully accurate, right?

+ /*
+ * Force a checkpoint to get everything out to disk. TODO: we probably
+ * don't want to use a CHECKPOINT_IMMEDIATE here but it's very convenient
+ * for testing until the patch is fully baked, as it may otherwise make
+ * tests take a lot longer.
+ */
+ RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT | CHECKPOINT_IMMEDIATE);

Do we need to verify that the checkpoint succeeded before we can
declare victory and officially change state?

+ PROCSIGNAL_BARRIER_CHECKSUM_OFF = 0,
+ PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON,
+ PROCSIGNAL_BARRIER_CHECKSUM_ON

I don't think it's a good idea to have three separate types of barrier
here. I think you should just have a barrier indicating that the state
has changed, and then backends need to reread the state from shared
memory when they absorb the barrier.

But the bigger problem here, and the thing that makes me intensely
doubtful that the synchronization in this patch is actually correct,
is that I can't find any machinery in the patch guarding against
TOCTTOU issues, nor any comments explaining why I shouldn't be afraid
of them. Suppose you got rid of the barriers and just changed all the
places that check LocalDataChecksumVersion to read from a shared
memory value directly instead. Would that be equivalent to what you've
got here, or would it break something? If you can't clearly explain
why that would be broken as compared with what you have, then either
the barriers aren't really necessary (which I doubt) or the
synchronization isn't really right (which I suspect to be true).

In the case of the ALTER SYSTEM READ ONLY patch, this was by far the
hardest part to get right, and I'm still not positive that it's
completely correct, but the basic thing we figured out there is that
you are in big trouble if the system goes read-only AFTER you've
decided to write a WAL record. That is, this is bugged:

if (WALIsProhibited())
   ereport(ERROR, errmsg("i'm sorry i can't do that"));
...
CHECK_FOR_INTERRUPTS();
...
START_CRIT_SECTION();
XLogBeginInsert();

If the CHECK_FOR_INTERRUPTS() absorbs a state change, then the
XLogBeginInsert() is going to hit an elog(ERROR) which, because we're
in a critical section, will be promoted to PANIC, which is bad. To
avoid that, the patch introduces a whole hairy system to make sure
that there can never be a CFI after we check whether it's OK to insert
WAL and before we actually do it. That stuff is designed in such a way
that it will make assertion fail even if you're not actually *trying*
to make the system read-only.

So the comparable problem here would be if we decide that we don't
need to set checksums on a page when modifying it, and then we absorb
a barrier that flips the state to in-progress, and then we actually
perform the page modification. Now you have a race condition: the page
was modified without checksums after we'd acknowledged to the process
pushing out the barrier that all of our future page modifications
would set checksums. So, maybe that's not possible here. For instance,
if we never examine the checksum-enabled state outside of a critical
section, then we're fine, because we can't absorb a barrier without
processing an interrupt, and we don't process interrupts in critical
sections. But if that's the case, then it seems to me that it would be
good to insert some cross-checks. Like, suppose we only ever access
the local variable that contains this state through a static inline
function that also asserts that InteruptHoldoffCount > 0 ||
CritSectionCount > 0. Then, if there is a place where we don't
actually follow that rule (only rely on that value within critical
sections) we're pretty likely to trip an assert just running the
regression tests. It's not foolproof, not only because the regression
tests are incomplete but because in theory we could fetch the value in
a crit section and then keep it around and rely on it some more after
we've processed interrupts again, but that seems like a less-likely
thing for somebody to do.

If, on the other hand, there are stretches of code that fetch this
value outside of a crit section and without interrupts held, then we
need some other kind of mechanism here to make it safe. We have to
make sure that not only does the present code not permit race
conditions of the type described above, but that future modifications
are quite unlikely to introduce any. I might be missing something, but
I don't see any kind of checks like this in the patch now. I think
there should be, and the rationale behind them should be written up,
too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 29 Jul 2020, at 19:58, Robert Haas <robertmhaas@gmail.com> wrote:

> Here are a bunch of comments based on a partial read-through of this
> patch.

Thanks a lot Robert and Justin for the reviews!  With the commitfest wrap-up
imminent and being on vacation I will have a hard time responding properly
before the end of CF so I'm moving it to the next CF for now.

cheers ./daniel



Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 28 Jul 2020, at 04:33, Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Thu, Jun 25, 2020 at 11:43:00AM +0200, Daniel Gustafsson wrote:
>> The attached v19 fixes a few doc issues I had missed.
>
> +   They can also be enabled or disabled at a later timne, either as an offline
> => time

Fixed.

> +     * awaiting shutdown, but we can continue turning off checksums anyway
> => a waiting

This was intentional, as it refers to the impending requested shutdown and not
one which is blocked waiting.  I've reworded the comment to make this clearer.

> +     * We are starting a checksumming process scratch, and need to start by
> => FROM scratch

Fixed.

> +     * to inprogress new relations will set relhaschecksums in pg_class so it
> => inprogress COMMA

Fixed.

> +         * Relation no longer exist. We don't consider this an error since
> => exists

Fixed.

> +     * so when the cluster comes back up processing will habe to be resumed.
> => have

Fixed.

> +             "completed one pass over all databases for checksum enabling, %i databases processed",
> => I think this will be confusing to be hardcoded "one".  It'll say "one" over
> and over.

Good point, I've reworded this based on the number of processed databases to
indicate what will happen next.  This call should probably be removed in case
we merge and ship this feature, but it's handy for this part of the patch
process.

> +         * still exist.
> => exists

Fixed.

> In many places, you refer to "datachecksumsworker" (sums) but in nine places
> you refer to datachecksumworker (sum).

Good catch, they should all be using "sums". Fixed.

> +ProcessSingleRelationFork(Relation reln, ForkNumber forkNum, BufferAccessStrategy strategy)
> +{
> +    BlockNumber numblocks = RelationGetNumberOfBlocksInFork(reln, forkNum);
>
> => I think looping over numblocks is safe since new blocks are intended to be
> written with checksum, right?  Maybe it's good to say that here.

Fixed.

> +    BlockNumber b;
>
> blknum will be easier to grep for

Fixed.

> +            (errmsg("background worker \"datachecksumsworker\" starting for database oid %d",
> => Should be %u or similar (several of these)

Fixed.  As per Roberts review downthread, these will be reworded in a future
version.

> It looks like you rewrite every page, even if it already has correct checksum,
> to handle replicas.  I wonder if it's possible/reasonable/good to skip pages
> with correct checksum when wal_level=minimal ?

That would AFAICT be possible, but I'm not sure it's worth adding that before
the patch is deemed safe in its simpler form.  I've added a comment to record
this as a potential future optimization.

> It looks like it's not possible to change the checksum delay while a checksum
> worker is already running.  That may be important to allow: 1) decreased delay
> during slow periods; 2) increased delay if the process is significantly done,
> but needs to be throttled to avoid disrupting production environment.
>
> Have you collaborated with Julien about this one?  His patch adds new GUCs:
> https://www.postgresql.org/message-id/20200714090808.GA20780@nol
> checksum_cost_delay
> checksum_cost_page
> checksum_cost_limit

I honestly hadn't thought about that, but I very much agree that any controls
introduced should work the same for both of these patches.

> Maybe you'd say that Julien's pg_check_relation() should accept parameters
> instead of adding GUCs.  I think you should be in agreement on that.  It'd be
> silly if the verification function added three GUCs and allowed adjusting
> throttle midcourse, but the checksum writer process didn't use them.

Agreed.  I'm not a fan of using yet more GUCs for controlling this, but I don't
a good argument against it.  It's in line with the cost-based vacuum delays so
I guess it's the most appropriate interface.

> If you used something like that, I guess you'd also want to distinguish
> checksum_cost_page_read vs write.  Possibly, the GUCs part should be a
> preliminary shared patch 0001 that you both used.

+1.

Thanks for the review!  I will attach a v20 to Robers email with these changes
included as well.

cheers ./daniel




Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 29 Jul 2020, at 19:58, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jun 22, 2020 at 8:27 AM Daniel Gustafsson <daniel@yesql.se> wrote:
>> Attached is a new version of the online checksums patch which, I hope, address
>> most of the concerns raised in previous reviews.  There has been a fair amount
>> of fiddling done, so below is a summary of what has been done.
>
> Here are a bunch of comments based on a partial read-through of this
> patch. The most serious concerns, around synchronization, are down
> toward at the bottom. Sorry this is a bit eclectic as a review, but I
> wrote things down as I read through the patch more or less in the
> order I ran across them.

Not need to apologize, many thanks for the review!  This is a partial response,
since I need to spend a bit more time to properly respond to the
synchronization questions, but I also wanted to submit a new version which
applies in the CF patch tester.  Anything not addressed here will be be in a
follow-up version.

The attached v20 contains fixes from this review as well as Justin's review
upthread.

> Regarding disable_data_checksums(), I disagree with ereport(LOG, ...)
> here. If you want to indicate to the caller whether or not a state
> change occurred, you could consider returning a Boolean instead of
> void. If you want to do it with log messages, I vote for NOTICE, not
> LOG. Maybe NOTICE is better, because enable_data_checksums() seems to
> want to convey more information that you can represent in a Boolean,
> but then it should use NOTICE consistently, not a mix of NOTICE and
> LOG.

I agree with this, I've moved to returning a bool rather than ereporting NOTICE
(or LOG).

> Formatting needs work for project style: typically no braces around
> single statements, "ereport(WHATEVER," should always have a line break
> at that point.

I think I've fixed all these instances, and the attached patch have been run
through pgindent as well.

> + * cluster, which was not initialized with checksums, this worker will ensure
>
> "which was not initialized with checksums" => "that does not running
> with checksums enabled"?

Fixed, but it's "run" rather than "running" right?

> + * turned on. In the case of disabling checksums, the state transition is
> + * recorded in the catalog and controlfile, no changes are performed
> + * on the data pages or in the catalog.
>
> Comma splice. Either write "controlfile; no" or "controlfile, and no".

Fixed.

> My spell-checker complains that controfile, clusterwide, inprogress,
> and endstate are not words. I think you should think about inserting
> spaces or, in the case of cluster-wide, a dash, unless they are being
> used as literals, in which case perhaps those instances should be
> quoted. "havent" needs an apostrophe.

This is me writing Swedish in English. Fixed.

> + * DataChecksumsWorker will compile a list of databases which exists at the
>
> which exist

Fixed.

> + * For each database, all relations which have storage are read and every data
> + * page is marked dirty to force a write with the checksum, this will generate
>
> Comma splice. Split into two sentences.

Fixed.

> + * In case checksums have been enabled and later disabled, when re-enabling
> + * pg_class.relhaschecksums will be reset to false before entering inprogress
> + * mode to ensure that all relations are re-processed.
>
> "If checksums are enabled, then disabled, and then re-enabled, every
> relation's pg_class.relhaschecksums field will be reset to false
> before entering the in-progress mode."

Replaced with your version, thanks.

> + * Disabling checksums is done as an immediate operation as it only updates
>
> s/done as //

Fixed.

> + * to pg_class.relhaschecksums is performed as it only tracks state during
>
> is performed -> are necessary

Fixed.

> + * Access to other members can be done without a lock, as while they are
> + * in shared memory, they are never concurrently accessed. When a worker
> + * is running, the launcher is only waiting for that worker to finish.
>
> The way this is written, it sounds like you're saying that concurrent
> access might be possible when this structure isn't in shared memory.
> But since it's called DatachecksumsWorkerShmemStruct that's not likely
> a correct conclusion, so I think it needs rephrasing.

Right, that was a pretty poorly worded comment.  Rewritten and expanded upon.

> + if (DatachecksumsWorkerShmem->launcher_started &&
> !DatachecksumsWorkerShmem->abort)
> + started = true;
>
> Why not started = a && b instead of started = false; if (a && b) started = true?

I don't have strong feelings either format, so changed according to your suggestion.

> + {
> + LWLockRelease(DatachecksumsWorkerLock);
> + ereport(ERROR,
> + (errmsg("data checksums worker has been aborted")));
> + }
>
> Errors always release LWLocks, so this seems unnecessary. Also, the
> error message looks confusing from a user perspective. What does it
> mean if I ask you to make me a cheeseburger and you tell me the
> cheeseburger has been eaten? I'm asking for a *new* cheeseburger (or
> in this case, a new worker).

Now you made me hungry for a green chili cheeseburger..

This case covers when the user disables a running datachecksumsworker and
then enables it again before the worker has finished the current page and thus
observed the abort request.

If the worker learns to distinguish between a user abort request and and an
internal cancellation (due to WL_POSTMASTER_DEATH) this window could be handled
by clearing the user request and keep going.  It would be the same thing as
killing the worker and restarting, except fewer moving parts.  Either way, I
agree that it's a confusing error path, and one which should be addressed.

> I wonder why this thing is inventing a brand new way of aborting a
> worker, anyway. Why not just keep track of the PID and send it SIGINT
> and have it use CHECK_FOR_INTERRUPTS()? That's already sprinkled all
> over the code, so it's likely to work better than some brand-new
> mechanism that will probably have checks in a lot fewer places.

I'm not convinced that the current coding is less responsive, and signalling
for launcher/worker isn't entirely straightforward, but I agree that it's
better to stick to established patterns.  Will rewrite to use pqsignal/SIGINT
and will address the previous paragraph in that as well.

> + vacuum_delay_point();
>
> Huh? Why?

The datachecksumsworker is using the same machinery for throttling as the
cost-based vacuum delay, hence this call.  Do you object to using that same
machinery, or the implementation and/or documentation of it?

> + elog(DEBUG2,
> + "background worker \"datachecksumsworker\" starting to process relation %u",
> + relationId);
>
> This and similar messages seem likely they refer needlessly to
> internals, e.g. this could be "adding checksums to relation with OID
> %u" without needing to reference background workers or
> datachecksumworker.

I have reworded these as well as removed a few that seemed a bit uninteresting.

> It would be even better if we could find a way to
> report relation names.

True, but doesn't really seem worth the overhead for a debug log.

> + * so when the cluster comes back up processing will habe to be resumed.
>
> habe -> have

Fixed (also noted by Justin upthread).

> + ereport(FATAL,
> + (errmsg("cannot enable checksums without the postmaster process"),
> + errhint("Restart the database and restart the checksumming process
> by calling pg_enable_data_checksums().")));
>
> I understand the motivation for this design and it may be the best we
> can do, but honestly it kinda sucks. It would be nice if the system
> itself figured out whether or not the worker should be running and, if
> yes, ran it. Like, if we're in this state when we exit recovery (or
> decide to skip recovery), just register the worker then automatically.
> Now that could still fail for lack of slots, so I guess to make this
> really robust we'd need a way for the registration to get retried,
> e.g. autovacuum could try to reregister it periodically, and we could
> just blow off the case where autovacuum=off. I don't know. I'd rather
> avoid burdening users with an implementation detail if we can get
> there, or at least minimize what they need to worry about.

I don't disagree with you, but I don't see how an automatic restart could be
made safe/good enough to be worth the complexity, since it's nigh impossible to
make it always Just Work.  Exposing implementation details to users is clearly
not a good design choice, if it can be avoided.

> + snprintf(activity, sizeof(activity) - 1,
> + "Waiting for worker in database %s (pid %d)", db->dbname, pid);
> + pgstat_report_activity(STATE_RUNNING, activity);
>
> So we only know how to run one such worker at a time?

Correct, there is currently one worker at a time.

> Maybe WaitForAllTransactionsToFinish should advertise something in
> pg_stat_activity.

It currently does this:

  snprintf(activity,
       sizeof(activity),
       "Waiting for current transactions to finish (waiting for %u)",
       waitforxid);
  pgstat_report_activity(STATE_RUNNING, activity);

Did you have anything else in mind?

> I think you should try to give all of the functions header comments,
> or at least all the bigger ones.

I've done a first pass over the patch.

> + else if (result == DATACHECKSUMSWORKER_ABORTED)
> + /* Abort flag set, so exit the whole process */
> + return false;
>
> I'd put braces here.

Fixed.

> And also, why bail out like this instead of
> retrying periodically until we succeed?

Currently, the abort request can come from the user disabling data checksums
during processing, postmaster dying or SIGINT.  Neither of these cases qualify
for retrying in the current design.

> + * True if all data pages of the relation have data checksums.
>
> Not fully accurate, right?

Ugh..  thats a leftover from previous hacking that I've since ripped out.
Sorry about that, it's been removed.

> + /*
> + * Force a checkpoint to get everything out to disk. TODO: we probably
> + * don't want to use a CHECKPOINT_IMMEDIATE here but it's very convenient
> + * for testing until the patch is fully baked, as it may otherwise make
> + * tests take a lot longer.
> + */
> + RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT | CHECKPOINT_IMMEDIATE);
>
> Do we need to verify that the checkpoint succeeded before we can
> declare victory and officially change state?

I don't think we need a verification step here.  With CHECKPOINT_WAIT we are
blocking until the checkpoint has completed.  If it fails we won't enable data
checksums until the process has been restarted and a subsequent checkpoint
succeeded.

> + PROCSIGNAL_BARRIER_CHECKSUM_OFF = 0,
> + PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON,
> + PROCSIGNAL_BARRIER_CHECKSUM_ON
>
> I don't think it's a good idea to have three separate types of barrier
> here. I think you should just have a barrier indicating that the state
> has changed, and then backends need to reread the state from shared
> memory when they absorb the barrier.

I'm not sure I follow why, my understanding of the infrastructure was to make
it finegrained like this.  But, you have clearly spent more time thinking about
this functionality so I'm curious to learn.  Can you elaborate on your
thinking?

> But the bigger problem here, and the thing that makes me intensely
> doubtful that the synchronization in this patch is actually correct,
> is that I can't find any machinery in the patch guarding against
> TOCTTOU issues, nor any comments explaining why I shouldn't be afraid
> of them.

<snip>

I unfortunately haven't had time to read the READ ONLY patch so I can't comment
on how these two patches do things in relation to each other.

The main synchronization mechanisms are the use of the inprogress mode where
data checksums are written but not verified, and by waiting for all
pre-existing non-compatible processes (transactions, temp tables) to disappear
before enabling.

That being handwavily said, I've started to write down a matrix with classes of
possible synchronization bugs and how the patch handles them in order to
properly respond.

cheers ./daniel




Attachment

Re: Online checksums patch - once again

From
Michael Paquier
Date:
On Wed, Sep 02, 2020 at 02:22:25PM +0200, Daniel Gustafsson wrote:
> I unfortunately haven't had time to read the READ ONLY patch so I can't comment
> on how these two patches do things in relation to each other.
>
> The main synchronization mechanisms are the use of the inprogress mode where
> data checksums are written but not verified, and by waiting for all
> pre-existing non-compatible processes (transactions, temp tables) to disappear
> before enabling.

The CF bot is complaining on this one with a TAP test failure:
https://travis-ci.org/github/postgresql-cfbot/postgresql/builds/724717901

t/003_standby_checksum.pl .. 1/10
#   Failed test 'ensure checksums are on or in progress on standby_1'
#   at t/003_standby_checksum.pl line 59.
#     'off'
#         ~~
#     'ARRAY(0x1d38c10)'
# Looks like you failed 1 test of 10.
t/003_standby_checksum.pl .. Dubious, test returned 1 (wstat 256,
0x100)
Failed 1/10 subtests

Daniel, could you look at that?
--
Michael

Attachment

Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 2 Sep 2020, at 14:22, Daniel Gustafsson <daniel@yesql.se> wrote:

> The main synchronization mechanisms are the use of the inprogress mode where
> data checksums are written but not verified, and by waiting for all
> pre-existing non-compatible processes (transactions, temp tables) to disappear
> before enabling.
>
> That being handwavily said, I've started to write down a matrix with classes of
> possible synchronization bugs and how the patch handles them in order to
> properly respond.

Having spent some more time on this, I believe I have a better answer (and
patch version) to give.

First, a thank you for asking insightful questions.  While working through the
cases I realized that the previous version has a problematic window: when
disabling checksums, if backend A absorbs the data checksums "off" barrier (or
starts after the controlfile has been changed) and writes a page without a
checksum, while backend B has yet to absorb the barrier and is still in "on"
and reads that very same page.  The solution IMO is to introduce an inprogress
state for disabling as well where all backends keep writing checksums but not
validating them until no backend is in "on" state anymore.  Once all backends
are in "inprogress-off", they can stop writing checksums and transition to
"off".  I even had this state in a previous unsubmitted version, embarrassingly
so to the point of the function prototype being there as a leftover in v19.

Now, synchronization happens on two levels in this patch for both the enable
and disable case : (i) between backends when transitioning between states and
(ii) inside the worker when synchronizing the current backends.

For (i) it's using "inprogress" states to ensure a safe transition to "on" or
"off".  The states are themselves transitioned to via procsignalbarriers.  Both
enabling and disabling follow the same logic, with the only difference being
the order in which operations are switched on/off during inprogress.  For (ii)
the workers are waiting for incompatible concurrent processing to end, such as
temporary tables etc (only affects enabling data checksums).

I've tried to write down the synchronization steps in datachecksumsworker.c to
document the code, and for ease of discussion I've pasted that part of the diff
below as well:

 * Synchronization and Correctness
 * -------------------------------
 * The processes involved in enabling, or disabling, data checksums in an
 * online cluster must be properly synchronized with the normal backends
 * serving concurrent queries to ensure correctness. Correctness is defined
 * as the following:
 *
 *        - Backends SHALL NOT violate local datachecksum state
 *        - Data checksums SHALL NOT be considered enabled cluster-wide until all
 *          currently connected backends have the local state "enabled"
 *
 * There are two levels of synchronization required for enabling data checksums
 * in an online cluster: (i) changing state in the active backends ("on",
 * "off", "inprogress-on" and "inprogress-off"), and (ii) ensuring no
 * incompatible objects and processes are left in a database when workers end.
 * The former deals with cluster-wide agreement on data checksum state and the
 * latter with ensuring that any concurrent activity cannot break the data
 * checksum contract during processing.
 *
 * Synchronizing the state change is done with procsignal barriers, where the
 * backend updating the global state in the controlfile will wait for all other
 * backends to absorb the barrier before WAL logging. Barrier absorption will
 * happen during interrupt processing, which means that connected backends will
 * change state at different times.
 *
 *   When Enabling Data Checksums
 *     ----------------------------
 *     A process which fails to observe data checksums being enabled can induce
 *     two types of errors: failing to write the checksum when modifying the page
 *     and failing to validate the data checksum on the page when reading it.
 *
 *   When the DataChecksumsWorker has finished writing checksums on all pages
 *   and enable data checksums cluster-wide, there are three sets of backends:
 *
 *   Bg: Backend updating the global state and emitting the procsignalbarrier
 *   Bd: Backends on "off" state
 *   Be: Backends in "on" state
 *   Bi: Backends in "inprogress-on" state
 *
 *   Backends transition from the Bd state to Be like so: Bd -> Bi -> Be
 *
 *   Backends in Bi and Be will write checksums when modifying a page, but only
 *   backends in Be will verify the checksum during reading. The Bg backend is
 *   blocked waiting for all backends in Bi to process interrupts and move to
 *   Be. Any backend starting will observe the global state being "on" and will
 *   thus automatically belong to Be.  Checksums are enabled cluster-wide when
 *   Bi is an empty set. All sets are compatible while still operating based on
 *   their local state.
 *
 *     When Disabling Data Checksums
 *     -----------------------------
 *     A process which fails to observe data checksums being disabled can induce
 *     two types of errors: writing the checksum when modifying the page and
 *     validating a data checksum which is no longer correct due to modifications
 *     to the page.
 *
 *   Bg: Backend updating the global state and emitting the procsignalbarrier
 *   Bd: Backands in "off" state
 *   Be: Backends in "on" state
 *   Bi: Backends in "inprogress-off" state
 *
 *   Backends transition from the Be state to Bd like so: Be -> Bi -> Bd
 *
 *   The goal is to transition all backends to Bd making the others empty sets.
 *   Backends in Bi writes data checksums, but don't validate them, such that
 *   backends still in Be can continue to validate pages until the barrier has
 *   been absorbed such that they are in Bi. Once all backends are in Bi, the
 *   barrier to transition to "off" can be raised and all backends can safely
 *   stop writing data checksums as no backend is enforcing data checksum
 *   validation.

I hope this clarifies the reasoning behind the implementation.

This has been implemented in the attached v21 patch.  "inprogress" was in need
of a new name before, and with "inprogress-{on|off}" the need is even bigger.
Suggestions for better names are highly appreciated, I'm drawing blanks here.

There are some minor fixes and documentation touchups in this version as well,
but not the SIGINT handling since I wanted to focus on one thing at a time.

cheers ./daniel


Attachment

Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 7 Sep 2020, at 09:17, Michael Paquier <michael@paquier.xyz> wrote:

> Daniel, could you look at that?

I believe this boils down to a timing issue, I've included a fix in the v21
patch attached to a previous mail upthread.

cheers ./daniel



Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 19 Sep 2020, at 04:18, Justin Pryzby <pryzby@telsasoft.com> wrote:

Thanks for reviewing!

> + * changed to "inprogress-off", the barrier for mvoving to "off" can be
> moving

Fixed.

> + * When disabling checksums, data_checksums will be set of "inprogress-off"
> set *to*

Fixed.

> +    get_namespace_name(RelationGetNamespace(reln)), RelationGetRelationName(reln),
>
> I think this palloc()s a new copy of the namespace every 100 blocks.
> Better do it outside the loop.

Good point, fixed.

> +    {"inprogress-on", DATA_CHECKSUMS_INPROGRESS_ON, true},
> +    {"inprogress-off", DATA_CHECKSUMS_INPROGRESS_OFF, true},
>
> enabling / disabling ?

Perhaps, but it doesn't match the grammatical tense of others though?

> +typedef enum ChecksumType
> +{
> +    DATA_CHECKSUMS_OFF = 0,
> +    DATA_CHECKSUMS_ON,
> +    DATA_CHECKSUMS_INPROGRESS_ON,
> +    DATA_CHECKSUMS_INPROGRESS_OFF
> +}            ChecksumType;
>
> Should this be an bitmask, maybe
> DATA_CHECKSUMS_WRITE = 1
> DATA_CHECKSUMS_VERIFY = 2,

That's an option, not sure if it would improve readability though.  Anyone else
have opinions on that?

> It occured to me that you could rephrase this patch as "Make checksum state
> per-relation rather than cluster granularity".  That's currently an
> implementation detail, but could be exposed as a feature.

Thats not entirely correct.  The patch tracks checksum status *during
inprogress-on* with per-relation granularity, but as it stands it doesn't
support per-relation during state "on" in any way.

A per-relation checksum mode where every relation at any point can enable or
disable checksums would require a very different synchronization mechanism from
the all-or-nothing one (which while simpler, IMO is complicated enough).  My
hope is that this patch brings solid infrastructure for anyone interested in
persuing per-relation checksums, but IMHO we should focus on getting
per-cluster rock-solid first.

> That could be a
> preliminary 0001 patch.  Half the existing patch would be 0002 "Allow
> online enabling checksums for a given relation/database/cluster".  You might
> save some of the existing effort of synchronize the cluster-wide checksum
> state, since it doesn't need to be synchronized.

I don't follow, how would a finer-grain resolution remove the need for
synchronization?

> The "data_checksums" GUC
> might be removed, or changed to an enum: on/off/per_relation.  Moving from
> "per_relation" to "on" would be an optional metadata-only change, allowed only
> when all rels in all dbs are checksumed.

How would you guarantee that such a state change isn't happening concurrently
with a user doing ALTER TABLE ..  checksums=off;?  It would still require
synchronization along the lines of what this patch does unless I'm missing
something.

> I'm not sure if you'd even care about
> temp tables, since "relhaschecksum" would be authoritative, rather than a
> secondary bit only used during processing.
>
> XLogHintBitIsNeeded() and DataChecksumsEnabled() would need to check
> relhaschecksum, which tentatively seems possible.

While possible, that's a pretty hot codepath so any additional checking will
need proper benchmarking.

relhaschecksum isn't guaranteed to be correct at any point other than during
checksum enabling, it's only used for tracking progress in case of a cluster
restart during processing.  To better convey this, I asked for suggestions for
better name upthread since relhaschecksum carries the risk of overpromise/
underdeliver. Perhaps relchecksumprocessed would be better?

A real per-relation relhaschecksum in pg_class would also need to solve how to
keep it accurate for offline enable/disable via pg_checksums.

> I'm not sure if it's possible, but maybe pg_checksums would be able to skip
> rels which had already been checksummed "online" (with an option to force
> reprocessing).

pg_checksums can't read the catalog state of the relations, so is has no
knowledge on where an online processing left off.  That should probably be made
clearer in the docs though, so added a note on that.

> Maybe some people would want (no) checksums on specific tables, and that could
> eventually be implemented as 0003: "ALTER TABLE SET checksums=".  I'm thinking
> of that being used immediately after an CREATE, but I suppose ON would cause
> the backend to rewrite the table with checksums synchronously (not in the BGW),
> either with AEL or by calling ProcessSingleRelationByOid().

Holding an AEL while changing state would be easier as it skips the need for
the complex synchronization, but is that really what users would expect?
Especially if cluster-wide enable is done transparent to the user.

More importantly though, what is the use-case for per-relation that we'd be
looking at solving?  Discussing the implementation without framing it in an
actual use-case runs the risk of performing successful surgery where the
patient still dies.  Performing ETL ingestion without the overhead of writing
checksums?  Ephemeral data?  Maybe unlogged tables can handle some situation
where checksums would be appealing?

While in the patch I realized that the relationlist saved the relkind but the
code wasn't actually using it, so I've gone ahead and removed that with a lot
fewer palloc calls as a result. The attached v22 fixes that and the above.

cheers ./daniel


Attachment

Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 24 Sep 2020, at 06:27, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Sep 23, 2020 at 02:34:36PM +0200, Daniel Gustafsson wrote:
>> While in the patch I realized that the relationlist saved the relkind but the
>> code wasn't actually using it, so I've gone ahead and removed that with a lot
>> fewer palloc calls as a result. The attached v22 fixes that and the above.
>
> Some of the TAP tests are blowing up here, as the CF bot is telling:
> t/003_standby_checksum.pl .. 1/11 # Looks like you planned 11 tests but ran 4.
> # Looks like your test exited with 29 just after 4.
> t/003_standby_checksum.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)

Interesting, I've been unable to trigger a fault and the latest Travis build
was green.  I'll continue to try and see if I can shake something loose.

cheers ./daniel




Re: Online checksums patch - once again

From
Heikki Linnakangas
Date:
I looked at patch v22, and I can see two main issues:

1. The one that Robert talked about earlier: A backend checks the local 
"checksums" state. If it's 'off', it writes a page without checksums. 
How do you guarantee that the local state doesn't change in between? The 
implicit assumption seems to be that there MUST NOT be any 
CHECK_FOR_INTERRUPTS() calls between DataChecksumsNeedWrite() and the 
write (or read and DataChecksumsNeedVerify()).

In most code, the DataChecksumsNeedWrite() call is very close to writing 
out the page, often in the same critical section. But this is an 
undocumented assumption.

The code in sendFile() in basebackup.c seems suspicious in that regard. 
It calls DataChecksumsNeedVerify() once before starting to read the 
file. Isn't it possible for the checksums flag to change while it's 
reading the file and sending it to the client? I hope there are 
CHECK_FOR_INTERRUPTS() calls buried somewhere in the loop, because it 
could take minutes to send the whole file.

I would feel better if the state transition of the "checksums" flag 
could only happen in a few safe places, or there were some other 
safeguards for this. I think that's what Andres was trying to say 
earlier in the thread on ProcSignalBarriers. I'm not sure what the 
interface to that should be. It could be something like 
HOLD/RESUME_INTERRUPTS(), where normally all procsignals are handled on 
CHECK_FOR_INTERRUPTS(), but you could "hold off" some if needed. Or 
something else. Or maybe we can just use HOLD/RESUME_INTERRUPTS() for 
this. It's more coarse-grained than necessary, but probably doesn't 
matter in practice.

At minimum, there needs to be comments in DataChecksumsNeedWrite() and 
DataChecksumsNeedVerify(), instructing how to use them safely. Namely, 
you must ensure that there are no interrupts between the 
DataChecksumsNeedWrite() and writing out the page, or between reading 
the page and the DataChecksumsNeedVerify() call. You can achieve that 
with HOLD_INTERRUPTS() or a critical section, or simply ensuring that 
there is no substantial code in between that could call 
CHECK_FOR_INTERRUPTS(). And sendFile() in basebackup.c needs to be fixed.

Perhaps you could have "Assert(InteruptHoldOffCount > 0)" in 
DataChecksumsNeedWrite() and DataChecksumsNeedVerify()? There could be 
other ways that callers could avoid the TOCTOU issue, but it would 
probably catch most of the unsafe call patterns, and you could always 
wrap the DataChecksumsNeedWrite/verify() call in a dummy 
HOLD_INTERRUPTS() block to work around the assertion if you know what 
you're doing.


2. The signaling between enable_data_checksums() and the launcher 
process looks funny to me. The general idea seems to be that 
enable_data_checksums() just starts the launcher process, and the 
launcher process figures out what it need to do and makes all the 
changes to the global state. But then there's this violation of the 
idea: enable_data_checksums() checks DataChecksumsOnInProgress(), and 
tells the launcher process whether it should continue a previously 
crashed operation or start from scratch. I think it would be much 
cleaner if the launcher process figured that out itself, and 
enable_data_checksums() would just tell the launcher what the target 
state is.

enable_data_checksums() and disable_data_checksums() seem prone to race 
conditions. If you call enable_data_checksums() in two backends 
concurrently, depending on the timing, there are two possible outcomes:

a) One call returns true, and launches the background process. The other 
call returns false.

b) Both calls return true, but one of them emits a "NOTICE: data 
checksums worker is already running".

In disable_data_checksum() imagine what happens if another backend calls 
enable_data_checksums() in between the 
ShutdownDatachecksumsWorkerIfRunning() and SetDataChecksumsOff() calls.


>         /*
>          * Mark the buffer as dirty and force a full page write.  We have to
>          * re-write the page to WAL even if the checksum hasn't changed,
>          * because if there is a replica it might have a slightly different
>          * version of the page with an invalid checksum, caused by unlogged
>          * changes (e.g. hintbits) on the master happening while checksums
>          * were off. This can happen if there was a valid checksum on the page
>          * at one point in the past, so only when checksums are first on, then
>          * off, and then turned on again. Iff wal_level is set to "minimal",
>          * this could be avoided iff the checksum is calculated to be correct.
>          */
>         START_CRIT_SECTION();
>         MarkBufferDirty(buf);
>         log_newpage_buffer(buf, false);
>         END_CRIT_SECTION();

It's really unfortunate that we have to dirty the page even if the 
checksum already happens to match. Could we only do the 
log_newpage_buffer() call and skip MarkBufferDirty() in that case?

Could we get away with a more lightweight WAL record that doesn't 
contain the full-page image, but just the block number? On replay, the 
redo routine would read the page from disk.

- Heikki



Re: Online checksums patch - once again

From
Heikki Linnakangas
Date:
Replying to an older message in this thread:

>> + /*
>> + * If we reach this point with checksums in inprogress state, we notify
>> + * the user that they need to manually restart the process to enable
>> + * checksums. This is because we cannot launch a dynamic background worker
>> + * directly from here, it has to be launched from a regular backend.
>> + */
>> + if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_VERSION)
>> + ereport(WARNING,
>> + (errmsg("checksum state is \"inprogress\" with no worker"),
>> + errhint("Either disable or enable checksums by calling the
>> pg_disable_data_checksums() or pg_enable_data_checksums()
>> functions.")));
>> 
>> This seems pretty half-baked.
> 
> I don't disagree with that.  However, given that enabling checksums is a pretty
> intensive operation it seems somewhat unfriendly to automatically restart.  As
> a DBA I wouldn't want that to kick off without manual intervention, but there
> is also the risk of this being missed due to assumptions that it would restart.
> Any ideas on how to treat this?
> 
> If/when we can restart the processing where it left off, without the need to go
> over all data again, things might be different wrt the default action.

The later patch version do support restarting, so I think we should 
revisit this issue. I would expect the checksums worker to be 
automatically started at postmaster startup. Can we make that happen?

- Heikki



Re: Online checksums patch - once again

From
Heikki Linnakangas
Date:
On 05/10/2020 17:25, Álvaro Herrera wrote:
> On 2020-Oct-05, Heikki Linnakangas wrote:
> 
>> The code in sendFile() in basebackup.c seems suspicious in that regard. It
>> calls DataChecksumsNeedVerify() once before starting to read the file. Isn't
>> it possible for the checksums flag to change while it's reading the file and
>> sending it to the client? I hope there are CHECK_FOR_INTERRUPTS() calls
>> buried somewhere in the loop, because it could take minutes to send the
>> whole file.
>>
>> I would feel better if the state transition of the "checksums" flag could
>> only happen in a few safe places, or there were some other safeguards for
>> this. I think that's what Andres was trying to say earlier in the thread on
>> ProcSignalBarriers. I'm not sure what the interface to that should be. It
>> could be something like HOLD/RESUME_INTERRUPTS(), where normally all
>> procsignals are handled on CHECK_FOR_INTERRUPTS(), but you could "hold off"
>> some if needed. Or something else. Or maybe we can just use
>> HOLD/RESUME_INTERRUPTS() for this. It's more coarse-grained than necessary,
>> but probably doesn't matter in practice.
> 
> I hope you're not suggesting that interrupts would be held for the whole
> transmission of a file, which you say could take minutes.  If we do have
> an interrupt holdoff, then it has to be pretty short; users (and
> systemd) despair if service shutdown is delayed more than a few seconds.

I'm not suggesting that, sorry I wasn't clear. That would indeed be 
horrible.

sendFile() needs a different solution, but all the other places where 
DataChecksumsNeedWrite/Verify() is called need to be inspected to make 
sure that they hold interrupts, or ensure some other way that an 
interrupt doesn't change the local checksums flag between the 
DataChecksumsNeedWrite/Verify() call and the read/write.

I think sendFile() needs to re-check the local checksums state before 
each read. It also needs to ensure that an interrupt doesn't occur and 
change the local checksums state between read and the 
DataChecksumsNeedVerify() calls, but that's a very short period if you 
call DataChecksumsNeedVerify() again for each block.

- Heikki



Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 5 Oct 2020, at 14:14, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Replying to an older message in this thread:
>
>>> + /*
>>> + * If we reach this point with checksums in inprogress state, we notify
>>> + * the user that they need to manually restart the process to enable
>>> + * checksums. This is because we cannot launch a dynamic background worker
>>> + * directly from here, it has to be launched from a regular backend.
>>> + */
>>> + if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_VERSION)
>>> + ereport(WARNING,
>>> + (errmsg("checksum state is \"inprogress\" with no worker"),
>>> + errhint("Either disable or enable checksums by calling the
>>> pg_disable_data_checksums() or pg_enable_data_checksums()
>>> functions.")));
>>> This seems pretty half-baked.
>> I don't disagree with that.  However, given that enabling checksums is a pretty
>> intensive operation it seems somewhat unfriendly to automatically restart.  As
>> a DBA I wouldn't want that to kick off without manual intervention, but there
>> is also the risk of this being missed due to assumptions that it would restart.
>> Any ideas on how to treat this?
>> If/when we can restart the processing where it left off, without the need to go
>> over all data again, things might be different wrt the default action.
>
> The later patch version do support restarting, so I think we should revisit this issue.

Agreed, now it makes sense to restart automatically.

> I would expect the checksums worker to be automatically started at postmaster startup. Can we make that happen?

A dynamic background worker has to be registered from a regular backend, so
it's not entirely clear to me where in startup processing that would take
place.  Do you have any good suggestions?

cheers ./daniel


Re: Online checksums patch - once again

From
Heikki Linnakangas
Date:
On 12/11/2020 15:17, Daniel Gustafsson wrote:
>> On 5 Oct 2020, at 14:14, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I would expect the checksums worker to be automatically started at postmaster startup. Can we make that happen?
> 
> A dynamic background worker has to be registered from a regular backend, so
> it's not entirely clear to me where in startup processing that would take
> place.  Do you have any good suggestions?

Could you launch it from the startup process, in StartupXLOG()?
Does it have to be dynamic?

- Heikki



Re: Online checksums patch - once again

From
Magnus Hagander
Date:


On Fri, Nov 13, 2020 at 12:22 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 12/11/2020 15:17, Daniel Gustafsson wrote:
>> On 5 Oct 2020, at 14:14, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I would expect the checksums worker to be automatically started at postmaster startup. Can we make that happen?
>
> A dynamic background worker has to be registered from a regular backend, so
> it's not entirely clear to me where in startup processing that would take
> place.  Do you have any good suggestions?

Could you launch it from the startup process, in StartupXLOG()?
Does it have to be dynamic?


If it's not dynamic, you can't start it from a regular backend can you?  So then you'd need a restart for it to happen?

As for launching it from the startup process I don't know, that might be a viable path. The code specifically explains why it's not possible to launch it from the postmaster, but I don't see anything that would make it impossible from the startup process.

--

Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 5 Oct 2020, at 13:36, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> I looked at patch v22, and I can see two main issues:

Thanks for reviewing!

> 1. The one that Robert talked about earlier: A backend checks the local "checksums" state. If it's 'off', it writes a
pagewithout checksums. How do you guarantee that the local state doesn't change in between? The implicit assumption
seemsto be that there MUST NOT be any CHECK_FOR_INTERRUPTS() calls between DataChecksumsNeedWrite() and the write (or
readand DataChecksumsNeedVerify()). 
>
> In most code, the DataChecksumsNeedWrite() call is very close to writing out the page, often in the same critical
section.But this is an undocumented assumption. 

I've extended the documentation on this.

> The code in sendFile() in basebackup.c seems suspicious in that regard. It calls DataChecksumsNeedVerify() once
beforestarting to read the file. Isn't it possible for the checksums flag to change while it's reading the file and
sendingit to the client? I hope there are CHECK_FOR_INTERRUPTS() calls buried somewhere in the loop, because it could
takeminutes to send the whole file. 

Agreed, fixed.

> I would feel better if the state transition of the "checksums" flag could only happen in a few safe places, or there
weresome other safeguards for this. I think that's what Andres was trying to say earlier in the thread on
ProcSignalBarriers.I'm not sure what the interface to that should be. It could be something like
HOLD/RESUME_INTERRUPTS(),where normally all procsignals are handled on CHECK_FOR_INTERRUPTS(), but you could "hold off"
someif needed. Or something else. Or maybe we can just use HOLD/RESUME_INTERRUPTS() for this. It's more coarse-grained
thannecessary, but probably doesn't matter in practice. 
>
> At minimum, there needs to be comments in DataChecksumsNeedWrite() and DataChecksumsNeedVerify(), instructing how to
usethem safely. Namely, you must ensure that there are no interrupts between the DataChecksumsNeedWrite() and writing
outthe page, or between reading the page and the DataChecksumsNeedVerify() call. You can achieve that with
HOLD_INTERRUPTS()or a critical section, or simply ensuring that there is no substantial code in between that could call
CHECK_FOR_INTERRUPTS().And sendFile() in basebackup.c needs to be fixed. 
>
> Perhaps you could have "Assert(InteruptHoldOffCount > 0)" in DataChecksumsNeedWrite() and DataChecksumsNeedVerify()?
Therecould be other ways that callers could avoid the TOCTOU issue, but it would probably catch most of the unsafe call
patterns,and you could always wrap the DataChecksumsNeedWrite/verify() call in a dummy HOLD_INTERRUPTS() block to work
aroundthe assertion if you know what you're doing. 

The attached holds off interrupt processing for the NeedWrite and NeedVerify
cases, and holds them for what I hope is the right duration for the respective
callsites.

One thing I realized in doing this is that the pg_stat_database checksums
statistics are set to NULL when checksums are disabled.  That makes perfect
sense when checksum state is static, but not when it can be turned on/off.  For
now I've made it so that stats are set to zero instead, and will continue
showing stats even if checksums gets disabled.  Not sure what the best option
would be here.

> 2. The signaling between enable_data_checksums() and the launcher process looks funny to me. The general idea seems
tobe that enable_data_checksums() just starts the launcher process, and the launcher process figures out what it need
todo and makes all the changes to the global state. But then there's this violation of the idea:
enable_data_checksums()checks DataChecksumsOnInProgress(), and tells the launcher process whether it should continue a
previouslycrashed operation or start from scratch. I think it would be much cleaner if the launcher process figured
thatout itself, and enable_data_checksums() would just tell the launcher what the target state is. 
>
> enable_data_checksums() and disable_data_checksums() seem prone to race conditions. If you call
enable_data_checksums()in two backends concurrently, depending on the timing, there are two possible outcomes: 
>
> a) One call returns true, and launches the background process. The other call returns false.
>
> b) Both calls return true, but one of them emits a "NOTICE: data checksums worker is already running".
>
> In disable_data_checksum() imagine what happens if another backend calls enable_data_checksums() in between the
ShutdownDatachecksumsWorkerIfRunning()and SetDataChecksumsOff() calls. 

I've reworked this in the attached such that the enable_ and disable_ functions
merely call into the launcher with the desired outcome, and the launcher is
responsible for figuring out the rest.  The datachecksumworker is now the sole
place which initiates a state transfer.

>>         /*
>>          * Mark the buffer as dirty and force a full page write.  We have to
>>          * re-write the page to WAL even if the checksum hasn't changed,
>>          * because if there is a replica it might have a slightly different
>>          * version of the page with an invalid checksum, caused by unlogged
>>          * changes (e.g. hintbits) on the master happening while checksums
>>          * were off. This can happen if there was a valid checksum on the page
>>          * at one point in the past, so only when checksums are first on, then
>>          * off, and then turned on again. Iff wal_level is set to "minimal",
>>          * this could be avoided iff the checksum is calculated to be correct.
>>          */
>>         START_CRIT_SECTION();
>>         MarkBufferDirty(buf);
>>         log_newpage_buffer(buf, false);
>>         END_CRIT_SECTION();
>
> It's really unfortunate that we have to dirty the page even if the checksum already happens to match. Could we only
dothe log_newpage_buffer() call and skip MarkBufferDirty() in that case? 

I think we can, but I've intentionally stayed away from such optimizations
until the basic version of the patch was deemed safe and approaching done.
It's complicated enough as it is IMO.

> Could we get away with a more lightweight WAL record that doesn't contain the full-page image, but just the block
number?On replay, the redo routine would read the page from disk. 

Quite possibly, but I'm not sure how to reason about such a change to ensure
it's safety. I would love any ideas you'd have.

The attached fixes the above plus a few other small things found while hacking
on this version.

cheers ./daniel


Attachment

Re: Online checksums patch - once again

From
Heikki Linnakangas
Date:
On 17/11/2020 10:56, Daniel Gustafsson wrote:
>> On 5 Oct 2020, at 13:36, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> 2. The signaling between enable_data_checksums() and the launcher process looks funny to me. The general idea seems
tobe that enable_data_checksums() just starts the launcher process, and the launcher process figures out what it need
todo and makes all the changes to the global state. But then there's this violation of the idea:
enable_data_checksums()checks DataChecksumsOnInProgress(), and tells the launcher process whether it should continue a
previouslycrashed operation or start from scratch. I think it would be much cleaner if the launcher process figured
thatout itself, and enable_data_checksums() would just tell the launcher what the target state is.
 
>>
>> enable_data_checksums() and disable_data_checksums() seem prone to race conditions. If you call
enable_data_checksums()in two backends concurrently, depending on the timing, there are two possible outcomes:
 
>>
>> a) One call returns true, and launches the background process. The other call returns false.
>>
>> b) Both calls return true, but one of them emits a "NOTICE: data checksums worker is already running".
>>
>> In disable_data_checksum() imagine what happens if another backend calls enable_data_checksums() in between the
ShutdownDatachecksumsWorkerIfRunning()and SetDataChecksumsOff() calls.
 
> 
> I've reworked this in the attached such that the enable_ and disable_ functions
> merely call into the launcher with the desired outcome, and the launcher is
> responsible for figuring out the rest.  The datachecksumworker is now the sole
> place which initiates a state transfer.

Well, you still fill the DatachecksumsWorkerShmem->operations array in 
the backend process that launches the datacheckumworker, not in the 
worker process. I find that still a bit surprising, but I believe it works.

>>>         /*
>>>          * Mark the buffer as dirty and force a full page write.  We have to
>>>          * re-write the page to WAL even if the checksum hasn't changed,
>>>          * because if there is a replica it might have a slightly different
>>>          * version of the page with an invalid checksum, caused by unlogged
>>>          * changes (e.g. hintbits) on the master happening while checksums
>>>          * were off. This can happen if there was a valid checksum on the page
>>>          * at one point in the past, so only when checksums are first on, then
>>>          * off, and then turned on again. Iff wal_level is set to "minimal",
>>>          * this could be avoided iff the checksum is calculated to be correct.
>>>          */
>>>         START_CRIT_SECTION();
>>>         MarkBufferDirty(buf);
>>>         log_newpage_buffer(buf, false);
>>>         END_CRIT_SECTION();
>>
>> It's really unfortunate that we have to dirty the page even if the checksum already happens to match. Could we only
dothe log_newpage_buffer() call and skip MarkBufferDirty() in that case?
 
> 
> I think we can, but I've intentionally stayed away from such optimizations
> until the basic version of the patch was deemed safe and approaching done.
> It's complicated enough as it is IMO.

Fair enough.

> The attached fixes the above plus a few other small things found while hacking
> on this version.

I haven't read through the whole patch, but a few random comments below, 
in no particular order:

pg_enable/disable_data_checksums() should perform a superuser-check. I 
don't think we want to expose them to any users.

> +/*
> + * Local state for Controlfile data_checksum_version. After initialization,
> + * this is only updated when absorbing a procsignal barrier during interrupt
> + * processing. Thus, it can be read by backends without the need for a lock.
> + * Possible values are the checksum versions defined in storage/bufpage.h and
> + * zero for when checksums are disabled.
> + */
> +static uint32 LocalDataChecksumVersion = 0;

The comment is a bit confusing: "Thus, it can be read by backends 
without the need for a lock". Since it's a variable in backend-private 
memory, it can only be read by the same backend, not "backends". And 
that's also why you don't need a lock, not because it's updated during 
interrupt processing. I understand how this works, but maybe rephrase 
the comment a bit.

> +/*
> + * DataChecksumsOnInProgress
> + *             Returns whether data checksums are being enabled
> + *
> + * Most callsites shouldn't need to worry about the "inprogress" states, since
> + * they should check the requirement for verification or writing. Some low-
> + * level callsites dealing with page writes need however to know. It's also
> + * used to check for aborted checksum processing which need to be restarted.
>   */
>  bool
> -DataChecksumsEnabled(void)
> +DataChecksumsOnInProgress(void)
> +{
> +       return (LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
> +}

s/need/needs/. The whole paragraph feels a bit awkwardly worded in 
general. I'd suggest something like: "Most operations don't need to 
worry about the "inprogress" states, and should use 
DataChecksumsNeedVerify() or DataChecksumsNeedWrite()". 
DataChecksumsOffInProgress() is called from 
StartDatachecksumsWorkerLauncher(), which I wouldn't call a "low-level 
callsite".

> @@ -1079,7 +1091,7 @@ XLogInsertRecord(XLogRecData *rdata,
>          Assert(RedoRecPtr < Insert->RedoRecPtr);
>          RedoRecPtr = Insert->RedoRecPtr;
>      }
> -    doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
> +    doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites || DataChecksumsOnInProgress());
>  
>      if (doPageWrites &&
>          (!prevDoPageWrites ||

The comment above this deserves to be updated. Also, this is a very hot 
codepath; will this slow down WAL-logging, when full-page writes are 
disabled? Could we inline DataChecksumsOnInProgress() or set 
Insert->forcePageWrites when checksums are being computed or something?

In StartupXLOG():
> +    /*
> +     * If data checksums were being disabled when the cluster was shutdown, we
> +     * know that we have a state where all backends have stopped validating
> +     * checksums and we can move to off instead.
> +     */
> +    if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION)
> +    {
> +        LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> +        ControlFile->data_checksum_version = 0;
> +        LWLockRelease(ControlFileLock);
> +    }
> +

Should this be WAL-logged, like in SetDataChecksumsOff()?

In SetDataChecksumsOff():
> +    ControlFile->data_checksum_version = 0;
> +    UpdateControlFile();
> +    LWLockRelease(ControlFileLock);
> +
> +    XlogChecksums(0);
> +    WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_OFF));
> +}

What happens is if you crash between UpdateControlFile() and XlogChecksum()?

- Heikki



Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 23 Nov 2020, at 18:36, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 17/11/2020 10:56, Daniel Gustafsson wrote:
>>> On 5 Oct 2020, at 13:36, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> 2. The signaling between enable_data_checksums() and the launcher process looks funny to me. The general idea seems
tobe that enable_data_checksums() just starts the launcher process, and the launcher process figures out what it need
todo and makes all the changes to the global state. But then there's this violation of the idea:
enable_data_checksums()checks DataChecksumsOnInProgress(), and tells the launcher process whether it should continue a
previouslycrashed operation or start from scratch. I think it would be much cleaner if the launcher process figured
thatout itself, and enable_data_checksums() would just tell the launcher what the target state is. 
>>>
>>> enable_data_checksums() and disable_data_checksums() seem prone to race conditions. If you call
enable_data_checksums()in two backends concurrently, depending on the timing, there are two possible outcomes: 
>>>
>>> a) One call returns true, and launches the background process. The other call returns false.
>>>
>>> b) Both calls return true, but one of them emits a "NOTICE: data checksums worker is already running".
>>>
>>> In disable_data_checksum() imagine what happens if another backend calls enable_data_checksums() in between the
ShutdownDatachecksumsWorkerIfRunning()and SetDataChecksumsOff() calls. 
>> I've reworked this in the attached such that the enable_ and disable_ functions
>> merely call into the launcher with the desired outcome, and the launcher is
>> responsible for figuring out the rest.  The datachecksumworker is now the sole
>> place which initiates a state transfer.
>
> Well, you still fill the DatachecksumsWorkerShmem->operations array in the backend process that launches the
datacheckumworker,not in the worker process. I find that still a bit surprising, but I believe it works. 

I'm open to changing it in case there are strong opinions, it just seemed the
most natural to me.

> pg_enable/disable_data_checksums() should perform a superuser-check. I don't think we want to expose them to any
users.

Fixed.

>> +/*
>> + * Local state for Controlfile data_checksum_version. After initialization,
>> + * this is only updated when absorbing a procsignal barrier during interrupt
>> + * processing. Thus, it can be read by backends without the need for a lock.
>> + * Possible values are the checksum versions defined in storage/bufpage.h and
>> + * zero for when checksums are disabled.
>> + */
>> +static uint32 LocalDataChecksumVersion = 0;
>
> The comment is a bit confusing: "Thus, it can be read by backends without the need for a lock". Since it's a variable
inbackend-private memory, it can only be read by the same backend, not "backends". And that's also why you don't need a
lock,not because it's updated during interrupt processing. I understand how this works, but maybe rephrase the comment
abit. 

Fixed.

>> +/*
>> + * DataChecksumsOnInProgress
>> + *             Returns whether data checksums are being enabled
>> + *
>> + * Most callsites shouldn't need to worry about the "inprogress" states, since
>> + * they should check the requirement for verification or writing. Some low-
>> + * level callsites dealing with page writes need however to know. It's also
>> + * used to check for aborted checksum processing which need to be restarted.
>>  */
>> bool
>> -DataChecksumsEnabled(void)
>> +DataChecksumsOnInProgress(void)
>> +{
>> +       return (LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
>> +}
>
> s/need/needs/. The whole paragraph feels a bit awkwardly worded in general. I'd suggest something like: "Most
operationsdon't need to worry about the "inprogress" states, and should use DataChecksumsNeedVerify() or
DataChecksumsNeedWrite()".DataChecksumsOffInProgress() is called from StartDatachecksumsWorkerLauncher(), which I
wouldn'tcall a "low-level callsite". 

Fixed, and a few related surrounding comments expanded and tweaked.

>> @@ -1079,7 +1091,7 @@ XLogInsertRecord(XLogRecData *rdata,
>>         Assert(RedoRecPtr < Insert->RedoRecPtr);
>>         RedoRecPtr = Insert->RedoRecPtr;
>>     }
>> -    doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
>> +    doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites || DataChecksumsOnInProgress());
>>      if (doPageWrites &&
>>         (!prevDoPageWrites ||
>
> The comment above this deserves to be updated.

Fixed.

> Also, this is a very hot codepath; will this slow down WAL-logging, when full-page writes are disabled? Could we
inlineDataChecksumsOnInProgress() or set Insert->forcePageWrites when checksums are being computed or something? 

Wouldn't setting forcePageWrites risk causing other side effects that we don't
want here?  I've changed DataChecksumsOnInProgress to inline as a start.

> In StartupXLOG():
>> +    /*
>> +     * If data checksums were being disabled when the cluster was shutdown, we
>> +     * know that we have a state where all backends have stopped validating
>> +     * checksums and we can move to off instead.
>> +     */
>> +    if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION)
>> +    {
>> +        LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
>> +        ControlFile->data_checksum_version = 0;
>> +        LWLockRelease(ControlFileLock);
>> +    }
>> +
>
> Should this be WAL-logged, like in SetDataChecksumsOff()?

My initial thinking was that we wouldn't need to, as all nodes would process
the controlfile equally.  The more I think about it the more I think we need to
have an XLOG record of it. Added.

It would be good to stress this in a TAP test, but I haven't been able to write
one yet.

> In SetDataChecksumsOff():
>> +    ControlFile->data_checksum_version = 0;
>> +    UpdateControlFile();
>> +    LWLockRelease(ControlFileLock);
>> +
>> +    XlogChecksums(0);
>> +    WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_OFF));
>> +}
>
> What happens is if you crash between UpdateControlFile() and XlogChecksum()?

Good point, that would not get the cluster to a consistent state.  The
XlogChecksum should be performed before controlfile is udpated.

The attached patch contains these fixes as well as a rebase on top of todays
Git HEAD.

cheers ./daniel


Attachment

Re: Online checksums patch - once again

From
Heikki Linnakangas
Date:
On 25/11/2020 15:20, Daniel Gustafsson wrote:
>> On 23 Nov 2020, at 18:36, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> What happens is if you crash between UpdateControlFile() and XlogChecksum()?
> 
> Good point, that would not get the cluster to a consistent state.  The
> XlogChecksum should be performed before controlfile is udpated.
> 
> +void
> +SetDataChecksumsOff(void)
> +{
> +    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> +
> +    if (ControlFile->data_checksum_version == 0)
> +    {
> +        LWLockRelease(ControlFileLock);
> +        return;
> +    }
> +
> +    if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_VERSION)
> +    {
> +        ControlFile->data_checksum_version = PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION;
> +        UpdateControlFile();
> +        LWLockRelease(ControlFileLock);
> +
> +        /*
> +         * Update local state in all backends to ensure that any backend in
> +         * "on" state is changed to "inprogress-off".
> +         */
> +        XlogChecksums(PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION);
> +        WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_OFF));
> +
> +        /*
> +         * At this point we know that no backends are verifying data checksums
> +         * during reading. Next, we can safely move to state "off" to also
> +         * stop writing checksums.
> +         */
> +    }
> +
> +    XlogChecksums(0);
> +
> +    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> +    ControlFile->data_checksum_version = 0;
> +    UpdateControlFile();
> +    LWLockRelease(ControlFileLock);
> +
> +    WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_OFF));
> +}

The lwlocking doesn't look right here. If 
ControlFile->data_checksum_version != PG_DATA_CHECKSUM_VERSION, 
LWLockAcquire is called twice without a LWLockRelease in between.

What if a checkpoint, and a crash, happens just after the WAL record has 
been written, but before the control file is updated? That's a 
ridiculously tight window for a whole checkpoint cycle to happen, but in 
principle I think that would spell trouble. I think you could set 
delayChkpt to prevent the checkpoint from happening in that window, 
similar to how we avoid this problem with the clog updates at commit. 
Also, I think this should be in a critical section; we don't want the 
process to error out in between for any reason, and if it does happen, 
it's panic time.

- Heikki



Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 25 Nov 2020, at 14:33, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

> The lwlocking doesn't look right here. If ControlFile->data_checksum_version != PG_DATA_CHECKSUM_VERSION,
LWLockAcquireis called twice without a LWLockRelease in between. 

Right, fixed.

> What if a checkpoint, and a crash, happens just after the WAL record has been written, but before the control file is
updated?That's a ridiculously tight window for a whole checkpoint cycle to happen, but in principle I think that would
spelltrouble. I think you could set delayChkpt to prevent the checkpoint from happening in that window, similar to how
weavoid this problem with the clog updates at commit. Also, I think this should be in a critical section; we don't want
theprocess to error out in between for any reason, and if it does happen, it's panic time. 

Good points.  The attached patch performs the state changes inside a critical
section with checkpoints delayed, as well as emit the barrier inside the
critical section while awaiting the barrier outside to keep it open as short as
possible.

I've also done some tweaks to the tests to make them more robust as well as
comment updates and general tidying up here and there.

cheers ./daniel





Attachment

Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 3 Dec 2020, at 10:37, Daniel Gustafsson <daniel@yesql.se> wrote:

> I've also done some tweaks to the tests to make them more robust as well as
> comment updates and general tidying up here and there.

Attached is a rebase of the patch on top of current HEAD.

cheers ./daniel


Attachment

Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 5 Jan 2021, at 21:29, Michael Banck <michael.banck@credativ.de> wrote:
> On Tue, Jan 05, 2021 at 12:18:07AM +0100, Daniel Gustafsson wrote:

>> Attached is a rebase of the patch on top of current HEAD.
>
> Some more comments/questions:

Thanks for reviewing, much appreciated!

>> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
>> index 53e997cd55..06c001f8ff 100644
>> --- a/src/backend/access/heap/heapam.c
>> +++ b/src/backend/access/heap/heapam.c
>> @@ -7284,7 +7284,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
>>  * and dirtied.
>>  *
>>  * If checksums are enabled, we also generate a full-page image of
>> - * heap_buffer, if necessary.
>> + * heap_buffer.
>
> That sounds like it has nothing to do with online (de)activation of
> checksums?

Right, it's a fix which is independent of this which could be broken out into a
separate docs patch along with the suggestions in your previous mail.

>>  */
>> XLogRecPtr
>> log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
>> @@ -7305,11 +7305,13 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
>>     XLogRegisterBuffer(0, vm_buffer, 0);
>>
>>     flags = REGBUF_STANDARD;
>> +    HOLD_INTERRUPTS();
>>     if (!XLogHintBitIsNeeded())
>>         flags |= REGBUF_NO_IMAGE;
>>     XLogRegisterBuffer(1, heap_buffer, flags);
>>
>>     recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VISIBLE);
>> +    RESUME_INTERRUPTS();
>
> This could maybe do with a comment on why the HOLD/RESUME_INTERRUPTS()
> is required here, similar as is done in bufpage.c further down.

Fixed.

>> diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
>> index 92cc7ea073..fa074c6046 100644
>> --- a/src/backend/access/rmgrdesc/xlogdesc.c
>> +++ b/src/backend/access/rmgrdesc/xlogdesc.c
>> @@ -18,6 +18,7 @@
>> #include "access/xlog.h"
>> #include "access/xlog_internal.h"
>> #include "catalog/pg_control.h"
>> +#include "storage/bufpage.h"
>> #include "utils/guc.h"
>> #include "utils/timestamp.h"
>>
>> @@ -140,6 +141,20 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
>>                          xlrec.ThisTimeLineID, xlrec.PrevTimeLineID,
>>                          timestamptz_to_str(xlrec.end_time));
>>     }
>> +    else if (info == XLOG_CHECKSUMS)
>> +    {
>> +        xl_checksum_state xlrec;
>> +
>> +        memcpy(&xlrec, rec, sizeof(xl_checksum_state));
>> +        if (xlrec.new_checksumtype == PG_DATA_CHECKSUM_VERSION)
>> +            appendStringInfo(buf, "on");
>> +        else if (xlrec.new_checksumtype == PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION)
>> +            appendStringInfo(buf, "inprogress-off");
>> +        else if (xlrec.new_checksumtype == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
>> +            appendStringInfo(buf, "inprogress-on");
>> +        else
>> +            appendStringInfo(buf, "off");
>
> We probably discussed this earlier, but what was the conclusion?
> PG_DATA_CHECKSUM_VERSION = 1 sounds like somebody thought it a good idea
> to not just have a bool here but they probably rather thought about
> different checkumming/hashing algorithms/implentations than about
> internal state of the checksumming machinery.

Commit 443951748ce4c94b001877c7cf88b0ee969c79e7 explicitly moved from a bool to
be able to handle changes to the checksum field.  I don't recall it being
discussed in the context of this patch.

> If we decide on v2 of data page checksums, how would that look like?
>
> PG_DATA_CHECKSUM_VERSION_V2 = 4?

Something like that yes.

> If we think we're done with versions, did we consider removing the
> VERSION here, because it is really confusing in
> PG_DATA_CHECKSUM_INPROGRESS_ON/OFF_VERSION, like
> PG_DATA_CHECKSUM_STATE_ON/INPROGRESS_ON/OFF? Or would removing "version"
> also from pg_controldata be a backwards-incompatible change we don't
> want to do?

We could rename the _INPROGRESS states to not have the _VERSION suffix, but
then we'd end up in a discussion around what a checksum version is, and we'd be
assigning something not named _VERSION to a version field .  I think the
current state of the patch is the least surprising.

> Sorry again, I think we discussed it earlier, but maybe at least some
> comments about what VERSION is supposed to be/mean in bufpage.h would be
> in order:

That I don't disagree with, but it can be done separately from this work.  I
didn't chase down the thread which led to 443951748ce4c94b but I assume the
discussion there would be useful for documenting this.

>> -    doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
>> +    doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites || DataChecksumsOnInProgress());
>
> Indent, but I guess this will be indented by pg_indent after the fact
> andyway?  Or is this how it's supposed to look?

The patch has been through pgindent so I think it's by design.

>> + * Are checksums enabled, or in the process of being enabled, for data pages?
>
> The second "," looks odd and could be ommitted?.

I think that was correct, but not being a native speaker I might be wrong.
Either way I've reworded the comment to make it clearer since it on the whole
seemed a bit cobbled together.

>> + * to keep the critical section short. This is in order to protect against
>> + * TOCTOU situations around checksum validation.
>
> I had to google "TOCTOU" and this acronym doesn't appear elsewhere in
> the source tree, so I suggest to spell it out at least here (there's one
> more occurance of it in this patch)

Fair enough, both places fixed.  The comment in RelationBuildLocalRelation was
also reworded a bit to try and make a tad clearer.

>> + * DataChecksumsOnInProgress
>> + *        Returns whether data checksums are being enabled
>> + *
>> + * Most operations don't need to worry about the "inprogress" states, and
>> + * should use DataChecksumsNeedVerify() or DataChecksumsNeedWrite(). The
>> + * "inprogress" state for enabling checksums is used when the checksum worker
>> + * is setting checksums on all pages, it can thus be used to check for aborted
>> + * checksum processing which need to be restarted.
>
> The first "inprogress" looks legit as it talks about both states, but
> the second one should be "inprogress-on" I think and ...

The sentence was as intended, but I agree that spelling out the state name is
better.  Fixed.

>> + * DataChecksumsOffInProgress
>> + *        Returns whether data checksums are being disabled
>> + *
>> + * The "inprogress" state for disabling checksums is used for when the worker
>> + * resets the catalog state. Operations should use DataChecksumsNeedVerify()
>> + * or DataChecksumsNeedWrite() for deciding whether to read/write checksums.
>
> ... "inprogress-off" here.

Same as the above, but also fixed (with some additional wordsmithing).

>> +void
>> +AbsorbChecksumsOnInProgressBarrier(void)
>> +{
>> +       Assert(LocalDataChecksumVersion == 0 || LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
>> +       LocalDataChecksumVersion = PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION;
>> +}
>
> Bikeshed alert: maybe alo those Absorb*Barrier functions could be lumped
> together after the SetDataChecksums*() functions. If not, a function
> comment would be in order.

I moved all the Absorb* functions into a single place since they very similar.

>> +void
>> +SetDataChecksumsOnInProgress(void)

> This function doesn't have the customary function header comment nor any
> comments in the body and it looks like it's doing some pretty important
> stuff, so I think some comments would be in order, e.g.
> explaining that "data_checksum_version != 0" means we've already got
> checksums enabled or are in the process of enabling/disabling them.

That was indeed sloppy, fixed.

>> +/*
>> + * SetDataChecksumsOn
>> + *        Enables data checksums cluster-wide
>> + *
>> + * Enabling data checksums is performed using two barriers, the first one
>> + * sets the checksums state to "inprogress-on" and the second one to "on".
>> + * During "inprogress-on", checksums are written but not verified. When all
>> + * existing pages are guaranteed to have checksums, and all new pages will be
>> + * initiated with checksums, the state can be changed to "on".
>
> This should proabably go above SetDataChecksumsOnInProgress() because
> even though I've just reviewed this, I looked at the following function
> body and wondered where the second barrier went...

I kept the ordering but reworded to comment to make it clearer.

>> +     * inprogress-off state during which backends continue to write checksums
>
> The inprogress-off is in " " everywhere else.

Fixed.

>> +    else
>> +    {
>> +        /*
>> +         * Ending up here implies that the checksums state is "inprogress-on"
>> +         * and we can transition directly to "off" from there.
>
> Can you explain that a bit more? Is "inprogress-on" a typo for
> "inprogress-off", or do you really mean that we can just switch off
> checksums during "inprogress-on"? If so, the rationale should be
> explained a bit more.

The reason why we need "inprogress-off", where checksums are written but not
verified, is to ensure that backends still in the "on" state have checksums
updated to not incur verification failures on concurrent writes.

If the state is "inprogress-on" then checksums are being written but not
verified, so we can transition directly to "off" as there are no backends
verifying data checksums so there is to need to keep writing them.

>> @@ -7929,6 +8226,32 @@ StartupXLOG(void)
>>      */
>>     CompleteCommitTsInitialization();
>>
>> +    /*
>> +     * If we reach this point with checksums in progress state (either being
>> +     * enabled or being disabled), we notify the user that they need to
>> +     * manually restart the process to enable checksums. This is because we
>
> I think this could rephrased to "If we reach this point with checksums
> being enabled, we notify..." because the disable case is different and
> handled in the following block.

Correct, that sentence was a leftover from a previous version of this codepath,
fixed.

>> @@ -965,10 +965,13 @@ InsertPgClassTuple(Relation pg_class_desc,
>>     /* relpartbound is set by updating this tuple, if necessary */
>>     nulls[Anum_pg_class_relpartbound - 1] = true;
>>
>> +    HOLD_INTERRUPTS();
>> +    values[Anum_pg_class_relhaschecksums - 1] = BoolGetDatum(DataChecksumsNeedWrite());
>>     tup = heap_form_tuple(RelationGetDescr(pg_class_desc), values, nulls);
>>
>>     /* finally insert the new tuple, update the indexes, and clean up */
>>     CatalogTupleInsert(pg_class_desc, tup);
>> +    RESUME_INTERRUPTS();
>>
>>     heap_freetuple(tup);
>
> Maybe add a comment here why we now HOLD/RESUME_INTERRUPTS.

Fixed.

>> + * a lot of WAL as the entire database is read and written. Once all datapages
>
> It's "data page" with a space in between everywhere else.

Fixed.

>> + *        - Backends SHALL NOT violate local datachecksum state
>> + *        - Data checksums SHALL NOT be considered enabled cluster-wide until all
>
> Linewrap.

Not sure what you mean here, sentence is too long to fit on one line.

>> + *   When Enabling Data Checksums
>> + *     ----------------------------
>
> There's something off with the indentation of either the title or the
> line seperator here.

There was a mix of tab and space, fixed to consistently use spaces for
indentation here.

>> + *   When the DataChecksumsWorker has finished writing checksums on all pages
>> + *   and enable data checksums cluster-wide, there are three sets of backends:
>
> "enables"

Fixed.

>> + *   Bg: Backend updating the global state and emitting the procsignalbarrier
>> + *   Bd: Backends on "off" state
>
> s/on/in/
>
> Also, given that "When the DataChecksumsWorker has finished writing
> checksums on all pages and enable[s] data checksums cluster-wide",
> shouldn't that mean that all other backends are either in "on" or
> "inprogress-on" state, because the Bd -> Bi transition happened during a
> previous barrier? Maybe that should be first explained?

Right, I've reworded this and wordsmithed a bit to make it a bit less
convoluted.

>> + *   Be: Backends in "on" state
>> + *   Bi: Backends in "inprogress-on" state
>> + *
>> + *   Backends transition from the Bd state to Be like so: Bd -> Bi -> Be
>> + *
>> + *   Backends in Bi and Be will write checksums when modifying a page, but only
>> + *   backends in Be will verify the checksum during reading. The Bg backend is
>> + *   blocked waiting for all backends in Bi to process interrupts and move to
>> + *   Be. Any backend starting will observe the global state being "on" and will
>
> "Any backend starting while Bg is waiting for the barrier" right?

Correct, fixed.

>> + *   All sets are compatible while still operating based on
>> + *   their local state.
>
> Whoa, you lost me there.

What I meant was that Bi and Be are compatible as they both satisfy the
requirement of each other (both write checksums), so they can concurrently
exist in a cluster without false negatives occurring. Reworded.

>> + *     When Disabling Data Checksums
>> + *     -----------------------------
>> + *     A process which fails to observe data checksums being disabled can induce
>> + *     two types of errors: writing the checksum when modifying the page and
>
> Can you rephrase what you mean with "being disabled"? If you mean we're
> in the "inprogress-off" state, then why is that an error? Do you mean
> "writing *no* checksum" because AIUI we should still write checksums at
> this point? Or are you talking about a different state?

This was referring to the "off" state, but it wasn't terribly clear.  I've
reworded that part.

>> + *     validating a data checksum which is no longer correct due to modifications
>> + *     to the page.
>> + *
>> + *   Bg: Backend updating the global state and emitting the procsignalbarrier
>> + *   Bd: Backands in "off" state
>
> s/Backands/Backends/

Fixed.

>> + *   Be: Backends in "on" state
>> + *   Bi: Backends in "inprogress-off" state
>
> I suggest using a different symbol here for "inprogress-off" in order
> not to confuse the two (different right?) Bi.

Good point, fixed.

>> + *   Backends transition from the Be state to Bd like so: Be -> Bi -> Bd
>> + *
>> + *   The goal is to transition all backends to Bd making the others empty sets.
>> + *   Backends in Bi writes data checksums, but don't validate them, such that
>
> s/writes/write/

Fixed.

>> + *   backends still in Be can continue to validate pages until the barrier has
>> + *   been absorbed such that they are in Bi. Once all backends are in Bi, the
>> + *   barrier to transition to "off" can be raised and all backends can safely
>> + *   stop writing data checksums as no backend is enforcing data checksum
>> + *   validation.
>
> ... "anymore" maybe.

Makes sense, fixed.

>> + * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
>
> Copyright year bump, there's two other occurances.

Fixed.

>> +    bool        target;
>
> This "target" bool isn't documented very well (at least here). AIUI,
> it's true if we enable checksums and false if we disable them?

Correct, and I agree that the name is quite poor.  I renamed to
"enable_checksums" and added a comment.

>> +void
>> +StartDatachecksumsWorkerLauncher(bool enable_checksums, int cost_delay, int cost_limit)
>
> After reading through the function I find this 'bool enable_checksums'
> a bit confusing, I think something like 'int operation' and then
> comparing it to either ENABLE or DISABLE or whatever would make the code
> more readable, but it's a minor nitpick.

Maybe, I personally find enable_checksums to be self-explanatory but I'm
clearly biased.  With the small change of renaming "target", do you still think
it should be changed?

>> +         * running launcher.
>> +         */
>> +
>
> extra newline

Fixed.

>> +            if (DatachecksumsWorkerShmem->target)
>
> Would "if (DatachecksumsWorkerShmem->target == enable_checksums)" maybe
> be better, or does that change the meaning?

It would change its meaning as it would always be true.  Since renaming target
to enable_checksums I think this condition is quite clear though.

>> +    /*
>> +     * The launcher is currently not running, so we need to query the system
>> +     * data checksum state to determine how to proceed based on the requested
>> +     * target state.
>> +     */
>> +    else
>> +    {
>
> That's a slightly weird comment placement, maybe put it below the '{' so
> that that the '} else {' is kept intact?

This style is used in a number of places in the patch, and it's used around the
tree as well.  If it's a common concern/complaint then I'm happy to change them
all.

>> +        memset(DatachecksumsWorkerShmem->operations, 0, sizeof(DatachecksumsWorkerShmem->operations));
>> +        DatachecksumsWorkerShmem->target = enable_checksums;
>
> This one is especially confusing as it looks like we're unconditionally
> enabling checksums but instead we're just setting the target based on
> the bool (see above). It might be our standard notation though.

After the rename of the variable to enable_checkums it reads pretty clear that
this operation is caching the passed in value.  What do you think about the
codepath now?

>> +    /*
>> +     * If the postmaster crashed we cannot end up with a processed database so
>> +     * we have no alternative other than exiting. When enabling checksums we
>> +     * won't at this time have changed the pg_control version to enabled so
>> +     * when the cluster comes back up processing will have to be resumed. When
>> +     * disabling, the pg_control version will be set to off before this so
>> +     * when the cluster comes up checksums will be off as expected. In the
>> +     * latter case we might have stale relhaschecksums flags in pg_class which
>> +     * need to be handled in some way. TODO
>
> Any idea on how? Or is that for a future feature and can just be documented
> for now? In any case, one can just run pg_disable_checksums() again as
> mentioned elsewhere so maybe just rephrase the comment to say the admin
> needs to do that?

I've reworded the comment to be a nice to have rather than a need to have,
since stale flags have no practical implication.  Regarding how to address it
I'm not really sure what the cleanest way would be, and I've deliberately held
off on it since it's mostly cosmetical and this patch is complicated enough as
it is.

>> +    /*
>> +     * Set up local cache of Controldata values.
>> +     */
>> +    InitLocalControldata();
>
> This just sets LocalDataChecksumVersion for now, is it expected to cache
> other ControlData values in the future?  Maybe clarifying the current
> state in the comment would be in order.

I'm a bit hesitant to write in the comment that it's only the data checksums
state right now, since such a comment is almost guaranteed to be missed and
become stale when/if InitLocalControldata gains more capabilities.  Instead
I've added a function comment in xlog.c on InitLocalControldata.

>> -static bool data_checksums;
>> +static int    data_checksums_tmp;
>
> Why the _tmp, is that required for enum GUCs?

That is a missed leftover from an earlier version, removed.

Attached is a rebase with the above fixes, thanks for review!

cheers ./daniel


Attachment

Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 5 Jan 2021, at 18:19, Michael Banck <michael.banck@credativ.de> wrote:

>> diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
>> index 385ac25150..e3b0048806 100644
>> --- a/doc/src/sgml/ref/initdb.sgml
>> +++ b/doc/src/sgml/ref/initdb.sgml
>> @@ -219,6 +219,7 @@ PostgreSQL documentation
>>         failures will be reported in the
>>         <link linkend="monitoring-pg-stat-database-view">
>>         <structname>pg_stat_database</structname></link> view.
>> +        See <xref linkend="checksums" /> for details.
>>        </para>
>>       </listitem>
>>      </varlistentry>
>
>> diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
>> index f4bc147b10..5dcfcdd2ff 100644
>> --- a/doc/src/sgml/wal.sgml
>> +++ b/doc/src/sgml/wal.sgml
>> @@ -230,6 +230,103 @@
>>   </para>
>>  </sect1>
>>
>> + <sect1 id="checksums">
>> +  <title>Data Checksums</title>
>> +  <indexterm>
>> +   <primary>checksums</primary>
>> +  </indexterm>
>> +
>> +  <para>
>> +   Data pages are not checksum protected by default, but this can optionally be
>> +   enabled for a cluster.  When enabled, each data page will be assigned a
>> +   checksum that is updated when the page is written and verified every time
>> +   the page is read. Only data pages are protected by checksums, internal data
>> +   structures and temporary files are not.
>> +  </para>
>> +
>> +  <para>
>> +   Checksums are normally enabled when the cluster is initialized using <link
>> +   linkend="app-initdb-data-checksums"><application>initdb</application></link>.
>> +   They can also be enabled or disabled at a later time, either as an offline
>> +   operation or in a running cluster. In all cases, checksums are enabled or
>> +   disabled at the full cluster level, and cannot be specified individually for
>> +   databases or tables.
>> +  </para>
>> +
>> +  <para>
>> +   The current state of checksums in the cluster can be verified by viewing the
>> +   value of the read-only configuration variable <xref
>> +   linkend="guc-data-checksums" /> by issuing the command <command>SHOW
>> +   data_checksums</command>.
>> +  </para>
>> +
>> +  <para>
>> +   When attempting to recover from corrupt data it may be necessary to bypass
>> +   the checksum protection in order to recover data. To do this, temporarily
>> +   set the configuration parameter <xref linkend="guc-ignore-checksum-failure" />.
>> +  </para>
>
> I think the above is rather informative about checksums in general and
> not specific to online activation of checksusm, so could pretty much be
> committed verbatim right now, except for the "either as an offline
> operation or in a running cluster" bit which would have to be rewritten.

That might indeedbe useful regardless of the state of this patch.  Robert and
Heikki: being committers who have both reviewed recent versions of the patch,
would you prefer these sections be broken out into a separate patch?

cheers ./daniel


Re: Online checksums patch - once again

From
Magnus Hagander
Date:
On Thu, Jan 7, 2021 at 3:05 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> > On 5 Jan 2021, at 18:19, Michael Banck <michael.banck@credativ.de> wrote:
>
> >> diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
> >> index 385ac25150..e3b0048806 100644
> >> --- a/doc/src/sgml/ref/initdb.sgml
> >> +++ b/doc/src/sgml/ref/initdb.sgml
> >> @@ -219,6 +219,7 @@ PostgreSQL documentation
> >>         failures will be reported in the
> >>         <link linkend="monitoring-pg-stat-database-view">
> >>         <structname>pg_stat_database</structname></link> view.
> >> +        See <xref linkend="checksums" /> for details.
> >>        </para>
> >>       </listitem>
> >>      </varlistentry>
> >
> >> diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
> >> index f4bc147b10..5dcfcdd2ff 100644
> >> --- a/doc/src/sgml/wal.sgml
> >> +++ b/doc/src/sgml/wal.sgml
> >> @@ -230,6 +230,103 @@
> >>   </para>
> >>  </sect1>
> >>
> >> + <sect1 id="checksums">
> >> +  <title>Data Checksums</title>
> >> +  <indexterm>
> >> +   <primary>checksums</primary>
> >> +  </indexterm>
> >> +
> >> +  <para>
> >> +   Data pages are not checksum protected by default, but this can optionally be
> >> +   enabled for a cluster.  When enabled, each data page will be assigned a
> >> +   checksum that is updated when the page is written and verified every time
> >> +   the page is read. Only data pages are protected by checksums, internal data
> >> +   structures and temporary files are not.
> >> +  </para>
> >> +
> >> +  <para>
> >> +   Checksums are normally enabled when the cluster is initialized using <link
> >> +   linkend="app-initdb-data-checksums"><application>initdb</application></link>.
> >> +   They can also be enabled or disabled at a later time, either as an offline
> >> +   operation or in a running cluster. In all cases, checksums are enabled or
> >> +   disabled at the full cluster level, and cannot be specified individually for
> >> +   databases or tables.
> >> +  </para>
> >> +
> >> +  <para>
> >> +   The current state of checksums in the cluster can be verified by viewing the
> >> +   value of the read-only configuration variable <xref
> >> +   linkend="guc-data-checksums" /> by issuing the command <command>SHOW
> >> +   data_checksums</command>.
> >> +  </para>
> >> +
> >> +  <para>
> >> +   When attempting to recover from corrupt data it may be necessary to bypass
> >> +   the checksum protection in order to recover data. To do this, temporarily
> >> +   set the configuration parameter <xref linkend="guc-ignore-checksum-failure" />.
> >> +  </para>
> >
> > I think the above is rather informative about checksums in general and
> > not specific to online activation of checksusm, so could pretty much be
> > committed verbatim right now, except for the "either as an offline
> > operation or in a running cluster" bit which would have to be rewritten.
>
> That might indeedbe useful regardless of the state of this patch.  Robert and
> Heikki: being committers who have both reviewed recent versions of the patch,
> would you prefer these sections be broken out into a separate patch?

I think it would be ;)

Obviously in that case adapted so that it has to be changed offline,
and the patch would have to change that to say it can be changed
online.


-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 7 Jan 2021, at 16:25, Magnus Hagander <magnus@hagander.net> wrote:

> I think it would be ;)
>
> Obviously in that case adapted so that it has to be changed offline,
> and the patch would have to change that to say it can be changed
> online.

Attached is a v28 which has the docs portion separated out into 0001 with 0002
changing the docs in 0001 to mention the online operation.

cheers ./daniel


Attachment

Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 12 Jan 2021, at 00:07, Daniel Gustafsson <daniel@yesql.se> wrote:
>
>> On 7 Jan 2021, at 16:25, Magnus Hagander <magnus@hagander.net> wrote:
>
>> I think it would be ;)
>>
>> Obviously in that case adapted so that it has to be changed offline,
>> and the patch would have to change that to say it can be changed
>> online.
>
> Attached is a v28 which has the docs portion separated out into 0001 with 0002
> changing the docs in 0001 to mention the online operation.

Attached is v29 which adds a small test for running pg_checksums on a cluster
turned off during a checksum enable operation (as well as some minor word-
smithing in test comments).

cheers ./daniel


Attachment

Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
The attached v30 adds the proposed optimizations in this thread as previously
asked for, as well as some small cleanups to the procsignal calling codepath
(replacing single call functions with just calling the function) and some
function comments which were missing.

cheers ./daniel


Attachment

Re: Online checksums patch - once again

From
Magnus Hagander
Date:
On Fri, Jan 15, 2021 at 11:33 AM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> The attached v30 adds the proposed optimizations in this thread as previously
> asked for, as well as some small cleanups to the procsignal calling codepath
> (replacing single call functions with just calling the function) and some
> function comments which were missing.

I've applied the docs patch.

I made a tiny change so the reference to "data page checksums" was
changed to "data checksums". Of course, after doing that I realize
that we use both terms in different places, but the docs side mostly
talked about "data checksums". I changed the one reference that was
also in wal.sgml, but left the rest alone.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: Online checksums patch - once again

From
Michael Banck
Date:
Hi,

On Fri, Jan 15, 2021 at 11:32:56AM +0100, Daniel Gustafsson wrote:
> The attached v30 adds the proposed optimizations in this thread as previously
> asked for, as well as some small cleanups to the procsignal calling codepath
> (replacing single call functions with just calling the function) and some
> function comments which were missing.

0002 (now 0001 I guess) needs a rebase due to a3ed4d1e.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz



Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 18 Jan 2021, at 19:14, Michael Banck <michael.banck@credativ.de> wrote:
>
> Hi,
>
> On Fri, Jan 15, 2021 at 11:32:56AM +0100, Daniel Gustafsson wrote:
>> The attached v30 adds the proposed optimizations in this thread as previously
>> asked for, as well as some small cleanups to the procsignal calling codepath
>> (replacing single call functions with just calling the function) and some
>> function comments which were missing.
>
> 0002 (now 0001 I guess) needs a rebase due to a3ed4d1e.

Correct, rebase attached.

cheers ./daniel


Attachment

Re: Online checksums patch - once again

From
Heikki Linnakangas
Date:
I read through the latest patch, 
v31-0001-Support-checksum-enable-disable-in-a-running-clu.patch. Some 
comments below:

On 19/01/2021 14:32, Daniel Gustafsson wrote:
> +       /*
> +        * Hold interrupts for the duration of xlogging to avoid the state of data
> +        * checksums changing during the processing which would later the premise
> +        * for xlogging hint bits.
> +        */

Sentence sense does not make.

> @@ -904,6 +916,7 @@ static void SetLatestXTime(TimestampTz xtime);
>  static void SetCurrentChunkStartTime(TimestampTz xtime);
>  static void CheckRequiredParameterValues(void);
>  static void XLogReportParameters(void);
> +static void XlogChecksums(ChecksumType new_type);
>  static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
>                                                                 TimeLineID prevTLI);
>  static void LocalSetXLogInsertAllowed(void);

Spelling: make it "XLogChecksums" for consistency.

> /*
>  * DataChecksumsNeedWrite
>  *        Returns whether data checksums must be written or not
>  *
>  * Returns true iff data checksums are enabled or are in the process of being
>  * enabled. In case data checksums are currently being enabled we must write
>  * the checksum even though it's not verified during this stage. Interrupts
>  * need to be held off by the caller to ensure that the returned state is
>  * valid for the duration of the intended processing.
>  */
> bool
> DataChecksumsNeedWrite(void)
> {
>     Assert(InterruptHoldoffCount > 0);
>     return (LocalDataChecksumVersion == PG_DATA_CHECKSUM_VERSION ||
>             LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION ||
>             LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION);
> }

Can you be more precise on the "duration of the intended processing"? It 
means, until you have actually written out the page, or something like 
that, right? The similar explanation in DataChecksumsNeedVerify() is 
easier to understand. The two functions are similar, it would be good to 
phrase the comments similarly, so that you can quickly see the 
difference between the two.

> /*
>  * SetDataChecksumsOnInProgress
>  *        Sets the data checksum state to "inprogress-on" to enable checksums
>  *
>  * In order to start the process of enabling data checksums in a running
>  * cluster the data_checksum_version state must be changed to "inprogress-on".
>  * This state requires data checksums to be written but not verified. The state
>  * transition is performed in a critical section in order to provide crash
>  * safety, and checkpoints are held off. When the emitted procsignalbarrier
>  * has been absorbed by all backends we know that the cluster has started to
>  * enable data checksums.
>  */

The two "in order" are superfluous, it's more concise to say just "To 
start the process ..." (I was made aware of that by Jürgen Purtz's 
recent patches that removed a couple of "in order"s from the docs as 
unnecessary).

It's a bit confusing to talk about the critical section and 
procsignalbarrier here. Does the caller need to wait for the 
procsignalbarrier? No, that's just explaining what the function does 
internally. Maybe move that explanation inside the function, and say 
here something like "This function blocks until all processes have 
acknowledged the state change" or something like that.

> /*
>  * SetDataChecksumsOn
>  *        Enables data checksums cluster-wide
>  *
>  * Enabling data checksums is performed using two barriers, the first one
>  * sets the checksums state to "inprogress-on" (which is performed by
>  * SetDataChecksumsOnInProgress()) and the second one to "on" (performed here).
>  * During "inprogress-on", checksums are written but not verified. When all
>  * existing pages are guaranteed to have checksums, and all new pages will be
>  * initiated with checksums, the state can be changed to "on".
>  */

Perhaps the explanation for how these SetDataChecksumsOn() and 
SetDataChecksumsOnInProgress() functions work together should be moved 
to one place. For example, explain both functions here at 
SetDataChecksumsOn(), and just have a "see SetDataChecksumsOn()" in the 
other function, or no comment at all if they're kept next to each other. 
As it stands, you have to read both comments and piece together the the 
big picture in your head. Maybe add a "see datachecksumsworker.c" here, 
since there's a longer explanation of the overall mechanism there.

> +/*
> + * Disables checksums for the cluster, unless already disabled.
> + *
> + * Has immediate effect - the checksums are set to off right away.
> + */
> +Datum
> +disable_data_checksums(PG_FUNCTION_ARGS)
> +{
> +       if (!superuser())
> +               ereport(ERROR,
> +                               (errmsg("must be superuser")));
> +
> +       StartDatachecksumsWorkerLauncher(false, 0, 0);
> +
> +       PG_RETURN_BOOL(true);
> +}

The documentation says "Returns <literal>false</literal> in case data 
checksums are disabled already", but the function always returns true. 
The enable_data_checksum() function also returns a constant 'true'; why 
not make it void?

The "has immediate effect" comment seems wrong, given that it actually 
launches a worker process.

> +       /*
> +        * If the launcher is already started, the only operation we can perform
> +        * is to cancel it iff the user requested for checksums to be disabled.
> +        * That doesn't however mean that all other cases yield an error, as some
> +        * might be perfectly benevolent.
> +        */

This comment is a bit hard to understand. Maybe something like

"If the launcher is already started, we cannot launch a new one. But if 
the user requested for checksums to be disabled, we can cancel it."

> +       if (DatachecksumsWorkerShmem->launcher_started)
> +       {
> +               if (DatachecksumsWorkerShmem->abort)
> +               {
> +                       ereport(NOTICE,
> +                                       (errmsg("data checksum processing is concurrently being aborted, please
retry")));
> +
> +                       LWLockRelease(DatachecksumsWorkerLock);
> +                       return;
> +               }

If it's being aborted, and the user requested to disable checksum, can 
we leave out the NOTICE? I guess not, because the worker process won't 
check the 'abort' flag, if it's already finished processing all data pages.

> +               /*
> +                * If the launcher is started data checksums cannot be on or off, but
> +                * it may be in an inprogress state. Since the state transition may
> +                * not have happened yet (in case of rapidly initiated checksum enable
> +                * calls for example) we inspect the target state of the currently
> +                * running launcher.
> +                */

This comment contradicts itself. If the state transition has not 
happened yet, then contrary to the first sentence, data checksums *are* 
currently on or off.

> +               if (enable_checksums)
> +               {
> +                       /*
> +                        * If we are asked to enable checksums when they are already being
> +                        * enabled, there is nothing to do so exit.
> +                        */
> +                       if (DatachecksumsWorkerShmem->enable_checksums)
> +                       {
> +                               LWLockRelease(DatachecksumsWorkerLock);
> +                               return;
> +                       }
> +
> +                       /*
> +                        * Disabling checksums is likely to be a very quick operation in
> +                        * many cases so trying to abort it to save the checksums would
> +                        * run the risk of race conditions.
> +                        */
> +                       else
> +                       {
> +                               ereport(NOTICE,
> +                                               (errmsg("data checksums are concurrently being disabled, please
retry")));
> +
> +                               LWLockRelease(DatachecksumsWorkerLock);
> +                               return;
> +                       }
> +
> +                       /* This should be unreachable */
> +                       Assert(false);
> +               }
> +               else if (!enable_checksums)
> +               {
> +                       /*
> +                        * Data checksums are already being disabled, exit silently.
> +                        */
> +                       if (DataChecksumsOffInProgress())
> +                       {
> +                               LWLockRelease(DatachecksumsWorkerLock);
> +                               return;
> +                       }
> +
> +                       DatachecksumsWorkerShmem->abort = true;
> +                       LWLockRelease(DatachecksumsWorkerLock);
> +                       return;
> +               }

The Assert seems unnecessary. The "if (!enable_checkums)" also seems 
unnecessary, could be just "else".

> +       /*
> +        * The launcher is currently not running, so we need to query the system
> +        * data checksum state to determine how to proceed based on the requested
> +        * target state.
> +        */

"query the system" makes me think "checking the catalogs" or similar, 
but this just looks at a few variables in shared memory.

> +/*
> + * ShutdownDatachecksumsWorkerIfRunning
> + *             Request shutdown of the datachecksumsworker
> + *
> + * This does not turn off processing immediately, it signals the checksum
> + * process to end when done with the current block.
> + */
> +void
> +ShutdownDatachecksumsWorkerIfRunning(void)
> +{
> +       LWLockAcquire(DatachecksumsWorkerLock, LW_EXCLUSIVE);
> +
> +       /* If the launcher isn't started, there is nothing to shut down */
> +       if (DatachecksumsWorkerShmem->launcher_started)
> +               DatachecksumsWorkerShmem->abort = true;
> +
> +       LWLockRelease(DatachecksumsWorkerLock);
> +}

This function is unused.

> +/*
> + * launcher_cancel_handler
> + *
> + * Internal routine for reacting to SIGINT and flagging the worker to abort.
> + * The worker won't be interrupted immediately but will check for abort flag
> + * between each block in a relation.
> + */
> +static void
> +launcher_cancel_handler(SIGNAL_ARGS)
> +{
> +       LWLockAcquire(DatachecksumsWorkerLock, LW_EXCLUSIVE);
> +       DatachecksumsWorkerShmem->abort = true;
> +       LWLockRelease(DatachecksumsWorkerLock);
> +}

Acquiring an lwlock in signal handler is not safe.

Since this is a signal handler, it might still get called after the 
process has finished processing, and has already set 
DatachecksumsWorkerShmem->launcher_started = false. That seems like an 
unexpected state.

> +/*
> + * WaitForAllTransactionsToFinish
> + *             Blocks awaiting all current transactions to finish
> + *
> + * Returns when all transactions which are active at the call of the function
> + * have ended, or if the postmaster dies while waiting. If the postmaster dies
> + * the abort flag will be set to indicate that the caller of this shouldn't
> + * proceed.
> + */

The caller of this function doesn't check the 'abort' flag AFAICS. I 
think you could just use use WL_EXIT_ON_PM_DEATH to error out on 
postmaster death.

> +       while (true)
> +       {
> +               int                     processed_databases = 0;
> +
> +               /*
> +                * Get a list of all databases to process. This may include databases
> +                * that were created during our runtime.
> +                *
> +                * Since a database can be created as a copy of any other database
> +                * (which may not have existed in our last run), we have to repeat
> +                * this loop until no new databases show up in the list. Since we wait
> +                * for all pre-existing transactions finish, this way we can be
> +                * certain that there are no databases left without checksums.
> +                */
> +               DatabaseList = BuildDatabaseList();

I think it's unnecessary to wait out the transactions on the first 
iteration of this loop. There must be at least one database, so there 
will always be at least two iterations.

I think it would be more clear to move the 
WaitForAllTransactionsToFinish() from BuildDatabaseList() to the callers.

> @@ -3567,6 +3571,27 @@ RelationBuildLocalRelation(const char *relname,
>                 relkind == RELKIND_MATVIEW)
>                 RelationInitTableAccessMethod(rel);
>  
> +       /*
> +        * Set the data checksum state. Since the data checksum state can change at
> +        * any time, the fetched value might be out of date by the time the
> +        * relation is built.  DataChecksumsNeedWrite returns true when data
> +        * checksums are: enabled; are in the process of being enabled (state:
> +        * "inprogress-on"); are in the process of being disabled (state:
> +        * "inprogress-off"). Since relhaschecksums is only used to track progress
> +        * when data checksums are being enabled, and going from disabled to
> +        * enabled will clear relhaschecksums before starting, it is safe to use
> +        * this value for a concurrent state transition to off.
> +        *
> +        * If DataChecksumsNeedWrite returns false, and is concurrently changed to
> +        * true then that implies that checksums are being enabled. Worst case,
> +        * this will lead to the relation being processed for checksums even though
> +        * each page written will have them already.  Performing this last shortens
> +        * the window, but doesn't avoid it.
> +        */
> +       HOLD_INTERRUPTS();
> +       rel->rd_rel->relhaschecksums = DataChecksumsNeedWrite();
> +       RESUME_INTERRUPTS();
> +
>         /*
>          * Okay to insert into the relcache hash table.
>          *

I grepped for relhashcheckums, and concluded that the value in the 
relcache isn't actually used for anything. Not so! In 
heap_create_with_catalog(), the actual pg_class row is constructed from 
the relcache entry, so the value set in RelationBuildLocalRelation() 
finds its way to pg_class. Perhaps it would be more clear to pass 
relhachecksums directly as an argument to AddNewRelationTuple(). That 
way, the value in the relcache would be truly never used.

- Heikki



Re: Online checksums patch - once again

From
Heikki Linnakangas
Date:
On 22/01/2021 13:55, Heikki Linnakangas wrote:
> I read through the latest patch,
> v31-0001-Support-checksum-enable-disable-in-a-running-clu.patch. Some
> comments below:

One more thing:

In SetRelationNumChecks(), you should use SearchSysCacheCopy1() to get a 
modifiable copy of the tuple. Otherwise you modify the tuple in the 
relcache as a side effect. Maybe that's harmless in this case, as the 
'relhaschecksums' value in the relcache isn't used for anything, but 
let's be tidy.

- Heikki



Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 22 Jan 2021, at 12:55, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> I read through the latest patch, v31-0001-Support-checksum-enable-disable-in-a-running-clu.patch. Some comments
below:

Thanks for reviewing!  Attached is a v32 which address most of your comments,
see below.

> On 19/01/2021 14:32, Daniel Gustafsson wrote:
>> +       /*
>> +        * Hold interrupts for the duration of xlogging to avoid the state of data
>> +        * checksums changing during the processing which would later the premise
>> +        * for xlogging hint bits.
>> +        */
>
> Sentence sense does not make.

Indeed it doesn't, the "later" is a misspelled "alter".  Essentially, the idea
with the comment is to explain that interrupts should be held during logging so
that the decision to log hintbits is valid for the duration of the logging.
Does changing to "alter" suffice?  It's fixed like so in the attached, but
perhaps there are better wordings for this.

>> @@ -904,6 +916,7 @@ static void SetLatestXTime(TimestampTz xtime);
>> static void SetCurrentChunkStartTime(TimestampTz xtime);
>> static void CheckRequiredParameterValues(void);
>> static void XLogReportParameters(void);
>> +static void XlogChecksums(ChecksumType new_type);
>> static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
>>                                                                TimeLineID prevTLI);
>> static void LocalSetXLogInsertAllowed(void);
>
> Spelling: make it "XLogChecksums" for consistency.

Good point, fixed.

>> /*
>> * DataChecksumsNeedWrite
>> *        Returns whether data checksums must be written or not
>> *
>> * Returns true iff data checksums are enabled or are in the process of being
>> * enabled. In case data checksums are currently being enabled we must write
>> * the checksum even though it's not verified during this stage. Interrupts
>> * need to be held off by the caller to ensure that the returned state is
>> * valid for the duration of the intended processing.
>> */
>> bool
>> DataChecksumsNeedWrite(void)
>> {
>>     Assert(InterruptHoldoffCount > 0);
>>     return (LocalDataChecksumVersion == PG_DATA_CHECKSUM_VERSION ||
>>             LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION ||
>>             LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION);
>> }
>
> Can you be more precise on the "duration of the intended processing"? It means, until you have actually written out
thepage, or something like that, right? The similar explanation in DataChecksumsNeedVerify() is easier to understand.
Thetwo functions are similar, it would be good to phrase the comments similarly, so that you can quickly see the
differencebetween the two. 

I've reworded this to (hopefully) be more clear, and to be more like the
comment for DataChecksumsNeedVerify.

>> /*
>> * SetDataChecksumsOnInProgress
>> *        Sets the data checksum state to "inprogress-on" to enable checksums
>> *
>> * In order to start the process of enabling data checksums in a running
>> * cluster the data_checksum_version state must be changed to "inprogress-on".
>> * This state requires data checksums to be written but not verified. The state
>> * transition is performed in a critical section in order to provide crash
>> * safety, and checkpoints are held off. When the emitted procsignalbarrier
>> * has been absorbed by all backends we know that the cluster has started to
>> * enable data checksums.
>> */
>
> The two "in order" are superfluous, it's more concise to say just "To start the process ..." (I was made aware of
thatby Jürgen Purtz's recent patches that removed a couple of "in order"s from the docs as unnecessary). 

Agreed, that's better.

> It's a bit confusing to talk about the critical section and procsignalbarrier here. Does the caller need to wait for
theprocsignalbarrier? No, that's just explaining what the function does internally. Maybe move that explanation inside
thefunction, and say here something like "This function blocks until all processes have acknowledged the state change"
orsomething like that. 

Fixed.

>> /*
>> * SetDataChecksumsOn
>> *        Enables data checksums cluster-wide
>> *
>> * Enabling data checksums is performed using two barriers, the first one
>> * sets the checksums state to "inprogress-on" (which is performed by
>> * SetDataChecksumsOnInProgress()) and the second one to "on" (performed here).
>> * During "inprogress-on", checksums are written but not verified. When all
>> * existing pages are guaranteed to have checksums, and all new pages will be
>> * initiated with checksums, the state can be changed to "on".
>> */
>
> Perhaps the explanation for how these SetDataChecksumsOn() and SetDataChecksumsOnInProgress() functions work together
shouldbe moved to one place. For example, explain both functions here at SetDataChecksumsOn(), and just have a "see
SetDataChecksumsOn()"in the other function, or no comment at all if they're kept next to each other. As it stands, you
haveto read both comments and piece together the the big picture in your head. Maybe add a "see datachecksumsworker.c"
here,since there's a longer explanation of the overall mechanism there. 

Fixed.  I did keep both comments since the function are too long to fit the
comment and the code on screen, but refer to one from the other.

>> +/*
>> + * Disables checksums for the cluster, unless already disabled.
>> + *
>> + * Has immediate effect - the checksums are set to off right away.
>> + */
>> +Datum
>> +disable_data_checksums(PG_FUNCTION_ARGS)
>> +{
>> +       if (!superuser())
>> +               ereport(ERROR,
>> +                               (errmsg("must be superuser")));
>> +
>> +       StartDatachecksumsWorkerLauncher(false, 0, 0);
>> +
>> +       PG_RETURN_BOOL(true);
>> +}
>
> The documentation says "Returns <literal>false</literal> in case data checksums are disabled already", but the
functionalways returns true. 

Fixed.

> The enable_data_checksum() function also returns a constant 'true'; why not make it void?

These functions were void until v20 of this patch when Robert correctly pointed
out that using ereport to communicate status was a poor choice (there were
NOTICEs IIRC).  v23 however rolled back all status checking in response to a
review by you, leaving the functions to only call the worker.  The returnvalue
was however not changed back to void at that point which would've been the
reasonable choice.  Fixed.

> The "has immediate effect" comment seems wrong, given that it actually launches a worker process.

Fixed.

>> +       /*
>> +        * If the launcher is already started, the only operation we can perform
>> +        * is to cancel it iff the user requested for checksums to be disabled.
>> +        * That doesn't however mean that all other cases yield an error, as some
>> +        * might be perfectly benevolent.
>> +        */
>
> This comment is a bit hard to understand. Maybe something like
>
> "If the launcher is already started, we cannot launch a new one. But if the user requested for checksums to be
disabled,we can cancel it." 

Agreed, thats better. Fixed.

>> +       if (DatachecksumsWorkerShmem->launcher_started)
>> +       {
>> +               if (DatachecksumsWorkerShmem->abort)
>> +               {
>> +                       ereport(NOTICE,
>> +                                       (errmsg("data checksum processing is concurrently being aborted, please
retry")));
>> +
>> +                       LWLockRelease(DatachecksumsWorkerLock);
>> +                       return;
>> +               }
>
> If it's being aborted, and the user requested to disable checksum, can we leave out the NOTICE? I guess not, because
theworker process won't check the 'abort' flag, if it's already finished processing all data pages. 

Correct, since we might otherwise silently not disable checksums on a request
to do so.

>> +               /*
>> +                * If the launcher is started data checksums cannot be on or off, but
>> +                * it may be in an inprogress state. Since the state transition may
>> +                * not have happened yet (in case of rapidly initiated checksum enable
>> +                * calls for example) we inspect the target state of the currently
>> +                * running launcher.
>> +                */
>
> This comment contradicts itself. If the state transition has not happened yet, then contrary to the first sentence,
datachecksums *are* currently on or off. 

Yes, indeed.  I've reworded this comment but I'm not sure it's much of an
improvement I'm afraid.

>> +               if (enable_checksums)
>> +               {
>> +                       /*
>> +                        * If we are asked to enable checksums when they are already being
>> +                        * enabled, there is nothing to do so exit.
>> +                        */
>> +                       if (DatachecksumsWorkerShmem->enable_checksums)
>> +                       {
>> +                               LWLockRelease(DatachecksumsWorkerLock);
>> +                               return;
>> +                       }
>> +
>> +                       /*
>> +                        * Disabling checksums is likely to be a very quick operation in
>> +                        * many cases so trying to abort it to save the checksums would
>> +                        * run the risk of race conditions.
>> +                        */
>> +                       else
>> +                       {
>> +                               ereport(NOTICE,
>> +                                               (errmsg("data checksums are concurrently being disabled, please
retry")));
>> +
>> +                               LWLockRelease(DatachecksumsWorkerLock);
>> +                               return;
>> +                       }
>> +
>> +                       /* This should be unreachable */
>> +                       Assert(false);
>> +               }
>> +               else if (!enable_checksums)
>> +               {
>> +                       /*
>> +                        * Data checksums are already being disabled, exit silently.
>> +                        */
>> +                       if (DataChecksumsOffInProgress())
>> +                       {
>> +                               LWLockRelease(DatachecksumsWorkerLock);
>> +                               return;
>> +                       }
>> +
>> +                       DatachecksumsWorkerShmem->abort = true;
>> +                       LWLockRelease(DatachecksumsWorkerLock);
>> +                       return;
>> +               }
>
> The Assert seems unnecessary. The "if (!enable_checkums)" also seems unnecessary, could be just "else".

Fixed.

>> +       /*
>> +        * The launcher is currently not running, so we need to query the system
>> +        * data checksum state to determine how to proceed based on the requested
>> +        * target state.
>> +        */
>
> "query the system" makes me think "checking the catalogs" or similar, but this just looks at a few variables in
sharedmemory. 

Reworded.

>> +/*
>> + * ShutdownDatachecksumsWorkerIfRunning
>> + *             Request shutdown of the datachecksumsworker
>> + *
>
> This function is unused.

Removed.

>> +/*
>> + * launcher_cancel_handler
>> + *
>> + * Internal routine for reacting to SIGINT and flagging the worker to abort.
>> + * The worker won't be interrupted immediately but will check for abort flag
>> + * between each block in a relation.
>> + */
>> +static void
>> +launcher_cancel_handler(SIGNAL_ARGS)
>> +{
>> +       LWLockAcquire(DatachecksumsWorkerLock, LW_EXCLUSIVE);
>> +       DatachecksumsWorkerShmem->abort = true;
>> +       LWLockRelease(DatachecksumsWorkerLock);
>> +}
>
> Acquiring an lwlock in signal handler is not safe.
>
> Since this is a signal handler, it might still get called after the process has finished processing, and has already
setDatachecksumsWorkerShmem->launcher_started = false. That seems like an unexpected state. 

Good point, I've fixed this to use a proper signal handler flag.  While doing
so I realized there was a race window when checksums were disabled while the
worker was running, but after it had processed the final buffer.  Since the
abort flag was only checked during buffer processing it could be missed leading
to the worker ending with checksums enabled when they should be disabled.  This
is likely to be much more common in the TAP tests than in real production use,
which is good since it was easy to trigger.  The attached fixes this by
re-checking the abort flag and I am no longer able to trigger the window during
tests.

>> +/*
>> + * WaitForAllTransactionsToFinish
>> + *             Blocks awaiting all current transactions to finish
>> + *
>> + * Returns when all transactions which are active at the call of the function
>> + * have ended, or if the postmaster dies while waiting. If the postmaster dies
>> + * the abort flag will be set to indicate that the caller of this shouldn't
>> + * proceed.
>> + */
>
> The caller of this function doesn't check the 'abort' flag AFAICS. I think you could just use use WL_EXIT_ON_PM_DEATH
toerror out on postmaster death. 

Good point, that simplifies the code.  It's erroring out in the same way as
other checks for postmaster death.

>> +       while (true)
>> +       {
>> +               int                     processed_databases = 0;
>> +
>> +               /*
>> +                * Get a list of all databases to process. This may include databases
>> +                * that were created during our runtime.
>> +                *
>> +                * Since a database can be created as a copy of any other database
>> +                * (which may not have existed in our last run), we have to repeat
>> +                * this loop until no new databases show up in the list. Since we wait
>> +                * for all pre-existing transactions finish, this way we can be
>> +                * certain that there are no databases left without checksums.
>> +                */
>> +               DatabaseList = BuildDatabaseList();
>
> I think it's unnecessary to wait out the transactions on the first iteration of this loop. There must be at least one
database,so there will always be at least two iterations. 
>
> I think it would be more clear to move the WaitForAllTransactionsToFinish() from BuildDatabaseList() to the callers.

I've tried this in the attached.  It does increase the window between waiting
for transactions to finish and grabbing the list, but that might be negligable?

>> @@ -3567,6 +3571,27 @@ RelationBuildLocalRelation(const char *relname,
>>                relkind == RELKIND_MATVIEW)
>>                RelationInitTableAccessMethod(rel);
>> +       /*
>> +        * Set the data checksum state. Since the data checksum state can change at
>> +        * any time, the fetched value might be out of date by the time the
>> +        * relation is built.  DataChecksumsNeedWrite returns true when data
>> +        * checksums are: enabled; are in the process of being enabled (state:
>> +        * "inprogress-on"); are in the process of being disabled (state:
>> +        * "inprogress-off"). Since relhaschecksums is only used to track progress
>> +        * when data checksums are being enabled, and going from disabled to
>> +        * enabled will clear relhaschecksums before starting, it is safe to use
>> +        * this value for a concurrent state transition to off.
>> +        *
>> +        * If DataChecksumsNeedWrite returns false, and is concurrently changed to
>> +        * true then that implies that checksums are being enabled. Worst case,
>> +        * this will lead to the relation being processed for checksums even though
>> +        * each page written will have them already.  Performing this last shortens
>> +        * the window, but doesn't avoid it.
>> +        */
>> +       HOLD_INTERRUPTS();
>> +       rel->rd_rel->relhaschecksums = DataChecksumsNeedWrite();
>> +       RESUME_INTERRUPTS();
>> +
>>        /*
>>         * Okay to insert into the relcache hash table.
>>         *
>
> I grepped for relhashcheckums, and concluded that the value in the relcache isn't actually used for anything. Not so!
Inheap_create_with_catalog(), the actual pg_class row is constructed from the relcache entry, so the value set in
RelationBuildLocalRelation()finds its way to pg_class. Perhaps it would be more clear to pass relhachecksums directly
asan argument to AddNewRelationTuple(). That way, the value in the relcache would be truly never used. 

I might be thick (or undercaffeinated) but I'm not sure I follow.
AddNewRelationTuple calls InsertPgClassTuple which in turn avoids the relcache
entry.

--
Daniel Gustafsson        https://vmware.com/


Attachment

Re: Online checksums patch - once again

From
Heikki Linnakangas
Date:
On 22/01/2021 14:21, Heikki Linnakangas wrote:
> On 22/01/2021 13:55, Heikki Linnakangas wrote:
>> I read through the latest patch,
>> v31-0001-Support-checksum-enable-disable-in-a-running-clu.patch. Some
>> comments below:
> 
> One more thing:
> 
> In SetRelationNumChecks(), you should use SearchSysCacheCopy1() to get a
> modifiable copy of the tuple. Otherwise you modify the tuple in the
> relcache as a side effect. Maybe that's harmless in this case, as the
> 'relhaschecksums' value in the relcache isn't used for anything, but
> let's be tidy.

Sorry, I meant SetRelHasChecksums. There is no SetRelationNumChecks 
function, I don't know where I got that from.

- Heikki



Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 26 Jan 2021, at 23:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 22/01/2021 14:21, Heikki Linnakangas wrote:
>> On 22/01/2021 13:55, Heikki Linnakangas wrote:
>>> I read through the latest patch,
>>> v31-0001-Support-checksum-enable-disable-in-a-running-clu.patch. Some
>>> comments below:
>> One more thing:
>> In SetRelationNumChecks(), you should use SearchSysCacheCopy1() to get a
>> modifiable copy of the tuple. Otherwise you modify the tuple in the
>> relcache as a side effect. Maybe that's harmless in this case, as the
>> 'relhaschecksums' value in the relcache isn't used for anything, but
>> let's be tidy.
>
> Sorry, I meant SetRelHasChecksums. There is no SetRelationNumChecks function, I don't know where I got that from.

Ah, that makes more sense, you had me confused there for a bit =) Fixed in
attached v33 which also have been through another pgindent and pgperltidy run.

--
Daniel Gustafsson        https://vmware.com/


Attachment

Re: Online checksums patch - once again

From
Heikki Linnakangas
Date:
Revisiting an issue we discussed earlier:

On 25/11/2020 15:20, Daniel Gustafsson wrote:
> On 23 Nov 2020, at 18:36, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> On 17/11/2020 10:56, Daniel Gustafsson wrote:
>>> I've reworked this in the attached such that the enable_ and
>>> disable_ functions merely call into the launcher with the desired
>>> outcome, and the launcher is responsible for figuring out the
>>> rest.  The datachecksumworker is now the sole place which
>>> initiates a state transfer.
>> 
>> Well, you still fill the DatachecksumsWorkerShmem->operations array
>> in the backend process that launches the datacheckumworker, not in
>> the worker process. I find that still a bit surprising, but I
>> believe it works.
> 
> I'm open to changing it in case there are strong opinions, it just
> seemed the most natural to me.

This kept bothering me, so I spent a while hacking this to my liking. 
The attached patch moves the code to fill in 'operations' from the 
backend to the launcher, so that the pg_enable/disable_checksums() call 
now truly just stores the desired state, checksum on or off, in shared 
memory, and launches the launcher process. The launcher process figures 
out how to get to the desired state. This removes the couple of corner 
cases that previously emitted a NOTICE about the processing being 
concurrently disabled or aborted. What do you think? I haven't done much 
testing, so if you adopt this approach, please check if I broke 
something in the process.

This changes the way the abort works. If the 
pg_enable/disable_checksums() is called, while the launcher is already 
busy changing the state, pg_enable/disable_checksums() will just set the 
new desired state in shared memory anyway. The launcher process will 
notice that the target state changed some time later, and restart from 
scratch.

A couple of other issues came up while doing that:

- AbortProcessing() has two callers, one in code that runs in the 
launcher process, and another one in code that runs in the worker 
process. Is it really safe to use from the worker process? Calling 
ProcessAllDatabases() in the worker seems sketchy. (This is moot in this 
new patch version, as I removed AbortProcessing() altogether)

- Is it possible that the worker keeps running after the launcher has 
already exited, e.g. because of an ERROR or SIGTERM? If you then quickly 
call pg_enable_checksums() again, can you end up with two workers 
running at the same time? Is that bad?

On 26/01/2021 23:00, Daniel Gustafsson wrote:
>> On 22 Jan 2021, at 12:55, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> @@ -3567,6 +3571,27 @@ RelationBuildLocalRelation(const char *relname,
>>>                 relkind == RELKIND_MATVIEW)
>>>                 RelationInitTableAccessMethod(rel);
>>> +       /*
>>> +        * Set the data checksum state. Since the data checksum state can change at
>>> +        * any time, the fetched value might be out of date by the time the
>>> +        * relation is built.  DataChecksumsNeedWrite returns true when data
>>> +        * checksums are: enabled; are in the process of being enabled (state:
>>> +        * "inprogress-on"); are in the process of being disabled (state:
>>> +        * "inprogress-off"). Since relhaschecksums is only used to track progress
>>> +        * when data checksums are being enabled, and going from disabled to
>>> +        * enabled will clear relhaschecksums before starting, it is safe to use
>>> +        * this value for a concurrent state transition to off.
>>> +        *
>>> +        * If DataChecksumsNeedWrite returns false, and is concurrently changed to
>>> +        * true then that implies that checksums are being enabled. Worst case,
>>> +        * this will lead to the relation being processed for checksums even though
>>> +        * each page written will have them already.  Performing this last shortens
>>> +        * the window, but doesn't avoid it.
>>> +        */
>>> +       HOLD_INTERRUPTS();
>>> +       rel->rd_rel->relhaschecksums = DataChecksumsNeedWrite();
>>> +       RESUME_INTERRUPTS();
>>> +
>>>         /*
>>>          * Okay to insert into the relcache hash table.
>>>          *
>>
>> I grepped for relhashcheckums, and concluded that the value in the
>> relcache isn't actually used for anything. Not so! In
>> heap_create_with_catalog(), the actual pg_class row is constructed
>> from the relcache entry, so the value set in
>> RelationBuildLocalRelation() finds its way to pg_class. Perhaps it
>> would be more clear to pass relhachecksums directly as an argument
>> to AddNewRelationTuple(). That way, the value in the relcache would
>> be truly never used.
> 
> I might be thick (or undercaffeinated) but I'm not sure I follow. 
> AddNewRelationTuple calls InsertPgClassTuple which in turn avoids the
> relcache entry.

Ah, you're right, I misread AddNewRelationTuple. That means that the 
relhaschecksums field in the relcache is never used? That's a clear 
rule. The changes to formrdesc() and RelationBuildLocalRelation() seem 
unnecessary then, we can always initialize relhaschecksums to false in 
the relcache.

- Heikki

Attachment

Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 27 Jan 2021, at 16:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Revisiting an issue we discussed earlier:
>
> On 25/11/2020 15:20, Daniel Gustafsson wrote:
>> On 23 Nov 2020, at 18:36, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>>> On 17/11/2020 10:56, Daniel Gustafsson wrote:
>>>> I've reworked this in the attached such that the enable_ and
>>>> disable_ functions merely call into the launcher with the desired
>>>> outcome, and the launcher is responsible for figuring out the
>>>> rest.  The datachecksumworker is now the sole place which
>>>> initiates a state transfer.
>>> Well, you still fill the DatachecksumsWorkerShmem->operations array
>>> in the backend process that launches the datacheckumworker, not in
>>> the worker process. I find that still a bit surprising, but I
>>> believe it works.
>> I'm open to changing it in case there are strong opinions, it just
>> seemed the most natural to me.
>
> This kept bothering me, so I spent a while hacking this to my liking. The attached patch moves the code to fill in
'operations'from the backend to the launcher, so that the pg_enable/disable_checksums() call now truly just stores the
desiredstate, checksum on or off, in shared memory, and launches the launcher process. The launcher process figures out
howto get to the desired state. This removes the couple of corner cases that previously emitted a NOTICE about the
processingbeing concurrently disabled or aborted. What do you think? 

I like it, it does avoid a few edgecases, while some are moved around (more on
that below) yielding easier to read code.  This wasn't really how I thought you
wanted it when we talked about this, so I'm very glad to see it in code since I
now get what you were saying.  Thanks!

> I haven't done much testing, so if you adopt this approach, please check if I broke something in the process.

I think you broke restarts: the enable_checksums variable wasn't populated so
the restart seemed to fail to set the right operations.  Also, the false return
can now mean an actual failure as well as an abort case for a restart.  I've
done a quick change to look for an updated state, but that doesn't cover the
case when processing fails *and* a new state has been set.  Maybe the best
solution is to change return type for processing to int's such that the three
cases (failure, abort, success) can be returned?

Also, this fails on assertion for me when checking the current state since
interrupts aren't held off.  I have a feeling it should've done that in my
version too as I look at it?  The attached holds off interrutps to pass that
(we clearly don't want a new state while we decide on operations anyways).

The tests in src/test/checksum were reliably failing on these for me but pass
with the attached 0002 applied.

> This changes the way the abort works. If the pg_enable/disable_checksums() is called, while the launcher is already
busychanging the state, pg_enable/disable_checksums() will just set the new desired state in shared memory anyway. The
launcherprocess will notice that the target state changed some time later, and restart from scratch. 

The "notice period" is the same as before though, unless I'm missing something.

> A couple of other issues came up while doing that:
>
> - AbortProcessing() has two callers, one in code that runs in the launcher process, and another one in code that runs
inthe worker process. Is it really safe to use from the worker process? Calling ProcessAllDatabases() in the worker
seemssketchy. (This is moot in this new patch version, as I removed AbortProcessing() altogether) 

I think it was safe, but I agree that this version is a lot cleaner so it is as
you say moot.

> - Is it possible that the worker keeps running after the launcher has already exited, e.g. because of an ERROR or
SIGTERM?If you then quickly call pg_enable_checksums() again, can you end up with two workers running at the same time?
Isthat bad? 

True, I think there is a window for when that could happen, but worst case
should be that a database is checksummed twice?

> On 26/01/2021 23:00, Daniel Gustafsson wrote:
>>> On 22 Jan 2021, at 12:55, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>>> @@ -3567,6 +3571,27 @@ RelationBuildLocalRelation(const char *relname,
>>>>                relkind == RELKIND_MATVIEW)
>>>>                RelationInitTableAccessMethod(rel);
>>>> +       /*
>>>> +        * Set the data checksum state. Since the data checksum state can change at
>>>> +        * any time, the fetched value might be out of date by the time the
>>>> +        * relation is built.  DataChecksumsNeedWrite returns true when data
>>>> +        * checksums are: enabled; are in the process of being enabled (state:
>>>> +        * "inprogress-on"); are in the process of being disabled (state:
>>>> +        * "inprogress-off"). Since relhaschecksums is only used to track progress
>>>> +        * when data checksums are being enabled, and going from disabled to
>>>> +        * enabled will clear relhaschecksums before starting, it is safe to use
>>>> +        * this value for a concurrent state transition to off.
>>>> +        *
>>>> +        * If DataChecksumsNeedWrite returns false, and is concurrently changed to
>>>> +        * true then that implies that checksums are being enabled. Worst case,
>>>> +        * this will lead to the relation being processed for checksums even though
>>>> +        * each page written will have them already.  Performing this last shortens
>>>> +        * the window, but doesn't avoid it.
>>>> +        */
>>>> +       HOLD_INTERRUPTS();
>>>> +       rel->rd_rel->relhaschecksums = DataChecksumsNeedWrite();
>>>> +       RESUME_INTERRUPTS();
>>>> +
>>>>        /*
>>>>         * Okay to insert into the relcache hash table.
>>>>         *
>>>
>>> I grepped for relhashcheckums, and concluded that the value in the
>>> relcache isn't actually used for anything. Not so! In
>>> heap_create_with_catalog(), the actual pg_class row is constructed
>>> from the relcache entry, so the value set in
>>> RelationBuildLocalRelation() finds its way to pg_class. Perhaps it
>>> would be more clear to pass relhachecksums directly as an argument
>>> to AddNewRelationTuple(). That way, the value in the relcache would
>>> be truly never used.
>> I might be thick (or undercaffeinated) but I'm not sure I follow. AddNewRelationTuple calls InsertPgClassTuple which
inturn avoids the 
>> relcache entry.
>
> Ah, you're right, I misread AddNewRelationTuple. That means that the relhaschecksums field in the relcache is never
used?That's a clear rule. The changes to formrdesc() and RelationBuildLocalRelation() seem unnecessary then, we can
alwaysinitialize relhaschecksums to false in the relcache. 

They probably are, but should they be kept as a just-in-case for future hackery
which may assume the relcache is at all points correct?

I've attached my changes as 0002 here on top of your patch, what do you think
about those?  There are some commentfixups as well, some stemming from your
patch and some from much earlier versions that I just hadn't seen until now.

--
Daniel Gustafsson        https://vmware.com/


Attachment

Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
The previous v35 had a tiny conflict in pg_class.h, the attached v36 (which is
a squash of the 2 commits in v35) fixes that.  No other changes are introduced
in this version.

--
Daniel Gustafsson        https://vmware.com/


Attachment

Re: Online checksums patch - once again

From
Heikki Linnakangas
Date:
(I may have said this before, but) My overall high-level impression of 
this patch is that it's really cmmplex for a feature that you use maybe 
once in the lifetime of a cluster. I'm happy to review but I'm not 
planning to commit this myself. I don't object if some other committer 
picks this up (Magnus?).

Now to the latest patch version:

On 03/02/2021 18:15, Daniel Gustafsson wrote:
> The previous v35 had a tiny conflict in pg_class.h, the attached v36 (which is
> a squash of the 2 commits in v35) fixes that.  No other changes are introduced
> in this version.


>     /*
>      * Check to see if my copy of RedoRecPtr is out of date. If so, may have
>      * to go back and have the caller recompute everything. This can only
>      * happen just after a checkpoint, so it's better to be slow in this case
>      * and fast otherwise.
>      *
>      * Also check to see if fullPageWrites or forcePageWrites was just turned
>      * on, or if we are in the process of enabling checksums in the cluster;
>      * if we weren't already doing full-page writes then go back and recompute.
>      *
>      * If we aren't doing full-page writes then RedoRecPtr doesn't actually
>      * affect the contents of the XLOG record, so we'll update our local copy
>      * but not force a recomputation.  (If doPageWrites was just turned off,
>      * we could recompute the record without full pages, but we choose not to
>      * bother.)
>      */
>     if (RedoRecPtr != Insert->RedoRecPtr)
>     {
>         Assert(RedoRecPtr < Insert->RedoRecPtr);
>         RedoRecPtr = Insert->RedoRecPtr;
>     }
>     doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites || DataChecksumsOnInProgress());

Why does this use DataChecksumsOnInProgress() instead of 
DataChecksumsNeedWrite()? If checksums are enabled, you always need 
full-page writes, don't you? If not, then why is it needed in the 
inprogress-on state?

We also set doPageWrites in InitXLOGAccess(). That should match the 
condition above (although it doesn't matter for correctness).

> /*
>  * DataChecksumsNeedVerify
>  *        Returns whether data checksums must be verified or not
>  *
>  * Data checksums are only verified if they are fully enabled in the cluster.
>  * During the "inprogress-on" and "inprogress-off" states they are only
>  * updated, not verified (see datachecksumsworker.c for a longer discussion).
>  *
>  * This function is intended for callsites which have read data and are about
>  * to perform checksum validation based on the result of this. To avoid the
>  * the risk of the checksum state changing between reading and performing the
>  * validation (or not), interrupts must be held off. This implies that calling
>  * this function must be performed as close to the validation call as possible
>  * to keep the critical section short. This is in order to protect against
>  * time of check/time of use situations around data checksum validation.
>  */
> bool
> DataChecksumsNeedVerify(void)
> {
>     Assert(InterruptHoldoffCount > 0);
>     return (LocalDataChecksumVersion == PG_DATA_CHECKSUM_VERSION);
> }

What exactly is the intended call pattern here? Something like this?

smgrread() a data page
HOLD_INTERRUPTS();
if (DataChecksumsNeedVerify())
{
   if (pg_checksum_page((char *) page, blkno) != expected)
     elog(ERROR, "bad checksum");
}
RESUME_INTERRUPTS();

That seems to be what the code currently does. What good does holding 
interrupts do here? If checksums were not fully enabled at the 
smgrread() call, the page might have incorrect checksums, and if the 
state transitions from inprogress-on to on between the smggread() call 
and the DataChecksumsNeedVerify() call, you'll get an error. I think you 
need to hold the interrupts *before* the smgrread() call.

> /*
>  * Set checksum for a page in private memory.
>  *
>  * This must only be used when we know that no other process can be modifying
>  * the page buffer.
>  */
> void
> PageSetChecksumInplace(Page page, BlockNumber blkno)
> {
>     HOLD_INTERRUPTS();
>     /* If we don't need a checksum, just return */
>     if (PageIsNew(page) || !DataChecksumsNeedWrite())
>     {
>         RESUME_INTERRUPTS();
>         return;
>     }
> 
>     ((PageHeader) page)->pd_checksum = pg_checksum_page((char *) page, blkno);
>     RESUME_INTERRUPTS();
> }

The checksums state might change just after this call, before the caller 
has actually performed the smgrwrite() or smgrextend() call. The caller 
needs to hold interrupts across this function and the 
smgrwrite/smgrextend() call. It is a bad idea to HOLD_INTERRUPTS() here, 
because that just masks bugs where the caller isn't holding the 
interrupts. Same in PageSetChecksumCopy().

- Heikki



Re: Online checksums patch - once again

From
Michael Paquier
Date:
On Tue, Feb 09, 2021 at 10:54:50AM +0200, Heikki Linnakangas wrote:
> (I may have said this before, but) My overall high-level impression of this
> patch is that it's really cmmplex for a feature that you use maybe once in
> the lifetime of a cluster. I'm happy to review but I'm not planning to
> commit this myself. I don't object if some other committer picks this up
> (Magnus?).

I was just looking at the latest patch set as a matter of curiosity,
and I have a shared feeling.  I think that this is a lot of
complication in-core for what would be a one-time operation,
particularly knowing that there are other ways to do it already with
the offline checksum tool, even if that is more costly:
- Involve logical replication after initializing the new instance with
--data-checksums, or in an upgrade scenatio with pg_upgrade.
- Involve physical replication: stop the standby cleanly, enable
checksums on it and do a switchover.

Another thing we could do is to improve pg_checksums with a parallel
mode.  The main design question would be how to distribute the I/O,
and that would mean balancing at least across tablespaces.
--
Michael

Attachment
Hi,

Am Mittwoch, den 10.02.2021, 15:06 +0900 schrieb Michael Paquier:
> On Tue, Feb 09, 2021 at 10:54:50AM +0200, Heikki Linnakangas wrote:
> > (I may have said this before, but) My overall high-level impression of this
> > patch is that it's really cmmplex for a feature that you use maybe once in
> > the lifetime of a cluster. I'm happy to review but I'm not planning to
> > commit this myself. I don't object if some other committer picks this up
> > (Magnus?).
> 
> I was just looking at the latest patch set as a matter of curiosity,
> and I have a shared feeling.  

I think this still would be a useful feature; not least for the online
deactivation - having to shut down the instance is sometimes just not an
option in production, even for just a few seconds.

However, there is also the shoot-the-whole-database-into-WAL (at least,
that is what happens, AIUI) issue which has not been discussed that much
either, the patch allows throttling, but I think the impact on actual
production workloads are not very clear yet.

> I think that this is a lot of complication in-core for what would be a
> one-time operation, particularly knowing that there are other ways to
> do it already with the offline checksum tool, even if that is more
> costly: 
> - Involve logical replication after initializing the new instance with
> --data-checksums, or in an upgrade scenatio with pg_upgrade.

Logical replication is still somewhat unpractical for such a (possibly)
routine task, and I don't understand your pg_upgrade scenario, can
expand on that a bit?

> - Involve physical replication: stop the standby cleanly, enable
> checksums on it and do a switchover.

I would like to focus on this, so I changed the subject in order not to
derail the online acivation patch thread.

If this is something we support, then we should document it.

I have to admit that this possiblity escaped me when we first committed
offline (de)activation, it was brought to my attention via 
https://twitter.com/samokhvalov/status/1281312586219188224 and the
following discussion.

So if we think this (to recap: shut down the standby, run pg_checksums
on it, start it up again, wait until it is back in sync, then
switchover) is a safe way to activate checksums on a streaming
replication setup, then we should document it I think. However, I have
only seen sorta hand-waiving on this so far and no deeper analysis of
what could possibly go wrong (but doesn't).

Anybody did some further work/tests on this and/or has something written
up to contribute to the documentation? Or do we think this is not
appropriate to document? I think once we agree this is safe, it is not
more complicated than the rsync-the-standby-after-pg_upgrade recipe we
did document.

> Another thing we could do is to improve pg_checksums with a parallel
> mode.  The main design question would be how to distribute the I/O,
> and that would mean balancing at least across tablespaces.

Right. I thought about this a while ago, but didn't have time to work on
it so far.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz




Re: Online checksums patch - once again

From
Magnus Hagander
Date:
On Tue, Feb 9, 2021 at 9:54 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> (I may have said this before, but) My overall high-level impression of
> this patch is that it's really cmmplex for a feature that you use maybe
> once in the lifetime of a cluster. I'm happy to review but I'm not
> planning to commit this myself. I don't object if some other committer
> picks this up (Magnus?).

A fairly large amount of this complexity comes out of the fact that it
now supports restarting and tracks checksums on a per-table basis. We
skipped this in the original patch for exactly this reason (that's not
to say there isn't a fair amount of complexity even without it, but it
did substantially i increase both the size and the complexity of the
patch), but in the review of that i was specifically asked for having
that added. I personally don't think it's worth that complexity but at
the time that seemed to be a pretty strong argument. So I'm not
entirely sure how to move forward with that...

is your impression that it would still be too complicated, even without that?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: Online checksums patch - once again

From
Bruce Momjian
Date:
On Wed, Feb 10, 2021 at 03:25:58PM +0100, Magnus Hagander wrote:
> On Tue, Feb 9, 2021 at 9:54 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > (I may have said this before, but) My overall high-level impression of
> > this patch is that it's really cmmplex for a feature that you use maybe
> > once in the lifetime of a cluster. I'm happy to review but I'm not
> > planning to commit this myself. I don't object if some other committer
> > picks this up (Magnus?).
> 
> A fairly large amount of this complexity comes out of the fact that it
> now supports restarting and tracks checksums on a per-table basis. We
> skipped this in the original patch for exactly this reason (that's not
> to say there isn't a fair amount of complexity even without it, but it
> did substantially i increase both the size and the complexity of the
> patch), but in the review of that i was specifically asked for having
> that added. I personally don't think it's worth that complexity but at
> the time that seemed to be a pretty strong argument. So I'm not
> entirely sure how to move forward with that...
> 
> is your impression that it would still be too complicated, even without that?

I was wondering why this feature has stalled for so long --- now I know.
This does highlight the risk of implementing too many additions to a
feature. I am working against this dynamic in the cluster file
encryption feature I am working on.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Online checksums patch - once again

From
Heikki Linnakangas
Date:
On 10/02/2021 16:25, Magnus Hagander wrote:
> On Tue, Feb 9, 2021 at 9:54 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> (I may have said this before, but) My overall high-level impression of
>> this patch is that it's really cmmplex for a feature that you use maybe
>> once in the lifetime of a cluster. I'm happy to review but I'm not
>> planning to commit this myself. I don't object if some other committer
>> picks this up (Magnus?).
> 
> A fairly large amount of this complexity comes out of the fact that it
> now supports restarting and tracks checksums on a per-table basis. We
> skipped this in the original patch for exactly this reason (that's not
> to say there isn't a fair amount of complexity even without it, but it
> did substantially i increase both the size and the complexity of the
> patch), but in the review of that i was specifically asked for having
> that added. I personally don't think it's worth that complexity but at
> the time that seemed to be a pretty strong argument. So I'm not
> entirely sure how to move forward with that...
> 
> is your impression that it would still be too complicated, even without that?

I'm not sure. It would certainly be a lot better.

Wrt. restartability, I'm also not very happy with the way that works - 
or rather doesn't :-) - in this patch. After shutting down and 
restarting the cluster, you have to manually call 
pg_enable_data_checksums() again to restart the checksumming process.

- Heikki



Re: Online checksums patch - once again

From
Bruce Momjian
Date:
On Wed, Feb 10, 2021 at 01:26:18PM -0500, Bruce Momjian wrote:
> On Wed, Feb 10, 2021 at 03:25:58PM +0100, Magnus Hagander wrote:
> > A fairly large amount of this complexity comes out of the fact that it
> > now supports restarting and tracks checksums on a per-table basis. We
> > skipped this in the original patch for exactly this reason (that's not
> > to say there isn't a fair amount of complexity even without it, but it
> > did substantially i increase both the size and the complexity of the
> > patch), but in the review of that i was specifically asked for having
> > that added. I personally don't think it's worth that complexity but at
> > the time that seemed to be a pretty strong argument. So I'm not
> > entirely sure how to move forward with that...
> > 
> > is your impression that it would still be too complicated, even without that?
> 
> I was wondering why this feature has stalled for so long --- now I know.
> This does highlight the risk of implementing too many additions to a
> feature. I am working against this dynamic in the cluster file
> encryption feature I am working on.

Oh, I think another reason this patchset has had problems is related to
something I mentioned in 2018:

    https://www.postgresql.org/message-id/20180801163613.GA2267@momjian.us

    This patchset is weird because it is perhaps our first case of trying to
    change the state of the server while it is running.  We just don't have
    an established protocol for how to orchestrate that, so we are limping
    along toward a solution.  Forcing a restart is probably part of that
    primitive orchestration.  We will probably have similar challenges if we
    ever allowed Postgres to change its data format on the fly.  These
    challenges are one reason pg_upgrade only modifies the new cluster,
    never the old one.

I don't think anyone has done anything wrong --- rather, it is what we
are _trying_ to do that is complex.  Adding restartability to this just
added to the challenge.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 9 Feb 2021, at 09:54, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> (I may have said this before, but) My overall high-level impression of this patch is that it's really cmmplex for a
featurethat you use maybe once in the lifetime of a cluster. 

The frequency of using a feature seems like a suboptimal metric of the value
and usefulness of it.  I don't disagree that this patch is quite invasive and
complicated though, that's a side-effect of performing global state changes in
a distributed system which is hard to get around.

As was discussed downthread, some of the complexity stems from restartability
and the catalog state persistence.  The attached version does away with this to
see how big the difference is.  Personally I don't think it's enough to move
the needle but mmv.

> I'm happy to review but I'm not planning to commit this myself.

Regardless, thanks for all review and thoughtful comments!

>>     doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites || DataChecksumsOnInProgress());
>
> Why does this use DataChecksumsOnInProgress() instead of DataChecksumsNeedWrite()? If checksums are enabled, you
alwaysneed full-page writes, don't you? If not, then why is it needed in the inprogress-on state? 

Correct, that's a thinko, it should be DataChecksumsNeedWrite.

> We also set doPageWrites in InitXLOGAccess(). That should match the condition above (although it doesn't matter for
correctness).

Good point, fixed.

>> I think you need to hold the interrupts *before* the smgrread() call.

Fixed.

>> /*
>> * Set checksum for a page in private memory.
>> *
>> * This must only be used when we know that no other process can be modifying
>> * the page buffer.
>> */
>> void
>> PageSetChecksumInplace(Page page, BlockNumber blkno)
>> {
>>     HOLD_INTERRUPTS();
>>     /* If we don't need a checksum, just return */
>>     if (PageIsNew(page) || !DataChecksumsNeedWrite())
>>     {
>>         RESUME_INTERRUPTS();
>>         return;
>>     }
>>     ((PageHeader) page)->pd_checksum = pg_checksum_page((char *) page, blkno);
>>     RESUME_INTERRUPTS();
>> }
>
> The checksums state might change just after this call, before the caller has actually performed the smgrwrite() or
smgrextend()call. The caller needs to hold interrupts across this function and the smgrwrite/smgrextend() call. It is a
badidea to HOLD_INTERRUPTS() here, because that just masks bugs where the caller isn't holding the interrupts. Same in
PageSetChecksumCopy().

Looking at the cases which could happen here in case of the below state change:

PageSetChecksumInPlace
  B <---- state change
smgrwrite

The possible state transitions which can happen, and the consequences of them
are:

1) B = (off -> inprogress-on): We will write without a checksum.  This relation
will then be processed by the DatachecksumsWorker.  Readers will not verify
checksums.

2) B = (inprogress-on -> on): We will write with a checksum.  Both those states
write checksums.  Readers will verify the checksum.

3) B = (on -> inprogress-off): We will write with a checksum.  Both these
states write checksums.  Readers will not verify checksums.

4) B = (inprogress-off -> off): We will write with a checksum which is wasteful
and strictly speaking incorrect.  Readers will not verify checksums.

The above assume there is only a single state transition between the call to
PageSetChecksumInPlace and smgrwrite/extend, which of course isn't guaranteed
to be the case (albeit a lot more likely than two).

The (off -> inprogress-on -> on) and (inprogress-on -> on -> inprogress-off)
transitions cannot happen here since the DatachecksumWorker will wait for
ongoing transactions before a transition to on can be initiated.

11) B = (on -> inprogress-off -> off): the checksum will be written when it
shouldn't be, but readers won't verify it.

22) B = (inprogress-off -> off -> inprogress-on): checksums are written without
being verified in both these states.

So these cornercases can happen but ending up with an incorrect verification is
likely hard, which is probably why they haven't shook out during testing.  I've
updated to patch to hold interrupts across the smgrwrite/extend call.

--
Daniel Gustafsson        https://vmware.com/


Attachment

Re: Online checksums patch - once again

From
Daniel Gustafsson
Date:
> On 11 Feb 2021, at 14:10, Bruce Momjian <bruce@momjian.us> wrote:
>
> On Wed, Feb 10, 2021 at 01:26:18PM -0500, Bruce Momjian wrote:
>> On Wed, Feb 10, 2021 at 03:25:58PM +0100, Magnus Hagander wrote:
>>> A fairly large amount of this complexity comes out of the fact that it
>>> now supports restarting and tracks checksums on a per-table basis. We
>>> skipped this in the original patch for exactly this reason (that's not
>>> to say there isn't a fair amount of complexity even without it, but it
>>> did substantially i increase both the size and the complexity of the
>>> patch), but in the review of that i was specifically asked for having
>>> that added. I personally don't think it's worth that complexity but at
>>> the time that seemed to be a pretty strong argument. So I'm not
>>> entirely sure how to move forward with that...
>>>
>>> is your impression that it would still be too complicated, even without that?
>>
>> I was wondering why this feature has stalled for so long --- now I know.
>> This does highlight the risk of implementing too many additions to a
>> feature. I am working against this dynamic in the cluster file
>> encryption feature I am working on.
>
> Oh, I think another reason this patchset has had problems is related to
> something I mentioned in 2018:
>
>     https://www.postgresql.org/message-id/20180801163613.GA2267@momjian.us
>
>     This patchset is weird because it is perhaps our first case of trying to
>     change the state of the server while it is running.  We just don't have
>     an established protocol for how to orchestrate that, so we are limping
>     along toward a solution.  Forcing a restart is probably part of that
>     primitive orchestration.  We will probably have similar challenges if we
>     ever allowed Postgres to change its data format on the fly.  These
>     challenges are one reason pg_upgrade only modifies the new cluster,
>     never the old one.
>
> I don't think anyone has done anything wrong --- rather, it is what we
> are _trying_ to do that is complex.

Global state changes in a cluster are complicated, and are unlikely to never
not be.  By writing patches to attempts such state changes we can see which
pieces of infrastructure we're lacking to reduce complexity.  A good example is
the ProcSignalBarrier work that Andres and Robert did, inspired in part by this
patch IIUC.  The more we do, the more we learn.

--
Daniel Gustafsson        https://vmware.com/