Re: Online checksums patch - once again - Mailing list pgsql-hackers

From Daniel Gustafsson
Subject Re: Online checksums patch - once again
Date
Msg-id D9F775AB-B78E-492D-A5C4-DA8644984D57@yesql.se
Whole thread Raw
In response to Re: Online checksums patch - once again  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Online checksums patch - once again
List pgsql-hackers
> On 22 Jan 2021, at 12:55, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> I read through the latest patch, v31-0001-Support-checksum-enable-disable-in-a-running-clu.patch. Some comments
below:

Thanks for reviewing!  Attached is a v32 which address most of your comments,
see below.

> On 19/01/2021 14:32, Daniel Gustafsson wrote:
>> +       /*
>> +        * Hold interrupts for the duration of xlogging to avoid the state of data
>> +        * checksums changing during the processing which would later the premise
>> +        * for xlogging hint bits.
>> +        */
>
> Sentence sense does not make.

Indeed it doesn't, the "later" is a misspelled "alter".  Essentially, the idea
with the comment is to explain that interrupts should be held during logging so
that the decision to log hintbits is valid for the duration of the logging.
Does changing to "alter" suffice?  It's fixed like so in the attached, but
perhaps there are better wordings for this.

>> @@ -904,6 +916,7 @@ static void SetLatestXTime(TimestampTz xtime);
>> static void SetCurrentChunkStartTime(TimestampTz xtime);
>> static void CheckRequiredParameterValues(void);
>> static void XLogReportParameters(void);
>> +static void XlogChecksums(ChecksumType new_type);
>> static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
>>                                                                TimeLineID prevTLI);
>> static void LocalSetXLogInsertAllowed(void);
>
> Spelling: make it "XLogChecksums" for consistency.

Good point, fixed.

>> /*
>> * DataChecksumsNeedWrite
>> *        Returns whether data checksums must be written or not
>> *
>> * Returns true iff data checksums are enabled or are in the process of being
>> * enabled. In case data checksums are currently being enabled we must write
>> * the checksum even though it's not verified during this stage. Interrupts
>> * need to be held off by the caller to ensure that the returned state is
>> * valid for the duration of the intended processing.
>> */
>> bool
>> DataChecksumsNeedWrite(void)
>> {
>>     Assert(InterruptHoldoffCount > 0);
>>     return (LocalDataChecksumVersion == PG_DATA_CHECKSUM_VERSION ||
>>             LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION ||
>>             LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION);
>> }
>
> Can you be more precise on the "duration of the intended processing"? It means, until you have actually written out
thepage, or something like that, right? The similar explanation in DataChecksumsNeedVerify() is easier to understand.
Thetwo functions are similar, it would be good to phrase the comments similarly, so that you can quickly see the
differencebetween the two. 

I've reworded this to (hopefully) be more clear, and to be more like the
comment for DataChecksumsNeedVerify.

>> /*
>> * SetDataChecksumsOnInProgress
>> *        Sets the data checksum state to "inprogress-on" to enable checksums
>> *
>> * In order to start the process of enabling data checksums in a running
>> * cluster the data_checksum_version state must be changed to "inprogress-on".
>> * This state requires data checksums to be written but not verified. The state
>> * transition is performed in a critical section in order to provide crash
>> * safety, and checkpoints are held off. When the emitted procsignalbarrier
>> * has been absorbed by all backends we know that the cluster has started to
>> * enable data checksums.
>> */
>
> The two "in order" are superfluous, it's more concise to say just "To start the process ..." (I was made aware of
thatby Jürgen Purtz's recent patches that removed a couple of "in order"s from the docs as unnecessary). 

Agreed, that's better.

> It's a bit confusing to talk about the critical section and procsignalbarrier here. Does the caller need to wait for
theprocsignalbarrier? No, that's just explaining what the function does internally. Maybe move that explanation inside
thefunction, and say here something like "This function blocks until all processes have acknowledged the state change"
orsomething like that. 

Fixed.

>> /*
>> * SetDataChecksumsOn
>> *        Enables data checksums cluster-wide
>> *
>> * Enabling data checksums is performed using two barriers, the first one
>> * sets the checksums state to "inprogress-on" (which is performed by
>> * SetDataChecksumsOnInProgress()) and the second one to "on" (performed here).
>> * During "inprogress-on", checksums are written but not verified. When all
>> * existing pages are guaranteed to have checksums, and all new pages will be
>> * initiated with checksums, the state can be changed to "on".
>> */
>
> Perhaps the explanation for how these SetDataChecksumsOn() and SetDataChecksumsOnInProgress() functions work together
shouldbe moved to one place. For example, explain both functions here at SetDataChecksumsOn(), and just have a "see
SetDataChecksumsOn()"in the other function, or no comment at all if they're kept next to each other. As it stands, you
haveto read both comments and piece together the the big picture in your head. Maybe add a "see datachecksumsworker.c"
here,since there's a longer explanation of the overall mechanism there. 

Fixed.  I did keep both comments since the function are too long to fit the
comment and the code on screen, but refer to one from the other.

>> +/*
>> + * Disables checksums for the cluster, unless already disabled.
>> + *
>> + * Has immediate effect - the checksums are set to off right away.
>> + */
>> +Datum
>> +disable_data_checksums(PG_FUNCTION_ARGS)
>> +{
>> +       if (!superuser())
>> +               ereport(ERROR,
>> +                               (errmsg("must be superuser")));
>> +
>> +       StartDatachecksumsWorkerLauncher(false, 0, 0);
>> +
>> +       PG_RETURN_BOOL(true);
>> +}
>
> The documentation says "Returns <literal>false</literal> in case data checksums are disabled already", but the
functionalways returns true. 

Fixed.

> The enable_data_checksum() function also returns a constant 'true'; why not make it void?

These functions were void until v20 of this patch when Robert correctly pointed
out that using ereport to communicate status was a poor choice (there were
NOTICEs IIRC).  v23 however rolled back all status checking in response to a
review by you, leaving the functions to only call the worker.  The returnvalue
was however not changed back to void at that point which would've been the
reasonable choice.  Fixed.

> The "has immediate effect" comment seems wrong, given that it actually launches a worker process.

Fixed.

>> +       /*
>> +        * If the launcher is already started, the only operation we can perform
>> +        * is to cancel it iff the user requested for checksums to be disabled.
>> +        * That doesn't however mean that all other cases yield an error, as some
>> +        * might be perfectly benevolent.
>> +        */
>
> This comment is a bit hard to understand. Maybe something like
>
> "If the launcher is already started, we cannot launch a new one. But if the user requested for checksums to be
disabled,we can cancel it." 

Agreed, thats better. Fixed.

>> +       if (DatachecksumsWorkerShmem->launcher_started)
>> +       {
>> +               if (DatachecksumsWorkerShmem->abort)
>> +               {
>> +                       ereport(NOTICE,
>> +                                       (errmsg("data checksum processing is concurrently being aborted, please
retry")));
>> +
>> +                       LWLockRelease(DatachecksumsWorkerLock);
>> +                       return;
>> +               }
>
> If it's being aborted, and the user requested to disable checksum, can we leave out the NOTICE? I guess not, because
theworker process won't check the 'abort' flag, if it's already finished processing all data pages. 

Correct, since we might otherwise silently not disable checksums on a request
to do so.

>> +               /*
>> +                * If the launcher is started data checksums cannot be on or off, but
>> +                * it may be in an inprogress state. Since the state transition may
>> +                * not have happened yet (in case of rapidly initiated checksum enable
>> +                * calls for example) we inspect the target state of the currently
>> +                * running launcher.
>> +                */
>
> This comment contradicts itself. If the state transition has not happened yet, then contrary to the first sentence,
datachecksums *are* currently on or off. 

Yes, indeed.  I've reworded this comment but I'm not sure it's much of an
improvement I'm afraid.

>> +               if (enable_checksums)
>> +               {
>> +                       /*
>> +                        * If we are asked to enable checksums when they are already being
>> +                        * enabled, there is nothing to do so exit.
>> +                        */
>> +                       if (DatachecksumsWorkerShmem->enable_checksums)
>> +                       {
>> +                               LWLockRelease(DatachecksumsWorkerLock);
>> +                               return;
>> +                       }
>> +
>> +                       /*
>> +                        * Disabling checksums is likely to be a very quick operation in
>> +                        * many cases so trying to abort it to save the checksums would
>> +                        * run the risk of race conditions.
>> +                        */
>> +                       else
>> +                       {
>> +                               ereport(NOTICE,
>> +                                               (errmsg("data checksums are concurrently being disabled, please
retry")));
>> +
>> +                               LWLockRelease(DatachecksumsWorkerLock);
>> +                               return;
>> +                       }
>> +
>> +                       /* This should be unreachable */
>> +                       Assert(false);
>> +               }
>> +               else if (!enable_checksums)
>> +               {
>> +                       /*
>> +                        * Data checksums are already being disabled, exit silently.
>> +                        */
>> +                       if (DataChecksumsOffInProgress())
>> +                       {
>> +                               LWLockRelease(DatachecksumsWorkerLock);
>> +                               return;
>> +                       }
>> +
>> +                       DatachecksumsWorkerShmem->abort = true;
>> +                       LWLockRelease(DatachecksumsWorkerLock);
>> +                       return;
>> +               }
>
> The Assert seems unnecessary. The "if (!enable_checkums)" also seems unnecessary, could be just "else".

Fixed.

>> +       /*
>> +        * The launcher is currently not running, so we need to query the system
>> +        * data checksum state to determine how to proceed based on the requested
>> +        * target state.
>> +        */
>
> "query the system" makes me think "checking the catalogs" or similar, but this just looks at a few variables in
sharedmemory. 

Reworded.

>> +/*
>> + * ShutdownDatachecksumsWorkerIfRunning
>> + *             Request shutdown of the datachecksumsworker
>> + *
>
> This function is unused.

Removed.

>> +/*
>> + * launcher_cancel_handler
>> + *
>> + * Internal routine for reacting to SIGINT and flagging the worker to abort.
>> + * The worker won't be interrupted immediately but will check for abort flag
>> + * between each block in a relation.
>> + */
>> +static void
>> +launcher_cancel_handler(SIGNAL_ARGS)
>> +{
>> +       LWLockAcquire(DatachecksumsWorkerLock, LW_EXCLUSIVE);
>> +       DatachecksumsWorkerShmem->abort = true;
>> +       LWLockRelease(DatachecksumsWorkerLock);
>> +}
>
> Acquiring an lwlock in signal handler is not safe.
>
> Since this is a signal handler, it might still get called after the process has finished processing, and has already
setDatachecksumsWorkerShmem->launcher_started = false. That seems like an unexpected state. 

Good point, I've fixed this to use a proper signal handler flag.  While doing
so I realized there was a race window when checksums were disabled while the
worker was running, but after it had processed the final buffer.  Since the
abort flag was only checked during buffer processing it could be missed leading
to the worker ending with checksums enabled when they should be disabled.  This
is likely to be much more common in the TAP tests than in real production use,
which is good since it was easy to trigger.  The attached fixes this by
re-checking the abort flag and I am no longer able to trigger the window during
tests.

>> +/*
>> + * WaitForAllTransactionsToFinish
>> + *             Blocks awaiting all current transactions to finish
>> + *
>> + * Returns when all transactions which are active at the call of the function
>> + * have ended, or if the postmaster dies while waiting. If the postmaster dies
>> + * the abort flag will be set to indicate that the caller of this shouldn't
>> + * proceed.
>> + */
>
> The caller of this function doesn't check the 'abort' flag AFAICS. I think you could just use use WL_EXIT_ON_PM_DEATH
toerror out on postmaster death. 

Good point, that simplifies the code.  It's erroring out in the same way as
other checks for postmaster death.

>> +       while (true)
>> +       {
>> +               int                     processed_databases = 0;
>> +
>> +               /*
>> +                * Get a list of all databases to process. This may include databases
>> +                * that were created during our runtime.
>> +                *
>> +                * Since a database can be created as a copy of any other database
>> +                * (which may not have existed in our last run), we have to repeat
>> +                * this loop until no new databases show up in the list. Since we wait
>> +                * for all pre-existing transactions finish, this way we can be
>> +                * certain that there are no databases left without checksums.
>> +                */
>> +               DatabaseList = BuildDatabaseList();
>
> I think it's unnecessary to wait out the transactions on the first iteration of this loop. There must be at least one
database,so there will always be at least two iterations. 
>
> I think it would be more clear to move the WaitForAllTransactionsToFinish() from BuildDatabaseList() to the callers.

I've tried this in the attached.  It does increase the window between waiting
for transactions to finish and grabbing the list, but that might be negligable?

>> @@ -3567,6 +3571,27 @@ RelationBuildLocalRelation(const char *relname,
>>                relkind == RELKIND_MATVIEW)
>>                RelationInitTableAccessMethod(rel);
>> +       /*
>> +        * Set the data checksum state. Since the data checksum state can change at
>> +        * any time, the fetched value might be out of date by the time the
>> +        * relation is built.  DataChecksumsNeedWrite returns true when data
>> +        * checksums are: enabled; are in the process of being enabled (state:
>> +        * "inprogress-on"); are in the process of being disabled (state:
>> +        * "inprogress-off"). Since relhaschecksums is only used to track progress
>> +        * when data checksums are being enabled, and going from disabled to
>> +        * enabled will clear relhaschecksums before starting, it is safe to use
>> +        * this value for a concurrent state transition to off.
>> +        *
>> +        * If DataChecksumsNeedWrite returns false, and is concurrently changed to
>> +        * true then that implies that checksums are being enabled. Worst case,
>> +        * this will lead to the relation being processed for checksums even though
>> +        * each page written will have them already.  Performing this last shortens
>> +        * the window, but doesn't avoid it.
>> +        */
>> +       HOLD_INTERRUPTS();
>> +       rel->rd_rel->relhaschecksums = DataChecksumsNeedWrite();
>> +       RESUME_INTERRUPTS();
>> +
>>        /*
>>         * Okay to insert into the relcache hash table.
>>         *
>
> I grepped for relhashcheckums, and concluded that the value in the relcache isn't actually used for anything. Not so!
Inheap_create_with_catalog(), the actual pg_class row is constructed from the relcache entry, so the value set in
RelationBuildLocalRelation()finds its way to pg_class. Perhaps it would be more clear to pass relhachecksums directly
asan argument to AddNewRelationTuple(). That way, the value in the relcache would be truly never used. 

I might be thick (or undercaffeinated) but I'm not sure I follow.
AddNewRelationTuple calls InsertPgClassTuple which in turn avoids the relcache
entry.

--
Daniel Gustafsson        https://vmware.com/


Attachment

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: shared tempfile was not removed on statement_timeout
Next
From: Magnus Hagander
Date:
Subject: Re: mkid reference