Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Changing the state of data checksums in a running cluster
Date
Msg-id 300822bd-daeb-49e5-8af7-d7ce3af65adc@iki.fi
Whole thread Raw
In response to Re: Changing the state of data checksums in a running cluster  (Daniel Gustafsson <daniel@yesql.se>)
Responses Re: Changing the state of data checksums in a running cluster
List pgsql-hackers
Here's a quick, non-exhaustive review of v20260316 of the patch. I 
haven't followed the thread, so forgive me if some of these things have 
already been discussed.

I don't see any major issues, just a bunch of small things:

> +/*
> + * Checksum version 0 is used for when data checksums are disabled (OFF).
> + * PG_DATA_CHECKSUM_VERSION defines that data checksums are enabled in the
> + * cluster and PG_DATA_CHECKSUM_INPROGRESS_{ON|OFF}_VERSION defines that data
> + * checksums are either currently being enabled or disabled.
> + */
> +typedef enum ChecksumType
> +{
> +    PG_DATA_CHECKSUM_OFF = 0,
> +    PG_DATA_CHECKSUM_VERSION,
> +    PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION,
> +    PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION
> +} ChecksumType;

Naming: This is called "VERSION", but also "Type", and I'm a little 
confused. I think the field was called "version" because we envisioned 
that we might want to use a different checksum algorithm in the future, 
which could then be indicated by a different checksum version number. If 
we do that in the future, how will these values look like?

> @@ -831,9 +852,10 @@ XLogInsertRecord(XLogRecData *rdata,
>           * only happen just after a checkpoint, so it's better to be slow in
>           * this case and fast otherwise.
>           *
> -         * Also check to see if fullPageWrites was just turned on or there's a
> -         * running backup (which forces full-page writes); if we weren't
> -         * already doing full-page writes then go back and recompute.
> +         * Also check to see if fullPageWrites was just turned on, there's a
> +         * running backup or if checksums are enabled (all of which forces
> +         * full-page writes); if we weren't already doing full-page writes
> +         * then go back and recompute.
>           *
>           * If we aren't doing full-page writes then RedoRecPtr doesn't
>           * actually affect the contents of the XLOG record, so we'll update

It's true that if checksums are enabled, we must do full-page writes. 
But that's already included in the "fullpageWrites" flag. We're not 
specifically checking whether checksums are enabled here, so I find this 
comment change more confusing than helpful.

> void
> SetDataChecksumsOnInProgress(void)
> {
>     uint64        barrier;
> 
>     Assert(ControlFile != NULL);
> 
>     /*
>      * The state transition is performed in a critical section with
>      * checkpoints held off to provide crash safety.
>      */
>     START_CRIT_SECTION();
>     MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> 
>     XLogChecksums(PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
> 
>     SpinLockAcquire(&XLogCtl->info_lck);
>     XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION;
>     SpinLockRelease(&XLogCtl->info_lck);
> 
>     barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON);
> 
>     MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
>     END_CRIT_SECTION();
> 
>     /*
>      * Update the controlfile before waiting since if we have an immediate
>      * shutdown while waiting we want to come back up with checksums enabled.
>      */
>     LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
>     ControlFile->data_checksum_version = PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION;
>     UpdateControlFile();
>     LWLockRelease(ControlFileLock);
> 
>     /*
>      * Await state change in all backends to ensure that all backends are in
>      * "inprogress-on". Once done we know that all backends are writing data
>      * checksums.
>      */
>     WaitForProcSignalBarrier(barrier);
> }

If a checkpoint starts, finishes, and you then crash, all between the 
END_CRIT_SECTION() and LWLockAcquire here, are you left with with wrong 
'data_checksum_version' in the control file? (Highly theoretical, the 
window is minuscule, but still)

> +/*
> + * InitLocalControlData
> + *
> + * Set up backend local caches of controldata variables which may change at
> + * any point during runtime and thus require special cased locking. So far
> + * this only applies to data_checksum_version, but it's intended to be general
> + * purpose enough to handle future cases.
> + */
> +void
> +InitLocalDataChecksumVersion(void)
> +{
> +    SpinLockAcquire(&XLogCtl->info_lck);
> +    SetLocalDataChecksumVersion(XLogCtl->data_checksum_version);
> +    SpinLockRelease(&XLogCtl->info_lck);
> +}

Function name in the comment doesn't match.

> +        ereport(WARNING,
> +                errmsg("data checksums state has been set to off"),
> +                errhint("If checksums were being enabled during shutdown then processing must be manually
restarted."));

This is a little unclear. If we reach this, we know that the checkum 
calculation was interrupted, no need to caveat it with "if checksums 
were being enabled", right? Maybe something like:

WARNING: enabling data checksums was interrupted
HINT: The checksum processing must be manually restarted

I feel we're missing a term for the process of enabling checkums. 
"checksum enablement"? "initial checksum calculation"? In some comments 
it's called "checksumming".

>  * The DataChecksumsWorker will compile a list of databases which exist at the
>  * start of checksumming, and once all are processed will regenerate the list
>  * and start over processing any new entries. Once there are no new entries on
>  * the list, processing will end.  All databases MUST BE successfully processed
>  * in order for data checksums to be enabled, the only exception are databases
>  * which are dropped before having been processed.

Why does the list need to be regenerated? Any newly created databases 
would already have checksums enabled, right? Is this because you might 
use database that hadn't been fully processed yet as the template? Would 
be good to mention if that's the reason. Could CREATE DATABASE calculate 
the checksums when it copies the files?

>  * means that connected backends will change state at different times.  If
>  * waiting for a barrier is done during startup, for example during replay, it
>  * is important to realize that any locks held by the startup process might
>  * cause deadlocks if backends end up waiting for those locks while startup
>  * is waiting for a procsignalbarrier.

Don't we absorb procsignalbarriers while waiting on a lock? Or does this 
mean LWlocks?

>  * Backends transition Bd -> Bi via a procsignalbarrier which is emitted by the
>  * DataChecksumsLauncher.  When all backends have acknowledged the barrier then
>  * Bd will be empty and the next phase can begin: calculating and writing data
>  * checksums with DataChecksumsWorkers.  When the DataChecksumsWorker processes
>  * have finished writing checksums on all pages, data checksums are enabled
>  * cluster-wide via another procsignalbarrier. There are four sets of backends
>  * where Bd shall be an empty set:
>  *
>  * Bg: Backend updating the global state and emitting the procsignalbarrier
>  * Bd: Backends in "off" state
>  * Be: Backends in "on" state
>  * Bi: Backends in "inprogress-on" state
>  *
>  * Backends in Bi and Be will write checksums when modifying a page, but only
>  * backends in Be will verify the checksum during reading. The Bg backend is
>  * blocked waiting for all backends in Bi to process interrupts and move to
>  * Be. Any backend starting while Bg is waiting on the procsignalbarrier will
>  * observe the global state being "on" and will thus automatically belong to
>  * Be.  Checksums are enabled cluster-wide when Bi is an empty set. Bi and Be
>  * are compatible sets while still operating based on their local state as
>  * both write data checksums.

Isn't "Bg" always the data checksums launcher process? In the previous 
paragraph before this, you didn't list the data checksums launcher as a 
separate "set".

> /*
>  * Configuration of conditions which must match when absorbing a procsignal
>  * barrier during data checksum enable/disable operations.  A single function
>  * is used for absorbing all barriers, and the set of conditions to use is
>  * looked up in the checksum_barriers struct.  The struct member for the target
>  * state defines which state the backend must currently be in, and which it
>  * must not be in.
>  *
>  * The reason for this explicit checking is to ensure that processing cannot
>  * be started such that it breaks the assumptions of the state machine.  See
>  * datachecksumsworker.c for a lengthy discussion on these states.

This is "datachecksumworker.c", aka "datachecksum_state.c", so this "See 
datachecksumworker.c" refers to itself.

>  *
>  * MAX_BARRIER_CONDITIONS must match largest number of sets in barrier_eq and
>  * barrier_ne in the below checksum_barriers definition.
>  */
> #define MAX_BARRIER_CONDITIONS 2
> typedef struct ChecksumBarrierCondition
> {
>     /* The target state of the barrier */
>     int            target;
>     /* A set of states in which at least one MUST match the current state */
>     int            barrier_eq[MAX_BARRIER_CONDITIONS];
>     /* The number of elements in the barrier_eq set */
>     int            barrier_eq_sz;
>     /* A set of states which all MUST NOT match the current state */
>     int            barrier_ne[MAX_BARRIER_CONDITIONS];
>     /* The number of elements in the barrier_ne set */
>     int            barrier_ne_sz;
> } ChecksumBarrierCondition;

I don't understand the logic here. Why do you need both 'barrier_eq' and 
'barrier_ne'? Wouldn't it be enough to just list the allowed states, and 
everything else is not allowed?

> static const ChecksumBarrierCondition checksum_barriers[4] =
> {
>     {PG_DATA_CHECKSUM_OFF, {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION, PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION}, 2,
{PG_DATA_CHECKSUM_VERSION},1},
 
>     {PG_DATA_CHECKSUM_VERSION, {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION}, 1, {0}, 0},
>     {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION, {0}, 0, {PG_DATA_CHECKSUM_VERSION}, 1},
>     {PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION, {PG_DATA_CHECKSUM_VERSION}, 1, {0}, 0},
> };

This is pretty jarring to read. Maybe use C99 designated initializers, 
or at least split over multiple lines. Or a plain array of of allowed 
transitions, (current, target) -> bool. Or a function with switch-case 
statement.

>     /*
>      * If the postmaster crashed we cannot end up with a processed database so
>      * we have no alternative other than exiting. When enabling checksums we
>      * won't at this time have changed the pg_control version to enabled so
>      * when the cluster comes back up processing will have to be restarted.
>      * When disabling, the pg_control version will be set to off before this
>      * so when the cluster comes up checksums will be off as expected.
>      */
>     if (status == BGWH_POSTMASTER_DIED)
>         ereport(FATAL,
>                 errcode(ERRCODE_ADMIN_SHUTDOWN),
>                 errmsg("cannot enable data checksums without the postmaster process"),
>                 errhint("Restart the database and restart data checksum processing by calling
pg_enable_data_checksums()."));

Do we reach here when disabling checksums? The comment suggests so, but 
in that case the hint would be wrong.

> /*
>  * launcher_exit
>  *
>  * Internal routine for cleaning up state when the launcher process exits. We
>  * need to clean up the abort flag to ensure that processing started again if
>  * it was previously aborted (note: started again, *not* restarted from where
>  * it left off).
>  */
> static void
> launcher_exit(int code, Datum arg)
> {
>     if (launcher_running)
>     {
>         LWLockAcquire(DataChecksumsWorkerLock, LW_EXCLUSIVE);
>         launcher_running = false;
>         DataChecksumsWorkerShmem->launcher_running = false;
> 
>         /*
>          * TODO: how to really handle the worker still running when the
>          * launcher exits?
>          */
>         if (DataChecksumsWorkerShmem->worker_running)
>             ereport(LOG,
>                     errmsg("data checksums launcher exiting while worker is still running"));
>         LWLockRelease(DataChecksumsWorkerLock);
>     }
> }

What "abort flag"?

>     /*
>      * Get a list of all databases to process. This may include databases that
>      * were created during our runtime.  Since a database can be created as a
>      * copy of any other database (which may not have existed in our last
>      * run), we have to repeat this loop until no new databases show up in the
>      * list.  Here the initial list for the loop processing is generated after
>      * waiting for all existing transactions to finish to ensure that we can
>      * see any database which was created even if the transaction in which it
>      * was created started before checksums were being processed.
>      */
>     WaitForAllTransactionsToFinish();
>     DatabaseList = BuildDatabaseList();

What if a database is dropped and another one is created with the same 
database OID?

>         /*
>          * This is the only place where we check if we are asked to abort, the
>          * abortion will bubble up from here. It's safe to check this without
>          * a lock, because if we miss it being set, we will try again soon.
>          */
>         Assert(operation == ENABLE_DATACHECKSUMS);
>         LWLockAcquire(DataChecksumsWorkerLock, LW_SHARED);
>         if (DataChecksumsWorkerShmem->launch_operation == DISABLE_DATACHECKSUMS)
>             abort_requested = true;
>         LWLockRelease(DataChecksumsWorkerLock);

The comment justifies that it's safe to do this without a lock, but it 
grabs the lock anyway.

> @@ -107,7 +107,8 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail
>       */
>      if (!PageIsNew(page))
>      {
> -        if (DataChecksumsEnabled())
> +        HOLD_INTERRUPTS();
> +        if (DataChecksumsNeedVerify())
>          {
>              checksum = pg_checksum_page(page, blkno);
>  
> @@ -118,6 +119,7 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail
>                      *checksum_failure_p = true;
>              }
>          }
> +        RESUME_INTERRUPTS();
>  
>          /*
>           * The following checks don't prove the header is correct, only that

This is to ensure that we don't absorb a barrier while calculating the 
checksum, which could transition us from "on" to "off" ? This is 
probably not really needed because there are no CHECK_FOR_INTERRUPTS() 
calls in pg_checksum_page(), but better safe than sorry I guess. Another 
approach would be to check DataChecksumsNeedVerify() again if the 
checksum didn't match. In any case, a comment would be in order.

s/datachecksumsworker.c/datachecksum_state.c/

DataChecksumsOnInProgress() and DataChecksumsOffInProgress() are unused.

Attached are a few more trivial fixes, in patch format.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Xuneng Zhou
Date:
Subject: Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery
Next
From: Marco Nenciarini
Date:
Subject: Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery