Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers
| From | Heikki Linnakangas |
|---|---|
| Subject | Re: Changing the state of data checksums in a running cluster |
| Date | |
| Msg-id | 300822bd-daeb-49e5-8af7-d7ce3af65adc@iki.fi Whole thread Raw |
| In response to | Re: Changing the state of data checksums in a running cluster (Daniel Gustafsson <daniel@yesql.se>) |
| Responses |
Re: Changing the state of data checksums in a running cluster
|
| List | pgsql-hackers |
Here's a quick, non-exhaustive review of v20260316 of the patch. I
haven't followed the thread, so forgive me if some of these things have
already been discussed.
I don't see any major issues, just a bunch of small things:
> +/*
> + * Checksum version 0 is used for when data checksums are disabled (OFF).
> + * PG_DATA_CHECKSUM_VERSION defines that data checksums are enabled in the
> + * cluster and PG_DATA_CHECKSUM_INPROGRESS_{ON|OFF}_VERSION defines that data
> + * checksums are either currently being enabled or disabled.
> + */
> +typedef enum ChecksumType
> +{
> + PG_DATA_CHECKSUM_OFF = 0,
> + PG_DATA_CHECKSUM_VERSION,
> + PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION,
> + PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION
> +} ChecksumType;
Naming: This is called "VERSION", but also "Type", and I'm a little
confused. I think the field was called "version" because we envisioned
that we might want to use a different checksum algorithm in the future,
which could then be indicated by a different checksum version number. If
we do that in the future, how will these values look like?
> @@ -831,9 +852,10 @@ XLogInsertRecord(XLogRecData *rdata,
> * only happen just after a checkpoint, so it's better to be slow in
> * this case and fast otherwise.
> *
> - * Also check to see if fullPageWrites was just turned on or there's a
> - * running backup (which forces full-page writes); if we weren't
> - * already doing full-page writes then go back and recompute.
> + * Also check to see if fullPageWrites was just turned on, there's a
> + * running backup or if checksums are enabled (all of which forces
> + * full-page writes); if we weren't already doing full-page writes
> + * then go back and recompute.
> *
> * If we aren't doing full-page writes then RedoRecPtr doesn't
> * actually affect the contents of the XLOG record, so we'll update
It's true that if checksums are enabled, we must do full-page writes.
But that's already included in the "fullpageWrites" flag. We're not
specifically checking whether checksums are enabled here, so I find this
comment change more confusing than helpful.
> void
> SetDataChecksumsOnInProgress(void)
> {
> uint64 barrier;
>
> Assert(ControlFile != NULL);
>
> /*
> * The state transition is performed in a critical section with
> * checkpoints held off to provide crash safety.
> */
> START_CRIT_SECTION();
> MyProc->delayChkptFlags |= DELAY_CHKPT_START;
>
> XLogChecksums(PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
>
> SpinLockAcquire(&XLogCtl->info_lck);
> XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION;
> SpinLockRelease(&XLogCtl->info_lck);
>
> barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON);
>
> MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> END_CRIT_SECTION();
>
> /*
> * Update the controlfile before waiting since if we have an immediate
> * shutdown while waiting we want to come back up with checksums enabled.
> */
> LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> ControlFile->data_checksum_version = PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION;
> UpdateControlFile();
> LWLockRelease(ControlFileLock);
>
> /*
> * Await state change in all backends to ensure that all backends are in
> * "inprogress-on". Once done we know that all backends are writing data
> * checksums.
> */
> WaitForProcSignalBarrier(barrier);
> }
If a checkpoint starts, finishes, and you then crash, all between the
END_CRIT_SECTION() and LWLockAcquire here, are you left with with wrong
'data_checksum_version' in the control file? (Highly theoretical, the
window is minuscule, but still)
> +/*
> + * InitLocalControlData
> + *
> + * Set up backend local caches of controldata variables which may change at
> + * any point during runtime and thus require special cased locking. So far
> + * this only applies to data_checksum_version, but it's intended to be general
> + * purpose enough to handle future cases.
> + */
> +void
> +InitLocalDataChecksumVersion(void)
> +{
> + SpinLockAcquire(&XLogCtl->info_lck);
> + SetLocalDataChecksumVersion(XLogCtl->data_checksum_version);
> + SpinLockRelease(&XLogCtl->info_lck);
> +}
Function name in the comment doesn't match.
> + ereport(WARNING,
> + errmsg("data checksums state has been set to off"),
> + errhint("If checksums were being enabled during shutdown then processing must be manually
restarted."));
This is a little unclear. If we reach this, we know that the checkum
calculation was interrupted, no need to caveat it with "if checksums
were being enabled", right? Maybe something like:
WARNING: enabling data checksums was interrupted
HINT: The checksum processing must be manually restarted
I feel we're missing a term for the process of enabling checkums.
"checksum enablement"? "initial checksum calculation"? In some comments
it's called "checksumming".
> * The DataChecksumsWorker will compile a list of databases which exist at the
> * start of checksumming, and once all are processed will regenerate the list
> * and start over processing any new entries. Once there are no new entries on
> * the list, processing will end. All databases MUST BE successfully processed
> * in order for data checksums to be enabled, the only exception are databases
> * which are dropped before having been processed.
Why does the list need to be regenerated? Any newly created databases
would already have checksums enabled, right? Is this because you might
use database that hadn't been fully processed yet as the template? Would
be good to mention if that's the reason. Could CREATE DATABASE calculate
the checksums when it copies the files?
> * means that connected backends will change state at different times. If
> * waiting for a barrier is done during startup, for example during replay, it
> * is important to realize that any locks held by the startup process might
> * cause deadlocks if backends end up waiting for those locks while startup
> * is waiting for a procsignalbarrier.
Don't we absorb procsignalbarriers while waiting on a lock? Or does this
mean LWlocks?
> * Backends transition Bd -> Bi via a procsignalbarrier which is emitted by the
> * DataChecksumsLauncher. When all backends have acknowledged the barrier then
> * Bd will be empty and the next phase can begin: calculating and writing data
> * checksums with DataChecksumsWorkers. When the DataChecksumsWorker processes
> * have finished writing checksums on all pages, data checksums are enabled
> * cluster-wide via another procsignalbarrier. There are four sets of backends
> * where Bd shall be an empty set:
> *
> * Bg: Backend updating the global state and emitting the procsignalbarrier
> * Bd: Backends in "off" state
> * Be: Backends in "on" state
> * Bi: Backends in "inprogress-on" state
> *
> * Backends in Bi and Be will write checksums when modifying a page, but only
> * backends in Be will verify the checksum during reading. The Bg backend is
> * blocked waiting for all backends in Bi to process interrupts and move to
> * Be. Any backend starting while Bg is waiting on the procsignalbarrier will
> * observe the global state being "on" and will thus automatically belong to
> * Be. Checksums are enabled cluster-wide when Bi is an empty set. Bi and Be
> * are compatible sets while still operating based on their local state as
> * both write data checksums.
Isn't "Bg" always the data checksums launcher process? In the previous
paragraph before this, you didn't list the data checksums launcher as a
separate "set".
> /*
> * Configuration of conditions which must match when absorbing a procsignal
> * barrier during data checksum enable/disable operations. A single function
> * is used for absorbing all barriers, and the set of conditions to use is
> * looked up in the checksum_barriers struct. The struct member for the target
> * state defines which state the backend must currently be in, and which it
> * must not be in.
> *
> * The reason for this explicit checking is to ensure that processing cannot
> * be started such that it breaks the assumptions of the state machine. See
> * datachecksumsworker.c for a lengthy discussion on these states.
This is "datachecksumworker.c", aka "datachecksum_state.c", so this "See
datachecksumworker.c" refers to itself.
> *
> * MAX_BARRIER_CONDITIONS must match largest number of sets in barrier_eq and
> * barrier_ne in the below checksum_barriers definition.
> */
> #define MAX_BARRIER_CONDITIONS 2
> typedef struct ChecksumBarrierCondition
> {
> /* The target state of the barrier */
> int target;
> /* A set of states in which at least one MUST match the current state */
> int barrier_eq[MAX_BARRIER_CONDITIONS];
> /* The number of elements in the barrier_eq set */
> int barrier_eq_sz;
> /* A set of states which all MUST NOT match the current state */
> int barrier_ne[MAX_BARRIER_CONDITIONS];
> /* The number of elements in the barrier_ne set */
> int barrier_ne_sz;
> } ChecksumBarrierCondition;
I don't understand the logic here. Why do you need both 'barrier_eq' and
'barrier_ne'? Wouldn't it be enough to just list the allowed states, and
everything else is not allowed?
> static const ChecksumBarrierCondition checksum_barriers[4] =
> {
> {PG_DATA_CHECKSUM_OFF, {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION, PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION}, 2,
{PG_DATA_CHECKSUM_VERSION},1},
> {PG_DATA_CHECKSUM_VERSION, {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION}, 1, {0}, 0},
> {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION, {0}, 0, {PG_DATA_CHECKSUM_VERSION}, 1},
> {PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION, {PG_DATA_CHECKSUM_VERSION}, 1, {0}, 0},
> };
This is pretty jarring to read. Maybe use C99 designated initializers,
or at least split over multiple lines. Or a plain array of of allowed
transitions, (current, target) -> bool. Or a function with switch-case
statement.
> /*
> * If the postmaster crashed we cannot end up with a processed database so
> * we have no alternative other than exiting. When enabling checksums we
> * won't at this time have changed the pg_control version to enabled so
> * when the cluster comes back up processing will have to be restarted.
> * When disabling, the pg_control version will be set to off before this
> * so when the cluster comes up checksums will be off as expected.
> */
> if (status == BGWH_POSTMASTER_DIED)
> ereport(FATAL,
> errcode(ERRCODE_ADMIN_SHUTDOWN),
> errmsg("cannot enable data checksums without the postmaster process"),
> errhint("Restart the database and restart data checksum processing by calling
pg_enable_data_checksums()."));
Do we reach here when disabling checksums? The comment suggests so, but
in that case the hint would be wrong.
> /*
> * launcher_exit
> *
> * Internal routine for cleaning up state when the launcher process exits. We
> * need to clean up the abort flag to ensure that processing started again if
> * it was previously aborted (note: started again, *not* restarted from where
> * it left off).
> */
> static void
> launcher_exit(int code, Datum arg)
> {
> if (launcher_running)
> {
> LWLockAcquire(DataChecksumsWorkerLock, LW_EXCLUSIVE);
> launcher_running = false;
> DataChecksumsWorkerShmem->launcher_running = false;
>
> /*
> * TODO: how to really handle the worker still running when the
> * launcher exits?
> */
> if (DataChecksumsWorkerShmem->worker_running)
> ereport(LOG,
> errmsg("data checksums launcher exiting while worker is still running"));
> LWLockRelease(DataChecksumsWorkerLock);
> }
> }
What "abort flag"?
> /*
> * Get a list of all databases to process. This may include databases that
> * were created during our runtime. Since a database can be created as a
> * copy of any other database (which may not have existed in our last
> * run), we have to repeat this loop until no new databases show up in the
> * list. Here the initial list for the loop processing is generated after
> * waiting for all existing transactions to finish to ensure that we can
> * see any database which was created even if the transaction in which it
> * was created started before checksums were being processed.
> */
> WaitForAllTransactionsToFinish();
> DatabaseList = BuildDatabaseList();
What if a database is dropped and another one is created with the same
database OID?
> /*
> * This is the only place where we check if we are asked to abort, the
> * abortion will bubble up from here. It's safe to check this without
> * a lock, because if we miss it being set, we will try again soon.
> */
> Assert(operation == ENABLE_DATACHECKSUMS);
> LWLockAcquire(DataChecksumsWorkerLock, LW_SHARED);
> if (DataChecksumsWorkerShmem->launch_operation == DISABLE_DATACHECKSUMS)
> abort_requested = true;
> LWLockRelease(DataChecksumsWorkerLock);
The comment justifies that it's safe to do this without a lock, but it
grabs the lock anyway.
> @@ -107,7 +107,8 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail
> */
> if (!PageIsNew(page))
> {
> - if (DataChecksumsEnabled())
> + HOLD_INTERRUPTS();
> + if (DataChecksumsNeedVerify())
> {
> checksum = pg_checksum_page(page, blkno);
>
> @@ -118,6 +119,7 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail
> *checksum_failure_p = true;
> }
> }
> + RESUME_INTERRUPTS();
>
> /*
> * The following checks don't prove the header is correct, only that
This is to ensure that we don't absorb a barrier while calculating the
checksum, which could transition us from "on" to "off" ? This is
probably not really needed because there are no CHECK_FOR_INTERRUPTS()
calls in pg_checksum_page(), but better safe than sorry I guess. Another
approach would be to check DataChecksumsNeedVerify() again if the
checksum didn't match. In any case, a comment would be in order.
s/datachecksumsworker.c/datachecksum_state.c/
DataChecksumsOnInProgress() and DataChecksumsOffInProgress() are unused.
Attached are a few more trivial fixes, in patch format.
- Heikki
Attachment
pgsql-hackers by date: