Re: Online checksums patch - once again - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Online checksums patch - once again
Date
Msg-id CA+TgmoZV+gvfQd6O6KXW=yYVC2nUafs6cgFP1ZC_+w2qyM6POg@mail.gmail.com
Whole thread Raw
In response to Re: Online checksums patch - once again  (Daniel Gustafsson <daniel@yesql.se>)
Responses Re: Online checksums patch - once again
Re: Online checksums patch - once again
List pgsql-hackers
On Mon, Jun 22, 2020 at 8:27 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> Attached is a new version of the online checksums patch which, I hope, address
> most of the concerns raised in previous reviews.  There has been a fair amount
> of fiddling done, so below is a summary of what has been done.

Here are a bunch of comments based on a partial read-through of this
patch. The most serious concerns, around synchronization, are down
toward at the bottom. Sorry this is a bit eclectic as a review, but I
wrote things down as I read through the patch more or less in the
order I ran across them.

Regarding disable_data_checksums(), I disagree with ereport(LOG, ...)
here. If you want to indicate to the caller whether or not a state
change occurred, you could consider returning a Boolean instead of
void. If you want to do it with log messages, I vote for NOTICE, not
LOG. Maybe NOTICE is better, because enable_data_checksums() seems to
want to convey more information that you can represent in a Boolean,
but then it should use NOTICE consistently, not a mix of NOTICE and
LOG.

Formatting needs work for project style: typically no braces around
single statements, "ereport(WHATEVER," should always have a line break
at that point.

+ * cluster, which was not initialized with checksums, this worker will ensure

"which was not initialized with checksums" => "that does not running
with checksums enabled"?

+ * turned on. In the case of disabling checksums, the state transition is
+ * recorded in the catalog and controlfile, no changes are performed
+ * on the data pages or in the catalog.

Comma splice. Either write "controlfile; no" or "controlfile, and no".

My spell-checker complains that controfile, clusterwide, inprogress,
and endstate are not words. I think you should think about inserting
spaces or, in the case of cluster-wide, a dash, unless they are being
used as literals, in which case perhaps those instances should be
quoted. "havent" needs an apostrophe.

+ * DataChecksumsWorker will compile a list of databases which exists at the

which exist

+ * For each database, all relations which have storage are read and every data
+ * page is marked dirty to force a write with the checksum, this will generate

Comma splice. Split into two sentences.

+ * In case checksums have been enabled and later disabled, when re-enabling
+ * pg_class.relhaschecksums will be reset to false before entering inprogress
+ * mode to ensure that all relations are re-processed.

"If checksums are enabled, then disabled, and then re-enabled, every
relation's pg_class.relhaschecksums field will be reset to false
before entering the in-progress mode."

+ * Disabling checksums is done as an immediate operation as it only updates

s/done as //

+ * to pg_class.relhaschecksums is performed as it only tracks state during

is performed -> are necessary

+ * Access to other members can be done without a lock, as while they are
+ * in shared memory, they are never concurrently accessed. When a worker
+ * is running, the launcher is only waiting for that worker to finish.

The way this is written, it sounds like you're saying that concurrent
access might be possible when this structure isn't in shared memory.
But since it's called DatachecksumsWorkerShmemStruct that's not likely
a correct conclusion, so I think it needs rephrasing.

+ if (DatachecksumsWorkerShmem->launcher_started &&
!DatachecksumsWorkerShmem->abort)
+ started = true;

Why not started = a && b instead of started = false; if (a && b) started = true?

+ {
+ LWLockRelease(DatachecksumsWorkerLock);
+ ereport(ERROR,
+ (errmsg("data checksums worker has been aborted")));
+ }

Errors always release LWLocks, so this seems unnecessary. Also, the
error message looks confusing from a user perspective. What does it
mean if I ask you to make me a cheeseburger and you tell me the
cheeseburger has been eaten? I'm asking for a *new* cheeseburger (or
in this case, a new worker).

I wonder why this thing is inventing a brand new way of aborting a
worker, anyway. Why not just keep track of the PID and send it SIGINT
and have it use CHECK_FOR_INTERRUPTS()? That's already sprinkled all
over the code, so it's likely to work better than some brand-new
mechanism that will probably have checks in a lot fewer places.

+ vacuum_delay_point();

Huh? Why?

+ elog(DEBUG2,
+ "background worker \"datachecksumsworker\" starting to process relation %u",
+ relationId);

This and similar messages seem likely they refer needlessly to
internals, e.g. this could be "adding checksums to relation with OID
%u" without needing to reference background workers or
datachecksumworker. It would be even better if we could find a way to
report relation names.

+ * so when the cluster comes back up processing will habe to be resumed.

habe -> have

+ ereport(FATAL,
+ (errmsg("cannot enable checksums without the postmaster process"),
+ errhint("Restart the database and restart the checksumming process
by calling pg_enable_data_checksums().")));

I understand the motivation for this design and it may be the best we
can do, but honestly it kinda sucks. It would be nice if the system
itself figured out whether or not the worker should be running and, if
yes, ran it. Like, if we're in this state when we exit recovery (or
decide to skip recovery), just register the worker then automatically.
Now that could still fail for lack of slots, so I guess to make this
really robust we'd need a way for the registration to get retried,
e.g. autovacuum could try to reregister it periodically, and we could
just blow off the case where autovacuum=off. I don't know. I'd rather
avoid burdening users with an implementation detail if we can get
there, or at least minimize what they need to worry about.

+ snprintf(activity, sizeof(activity) - 1,
+ "Waiting for worker in database %s (pid %d)", db->dbname, pid);
+ pgstat_report_activity(STATE_RUNNING, activity);

So we only know how to run one such worker at a time?

Maybe WaitForAllTransactionsToFinish should advertise something in
pg_stat_activity.

I think you should try to give all of the functions header comments,
or at least all the bigger ones.

+ else if (result == DATACHECKSUMSWORKER_ABORTED)
+ /* Abort flag set, so exit the whole process */
+ return false;

I'd put braces here. And also, why bail out like this instead of
retrying periodically until we succeed?

+ * True if all data pages of the relation have data checksums.

Not fully accurate, right?

+ /*
+ * Force a checkpoint to get everything out to disk. TODO: we probably
+ * don't want to use a CHECKPOINT_IMMEDIATE here but it's very convenient
+ * for testing until the patch is fully baked, as it may otherwise make
+ * tests take a lot longer.
+ */
+ RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT | CHECKPOINT_IMMEDIATE);

Do we need to verify that the checkpoint succeeded before we can
declare victory and officially change state?

+ PROCSIGNAL_BARRIER_CHECKSUM_OFF = 0,
+ PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON,
+ PROCSIGNAL_BARRIER_CHECKSUM_ON

I don't think it's a good idea to have three separate types of barrier
here. I think you should just have a barrier indicating that the state
has changed, and then backends need to reread the state from shared
memory when they absorb the barrier.

But the bigger problem here, and the thing that makes me intensely
doubtful that the synchronization in this patch is actually correct,
is that I can't find any machinery in the patch guarding against
TOCTTOU issues, nor any comments explaining why I shouldn't be afraid
of them. Suppose you got rid of the barriers and just changed all the
places that check LocalDataChecksumVersion to read from a shared
memory value directly instead. Would that be equivalent to what you've
got here, or would it break something? If you can't clearly explain
why that would be broken as compared with what you have, then either
the barriers aren't really necessary (which I doubt) or the
synchronization isn't really right (which I suspect to be true).

In the case of the ALTER SYSTEM READ ONLY patch, this was by far the
hardest part to get right, and I'm still not positive that it's
completely correct, but the basic thing we figured out there is that
you are in big trouble if the system goes read-only AFTER you've
decided to write a WAL record. That is, this is bugged:

if (WALIsProhibited())
   ereport(ERROR, errmsg("i'm sorry i can't do that"));
...
CHECK_FOR_INTERRUPTS();
...
START_CRIT_SECTION();
XLogBeginInsert();

If the CHECK_FOR_INTERRUPTS() absorbs a state change, then the
XLogBeginInsert() is going to hit an elog(ERROR) which, because we're
in a critical section, will be promoted to PANIC, which is bad. To
avoid that, the patch introduces a whole hairy system to make sure
that there can never be a CFI after we check whether it's OK to insert
WAL and before we actually do it. That stuff is designed in such a way
that it will make assertion fail even if you're not actually *trying*
to make the system read-only.

So the comparable problem here would be if we decide that we don't
need to set checksums on a page when modifying it, and then we absorb
a barrier that flips the state to in-progress, and then we actually
perform the page modification. Now you have a race condition: the page
was modified without checksums after we'd acknowledged to the process
pushing out the barrier that all of our future page modifications
would set checksums. So, maybe that's not possible here. For instance,
if we never examine the checksum-enabled state outside of a critical
section, then we're fine, because we can't absorb a barrier without
processing an interrupt, and we don't process interrupts in critical
sections. But if that's the case, then it seems to me that it would be
good to insert some cross-checks. Like, suppose we only ever access
the local variable that contains this state through a static inline
function that also asserts that InteruptHoldoffCount > 0 ||
CritSectionCount > 0. Then, if there is a place where we don't
actually follow that rule (only rely on that value within critical
sections) we're pretty likely to trip an assert just running the
regression tests. It's not foolproof, not only because the regression
tests are incomplete but because in theory we could fetch the value in
a crit section and then keep it around and rely on it some more after
we've processed interrupts again, but that seems like a less-likely
thing for somebody to do.

If, on the other hand, there are stretches of code that fetch this
value outside of a crit section and without interrupts held, then we
need some other kind of mechanism here to make it safe. We have to
make sure that not only does the present code not permit race
conditions of the type described above, but that future modifications
are quite unlikely to introduce any. I might be missing something, but
I don't see any kind of checks like this in the patch now. I think
there should be, and the rationale behind them should be written up,
too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Default setting for enable_hashagg_disk
Next
From: Asif Rehman
Date:
Subject: Re: proposal: unescape_text function