Re: Online checksums patch - once again - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Online checksums patch - once again |
Date | |
Msg-id | CA+TgmoZV+gvfQd6O6KXW=yYVC2nUafs6cgFP1ZC_+w2qyM6POg@mail.gmail.com Whole thread Raw |
In response to | Re: Online checksums patch - once again (Daniel Gustafsson <daniel@yesql.se>) |
Responses |
Re: Online checksums patch - once again
Re: Online checksums patch - once again |
List | pgsql-hackers |
On Mon, Jun 22, 2020 at 8:27 AM Daniel Gustafsson <daniel@yesql.se> wrote: > Attached is a new version of the online checksums patch which, I hope, address > most of the concerns raised in previous reviews. There has been a fair amount > of fiddling done, so below is a summary of what has been done. Here are a bunch of comments based on a partial read-through of this patch. The most serious concerns, around synchronization, are down toward at the bottom. Sorry this is a bit eclectic as a review, but I wrote things down as I read through the patch more or less in the order I ran across them. Regarding disable_data_checksums(), I disagree with ereport(LOG, ...) here. If you want to indicate to the caller whether or not a state change occurred, you could consider returning a Boolean instead of void. If you want to do it with log messages, I vote for NOTICE, not LOG. Maybe NOTICE is better, because enable_data_checksums() seems to want to convey more information that you can represent in a Boolean, but then it should use NOTICE consistently, not a mix of NOTICE and LOG. Formatting needs work for project style: typically no braces around single statements, "ereport(WHATEVER," should always have a line break at that point. + * cluster, which was not initialized with checksums, this worker will ensure "which was not initialized with checksums" => "that does not running with checksums enabled"? + * turned on. In the case of disabling checksums, the state transition is + * recorded in the catalog and controlfile, no changes are performed + * on the data pages or in the catalog. Comma splice. Either write "controlfile; no" or "controlfile, and no". My spell-checker complains that controfile, clusterwide, inprogress, and endstate are not words. I think you should think about inserting spaces or, in the case of cluster-wide, a dash, unless they are being used as literals, in which case perhaps those instances should be quoted. "havent" needs an apostrophe. + * DataChecksumsWorker will compile a list of databases which exists at the which exist + * For each database, all relations which have storage are read and every data + * page is marked dirty to force a write with the checksum, this will generate Comma splice. Split into two sentences. + * In case checksums have been enabled and later disabled, when re-enabling + * pg_class.relhaschecksums will be reset to false before entering inprogress + * mode to ensure that all relations are re-processed. "If checksums are enabled, then disabled, and then re-enabled, every relation's pg_class.relhaschecksums field will be reset to false before entering the in-progress mode." + * Disabling checksums is done as an immediate operation as it only updates s/done as // + * to pg_class.relhaschecksums is performed as it only tracks state during is performed -> are necessary + * Access to other members can be done without a lock, as while they are + * in shared memory, they are never concurrently accessed. When a worker + * is running, the launcher is only waiting for that worker to finish. The way this is written, it sounds like you're saying that concurrent access might be possible when this structure isn't in shared memory. But since it's called DatachecksumsWorkerShmemStruct that's not likely a correct conclusion, so I think it needs rephrasing. + if (DatachecksumsWorkerShmem->launcher_started && !DatachecksumsWorkerShmem->abort) + started = true; Why not started = a && b instead of started = false; if (a && b) started = true? + { + LWLockRelease(DatachecksumsWorkerLock); + ereport(ERROR, + (errmsg("data checksums worker has been aborted"))); + } Errors always release LWLocks, so this seems unnecessary. Also, the error message looks confusing from a user perspective. What does it mean if I ask you to make me a cheeseburger and you tell me the cheeseburger has been eaten? I'm asking for a *new* cheeseburger (or in this case, a new worker). I wonder why this thing is inventing a brand new way of aborting a worker, anyway. Why not just keep track of the PID and send it SIGINT and have it use CHECK_FOR_INTERRUPTS()? That's already sprinkled all over the code, so it's likely to work better than some brand-new mechanism that will probably have checks in a lot fewer places. + vacuum_delay_point(); Huh? Why? + elog(DEBUG2, + "background worker \"datachecksumsworker\" starting to process relation %u", + relationId); This and similar messages seem likely they refer needlessly to internals, e.g. this could be "adding checksums to relation with OID %u" without needing to reference background workers or datachecksumworker. It would be even better if we could find a way to report relation names. + * so when the cluster comes back up processing will habe to be resumed. habe -> have + ereport(FATAL, + (errmsg("cannot enable checksums without the postmaster process"), + errhint("Restart the database and restart the checksumming process by calling pg_enable_data_checksums()."))); I understand the motivation for this design and it may be the best we can do, but honestly it kinda sucks. It would be nice if the system itself figured out whether or not the worker should be running and, if yes, ran it. Like, if we're in this state when we exit recovery (or decide to skip recovery), just register the worker then automatically. Now that could still fail for lack of slots, so I guess to make this really robust we'd need a way for the registration to get retried, e.g. autovacuum could try to reregister it periodically, and we could just blow off the case where autovacuum=off. I don't know. I'd rather avoid burdening users with an implementation detail if we can get there, or at least minimize what they need to worry about. + snprintf(activity, sizeof(activity) - 1, + "Waiting for worker in database %s (pid %d)", db->dbname, pid); + pgstat_report_activity(STATE_RUNNING, activity); So we only know how to run one such worker at a time? Maybe WaitForAllTransactionsToFinish should advertise something in pg_stat_activity. I think you should try to give all of the functions header comments, or at least all the bigger ones. + else if (result == DATACHECKSUMSWORKER_ABORTED) + /* Abort flag set, so exit the whole process */ + return false; I'd put braces here. And also, why bail out like this instead of retrying periodically until we succeed? + * True if all data pages of the relation have data checksums. Not fully accurate, right? + /* + * Force a checkpoint to get everything out to disk. TODO: we probably + * don't want to use a CHECKPOINT_IMMEDIATE here but it's very convenient + * for testing until the patch is fully baked, as it may otherwise make + * tests take a lot longer. + */ + RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT | CHECKPOINT_IMMEDIATE); Do we need to verify that the checkpoint succeeded before we can declare victory and officially change state? + PROCSIGNAL_BARRIER_CHECKSUM_OFF = 0, + PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON, + PROCSIGNAL_BARRIER_CHECKSUM_ON I don't think it's a good idea to have three separate types of barrier here. I think you should just have a barrier indicating that the state has changed, and then backends need to reread the state from shared memory when they absorb the barrier. But the bigger problem here, and the thing that makes me intensely doubtful that the synchronization in this patch is actually correct, is that I can't find any machinery in the patch guarding against TOCTTOU issues, nor any comments explaining why I shouldn't be afraid of them. Suppose you got rid of the barriers and just changed all the places that check LocalDataChecksumVersion to read from a shared memory value directly instead. Would that be equivalent to what you've got here, or would it break something? If you can't clearly explain why that would be broken as compared with what you have, then either the barriers aren't really necessary (which I doubt) or the synchronization isn't really right (which I suspect to be true). In the case of the ALTER SYSTEM READ ONLY patch, this was by far the hardest part to get right, and I'm still not positive that it's completely correct, but the basic thing we figured out there is that you are in big trouble if the system goes read-only AFTER you've decided to write a WAL record. That is, this is bugged: if (WALIsProhibited()) ereport(ERROR, errmsg("i'm sorry i can't do that")); ... CHECK_FOR_INTERRUPTS(); ... START_CRIT_SECTION(); XLogBeginInsert(); If the CHECK_FOR_INTERRUPTS() absorbs a state change, then the XLogBeginInsert() is going to hit an elog(ERROR) which, because we're in a critical section, will be promoted to PANIC, which is bad. To avoid that, the patch introduces a whole hairy system to make sure that there can never be a CFI after we check whether it's OK to insert WAL and before we actually do it. That stuff is designed in such a way that it will make assertion fail even if you're not actually *trying* to make the system read-only. So the comparable problem here would be if we decide that we don't need to set checksums on a page when modifying it, and then we absorb a barrier that flips the state to in-progress, and then we actually perform the page modification. Now you have a race condition: the page was modified without checksums after we'd acknowledged to the process pushing out the barrier that all of our future page modifications would set checksums. So, maybe that's not possible here. For instance, if we never examine the checksum-enabled state outside of a critical section, then we're fine, because we can't absorb a barrier without processing an interrupt, and we don't process interrupts in critical sections. But if that's the case, then it seems to me that it would be good to insert some cross-checks. Like, suppose we only ever access the local variable that contains this state through a static inline function that also asserts that InteruptHoldoffCount > 0 || CritSectionCount > 0. Then, if there is a place where we don't actually follow that rule (only rely on that value within critical sections) we're pretty likely to trip an assert just running the regression tests. It's not foolproof, not only because the regression tests are incomplete but because in theory we could fetch the value in a crit section and then keep it around and rely on it some more after we've processed interrupts again, but that seems like a less-likely thing for somebody to do. If, on the other hand, there are stretches of code that fetch this value outside of a crit section and without interrupts held, then we need some other kind of mechanism here to make it safe. We have to make sure that not only does the present code not permit race conditions of the type described above, but that future modifications are quite unlikely to introduce any. I might be missing something, but I don't see any kind of checks like this in the patch now. I think there should be, and the rationale behind them should be written up, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: