Thread: Changing the state of data checksums in a running cluster
After some off-list discussion about the desirability of this feature, where several hackers opined that it's something that we should have, I've decided to rebase this patch and submit it one more time. There are several (long) threads covering the history of this patch [0][1], related work stemming from this [2] as well as earlier attempts and discussions [3][4]. Below I try to respond to a summary of points raised in those threads. The mechanics of the patch hasn't changed since the last posted version, it has mainly been polished slightly. A high-level overview of the processing is: It's using a launcher/worker model where the launcher will spawn a worker per database which will traverse all pages and dirty them in order to calculate and set the checksum on them. During this inprogress state all backends calculated and write checksums but don't verify them on read. Once all pages have been checksummed the state of the cluster will switch over to "on" synchronized across all backends with a procsignalbarrier. At this point checksums are verified and processing is equal to checksums having been enabled initdb. When a user disables checksums the cluster enters a state where all backends still write checksums until all backends have acknowledged that they have stopped verifying checksums (again using a procsignalbarrier). At this point the cluster switches to "off" and checksums are neither written nor verified. In case the cluster is restarted, voluntarily or via a crash, processing will have to be restarted (more on that further down). The user facing controls for this are two SQL level functions, for enabling and disabling. The existing data_checksums GUC remains but is expanded with more possible states (with on/off retained). Complaints against earlier versions =================================== Seasoned hackers might remember that this patch has been on -hackers before. There has been a lot of review, and AFAICT all specific comments have been addressed. There are however a few larger more generic complaints: * Restartability - the initial version of the patch did not support stateful restarts, a shutdown performed (or crash) before checksums were enabled would result in a need to start over from the beginning. This was deemed the safe orchestration method. The lack of this feature was seen as serious drawback, so it was added. Subsequent review instead found the patch to be too complicated with a too large featureset. I thihk there is merit to both of these arguments: being able to restart is a great feature; and being able to reason about the correctness of a smaller patch is also great. As of this submission I have removed the ability to restart to keep the scope of the patch small (which is where the previous version was, which received no review after the removal). The way I prefer to frame this is to first add scaffolding and infrastructure (this patch) and leave refinements and add-on features (restartability, but also others like parallel workers, optimizing rare cases, etc) for follow-up patches. * Complexity - it was brought up that this is a very complex patch for a niche feature, and there is a lot of truth to that. It is inherently complex to change a pg_control level state of a running cluster. There might be ways to make the current patch less complex, while not sacrificing stability, and if so that would be great. A lot of of the complexity came from being able to restart processing, and that's not removed for this version, but it's clearly not close to a one-line-diff even without it. Other complaints were addressed, in part by the invention of procsignalbarriers which makes this synchronization possible. In re-reading the threads I might have missed something which is still left open, and if so I do apologize for that. Open TODO items: ================ * Immediate checkpoints - the code is currently using CHECKPOINT_IMMEDIATE in order to be able to run the tests in a timely manner on it. This is overly aggressive and dialling it back while still being able to run fast tests is a TODO. Not sure what the best option is there. * Monitoring - an insightful off-list reviewer asked how the current progress of the operation is monitored. So far I've been using pg_stat_activity but I don't disagree that it's not a very sharp tool for this. Maybe we need a specific function or view or something? There clearly needs to be a way for a user to query state and progress of a transition. * Throttling - right now the patch uses the vacuum access strategy, with the same cost options as vacuum, in order to implement throttling. This is in part due to the patch starting out modelled around autovacuum as a worker, but it may not be the right match for throttling checksums. * Naming - the in-between states when data checksums are enabled or disabled are called inprogress-on and inprogress-off. The reason for this is simply that early on there were only three states: inprogress, on and off, and the process of disabling wasn't labeled with a state. When this transition state was added it seemed like a good idea to tack the end-goal onto the transition. These state names make the code easily greppable but might not be the most obvious choices for anything user facing. Is "Enabling" and "Disabling" better terms to use (across the board or just user facing) or should we stick to the current? There are ways in which this processing can be optimized to achieve better performance, but in order to keep goalposts in sight and patchsize down they are left as future work. -- Daniel Gustafsson [0] https://www.postgresql.org/message-id/flat/CABUevExz9hUUOLnJVr2kpw9Cx%3Do4MCr1SVKwbupzuxP7ckNutA%40mail.gmail.com [1] https://www.postgresql.org/message-id/flat/CABUevEwE3urLtwxxqdgd5O2oQz9J717ZzMbh%2BziCSa5YLLU_BA%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/20181030051643.elbxjww5jjgnjaxg%40alap3.anarazel.de [3] https://www.postgresql.org/message-id/flat/FF393672-5608-46D6-9224-6620EC532693%40endpoint.com [4] https://www.postgresql.org/message-id/flat/CABUevEx8KWhZE_XkZQpzEkZypZmBp3GbM9W90JLp%3D-7OJWBbcg%40mail.gmail.com
Attachment
Hi Daniel, Thanks for rebasing the patch and submitting it again! On 7/3/24 08:41, Daniel Gustafsson wrote: > After some off-list discussion about the desirability of this feature, where > several hackers opined that it's something that we should have, I've decided to > rebase this patch and submit it one more time. There are several (long) > threads covering the history of this patch [0][1], related work stemming from > this [2] as well as earlier attempts and discussions [3][4]. Below I try to > respond to a summary of points raised in those threads. > > The mechanics of the patch hasn't changed since the last posted version, it has > mainly been polished slightly. A high-level overview of the processing is: > It's using a launcher/worker model where the launcher will spawn a worker per > database which will traverse all pages and dirty them in order to calculate and > set the checksum on them. During this inprogress state all backends calculated > and write checksums but don't verify them on read. Once all pages have been > checksummed the state of the cluster will switch over to "on" synchronized > across all backends with a procsignalbarrier. At this point checksums are > verified and processing is equal to checksums having been enabled initdb. When > a user disables checksums the cluster enters a state where all backends still > write checksums until all backends have acknowledged that they have stopped > verifying checksums (again using a procsignalbarrier). At this point the > cluster switches to "off" and checksums are neither written nor verified. In > case the cluster is restarted, voluntarily or via a crash, processing will have > to be restarted (more on that further down). > > The user facing controls for this are two SQL level functions, for enabling and > disabling. The existing data_checksums GUC remains but is expanded with more > possible states (with on/off retained). > > > Complaints against earlier versions > =================================== > Seasoned hackers might remember that this patch has been on -hackers before. > There has been a lot of review, and AFAICT all specific comments have been > addressed. There are however a few larger more generic complaints: > > * Restartability - the initial version of the patch did not support stateful > restarts, a shutdown performed (or crash) before checksums were enabled would > result in a need to start over from the beginning. This was deemed the safe > orchestration method. The lack of this feature was seen as serious drawback, > so it was added. Subsequent review instead found the patch to be too > complicated with a too large featureset. I thihk there is merit to both of > these arguments: being able to restart is a great feature; and being able to > reason about the correctness of a smaller patch is also great. As of this > submission I have removed the ability to restart to keep the scope of the patch > small (which is where the previous version was, which received no review after > the removal). The way I prefer to frame this is to first add scaffolding and > infrastructure (this patch) and leave refinements and add-on features > (restartability, but also others like parallel workers, optimizing rare cases, > etc) for follow-up patches. > I 100% support this approach. Sure, I'd like to have a restartable tool, but clearly that didn't go particularly well, and we still have nothing to enable checksums online. That doesn't seem to benefit anyone - to me it seems reasonable to get the non-restartable tool in, and then maybe later someone can improve this to make it restartable. Thanks to the earlier work we know it's doable, even if it was too complex. This way it's at least possible to enable checksums online with some additional care (e.g. to make sure no one restarts the cluster etc.). I'd bet for vast majority of systems this will work just fine. Huge systems with some occasional / forced restarts may not be able to make this work - but then again, that's no worse than now. > * Complexity - it was brought up that this is a very complex patch for a niche > feature, and there is a lot of truth to that. It is inherently complex to > change a pg_control level state of a running cluster. There might be ways to > make the current patch less complex, while not sacrificing stability, and if so > that would be great. A lot of of the complexity came from being able to > restart processing, and that's not removed for this version, but it's clearly > not close to a one-line-diff even without it. > I'd push back on this a little bit - the patch looks like this: 50 files changed, 2691 insertions(+), 48 deletions(-) and if we ignore the docs / perl tests, then the two parts that stand out are src/backend/access/transam/xlog.c | 455 +++++- src/backend/postmaster/datachecksumsworker.c | 1353 +++++++++++++++++ I don't think the worker code is exceptionally complex. Yes, it's not trivial, but a lot of the 1353 inserts is comments (which is good) or generic infrastructure to start the worker etc. > Other complaints were addressed, in part by the invention of procsignalbarriers > which makes this synchronization possible. In re-reading the threads I might > have missed something which is still left open, and if so I do apologize for > that. > > > Open TODO items: > ================ > * Immediate checkpoints - the code is currently using CHECKPOINT_IMMEDIATE in > order to be able to run the tests in a timely manner on it. This is overly > aggressive and dialling it back while still being able to run fast tests is a > TODO. Not sure what the best option is there. > Why not to add a parameter to pg_enable_data_checksums(), specifying whether to do immediate checkpoint or wait for the next one? AFAIK that's what we do in pg_backup_start, for example. > * Monitoring - an insightful off-list reviewer asked how the current progress > of the operation is monitored. So far I've been using pg_stat_activity but I > don't disagree that it's not a very sharp tool for this. Maybe we need a > specific function or view or something? There clearly needs to be a way for a > user to query state and progress of a transition. > Yeah, I think a view like pg_stat_progress_checksums would work. > * Throttling - right now the patch uses the vacuum access strategy, with the > same cost options as vacuum, in order to implement throttling. This is in part > due to the patch starting out modelled around autovacuum as a worker, but it > may not be the right match for throttling checksums. > IMHO it's reasonable to reuse the vacuum throttling. Even if it's not perfect, it does not seem great to invent something new and end up with two different ways to throttle stuff. > * Naming - the in-between states when data checksums are enabled or disabled > are called inprogress-on and inprogress-off. The reason for this is simply > that early on there were only three states: inprogress, on and off, and the > process of disabling wasn't labeled with a state. When this transition state > was added it seemed like a good idea to tack the end-goal onto the transition. > These state names make the code easily greppable but might not be the most > obvious choices for anything user facing. Is "Enabling" and "Disabling" better > terms to use (across the board or just user facing) or should we stick to the > current? > I think the naming is fine. In the worst case we can rename that later, seems more like a detail. > There are ways in which this processing can be optimized to achieve better > performance, but in order to keep goalposts in sight and patchsize down they > are left as future work. > +1 regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 3, 2024 at 01:20:10PM +0200, Tomas Vondra wrote: > > * Restartability - the initial version of the patch did not support stateful > > restarts, a shutdown performed (or crash) before checksums were enabled would > > result in a need to start over from the beginning. This was deemed the safe > > orchestration method. The lack of this feature was seen as serious drawback, > > so it was added. Subsequent review instead found the patch to be too > > complicated with a too large featureset. I thihk there is merit to both of > > these arguments: being able to restart is a great feature; and being able to > > reason about the correctness of a smaller patch is also great. As of this > > submission I have removed the ability to restart to keep the scope of the patch > > small (which is where the previous version was, which received no review after > > the removal). The way I prefer to frame this is to first add scaffolding and > > infrastructure (this patch) and leave refinements and add-on features > > (restartability, but also others like parallel workers, optimizing rare cases, > > etc) for follow-up patches. > > > > I 100% support this approach. Yes, I was very disappointed when restartability sunk the patch, and I saw this as another case where saying "yes" to every feature improvement can lead to failure. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Hi, On Mon, Sep 30, 2024 at 11:21:30PM +0200, Daniel Gustafsson wrote: > > Yeah, I think a view like pg_stat_progress_checksums would work. > > Added in the attached version. It probably needs some polish (the docs for > sure do) but it's at least a start. Just a nitpick, but we call it data_checksums about everywhere, but the new view is called pg_stat_progress_datachecksums - I think pg_stat_progress_data_checksums would look better even if it gets quite long. Michael
> On 1 Oct 2024, at 00:43, Michael Banck <mbanck@gmx.net> wrote: > > Hi, > > On Mon, Sep 30, 2024 at 11:21:30PM +0200, Daniel Gustafsson wrote: >>> Yeah, I think a view like pg_stat_progress_checksums would work. >> >> Added in the attached version. It probably needs some polish (the docs for >> sure do) but it's at least a start. > > Just a nitpick, but we call it data_checksums about everywhere, but the > new view is called pg_stat_progress_datachecksums - I think > pg_stat_progress_data_checksums would look better even if it gets quite > long. That's a fair point, I'll make sure to switch for the next version of the patch. -- Daniel Gustafsson
Hi, I did a quick review today. First a couple minor comments: 1) monitoring.sgml typos: number of database -> number of databases calcuated -> calculated 2) unnecessary newline in heapam.c (no other changes) 3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345, shadowing earlier variable 4) unnecessary comment change in bufmgr.c (no other changes) 5) unnecessary include and newline in bulk_write.c (no other changes) 6) unnecessary newline in pgstat.c (no other changes) 7) controldata.c - maybe this if (oldctrl->data_checksum_version == 2) should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic constant? For "off" we use "0" which seems somewhat acceptable, but for other values it's less obvious what the meaning is. 8) xlog_internal.h - xl_checksum_state should be added to typedefs 9) system_functions.sql - Isn't it weird that this only creates the new pg_enable_data_checksums function, but not pg_disable_data_checksums? It also means it doesn't revoke EXECUTE from public on it, which I guess it probably should? Or why should this be different for the two functions? Also the error message seems to differ: test=> select pg_enable_data_checksums(); ERROR: permission denied for function pg_enable_data_checksums test=> select pg_disable_data_checksums(); ERROR: must be superuser Probably harmless, but seems a bit strange. But there also seems to be a more serious problem with recovery. I did a simple script that does a loop of * start a cluster * initialize a small pgbench database (scale 1 - 100) * run "long" pgbench * call pg_enable_data_checksums(), wait for it to complete * stop the cluster with "-m immediate" * start the cluster And this unfortunately hits this assert: bool AbsorbChecksumsOnBarrier(void) { Assert(LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION); LocalDataChecksumVersion = PG_DATA_CHECKSUM_VERSION; return true; } Based on our short discussion about this, the controlfile gets updated right after pg_enable_data_checksums() completes. The immediate stop however forces a recovery since the last checkpoint, which means we see the XLOG_CHECKSUMS WAL message again, and set the barrier. And then we exit recovery, try to start checkpointer and it trips over this, because the control file already has the "on" value :-( I'm not sure what's the best way to fix this. Would it be possible to remember we saw the XLOG_CHECKSUMS during recovery, and make the assert noop in that case? Or not set the barrier when exiting recovery. I'm not sure the relaxed assert would remain meaningful, though. What would it check on standbys, for example? Maybe a better way would be to wait for a checkpoint before updating the controlfile, similar to what we do at the beginning? Possibly even with the same "fast=true/false" logic. That would prevent us from seeing the XLOG_CHECKSUMS wal record with the updated flag. It would extend the "window" where a crash would mean we have to redo the checksums, but I don't think that matters much. For small databases who cares, and for large databases it should not be a meaningful difference (setting the checksums already ran over multiple checkpoints, so one checkpoint is not a big difference). regards -- Tomas Vondra
Tomas Vondra <tomas@vondra.me> writes: > 3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345, > shadowing earlier variable All the ListCell variables can be eliminated by using the foreach_ptr and foreach_oid macros instead of plain foreach. - ilmari
> On 7 Oct 2024, at 20:42, Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> wrote: > > Tomas Vondra <tomas@vondra.me> writes: > >> 3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345, >> shadowing earlier variable > > All the ListCell variables can be eliminated by using the foreach_ptr > and foreach_oid macros instead of plain foreach. Fair point, done in the v4 attached upthread. -- Daniel Gustafsson
On 10/8/24 22:38, Daniel Gustafsson wrote: > >> 7) controldata.c - maybe this >> >> if (oldctrl->data_checksum_version == 2) >> >> should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic >> constant? For "off" we use "0" which seems somewhat acceptable, but for >> other values it's less obvious what the meaning is. > > It doesn't seem clean to include storage/bufpage.h in pg_upgrade, I wonder if > we should move (or mirror) the checksum versions to storage/checksum_impl.h to > make them available to frontend and backend tools? > +1 to have checksum_impl.h > >> But there also seems to be a more serious problem with recovery. I did a >> simple script that does a loop of >> >> * start a cluster >> * initialize a small pgbench database (scale 1 - 100) >> * run "long" pgbench >> * call pg_enable_data_checksums(), wait for it to complete >> * stop the cluster with "-m immediate" >> * start the cluster >> >> And this unfortunately hits this assert: >> >> bool >> AbsorbChecksumsOnBarrier(void) >> { >> Assert(LocalDataChecksumVersion == >> PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION); >> LocalDataChecksumVersion = PG_DATA_CHECKSUM_VERSION; >> return true; >> } >> >> Based on our short discussion about this, the controlfile gets updated >> right after pg_enable_data_checksums() completes. The immediate stop >> however forces a recovery since the last checkpoint, which means we see >> the XLOG_CHECKSUMS WAL message again, and set the barrier. And then we >> exit recovery, try to start checkpointer and it trips over this, because >> the control file already has the "on" value :-( >> >> I'm not sure what's the best way to fix this. Would it be possible to >> remember we saw the XLOG_CHECKSUMS during recovery, and make the assert >> noop in that case? Or not set the barrier when exiting recovery. I'm not >> sure the relaxed assert would remain meaningful, though. What would it >> check on standbys, for example? >> >> Maybe a better way would be to wait for a checkpoint before updating the >> controlfile, similar to what we do at the beginning? Possibly even with >> the same "fast=true/false" logic. That would prevent us from seeing the >> XLOG_CHECKSUMS wal record with the updated flag. It would extend the >> "window" where a crash would mean we have to redo the checksums, but I >> don't think that matters much. For small databases who cares, and for >> large databases it should not be a meaningful difference (setting the >> checksums already ran over multiple checkpoints, so one checkpoint is >> not a big difference). > > The more I think about it the more I think that updating the control file is > the wrong thing to do for this patch, it should only change the state in memory > and let the checkpoints update the controlfile. The attached fixes that and I > can no longer reproduce the assertion failure you hit. > I think leaving the update of controlfile to checkpointer is correct, and probably the only way to make this correct (without race conditions). We need to do that automatically with the checkpoint (which updates the redo LSN, guaranteeing we won't see the XLOG_CHECKSUMS record again). I ran the tests with this new patch, and I haven't reproduced the crashes. I'll let it run a bit longer, and improve it to test some more stuff, but it looks good. regards -- Tomas Vondra
Hi, I spent a bit more time doing some testing on the last version of the patch from [1]. And I ran into this assert in PostmasterStateMachine() when stopping the cluster: /* All types should be included in targetMask or remainMask */ Assert((remainMask.mask | targetMask.mask) == BTYPE_MASK_ALL.mask); At first I was puzzled as this happens on every shutdown, but that's because these checks were introduced by a78af0427015 a week ago. So it's more a matter of rebasing. However, I also noticed the progress monitoring does not really work. I get stuff like this: + psql -x test -c 'select * from pg_stat_progress_data_checksums' -[ RECORD 1 ]---------------------+--------- pid | 56811 datid | 0 datname | phase | enabling databases_total | 4 relations_total | databases_processed | 0 relations_processed | 0 databases_current | 16384 relation_current | 0 relation_current_blocks | 0 relation_current_blocks_processed | 0 But I've never seen any of the "processed" fields to be non-zero (and relations is even NULL), and the same thing applies to relation_. Also what is the datid/datname about? It's empty, not mentioned in sgml docs, and we already have databases_current ... The message [2] from 10/08 says: > I did remove parts of the progress reporting for now since it can't be > used from the dynamic backgroundworker it seems. I need to regroup > and figure out a better way there, but I wanted to address your above > find sooner rather than wait for that. And I guess that would explain why some of the fields are not updated. But then the later patch versions seem to imply there are no outstanding issues / missing stuff. regards [1] https://www.postgresql.org/message-id/CA226DE1-DC9A-4675-A83C-32270C473F0B%40yesql.se [2] https://www.postgresql.org/message-id/DD25705F-E75F-4DCA-B49A-5578F4F55D94%40yesql.se -- Tomas Vondra