Re: Online checksums patch - once again - Mailing list pgsql-hackers
From | Daniel Gustafsson |
---|---|
Subject | Re: Online checksums patch - once again |
Date | |
Msg-id | 76117EC5-A4B8-43DC-B4A2-EF6AAF70F149@yesql.se Whole thread Raw |
In response to | Re: Online checksums patch - once again (Daniel Gustafsson <daniel@yesql.se>) |
List | pgsql-hackers |
> On 2 Sep 2020, at 14:22, Daniel Gustafsson <daniel@yesql.se> wrote: > The main synchronization mechanisms are the use of the inprogress mode where > data checksums are written but not verified, and by waiting for all > pre-existing non-compatible processes (transactions, temp tables) to disappear > before enabling. > > That being handwavily said, I've started to write down a matrix with classes of > possible synchronization bugs and how the patch handles them in order to > properly respond. Having spent some more time on this, I believe I have a better answer (and patch version) to give. First, a thank you for asking insightful questions. While working through the cases I realized that the previous version has a problematic window: when disabling checksums, if backend A absorbs the data checksums "off" barrier (or starts after the controlfile has been changed) and writes a page without a checksum, while backend B has yet to absorb the barrier and is still in "on" and reads that very same page. The solution IMO is to introduce an inprogress state for disabling as well where all backends keep writing checksums but not validating them until no backend is in "on" state anymore. Once all backends are in "inprogress-off", they can stop writing checksums and transition to "off". I even had this state in a previous unsubmitted version, embarrassingly so to the point of the function prototype being there as a leftover in v19. Now, synchronization happens on two levels in this patch for both the enable and disable case : (i) between backends when transitioning between states and (ii) inside the worker when synchronizing the current backends. For (i) it's using "inprogress" states to ensure a safe transition to "on" or "off". The states are themselves transitioned to via procsignalbarriers. Both enabling and disabling follow the same logic, with the only difference being the order in which operations are switched on/off during inprogress. For (ii) the workers are waiting for incompatible concurrent processing to end, such as temporary tables etc (only affects enabling data checksums). I've tried to write down the synchronization steps in datachecksumsworker.c to document the code, and for ease of discussion I've pasted that part of the diff below as well: * Synchronization and Correctness * ------------------------------- * The processes involved in enabling, or disabling, data checksums in an * online cluster must be properly synchronized with the normal backends * serving concurrent queries to ensure correctness. Correctness is defined * as the following: * * - Backends SHALL NOT violate local datachecksum state * - Data checksums SHALL NOT be considered enabled cluster-wide until all * currently connected backends have the local state "enabled" * * There are two levels of synchronization required for enabling data checksums * in an online cluster: (i) changing state in the active backends ("on", * "off", "inprogress-on" and "inprogress-off"), and (ii) ensuring no * incompatible objects and processes are left in a database when workers end. * The former deals with cluster-wide agreement on data checksum state and the * latter with ensuring that any concurrent activity cannot break the data * checksum contract during processing. * * Synchronizing the state change is done with procsignal barriers, where the * backend updating the global state in the controlfile will wait for all other * backends to absorb the barrier before WAL logging. Barrier absorption will * happen during interrupt processing, which means that connected backends will * change state at different times. * * When Enabling Data Checksums * ---------------------------- * A process which fails to observe data checksums being enabled can induce * two types of errors: failing to write the checksum when modifying the page * and failing to validate the data checksum on the page when reading it. * * When the DataChecksumsWorker has finished writing checksums on all pages * and enable data checksums cluster-wide, there are three sets of backends: * * Bg: Backend updating the global state and emitting the procsignalbarrier * Bd: Backends on "off" state * Be: Backends in "on" state * Bi: Backends in "inprogress-on" state * * Backends transition from the Bd state to Be like so: Bd -> Bi -> Be * * Backends in Bi and Be will write checksums when modifying a page, but only * backends in Be will verify the checksum during reading. The Bg backend is * blocked waiting for all backends in Bi to process interrupts and move to * Be. Any backend starting will observe the global state being "on" and will * thus automatically belong to Be. Checksums are enabled cluster-wide when * Bi is an empty set. All sets are compatible while still operating based on * their local state. * * When Disabling Data Checksums * ----------------------------- * A process which fails to observe data checksums being disabled can induce * two types of errors: writing the checksum when modifying the page and * validating a data checksum which is no longer correct due to modifications * to the page. * * Bg: Backend updating the global state and emitting the procsignalbarrier * Bd: Backands in "off" state * Be: Backends in "on" state * Bi: Backends in "inprogress-off" state * * Backends transition from the Be state to Bd like so: Be -> Bi -> Bd * * The goal is to transition all backends to Bd making the others empty sets. * Backends in Bi writes data checksums, but don't validate them, such that * backends still in Be can continue to validate pages until the barrier has * been absorbed such that they are in Bi. Once all backends are in Bi, the * barrier to transition to "off" can be raised and all backends can safely * stop writing data checksums as no backend is enforcing data checksum * validation. I hope this clarifies the reasoning behind the implementation. This has been implemented in the attached v21 patch. "inprogress" was in need of a new name before, and with "inprogress-{on|off}" the need is even bigger. Suggestions for better names are highly appreciated, I'm drawing blanks here. There are some minor fixes and documentation touchups in this version as well, but not the SIGINT handling since I wanted to focus on one thing at a time. cheers ./daniel
Attachment
pgsql-hackers by date: