Thread: Offline enabling/disabling of data checksums
Hi, the attached patch adds offline enabling/disabling of checksums to pg_verify_checksums. It is based on independent work both Michael (Paquier) and me did earlier this year and takes changes from both, see https://github.com/credativ/pg_checksums and https://github.com/michaelpq/pg_plugins/tree/master/pg_checksums It adds an (now mandatory) --action parameter that takes either verify, enable or disable as argument. This is basically meant as a stop-gap measure in case online activation of checksums won't make it for v12, but maybe it is independently useful? Things I have not done so far: 1. Rename pg_verify_checksums to e.g. pg_checksums as it will no longer only verify checksums. 2. Rename the scan_* functions (Michael renamed them to operate_file and operate_directory but I am not sure it is worth it. 3. Once that patch is in, there would be a way to disable checksums so there'd be a case to also change the initdb default to enabled, but that required further discussion (and maybe another round of benchmarks). Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
On Fri, Dec 21, 2018 at 09:16:16PM +0100, Michael Banck wrote: > It adds an (now mandatory) --action parameter that takes either verify, > enable or disable as argument. There are two discussion points which deserve attention here: 1) Do we want to rename pg_verify_checksums to something else, like pg_checksums. I like a lot if we would do a simple renaming of the tool, which should be the first step taken. 2) Which kind of interface do we want to use? When I did my own flavor of pg_checksums, I used an --action switch able to use the following values: - enable - disable - verify The switch cannot be specified twice (perhaps we could enforce the last value as other binaries do in the tree, not sure if that's adapted here). A second type of interface is to use one switch per action. For both interfaces if no action is specify then the tool fails. Vote is open. > This is basically meant as a stop-gap measure in case online activation > of checksums won't make it for v12, but maybe it is independently > useful? I think that this is independently useful, I got this stuff part of an upgrade workflow where the user is ready to accept some extra one-time offline time so as checksums are enabled. > Things I have not done so far: > > 1. Rename pg_verify_checksums to e.g. pg_checksums as it will no longer > only verify checksums. Check. That sounds right to me. > 2. Rename the scan_* functions (Michael renamed them to operate_file and > operate_directory but I am not sure it is worth it. The renaming makes sense, as scan implies only reading while enabling checksums causes a write. > 3. Once that patch is in, there would be a way to disable checksums so > there'd be a case to also change the initdb default to enabled, but that > required further discussion (and maybe another round of benchmarks). Perhaps, that's unrelated to this thread though. I am not sure that all users would be ready to pay the extra cost of checksums enabled by default. -- Michael
Attachment
Hi, I have added it to the commitfest now: https://commitfest.postgresql.org/21/1944/ On Sat, Dec 22, 2018 at 08:28:34AM +0900, Michael Paquier wrote: > On Fri, Dec 21, 2018 at 09:16:16PM +0100, Michael Banck wrote: > > It adds an (now mandatory) --action parameter that takes either verify, > > enable or disable as argument. > > There are two discussion points which deserve attention here: > 1) Do we want to rename pg_verify_checksums to something else, like > pg_checksums. I like a lot if we would do a simple renaming of the > tool, which should be the first step taken. I am for it, but don't mind whether it's before or afterwards, your call. > 2) Which kind of interface do we want to use? When I did my own > flavor of pg_checksums, I used an --action switch able to use the > following values: > - enable > - disable > - verify > The switch cannot be specified twice (perhaps we could enforce the > last value as other binaries do in the tree, not sure if that's > adapted here). A second type of interface is to use one switch per > action. For both interfaces if no action is specify then the tool > fails. Vote is open. Even though my fork has the separate switches, I like the --action one. On the other hand, it is a bit more typing as you always have to spell out the action (is there precendent of accepting also incomplete option arguments like 'v', 'e', 'd'?). > > This is basically meant as a stop-gap measure in case online activation > > of checksums won't make it for v12, but maybe it is independently > > useful? > > I think that this is independently useful, I got this stuff part of an > upgrade workflow where the user is ready to accept some extra one-time > offline time so as checksums are enabled. OK; we have also used that at clients - if the instance has a size of less than a few dozen GBs, enabling checksums during a routine minor upgrade restart is not delaying things much. > > 2. Rename the scan_* functions (Michael renamed them to operate_file and > > operate_directory but I am not sure it is worth it. > > The renaming makes sense, as scan implies only reading while enabling > checksums causes a write. Ok, will do in the next version. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Sat, Dec 22, 2018 at 02:42:55PM +0100, Michael Banck wrote: > On Sat, Dec 22, 2018 at 08:28:34AM +0900, Michael Paquier wrote: >> There are two discussion points which deserve attention here: >> 1) Do we want to rename pg_verify_checksums to something else, like >> pg_checksums. I like a lot if we would do a simple renaming of the >> tool, which should be the first step taken. > > I am for it, but don't mind whether it's before or afterwards, your > call. Doing the renaming after would be a bit weird logically, as we would finish with a point in time in the tree where pg_verify_checksums is able to do something else than just verifying checksums. > Even though my fork has the separate switches, I like the --action one. > On the other hand, it is a bit more typing as you always have to spell > out the action (is there precendent of accepting also incomplete option > arguments like 'v', 'e', 'd'?). Yes, there is a bit of that in psql for example for formats. Not sure that we should take this road for a checksumming tool though. If a new option is added which takes the first letter then we would have incompatibility issues. That's unlikely to happen, still that feels uneasy. -- Michael
Attachment
On Fri, Dec 21, 2018 at 6:28 PM Michael Paquier <michael@paquier.xyz> wrote: > 2) Which kind of interface do we want to use? When I did my own > flavor of pg_checksums, I used an --action switch able to use the > following values: > - enable > - disable > - verify > The switch cannot be specified twice (perhaps we could enforce the > last value as other binaries do in the tree, not sure if that's > adapted here). A second type of interface is to use one switch per > action. For both interfaces if no action is specify then the tool > fails. Vote is open. I vote for separate switches. Using the same switch with an argument seems like it adds typing for no real gain. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hallo Michael, > It adds an (now mandatory) --action parameter that takes either verify, > enable or disable as argument. I'd rather have explicit switches for verify, enable & disable, and verify would be the default if none is provided. > This is basically meant as a stop-gap measure in case online activation > of checksums won't make it for v12, but maybe it is independently > useful? I would say yes. > 1. Rename pg_verify_checksums to e.g. pg_checksums as it will no longer > only verify checksums. I'd agree to rename the tool as "pg_checksums". > 2. Rename the scan_* functions (Michael renamed them to operate_file and > operate_directory but I am not sure it is worth it. Hmmm. The file is indeed scanned, and "operate" is kind of very fuzzy. > 3. Once that patch is in, there would be a way to disable checksums so > there'd be a case to also change the initdb default to enabled, but that > required further discussion (and maybe another round of benchmarks). My 0.02€ is that data safety should comes first, thus checksums should be enabled by default. About the patch: applies, compiles, "make check" ok. There is no documentation. In "scan_file", I would open RW only for enable, but keep RO for verify. Also, the full page is rewritten... would it make sense to only overwrite the checksum part itself? It seems that the control file is unlinked and then rewritten. If the rewritting fails, or the command is interrupted, the user has a problem. Could the control file be simply opened RW? Else, I would suggest to rename (eg add .tmp), write the new one, then unlink the old one, so that recovering the old state in case of problem is possible. For enable/disable, while the command is running, it should mark the cluster as opened to prevent an unwanted database start. I do not see where this is done. -- Fabien.
On Wed, Dec 26, 2018 at 07:43:17PM +0100, Fabien COELHO wrote: >> It adds an (now mandatory) --action parameter that takes either verify, >> enable or disable as argument. > > I'd rather have explicit switches for verify, enable & disable, and verify > would be the default if none is provided. Okay, noted for the separate switches. But I don't agree with the point of assuming that --verify should be enforced if no switches are defined. That feels like a trap for newcomers of this tool.. > For enable/disable, while the command is running, it should mark the cluster > as opened to prevent an unwanted database start. I do not see where this is > done. You have pretty much the same class of problems if you attempt to start a cluster on which pg_rewind or the existing pg_verify_checksums is run after these have scanned the control file to make sure that they work on a cleanly-stopped instance. In short, this is a deal between code simplicity and trying to have the tool outsmart users in the way users use the tool. I tend to prefer keeping the code simple and not worry about cases where the users mess up with their application, as there are many more things which could go wrong. -- Michael
Attachment
Bonjour Michaël, >>> It adds an (now mandatory) --action parameter that takes either verify, >>> enable or disable as argument. >> >> I'd rather have explicit switches for verify, enable & disable, and verify >> would be the default if none is provided. > > Okay, noted for the separate switches. But I don't agree with the > point of assuming that --verify should be enforced if no switches are > defined. That feels like a trap for newcomers of this tool.. Hmmm. It does something safe and useful, especially if it also works online (patch pending), and the initial tool only does checking. However, I'd be okay for no default. >> For enable/disable, while the command is running, it should mark the cluster >> as opened to prevent an unwanted database start. I do not see where this is >> done. > > You have pretty much the same class of problems if you attempt to > start a cluster on which pg_rewind or the existing pg_verify_checksums > is run after these have scanned the control file to make sure that > they work on a cleanly-stopped instance. In short, this is a deal > between code simplicity and trying to have the tool outsmart users in > the way users use the tool. I tend to prefer keeping the code simple > and not worry about cases where the users mess up with their > application, as there are many more things which could go wrong. Hmmm. I do not buy the comparison. A verify that fails is not a big problem, you can run it again. If pg_rewind fails, you can probably run it again as well, the source is probably still consistent even if it has changed, and too bad for the target side, but it was scheduled to be overwritten anyway. However, a tool which overwrites files beyond the back of a running server is a recipee for data-loss, so I think it should take much more care, i.e. set the server state into some specific safe state. About code simplicity: probably there is, or there should be, a change-the-state function somewhere, because quite a few tools could use it? -- Fabien.
On Thu, Dec 27, 2018 at 2:15 AM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Dec 26, 2018 at 07:43:17PM +0100, Fabien COELHO wrote:
>> It adds an (now mandatory) --action parameter that takes either verify,
>> enable or disable as argument.
>
> I'd rather have explicit switches for verify, enable & disable, and verify
> would be the default if none is provided.
Okay, noted for the separate switches. But I don't agree with the
point of assuming that --verify should be enforced if no switches are
defined. That feels like a trap for newcomers of this tool..
Defaulting to the choice that makes no actual changes to the data surely is the safe choice,a nd not a trap :)
That said, this would probably be our first tool where you switch it between readonly and rewrite mode with just a switch, woudn't it? All other tools are either read-only or read/write at the *tool* level, not the switch level.
That in itself would be an argument for making it a separate tool. But not a very strong one I think, I prefer the single-tool-renamed approach as well.
There's plenty enough precedent for the "separate switches and a default behaviour if none is specified" in other tools though, and I don't think that's generally considered a trap.
So count me in the camp for separate switches and default to verify. If one didn't mean that, it's only a quick Ctrl-C away with no damage done.
> For enable/disable, while the command is running, it should mark the cluster
> as opened to prevent an unwanted database start. I do not see where this is
> done.
You have pretty much the same class of problems if you attempt to
start a cluster on which pg_rewind or the existing pg_verify_checksums
is run after these have scanned the control file to make sure that
they work on a cleanly-stopped instance. In short, this is a deal
between code simplicity and trying to have the tool outsmart users in
the way users use the tool. I tend to prefer keeping the code simple
and not worry about cases where the users mess up with their
application, as there are many more things which could go wrong.
I think it comes down to what the outcome is. If we're going to end up with a corrupt database (e.g. one where checksums aren't set everywhere but they are marked as such in pg_control) then it's not acceptable. If the only outcome is the tool gives an error that's not an error and if re-run it's fine, then it's a different story.
On Sat, Dec 22, 2018 at 12:28 AM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Dec 21, 2018 at 09:16:16PM +0100, Michael Banck wrote:
I think that this is independently useful, I got this stuff part of an
upgrade workflow where the user is ready to accept some extra one-time
offline time so as checksums are enabled.
Very much so, IMHO.
> Things I have not done so far:
>
> 1. Rename pg_verify_checksums to e.g. pg_checksums as it will no longer
> only verify checksums.
Check. That sounds right to me.
Should we double-check with packagers that this won't cause a problem? Though the fact that it's done in a major release should make it perfectly fine I think -- and it's a smaller change than when we did all those xlog->wal changes...
> 3. Once that patch is in, there would be a way to disable checksums so
> there'd be a case to also change the initdb default to enabled, but that
> required further discussion (and maybe another round of benchmarks).
Perhaps, that's unrelated to this thread though. I am not sure that
all users would be ready to pay the extra cost of checksums enabled by
default.
I'd be a strong +1 for changing the default once we have a painless way to turn them off.
It remains super-cheap to turn them off (stop database, one command, turn them on). So those people that aren't willing to pay the overhead of checksums, can very cheaply get away from it.
It's a lot more expensive to turn them on once your database has grown to some size (definitely in offline mode, but also in an online mode when we get that one in).
Plus, the majority of people *should* want them on :) We don't run with say synchronous_commit=off by default either to make it easier on those that don't want to pay the overhead of full data safety :P (I know it's not a direct match, but you get the idea)
>> For enable/disable, while the command is running, it should mark the >> cluster as opened to prevent an unwanted database start. I do not see >> where this is done. >> >> You have pretty much the same class of problems if you attempt to >> start a cluster on which pg_rewind or the existing pg_verify_checksums >> is run after these have scanned the control file to make sure that >> they work on a cleanly-stopped instance. [...] > > I think it comes down to what the outcome is. If we're going to end up with > a corrupt database (e.g. one where checksums aren't set everywhere but they > are marked as such in pg_control) then it's not acceptable. If the only > outcome is the tool gives an error that's not an error and if re-run it's > fine, then it's a different story. ISTM that such an outcome is indeed a risk, as a starting postgres could update already checksummed pages without putting a checksum. It could be even worse, although with a (very) low probability, with updates overwritten on a race condition between the processes. In any case, no error would be reported before much later, with invalid checksums or inconsistent data, or undetected forgotten committed data. -- Fabien.
On 12/27/18 11:43 AM, Magnus Hagander wrote: > > > On Sat, Dec 22, 2018 at 12:28 AM Michael Paquier <michael@paquier.xyz > <mailto:michael@paquier.xyz>> wrote: > > On Fri, Dec 21, 2018 at 09:16:16PM +0100, Michael Banck wrote: > > I think that this is independently useful, I got this stuff part of an > upgrade workflow where the user is ready to accept some extra one-time > offline time so as checksums are enabled. > > > Very much so, IMHO. > > > > Things I have not done so far: > > > > 1. Rename pg_verify_checksums to e.g. pg_checksums as it will no > longer > > only verify checksums. > > Check. That sounds right to me. > > > Should we double-check with packagers that this won't cause a problem? > Though the fact that it's done in a major release should make it > perfectly fine I think -- and it's a smaller change than when we did all > those xlog->wal changes... > I think it makes little sense to not rename the tool now. I'm pretty sure we'd end up doing that sooner or later anyway, and we'll just live with a misnamed tool until then. > > > 3. Once that patch is in, there would be a way to disable checksums so > > there'd be a case to also change the initdb default to enabled, > but that > > required further discussion (and maybe another round of benchmarks). > > Perhaps, that's unrelated to this thread though. I am not sure that > all users would be ready to pay the extra cost of checksums enabled by > default. > > > I'd be a strong +1 for changing the default once we have a painless way > to turn them off. > > It remains super-cheap to turn them off (stop database, one command, > turn them on). So those people that aren't willing to pay the overhead > of checksums, can very cheaply get away from it. > > It's a lot more expensive to turn them on once your database has grown > to some size (definitely in offline mode, but also in an online mode > when we get that one in). > > Plus, the majority of people *should* want them on :) We don't run with > say synchronous_commit=off by default either to make it easier on those > that don't want to pay the overhead of full data safety :P (I know it's > not a direct match, but you get the idea) > I don't know, TBH. I agree making the on/off change cheaper makes moves us closer to 'on' by default, because they may disable it if needed. But it's not the whole story. If we enable checksums by default, 99% users will have them enabled. That means more people will actually observe data corruption cases that went unnoticed so far. What shall we do with that? We don't have very good answers to that (tooling, docs) and I'd say "disable checksums" is not a particularly amazing response in this case :-( FWIW I don't know what to do about that. We certainly can't prevent the data corruption, but maybe we could help with fixing it (although that's bound to be low-level work). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/27/18 11:39 AM, Magnus Hagander wrote: > > > On Thu, Dec 27, 2018 at 2:15 AM Michael Paquier <michael@paquier.xyz > <mailto:michael@paquier.xyz>> wrote: > > On Wed, Dec 26, 2018 at 07:43:17PM +0100, Fabien COELHO wrote: > >> It adds an (now mandatory) --action parameter that takes either > verify, > >> enable or disable as argument. > > > > I'd rather have explicit switches for verify, enable & disable, > and verify > > would be the default if none is provided. > > Okay, noted for the separate switches. But I don't agree with the > point of assuming that --verify should be enforced if no switches are > defined. That feels like a trap for newcomers of this tool.. > > > Defaulting to the choice that makes no actual changes to the data surely > is the safe choice,a nd not a trap :) > > That said, this would probably be our first tool where you switch it > between readonly and rewrite mode with just a switch, woudn't it? All > other tools are either read-only or read/write at the *tool* level, not > the switch level. > Eh? Isn't pg_rewind "modify by default" with --dry-run switch to run in a read-only mode. So I'm not sure what you mean by "tool level" here. FWIW I'd prefer sticking to the same approach for this tool too, i.e. have a "dry-run" switch that makes it read-only. IMHO that's pretty common pattern. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Dec 27, 2018 at 3:54 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 12/27/18 11:39 AM, Magnus Hagander wrote:
>
>
> On Thu, Dec 27, 2018 at 2:15 AM Michael Paquier <michael@paquier.xyz
> <mailto:michael@paquier.xyz>> wrote:
>
> On Wed, Dec 26, 2018 at 07:43:17PM +0100, Fabien COELHO wrote:
> >> It adds an (now mandatory) --action parameter that takes either
> verify,
> >> enable or disable as argument.
> >
> > I'd rather have explicit switches for verify, enable & disable,
> and verify
> > would be the default if none is provided.
>
> Okay, noted for the separate switches. But I don't agree with the
> point of assuming that --verify should be enforced if no switches are
> defined. That feels like a trap for newcomers of this tool..
>
>
> Defaulting to the choice that makes no actual changes to the data surely
> is the safe choice,a nd not a trap :)
>
> That said, this would probably be our first tool where you switch it
> between readonly and rewrite mode with just a switch, woudn't it? All
> other tools are either read-only or read/write at the *tool* level, not
> the switch level.
>
Eh? Isn't pg_rewind "modify by default" with --dry-run switch to run in
a read-only mode. So I'm not sure what you mean by "tool level" here.
FWIW I'd prefer sticking to the same approach for this tool too, i.e.
have a "dry-run" switch that makes it read-only. IMHO that's pretty
common pattern.
That's a different thing.
pg_rewind in dry-run mode does the same thing, except it doesn't actually do it, it just pretends.
Verifying checksums is not the same as "turn on checksums except don't actually do it" or "turn off checksums except don't actually do it".
On Thu, Dec 27, 2018 at 03:46:48PM +0100, Tomas Vondra wrote: > On 12/27/18 11:43 AM, Magnus Hagander wrote: >> Should we double-check with packagers that this won't cause a problem? >> Though the fact that it's done in a major release should make it >> perfectly fine I think -- and it's a smaller change than when we did all >> those xlog->wal changes... >> > > I think it makes little sense to not rename the tool now. I'm pretty > sure we'd end up doing that sooner or later anyway, and we'll just live > with a misnamed tool until then. Do you think that a thread Would on -packagers be more adapted then? > I don't know, TBH. I agree making the on/off change cheaper makes moves > us closer to 'on' by default, because they may disable it if needed. But > it's not the whole story. > > If we enable checksums by default, 99% users will have them enabled. > That means more people will actually observe data corruption cases that > went unnoticed so far. What shall we do with that? We don't have very > good answers to that (tooling, docs) and I'd say "disable checksums" is > not a particularly amazing response in this case :-( Enabling data checksums by default is still a couple of steps ahead, without a way to control them better.. > FWIW I don't know what to do about that. We certainly can't prevent the > data corruption, but maybe we could help with fixing it (although that's > bound to be low-level work). Yes, data checksums are extremely useful to tell people when the problem is *not* from Postgres, which can be really hard in a large organization. Knowing about the corrupted page is also useful as you can look at its contents and look at its bytes before it gets zero'ed to spot patterns which can help other teams in charge of a lower level of the application layer. -- Michael
Attachment
On 12/28/18 12:25 AM, Michael Paquier wrote: > On Thu, Dec 27, 2018 at 03:46:48PM +0100, Tomas Vondra wrote: >> On 12/27/18 11:43 AM, Magnus Hagander wrote: >>> Should we double-check with packagers that this won't cause a problem? >>> Though the fact that it's done in a major release should make it >>> perfectly fine I think -- and it's a smaller change than when we did all >>> those xlog->wal changes... >>> >> >> I think it makes little sense to not rename the tool now. I'm pretty >> sure we'd end up doing that sooner or later anyway, and we'll just live >> with a misnamed tool until then. > > Do you think that a thread Would on -packagers be more adapted then? > I'm sorry, but I'm not sure I understand the question. Of course, asking over at -packagers won't hurt, but my guess is the response will be it's not a big deal from the packaging perspective. >> I don't know, TBH. I agree making the on/off change cheaper makes moves >> us closer to 'on' by default, because they may disable it if needed. But >> it's not the whole story. >> >> If we enable checksums by default, 99% users will have them enabled. >> That means more people will actually observe data corruption cases that >> went unnoticed so far. What shall we do with that? We don't have very >> good answers to that (tooling, docs) and I'd say "disable checksums" is >> not a particularly amazing response in this case :-( > > Enabling data checksums by default is still a couple of steps ahead, > without a way to control them better.. > What do you mean by "control" here? Dealing with checksum failures, or some additional capabilities? >> FWIW I don't know what to do about that. We certainly can't prevent the >> data corruption, but maybe we could help with fixing it (although that's >> bound to be low-level work). > > Yes, data checksums are extremely useful to tell people when the > problem is *not* from Postgres, which can be really hard in a large > organization. Knowing about the corrupted page is also useful as you > can look at its contents and look at its bytes before it gets zero'ed > to spot patterns which can help other teams in charge of a lower level > of the application layer. I'm not sure data checksums are particularly great evidence. For example with the recent fsync issues, we might have ended with partial writes (and thus invalid checksums). The OS migh have even told us about the failure, but we've gracefully ignored it. So I'm afraid data checksums are not a particularly great proof it's not our fault. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Dec 28, 2018 at 01:14:05AM +0100, Tomas Vondra wrote: > I'm sorry, but I'm not sure I understand the question. Of course, asking > over at -packagers won't hurt, but my guess is the response will be it's > not a big deal from the packaging perspective. (The previous email had an extra "Would"... Sorry.) Let's ask those folks then. > What do you mean by "control" here? Dealing with checksum failures, or > some additional capabilities? What I am referring to here is the possibility to enable, disable and check checksums for an online cluster. I am not sure what kind of tooling able to do chirurgy at page level would make sense. Once a checksum is corrupted a user knows about a problem, which mainly needs a human lookup. > I'm not sure data checksums are particularly great evidence. For example > with the recent fsync issues, we might have ended with partial writes > (and thus invalid checksums). The OS migh have even told us about the > failure, but we've gracefully ignored it. So I'm afraid data checksums > are not a particularly great proof it's not our fault. Sure, they are not a solution to all problems. Still they give hints before the problem spreads, and sometimes by looking at one corrupted page by yourself one can see if the data fetched from disk comes from Postgres or not (say inspecting the page header with pageinspect, etc.). -- Michael
Attachment
On Fri, Dec 28, 2018 at 1:14 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 12/28/18 12:25 AM, Michael Paquier wrote:
> On Thu, Dec 27, 2018 at 03:46:48PM +0100, Tomas Vondra wrote:
>> On 12/27/18 11:43 AM, Magnus Hagander wrote:
>>> Should we double-check with packagers that this won't cause a problem?
>>> Though the fact that it's done in a major release should make it
>>> perfectly fine I think -- and it's a smaller change than when we did all
>>> those xlog->wal changes...
>>>
>>
>> I think it makes little sense to not rename the tool now. I'm pretty
>> sure we'd end up doing that sooner or later anyway, and we'll just live
>> with a misnamed tool until then.
>
> Do you think that a thread Would on -packagers be more adapted then?
>
I'm sorry, but I'm not sure I understand the question. Of course, asking
over at -packagers won't hurt, but my guess is the response will be it's
not a big deal from the packaging perspective.
I think a heads- up in the way of "planning to change it, now's the time to yell" is the reasonable thing.
>> I don't know, TBH. I agree making the on/off change cheaper makes moves
>> us closer to 'on' by default, because they may disable it if needed. But
>> it's not the whole story.
>>
>> If we enable checksums by default, 99% users will have them enabled.
>> That means more people will actually observe data corruption cases that
>> went unnoticed so far. What shall we do with that? We don't have very
>> good answers to that (tooling, docs) and I'd say "disable checksums" is
>> not a particularly amazing response in this case :-(
>
> Enabling data checksums by default is still a couple of steps ahead,
> without a way to control them better..
>
What do you mean by "control" here? Dealing with checksum failures, or
some additional capabilities?
>> FWIW I don't know what to do about that. We certainly can't prevent the
>> data corruption, but maybe we could help with fixing it (although that's
>> bound to be low-level work).
>
> Yes, data checksums are extremely useful to tell people when the
> problem is *not* from Postgres, which can be really hard in a large
> organization. Knowing about the corrupted page is also useful as you
> can look at its contents and look at its bytes before it gets zero'ed
> to spot patterns which can help other teams in charge of a lower level
> of the application layer.
I'm not sure data checksums are particularly great evidence. For example
with the recent fsync issues, we might have ended with partial writes
(and thus invalid checksums). The OS migh have even told us about the
failure, but we've gracefully ignored it. So I'm afraid data checksums
are not a particularly great proof it's not our fault.
They are a great evidence that your data is corrupt. You *want* to know that your data is corrupt. Even if our best recommendation is "go restore your backups", you still want to know. Otherwise you are sitting around on data that's corrupt and you don't know about it.
There are certainly many things we can do to improve the experience. But not telling people their data is coorrupt when it is, isn't one of them.
>>> [...] >> >> I'm not sure data checksums are particularly great evidence. For example >> with the recent fsync issues, we might have ended with partial writes >> (and thus invalid checksums). The OS migh have even told us about the >> failure, but we've gracefully ignored it. So I'm afraid data checksums >> are not a particularly great proof it's not our fault. > > They are a great evidence that your data is corrupt. You *want* to know > that your data is corrupt. Even if our best recommendation is "go restore > your backups", you still want to know. Otherwise you are sitting around on > data that's corrupt and you don't know about it. > > There are certainly many things we can do to improve the experience. But > not telling people their data is coorrupt when it is, isn't one of them. Yep, anyone should want to know if their database is corrupt, compare to ignoring the fact. One reason not to enable it could be if the implementation is not trusted, i.e. if false positive (corrupt page detected while the data are okay and there was only an issue with computing or storing the checksum) can occur. There is also the performance impact. I did some quick-and-dirty pgbench simple update single thread performance tests to compare with vs without checksum. Enabling checksums on these tests seems to induce a 1.4% performance penalty, although I'm moderately confident about it given the standard deviation. At least it is an indication, and it seems to me that it is consistent with other figures previously reported on the list. -- Fabien.
On Fri, Dec 28, 2018 at 10:12:24AM +0100, Magnus Hagander wrote: > I think a heads- up in the way of "planning to change it, now's the time to > yell" is the reasonable thing. And done. -- Michael
Attachment
Hi, Am Freitag, den 28.12.2018, 10:12 +0100 schrieb Magnus Hagander: > On Fri, Dec 28, 2018 at 1:14 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On 12/28/18 12:25 AM, Michael Paquier wrote: > > > On Thu, Dec 27, 2018 at 03:46:48PM +0100, Tomas Vondra wrote: > > >> On 12/27/18 11:43 AM, Magnus Hagander wrote: > > >>> Should we double-check with packagers that this won't cause a problem? > > >>> Though the fact that it's done in a major release should make it > > >>> perfectly fine I think -- and it's a smaller change than when we did all > > >>> those xlog->wal changes... > > >>> > > >> > > >> I think it makes little sense to not rename the tool now. I'm pretty > > >> sure we'd end up doing that sooner or later anyway, and we'll just live > > >> with a misnamed tool until then. > > > > > > Do you think that a thread Would on -packagers be more adapted then? > > > > I'm sorry, but I'm not sure I understand the question. Of course, asking > > over at -packagers won't hurt, but my guess is the response will be it's > > not a big deal from the packaging perspective. > > I think a heads- up in the way of "planning to change it, now's the > time to yell" is the reasonable thing. Renaming applications shouldn't be a problem unless they have to be moved from one binary package to another. I assume all packagers ship all client/server binaries in one package, respectively (and not e.g. a dedicated postgresql-11-pg_test_fsync package), this should only be a matter of updating package metadata. In any case, it should be identical to the xlog->wal rename. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Sat, Dec 29, 2018 at 11:55:43AM +0100, Michael Banck wrote: > Renaming applications shouldn't be a problem unless they have to be > moved from one binary package to another. I assume all packagers ship > all client/server binaries in one package, respectively (and not e.g. a > dedicated postgresql-11-pg_test_fsync package), this should only be a > matter of updating package metadata. > > In any case, it should be identical to the xlog->wal rename. I have poked -packagers on the matter and I am seeing no complains, so let's move forward with this stuff. From the consensus I am seeing on the thread, we have been discussing about the following points: 1) Rename pg_verify_checksums to pg_checksums. 2) Have separate switches for each action, aka --verify, --enable and --disable, or a unified --action switch which can take different values. 3) Do we want to imply --verify by default if no switch is specified? About 2), folks who have expressed an opinion are: - Multiple switches: Robert, Fabien, Magnus - Single --action switch: Michael B, Michael P About 3), aka --verify implied if no action is specified: - In favor: Fabien C, Magnus - Against: Michael P If I missed what someone said, please feel free to complete with your votes here. -- Michael
Attachment
Hi, Am Dienstag, den 01.01.2019, 11:38 +0900 schrieb Michael Paquier: > On Sat, Dec 29, 2018 at 11:55:43AM +0100, Michael Banck wrote: > > Renaming applications shouldn't be a problem unless they have to be > > moved from one binary package to another. I assume all packagers ship > > all client/server binaries in one package, respectively (and not e.g. a > > dedicated postgresql-11-pg_test_fsync package), this should only be a > > matter of updating package metadata. > > > > In any case, it should be identical to the xlog->wal rename. > > I have poked -packagers on the matter and I am seeing no complains, so > let's move forward with this stuff. From the consensus I am seeing on > the thread, we have been discussing about the following points: > 1) Rename pg_verify_checksums to pg_checksums. > 2) Have separate switches for each action, aka --verify, --enable and > --disable, or a unified --action switch which can take different > values. > 3) Do we want to imply --verify by default if no switch is specified? > > About 2), folks who have expressed an opinion are: > - Multiple switches: Robert, Fabien, Magnus > - Single --action switch: Michael B, Michael P I implemented the multiple switches thing in my branch first anyway and don't mind a lot either way; I think the consensus goes towards multiple switches. > About 3), aka --verify implied if no action is specified: > - In favor: Fabien C, Magnus > - Against: Michael P I think I'm in favor as well. I wonder whether we (or packagers) could then just ship a pg_verify_checksums -> pg_checksums symlink for compatibility if we/they want, as the behaviour would stay the same? Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Tue, Jan 01, 2019 at 11:42:49AM +0100, Michael Banck wrote: > Am Dienstag, den 01.01.2019, 11:38 +0900 schrieb Michael Paquier: >> About 3), aka --verify implied if no action is specified: >> - In favor: Fabien C, Magnus >> - Against: Michael P > > I think I'm in favor as well. Okay, it looks to be the direction to take then. > I wonder whether we (or packagers) could then just ship a > pg_verify_checksums -> pg_checksums symlink for compatibility if we/they > want, as the behaviour would stay the same? In the v10 dev cycle this part has been discarded for the switch from pg_xlogdump to pg_waldump. I don't think that's worth bothering this time either in the build. -- Michael
Attachment
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On 12/27/18 11:43 AM, Magnus Hagander wrote: > > Plus, the majority of people *should* want them on :) We don't run with > > say synchronous_commit=off by default either to make it easier on those > > that don't want to pay the overhead of full data safety :P (I know it's > > not a direct match, but you get the idea) +1 to having them on by default, we should have done that a long time ago. > I don't know, TBH. I agree making the on/off change cheaper makes moves > us closer to 'on' by default, because they may disable it if needed. But > it's not the whole story. > > If we enable checksums by default, 99% users will have them enabled. Yes, and they'll then be able to catch data corruption much earlier. Today, 99% of our users don't have them enabled and have no clue if their data has been corrupted on disk, or not. That's not good. > That means more people will actually observe data corruption cases that > went unnoticed so far. What shall we do with that? We don't have very > good answers to that (tooling, docs) and I'd say "disable checksums" is > not a particularly amazing response in this case :-( Now that we've got a number of tools available which will check the checksums in a running system and throw up warnings when found (pg_basebackup, pgBackRest and I think other backup tools, pg_checksums...), users will see corruption and have the option to restore from a backup before those backups expire out and they're left with a corrupt database and backups which also have that corruption. This ongoing call for specific tooling to do "something" about checksums is certainly good, but it's not right to say that we don't have existing documentation- we do, quite a bit of it, and it's all under the heading of "Backup and Recovery". > FWIW I don't know what to do about that. We certainly can't prevent the > data corruption, but maybe we could help with fixing it (although that's > bound to be low-level work). There's been some effort to try and automagically correct corrupted pages but it's certainly not something I'm ready to trust beyond a "well, this is what it might have been" review. The answer today is to find a backup which isn't corrupt and restore from it on a known-good system. If adding explicit documentation to that effect would reduce your level of concern when it comes to enabling checksums by default, then I'm happy to do that. Thanks! Stephen
Attachment
Hi, Am Donnerstag, den 27.12.2018, 12:26 +0100 schrieb Fabien COELHO: > > > For enable/disable, while the command is running, it should mark the > > > cluster as opened to prevent an unwanted database start. I do not see > > > where this is done. > > > > > > You have pretty much the same class of problems if you attempt to > > > start a cluster on which pg_rewind or the existing pg_verify_checksums > > > is run after these have scanned the control file to make sure that > > > they work on a cleanly-stopped instance. [...] > > > > I think it comes down to what the outcome is. If we're going to end up with > > a corrupt database (e.g. one where checksums aren't set everywhere but they > > are marked as such in pg_control) then it's not acceptable. If the only > > outcome is the tool gives an error that's not an error and if re-run it's > > fine, then it's a different story. > > ISTM that such an outcome is indeed a risk, as a starting postgres could > update already checksummed pages without putting a checksum. It could be > even worse, although with a (very) low probability, with updates > overwritten on a race condition between the processes. In any case, no > error would be reported before much later, with invalid checksums or > inconsistent data, or undetected forgotten committed data. One difference between pg_rewind and pg_checksums is that the latter potentially runs for a longer time (or rather a non-trivial amount of time, compared to pg_rewind), so the margin of error of another DBA saying "oh, that DB is down, let me start it again" might be much higher. The question is how to reliably do this in an acceptable way? Just faking a postmaster.pid sounds pretty hackish to me, do you have any suggestions here? The alternative would be to document that it needs to be made sure that the database is not started up during enabling of checksums, yielding to pilot error. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Hi, Am Mittwoch, den 26.12.2018, 19:43 +0100 schrieb Fabien COELHO: > > It adds an (now mandatory) --action parameter that takes either verify, > > enable or disable as argument. > > I'd rather have explicit switches for verify, enable & disable, and verify > would be the default if none is provided. I changed that to the switches -c/--verify (-c for check as -v is taken, should it be --check as well? I personally like verify better), -d/--disable and -e/--enable. > About the patch: applies, compiles, "make check" ok. > > There is no documentation. Yeah, I'll write that once the CLI is settled. > In "scan_file", I would open RW only for enable, but keep RO for verify. OK, I've changed that. > Also, the full page is rewritten... would it make sense to only overwrite > the checksum part itself? So just writing the page header? I find that a bit scary and don't expect much speedup as the OS would write the whole block anyway I guess? I haven't touched that yet. > It seems that the control file is unlinked and then rewritten. If the > rewritting fails, or the command is interrupted, the user has a problem. > > Could the control file be simply opened RW? Else, I would suggest to > rename (eg add .tmp), write the new one, then unlink the old one, so that > recovering the old state in case of problem is possible. I have mostly taken the pg_rewind code here; if there was a function that allowed for safe offline changes of the control file, I'd be happy to use it but I don't think it should be this patch to invent that. In any case, I have removed the unlink() now (not sure where that came from), and changed it to open(O_WRONLY) same as in Michael's code and pg_rewind. V2 attached. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
> One difference between pg_rewind and pg_checksums is that the latter > potentially runs for a longer time (or rather a non-trivial amount of > time, compared to pg_rewind), so the margin of error of another DBA > saying "oh, that DB is down, let me start it again" might be much > higher. > > The question is how to reliably do this in an acceptable way? Just > faking a postmaster.pid sounds pretty hackish to me, do you have any > suggestions here? Adding a new state to ControlFileData which would prevent it from starting? -- Fabien.
Am Dienstag, den 08.01.2019, 15:09 +0100 schrieb Fabien COELHO: > > The question is how to reliably do this in an acceptable way? Just > > faking a postmaster.pid sounds pretty hackish to me, do you have > > any > > suggestions here? > > Adding a new state to ControlFileData which would prevent it from > starting? But then you have to make sure the control flag gets cleared in any case pg_verify_checksums crashes somehow or gets SIGKILL'ed ... Setting the checksum flag is done after having finished all blocks, so there is no problem. But we need to set this new flag before and reset it afterwards, so in between strange things can happen (as the various calls to exit() within error handling illustrates). Bernd.
Am Dienstag, den 08.01.2019, 15:39 +0100 schrieb Bernd Helmle: > Am Dienstag, den 08.01.2019, 15:09 +0100 schrieb Fabien COELHO: > > > The question is how to reliably do this in an acceptable way? Just > > > faking a postmaster.pid sounds pretty hackish to me, do you have > > > any > > > suggestions here? > > > > Adding a new state to ControlFileData which would prevent it from > > starting? > > But then you have to make sure the control flag gets cleared in any > case pg_verify_checksums crashes somehow or gets SIGKILL'ed ... > > Setting the checksum flag is done after having finished all blocks, so > there is no problem. But we need to set this new flag before and reset > it afterwards, so in between strange things can happen (as the various > calls to exit() within error handling illustrates). It seems writing a note like "pg_checksums is running" into the postmaster.pid would work, and would give a hopefully useful hint to somebody trying to start Postgres while pg_checksums is running: postgres@kohn:~$ echo "pg_checksums running with pid 1231, cluster disabled" > data/postmaster.pid postgres@kohn:~$ pg_ctl -D data -l logfile start pg_ctl: invalid data in PID file "data/postmaster.pid" postgres@kohn:~$ echo $? 1 postgres@kohn:~$ If the DBA then just simply deletes postmaster.pid and starts over, well then I call pilot error; though we could in theory change pg_ctl (or whatever checks postmaster.pid) to emit an even more useful error message if it encounters a "cluster is locked" keyword in it. Not sure whether everybody likes that (or is future-proof for that matter), but I like it better than adding a new field to the control file, for the reasons Bernd outlined above. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
>> Adding a new state to ControlFileData which would prevent it from >> starting? > > But then you have to make sure the control flag gets cleared in any > case pg_verify_checksums crashes somehow or gets SIGKILL'ed ... The usual approach is a restart with some --force option? > Setting the checksum flag is done after having finished all blocks, so > there is no problem. There is also a problem if the db is started while the checksum is being enabled. > But we need to set this new flag before and reset it afterwards, so in > between strange things can happen (as the various calls to exit() within > error handling illustrates). Sure, there is some need for a backup plan if it fails and the control file is let in a wrong state. -- Fabien.
>> Setting the checksum flag is done after having finished all blocks, so >> there is no problem. But we need to set this new flag before and reset >> it afterwards, so in between strange things can happen (as the various >> calls to exit() within error handling illustrates). > > It seems writing a note like "pg_checksums is running" into the > postmaster.pid would work, and would give a hopefully useful hint to > somebody trying to start Postgres while pg_checksums is running: > > postgres@kohn:~$ echo "pg_checksums running with pid 1231, cluster disabled" > data/postmaster.pid > postgres@kohn:~$ pg_ctl -D data -l logfile start > pg_ctl: invalid data in PID file "data/postmaster.pid" > postgres@kohn:~$ echo $? > 1 > postgres@kohn:~$ Looks ok, but I'm unsure how portable it is though. What if started with "postmater" directly? > If the DBA then just simply deletes postmaster.pid and starts over, well > then I call pilot error; though we could in theory change pg_ctl (or > whatever checks postmaster.pid) to emit an even more useful error > message if it encounters a "cluster is locked" keyword in it. > > Not sure whether everybody likes that (or is future-proof for that > matter), but I like it better than adding a new field to the control > file, for the reasons Bernd outlined above. ISTM that the point of the control file is exactly to tell what is current the status of the cluster, so it is where this information really belongs? AFAICS all commands take care of the status in some way to avoid accidents. -- Fabien.
Am Dienstag, den 08.01.2019, 16:17 +0100 schrieb Fabien COELHO: > > > Adding a new state to ControlFileData which would prevent it from > > > starting? > > > > But then you have to make sure the control flag gets cleared in any > > case pg_verify_checksums crashes somehow or gets SIGKILL'ed ... > > The usual approach is a restart with some --force option? > > > Setting the checksum flag is done after having finished all blocks, > > so > > there is no problem. > > There is also a problem if the db is started while the checksum is > being > enabled. What i mean is that interrupting pg_verify_checksums won't leave pg_control in a state where starting the cluster won't work without any further interaction. Bernd.
>>> But then you have to make sure the control flag gets cleared in any >>> case pg_verify_checksums crashes somehow or gets SIGKILL'ed ... >> >> The usual approach is a restart with some --force option? >> >>> Setting the checksum flag is done after having finished all blocks, so >>> there is no problem. >> >> There is also a problem if the db is started while the checksum is >> being enabled. > > What i mean is that interrupting pg_verify_checksums won't leave > pg_control in a state where starting the cluster won't work without any > further interaction. Yep, I understood that, and agree that a way out is needed, hence the --force option suggestion. -- Fabien.
On Tue, Jan 08, 2019 at 01:03:25PM +0100, Michael Banck wrote: > I changed that to the switches -c/--verify (-c for check as -v is taken, > should it be --check as well? I personally like verify better), > -d/--disable and -e/--enable. Indeed we could use --check, pg_checksums --check looks repetitive still that makes the short option more consistent with the rest. + printf(_(" -A, --action action to take on the cluster, can be set as\n")); + printf(_(" \"verify\", \"enable\" and \"disable\"\n")); Not reflected yet in the --help portion. >> Also, the full page is rewritten... would it make sense to only overwrite >> the checksum part itself? > > So just writing the page header? I find that a bit scary and don't > expect much speedup as the OS would write the whole block anyway I > guess? I haven't touched that yet. The OS would write blocks of 4kB out of the 8kB as that's the usual page size, no? So this could save a lot of I/O. > I have mostly taken the pg_rewind code here; if there was a function > that allowed for safe offline changes of the control file, I'd be happy > to use it but I don't think it should be this patch to invent that. > > In any case, I have removed the unlink() now (not sure where that came > from), and changed it to open(O_WRONLY) same as in Michael's code and > pg_rewind. My own stuff in pg_checksums.c does not have an unlink(), anyway... I think that there is room for improvement for both pg_rewind and pg_checksums here. What about refactoring updateControlFile() and move it to controldata_utils.c()? This centralizes the CRC check, static assertions, file open and writes into a single place. The backend has a similar flavor with UpdateControlFile. By combining both we need some extra "ifdef FRONTEND" for BasicOpenFile and the wait events which generates some noise, still both share a lot. The backend also includes a fsync() for the control file which happens when the file is written, but for pg_checksums and pg_rewind we just do it in one go at the end, so we would need an extra flag to decide if fsync should happen or not. pg_rewind has partially the right interface by passing ControlFileData contents as an argument. > V2 attached. +/* Filename components */ +#define PG_TEMP_FILES_DIR "pgsql_tmp" +#define PG_TEMP_FILE_PREFIX "pgsql_tmp" This may look strange, but these are needed because pg_checksums calls some of the sync-related routines which are defined in fd.c. Amen. + if (fsync(fd) != 0) + { + fprintf(stderr, _("%s: fsync error: %s\n"), progname, strerror(errno)); + exit(1); + } No need for that as fsync_pgdata() gets called at the end. -- Michael
Attachment
> I changed that to the switches -c/--verify (-c for check as -v is taken, > should it be --check as well? I personally like verify better), > -d/--disable and -e/--enable. I agree that checking the checksum sounds repetitive, but I think that for consistency --check should be provided. About the patch: applies, compiles, global & local "make check" are ok. There is still no documentation. I think that there is a consensus about renaming the command. The --help string documents --action which does not exists anymore. The code in "updateControlFile" seems to allow to create the file (O_CREAT). I do not think that it should, it should only apply to an existing file. ISTM that some generalized version of this function should be in "src/common/controldata_utils.c" instead of duplicating it from command to command (as suggested by Michaël as well). In "scan_file" verbose output, ISTM that the checksum is more computed than enabled on the file. It is really enabled at the cluster level in the end. Maybe there could be only one open call with a ?: for RO vs RW. Non root check: as files are only manipulated RW, ISTM that there is no reason why the ownership would be changed, so I do not think that this constraint is useful. There is kind of a copy paste for enabling/disabling, I'd consider skipping the scan when not necessary and merge both branches. >> Also, the full page is rewritten... would it make sense to only overwrite >> the checksum part itself? > > So just writing the page header? I find that a bit scary and don't > expect much speedup as the OS would write the whole block anyway I > guess? I haven't touched that yet. Possibly the OS would write its block size, which is not necessary the same as postgres page size? >> It seems that the control file is unlinked and then rewritten. If the >> rewritting fails, or the command is interrupted, the user has a problem. >> >> Could the control file be simply opened RW? Else, I would suggest to >> rename (eg add .tmp), write the new one, then unlink the old one, so that >> recovering the old state in case of problem is possible. > > I have mostly taken the pg_rewind code here; if there was a function > that allowed for safe offline changes of the control file, I'd be happy > to use it but I don't think it should be this patch to invent that. It is reinventing it somehow by duplicating the stuff anyway. I'd suggest a separate preparatory patch to do the cleanup. -- Fabien.
Hi, On 2019-01-09 07:07:17 +0100, Fabien COELHO wrote: > There is still no documentation. Michael, are you planning to address this? It'd also be useful to state when you just don't agree with things / don't plan to address them. Given the docs piece hasn't been addressed, and seems uncontroversial, I'm marking this patch as returned with feedback. Please resubmit once ready. > > > Also, the full page is rewritten... would it make sense to only overwrite > > > the checksum part itself? > > > > So just writing the page header? I find that a bit scary and don't > > expect much speedup as the OS would write the whole block anyway I > > guess? I haven't touched that yet. > > Possibly the OS would write its block size, which is not necessary the same > as postgres page size? I think it'd be a bad idea to write more granular. Very commonly that'll turn a write operation into a read-modify-write (although caching will often prevent that from being a problem here), and it'll be bad for flash translation layers. Greetings, Andres Freund
Hi, sorry for letting this slack. First off, thanks for the review! Am Mittwoch, den 09.01.2019, 07:07 +0100 schrieb Fabien COELHO: > > I changed that to the switches -c/--verify (-c for check as -v is taken, > > should it be --check as well? I personally like verify better), > > -d/--disable and -e/--enable. > > I agree that checking the checksum sounds repetitive, but I think that for > consistency --check should be provided. Ok then. The enum is currently called PG_ACTION_VERIFY, I changed that to PG_ACTION_CHECK as well. > About the patch: applies, compiles, global & local "make check" are ok. > > There is still no documentation. I've added that now, though I did that blindly and have not checked the output yet. > I think that there is a consensus about renaming the command. I think so as well, but doing that right now will make the patch difficult to review, so I'd prefer to leave it to the committer to do that. I can submit a patch with the directory/file rename if that is preferred. > The --help string documents --action which does not exists anymore. Fixed that. > The code in "updateControlFile" seems to allow to create the file > (O_CREAT). I do not think that it should, it should only apply to an > existing file. Removed that. > ISTM that some generalized version of this function should be in > "src/common/controldata_utils.c" instead of duplicating it from command to > command (as suggested by Michaël as well). Haven't done that yet. > In "scan_file" verbose output, ISTM that the checksum is more computed > than enabled on the file. It is really enabled at the cluster level in the > end. It's certainly not just computed but also written. It's true that it will be only meaningful if the control file is updated accordingly at the end, but I don't think that message is very incorrect, so left it as-is for now. > Maybe there could be only one open call with a ?: for RO vs RW. Done that. > Non root check: as files are only manipulated RW, ISTM that there is no > reason why the ownership would be changed, so I do not think that this > constraint is useful. Now that we no longer unlink() pg_control, I believe you are right and I have removed it. ` > There is kind of a copy paste for enabling/disabling, I'd consider > skipping the scan when not necessary and merge both branches. Done so. > > > Also, the full page is rewritten... would it make sense to only overwrite > > > the checksum part itself? > > > > So just writing the page header? I find that a bit scary and don't > > expect much speedup as the OS would write the whole block anyway I > > guess? I haven't touched that yet. > > Possibly the OS would write its block size, which is not necessary the > same as postgres page size? I haven't changed that yet, I think Andres was also of the opinion that this is not necessary? > > > It seems that the control file is unlinked and then rewritten. If the > > > rewritting fails, or the command is interrupted, the user has a problem. > > > > > > Could the control file be simply opened RW? I've done that now. New patch attached. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
On Sun, Feb 17, 2019 at 07:31:38PM +0100, Michael Banck wrote: > New patch attached. - * src/bin/pg_verify_checksums/pg_verify_checksums.c + * src/bin/pg_checksums/pg_checksums.c That's lacking a rename, or this comment is incorrect. +#if PG_VERSION_NUM >= 100000 + StaticAssertStmt(sizeof(ControlFileData) <= PG_CONTROL_MAX_SAFE_SIZE, + "pg_control is too large for atomic disk writes"); +#endif This is compiled with only one version of the control file data, so you don't need that. Any reason why we don't refactor updateControlFile() into controldata_utils.c? This duplicates the code, at the exception of some details. -- Michael
Attachment
Hi, Am Dienstag, den 19.02.2019, 14:02 +0900 schrieb Michael Paquier: > On Sun, Feb 17, 2019 at 07:31:38PM +0100, Michael Banck wrote: > > New patch attached. > > - * src/bin/pg_verify_checksums/pg_verify_checksums.c > + * src/bin/pg_checksums/pg_checksums.c > That's lacking a rename, or this comment is incorrect. Right, I started the rename, but then backed off pending further discussion whether I should submit that or whether the committer will just do it. I've backed those 4 in-file renames out for now. > +#if PG_VERSION_NUM >= 100000 > + StaticAssertStmt(sizeof(ControlFileData) <= PG_CONTROL_MAX_SAFE_SIZE, > + "pg_control is too large for atomic disk writes"); > +#endif > This is compiled with only one version of the control file data, so > you don't need that. Oops, yeah. > Any reason why we don't refactor updateControlFile() into > controldata_utils.c? This duplicates the code, at the exception of > some details. Ok, I've done that now, and migrated pg_rewind as well, do you know of any other programs that might benefit here? This could/should probably be committed separately beforehand. New patch attached. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Hallo Michael, >> - * src/bin/pg_verify_checksums/pg_verify_checksums.c >> + * src/bin/pg_checksums/pg_checksums.c >> That's lacking a rename, or this comment is incorrect. > > Right, I started the rename, but then backed off pending further > discussion whether I should submit that or whether the committer will > just do it. ISTM that there is a all clear for renaming. The renaming implies quite a few changes (eg in the documentation, makefiles, tests) which warrants a review, so it should be a patch. Also, ISTM that the renaming only make sense when adding the enable/disable feature, so I'd say that it belongs to this patch. Opinions? About v4: Patch applies cleanly, compiles, global & local "make check" are ok. Doc: "enable, disable or verifies" -> "checks, enables or disables"? Spurious space: "> ." -> ">.". As checksum are now checked, the doc could use "check" instead of "verify", especially if there is a rename and the "verify" word disappears. I'd be less terse when documenting actions, eg: "Disable checksums" -> "Disable checksums on cluster." Doc should state that checking is the default action, eg "Check checksums on cluster. This is the default action." Help string could say that -c is the default action. There is a spurious help line remaining from the previous "--action" implementation. open: I'm positively unsure about ?: priority over |, and probably not the only one, so I'd add parentheses around the former. I'm at odds with the "postmaster.pid" check, which would not prevent an issue if a cluster is started with "postmaster". I still think that the enabling-in-progress should be stored in the cluster state. ISTM that the cluster read/update cycle should lock somehow the control file being modified. However other commands do not seem to do something about it. I do not think that enabling if already enabled or disabling or already disable should exit(1), I think it is a no-op and should simply exit(0). About tests: I'd run a check on a disabled cluster to check that the command fails because disabled. -- Fabien.
On Wed, Feb 27, 2019 at 07:59:31AM +0100, Fabien COELHO wrote: > The renaming implies quite a few changes (eg in the documentation, > makefiles, tests) which warrants a review, so it should be a patch. Also, > ISTM that the renaming only make sense when adding the enable/disable > feature, so I'd say that it belongs to this patch. Opinions? I would think that the rename should happen first, but it is possible to make git diffs less noisy as well for files copied, so merging things is technically doable. > About tests: I'd run a check on a disabled cluster to check that the command > fails because disabled. While I look at that... If you could split the refactoring into a separate, first, patch as well.. -- Michael
Attachment
On Wed, Feb 27, 2019 at 07:59:31AM +0100, Fabien COELHO wrote: > Hallo Michael, Okay, let's move on with these patches! > The renaming implies quite a few changes (eg in the documentation, > makefiles, tests) which warrants a review, so it should be a patch. Also, > ISTM that the renaming only make sense when adding the enable/disable > feature, so I'd say that it belongs to this patch. Opinions? I have worked on the last v4 sent by Michael B, finishing with the attached after review and addressed the last points raised by Fabien. The thing is that I have been rather unhappy with a couple of things in what was proposed, so I have finished by modifying quite a couple of areas. The patch set is now splitted as I think is suited for commit, with the refactoring and renaming being separated from the actual feature: - 0001 if a patch to refactor the routine for the control file update. I have made it backend-aware, and we ought to be careful with error handling, use of fds and such, something that v4 was not very careful about. - 0002 renames pg_verify_checksums to pg_checksums with a straight-forward switch. Docs as well as all references to pg_verify_checksums are updated. - 0003 adds the new options --check, --enable and --disable, with --check being the default as discussed. - 0004 adds a -N/--no-sync which I think is nice for consistency with other tools. That's also useful for the tests, and was not discussed until now on this thread. > Patch applies cleanly, compiles, global & local "make check" are ok. > > Doc: "enable, disable or verifies" -> "checks, enables or disables"? > Spurious space: "> ." -> ">.". > > As checksum are now checked, the doc could use "check" instead of "verify", > especially if there is a rename and the "verify" word disappears. That makes sense. I have fixed these, and simplified the docs a bit to have a more simple page. > I'd be less terse when documenting actions, eg: "Disable checksums" -> > "Disable checksums on cluster." The former is fine in my opinion. > Doc should state that checking is the default action, eg "Check checksums on > cluster. This is the default action." Check. > Help string could say that -c is the default action. There is a spurious > help line remaining from the previous "--action" implementation. Yeah, we should. Added. > open: I'm positively unsure about ?: priority over |, and probably not the > only one, so I'd add parentheses around the former. Yes, I agree that the current code is hard to decrypt. So reworked with a separate variable in scan_file, and added extra parenthesis in the part which updates the control file. > I'm at odds with the "postmaster.pid" check, which would not prevent an > issue if a cluster is started with "postmaster". I still think that the > enabling-in-progress should be stored in the cluster state. > > ISTM that the cluster read/update cycle should lock somehow the control file > being modified. However other commands do not seem to do something about it. I am still not on board for adding more complexity in this area, at least not for this stuff and for the purpose of this thread, because this can happen at various degrees for various configurations for ages and not only for checksums. Also, I think that we still have to see users complain about that. Here are some scenarios where this can happen: - A base backup partially written. pg_basebackup limits this risk but it could still be possible to see a case where partially-written data folder. And base backups are around for many years. - pg_rewind, and the tool is in the tree since 9.5, the tool is actually available on github since 9.3. > I do not think that enabling if already enabled or disabling or already > disable should exit(1), I think it is a no-op and should simply exit(0). We already issue exit(1) when attempting to verify checksums on a cluster where they are disabled. So I agree with Michael B's point of Issuing an error in such cases. I think also that this makes handling for operators easier. > About tests: I'd run a check on a disabled cluster to check that the command > fails because disabled. Makes sense. Added. We need a test also for the case of successive runs with --enable. Here are also some notes from my side. - There was no need to complicate the synopsis of the docs. - usage() included still references to --action and indentation was a bit too long at the top. - There were no tests for disabling checksums, so I have added some. - We should check that the combination of --enable and -r fails. - Tests use only long options, that's better for readability. - Improved comments in tests. - Better to check for "data_checksum_version > 0" if --enable is used. That's more portable long-term if more checksum versions are added. - The check on postmaster.pid is actually not necessary as we already know that the cluster has been shutdown cleanly per the state of the control file. I agree that there could be a small race condition here, and we could discuss that in a different thread if need be as such things could be improved for other frontend tools as well. For now I am taking the most simple approach. (Still need to indent the patches before commit, but that's a nit.) -- Michael
Attachment
Hi Michael, Am Montag, den 11.03.2019, 13:53 +0900 schrieb Michael Paquier: > On Wed, Feb 27, 2019 at 07:59:31AM +0100, Fabien COELHO wrote: > > Hallo Michael, > > Okay, let's move on with these patches! Wow cool. I was going to go back to these and split them up similar to how you did it now that the online verification patch seems to be done/stuck for v12, but great that you beat me to it. I had a quick look over the patch and your changes and it LGTM. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Hi, Am Montag, den 11.03.2019, 11:11 +0100 schrieb Michael Banck: > I had a quick look over the patch and your changes and it LGTM. One thing: you (Michael) should be co-author for patch #3 as I took some of your code from https://github.com/michaelpq/pg_plugins/tree/master/pg _checksums Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation: tested, passed Hello I review latest patchset. I have one big question: Is pg_checksums safe for cross-versions operations? Even with update_controlfilecall? Currently i am able to enable checksums in pg11 cluster with pg_checksums compiled on HEAD. Is thisexpected? I didn't notice any version-specific check in code. And few small notes: > <command>pg_checksums</command> checks, enables or disables data checksums maybe better is <application>, not <command>? > + printf(_("%s enables, disables or verifies data checksums in a PostgreSQL\n"), progname); > + printf(_("database cluster.\n\n")); I doubt this is good line formatting for translation purposes. > + printf(_(" -c, --check check data checksums. This is the default\n")); > + printf(_(" mode if nothing is specified.\n")); same. For example pg_basebackup uses different multiline style: > printf(_(" -r, --max-rate=RATE maximum transfer rate to transfer data directory\n" > " (in kB/s, or use suffix \"k\" or \"M\")\n")); > +command_fails(['pg_checksums', '--enable', '-r', '1234', '-D', $pgdata], > + "fails when relfilnodes are requested and action is --disable"); action is "--enable" here ;-) > if (badblocks > 0) > return 1; Small question: why return 1 instead of exit(1)? > <refentry id="pgchecksums"> > <indexterm zone="pgchecksums"> How about use "app-pgchecksums" similar to other applications? regards, Sergei
On Mon, Mar 11, 2019 at 02:11:11PM +0000, Sergei Kornilov wrote: > I review latest patchset. Thanks, I have committed the refactoring of src/common/ as a first step. > I have one big question: Is pg_checksums > safe for cross-versions operations? Even with update_controlfile > call? Currently i am able to enable checksums in pg11 cluster with > pg_checksums compiled on HEAD. Is this expected? I didn't notice any > version-specific check in code. This depends on the version of the control file, and it happens that we don't check for it, so that's a good catch from your side. Not doing the check is a bad idea as ControlFileData should be compatible between the binary and the data read. I am attaching a fresh 0001 which should be back-patched down to v11 as a bug fix. An advantage of that, which is similar to pg_rewind, is that if the control file version does not change in a major version, then the tool can be used. And the data folder layer is unlikely going to change.. >> <command>pg_checksums</command> checks, enables or disables data checksums > > maybe better is <application>, not <command>? Fixed, as part of the renaming patch. >> + printf(_("%s enables, disables or verifies data checksums in a PostgreSQL\n"), progname); >> + printf(_("database cluster.\n\n")); > > I doubt this is good line formatting for translation purposes. > >> + printf(_(" -c, --check check data checksums. This is the default\n")); >> + printf(_(" mode if nothing is specified.\n")); > > same. For example pg_basebackup uses different multiline style: Oh, good points. I forgot about that point of view. >> +command_fails(['pg_checksums', '--enable', '-r', '1234', '-D', $pgdata], >> + "fails when relfilnodes are requested and action is --disable"); > > action is "--enable" here ;-) s/relfilnodes/relfilenodes/ while on it. >> if (badblocks > 0)gi >> return 1; > > Small question: why return 1 instead of exit(1)? OK, let's fix that on the way as part of the renaming. >> <refentry id="pgchecksums"> >> <indexterm zone="pgchecksums"> > > How about use "app-pgchecksums" similar to other applications? Yes, I was wondering about that one when doing the renaming, but did not bother much for consistency. Anyway switched, you are right. Attached is an updated patch set, minus the refactoring for src/common/. -- Michael
Attachment
On Mon, Mar 11, 2019 at 11:19:50AM +0100, Michael Banck wrote: > One thing: you (Michael) should be co-author for patch #3 as I took some > of your code from https://github.com/michaelpq/pg_plugins/tree/master/pg > _checksums OK, thanks for the notice. I was not sure as we actually developped the same fork. -- Michael
Attachment
Hi > Not doing the check is a bad idea as ControlFileData should be compatible > between the binary and the data read. I am attaching a fresh 0001 > which should be back-patched down to v11 as a bug fix. Looks fine. We need add few words to documentation? >>> if (badblocks > 0) >>> return 1; >> >> Small question: why return 1 instead of exit(1)? > > OK, let's fix that on the way as part of the renaming. Was not changed?.. I have no new notes after reading updated patchset. regards, Sergei
Hi, Am Montag, den 11.03.2019, 14:11 +0000 schrieb Sergei Kornilov: > > if (badblocks > 0) > > return 1; > > Small question: why return 1 instead of exit(1)? I have a feeling it is project policy to return 0 from main(), and exit(1) if a program aborts with an error. In the above case, the program finishes more-or-less as intended (no abort), but due to errors found on the way, does not return with 0. I don't mind either way and probably exit(1) makes more sense, but I wanted to explain why it is like that. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Hello Thank you for explain. I thought so. PS: I am not sure for now about patch status in CF app. Did not changed status regards, Sergei
On Tue, Mar 12, 2019 at 11:13:46AM +0100, Michael Banck wrote: > I have a feeling it is project policy to return 0 from main(), and > exit(1) if a program aborts with an error. Yes, it does not matter much in practice, but other tools just don't do that. Note that changing it can be actually annoying for a backpatch if we don't have the --enable/--disable part, because git is actually smart enough to detect the file renaming across branches as far as I tried, but as we are refactoring this code anyway for --enable and --disable let's just do it, that's cleaner. -- Michael
Attachment
Bonjour Michaël, Here is a partial review: > - 0001 if a patch to refactor the routine for the control file > update. I have made it backend-aware, and we ought to be careful with > error handling, use of fds and such, something that v4 was not very > careful about. This refactoring patch is ok for me: applies, compiles, check is ok. However, Am I right in thinking that the change should propagate to other tools which manipulate the control file, eg pg_resetwal, postmaster… So that there would be only one shared API to update the control file? > - 0002 renames pg_verify_checksums to pg_checksums with a > straight-forward switch. Docs as well as all references to > pg_verify_checksums are updated. Looks ok to me. Applies, compiles, checks are ok. Doc build is ok. I'm wondering whether there should be something done so that the inter-release documentation navigation works? Should the section keep the former name? Is it managed by hand somewhere else? Maybe it would require to keep the refsect1 id, or to duplicate it, not sure. In "doc/src/sgml/ref/allfiles.sgml" there seems to be a habit to align on the SYSTEM keyword, which is not fellowed by the patch. > - 0003 adds the new options --check, --enable and --disable, with > --check being the default as discussed. Looks like the patch I already reviewed, but I'll look at it in details later. "If enabling or disabling checksums, the exit status is nonzero if the operation failed." However: + if (ControlFile->data_checksum_version == 0 && + action == PG_ACTION_DISABLE) + { + fprintf(stderr, _("%s: data checksums are already disabled in cluster.\n"), progname); + exit(1); + } This seem contradictory to me: you want to disable checksum, and they are already disabled, so nothing is needed. How does that qualifies as a "failed" operation? Further review will come later. > - 0004 adds a -N/--no-sync which I think is nice for consistency with > other tools. That's also useful for the tests, and was not discussed > until now on this thread. Indeed. I do not immediately see the use case where no syncing would be a good idea. I can see why it would be a bad idea. So I'm not sure of the concept. -- Fabien.
On Tue, Mar 12, 2019 at 10:08:19PM +0100, Fabien COELHO wrote: > This refactoring patch is ok for me: applies, compiles, check is ok. > However, Am I right in thinking that the change should propagate to other > tools which manipulate the control file, eg pg_resetwal, postmaster… So that > there would be only one shared API to update the control file? Yes, that would be nice, for now I have focused. For pg_resetwal yes we could do it easily. Would you like to send a patch? > I'm wondering whether there should be something done so that the > inter-release documentation navigation works? Should the section keep the > former name? Is it managed by hand somewhere else? Maybe it would require to > keep the refsect1 id, or to duplicate it, not sure. When it came to the renaming of pg_receivexlog to pg_receivewal, we did not actually do anything in the core code, and let the magic happen on pgsql-www. I have also pinged pgsql-packagers about the renaming and it is not really an issue on their side. So I have committed the renaming to pg_checksums as well. So now remains only the new options. > In "doc/src/sgml/ref/allfiles.sgml" there seems to be a habit to align on > the SYSTEM keyword, which is not fellowed by the patch. Sure. I sent an updated patch to actually fix that, and also address a couple of other side things I noticed on the way like the top refentry in the docs or the header format at the top of pg_checksums.c as we are on tweaking the area. > This seem contradictory to me: you want to disable checksum, and they are > already disabled, so nothing is needed. How does that qualifies as a > "failed" operation? If the operation is automated, then a proper reaction can be done if multiple attempts are done. Of course, I am fine to tune things one way or the other depending on the opinion of the crowd here. From the opinions gathered, I can see that (Michael * 2) prefer failing with exit(1), while (Fabien * 1) would like to just do exit(0). > Further review will come later. Thanks, Fabien! > Indeed. I do not immediately see the use case where no syncing would be a > good idea. I can see why it would be a bad idea. So I'm not sure of the > concept. To leverage the buildfarm effort I think this one is worth it. Or we finish to fsync the data folder a couple of times, which would make the small-ish buildfarm machines suffer more than they need. I am going to send a rebased patch set of the remaining things at the top of the discussion as well. -- Michael
Attachment
On Tue, Mar 12, 2019 at 09:44:03PM +0900, Michael Paquier wrote: > Yes, it does not matter much in practice, but other tools just don't > do that. Note that changing it can be actually annoying for a > backpatch if we don't have the --enable/--disable part, because git is > actually smart enough to detect the file renaming across branches as > far as I tried, but as we are refactoring this code anyway for > --enable and --disable let's just do it, that's cleaner. Okay, please find attached a rebased patch set. I have committed 0001 which adds version checks for the control file, and the renaming part 0002. What remains now is the addition of the new options, and --no-sync. The "return 1" stuff has been switched to exit(1) while on it, and is part of 0003. Now the set of patches is: - 0001, add --enable and --disable. I have tweaked a bit the patch so as "action" is replaced by "mode" which is more consistent with other tools like pg_ctl. pg_indent was also complaining about one of the new enum structures. - 0002, add --no-sync. Thanks, -- Michael
Attachment
Bonjour Michaël, > Yes, that would be nice, for now I have focused. For pg_resetwal yes > we could do it easily. Would you like to send a patch? I probably can do that before next Monday. I'll prioritize reviewing the latest instance of this patch, though. >> This seem contradictory to me: you want to disable checksum, and they are >> already disabled, so nothing is needed. How does that qualifies as a >> "failed" operation? > > If the operation is automated, then a proper reaction can be done if > multiple attempts are done. Of course, I am fine to tune things one > way or the other depending on the opinion of the crowd here. From the > opinions gathered, I can see that (Michael * 2) prefer failing with > exit(1), while (Fabien * 1) would like to just do exit(0). Yep, that sums it up:-). >> Indeed. I do not immediately see the use case where no syncing would be a >> good idea. I can see why it would be a bad idea. So I'm not sure of the >> concept. > > To leverage the buildfarm effort I think this one is worth it. Or we > finish to fsync the data folder a couple of times, which would make > the small-ish buildfarm machines suffer more than they need. Ok for the particular use-case, provided that the documentation is very clear about the risks, which is the case, so fine with me wrt to the feature. -- Fabien.
On Wed, Mar 13, 2019 at 07:18:32AM +0100, Fabien COELHO wrote: > I probably can do that before next Monday. I'll prioritize reviewing the > latest instance of this patch, though. Thanks. The core code of the feature has not really changed with the last reviews, except for the tweaks in the variable names and I think that it's in a rather committable shape. -- Michael
Attachment
Michaël-san, > Now the set of patches is: > - 0001, add --enable and --disable. I have tweaked a bit the patch so > as "action" is replaced by "mode" which is more consistent with other > tools like pg_ctl. pg_indent was also complaining about one of the > new enum structures. Patch applies cleanly, compiles, various make check ok, doc build ok. I'm still at odds with the exit(1) behavior when there is nothing to do. If this behavior is kept, I think that the documentation needs to be improved because "failed" does not describe a no-op-was-needed to me. """ If enabling or disabling checksums, the exit status is nonzero if the operation failed. """ Maybe: "... if the operation failed or the requested setting is already active." would at least describe clearly the implemented behavior. + printf(_(" -c, --check check data checksums\n")); + printf(_(" This is the default mode if nothing is specified.\n")); I'm not sure of the punctuation logic on the help line: the first sentence does not end with a ".". I could not find an instance of this style in other help on pg commands. I'd suggest "check data checksums (default)" would work around and be more in line with other commands help. I see a significant locking issue, which I discussed on other threads without convincing anyone. I could do the following things: I slowed down pg_checksums by adding a 0.1s sleep when scanning a new file, then started a "pg_checksums --enable" on a stopped cluster, then started the cluster while the enabling was in progress, then connected and updated data. Hmmm. Then I stopped while the slow enabling was still in progress. Then I could also run a fast pg_checksums --enable in parallel, overtaking the first one... then when the fast one finished, I started the cluster again. When the slow one finished, it overwrote the control file, I had a running cluster with a control file which did not say so, so I could disable the checksum. Hmmm again. Ok, I could not generate a inconsistent state because on stopping the cluster the cluster file is overwritten with the initial state from the point of view of postmater, but it does not look good. I do not think it is a good thing that two commands can write to the data directory at the same time, really. About fsync-ing: ISTM that it is possible that the control file is written to disk while data are still not written, so a failure in between would leave the cluster with an inconsistent state. I think that it should fsync the data *then* update the control file and fsync again on that one. > - 0002, add --no-sync. Patch applies cleanly, compiles, various make checks are ok, doc build ok. Fine with me. -- Fabien.
On Wed, Mar 13, 2019 at 10:08:33AM +0100, Fabien COELHO wrote: > I'm not sure of the punctuation logic on the help line: the first sentence > does not end with a ".". I could not find an instance of this style in other > help on pg commands. I'd suggest "check data checksums (default)" would work > around and be more in line with other commands help. Good idea, let's do that. > I slowed down pg_checksums by adding a 0.1s sleep when scanning a new file, > then started a "pg_checksums --enable" on a stopped cluster, then started > the cluster while the enabling was in progress, then connected and updated > data. Well, yes, don't do that. You can get into the same class of problems while running pg_rewind, pg_basebackup or even pg_resetwal once the initial control file check is done for each one of these tools. > I do not think it is a good thing that two commands can write to the data > directory at the same time, really. We don't prevent either a pg_resetwal and a pg_basebackup to run in parallel. That would be... Interesting. > About fsync-ing: ISTM that it is possible that the control file is written > to disk while data are still not written, so a failure in between would > leave the cluster with an inconsistent state. I think that it should fsync > the data *then* update the control file and fsync again on that one. If --disable is used, the control file gets updated at the end without doing anything else. If the host crashes, it could be possible that the control file has checksums enabled or disabled. If the state is disabled, then well it succeeded. If the state is enabled, then the control file is still correct, because all the other blocks still have checksums set. if --enable is used, we fsync the whole data directory after writing all the blocks and updating the control file at the end. The case you are referring to here is in fsync_pgdata(), not pg_checksums actually, because you could reach the same state after a simple initdb. It could be possible to reach a state where the control file has checksums enabled and some blocks are not correctly synced, still you would notice rather quickly if the server is in an incorrect state at the follow-up startup. -- Michael
Attachment
>> I do not think it is a good thing that two commands can write to the data >> directory at the same time, really. > > We don't prevent either a pg_resetwal and a pg_basebackup to run in > parallel. That would be... Interesting. Yep, I'm trying again to suggest that this kind of thing should be prevented. It seems that I'm pretty unconvincing. >> About fsync-ing: ISTM that it is possible that the control file is written >> to disk while data are still not written, so a failure in between would >> leave the cluster with an inconsistent state. I think that it should fsync >> the data *then* update the control file and fsync again on that one. > > if --enable is used, we fsync the whole data directory after writing > all the blocks and updating the control file at the end. [...] > It could be possible to reach a state where the control file has > checksums enabled and some blocks are not correctly synced, still you > would notice rather quickly if the server is in an incorrect state at > the follow-up startup. Yep. That is the issue I think is preventable by fsyncing updated data *then* writing & syncing the control file, and that should be done by pg_checksums. -- Fabien.
On Wed, Mar 13, 2019 at 10:44:03AM +0100, Fabien COELHO wrote: > Yep. That is the issue I think is preventable by fsyncing updated data > *then* writing & syncing the control file, and that should be done by > pg_checksums. Well, pg_rewind works similarly: control file gets updated and then the whole data directory gets flushed. In my opinion, the take here is that we log something after the sync of the whole data folder is done, so as in the event of a crash an operator can make sure that everything has happened. -- Michael
Attachment
Am Mittwoch, den 13.03.2019, 18:31 +0900 schrieb Michael Paquier: > On Wed, Mar 13, 2019 at 10:08:33AM +0100, Fabien COELHO wrote: > > I'm not sure of the punctuation logic on the help line: the first sentence > > does not end with a ".". I could not find an instance of this style in other > > help on pg commands. I'd suggest "check data checksums (default)" would work > > around and be more in line with other commands help. > > Good idea, let's do that. > > > I slowed down pg_checksums by adding a 0.1s sleep when scanning a new file, > > then started a "pg_checksums --enable" on a stopped cluster, then started > > the cluster while the enabling was in progress, then connected and updated > > data. > > Well, yes, don't do that. You can get into the same class of problems > while running pg_rewind, pg_basebackup or even pg_resetwal once the > initial control file check is done for each one of these tools. > > > I do not think it is a good thing that two commands can write to the data > > directory at the same time, really. > > We don't prevent either a pg_resetwal and a pg_basebackup to run in > parallel. That would be... Interesting. But does pg_basebackup actually change the primary's data directory? I don't think so, so that does not seem to be a problem. pg_rewind and pg_resetwal are (TTBOMK) pretty quick operations, while pg_checksums can potentially run for hours, so I see the point of taking extra care here. On the other hand, two pg_checksums running in parallel also seem not much of a problem as the cluster is offline anyway. What is much more of a footgun is one DBA starting pg_checksums --enable on a 1TB cluster, then going for lunch, and then the other DBA wondering why the DB is down and starting the instance again. We read the control file on pg_checksums' startup, so once pg_checksums finishs it'll write the old checkpoint LSN into pg_control (along with the updated checksum version). This is pilot error, but I think we should try to guard against it. I propose we re-read the control file for the enable case after we finished operating on all files and (i) check the instance is still offline and (ii) update the checksums version from there. That should be a small but worthwhile change that could be done anyway. Another option would be to add a new feature which reliably blocks an instance from starting up due to maintenance - either a new control file field, some message in postmaster.pid (like "pg_checksums maintenance in progress") that would prevent pg_ctl or postgres/postmaster from starting up like 'FATAL: bogus data in lock file "postmaster.pid": "pg_checksums in progress' or some other trigger file. > > About fsync-ing: ISTM that it is possible that the control file is written > > to disk while data are still not written, so a failure in between would > > leave the cluster with an inconsistent state. I think that it should fsync > > the data *then* update the control file and fsync again on that one. > > if --enable is used, we fsync the whole data directory after writing > all the blocks and updating the control file at the end. The case you > are referring to here is in fsync_pgdata(), not pg_checksums actually, > because you could reach the same state after a simple initdb. But in the initdb case you don't have any valuable data in the instance yet. > It > could be possible to reach a state where the control file has > checksums enabled and some blocks are not correctly synced, still you > would notice rather quickly if the server is in an incorrect state at > the follow-up startup. Would you? I think I'm with Fabien on this one and it seems worthwhile to run fsync_pgdata() before and after update_controlfile() - the second one should be really quick anyway. Also, I suggest to maybe add a notice in verbose mode that we are syncing the data directory - otherwise the user might wonder what's going on at 100% done, though I haven't seen a large delay in my tests so far. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Hello, >> Yep. That is the issue I think is preventable by fsyncing updated data >> *then* writing & syncing the control file, and that should be done by >> pg_checksums. > > Well, pg_rewind works similarly: control file gets updated and then > the whole data directory gets flushed. So it is basically prone to the same potential issue? > In my opinion, the take here is that we log something after the sync of > the whole data folder is done, so as in the event of a crash an operator > can make sure that everything has happened. I do not understand. I'm basically only suggesting to reorder 3 lines and add an fsync so that this potential problem goes away, see attached poc (which does not compile because pg_fsync is in the backend only, however it works with fsync but on linux, I'm unsure of the portability, probably pg_fsync should be moved to port or something). -- Fabien.
Attachment
On Wed, Mar 13, 2019 at 11:41 AM Michael Banck <michael.banck@credativ.de> wrote:
Am Mittwoch, den 13.03.2019, 18:31 +0900 schrieb Michael Paquier:
> On Wed, Mar 13, 2019 at 10:08:33AM +0100, Fabien COELHO wrote:
> > I'm not sure of the punctuation logic on the help line: the first sentence
> > does not end with a ".". I could not find an instance of this style in other
> > help on pg commands. I'd suggest "check data checksums (default)" would work
> > around and be more in line with other commands help.
>
> Good idea, let's do that.
>
> > I slowed down pg_checksums by adding a 0.1s sleep when scanning a new file,
> > then started a "pg_checksums --enable" on a stopped cluster, then started
> > the cluster while the enabling was in progress, then connected and updated
> > data.
>
> Well, yes, don't do that. You can get into the same class of problems
> while running pg_rewind, pg_basebackup or even pg_resetwal once the
> initial control file check is done for each one of these tools.
>
> > I do not think it is a good thing that two commands can write to the data
> > directory at the same time, really.
>
> We don't prevent either a pg_resetwal and a pg_basebackup to run in
> parallel. That would be... Interesting.
But does pg_basebackup actually change the primary's data directory? I
don't think so, so that does not seem to be a problem.
pg_rewind and pg_resetwal are (TTBOMK) pretty quick operations, while
pg_checksums can potentially run for hours, so I see the point of taking
extra care here.
On the other hand, two pg_checksums running in parallel also seem not
much of a problem as the cluster is offline anyway.
What is much more of a footgun is one DBA starting pg_checksums --enable
on a 1TB cluster, then going for lunch, and then the other DBA wondering
why the DB is down and starting the instance again.
We read the control file on pg_checksums' startup, so once pg_checksums
finishs it'll write the old checkpoint LSN into pg_control (along with
the updated checksum version). This is pilot error, but I think we
should try to guard against it.
I propose we re-read the control file for the enable case after we
finished operating on all files and (i) check the instance is still
offline and (ii) update the checksums version from there. That should be
a small but worthwhile change that could be done anyway.
In (i) you need to also check that is' not offline *again*. Somebody could start *and* stop the database while pg_checksums is running. But that should hopefully be enough to check the time field?
Another option would be to add a new feature which reliably blocks an
instance from starting up due to maintenance - either a new control file
field, some message in postmaster.pid (like "pg_checksums maintenance in
progress") that would prevent pg_ctl or postgres/postmaster from
starting up like 'FATAL: bogus data in lock file "postmaster.pid":
"pg_checksums in progress' or some other trigger file.
Instead of overloading yet another thing on postmaster.pid, it might be better to just have a separate file that if it exists, blocks startup with a message defined as the content of that file?
> It
> could be possible to reach a state where the control file has
> checksums enabled and some blocks are not correctly synced, still you
> would notice rather quickly if the server is in an incorrect state at
> the follow-up startup.
Would you? I think I'm with Fabien on this one and it seems worthwhile
to run fsync_pgdata() before and after update_controlfile() - the second
one should be really quick anyway.
Also, I suggest to maybe add a notice in verbose mode that we are
syncing the data directory - otherwise the user might wonder what's
going on at 100% done, though I haven't seen a large delay in my tests
so far.
Seems like a good idea -- there certainly could be a substantial delay there depending on data size and underlying storage.
Hi One new question from me: how about replication? Case: primary+replica, we shut down primary and enable checksum, and "started streaming WAL from primary" without any issue.I have master with checksums, but replica without. Or cluster with checksums, then disable checksums on primary, but standby think we have checksums. Also we support ./configure --with-blocksize=(not equals 8)? make check on HEAD fails for me. If we support this - i thinkwe need recheck BLCKSZ between compiled pg_checksum and used in PGDATA regards, Sergei
Hallo Michael, > I propose we re-read the control file for the enable case after we > finished operating on all files and (i) check the instance is still > offline and (ii) update the checksums version from there. That should be > a small but worthwhile change that could be done anyway. That looks like a simple but mostly effective guard. > Another option would be to add a new feature which reliably blocks an > instance from starting up due to maintenance - either a new control file > field, some message in postmaster.pid (like "pg_checksums maintenance in > progress") that would prevent pg_ctl or postgres/postmaster from > starting up like 'FATAL: bogus data in lock file "postmaster.pid": > "pg_checksums in progress' or some other trigger file. I think that a clear cluster-locking can-be-overriden-if-needed shared-between-commands mechanism would be a good thing (tm), although it requires some work. My initial suggestion was to update the control file with an appropriate state, eg some general "admin command in progress", but I understood that it is rejected, and for another of your patch it seems that the "postmaster.pid" file is the right approach. Fine with me, the point is that it should be effective and consistent accross all relevant commands. A good point about the "postmaster.pid" trick, when it does not contain the posmaster pid, is that overriding is as simple as "rm postmaster.pid". >> It could be possible to reach a state where the control file has >> checksums enabled and some blocks are not correctly synced, still you >> would notice rather quickly if the server is in an incorrect state at >> the follow-up startup. > > Would you? I think I'm with Fabien on this one and it seems worthwhile > to run fsync_pgdata() before and after update_controlfile() - the second > one should be really quick anyway. Note that fsync_pgdata is kind of heavy, it recurses everywhere. I think that a simple fsync on the control file only is enough. > Also, I suggest to maybe add a notice in verbose mode that we are > syncing the data directory - otherwise the user might wonder what's > going on at 100% done, though I haven't seen a large delay in my tests > so far. I agree, as it might not be cheap. -- Fabien.
Hi, Am Mittwoch, den 13.03.2019, 11:47 +0100 schrieb Magnus Hagander: > On Wed, Mar 13, 2019 at 11:41 AM Michael Banck <michael.banck@credativ.de> wrote: > > I propose we re-read the control file for the enable case after we > > finished operating on all files and (i) check the instance is still > > offline and (ii) update the checksums version from there. That should be > > a small but worthwhile change that could be done anyway. > > In (i) you need to also check that is' not offline *again*. Somebody > could start *and* stop the database while pg_checksums is running. But > that should hopefully be enough to check the time field? Good point. > > Also, I suggest to maybe add a notice in verbose mode that we are > > syncing the data directory - otherwise the user might wonder what's > > going on at 100% done, though I haven't seen a large delay in my tests > > so far. > > Seems like a good idea -- there certainly could be a substantial delay > there depending on data size and underlying storage. The attached patch should do the above, on top of Michael's last patchset. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
On Wed, Mar 13, 2019 at 11:54 AM Sergei Kornilov <sk@zsrv.org> wrote:
Hi
One new question from me: how about replication?
Case: primary+replica, we shut down primary and enable checksum, and "started streaming WAL from primary" without any issue. I have master with checksums, but replica without.
Or cluster with checksums, then disable checksums on primary, but standby think we have checksums.
Enabling or disabling the checksums offline on the master quite clearly requires a rebuild of the standby, there is no other way (this is one of the reasons for the online enabling in that patch, so I still hope we can get that done -- but not for this version).
You would have the same with PITR backups for example. And especially if you have some tool that does block or segment level differential.
Of course, we have to make sure that this actually fails.
I wonder if we should bring out the big hammer and actually change the system id in pg_control when checksums are enabled/disabled by this tool? That should make it clear to any tool that it's changed.
Also we support ./configure --with-blocksize=(not equals 8)? make check on HEAD fails for me. If we support this - i think we need recheck BLCKSZ between compiled pg_checksum and used in PGDATA
You mean if the backend and pg_checksums is built with different blocksize? Yeah, that sounds like something which is a cheap check and should be done.
Hi, Am Mittwoch, den 13.03.2019, 12:24 +0100 schrieb Magnus Hagander: > > Also we support ./configure --with-blocksize=(not equals 8)? make > > check on HEAD fails for me. If we support this - i think we need > > recheck BLCKSZ between compiled pg_checksum and used in PGDATA > > You mean if the backend and pg_checksums is built with different > blocksize? Yeah, that sounds like something which is a cheap check and > should be done. I've been doing that in my pg_checksums fork for a while (as it further removed from the Postgres binaries) but yeah we should check for that as well in pg_checksums, see attached patch. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Hi >> One new question from me: how about replication? >> Case: primary+replica, we shut down primary and enable checksum, and "started streaming WAL from primary" without anyissue. I have master with checksums, but replica without. >> Or cluster with checksums, then disable checksums on primary, but standby think we have checksums. > > Enabling or disabling the checksums offline on the master quite clearly requires a rebuild of the standby, there is noother way (this is one of the reasons for the online enabling in that patch, so I still hope we can get that done -- butnot for this version). I mean this should be at least documented. Change system id... Maybe is reasonable >> Also we support ./configure --with-blocksize=(not equals 8)? make check on HEAD fails for me. If we support this - i thinkwe need recheck BLCKSZ between compiled pg_checksum and used in PGDATA > > You mean if the backend and pg_checksums is built with different blocksize? Yeah, that sounds like something which is acheap check and should be done. Yep regards, Sergei
Hi, >> > Also we support ./configure --with-blocksize=(not equals 8)? make >> > check on HEAD fails for me. If we support this - i think we need >> > recheck BLCKSZ between compiled pg_checksum and used in PGDATA >> >> You mean if the backend and pg_checksums is built with different >> blocksize? Yeah, that sounds like something which is a cheap check and >> should be done. > > I've been doing that in my pg_checksums fork for a while (as it further > removed from the Postgres binaries) but yeah we should check for that as > well in pg_checksums, see attached patch. Seems good. And I think we need backpath this check to pg11. similar to cross-version compatibility checks regards, Sergei
On Wed, Mar 13, 2019 at 12:40 PM Sergei Kornilov <sk@zsrv.org> wrote:
Hi
>> One new question from me: how about replication?
>> Case: primary+replica, we shut down primary and enable checksum, and "started streaming WAL from primary" without any issue. I have master with checksums, but replica without.
>> Or cluster with checksums, then disable checksums on primary, but standby think we have checksums.
>
> Enabling or disabling the checksums offline on the master quite clearly requires a rebuild of the standby, there is no other way (this is one of the reasons for the online enabling in that patch, so I still hope we can get that done -- but not for this version).
I mean this should be at least documented.
Change system id... Maybe is reasonable
I think this is dangerous enough that it needs to be enforced and not documented.
Most people who care about checksums are also going to be having either replication or backup...
>> Also we support ./configure --with-blocksize=(not equals 8)? make check on HEAD fails for me. If we support this - i think we need recheck BLCKSZ between compiled pg_checksum and used in PGDATA
>
> You mean if the backend and pg_checksums is built with different blocksize? Yeah, that sounds like something which is a cheap check and should be done.
Yep
This one I could more live with it only being a documented problem rather than enforced, but it also seems very simple to enforce.
On Wed, Mar 13, 2019 at 12:09:24PM +0100, Michael Banck wrote: > The attached patch should do the above, on top of Michael's last > patchset. What you are doing here looks like a good defense in itself. -- Michael
Attachment
On Wed, Mar 13, 2019 at 02:43:39PM +0300, Sergei Kornilov wrote: > Seems good. And I think we need backpath this check to pg11. similar > to cross-version compatibility checks Good point raised, a backpatch looks adapted. It would be nice to get into something more dynamic, but pg_checksum_block() uses directly BLCKSZ :( -- Michael
Attachment
Hi, Am Mittwoch, den 13.03.2019, 12:24 +0100 schrieb Magnus Hagander: > On Wed, Mar 13, 2019 at 11:54 AM Sergei Kornilov <sk@zsrv.org> wrote: > > One new question from me: how about replication? > > Case: primary+replica, we shut down primary and enable checksum, and > > "started streaming WAL from primary" without any issue. I have > > master with checksums, but replica without. > > Or cluster with checksums, then disable checksums on primary, but > > standby think we have checksums. > > Enabling or disabling the checksums offline on the master quite > clearly requires a rebuild of the standby, there is no other way What about shutting down both and running pg_checksums --enable on the standby as well? Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Hi, Am Mittwoch, den 13.03.2019, 12:43 +0100 schrieb Magnus Hagander: > I think this is dangerous enough that it needs to be enforced and not > documented. Changing the cluster ID might have some other side-effects, I think there are several cloud-native 3rd party solutions that use the cluster ID as some kind of unique identifier for an instance. It might not be an issue in practise, but then again, it might break other stuff down the road. Another possibility would be to extend the replication protocol's IDENTIFY_SYSTEM command to also report the checksum version so that the standby can check against the local control file on startup. But I am not sure we can easily extend IDENTIFY_SYSTEM this way nor whether we should for this rather corner-casey thing? Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Wed, Mar 13, 2019 at 4:46 PM Michael Banck <michael.banck@credativ.de> wrote:
Hi,
Am Mittwoch, den 13.03.2019, 12:24 +0100 schrieb Magnus Hagander:
> On Wed, Mar 13, 2019 at 11:54 AM Sergei Kornilov <sk@zsrv.org> wrote:
> > One new question from me: how about replication?
> > Case: primary+replica, we shut down primary and enable checksum, and
> > "started streaming WAL from primary" without any issue. I have
> > master with checksums, but replica without.
> > Or cluster with checksums, then disable checksums on primary, but
> > standby think we have checksums.
>
> Enabling or disabling the checksums offline on the master quite
> clearly requires a rebuild of the standby, there is no other way
What about shutting down both and running pg_checksums --enable on the
standby as well?
That sounds pretty fragile to me. But if we can prove that the user has done things in the right order, sure. But how can we do that in an offline process? what if the user just quickly restarted the primary note after the standby had been shut down? We'll need to somehow validate it across the nodes..
On Wed, Mar 13, 2019 at 4:51 PM Michael Banck <michael.banck@credativ.de> wrote:
Hi,
Am Mittwoch, den 13.03.2019, 12:43 +0100 schrieb Magnus Hagander:
> I think this is dangerous enough that it needs to be enforced and not
> documented.
Changing the cluster ID might have some other side-effects, I think
there are several cloud-native 3rd party solutions that use the cluster
ID as some kind of unique identifier for an instance. It might not be an
issue in practise, but then again, it might break other stuff down the
road.
Well, whatever we do they have to update, right? If we're not changing it, then we're basically saying that it's (systemid, checksums) that is the identifier of the cluster, not just systemid. They'd have to go around and check each node individually for the configuration and not just use systemid anyway, so what's the actual win?
Another possibility would be to extend the replication protocol's
IDENTIFY_SYSTEM command to also report the checksum version so that the
standby can check against the local control file on startup. But I am
not sure we can easily extend IDENTIFY_SYSTEM this way nor whether we
should for this rather corner-casey thing?
We could, but is it really a win in those scenarios? Vs just making the systemid different? With systemid being different it's obvious that something needs to be done. If it's not then at the best, if we check it in the standby startup, the standby won't start. But people can still end up with things like unusuable/corrupt backups for example.
On Wed, Mar 13, 2019 at 02:43:39PM +0300, Sergei Kornilov wrote: > Seems good. And I think we need backpath this check to pg11. similar > to cross-version compatibility checks + fprintf(stderr, _("%s: data directory block size %d is different to compiled-in block size %d.\n"), + progname, ControlFile->blcksz, BLCKSZ); The error message looks grammatically a bit weird to me. What about the following? Say: "database block size of %u is different from supported block size of %u." Better ideas are welcome. Please note that hose integers are unsigned by the way. -- Michael
Attachment
On Wed, Mar 13, 2019 at 12:24:21PM +0100, Magnus Hagander wrote: > Enabling or disabling the checksums offline on the master quite clearly > requires a rebuild of the standby, there is no other way (this is one of > the reasons for the online enabling in that patch, so I still hope we can > get that done -- but not for this version). I am curious to understand why this would require a rebuild of the standby. Technically FPWs don't update the checksum of a page when it is WAL-logged, so even if a primary and a standby don't agree on the checksum configuration, it is the timing where pages are flushed in the local instance which counts for checksum correctness. > You mean if the backend and pg_checksums is built with different blocksize? > Yeah, that sounds like something which is a cheap check and should be done. Yes, we should check after that, checksum calculation uses BLCKSZ with a hardcoded value, so a mismatch would cause computation failures. It could be possible to not have this restriction if we made the block size an argument of the checksum calculation though. -- Michael
Attachment
On Wed, Mar 13, 2019 at 08:56:33PM +0900, Michael Paquier wrote: > On Wed, Mar 13, 2019 at 12:09:24PM +0100, Michael Banck wrote: > > The attached patch should do the above, on top of Michael's last > > patchset. > > What you are doing here looks like a good defense in itself. More thoughts on that, and here is a short summary of the thread. + /* Check if control file has changed */ + if (controlfile_last_updated != ControlFile->time) + { + fprintf(stderr, _("%s: control file has changed since startup\n"), progname); + exit(1); + } Actually, under the conditions discussed on this thread that Postgres is started in parallel of pg_checksums, imagine the following scenario: - pg_checksums starts to enable checksums, it reads a block to calculate its checksum, then calculates it. - Postgres has been started in parallel, writes the same block. - pg_checksums finishes the block calculation, writes back the block it has just read. - Postgres stops, some data is lost. - At the end of pg_checksums, we complain that the control file has been updated since the start of pg_checksums. I think that we should be way more noisy about this error message document properly that Postgres should not be started while checksums are enabled. Basically, I guess that it should mention that there is a risk of corruption because of this parallel operation. Hence, based on what I could read on this thread, we'd like to have the following things added to the core patch set: 1) When enabling checksums, fsync the data folder. Then update the control file, and finally fsync the control file (+ flush of the parent folder to make the whole durable). This way a host crash never actually means that we finish in an inconsistent state. 2) When checksums are disabled, update the control file, then fsync it + its parent folder. 3) Add tracking of the control file data, and complain loudly before trying to update the file if there are any inconsistencies found. 4) Document with a big-fat-red warning that postgres should not be worked on while the tool is enabling or disabling checksums. There is a part discussed about standbys and primaries with not the same checksum settings, but I commented on that in [1]. There is a secondary bug fix to prevent the tool to run if the data folder has been initialized with a block size different than what pg_checksums has been compiled with in [2]. The patch looks good, still the error message could be better per my lookup. [1]: https://www.postgresql.org/message-id/20190314002342.GC3493@paquier.xyz [2]: https://www.postgresql.org/message-id/20190313224742.GA3493@paquier.xyz Am I missing something? -- Michael
Attachment
Hi, Am Mittwoch, den 13.03.2019, 17:54 +0100 schrieb Magnus Hagander: > On Wed, Mar 13, 2019 at 4:51 PM Michael Banck <michael.banck@credativ.de> wrote: > > Am Mittwoch, den 13.03.2019, 12:43 +0100 schrieb Magnus Hagander: > > > I think this is dangerous enough that it needs to be enforced and not > > > documented. > > > > Changing the cluster ID might have some other side-effects, I think > > there are several cloud-native 3rd party solutions that use the cluster > > ID as some kind of unique identifier for an instance. It might not be an > > issue in practise, but then again, it might break other stuff down the > > road. > > Well, whatever we do they have to update, right? Yeah, but I am saying their orchestrators might get confused about where the old instance went and what this new thing with a totally different systemid is and lose the connection between the two. Maybe that is a feature and not a bug. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Thu, Mar 14, 2019 at 1:23 AM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Mar 13, 2019 at 12:24:21PM +0100, Magnus Hagander wrote:
> Enabling or disabling the checksums offline on the master quite clearly
> requires a rebuild of the standby, there is no other way (this is one of
> the reasons for the online enabling in that patch, so I still hope we can
> get that done -- but not for this version).
I am curious to understand why this would require a rebuild of the
standby. Technically FPWs don't update the checksum of a page when it
is WAL-logged, so even if a primary and a standby don't agree on the
checksum configuration, it is the timing where pages are flushed in
the local instance which counts for checksum correctness.
Are you suggesting we should support running with a master with checksums on and a standby with checksums off in the same cluster? That seems.. Very fragile.
On Thu, Mar 14, 2019 at 5:39 AM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Mar 13, 2019 at 08:56:33PM +0900, Michael Paquier wrote:
> On Wed, Mar 13, 2019 at 12:09:24PM +0100, Michael Banck wrote:
> > The attached patch should do the above, on top of Michael's last
> > patchset.
>
> What you are doing here looks like a good defense in itself.
More thoughts on that, and here is a short summary of the thread.
+ /* Check if control file has changed */
+ if (controlfile_last_updated != ControlFile->time)
+ {
+ fprintf(stderr, _("%s: control file has changed since startup\n"), progname);
+ exit(1);
+ }
Actually, under the conditions discussed on this thread that Postgres
is started in parallel of pg_checksums, imagine the following
scenario:
- pg_checksums starts to enable checksums, it reads a block to
calculate its checksum, then calculates it.
- Postgres has been started in parallel, writes the same block.
- pg_checksums finishes the block calculation, writes back the block
it has just read.
- Postgres stops, some data is lost.
- At the end of pg_checksums, we complain that the control file has
been updated since the start of pg_checksums.
I think that we should be way more noisy about this error message
document properly that Postgres should not be started while checksums
are enabled. Basically, I guess that it should mention that there is
a risk of corruption because of this parallel operation.
Hence, based on what I could read on this thread, we'd like to have
the following things added to the core patch set:
1) When enabling checksums, fsync the data folder. Then update the
control file, and finally fsync the control file (+ flush of the
parent folder to make the whole durable). This way a host crash never
actually means that we finish in an inconsistent state.
2) When checksums are disabled, update the control file, then fsync
it + its parent folder.
3) Add tracking of the control file data, and complain loudly before
trying to update the file if there are any inconsistencies found.
4) Document with a big-fat-red warning that postgres should not be
worked on while the tool is enabling or disabling checksums.
Given that the failure is data corruption, I don't think big fat warning is enough. We should really make it impossible to start up the postmaster by mistake during the checksum generation. People don't read the documentation until it's too late. And it might not even be under their control - some automated tool might go in and try to start postgres, and boom, corruption.
One big-hammer method could be similar to what pg_upgrade does -- temporarily rename away the controlfile so postgresql can't start, and when done, put it back.
Re: Magnus Hagander 2019-03-14 <CABUevEx7QZLOjWDvwTdm1VM+mjsDm7=ZmB8qck7nDmcHEY5O5g@mail.gmail.com> > Are you suggesting we should support running with a master with checksums > on and a standby with checksums off in the same cluster? That seems.. Very > fragile. The case "shut down master and standby, run pg_checksums on both, and start them again" should be supported. That seems safe to do, and a real-world use case. Changing the system id to a random number would complicate this. (Horrible idea: maybe just adding 1 (= checksum version) to the system id would work?) Christoph
On Thu, Mar 14, 2019 at 3:28 PM Christoph Berg <myon@debian.org> wrote:
Re: Magnus Hagander 2019-03-14 <CABUevEx7QZLOjWDvwTdm1VM+mjsDm7=ZmB8qck7nDmcHEY5O5g@mail.gmail.com>
> Are you suggesting we should support running with a master with checksums
> on and a standby with checksums off in the same cluster? That seems.. Very
> fragile.
The case "shut down master and standby, run pg_checksums on both, and
start them again" should be supported. That seems safe to do, and a
real-world use case.
I can agree with that, if we can declare it safe. You might need some way to ensure it was shut down cleanly on both sides, I'm guessing.
Changing the system id to a random number would complicate this.
(Horrible idea: maybe just adding 1 (= checksum version) to the system
id would work?)
Or any other way of changing the systemid in a predictable way would also work, right? As long as it's done the same on both sides. And that way it would look different to any system that *doesn't* know what it means, which is probably a good thing.
Hi, Am Donnerstag, den 14.03.2019, 15:26 +0100 schrieb Magnus Hagander: > Given that the failure is data corruption, I don't think big fat > warning is enough. We should really make it impossible to start up the > postmaster by mistake during the checksum generation. People don't > read the documentation until it's too late. And it might not even be > under their control - some automated tool might go in and try to start > postgres, and boom, corruption. I guess you're right. > One big-hammer method could be similar to what pg_upgrade does -- > temporarily rename away the controlfile so postgresql can't start, and > when done, put it back. That sounds like a good solution to me. I've made PoC patch for that, see attached. The only question is whether pg_checksums should try to move pg_control back (i) on failure (ii) when interrupted? Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Attachment
Hi, Am Donnerstag, den 14.03.2019, 15:32 +0100 schrieb Magnus Hagander: > On Thu, Mar 14, 2019 at 3:28 PM Christoph Berg <myon@debian.org> wrote: > > Re: Magnus Hagander 2019-03-14 <CABUevEx7QZLOjWDvwTdm1VM+mjsDm7=ZmB8qck7nDmcHEY5O5g@mail.gmail.com> > > > Are you suggesting we should support running with a master with checksums > > > on and a standby with checksums off in the same cluster? That seems.. Very > > > fragile. > > > > The case "shut down master and standby, run pg_checksums on both, and > > start them again" should be supported. That seems safe to do, and a > > real-world use case. > > I can agree with that, if we can declare it safe. You might need some > way to ensure it was shut down cleanly on both sides, I'm guessing. > > > Changing the system id to a random number would complicate this. > > > > (Horrible idea: maybe just adding 1 (= checksum version) to the system > > id would work?) > > Or any other way of changing the systemid in a predictable way would > also work, right? As long as it's done the same on both sides. And > that way it would look different to any system that *doesn't* know > what it means, which is probably a good thing. If we change the system identifier, we'll have to reset the WAL as well or otherwise we'll get "PANIC: could not locate a valid checkpoint record" on startup. So even if we do it predictably on both primary and standby I guess the standby would need to be re-cloned? So I think an option that skips that for people who know what they are doing with the streaming replication setup would be required, should we decide to bump the system identifier. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Thu, Mar 14, 2019 at 03:23:59PM +0100, Magnus Hagander wrote: > Are you suggesting we should support running with a master with checksums > on and a standby with checksums off in the same cluster? That seems.. Very > fragile. Well, saying that it is supported is a too big term for that. What I am saying is that the problems you are pointing out are not as bad as you seem to mean they are as long as an operator does not copy on-disk pages from one node to the other one. Knowing that checksums apply only to pages flushed on disk on a local node, everything going through WAL for availability is actually able to work fine: - PITR - archive recovery. - streaming replication. Reading the code I understand that. I have as well done some tests with a primary/standby configuration to convince myself, using pgbench on both nodes (read-write for the primary, read-only on the standby), with checkpoint (or restart point) triggered on each node every 20s. If one node has checksum enabled and the other checksum disabled, then I am not seeing any inconsistency. However, anything which does a physical copy of pages could get things easily messed up if one node has checksum disabled and the other enabled. One such tool is pg_rewind. If the promoted standby has checksums disabled (becoming the source), and the old master to rewind has checksums enabled, then the rewind could likely copy pages which have not their checksums set correctly, resulting in incorrect checksums on the old master. So yes, it is easy to mess up things, however this does not apply to all configurations. The suggestion from Christoph to enable checksums on both nodes separately would work, and personally I find the suggestion to update the system ID after enabling or disabling checksums an over-engineered design because of the reasons in the first part of this email (it is technically doable to enable checksums with a minimum downtime and a failover), so my recommendation would be to document that when enabling checksums on one instance in a cluster, it should be applied to all instances as it could cause problems with any tools performing a physical copy of relation files or blocks. -- Michael
Attachment
On Thu, Mar 14, 2019 at 04:26:20PM +0100, Michael Banck wrote: > Am Donnerstag, den 14.03.2019, 15:26 +0100 schrieb Magnus Hagander: >> One big-hammer method could be similar to what pg_upgrade does -- >> temporarily rename away the controlfile so postgresql can't start, and >> when done, put it back. > > That sounds like a good solution to me. I've made PoC patch for that, > see attached. Indeed. I did not know this trick from pg_upgrade. We could just use the same. > The only question is whether pg_checksums should try to move pg_control > back (i) on failure (ii) when interrupted? Yes, we should have a callback on SIGINT and SIGTERM here which just moves back in place the control file if the temporary one exists. I have been able to grab some time to incorporate the feedback gathered on this thread, and please find attached a new version of the patch to add --enable/--disable. The main changes are: - When enabling checksums, fsync first the data directory, and at the end then update/flush the control file and its parent folder as Fabien has reported. - When disabling checksums, only work on the control file, as Fabien has also reported. - Rename the control file when beginning the enabling operation, with a callback to rename the file back if the operation is interrupted. Does this make sense? -- Michael
Attachment
On Fri, Mar 15, 2019 at 11:50:27AM +0900, Michael Paquier wrote: > - Rename the control file when beginning the enabling operation, with > a callback to rename the file back if the operation is interrupted. > > Does this make sense? Just before I forget... Please note that this handles interruptions but not failures, based on the assumption that on failures we can know that the system was working on its checksums thanks to the temporary control file so that's useful for debugging in my opinion. -- Michael
Attachment
Hi, Am Freitag, den 15.03.2019, 11:50 +0900 schrieb Michael Paquier: > On Thu, Mar 14, 2019 at 04:26:20PM +0100, Michael Banck wrote: > > Am Donnerstag, den 14.03.2019, 15:26 +0100 schrieb Magnus Hagander: > > > One big-hammer method could be similar to what pg_upgrade does -- > > > temporarily rename away the controlfile so postgresql can't start, and > > > when done, put it back. > > > > That sounds like a good solution to me. I've made PoC patch for that, > > see attached. > > Indeed. I did not know this trick from pg_upgrade. We could just use > the same. > > > The only question is whether pg_checksums should try to move pg_control > > back (i) on failure (ii) when interrupted? > > Yes, we should have a callback on SIGINT and SIGTERM here which just > moves back in place the control file if the temporary one exists. I > have been able to grab some time to incorporate the feedback gathered > on this thread, and please find attached a new version of the patch to > add --enable/--disable. Thanks! One thing stood out to me while quickly looking over it: + /* + * Flush the control file and its parent path to make the change + * durable. + */ + if (fsync_fname(controlfile_path, false, progname) != 0 || + fsync_parent_path(controlfile_path, progname) != 0) + { + /* errors are already logged on failure */ + exit(1); + } ISTM this would not run fsync_parent_path() unless the first fsync fails which is not the intended use. I guess we need two ifs here? Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Fri, Mar 15, 2019 at 09:04:51AM +0100, Michael Banck wrote: > ISTM this would not run fsync_parent_path() unless the first fsync fails > which is not the intended use. I guess we need two ifs here? Yes, let's do that. Let's see if others have input to offer about the patch. This thread is gathering attention, which is good. -- Michael
Attachment
On Fri, Mar 15, 2019 at 1:49 AM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Mar 14, 2019 at 03:23:59PM +0100, Magnus Hagander wrote:
> Are you suggesting we should support running with a master with checksums
> on and a standby with checksums off in the same cluster? That seems.. Very
> fragile.
Well, saying that it is supported is a too big term for that. What I
am saying is that the problems you are pointing out are not as bad as
you seem to mean they are as long as an operator does not copy on-disk
pages from one node to the other one. Knowing that checksums apply
only to pages flushed on disk on a local node, everything going
through WAL for availability is actually able to work fine:
- PITR
- archive recovery.
- streaming replication.
Reading the code I understand that. I have as well done some tests
with a primary/standby configuration to convince myself, using pgbench
on both nodes (read-write for the primary, read-only on the standby),
with checkpoint (or restart point) triggered on each node every 20s.
If one node has checksum enabled and the other checksum disabled, then
I am not seeing any inconsistency.
However, anything which does a physical copy of pages could get things
easily messed up if one node has checksum disabled and the other
enabled. One such tool is pg_rewind. If the promoted standby has
checksums disabled (becoming the source), and the old master to rewind
has checksums enabled, then the rewind could likely copy pages which
have not their checksums set correctly, resulting in incorrect
checksums on the old master.
So yes, it is easy to mess up things, however this does not apply to
all configurations. The suggestion from Christoph to enable checksums
on both nodes separately would work, and personally I find the
suggestion to update the system ID after enabling or disabling
checksums an over-engineered design because of the reasons in the
first part of this email (it is technically doable to enable checksums
with a minimum downtime and a failover), so my recommendation would be
to document that when enabling checksums on one instance in a cluster,
it should be applied to all instances as it could cause problems with
any tools performing a physical copy of relation files or blocks.
As I said, that's a big hammer. I'm all for having a better solution. But I don't think it's acceptable not to have *any* defense against it, given how bad corruption it can lead to.
On Thu, Mar 14, 2019 at 4:54 PM Michael Banck <michael.banck@credativ.de> wrote:
Hi,
Am Donnerstag, den 14.03.2019, 15:32 +0100 schrieb Magnus Hagander:
> On Thu, Mar 14, 2019 at 3:28 PM Christoph Berg <myon@debian.org> wrote:
> > Re: Magnus Hagander 2019-03-14 <CABUevEx7QZLOjWDvwTdm1VM+mjsDm7=ZmB8qck7nDmcHEY5O5g@mail.gmail.com>
> > > Are you suggesting we should support running with a master with checksums
> > > on and a standby with checksums off in the same cluster? That seems.. Very
> > > fragile.
> >
> > The case "shut down master and standby, run pg_checksums on both, and
> > start them again" should be supported. That seems safe to do, and a
> > real-world use case.
>
> I can agree with that, if we can declare it safe. You might need some
> way to ensure it was shut down cleanly on both sides, I'm guessing.
>
> > Changing the system id to a random number would complicate this.
> >
> > (Horrible idea: maybe just adding 1 (= checksum version) to the system
> > id would work?)
>
> Or any other way of changing the systemid in a predictable way would
> also work, right? As long as it's done the same on both sides. And
> that way it would look different to any system that *doesn't* know
> what it means, which is probably a good thing.
If we change the system identifier, we'll have to reset the WAL as well
or otherwise we'll get "PANIC: could not locate a valid checkpoint
record" on startup. So even if we do it predictably on both primary and
standby I guess the standby would need to be re-cloned?
So I think an option that skips that for people who know what they are
doing with the streaming replication setup would be required, should we
decide to bump the system identifier.
Ugh. I did not think of that one. But yes, the main idea there would be that if you turn on checksums on the primary then you have to re-clone all standbys. That's what happens if we change the system idenfier -- that's why it's the "big hammer method".
But yeah, an option to avoid it could be one way to deal with it. If we could find some safer way to handle it that'd be better, but otherwise changing the sysid by default and having an option to turn it off could be one way to deal with it.
On Thu, Mar 14, 2019 at 4:26 PM Michael Banck <michael.banck@credativ.de> wrote:
Hi,
Am Donnerstag, den 14.03.2019, 15:26 +0100 schrieb Magnus Hagander:
> Given that the failure is data corruption, I don't think big fat
> warning is enough. We should really make it impossible to start up the
> postmaster by mistake during the checksum generation. People don't
> read the documentation until it's too late. And it might not even be
> under their control - some automated tool might go in and try to start
> postgres, and boom, corruption.
I guess you're right.
> One big-hammer method could be similar to what pg_upgrade does --
> temporarily rename away the controlfile so postgresql can't start, and
> when done, put it back.
That sounds like a good solution to me. I've made PoC patch for that,
see attached.
The downside with this method is we can't get a nice error message during the attempted startup. But it should at least be safe, which is the most important part. And at least it's clear what's happening once you list the files and see the name of the temporary one.
//Magnus
On Fri, Mar 15, 2019 at 09:52:11AM +0100, Magnus Hagander wrote: > As I said, that's a big hammer. I'm all for having a better solution. But I > don't think it's acceptable not to have *any* defense against it, given how > bad corruption it can lead to. Hm... It looks that my arguments are not convincing enough. I am not really convinced that there is any need to make that the default, nor does it make much sense to embed that stuff directly into pg_checksums because that's actually just doing an extra step which is equivalent to calling pg_resetwal, and we know that this tool has the awesome reputation to cause more harm than anything else. At least I would like to have an option which allows to support the behavior to *not* update the system identifier so as the cases I mentioned would be supported, because then it becomes possible to enable checksums on a primary with only a failover as long as page copies are not directly involved and that all operations go through WAL. And that would be quite nice. -- Michael
Attachment
Hi, Am Mittwoch, den 13.03.2019, 12:24 +0100 schrieb Magnus Hagander: > On Wed, Mar 13, 2019 at 11:54 AM Sergei Kornilov <sk@zsrv.org> wrote: > > One new question from me: how about replication? > > Case: primary+replica, we shut down primary and enable checksum, and > > "started streaming WAL from primary" without any issue. I have > > master with checksums, but replica without. > > Or cluster with checksums, then disable checksums on primary, but > > standby think we have checksums. > > Enabling or disabling the checksums offline on the master quite > clearly requires a rebuild of the standby, there is no other way (this > is one of the reasons for the online enabling in that patch, so I > still hope we can get that done -- but not for this version). > > You would have the same with PITR backups for example. I'd like to get back to PITR. I thought about this a bit and actually I think PITR might be fine in the sense that if you enabled or disabled the cluster after the last basebackup and then do PITR with the avaiable WAL beyond that, you would get a working cluster, just with the checksum state the cluster had at the time of the basebackup. I think that would be entirely accetable, so long as nothing else breaks? I made some quick tests and did see no errors, but maybe I am missing something? > And especially if you have some tool that does block or segment level > differential. This might be the case, but not sure if PostgreSQL core must worry about it? Obviously the documentation must be made explicit about these kinds of cases. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Hi, Am Freitag, den 15.03.2019, 11:50 +0900 schrieb Michael Paquier: > I have been able to grab some time to incorporate the feedback gathered > on this thread, and please find attached a new version of the patch to > add --enable/--disable. Some more feedback: 1. There's a typo in line 578 which makes it fail to compile: |src/bin/pg_checksums/pg_checksums.c:578:4: error: ‘y’ undeclared (first use in this function) | }y 2. Should the pqsignal() stuff only be setup in PG_MODE_ENABLE? Same with the controlfile_path? 3. There's (I think) leftover debug output in the following places: |+ printf(_("Renaming \"%s\" to \"%s\"\n"), controlfile_path, |+ controlfile_path_temp); |+ printf(_("Renaming \"%s\" to \"%s\"\n"), controlfile_path_temp, |+ controlfile_path); |+ printf(_("Syncing data folder\n")); (that one is debatable, we are mentioning this only in verbose mode in pg_basebackup but pg_checksums is more chatty anyway, so probably fine). |+ printf(_("Updating control file\n")); Besides to the syncing message (which is user-relevant cause they might wonder what is taking so long), the others seem to be implementation details we don't need to tell the user about. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Fri, Mar 15, 2019 at 12:54:01PM +0100, Michael Banck wrote: > 1. There's a typo in line 578 which makes it fail to compile: > > |src/bin/pg_checksums/pg_checksums.c:578:4: error: ‘y’ undeclared (first use in this function) > | }y I am wondering where you got this one. My local branch does not have it, and the patch I sent does not seem to have it either. > 2. Should the pqsignal() stuff only be setup in PG_MODE_ENABLE? Same > with the controlfile_path? PG_MODE_DISABLE needs controlfile_path as well. We could make the cleanup only available when using --enable, the code just looked more simple in its current shape. I think it's just more simple to set everything unconditionally. This code may become more complicated in the future. > 3. There's (I think) leftover debug output in the following places: > > |+ printf(_("Renaming \"%s\" to \"%s\"\n"), controlfile_path, > |+ controlfile_path_temp); > > |+ printf(_("Renaming \"%s\" to \"%s\"\n"), controlfile_path_temp, > |+ controlfile_path); > > |+ printf(_("Syncing data folder\n")); > > (that one is debatable, we are mentioning this only in verbose mode in > pg_basebackup but pg_checksums is more chatty anyway, so probably > fine). This is wanted. Many folks have been complaning on this thread about crashes and such, surely we want logs about what happens :) > |+ printf(_("Updating control file\n")); > > Besides to the syncing message (which is user-relevant cause they might > wonder what is taking so long), the others seem to be implementation > details we don't need to tell the user about. Perhaps having them under --verbose makes more sense? -- Michael
Attachment
Hi, Am Freitag, den 15.03.2019, 21:23 +0900 schrieb Michael Paquier: > On Fri, Mar 15, 2019 at 12:54:01PM +0100, Michael Banck wrote: > > 1. There's a typo in line 578 which makes it fail to compile: > > > > > src/bin/pg_checksums/pg_checksums.c:578:4: error: ‘y’ undeclared (first use in this function) > > > }y > > I am wondering where you got this one. My local branch does not have > it, and the patch I sent does not seem to have it either. Mea culpa, I must've fat fingered something in the editor before applying your patch, sorry. I should've double-checked. > > 2. Should the pqsignal() stuff only be setup in PG_MODE_ENABLE? Same > > with the controlfile_path? > > PG_MODE_DISABLE needs controlfile_path as well. We could make the > cleanup only available when using --enable, the code just looked more > simple in its current shape. I think it's just more simple to set > everything unconditionally. This code may become more complicated in > the future. Ok. > > 3. There's (I think) leftover debug output in the following places: > > > > > + printf(_("Renaming \"%s\" to \"%s\"\n"), controlfile_path, > > > + controlfile_path_temp); > > > + printf(_("Renaming \"%s\" to \"%s\"\n"), controlfile_path_temp, > > > + controlfile_path); > > > + printf(_("Syncing data folder\n")); > > > > (that one is debatable, we are mentioning this only in verbose mode in > > pg_basebackup but pg_checksums is more chatty anyway, so probably > > fine). > > This is wanted. Many folks have been complaning on this thread about > crashes and such, surely we want logs about what happens :) > > > > + printf(_("Updating control file\n")); > > > > Besides to the syncing message (which is user-relevant cause they might > > wonder what is taking so long), the others seem to be implementation > > details we don't need to tell the user about. > > Perhaps having them under --verbose makes more sense? Well if we think it is essential in order to tell the user what happened in the case of an error, it shouldn't be verbose I guess. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Bonjour Michaël-san, > Yes, that would be nice, for now I have focused. For pg_resetwal yes > we could do it easily. Would you like to send a patch? Here is a proposal for "pg_resetwal". The implementation basically removes a lot of copy paste and calls the new update_controlfile function instead. I like removing useless code:-) The reserwal implementation was doing a rm/create cycle, which was leaving a small window for losing the controlfile. Not neat. I do not see the value of *not* fsyncing the control file when writing it, as it is by definition very precious, so I added a fsync. The server side branch uses the backend available "pg_fsync", which complies with server settings there and can do nothing if fsync is disabled. Maybe the two changes could be committed separately. -- Fabien.
Attachment
On Sun, Mar 17, 2019 at 10:01:20AM +0100, Fabien COELHO wrote: > The implementation basically removes a lot of copy paste and calls the new > update_controlfile function instead. I like removing useless code:-) Yes, I spent something like 10 minutes looking at that code yesterday and I agree that removing the control file to recreate it is not really necessary, even if the window between its removal and recreation is short. > I do not see the value of *not* fsyncing the control file when writing it, > as it is by definition very precious, so I added a fsync. The server side > branch uses the backend available "pg_fsync", which complies with server > settings there and can do nothing if fsync is disabled. The issue here is that trying to embed directly the fsync routines from file_utils.c into pg_resetwal.c messes up the inclusions because pg_resetwal.c includes backend-side includes, which themselves touch fd.h :( In short your approach avoids some extra mess with the include dependencies. . > Maybe the two changes could be committed separately. I was thinking about this one, and for pg_rewind we don't care about the fsync of the control file because the full data folder gets fsync'd afterwards and in the event of a crash in the middle of a rewind the target data folder is surely not something to use, but we do for pg_checksums, and we do for pg_resetwal. Even if there is the argument that usually callers of update_controlfile() would care a lot about the control file and fsync it, I think that we need some control on if we do the fsync or not because many tools have a --no-sync and that should be fully respected. So while your patch is on a good track, I would suggest to do the following things to complete it: - Add an extra argument bits16 to update_controlfile to pass a set of optional flags, with NOSYNC being the only and current value. The default is to flush the file. - Move the wait event calls WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE and WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE to controldata_utils.c. - And then delete UpdateControlFile() in xlog.c, and use update_controlfile() instead to remove even more code. The version in xlog.c uses BasicOpenFile(), so we should use also that in update_controlfile() instead of OpenTransientFile(). As any errors result in a PANIC we don't care about leaking fds. -- Michael
Attachment
Michaël-san, > The issue here is that trying to embed directly the fsync routines > from file_utils.c into pg_resetwal.c messes up the inclusions because > pg_resetwal.c includes backend-side includes, which themselves touch > fd.h :( > > In short your approach avoids some extra mess with the include > dependencies. . I could remove the two "catalog/" includes from pg_resetwal, I assume that you meant these ones. >> Maybe the two changes could be committed separately. > > I was thinking about this one, and for pg_rewind we don't care about > the fsync of the control file because the full data folder gets > fsync'd afterwards and in the event of a crash in the middle of a > rewind the target data folder is surely not something to use, but we > do for pg_checksums, and we do for pg_resetwal. Even if there is the > argument that usually callers of update_controlfile() would care a > lot about the control file and fsync it, I think that we need some > control on if we do the fsync or not because many tools have a > --no-sync and that should be fully respected. > So while your patch is on a good track, I would suggest to do the > following things to complete it: > - Add an extra argument bits16 to update_controlfile to pass a set of > optional flags, with NOSYNC being the only and current value. The > default is to flush the file. Hmmm. I just did that, but what about just a boolean? What other options could be required? Maybe some locking/checking? > - Move the wait event calls WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE and > WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE to controldata_utils.c. Done. > - And then delete UpdateControlFile() in xlog.c, and use > update_controlfile() instead to remove even more code. I was keeping that one for another patch because it touches the backend code, but it makes sense to do that in one go for consistency. I kept the initial no-parameter function which calls the new one with 4 parameters, though, because it looks more homogeneous this way in the backend code. This is debatable. > The version in xlog.c uses BasicOpenFile(), so we should use also that > in update_controlfile() instead of OpenTransientFile(). As any errors > result in a PANIC we don't care about leaking fds. Done. Attached is an update. -- Fabien.
Attachment
On Sun, Mar 17, 2019 at 12:44:39PM +0100, Fabien COELHO wrote: > I could remove the two "catalog/" includes from pg_resetwal, I assume that > you meant these ones. Not exactly. What I meant is that if you try to call directly fsync_fname and fsync_parent_path from file_utils.h, then you get into trouble because of xlog.h.. Sure you can remove also the ones you removed. > Hmmm. I just did that, but what about just a boolean? What other options > could be required? Maybe some locking/checking? It is already expected from the caller to properly take ControlFileLock. Note I tend to worry too much about the extensibility of published APIs these days as well, so perhaps just a boolean would be fine, please let me reconsider that after some sleep, and it is not like the contents of this routine are going to become much complicated either, except potentially to control the flags on open(). :p > I kept the initial no-parameter function which calls the new one with 4 > parameters, though, because it looks more homogeneous this way in the > backend code. This is debatable. True, this actually makes back-patching a bit easier, and there are 13 calls of UpdateControlFile(). > Attached is an update. Thanks, I'll take a look at that tomorrow. You have one error at the end of update_controlfile(), where close() could issue a frontend-like error for the backend, calling exit() on the way. That's not good. (No need to send a new patch, I'll fix it myself.) -- Michael
Attachment
> You have one error at the end of update_controlfile(), where close() > could issue a frontend-like error for the backend, calling exit() on the > way. That's not good. (No need to send a new patch, I'll fix it > myself.) Indeed. I meant to merge the "if (close(fd))", but ended merging the error generation as well. -- Fabien
On Sun, Mar 17, 2019 at 12:44:39PM +0100, Fabien COELHO wrote: > I kept the initial no-parameter function which calls the new one with 4 > parameters, though, because it looks more homogeneous this way in the > backend code. This is debatable. From a compatibility point of view, your position actually makes sense, at least to me and after sleeping on it as UpdateControlFile is not static, and that there are interactions with the other local routines to read and write the control file because of the variable ControlFile at the top of xlog.c. So I have kept the original interface, being now only a wrapper of the new routine. > Attached is an update. Thanks, I have committed the patch after fixing a couple of things. After considering the interface, I have switched to a single boolean as I could not actually imagine with what kind of fancy features this could be extended further more. If I am wrong, let's adjust it later on. Here are my notes about the fixes: - pg_resetwal got broken because the path to the control file was incorrect. Running tests of pg_upgrade or the TAP tests of pg_resetwal showed the failure. - The previously-mentioned problem with close() in the new routine is fixed. - Header comments at the top of update_controlfile were a bit messed up (s/Not/Note/, missed an "up" as well). - pg_rewind was issuing a flush of the control file even if --no-sync was used. - Nit: incorrect header order in controldata_utils.c. I have kept the backend-only includes grouped though. -- Michael
Attachment
Bonjour Michaël, > Here are my notes about the fixes: Thanks for the fixes. > - pg_resetwal got broken because the path to the control file was > incorrect. Running tests of pg_upgrade or the TAP tests of > pg_resetwal showed the failure. Hmmm… I thought I had done that with "make check-world":-( > - pg_rewind was issuing a flush of the control file even if --no-sync > was used. Indeed, I missed this one. > - Nit: incorrect header order in controldata_utils.c. I have kept the > backend-only includes grouped though. I'll pay attention to that the next time. Thanks for the push. -- Fabien.
On Fri, Mar 15, 2019 at 01:37:27PM +0100, Michael Banck wrote: > Am Freitag, den 15.03.2019, 21:23 +0900 schrieb Michael Paquier: >> Perhaps having them under --verbose makes more sense? > > Well if we think it is essential in order to tell the user what happened > in the case of an error, it shouldn't be verbose I guess. I would still keep them to be honest. I don't know, if others find the tool too chatty we could always rework that part and tune it. Please find attached an updated patch set, I have rebased that stuff on top of my recent commits to refactor the control file updates. While reviewing, I have found a problem in the docs (forgot a <para> markup previously), and there was a problem with the parent path fsync causing an interruption to not return the correct error code, and actually we should just use durable_rename() in this case (if --no-sync gets in then pg_mv_file() should be used of course). I have also been thinking about what we could add in the documentation, so this version adds a draft to describe the cases where enabling checksums can lead to corruption when involving multiple nodes in a cluster and tools doing physical copy of relation blocks. I have not done the --no-sync part yet on purpose, as that will most likely conflict based on the feedback received for this version.. -- Michael
Attachment
Bonjour Michaël, > Please find attached an updated patch set, I have rebased that stuff > on top of my recent commits to refactor the control file updates. Patch applies cleanly, compiles, make check-world seems ok, doc build ok. It would help if the patch includes a version number. I assume that this is v7. Doc looks ok. Moving the controlfile looks like an effective way to prevent any concurrent start, as the fs operation is probably atomic and especially if external tools uses the same trick. However this is not the case yet, eg "pg_resetwal" uses a "postmaster.pid" hack instead. Probably the method could be unified, possibly with some functions in "controlfile_utils.c". However, I think that there still is a race condition because of the order in which it is implemented: pg_checksums reads control file pg_checksums checks control file contents... ** cluster may be started and the control file updated pg_checksums moves the (updated) control file pg_checksums proceeds on a running cluster pg_checksums moves back the control file pg_checksums updates the control file contents, overriding updates I think that the correct way to handle this for enable/disable is: pg_checksums moves the control file pg_checksums reads, checks, proceeds, updates pg_checksums moves back the control file This probably means extending a little bit the update_controlfile function to allow a suffix. No big deal. Ok, this might not work, because of the following, less likely, race condition: postmaster opens control file RW pg_checksums moves control file, posmater open file handle follows ... So ISTM that we really need some locking to have something clean. Why not always use "pgrename" instead of the strange pg_mv_file macro? Help line about --check not simplified as suggested in a prior review, although you said you would take it into account. Tests look ok. -- Fabien.
On Tue, Mar 19, 2019 at 11:48:25AM +0100, Fabien COELHO wrote: > Moving the controlfile looks like an effective way to prevent any concurrent > start, as the fs operation is probably atomic and especially if external > tools uses the same trick. However this is not the case yet, eg > "pg_resetwal" uses a "postmaster.pid" hack instead. pg_upgrade does so. Note that pg_resetwal does not check either that the PID in the file is actually running. > Probably the method could be unified, possibly with some functions > in "controlfile_utils.c". Hm. postmaster.pid is just here to make sure that the instance is not started at all, while we require the instance to be stopped cleanly with other tools, so that's not really consistent in my opinion to combine both. > Ok, this might not work, because of the following, less likely, race > condition: > > postmaster opens control file RW > pg_checksums moves control file, postmater open file handle follows > ... > > So ISTM that we really need some locking to have something clean. We are talking about complicating a method which is already fine for a a window where the whole operation works, as it could take hours to enable checksums, versus a couple of instructions. I am not sure that it is worth complicating the code more. > Help line about --check not simplified as suggested in a prior review, > although you said you would take it into account. Oops, it looks like this got lost because of the successive rebases. I am sure to have updated it at some point.. Anyway, thanks for pointing it out, I got that fixed on my local branch. -- Michael
Attachment
Hi, Am Dienstag, den 19.03.2019, 11:48 +0100 schrieb Fabien COELHO: > Moving the controlfile looks like an effective way to prevent any > concurrent start, as the fs operation is probably atomic and especially if > external tools uses the same trick. However this is not the case yet, eg > "pg_resetwal" uses a "postmaster.pid" hack instead. Probably the method > could be unified, possibly with some functions in "controlfile_utils.c". > > However, I think that there still is a race condition because of the order > in which it is implemented: > > pg_checksums reads control file > pg_checksums checks control file contents... > ** cluster may be started and the control file updated > pg_checksums moves the (updated) control file > pg_checksums proceeds on a running cluster > pg_checksums moves back the control file > pg_checksums updates the control file contents, overriding updates > > I think that the correct way to handle this for enable/disable is: > > pg_checksums moves the control file > pg_checksums reads, checks, proceeds, updates > pg_checksums moves back the control file > > This probably means extending a little bit the update_controlfile function > to allow a suffix. No big deal. > > Ok, this might not work, because of the following, less likely, race > condition: > > postmaster opens control file RW > pg_checksums moves control file, posmater open file handle follows > ... > > So ISTM that we really need some locking to have something clean. I think starting the postmaster during offline maintenance is already quite some pilot error. As pg_checksums can potentially run for hours though, I agree it is important to disable the cluster in the meantime. There's really not a lot going on between pg_checksums reading the control file and moving it away - what you propose above sounds a bit like overengineering to me. If anything, we could include the postmaster.pid check from pg_resetwal after we have renamed the control file to make absolutely sure that the cluster is offline. Once the control file is gone and there is no postmaster.pid, it surely cannot be pg_checksums' problem anymore if a postmaster is started regardless of maintenance. I leave that to Michael to decide whether he thinks the above is warranted. I think the more important open issue is what to do about PITR and streaming replication, see my replies to Magnus elsewhere in the thread. > Why not always use "pgrename" instead of the strange pg_mv_file macro? pg_ugprade does it the same way, possibly both could be converted to pgrename, dunno. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
>> Ok, this might not work, because of the following, less likely, race >> condition: >> >> postmaster opens control file RW >> pg_checksums moves control file, postmater open file handle follows >> ... >> >> So ISTM that we really need some locking to have something clean. > > We are talking about complicating a method which is already fine for a > a window where the whole operation works, as it could take hours to > enable checksums, versus a couple of instructions. I am not sure that > it is worth complicating the code more. Hmmm. Possibly. The point is that anything only needs to be implemented once. The whole point of pg is to have ACID transactional properties, but it does not achieve that on the controlfile, which I find paradoxical:-) Now if there is a race condition opportunity, ISTM that it should be as short as possible. Renaming before manipulating seems safer if other commands proceeds like that as well. Probably if pg always rename *THEN* open before doing anything in all commands it could be safe, provided that the renaming is atomic, which I think is the case. That would avoid locking, at the price of a small probability of having a controlfile in a controlfile.command-that-failed-at-the-wrong-time state. Maybe it is okay. Maybe there is a need to be able to force the state back to something to recover from such unlikely event, but probably it does already exists (eg postmaster could be dead without releasing the controlfile state). -- Fabien.
Hi, On 2019-03-18 17:13:01 +0900, Michael Paquier wrote: > +/* > + * Locations of persistent and temporary control files. The control > + * file gets renamed into a temporary location when enabling checksums > + * to prevent a parallel startup of Postgres. > + */ > +#define CONTROL_FILE_PATH "global/pg_control" > +#define CONTROL_FILE_PATH_TEMP CONTROL_FILE_PATH ".pg_checksums_in_progress" I think this should be outright rejected. Again, you're making the control file into something it isn't. And there's no buyin for this as far as I can tell outside of Fabien and you. For crying out loud, if the server crashes during this YOU'VE CORRUPTED THE CLUSTER. - Andres
Hi, Am Dienstag, den 19.03.2019, 08:36 -0700 schrieb Andres Freund: > On 2019-03-18 17:13:01 +0900, Michael Paquier wrote: > > +/* > > + * Locations of persistent and temporary control files. The control > > + * file gets renamed into a temporary location when enabling checksums > > + * to prevent a parallel startup of Postgres. > > + */ > > +#define CONTROL_FILE_PATH "global/pg_control" > > +#define CONTROL_FILE_PATH_TEMP CONTROL_FILE_PATH ".pg_checksums_in_progress" > > I think this should be outright rejected. Again, you're making the > control file into something it isn't. And there's no buyin for this as > far as I can tell outside of Fabien and you. For crying out loud, if the > server crashes during this YOU'VE CORRUPTED THE CLUSTER. The cluster is supposed to be offline during this. This is just an additional precaution so that nobody starts it during the operation - similar to how pg_upgrade disables the old data directory. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On 2019-03-19 16:55:12 +0100, Michael Banck wrote: > Hi, > > Am Dienstag, den 19.03.2019, 08:36 -0700 schrieb Andres Freund: > > On 2019-03-18 17:13:01 +0900, Michael Paquier wrote: > > > +/* > > > + * Locations of persistent and temporary control files. The control > > > + * file gets renamed into a temporary location when enabling checksums > > > + * to prevent a parallel startup of Postgres. > > > + */ > > > +#define CONTROL_FILE_PATH "global/pg_control" > > > +#define CONTROL_FILE_PATH_TEMP CONTROL_FILE_PATH ".pg_checksums_in_progress" > > > > I think this should be outright rejected. Again, you're making the > > control file into something it isn't. And there's no buyin for this as > > far as I can tell outside of Fabien and you. For crying out loud, if the > > server crashes during this YOU'VE CORRUPTED THE CLUSTER. > > The cluster is supposed to be offline during this. This is just an > additional precaution so that nobody starts it during the operation - > similar to how pg_upgrade disables the old data directory. I don't see how that matters. Afterwards the cluster needs low level surgery to be recovered. That's a) undocumented b) likely to be done wrongly. This is completely unacceptable *AND UNNECESSARY*. Greetings, Andres Freund
Hi, Am Dienstag, den 19.03.2019, 09:00 -0700 schrieb Andres Freund: > On 2019-03-19 16:55:12 +0100, Michael Banck wrote: > > Am Dienstag, den 19.03.2019, 08:36 -0700 schrieb Andres Freund: > > > On 2019-03-18 17:13:01 +0900, Michael Paquier wrote: > > > > +/* > > > > + * Locations of persistent and temporary control files. The control > > > > + * file gets renamed into a temporary location when enabling checksums > > > > + * to prevent a parallel startup of Postgres. > > > > + */ > > > > +#define CONTROL_FILE_PATH "global/pg_control" > > > > +#define CONTROL_FILE_PATH_TEMP CONTROL_FILE_PATH ".pg_checksums_in_progress" > > > > > > I think this should be outright rejected. Again, you're making the > > > control file into something it isn't. And there's no buyin for this as > > > far as I can tell outside of Fabien and you. For crying out loud, if the > > > server crashes during this YOU'VE CORRUPTED THE CLUSTER. > > > > The cluster is supposed to be offline during this. This is just an > > additional precaution so that nobody starts it during the operation - > > similar to how pg_upgrade disables the old data directory. > > I don't see how that matters. Afterwards the cluster needs low level > surgery to be recovered. That's a) undocumented b) likely to be done > wrongly. This is completely unacceptable *AND UNNECESSARY*. Can you explain why low level surgery is needed and how that would look like? If pg_checksums successfully enables checksums, it will move back the control file and update the checksum version - the cluster is ready to be started again unless I am missing something? If pg_checksums is interrupted by the admin, it will move back the control file and the cluster is ready to be started again as well. If pg_checksums aborts with a failure, the admin will have to move back the control file before starting up the instance again, but I don't think that counts? If pg_checksums crashes due to I/O failures or other causes I can see how possibly the block it was currently writing might need low level surgery, but in that case we are in the domain of forensics already I guess and that still does not pertain to the control file? Sorry for being obtuse, I don't get it. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Hi, On 2019-03-19 17:08:17 +0100, Michael Banck wrote: > Am Dienstag, den 19.03.2019, 09:00 -0700 schrieb Andres Freund: > > On 2019-03-19 16:55:12 +0100, Michael Banck wrote: > > > Am Dienstag, den 19.03.2019, 08:36 -0700 schrieb Andres Freund: > > > > On 2019-03-18 17:13:01 +0900, Michael Paquier wrote: > > > > > +/* > > > > > + * Locations of persistent and temporary control files. The control > > > > > + * file gets renamed into a temporary location when enabling checksums > > > > > + * to prevent a parallel startup of Postgres. > > > > > + */ > > > > > +#define CONTROL_FILE_PATH "global/pg_control" > > > > > +#define CONTROL_FILE_PATH_TEMP CONTROL_FILE_PATH ".pg_checksums_in_progress" > > > > > > > > I think this should be outright rejected. Again, you're making the > > > > control file into something it isn't. And there's no buyin for this as > > > > far as I can tell outside of Fabien and you. For crying out loud, if the > > > > server crashes during this YOU'VE CORRUPTED THE CLUSTER. > > > > > > The cluster is supposed to be offline during this. This is just an > > > additional precaution so that nobody starts it during the operation - > > > similar to how pg_upgrade disables the old data directory. > > > > I don't see how that matters. Afterwards the cluster needs low level > > surgery to be recovered. That's a) undocumented b) likely to be done > > wrongly. This is completely unacceptable *AND UNNECESSARY*. > > Can you explain why low level surgery is needed and how that would look > like? > > If pg_checksums successfully enables checksums, it will move back the > control file and update the checksum version - the cluster is ready to > be started again unless I am missing something? > > If pg_checksums is interrupted by the admin, it will move back the > control file and the cluster is ready to be started again as well. > > If pg_checksums aborts with a failure, the admin will have to move back > the control file before starting up the instance again, but I don't > think that counts? That absolutely counts. Even a short period would imo be unacceptable, but this will take *hours* in many clusters. It's completely possible that the machine crashes while the enabling is in progress. And after restarting postgres or even pg_checksums, you'll just get a message that there's no control file. How on earth is a normal user supposed to recover from that? Now, you could have a check for the control file under the temporary name, and emit a hint about renaming, but that has its own angers (like people renaming it just to start postgres). And you're basically adding it because Fabien doesn't like postmaster.pid and wants to invent another lockout mechanism in this thread. Greetings, Andres Freund
Hi, Am Dienstag, den 19.03.2019, 09:13 -0700 schrieb Andres Freund: > On 2019-03-19 17:08:17 +0100, Michael Banck wrote: > > Am Dienstag, den 19.03.2019, 09:00 -0700 schrieb Andres Freund: > > > On 2019-03-19 16:55:12 +0100, Michael Banck wrote: > > > > Am Dienstag, den 19.03.2019, 08:36 -0700 schrieb Andres Freund: > > > > > On 2019-03-18 17:13:01 +0900, Michael Paquier wrote: > > > > > > +/* > > > > > > + * Locations of persistent and temporary control files. The control > > > > > > + * file gets renamed into a temporary location when enabling checksums > > > > > > + * to prevent a parallel startup of Postgres. > > > > > > + */ > > > > > > +#define CONTROL_FILE_PATH "global/pg_control" > > > > > > +#define CONTROL_FILE_PATH_TEMP CONTROL_FILE_PATH ".pg_checksums_in_progress" > > > > > > > > > > I think this should be outright rejected. Again, you're making the > > > > > control file into something it isn't. And there's no buyin for this as > > > > > far as I can tell outside of Fabien and you. For crying out loud, if the > > > > > server crashes during this YOU'VE CORRUPTED THE CLUSTER. > > > > > > > > The cluster is supposed to be offline during this. This is just an > > > > additional precaution so that nobody starts it during the operation - > > > > similar to how pg_upgrade disables the old data directory. > > > > > > I don't see how that matters. Afterwards the cluster needs low level > > > surgery to be recovered. That's a) undocumented b) likely to be done > > > wrongly. This is completely unacceptable *AND UNNECESSARY*. > > > > Can you explain why low level surgery is needed and how that would look > > like? > > > > If pg_checksums successfully enables checksums, it will move back the > > control file and update the checksum version - the cluster is ready to > > be started again unless I am missing something? > > > > If pg_checksums is interrupted by the admin, it will move back the > > control file and the cluster is ready to be started again as well. > > > > If pg_checksums aborts with a failure, the admin will have to move back > > the control file before starting up the instance again, but I don't > > think that counts? > > That absolutely counts. Even a short period would imo be unacceptable, > but this will take *hours* in many clusters. It's completely possible > that the machine crashes while the enabling is in progress. > > And after restarting postgres or even pg_checksums, you'll just get a > message that there's no control file. How on earth is a normal user > supposed to recover from that? Now, you could have a check for the > control file under the temporary name, and emit a hint about renaming, > but that has its own angers (like people renaming it just to start > postgres). Ok, thanks for explaining. I guess if we check for the temporary name in postmaster during startup if pg_control isn't there then a more generally useful name like "pg_control.maintenance" should be picked. We could then spit out a nice error message or hint like "the cluster has been disabled for maintenance. In order to start it up anyway, rename pg_control.maintenance to pg_control" or so. In any case, this would be more of a operational or availability issue and not a data-loss issue, as I feared from your previous mails. > And you're basically adding it because Fabien doesn't like > postmaster.pid and wants to invent another lockout mechanism in this > thread. I think the hazard of another DBA (or some automated configuration management or HA tool for that matter) looking at postmaster.pid, deciding that it is not a legit file from a running instance, deleting it and then starting up Postgres while pg_checksums is still at work is worse than the above scenario, but maybe if we make the content of postmaster.pid clear enough (like "maintenance in progress"?) it would be enough of a hint? Or do you have concrete suggestions on how this should work? I had the feeling (ab)using postmaster.pid for this would fly less than using the same scheme as pg_upgrade does, but I'm fine doing it either way. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Hi, On 2019-03-19 17:30:16 +0100, Michael Banck wrote: > Am Dienstag, den 19.03.2019, 09:13 -0700 schrieb Andres Freund: > > On 2019-03-19 17:08:17 +0100, Michael Banck wrote: > > > Am Dienstag, den 19.03.2019, 09:00 -0700 schrieb Andres Freund: > > > > On 2019-03-19 16:55:12 +0100, Michael Banck wrote: > > > > > Am Dienstag, den 19.03.2019, 08:36 -0700 schrieb Andres Freund: > > > > > > On 2019-03-18 17:13:01 +0900, Michael Paquier wrote: > > > > > > > +/* > > > > > > > + * Locations of persistent and temporary control files. The control > > > > > > > + * file gets renamed into a temporary location when enabling checksums > > > > > > > + * to prevent a parallel startup of Postgres. > > > > > > > + */ > > > > > > > +#define CONTROL_FILE_PATH "global/pg_control" > > > > > > > +#define CONTROL_FILE_PATH_TEMP CONTROL_FILE_PATH ".pg_checksums_in_progress" > > > > > > > > > > > > I think this should be outright rejected. Again, you're making the > > > > > > control file into something it isn't. And there's no buyin for this as > > > > > > far as I can tell outside of Fabien and you. For crying out loud, if the > > > > > > server crashes during this YOU'VE CORRUPTED THE CLUSTER. > > > > > > > > > > The cluster is supposed to be offline during this. This is just an > > > > > additional precaution so that nobody starts it during the operation - > > > > > similar to how pg_upgrade disables the old data directory. > > > > > > > > I don't see how that matters. Afterwards the cluster needs low level > > > > surgery to be recovered. That's a) undocumented b) likely to be done > > > > wrongly. This is completely unacceptable *AND UNNECESSARY*. > > > > > > Can you explain why low level surgery is needed and how that would look > > > like? > > > > > > If pg_checksums successfully enables checksums, it will move back the > > > control file and update the checksum version - the cluster is ready to > > > be started again unless I am missing something? > > > > > > If pg_checksums is interrupted by the admin, it will move back the > > > control file and the cluster is ready to be started again as well. > > > > > > If pg_checksums aborts with a failure, the admin will have to move back > > > the control file before starting up the instance again, but I don't > > > think that counts? > > > > That absolutely counts. Even a short period would imo be unacceptable, > > but this will take *hours* in many clusters. It's completely possible > > that the machine crashes while the enabling is in progress. > > > > And after restarting postgres or even pg_checksums, you'll just get a > > message that there's no control file. How on earth is a normal user > > supposed to recover from that? Now, you could have a check for the > > control file under the temporary name, and emit a hint about renaming, > > but that has its own angers (like people renaming it just to start > > postgres). > > Ok, thanks for explaining. > > I guess if we check for the temporary name in postmaster during startup > if pg_control isn't there then a more generally useful name like > "pg_control.maintenance" should be picked. We could then spit out a nice > error message or hint like "the cluster has been disabled for > maintenance. In order to start it up anyway, rename > pg_control.maintenance to pg_control" or so. To be very clear: I am going to try to stop any patch with this mechanism from going into the tree. I think it's an absurdly bad idea. There'd need to be significant support from a number of other committers to convince me otherwise. > In any case, this would be more of a operational or availability issue > and not a data-loss issue, as I feared from your previous mails. It's just about undistinguishable for normal users. > > And you're basically adding it because Fabien doesn't like > > postmaster.pid and wants to invent another lockout mechanism in this > > thread. > > I think the hazard of another DBA (or some automated configuration > management or HA tool for that matter) looking at postmaster.pid, > deciding that it is not a legit file from a running instance, deleting > it and then starting up Postgres while pg_checksums is still at work is > worse than the above scenario, but maybe if we make the content of > postmaster.pid clear enough (like "maintenance in progress"?) it would > be enough of a hint? Err, how would such a tool decide to do that safely? And if it did so, how would it not cause problems in postgres' normal operation, given that that postmaster.pid is crucial to prevent two postgres instances running at the same time? > Or do you have concrete suggestions on how this should work? create a postmaster.pid with the pid of pg_checksums. That'll trigger a postgres start from checking whether that pid is still alive. There'd probably need to be a tiny change to CreateLockFile() to prevent it from checking whether any shared memory is connected. FWIW, it'd probably actually be good if pg_checksums (and some other tools), did most if not all the checks in CreateLockFile(). I'm not sure it needs to be this patch's responsibility to come up with a scheme here at all however. pg_rewind, pg_resetwal, pg_upgrade all don't really have a lockout mechanism, and it hasn't caused a ton of problems. I think it'd be good to invent something better, but it can't be some half assed approach that'll lead to people think their database is gone. > I had the feeling (ab)using postmaster.pid for this would fly less than > using the same scheme as pg_upgrade does, but I'm fine doing it either > way. I don't think pg_upgrade is a valid comparison, given that the goal there is to permanently disable the cluster. It also emits a hint about it. And only does so at the end of a run. Greetings, Andres Freund
On Tue, Mar 19, 2019 at 09:47:17AM -0700, Andres Freund wrote: > I'm not sure it needs to be this patch's responsibility to come up with > a scheme here at all however. pg_rewind, pg_resetwal, pg_upgrade all > don't really have a lockout mechanism, and it hasn't caused a ton of > problems. I think it'd be good to invent something better, but it can't > be some half assed approach that'll lead to people think their database > is gone. Amen. Take it as you wish, but that's actually what I was mentioning upthread one week ago where I argued that it is a problem, but not a problem of this patch and that this problems concerns other tools: https://www.postgresql.org/message-id/20190313093150.GE2988@paquier.xyz And then, my position has been overthrown by anybody on this thread. So I am happy to see somebody chiming in and say the same thing. Honestly, I think that what I sent last week, with a patch in its simplest form, would be enough: https://www.postgresql.org/message-id/20190313021621.GP13812@paquier.xyz In short, you keep the main feature with: - No tweaks with postmaster.pid. - Rely just on the control file indicating an instance shutdown cleanly. - No tweaks with the system ID. - No renaming of the control file. -- Michael
Attachment
On Wed, Mar 20, 2019 at 08:09:07AM +0900, Michael Paquier wrote: > In short, you keep the main feature with: > - No tweaks with postmaster.pid. > - Rely just on the control file indicating an instance shutdown > cleanly. > - No tweaks with the system ID. > - No renaming of the control file. FWIW, the simplest version is just like the attached. -- Michael
Attachment
Hallo Andres, > And you're basically adding it because Fabien doesn't like > postmaster.pid and wants to invent another lockout mechanism in this > thread. I did not suggest to rename the control file, but as it is already done by another command it did not look like a bad idea in itself, or at least an already used bad idea:-) I'd be okay with anything that works consistently accross all commands that may touch a cluster and are mutually exclusive (postmater, pg_rewind, pg_resetwal, pg_upgrade, pg_checksums…), without underlying race conditions. It could be locking, a control file state, a special file (which one ? what is the procedure to create/remove it safely and avoid potential race conditions ?), possibly "postmaster.pid", whatever really. I'll admit that I'm moderately enthousiastic about "posmaster.pid" because it does not do anymore what the file names says, but if it really works and is used consistently by all commands, why not. In case of unexpected problems, the file will probably have to be removed/fixed by hand. I also think that the implemented mechanism should be made available in "control_utils.c", not duplicated in every command. -- Fabien.
Hi, On 2019-03-20 07:55:39 +0100, Fabien COELHO wrote: > > And you're basically adding it because Fabien doesn't like > > postmaster.pid and wants to invent another lockout mechanism in this > > thread. > > I did not suggest to rename the control file, but as it is already done by > another command it did not look like a bad idea in itself, or at least an > already used bad idea:-) pg_upgrade in link mode intentionally wants to *permanently* disable a cluster. And it explicitly writes a log message about it. That's not a case to draw inferrence for this case. > I'd be okay with anything that works consistently accross all commands that > may touch a cluster and are mutually exclusive (postmater, pg_rewind, > pg_resetwal, pg_upgrade, pg_checksums…), without underlying race conditions. > It could be locking, a control file state, a special file (which one ? what > is the procedure to create/remove it safely and avoid potential race > conditions ?), possibly "postmaster.pid", whatever really. > > I'll admit that I'm moderately enthousiastic about "posmaster.pid" because > it does not do anymore what the file names says, but if it really works and > is used consistently by all commands, why not. In case of unexpected > problems, the file will probably have to be removed/fixed by hand. I also > think that the implemented mechanism should be made available in > "control_utils.c", not duplicated in every command. That's just a separate feature. Greetings, Andres Freund
Hallo Andres, >> [...] > > pg_upgrade in link mode intentionally wants to *permanently* disable a > cluster. And it explicitly writes a log message about it. That's not a > case to draw inferrence for this case. Ok. My light knowledge of pg_upgrade inner working does not extend to this level of precision. >> I'd be okay with anything that works consistently accross all commands >> [...] >> >> I'll admit that I'm moderately enthousiastic about "posmaster.pid" because >> it does not do anymore what the file names says, but if it really works and >> is used consistently by all commands, why not. In case of unexpected >> problems, the file will probably have to be removed/fixed by hand. I also >> think that the implemented mechanism should be made available in >> "control_utils.c", not duplicated in every command. > > That's just a separate feature. Possibly, although I'm not sure what in the above is a "separate feature", I assume from the "pg_checksum --enable" implementation. Is it the fact that there could (should, IMO) be some mechanisms to ensure that mutually exclusive direct cluster-modification commands are not run concurrently? As "pg_checksums -e" is a potentially long running command, the likelyhood of self-inflected wounds is raised significantly: I could do absurd things on an enable-checksum-in-progress cluster on a previous version of the patch. Thus as a reviewer I'm suggesting to fix the issue. Or is it the fact that fixing on some critical errors would possibly involve some manual intervention at some point? Or is it something else? -- Fabien.
Michaël-san, >> In short, you keep the main feature with: >> - No tweaks with postmaster.pid. >> - Rely just on the control file indicating an instance shutdown >> cleanly. >> - No tweaks with the system ID. >> - No renaming of the control file. Hmmm… so nothing:-) I think that this feature is useful, in complement to a possible online-enabling server-managed version. About patch v8 part 1: applies cleanly, compiles, global & local make check ok, doc build ok. I think that a clear warning not to run any cluster command in parallel, under pain of possible cluster corruption, and possibly other caveats about replication, should appear in the documentation. I also think that some mechanism should be used to prevent such occurence, whether in this patch or another. About "If enabling or disabling checksums, the exit status is nonzero if the operation failed." I must admit that enabling/disabling an already enabled/disabled cluster is still not really a failure for me, because the end status is that the cluster is in the state required by the user. I still think that the control file update should be made *after* the data directory has been synced, otherwise the control file could be updated while some data files are not. It just means exchanging two lines. About patch v8 part 2: applies cleanly, compiles, global & local make check ok. The added -N option is not documented. If the conditional sync is moved before the file update, the file update should pass do_sync to update_controlfile. -- Fabien.
On Wed, Mar 20, 2019 at 10:38:36AM +0100, Fabien COELHO wrote: > Hmmm… so nothing:-) The core of the feature is still here, fortunately. > I think that a clear warning not to run any cluster command in parallel, > under pain of possible cluster corruption, and possibly other caveats about > replication, should appear in the documentation. I still have the following extra documentation in my notes: + <refsect1> + <title>Notes</title> + <para> + When disabling or enabling checksums in a cluster of multiple instances, + it is recommended to stop all the instances of the cluster before doing + the switch to all the instances consistently. When using a cluster with + tools which perform direct copies of relation file blocks (for example + <xref linkend="app-pgrewind"/>), enabling or disabling checksums can + lead to page corruptions in the shape of incorrect checksums if the + operation is not done consistently across all nodes. Destroying all + the standbys in a cluster first, enabling or disabling checksums on + the primary and finally recreate the cluster nodes from scratch is + also safe. + </para> + </refsect1> This sounds kind of enough for me but.. > I also think that some mechanism should be used to prevent such occurence, > whether in this patch or another. I won't counter-argue on that. > I still think that the control file update should be made *after* the data > directory has been synced, otherwise the control file could be updated while > some data files are not. It just means exchanging two lines. The parent path of the control file needs also to be flushed to make the change durable, just doing things the same way pg_rewind keeps the code simple in my opinion. > If the conditional sync is moved before the file update, the file update > should pass do_sync to update_controlfile. For the durability of the operation, the current order is intentional. -- Michael
Attachment
Michaël-san, >> I think that a clear warning not to run any cluster command in parallel, >> under pain of possible cluster corruption, and possibly other caveats about >> replication, should appear in the documentation. > > I still have the following extra documentation in my notes: Ok, it should have been included in the patch. > + <refsect1> > + <title>Notes</title> > + <para> > + When disabling or enabling checksums in a cluster of multiple instances, ISTM that a "postgres cluster" was like an Oracle instance: See https://www.postgresql.org/docs/devel/creating-cluster.html So the vocabulary used above seems misleading. I'm not sure how to name an Oracle cluster in postgres lingo, though. > + it is recommended to stop all the instances of the cluster before doing > + the switch to all the instances consistently. I think that the motivation/risks should appear before the solution. "As xyz ..., ...", or there at least the logical link should be outlined. It is not clear for me whether the following sentences, which seems specific to "pg_rewind", are linked to the previous advice, which seems rather to refer to streaming replication? > + When using a cluster with > + tools which perform direct copies of relation file blocks (for example > + <xref linkend="app-pgrewind"/>), enabling or disabling checksums can > + lead to page corruptions in the shape of incorrect checksums if the > + operation is not done consistently across all nodes. Destroying all > + the standbys in a cluster first, enabling or disabling checksums on > + the primary and finally recreate the cluster nodes from scratch is > + also safe. > + </para> > + </refsect1> Should not disabling in reverse order be safe? the checksum are not checked afterwards? > This sounds kind of enough for me but.. ISTM that the risks could be stated more clearly. >> I also think that some mechanism should be used to prevent such occurence, >> whether in this patch or another. > > I won't counter-argue on that. This answer is ambiguous. >> I still think that the control file update should be made *after* the data >> directory has been synced, otherwise the control file could be updated while >> some data files are not. It just means exchanging two lines. > > The parent path of the control file needs also to be flushed to make > the change durable, just doing things the same way pg_rewind keeps the > code simple in my opinion. I do not understand. The issue I'm refering to is, when enabling: - data files are updated in fs cache - control file is updated in fs cache * fsync is called - updated control file gets written - data files are being written... but reboot occurs while fsyncing is still in progress After the reboot, some data files are not fully updated with their checksums, although the controlfiles tells that they are. It should then fail after a restart when a no-checksum page is loaded? What am I missing? Also, I do not see how exchanging two lines makes the code less simple. >> If the conditional sync is moved before the file update, the file update >> should pass do_sync to update_controlfile. > > For the durability of the operation, the current order is intentional. See my point above: I think that this order can lead to cluster corruption. If not, please be kind enough to try to explain in more details why I'm wrong. -- Fabien.
On Wed, Mar 20, 2019 at 05:46:32PM +0100, Fabien COELHO wrote: > I think that the motivation/risks should appear before the solution. "As xyz > ..., ...", or there at least the logical link should be outlined. > > It is not clear for me whether the following sentences, which seems specific > to "pg_rewind", are linked to the previous advice, which seems rather to > refer to streaming replication? Do you have a better idea of formulation? If you have a failover which makes use of pg_rewind, or use some backup tool which does incremental copy of physical blocks like pg_rman, then you have a risk to mess up instances in a cluster which is the risk I am trying to outline here. It is actually fine to do the following actually if everything is WAL-based as checksums are only computed once a shared buffer is flushed on a single instance. Imagine for example a primary-standby with checksums disabled: - Shutdown cleanly standby, enable checksums. - Plug back standby to cluster, let it replay up to the latest point. - Shutdown cleanly primary. - Promote standby. - Enable checksums on the previous primary. - Add recovery parameters and recommend the primary back to the cluster. - All nodes have checksums enabled, without rebuilding any of them. Per what I could see on this thread, folks tend to point out that we should *not* include such recommendations in the docs, as it is too easy to do a pilot error. > > + When using a cluster with > > + tools which perform direct copies of relation file blocks (for example > > + <xref linkend="app-pgrewind"/>), enabling or disabling checksums can > > + lead to page corruptions in the shape of incorrect checksums if the > > + operation is not done consistently across all nodes. Destroying all > > + the standbys in a cluster first, enabling or disabling checksums on > > + the primary and finally recreate the cluster nodes from scratch is > > + also safe. > > + </para> > > + </refsect1> > > Should not disabling in reverse order be safe? the checksum are not checked > afterwards? I don't quite understand your comment about the ordering. If all the standbys are destroyed first, then enabling/disabling checksums happens at a single place. > After the reboot, some data files are not fully updated with their > checksums, although the controlfiles tells that they are. It should then > fail after a restart when a no-checksum page is loaded? > > What am I missing? Please note that we do that in other tools as well and we live fine with that as pg_basebackup, pg_rewind just to name two. I am not saying that it is not a problem in some cases, but I am saying that this is not a problem that this patch should solve. If we were to do something about that, it could make sense to make fsync_pgdata() smarter so as the control file is flushed last there, or define flush strategies there. -- Michael
Attachment
On Thu, Mar 21, 2019 at 07:59:24AM +0900, Michael Paquier wrote: > Please note that we do that in other tools as well and we live fine > with that as pg_basebackup, pg_rewind just to name two. I am not > saying that it is not a problem in some cases, but I am saying that > this is not a problem that this patch should solve. If we were to do > something about that, it could make sense to make fsync_pgdata() > smarter so as the control file is flushed last there, or define flush > strategies there. Anyway, as this stuff is very useful for upgrade scenarios a-la-pg_upgrade, for backup validation and as it does not produce false positives, I would really like to get something committed for v12 in its simplest form... Are there any recommendations that people would like to add to the documentation? -- Michael
Attachment
Bonjour Michaël, > On Wed, Mar 20, 2019 at 05:46:32PM +0100, Fabien COELHO wrote: >> I think that the motivation/risks should appear before the solution. "As xyz >> ..., ...", or there at least the logical link should be outlined. >> >> It is not clear for me whether the following sentences, which seems specific >> to "pg_rewind", are linked to the previous advice, which seems rather to >> refer to streaming replication? > > Do you have a better idea of formulation? I can try, but I must admit that I'm fuzzy about the actual issue. Is there a problem on a streaming replication with inconsistent checksum settings, or not? You seem to suggest that the issue is more about how some commands or backup tools operate on a cluster. I'll reread the thread carefully and will make a proposal. > Imagine for example a primary-standby with checksums disabled: [...] Yep, that's cool. >> Should not disabling in reverse order be safe? the checksum are not checked >> afterwards? > > I don't quite understand your comment about the ordering. If all the > standbys are destroyed first, then enabling/disabling checksums happens > at a single place. Sure. I was suggesting that disabling on replicated clusters is possibly safer, but do not know the detail of replication & checksumming with enough precision to be that sure about it. >> After the reboot, some data files are not fully updated with their >> checksums, although the controlfiles tells that they are. It should then >> fail after a restart when a no-checksum page is loaded? >> >> What am I missing? > > Please note that we do that in other tools as well and we live fine > with that as pg_basebackup, pg_rewind just to name two. The fact that other commands are exposed to the same potential risk is not a very good argument not to fix it. > I am not saying that it is not a problem in some cases, but I am saying > that this is not a problem that this patch should solve. As solving the issue involves exchanging two lines and turning one boolean parameter to true, I do not see why it should not be done. Fixing the issue takes much less time than writing about it... And if other commands can be improved fine with me. > If we were to do something about that, it could make sense to make > fsync_pgdata() smarter so as the control file is flushed last there, or > define flush strategies there. ISTM that this would not work: The control file update can only be done *after* the fsync to describe the cluster actual status, otherwise it is just a question of luck whether the cluster is corrupt on an crash while fsyncing. The enforced order of operation, with a barrier in between, is the important thing here. -- Fabien.
> Anyway, as this stuff is very useful for upgrade scenarios > a-la-pg_upgrade, for backup validation and as it does not produce false > positives, I would really like to get something committed for v12 in its > simplest form... Fine with me, the detailed doc is not a showstopper and can be improved later one. > Are there any recommendations that people would like to add to the > documentation? For me, I just want at least a clear warning on potential risks. -- Fabien.
On Thu, Mar 21, 2019 at 08:17:32AM +0100, Fabien COELHO wrote: > I can try, but I must admit that I'm fuzzy about the actual issue. Is there > a problem on a streaming replication with inconsistent checksum settings, or > not? > > You seem to suggest that the issue is more about how some commands or backup > tools operate on a cluster. Yes. That's what I am writing about. Imagine for example this case with pg_rewind: - primary has checksums enabled. - standby has checksums disabled. - a hard crash of the primary happens, there is a failover to the standby which gets promoted. - The primary's host is restarted, it is started and stopped once cleanly to have a clear point in its past timeline where WAL forked thanks to the generation of at least the shutdown checkpoint generated by the clean stop. - pg_rewind is run, copying some pages from the promoted standby, which don't have checksums, to the primary with checksums enabled, and causing some pages to have an incorrect checksum. There is another tool I know of which is called pg_rman, which is a backup tool able to take incremental backups in the shape of a delta of relation blocks. Then imagine the following: - One instance of Postgres runs, has checksums disabled. - pg_rman takes a full backup of it. - Checksums are enabled on this instance. - An incremental backup from the previous full backup point is taken. If I recall correctly pg_rman takes a copy of the new control file as well, which tracks checksums as being enabled. - A crash happens, the data folder is dead. - Rollback to the previous backup is done, and we restore up to a point after the incremental backup. - And you finish with a cluster which has checksums enabled, but as the initial full backup had checksums disabled, not all the pages may be in a correct state. So I think that it makes sense to tell to be careful within the documentation, but being too much careful in the tool discards also many possibilities (see the example of the clean failover where it is possible to enable checksums with no actual downtime). And this part has a lot of value. > ISTM that this would not work: The control file update can only be done > *after* the fsync to describe the cluster actual status, otherwise it is > just a question of luck whether the cluster is corrupt on an crash while > fsyncing. The enforced order of operation, with a barrier in between, is the > important thing here. Done the switch for this case. For pg_rewind actually I think that this is an area where its logic could be improved a bit. So first the data folder is synced, and then the control file is updated. It took less time to change the code than to write this paragraph, including the code compilation and one run of the TAP tests, confirmed. I have added in the docs a warning about a host crash while doing the operation, with a recommendation to check the state of the checksums on the data folder should it happen, and the previous portion of the docs about clusters. Your suggestion sounds adapted. I would be tempted to add a bigger warning in pg_rewind or pg_basebackup about that, but that's a different story for another time. Does that look fine to you? -- Michael
Attachment
Hi, Am Freitag, den 22.03.2019, 09:27 +0900 schrieb Michael Paquier: > I have added in the docs a warning about a host crash while doing the > operation, with a recommendation to check the state of the checksums > on the data folder should it happen, and the previous portion of the > docs about clusters. Your suggestion sounds adapted. I would be > tempted to add a bigger warning in pg_rewind or pg_basebackup about > that, but that's a different story for another time. > > Does that look fine to you? Don't we need a big warning that the cluster must not be started during operation of pg_checksums as well, now that we don't disallow it? Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Fri, Mar 22, 2019 at 09:13:43AM +0100, Michael Banck wrote: > Don't we need a big warning that the cluster must not be started during > operation of pg_checksums as well, now that we don't disallow it? The same applies to pg_rewind and pg_basebackup, so I would classify that as a pilot error. How would you formulate that in the docs if you add it. -- Michael
Attachment
Hi, Am Freitag, den 22.03.2019, 17:37 +0900 schrieb Michael Paquier: > On Fri, Mar 22, 2019 at 09:13:43AM +0100, Michael Banck wrote: > > Don't we need a big warning that the cluster must not be started during > > operation of pg_checksums as well, now that we don't disallow it? > > The same applies to pg_rewind and pg_basebackup, so I would classify > that as a pilot error. How would it apply to pg_basebackup? The cluster is running while the base backup is taken and I believe the control file is written at the end so you can't start another instance off the backup directory until the base backup has finished. It would apply to pg_rewind, but pg_rewind's runtime is not scaling with cluster size, does it? pg_checksums will run for hours on large clusters so the window of errors is much larger and I don't think you can easily compare the two. > How would you formulate that in the docs if you add it. (I would try to make sure you can't start the cluster but that seems off the table for now) How about this: + <refsect1> + <title>Notes</title> + <para> + When enabling checksums in a cluster, the operation can potentially take a + long time if the data directory is large. During this operation, the + cluster or other programs that write to the data directory must not be + started or else data-loss will occur. + </para> + + <para> + When disabling or enabling checksums in a cluster of multiple instances, [...] Also, the following is not very clear to me: + If the event of a crash of the operating system while enabling or s/If/In/ + disabling checksums, the data folder may have checksums in an inconsistent + state, in which case it is recommended to check the state of checksums + in the data folder. How is the user supposed to check the state of checksums? Do you mean that if the user intended to enable checksums and the box dies in between, they should check whether checksums are actually enabled and re-run if not? Because it could also mean running pg_checksums --check on the cluster, which wouldn't work in that case as the control file has not been updated yet. Maybe it could be formulated like "If pg_checksums is aborted or killed in its operation while enabling or disabling checksums, the cluster will have the same state with respect of checksums as before the operation and pg_checksums needs to be restarted."? Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Fri, Mar 22, 2019 at 10:04:02AM +0100, Michael Banck wrote: > How about this: > > + <refsect1> > + <title>Notes</title> > + <para> > + When enabling checksums in a cluster, the operation can potentially take a > + long time if the data directory is large. During this operation, the > + cluster or other programs that write to the data directory must not be > + started or else data-loss will occur. > + </para> Sounds fine to me. Will add an extra paragraph on that. > Maybe it could be formulated like "If pg_checksums is aborted or killed > in its operation while enabling or disabling checksums, the cluster > will have the same state with respect of checksums as before the > operation and pg_checksums needs to be restarted."? We could use that as well. With the current patch, and per the suggestions from Fabien, this holds true as the control file is updated and flushed last. -- Michael
Attachment
Bonjour Michaël, > Does that look fine to you? Mostly. Patch v9 part 1 applies cleanly, compiles, global and local check ok, doc build ok. On write(), the error message is not translatable whereas it is for all others. I agree that a BIG STRONG warning is needed about not to start the cluster under pain of possible data corruption. I still think that preventing this is desirable, preferably before v12. Otherwise, my remaining non showstopper (committer's opinion matters more) issues: Doc: A postgres cluster is like an Oracle instance. I'd use "replicated setup" instead of "cluster", and "cluster" instead of "instance. I'll try to improve the text, possibly over the week-end. Enabling/disabling an already enabled/disabled cluster should not be a failure for me, because the cluster is left under the prescribed state. Patch v9 part 2 applies cleanly, compiles, global and local check ok, doc build ok. Ok for me. -- Fabien.
> Done the switch for this case. For pg_rewind actually I think that this > is an area where its logic could be improved a bit. So first the data > folder is synced, and then the control file is updated. Attached is a quick patch about "pg_rewind", so that the control file is updated after everything else is committed to disk. -- Fabien.
Attachment
Re: Fabien COELHO 2019-03-22 <alpine.DEB.2.21.1903221514390.2198@lancre> > Attached is a quick patch about "pg_rewind", so that the control file is > updated after everything else is committed to disk. > update_controlfile(datadir_target, progname, &ControlFile_new, do_sync); > > - pg_log(PG_PROGRESS, "syncing target data directory\n"); > - syncTargetDirectory(); Doesn't the control file still need syncing? Christoph
Hello Christoph, >> - pg_log(PG_PROGRESS, "syncing target data directory\n"); >> - syncTargetDirectory(); > > Doesn't the control file still need syncing? Indeed it does, and it is done in update_controlfile if the last argument is true. Basically update_controlfile latest version always fsync the control file, unless explicitely told not to do so. The options to do that are really there only to speed up non regression tests. -- Fabien.
On Fri, Mar 22, 2019 at 02:59:31PM +0100, Fabien COELHO wrote: > On write(), the error message is not translatable whereas it is for all > others. Fixed. > I agree that a BIG STRONG warning is needed about not to start the cluster > under pain of possible data corruption. I still think that preventing this > is desirable, preferably before v12. For now the docs mention that in a paragraph as Michael Banck has suggested. Not sure that this deserves a warning portion. > Otherwise, my remaining non showstopper (committer's opinion matters more) > issues: > > Doc: A postgres cluster is like an Oracle instance. I'd use "replicated > setup" instead of "cluster", and "cluster" instead of "instance. I'll try to > improve the text, possibly over the week-end. Right. I have reworded that using your suggestions. And committed the main part. I'll look after the --no-sync part in a bit. -- Michael
Attachment
On Fri, Mar 22, 2019 at 07:02:36PM +0100, Fabien COELHO wrote: > Indeed it does, and it is done in update_controlfile if the last argument is > true. Basically update_controlfile latest version always fsync the control > file, unless explicitely told not to do so. The options to do that are > really there only to speed up non regression tests. For the control file, it would not really matter much, and the cost would be really coming from syncing the data directory, still for correctness it is better to have a full all-or-nothing switch. Small buildfarm machines also like the --no-sync flavors a lot. -- Michael
Attachment
On Sat, Mar 23, 2019 at 08:16:07AM +0900, Michael Paquier wrote: > And committed the main part. I'll look after the --no-sync part in a > bit. --no-sync is committed as well now. -- Michael
Attachment
On Fri, Mar 22, 2019 at 03:18:26PM +0100, Fabien COELHO wrote: > Attached is a quick patch about "pg_rewind", so that the control file is > updated after everything else is committed to disk. Could you start a new thread about that please? This one has already been used for too many things. -- Michael
Attachment
Bonjour Michaël, Here is an attempt at improving the Notes. Mostly it is a reordering from more important (cluster corruption) to less important (if interrupted a restart is needed), some reordering from problem to solutions instead of solution/problem/solution, some sentence simplification. -- Fabien.
Attachment
On Sat, Mar 23, 2019 at 02:14:02PM +0100, Fabien COELHO wrote: > Here is an attempt at improving the Notes. > > Mostly it is a reordering from more important (cluster corruption) to less > important (if interrupted a restart is needed), some reordering from problem > to solutions instead of solution/problem/solution, some sentence > simplification. So, the ordering of the notes for each paragraph is as follows: 1) Replication issues when mixing different checksum setups across nodes. 2) Consistency of the operations if killed. 3) Don't start Postgres while the operation runs. Your proposal is to switch the order of the paragraphs to 3), 1) and then 2). Do others have any opinion? I am fine with the current order of things, still it may make sense to tweaks the docs. In the paragraph related to replication, the second statement is switched to be first so as the docs warn first, and then give recommendations. This part makes sense. I am not sure that "checksum status" is a correct term. It seems to me that "same configuration for data checksums as before the tool ran" or something like that would be more correct. -- Michael
Attachment
Bonjour Michaël, >> Here is an attempt at improving the Notes. [...] > > So, the ordering of the notes for each paragraph is as follows: 1) > Replication issues when mixing different checksum setups across nodes. > 2) Consistency of the operations if killed. 3) Don't start Postgres > while the operation runs. > > Your proposal is to switch the order of the paragraphs to 3), 1) and > then 2). Yes. I suggest to emphasize cluster corruption risks by putting them first. > Do others have any opinion? I am fine with the current > order of things, still it may make sense to tweaks the docs. > > In the paragraph related to replication, the second statement is > switched to be first so as the docs warn first, and then give > recommendations. Yep. > This part makes sense. Yep! > I am not sure that "checksum status" is a correct term. It seems to > me that "same configuration for data checksums as before the tool ran" > or something like that would be more correct. Possibly, I cannot say. -- Fabien.
On Tue, Mar 26, 2019 at 01:41:38PM +0100, Fabien COELHO wrote: >> I am not sure that "checksum status" is a correct term. It seems to >> me that "same configuration for data checksums as before the tool ran" >> or something like that would be more correct. > > Possibly, I cannot say. I have put more thoughts into this part, and committed the reorganization as you mainly suggested. -- Michael