Thread: WIP checksums patch
This is just a rebased version of the patch by Simon here: http://archives.postgresql.org/message-id/CA +U5nMKw_GBs6qQ_Y8-RjGL1V7MVW2HWBHartB8LoJhnPfxL8g@mail.gmail.com There are other things I'd like to do, like: * include page# in checksum, and perhaps relfilenode or object OID (but those two might cause difficulties) * use TLI field instead of page version, if that is the consensus (though it seems some may still disagree with that) * we might want to make it slightly easier for external utilities, like for backup/replication, to verify the pages * optionally protect temporary files and local buffers * come up with a robust way to determine that all pages in the database are protected * protect uninitialized pages But not all of these will make the first patch. This is just a starting point to reignite the discussion. I welcome any feedback. Regards,Jeff Davis
On Fri, 2012-09-14 at 17:58 -0700, Jeff Davis wrote: > This is just a rebased version of the patch by Simon here: > > http://archives.postgresql.org/message-id/CA > +U5nMKw_GBs6qQ_Y8-RjGL1V7MVW2HWBHartB8LoJhnPfxL8g@mail.gmail.com Now with attachment. Regards, Jeff Davis
Attachment
On Fri, 2012-09-14 at 17:58 -0700, Jeff Davis wrote: > This is just a rebased version of the patch by Simon here: I just noticed the following note in the docs for this patch: The default is <literal>off</> for backwards compatibility and to allow upgrade. The recommended setting is <literal>on</>though this should not be enabled until upgrade is successfully complete with full set of new backups. I don't understand what that means -- if they have the page_checksums GUC available, then surely upgrade is complete, right? And what is the backwards-compatibility issue? Also, it looks out of date, because the default in guc.c is set to true. I think we should probably default to true, because it's safer and it can always be disabled at runtime, but I don't have a strong opinion about that. Regards,Jeff Davis
On Fri, 2012-09-14 at 17:58 -0700, Jeff Davis wrote: > * we might want to make it slightly easier for external utilities, like > for backup/replication, to verify the pages Ideally, PageVerificationInfoOK should be available to external utilities, so that someone might script a background job to verify pages. Or, perhaps we should just include a new utility, pg_verify_checksums? Either way, where is a good place to put the function so that it's shared between the server and the utility? In src/port somewhere? Regards,Jeff Davis
On Sun, Sep 30, 2012 at 07:09:20PM -0700, Jeff Davis wrote: > On Fri, 2012-09-14 at 17:58 -0700, Jeff Davis wrote: > > This is just a rebased version of the patch by Simon here: > > I just noticed the following note in the docs for this patch: > > The default is <literal>off</> for backwards compatibility and > to allow upgrade. The recommended setting is <literal>on</> though > this should not be enabled until upgrade is successfully complete > with full set of new backups. > > I don't understand what that means -- if they have the page_checksums > GUC available, then surely upgrade is complete, right? And what is the > backwards-compatibility issue? > > Also, it looks out of date, because the default in guc.c is set to true. > I think we should probably default to true, because it's safer and it > can always be disabled at runtime, but I don't have a strong opinion > about that. I think this need to clearly state "pg_upgrade", not a dump/restore upgrade, which would be fine. It would be interesting to have pg_upgrade change this setting, or tell the user to change it. I am not sure enough people are using pg_upgrade to change a default value. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Mon, 2012-10-01 at 10:43 -0400, Bruce Momjian wrote: > > The default is <literal>off</> for backwards compatibility and > > to allow upgrade. The recommended setting is <literal>on</> though > > this should not be enabled until upgrade is successfully complete > > with full set of new backups. > > > > I don't understand what that means -- if they have the page_checksums > > GUC available, then surely upgrade is complete, right? And what is the > > backwards-compatibility issue? > I think this need to clearly state "pg_upgrade", not a dump/restore > upgrade, which would be fine. It would be interesting to have > pg_upgrade change this setting, or tell the user to change it. I am not > sure enough people are using pg_upgrade to change a default value. I still don't understand why pg_upgrade and page_checksums don't mix. Regards,Jeff Davis
On Mon, Oct 1, 2012 at 09:25:43AM -0700, Jeff Davis wrote: > On Mon, 2012-10-01 at 10:43 -0400, Bruce Momjian wrote: > > > The default is <literal>off</> for backwards compatibility and > > > to allow upgrade. The recommended setting is <literal>on</> though > > > this should not be enabled until upgrade is successfully complete > > > with full set of new backups. > > > > > > I don't understand what that means -- if they have the page_checksums > > > GUC available, then surely upgrade is complete, right? And what is the > > > backwards-compatibility issue? > > > I think this need to clearly state "pg_upgrade", not a dump/restore > > upgrade, which would be fine. It would be interesting to have > > pg_upgrade change this setting, or tell the user to change it. I am not > > sure enough people are using pg_upgrade to change a default value. > > I still don't understand why pg_upgrade and page_checksums don't mix. The heap/index files are copied unmodified from the old cluster, so there are no checksums on the pages. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Mon, 2012-10-01 at 12:35 -0400, Bruce Momjian wrote: > The heap/index files are copied unmodified from the old cluster, so > there are no checksums on the pages. That's fine though, the patch still reads the old format the same way, and the pages are treated as though they have no checksum. How is that a reason for defaulting the GUC to off? I'm missing something here. Are we worried about users who turn the GUC on and then expect all of their old data pages to magically be protected? Regards,Jeff Davis
On Fri, 2012-09-14 at 17:58 -0700, Jeff Davis wrote: > This is just a rebased version of the patch by Simon here: > > http://archives.postgresql.org/message-id/CA > +U5nMKw_GBs6qQ_Y8-RjGL1V7MVW2HWBHartB8LoJhnPfxL8g@mail.gmail.com Another thing I noticed about the design of this patch: It looks like the checksum will usually be wrong in a backup block in WAL, because it writes the backup block before calculating the checksum (which isn't done until the time the block is written out). I think that's OK, because it's still protected by the WAL CRC, and there's no expectation that the checksum is correct in shared buffers, and the correct checksum should be set on the next checkpoint. Just an observation. Regards,Jeff Davis
On Mon, Oct 1, 2012 at 10:04:09AM -0700, Jeff Davis wrote: > On Mon, 2012-10-01 at 12:35 -0400, Bruce Momjian wrote: > > The heap/index files are copied unmodified from the old cluster, so > > there are no checksums on the pages. > > That's fine though, the patch still reads the old format the same way, > and the pages are treated as though they have no checksum. How is that a > reason for defaulting the GUC to off? No idea then. > I'm missing something here. Are we worried about users who turn the GUC > on and then expect all of their old data pages to magically be > protected? Perhaps we don't allow this to be turned per page, but rather per cluster, and per-cluster would require the entire cluster to be rewritten. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 1 October 2012 18:04, Jeff Davis <pgsql@j-davis.com> wrote: > On Mon, 2012-10-01 at 12:35 -0400, Bruce Momjian wrote: >> The heap/index files are copied unmodified from the old cluster, so >> there are no checksums on the pages. > > That's fine though, the patch still reads the old format the same way, > and the pages are treated as though they have no checksum. > How is that a > reason for defaulting the GUC to off? It's not. > Are we worried about users who turn the GUC > on and then expect all of their old data pages to magically be > protected? Yes, but as you say, that is a separate consideration. You are missing large parts of the previous thread, giving various opinions on what the UI should look like for enabling checksums. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> I think that's OK, because it's still protected by the WAL CRC, and > there's no expectation that the checksum is correct in shared buffers, > and the correct checksum should be set on the next checkpoint. Just an > observation. We'd need to document that emphatically. Otherwise folks running on ZFS and/or FusionIO with atomic writes (and, in the future, BTRFS) will assume that they can turn "full_page_writes" off and checksums on, and clearly that won't work with the current code. I think that's an acceptable limitation, I just think we need to document it carefully, and maybe throw a warning if people start up in that configuration. > Perhaps we don't allow this to be turned per page, but rather per > cluster, and per-cluster would require the entire cluster to be > rewritten. We dicussed this last year, and options which require a total rewrite of the database in order to turn on the option were rejected as impractical for users. People did say it was desirable to have a manual option which says "rewrite this entire table with checksums". However, that's not required for the initial patch. For that matter, it will become desirable to turn on checksums only for specific tables by they user (again, future feature). -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, 2012-10-01 at 18:14 +0100, Simon Riggs wrote: > You are missing large parts of the previous thread, giving various > opinions on what the UI should look like for enabling checksums. I read through all of the discussion that I could find. There was quite a lot, so perhaps I have forgotten pieces of it. But that one section in the docs does look out of date and/or confusing to me. I remember there was discussion about a way to ensure that checksums are set cluster-wide with some kind of special command (perhaps related to VACUUM) and a magic file to let recovery know whether checksums are set everywhere or not. That doesn't necessarily conflict with the GUC though (the GUC could be a way to write checksums lazily, while this new command could be a way to write checksums eagerly). If some consensus was reached on the exact user interface, can you please send me a link? Regards,Jeff Davis
On Monday, October 01, 2012 11:11 PM Jeff Davis wrote: > On Mon, 2012-10-01 at 18:14 +0100, Simon Riggs wrote: > > You are missing large parts of the previous thread, giving various > > opinions on what the UI should look like for enabling checksums. > > I read through all of the discussion that I could find. There was quite > a lot, so perhaps I have forgotten pieces of it. > > But that one section in the docs does look out of date and/or confusing > to me. > > I remember there was discussion about a way to ensure that checksums are > set cluster-wide with some kind of special command (perhaps related to > VACUUM) and a magic file to let recovery know whether checksums are set > everywhere or not. That doesn't necessarily conflict with the GUC though > (the GUC could be a way to write checksums lazily, while this new > command could be a way to write checksums eagerly). > > If some consensus was reached on the exact user interface, can you > please send me a link? AFAICT complete consensus has not been reached but one of the discussions can be found on below link: http://archives.postgresql.org/pgsql-hackers/2012-03/msg00279.php Here Robert has given suggestions and then further there is more discussion based on that points. According to me, the main points where more work for this patch is required as per previous discussions is as follows: 1. Performance impact of WAL log for hint-bits needs to be verified for scenario's other than pg_bench (Such as bulk dataload (which I feel there is some way to optimize, but I don't know if that’s part of this patch)). 2. How to put the information in Page header. I think general direction is use pd_tli. Storing whether page has checksumshould be done or it needs to be maintained at table or db level is not decided. 3. User Interface for Checksum options. 4. Still not there is consensus about locking the buffer. 5. Any more which I have missed? Apart from above, one of the concern raised by many members is that there should be page format upgrade infrastructure first and then we should add thinking of checksums(http://archives.postgresql.org/pgsql-hackers/2012-02/msg01517.php). The major point for upgrade is that it should be an online upgrade and the problem which I think there is no clear solution yet is hot to ensure that a database will never have more than 2 pageformats. If the general consensus is we should do it without having upgrade, then I think we can pursue discussion about the mainpoints listed above. With Regards, Amit Kapila.
On 10/1/12 12:22 PM, Josh Berkus wrote: >> >Perhaps we don't allow this to be turned per page, but rather per >> >cluster, and per-cluster would require the entire cluster to be >> >rewritten. > We dicussed this last year, and options which require a total rewrite of > the database in order to turn on the option were rejected as impractical > for users. For whatever it's worth... we (and presumably others) still use londiste (or Slony) as our upgrade path, so we could toleratea cluster-wide setting. We'd just set it when building new clusters via londiste and forget about it. So I'd rather see this get in at a cluster level than not make it at all while we wait for something better. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
> On 10/1/12 12:22 PM, Josh Berkus wrote: >>> >Perhaps we don't allow this to be turned per page, but rather per >>> >cluster, and per-cluster would require the entire cluster to be >>> >rewritten. >> We dicussed this last year, and options which require a total rewrite of >> the database in order to turn on the option were rejected as impractical >> for users. > > For whatever it's worth... we (and presumably others) still use londiste > (or Slony) as our upgrade path, so we could tolerate a cluster-wide > setting. We'd just set it when building new clusters via londiste and > forget about it. > > So I'd rather see this get in at a cluster level than not make it at all > while we wait for something better. Some still do dump/restore between major releases, so there it is also applicable. (and preferrable). I do know a lot of things had been discussed last year, an API I could easily imagine would work: ALTER TABLE <tablename> SET ENABLE_CHECKSUMMING=YES (would make the server do the checksums on all writes on table+indices going forward, here, verification of the checksums would still be enabled, but only on "already checksummed pages" ). ALTER TABLE <tablename> set VERIFY_CHECKSUM=YES (would cause the server to scan table and index, and rewrite the parts that hadn't an updated with a checksum already. Perhaps the names are not well-chosen, but it is based on the assumptions that check-summing overhead is measurable and it is thereby desireable to be able to dis/enable it on a per-table level. Then I could let my system go 6-12 months between the above two commands and the amount of rewrite would hopefully not require a rewrite of the entire 2.5TB of PGDATA. Jesper -- Jesper Krogh
On Mon, Oct 29, 2012 at 4:31 PM, Jim Nasby <jim@nasby.net> wrote: > For whatever it's worth... we (and presumably others) still use londiste (or > Slony) as our upgrade path, so we could tolerate a cluster-wide setting. > We'd just set it when building new clusters via londiste and forget about > it. > > So I'd rather see this get in at a cluster level than not make it at all > while we wait for something better. Yeah. I definitely think that we could shed an enormous amount of complexity by deciding that this is, for now, an option that can only be selected at initdb time. That would remove approximately 85% of everything I've ever disliked about this patch - without, I think, precluding the possibility of improving things later. It also occurred to me that another way to reduce the scope of this change would be to have a first version that does CRCs only for SLRU pages. That would be useful for verifying the integrity of some of our most critical data (pg_clog) and be a useful building block toward a more complete implementation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Nov 5, 2012 at 12:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I see one thing to be concerned about, there...
I imagine it would not be a totally happy thing if the only way to switch it on/off was to use Slony or Londiste to replicate into a database with the opposite setting. (e.g. - This implies that built-in replication may only replicate into a database with the identical checksum configuration.)
It's not outrageous for it to be a pretty heavyweight operation to switch polarities, but there's such a thing as too heavy.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"
On Mon, Oct 29, 2012 at 4:31 PM, Jim Nasby <jim@nasby.net> wrote:Yeah. I definitely think that we could shed an enormous amount of
> For whatever it's worth... we (and presumably others) still use londiste (or
> Slony) as our upgrade path, so we could tolerate a cluster-wide setting.
> We'd just set it when building new clusters via londiste and forget about
> it.
>
> So I'd rather see this get in at a cluster level than not make it at all
> while we wait for something better.
complexity by deciding that this is, for now, an option that can only
be selected at initdb time. That would remove approximately 85% of
everything I've ever disliked about this patch - without, I think,
precluding the possibility of improving things later.
I see one thing to be concerned about, there...
I imagine it would not be a totally happy thing if the only way to switch it on/off was to use Slony or Londiste to replicate into a database with the opposite setting. (e.g. - This implies that built-in replication may only replicate into a database with the identical checksum configuration.)
It's not outrageous for it to be a pretty heavyweight operation to switch polarities, but there's such a thing as too heavy.
--
question, "How would the Lone Ranger handle this?"
On Thu, Nov 8, 2012 at 9:17 PM, Christopher Browne <cbbrowne@gmail.com> wrote: > I see one thing to be concerned about, there... > > I imagine it would not be a totally happy thing if the only way to switch it > on/off was to use Slony or Londiste to replicate into a database with the > opposite setting. (e.g. - This implies that built-in replication may only > replicate into a database with the identical checksum configuration.) Sure, I agree. I don't think it should stay that way forever, but removing the burden of dealing with this issue from the initial commit would likely allow that commit to happen this release cycle, perhaps even in the next CommitFest. And then we'd have half a loaf, which is better than none, and we could deal with the issues of switching it on and off as a further enhancement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, 2012-11-05 at 12:19 -0500, Robert Haas wrote: > Yeah. I definitely think that we could shed an enormous amount of > complexity by deciding that this is, for now, an option that can only > be selected at initdb time. That would remove approximately 85% of > everything I've ever disliked about this patch - without, I think, > precluding the possibility of improving things later. That's certainly true, but it introduces one large problem: upgrading would not work, which (in the past few releases) we've treated as a major showstopper for many features. If there is really no other good way to do it, then that might be reasonable. But it seems within grasp to at least offer an offline way to set checksums. > It also occurred to me that another way to reduce the scope of this > change would be to have a first version that does CRCs only for SLRU > pages. That would be useful for verifying the integrity of some of > our most critical data (pg_clog) and be a useful building block toward > a more complete implementation. That also breaks upgrade, right? Regards,Jeff Davis
On Fri, 2012-11-09 at 10:18 -0500, Robert Haas wrote: > Sure, I agree. I don't think it should stay that way forever, but > removing the burden of dealing with this issue from the initial commit > would likely allow that commit to happen this release cycle, perhaps > even in the next CommitFest. And then we'd have half a loaf, which is > better than none, and we could deal with the issues of switching it on > and off as a further enhancement. Just after sending the last email, I realized that it can be separated into separate commits fairly naturally, I think. So, I agree with you that we should focus on an initdb setting for the next commitfest and try for at least an offline migration tool (if not online) later. Regards,Jeff Davis
On Mon, 2012-10-01 at 10:22 -0700, Josh Berkus wrote: > > I think that's OK, because it's still protected by the WAL CRC, and > > there's no expectation that the checksum is correct in shared buffers, > > and the correct checksum should be set on the next checkpoint. Just an > > observation. > > We'd need to document that emphatically. Otherwise folks running on ZFS > and/or FusionIO with atomic writes (and, in the future, BTRFS) will > assume that they can turn "full_page_writes" off and checksums on, and > clearly that won't work with the current code. I think that's an > acceptable limitation, I just think we need to document it carefully, > and maybe throw a warning if people start up in that configuration. What situation are you concerned about here? I think that COW filesystems should still be safe with full_page_writes off, right? The checksum is calculated before every write, and the COW filesystems do atomic writes, so the checksums should always be fine. What am I missing? Regards,Jeff Davis
On 11/9/12 6:14 PM, Jeff Davis wrote: > On Mon, 2012-11-05 at 12:19 -0500, Robert Haas wrote: >> Yeah. I definitely think that we could shed an enormous amount of >> complexity by deciding that this is, for now, an option that can only >> be selected at initdb time. That would remove approximately 85% of >> everything I've ever disliked about this patch - without, I think, >> precluding the possibility of improving things later. > > That's certainly true, but it introduces one large problem: upgrading > would not work, which (in the past few releases) we've treated as a > major showstopper for many features. If you have pages that were written with one datetime storage format, and you create a cluster using the other one, that will fail. If checksums are done as an initdb time option, I see "checksummed" as being a block change on that level, and the precedent for not supporting it defensible. pg_upgrade will need to check for a mismatch--new cluster has checksums turned on, old one doesn't--and abort out if that happens. That can be lumped in with the other pg_controldata tests easily enough. What I think this argues for, though, is being precise about naming what goes into pg_controldata. Let's say the initial commit target is an initdb time switch, but later finer grained ones are expected to be available. I think the output and source code labels on the initdb controldata part should refer to this as something like "checksums available" then. The word "enabled" could be misleading when there's finer grained enabling coming later. We don't want people to run pg_controldata, see "checksums: enabled/on", and later discover they're not fully operational in that cluster yet. Saying "checksums: available" suggests there's somewhere else that should be checked, to tell if they're available on a given database/table or not. The sort of text I'm thinking of for the manual and/or warning is: "You can't use pg_upgrade to migrate from a cluster where checksums are not available to one where they are. This limitation may be removed by a future version." -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Mon, Nov 12, 2012 at 12:39:17AM -0500, Greg Smith wrote: > On 11/9/12 6:14 PM, Jeff Davis wrote: > >On Mon, 2012-11-05 at 12:19 -0500, Robert Haas wrote: > >>Yeah. I definitely think that we could shed an enormous amount of > >>complexity by deciding that this is, for now, an option that can only > >>be selected at initdb time. That would remove approximately 85% of > >>everything I've ever disliked about this patch - without, I think, > >>precluding the possibility of improving things later. > > > >That's certainly true, but it introduces one large problem: upgrading > >would not work, which (in the past few releases) we've treated as a > >major showstopper for many features. > > If you have pages that were written with one datetime storage > format, and you create a cluster using the other one, that will > fail. If checksums are done as an initdb time option, I see > "checksummed" as being a block change on that level, and the > precedent for not supporting it defensible. pg_upgrade will need to > check for a mismatch--new cluster has checksums turned on, old one > doesn't--and abort out if that happens. That can be lumped in with > the other pg_controldata tests easily enough. Yes, pg_upgrade can do that easily. > What I think this argues for, though, is being precise about naming > what goes into pg_controldata. Let's say the initial commit target > is an initdb time switch, but later finer grained ones are expected > to be available. I think the output and source code labels on the > initdb controldata part should refer to this as something like > "checksums available" then. The word "enabled" could be misleading > when there's finer grained enabling coming later. We don't want > people to run pg_controldata, see "checksums: enabled/on", and > later discover they're not fully operational in that cluster yet. > Saying "checksums: available" suggests there's somewhere else that > should be checked, to tell if they're available on a given > database/table or not. > > The sort of text I'm thinking of for the manual and/or warning is: > > "You can't use pg_upgrade to migrate from a cluster where checksums > are not available to one where they are. This limitation may be > removed by a future version." "available" sounds like it is compiled in, but in this case, it means it is active. I think we are just going to need to rename it as we make it finer grained. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Tue, Oct 9, 2012 at 1:51 AM, Amit Kapila <amit.kapila@huawei.com> wrote: > On Monday, October 01, 2012 11:11 PM Jeff Davis wrote: >> On Mon, 2012-10-01 at 18:14 +0100, Simon Riggs wrote: >> > You are missing large parts of the previous thread, giving various >> > opinions on what the UI should look like for enabling checksums. >> >> I read through all of the discussion that I could find. There was quite >> a lot, so perhaps I have forgotten pieces of it. >> >> But that one section in the docs does look out of date and/or confusing >> to me. >> >> I remember there was discussion about a way to ensure that checksums are >> set cluster-wide with some kind of special command (perhaps related to >> VACUUM) and a magic file to let recovery know whether checksums are set >> everywhere or not. That doesn't necessarily conflict with the GUC though >> (the GUC could be a way to write checksums lazily, while this new >> command could be a way to write checksums eagerly). >> >> If some consensus was reached on the exact user interface, can you >> please send me a link? > > AFAICT complete consensus has not been reached but one of the discussions can be found on below link: > http://archives.postgresql.org/pgsql-hackers/2012-03/msg00279.php > Here Robert has given suggestions and then further there is more discussion based on that points. > > According to me, the main points where more work for this patch is required as per previous discussions is as follows: > > 1. Performance impact of WAL log for hint-bits needs to be verified for scenario's other than pg_bench (Such as bulk dataload (which I > feel there is some way to optimize, but I don't know if that’s part of this patch)). Atri has a patch posted which (if it passes muster) would eliminate the i/o impact of WAL logging hint bits following a bulk load or any scenario where many pages worth of tuples were sequentially written out with the same XID. Atri's patch was written with the checksum patch in mind. http://archives.postgresql.org/message-id/CAOeZVid7zd9V-9WsEtf9wRCd0H4yw_1OgCLm7%3DCdvGPCxZcJJw%40mail.gmail.com merlin