Thread: backup manifests
In the lengthy thread on block-level incremental backup,[1] both Vignesh C[2] and Stephen Frost[3] have suggested storing a manifest as part of each backup, somethig that could be useful not only for incremental backups but also for full backups. I initially didn't think this was necessary,[4] but some of my colleagues figured out that my design was broken, because my proposal was to detect new blocks just using LSN, and that ignores the fact that CREATE DATABASE and ALTER TABLE .. SET TABLESPACE do physical copies without bumping page LSNs, which I knew but somehow forgot about. Fortunately, some of my colleagues realized my mistake in testing.[5] Because of this problem, for an LSN-based approach to work, we'll need to send not only an LSN, but also a list of files (and file sizes) that exist in the previous full backup; so, some kind of backup manifest now seems like a good idea to me.[6] That whole approach might still be dead on arrival if it's possible to add new blocks with old LSNs to existing files,[7] but there seems to be room to hope that there are no such cases.[8] So, let's suppose we invent a backup manifest. What should it contain? I imagine that it would consist of a list of files, and the lengths of those files, and a checksum for each file. I think you should have a choice of what kind of checksums to use, because algorithms that used to seem like good choices (e.g. MD5) no longer do; this trend can probably be expected to continue. Even if we initially support only one kind of checksum -- presumably SHA-something since we have code for that already for SCRAM -- I think that it would also be a good idea to allow for future changes. And maybe it's best to just allow a choice of SHA-224, SHA-256, SHA-384, and SHA-512 right out of the gate, so that we can avoid bikeshedding over which one is secure enough. I guess we'll still have to argue about the default. I also think that it should be possible to build a manifest with no checksums, so that one need not pay the overhead of computing checksums if one does not wish. Of course, such a manifest is of much less utility for checking backup integrity, but you can still check that you've got the right files, which is noticeably better than nothing. The manifest should probably also contain a checksum of its own contents so that the integrity of the manifest itself can be verified. And maybe a few other bits of metadata, but I'm not sure exactly what. Ideas? Once we invent the concept of a backup manifest, what do we need to do with them? I think we'd want three things initially: (1) When taking a backup, have the option (perhaps enabled by default) to include a backup manifest. (2) Given an existing backup that has not got a manifest, construct one. (3) Cross-check a manifest against a backup and complain about extra files, missing files, size differences, or checksum mismatches. One thing I'm not quite sure about is where to store the backup manifest. If you take a base backup in tar format, you get base.tar, pg_wal.tar (unless -Xnone), and an additional tar file per tablespace. Does the backup manifest go into base.tar? Get written into a separate file outside of any tar archive? Something else? And what about a plain-format backup? I suppose then we should just write the manifest into the top level of the main data directory, but perhaps someone has another idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] https://www.postgresql.org/message-id/flat/CA%2BTgmoYxQLL%3DmVyN90HZgH0X_EUrw%2BaZ0xsXJk7XV3-3LygTvA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CALDaNm310fUZ72nM2n%3DcD0eSHKRAoJPuCyvvR0dhTEZ9Oytyzg%40mail.gmail.com [3] https://www.postgresql.org/message-id/20190916143817.GA6962%40tamriel.snowman.net [4] https://www.postgresql.org/message-id/CA%2BTgmoaj-zw4Mou4YBcJSkHmQM%2BJA-dAVJnRP8zSASP1S4ZVgw%40mail.gmail.com [5] https://www.postgresql.org/message-id/CAM2%2B6%3DXfJX%3DKXvpTgDvgd1rQjya_Am27j4UvJtL3nA%2BJMCTGVQ%40mail.gmail.com [6] https://www.postgresql.org/message-id/CA%2BTgmoYg9i8TZhyjf8MqCyU8unUVuW%2B03FeBF1LGDu_-eOONag%40mail.gmail.com [7] https://www.postgresql.org/message-id/CA%2BTgmoYT9xODgEB6y6j93hFHqobVcdiRCRCp0dHh%2BfFzZALn%3Dw%40mail.gmail.com and nearby messages [8] https://www.postgresql.org/message-id/20190916173933.GE6962%40tamriel.snowman.net
Hi Robert, On 9/18/19 1:48 PM, Robert Haas wrote: > That whole approach might still be dead on > arrival if it's possible to add new blocks with old LSNs to existing > files,[7] but there seems to be room to hope that there are no such > cases.[8] I sure hope there are no such cases, but we should be open to the idea just in case. > So, let's suppose we invent a backup manifest. What should it contain? > I imagine that it would consist of a list of files, and the lengths of > those files, and a checksum for each file. These are essential. Also consider adding the timestamp. You have justifiable concerns about using timestamps for deltas and I get that. However, there are a number of methods that can be employed to make it *much* safer. I won't go into that here since it is an entire thread in itself. Suffice to say we can detect many anomalies in the timestamps and require a checksum backup when we see them. I'm really interested in scanning the WAL for changed files but that method is very complex and getting it right might be harder than ensuring FS checksums are reliable. Still worth trying, though, since the benefits are enormous. We are planning to use timestamp + size + wal data to do incrementals if we get there. Consider adding a reference to each file that specifies where the file can be found in if it is not in this backup. As I understand the pg_basebackup proposal, it would only be implementing differential backups, i.e. an incremental that is *only* based on the last full backup. So, the reference can be inferred in this case. However, if the user selects the wrong full backup on restore, and we have labeled each backup, then a differential restore with references against the wrong full backup would result in a hard error rather than corruption. > I think you should have a > choice of what kind of checksums to use, because algorithms that used > to seem like good choices (e.g. MD5) no longer do; this trend can > probably be expected to continue. Even if we initially support only > one kind of checksum -- presumably SHA-something since we have code > for that already for SCRAM -- I think that it would also be a good > idea to allow for future changes. And maybe it's best to just allow a > choice of SHA-224, SHA-256, SHA-384, and SHA-512 right out of the > gate, so that we can avoid bikeshedding over which one is secure > enough. I guess we'll still have to argue about the default. Based on my original calculations (which sadly I don't have anymore), the combination of SHA1, size, and file name is *extremely* unlikely to generate a collision. As in, unlikely to happen before the end of the universe kind of unlikely. Though, I guess it depends on your expectations for the lifetime of the universe. These checksums don't have to be cryptographically secure, in the sense that you could infer the plaintext from the checksum. They just need to have a suitably low collision rate. These days I would choose something with more bits because the computation time is similar, though the larger size requires more storage. > I also > think that it should be possible to build a manifest with no > checksums, so that one need not pay the overhead of computing > checksums if one does not wish. Our benchmarks have indicated that checksums only account for about 1% of total cpu time when gzip -6 compression is used. Without compression the percentage may be higher of course, but in that case we find network latency is the primary bottleneck. For S3 backups we do a SHA1 hash for our manifest, a SHA256 hash for authv4 and a good-old-fashioned MD5 checksum for each upload part. This is barely noticeable when compression is enabled. > Of course, such a manifest is of much > less utility for checking backup integrity, but you can still check > that you've got the right files, which is noticeably better than > nothing. Absolutely -- and yet. There was a time when we made checksums optional but eventually gave up on that once we profiled and realized how low the cost was vs. the benefit. > The manifest should probably also contain a checksum of its > own contents so that the integrity of the manifest itself can be > verified. This is a good idea. Amazingly we've never seen a manifest checksum error in the field but it's only a matter of time. And maybe a few other bits of metadata, but I'm not sure > exactly what. Ideas? A backup label for sure. You can also use this as the directory/tar name to save the user coming up with one. We use YYYYMMDDHH24MMSSF for full backups and YYYYMMDDHH24MMSSF_YYYYMMDDHH24MMSS(D|I) for incrementals and have logic to prevent two backups from having the same label. This is unlikely outside of testing but still a good idea. Knowing the start/stop time of the backup is useful in all kinds of ways, especially monitoring and time-targeted PITR. Start/stop LSN is also good. I know this is also in backup_label but having it all in one place is nice. We include the version/sysid of the cluster to avoid mixups. It's a great extra check on top of references to be sure everything is kosher. A manifest version is good in case we change the format later. I'd recommend JSON for the format since it is so ubiquitous and easily handles escaping which can be gotchas in a home-grown format. We currently have a format that is a combination of Windows INI and JSON (for human-readability in theory) and we have become painfully aware of escaping issues. Really, why would you drop files with '=' in their name in PGDATA? And yet it happens. > Once we invent the concept of a backup manifest, what do we need to do > with them? I think we'd want three things initially: > > (1) When taking a backup, have the option (perhaps enabled by default) > to include a backup manifest. Manifests are cheap to builds so I wouldn't make it an option. > (2) Given an existing backup that has not got a manifest, construct one. Might be too late to be trusted and we'd have to write extra code for it. I'd leave this for a project down the road, if at all. > (3) Cross-check a manifest against a backup and complain about extra > files, missing files, size differences, or checksum mismatches. Verification is the best part of the manifest. Plus, you can do verification pretty cheaply on restore. We also restore pg_control last so clusters that have a restore error won't start. > One thing I'm not quite sure about is where to store the backup > manifest. If you take a base backup in tar format, you get base.tar, > pg_wal.tar (unless -Xnone), and an additional tar file per tablespace. > Does the backup manifest go into base.tar? Get written into a separate > file outside of any tar archive? Something else? And what about a > plain-format backup? I suppose then we should just write the manifest > into the top level of the main data directory, but perhaps someone has > another idea. We do: [backup_label]/ backup.manifest pg_data/ pg_tblspc/ In general, having the manifest easily accessible is ideal. Regards, -- -David david@pgmasters.net
On Wed, Sep 18, 2019 at 9:11 PM David Steele <david@pgmasters.net> wrote: > Also consider adding the timestamp. Sounds reasonable, even if only for the benefit of humans who might look at the file. We can decide later whether to use it for anything else (and third-party tools could make different decisions from core). I assume we're talking about file mtime here, not file ctime or file atime or the time the manifest was generated, but let me know if I'm wrong. > Consider adding a reference to each file that specifies where the file > can be found in if it is not in this backup. As I understand the > pg_basebackup proposal, it would only be implementing differential > backups, i.e. an incremental that is *only* based on the last full > backup. So, the reference can be inferred in this case. However, if > the user selects the wrong full backup on restore, and we have labeled > each backup, then a differential restore with references against the > wrong full backup would result in a hard error rather than corruption. I intend that we should be able to support incremental backups based either on a previous full backup or based on a previous incremental backup. I am not aware of a technical reason why we need to identify the specific backup that must be used. If incremental backup B is taken based on a pre-existing backup A, then I think that B can be restored using either A or *any other backup taken after A and before B*. In the normal case, there probably wouldn't be any such backup, but AFAICS the start-LSNs are a sufficient cross-check that the chosen base backup is legal. > Based on my original calculations (which sadly I don't have anymore), > the combination of SHA1, size, and file name is *extremely* unlikely to > generate a collision. As in, unlikely to happen before the end of the > universe kind of unlikely. Though, I guess it depends on your > expectations for the lifetime of the universe. Somebody once said that we should be prepared for it to end at an any time, or not, and that the time at which it actually was due to end would not be disclosed in advance. This is probably good life advice which I ought to take more frequently than I do, but I think we can finesse the issue for purposes of this discussion. What I'd say is: if the probability of getting a collision is demonstrably many orders of magnitude less than the probability of the disk writing the block incorrectly, then I think we're probably reasonably OK. Somebody might differ, which is perhaps a mild point in favor of LSN-based approaches, but as a practical matter, if a bad block is a billion times more likely to be the result of a disk error than a checksum mismatch, then it's a negligible risk. > And maybe a few other bits of metadata, but I'm not sure > > exactly what. Ideas? > > A backup label for sure. You can also use this as the directory/tar > name to save the user coming up with one. We use YYYYMMDDHH24MMSSF for > full backups and YYYYMMDDHH24MMSSF_YYYYMMDDHH24MMSS(D|I) for > incrementals and have logic to prevent two backups from having the same > label. This is unlikely outside of testing but still a good idea. > > Knowing the start/stop time of the backup is useful in all kinds of > ways, especially monitoring and time-targeted PITR. Start/stop LSN is > also good. I know this is also in backup_label but having it all in one > place is nice. > > We include the version/sysid of the cluster to avoid mixups. It's a > great extra check on top of references to be sure everything is kosher. I don't think it's a good idea to duplicate the information that's already in the backup_label. Storing two copies of the same information is just an invitation to having to worry about what happens if they don't agree. > A manifest version is good in case we change the format later. Yeah. > I'd > recommend JSON for the format since it is so ubiquitous and easily > handles escaping which can be gotchas in a home-grown format. We > currently have a format that is a combination of Windows INI and JSON > (for human-readability in theory) and we have become painfully aware of > escaping issues. Really, why would you drop files with '=' in their > name in PGDATA? And yet it happens. I am not crazy about JSON because it requires that I get a json parser into src/common, which I could do, but given the possibly-imminent end of the universe, I'm not sure it's the greatest use of time. You're right that if we pick an ad-hoc format, we've got to worry about escaping, which isn't lovely. > > (1) When taking a backup, have the option (perhaps enabled by default) > > to include a backup manifest. > > Manifests are cheap to builds so I wouldn't make it an option. Huh. That's an interesting idea. Thanks. > > (3) Cross-check a manifest against a backup and complain about extra > > files, missing files, size differences, or checksum mismatches. > > Verification is the best part of the manifest. Plus, you can do > verification pretty cheaply on restore. We also restore pg_control last > so clusters that have a restore error won't start. There's no "restore" operation here, really. A backup taken by pg_basebackup can be "restored" by copying the whole thing, but it can also be used just where it is. If we were going to build something into some in-core tool to copy backups around, this would be a smart way to implement said tool, but I'm not planning on that myself. > > One thing I'm not quite sure about is where to store the backup > > manifest. If you take a base backup in tar format, you get base.tar, > > pg_wal.tar (unless -Xnone), and an additional tar file per tablespace. > > Does the backup manifest go into base.tar? Get written into a separate > > file outside of any tar archive? Something else? And what about a > > plain-format backup? I suppose then we should just write the manifest > > into the top level of the main data directory, but perhaps someone has > > another idea. > > We do: > > [backup_label]/ > backup.manifest > pg_data/ > pg_tblspc/ > > In general, having the manifest easily accessible is ideal. That's a fine choice for a tool, but a I'm talking about something that is part of the actual backup format supported by PostgreSQL, not what a tool might wrap around it. The choice is whether, for a tar-format backup, the manifest goes inside a tar file or as a separate file. To put that another way, a patch adding backup manifests does not get to redesign where pg_basebackup puts anything else; it only gets to decide where to put the manifest. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 19, 2019 at 9:51 AM Robert Haas <robertmhaas@gmail.com> wrote: > I intend that we should be able to support incremental backups based > either on a previous full backup or based on a previous incremental > backup. I am not aware of a technical reason why we need to identify > the specific backup that must be used. If incremental backup B is > taken based on a pre-existing backup A, then I think that B can be > restored using either A or *any other backup taken after A and before > B*. In the normal case, there probably wouldn't be any such backup, > but AFAICS the start-LSNs are a sufficient cross-check that the chosen > base backup is legal. Scratch that: there can be overlapping backups, so you have to cross-check both start and stop LSNs. > > > (3) Cross-check a manifest against a backup and complain about extra > > > files, missing files, size differences, or checksum mismatches. > > > > Verification is the best part of the manifest. Plus, you can do > > verification pretty cheaply on restore. We also restore pg_control last > > so clusters that have a restore error won't start. > > There's no "restore" operation here, really. A backup taken by > pg_basebackup can be "restored" by copying the whole thing, but it can > also be used just where it is. If we were going to build something > into some in-core tool to copy backups around, this would be a smart > way to implement said tool, but I'm not planning on that myself. Scratch that: incremental backups need a restore tool, so we can use this technique there. And it can work for full backups too, because why not? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Robert, On 9/19/19 9:51 AM, Robert Haas wrote: > On Wed, Sep 18, 2019 at 9:11 PM David Steele <david@pgmasters.net> wrote: >> Also consider adding the timestamp. > > Sounds reasonable, even if only for the benefit of humans who might > look at the file. We can decide later whether to use it for anything > else (and third-party tools could make different decisions from core). > I assume we're talking about file mtime here, not file ctime or file > atime or the time the manifest was generated, but let me know if I'm > wrong. In my experience only mtime is useful. >> Based on my original calculations (which sadly I don't have anymore), >> the combination of SHA1, size, and file name is *extremely* unlikely to >> generate a collision. As in, unlikely to happen before the end of the >> universe kind of unlikely. Though, I guess it depends on your >> expectations for the lifetime of the universe. > What I'd say is: if > the probability of getting a collision is demonstrably many orders of > magnitude less than the probability of the disk writing the block > incorrectly, then I think we're probably reasonably OK. Somebody might > differ, which is perhaps a mild point in favor of LSN-based > approaches, but as a practical matter, if a bad block is a billion > times more likely to be the result of a disk error than a checksum > mismatch, then it's a negligible risk. Agreed. >> We include the version/sysid of the cluster to avoid mixups. It's a >> great extra check on top of references to be sure everything is kosher. > > I don't think it's a good idea to duplicate the information that's > already in the backup_label. Storing two copies of the same > information is just an invitation to having to worry about what > happens if they don't agree. OK, but now we have backup_label, tablespace_map, XXXXXXXXXXXXXXXXXXXXXXXX.XXXXXXXX.backup (in the WAL) and now perhaps a backup.manifest file. I feel like we may be drowning in backup info files. >> I'd >> recommend JSON for the format since it is so ubiquitous and easily >> handles escaping which can be gotchas in a home-grown format. We >> currently have a format that is a combination of Windows INI and JSON >> (for human-readability in theory) and we have become painfully aware of >> escaping issues. Really, why would you drop files with '=' in their >> name in PGDATA? And yet it happens. > > I am not crazy about JSON because it requires that I get a json parser > into src/common, which I could do, but given the possibly-imminent end > of the universe, I'm not sure it's the greatest use of time. You're > right that if we pick an ad-hoc format, we've got to worry about > escaping, which isn't lovely. My experience is that JSON is simple to implement and has already dealt with escaping and data structure considerations. A home-grown solution will be at least as complex but have the disadvantage of being non-standard. >>> One thing I'm not quite sure about is where to store the backup >>> manifest. If you take a base backup in tar format, you get base.tar, >>> pg_wal.tar (unless -Xnone), and an additional tar file per tablespace. >>> Does the backup manifest go into base.tar? Get written into a separate >>> file outside of any tar archive? Something else? And what about a >>> plain-format backup? I suppose then we should just write the manifest >>> into the top level of the main data directory, but perhaps someone has >>> another idea. >> >> We do: >> >> [backup_label]/ >> backup.manifest >> pg_data/ >> pg_tblspc/ >> >> In general, having the manifest easily accessible is ideal. > > That's a fine choice for a tool, but a I'm talking about something > that is part of the actual backup format supported by PostgreSQL, not > what a tool might wrap around it. The choice is whether, for a > tar-format backup, the manifest goes inside a tar file or as a > separate file. To put that another way, a patch adding backup > manifests does not get to redesign where pg_basebackup puts anything > else; it only gets to decide where to put the manifest. Fair enough. The point is to make the manifest easily accessible. I'd keep it in the data directory for file-based backups and as a separate file for tar-based backups. The advantage here is that we can pick a file name that becomes reserved which a tool can't do. Regards, -- -David david@pgmasters.net
On 9/19/19 11:00 AM, Robert Haas wrote: > On Thu, Sep 19, 2019 at 9:51 AM Robert Haas <robertmhaas@gmail.com> wrote: >> I intend that we should be able to support incremental backups based >> either on a previous full backup or based on a previous incremental >> backup. I am not aware of a technical reason why we need to identify >> the specific backup that must be used. If incremental backup B is >> taken based on a pre-existing backup A, then I think that B can be >> restored using either A or *any other backup taken after A and before >> B*. In the normal case, there probably wouldn't be any such backup, >> but AFAICS the start-LSNs are a sufficient cross-check that the chosen >> base backup is legal. > > Scratch that: there can be overlapping backups, so you have to > cross-check both start and stop LSNs. Overall we have found it's much simpler to label each backup and cross-check that against the pg version and system id. Start LSN is pretty unique, but backup labels work really well and are more widely understood. >>>> (3) Cross-check a manifest against a backup and complain about extra >>>> files, missing files, size differences, or checksum mismatches. >>> >>> Verification is the best part of the manifest. Plus, you can do >>> verification pretty cheaply on restore. We also restore pg_control last >>> so clusters that have a restore error won't start. >> >> There's no "restore" operation here, really. A backup taken by >> pg_basebackup can be "restored" by copying the whole thing, but it can >> also be used just where it is. If we were going to build something >> into some in-core tool to copy backups around, this would be a smart >> way to implement said tool, but I'm not planning on that myself. > > Scratch that: incremental backups need a restore tool, so we can use > this technique there. And it can work for full backups too, because > why not? Agreed, once we have a restore tool, use it for everything. -- -David david@pgmasters.net
On Thu, Sep 19, 2019 at 11:10:46PM -0400, David Steele wrote: > On 9/19/19 11:00 AM, Robert Haas wrote: >> On Thu, Sep 19, 2019 at 9:51 AM Robert Haas <robertmhaas@gmail.com> wrote: >> > I intend that we should be able to support incremental backups based >> > either on a previous full backup or based on a previous incremental >> > backup. I am not aware of a technical reason why we need to identify >> > the specific backup that must be used. If incremental backup B is >> > taken based on a pre-existing backup A, then I think that B can be >> > restored using either A or *any other backup taken after A and before >> > B*. In the normal case, there probably wouldn't be any such backup, >> > but AFAICS the start-LSNs are a sufficient cross-check that the chosen >> > base backup is legal. >> >> Scratch that: there can be overlapping backups, so you have to >> cross-check both start and stop LSNs. > > Overall we have found it's much simpler to label each backup and cross-check > that against the pg version and system id. Start LSN is pretty unique, but > backup labels work really well and are more widely understood. Warning. The start LSN could be the same for multiple backups when taken from a standby. -- Michael
Attachment
On Thu, Sep 19, 2019 at 11:10 PM David Steele <david@pgmasters.net> wrote: > Overall we have found it's much simpler to label each backup and > cross-check that against the pg version and system id. Start LSN is > pretty unique, but backup labels work really well and are more widely > understood. I see your point, but part of my point is that uniqueness is not a technical requirement. However, it may be a requirement for user comprehension. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 19, 2019 at 11:06 PM David Steele <david@pgmasters.net> wrote: > > I am not crazy about JSON because it requires that I get a json parser > > into src/common, which I could do, but given the possibly-imminent end > > of the universe, I'm not sure it's the greatest use of time. You're > > right that if we pick an ad-hoc format, we've got to worry about > > escaping, which isn't lovely. > > My experience is that JSON is simple to implement and has already dealt > with escaping and data structure considerations. A home-grown solution > will be at least as complex but have the disadvantage of being non-standard. I think that's fair and just spent a little while investigating how difficult it would be to disentangle the JSON parser from the backend. It has dependencies on the following bits of backend-only functionality: - check_stack_depth(). No problem, I think. Just skip it for frontend code. - pg_mblen() / GetDatabaseEncoding(). Not sure what to do about this. Some of our infrastructure for dealing with encoding is available in the frontend and backend, but this part is backend-only. - elog() / ereport(). Kind of a pain. We could just kill the program if an error occurs, but that seems a bit ham-fisted. Refactoring the code so that the error is returned rather than thrown might be the way to go, but it's not simple, because you're not just passing a string. ereport(ERROR, (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg("invalid input syntax for type %s", "json"), errdetail("Character with value 0x%02x must be escaped.", (unsigned char) *s), report_json_context(lex))); - appendStringInfo et. al. I don't think it would be that hard to move this to src/common, but I'm also not sure it really solves the problem, because StringInfo has a 1GB limit, and there's no rule at all that a backup manifest has got to be less than 1GB. https://www.pgcon.org/2013/schedule/events/595.en.html This gets at another problem that I just started to think about. If the file is just a series of lines, you can parse it one line and a time and do something with that line, then move on. If it's a JSON blob, you have to parse the whole file and get a potentially giant data structure back, and then operate on that data structure. At least, I think you do. There's probably some way to create a callback structure that lets you presuppose that the toplevel data structure is an array (or object) and get back each element of that array (or key/value pair) as it's parsed, but that sounds pretty annoying to get working. Or we could just decide that you have to have enough memory to hold the parsed version of the entire manifest file in memory all at once, and if you don't, maybe you should drop some tables or buy more RAM. That still leaves you with bypassing the 1GB size limit on StringInfo, maybe by having a "huge" option, or perhaps by memory-mapping the file and then making the StringInfo point directly into the mapped region. Perhaps I'm overthinking this and maybe you have a simpler idea in mind about how it can be made to work, but I find all this complexity pretty unappealing. Here's a competing proposal: let's decide that lines consist of tab-separated fields. If a field contains a \t, \r, or \n, put a " at the beginning, a " at the end, and double any " that appears in the middle. This is easy to generate and easy to parse. It lets us completely ignore encoding considerations. Incremental parsing is straightforward. Quoting will rarely be needed because there's very little reason to create a file inside a PostgreSQL data directory that contains a tab or a newline, but if you do it'll still work. The lack of quoting is nice for humans reading the manifest, and nice in terms of keeping the manifest succinct; in contrast, note that using JSON doubles every backslash. I hear you saying that this is going to end up being just as complex in the end, but I don't think I believe it. It sounds to me like the difference between spending a couple of hours figuring this out and spending a couple of months trying to figure it out and maybe not actually getting anywhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 19, 2019 at 11:06 PM David Steele <david@pgmasters.net> wrote: > > I don't think it's a good idea to duplicate the information that's > > already in the backup_label. Storing two copies of the same > > information is just an invitation to having to worry about what > > happens if they don't agree. > > OK, but now we have backup_label, tablespace_map, > XXXXXXXXXXXXXXXXXXXXXXXX.XXXXXXXX.backup (in the WAL) and now perhaps a > backup.manifest file. I feel like we may be drowning in backup info files. I agree! I'm not sure what to do about it, though. The information that is present in the tablespace_map file could have been stored in the backup_label file, I think, and that would have made sense, because both files are serving a very similar purpose: they tell the server that it needs to do some non-standard stuff when it starts up, and they give it instructions for what those things are. And, as a secondary purpose, humans or third-party tools can read them and use that information for whatever purpose they wish. The proposed backup_manifest file is a little different. I don't think that anyone is proposing that the server should read that file: it is there solely for the purpose of helping our own tools or third-party tools or human beings who are, uh, acting like tools.[1] We're also proposing to put it in a different place: the backup_label goes into one of the tar files, but the backup_manifest would sit outside of any tar file. If we were designing this from scratch, maybe we'd roll all of this into one file that serves as backup manifest, tablespace map, backup label, and backup history file, but then again, maybe separating the instructions-to-the-server part from the backup-integrity-checking part makes sense. At any rate, even if we knew for sure that's the direction we wanted to go, getting there from here looks a bit rough. If we just add a backup manifest, people who don't care can mostly ignore it and then should be mostly fine. If we start trying to create the one backup information system to rule them all, we're going to break people's tools. Maybe that's worth doing someday, but the paint isn't even dry on removing recovery.conf yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] There are a surprising number of installations where, in effect, the DBA is the backup-and-restore tool, performing all the steps by hand and hoping not to mess any of them up. The fact that nearly every PostgreSQL company offers tools to make this easier does not seem to have done a whole lot to diminish the number of people using ad-hoc solutions.
On Fri, Sep 20, 2019 at 9:46 AM Robert Haas <robertmhaas@gmail.com> wrote: > - appendStringInfo et. al. I don't think it would be that hard to move > this to src/common, but I'm also not sure it really solves the > problem, because StringInfo has a 1GB limit, and there's no rule at > all that a backup manifest has got to be less than 1GB. Hmm. That's actually going to be a problem on the server side, no matter what we do on the client side. We have to send the manifest after we send everything else, so that we know what we sent. But if we sent a lot of files, the manifest might be really huge. I had been thinking that we would generate the manifest on the server and send it to the client after everything else, but maybe this is an argument for generating the manifest on the client side and writing it incrementally. That would require the client to peek at the contents of every tar file it receives all the time, which it currently doesn't need to do, but it does peek inside them a little bit, so maybe it's OK. Another alternative would be to have the server spill the manifest in progress to a temp file and then stream it from there to the client. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9/20/19 9:46 AM, Robert Haas wrote: > On Thu, Sep 19, 2019 at 11:06 PM David Steele <david@pgmasters.net> wrote: > >> My experience is that JSON is simple to implement and has already dealt >> with escaping and data structure considerations. A home-grown solution >> will be at least as complex but have the disadvantage of being non-standard. > > I think that's fair and just spent a little while investigating how > difficult it would be to disentangle the JSON parser from the backend. > It has dependencies on the following bits of backend-only > functionality: > - elog() / ereport(). Kind of a pain. We could just kill the program > if an error occurs, but that seems a bit ham-fisted. Refactoring the > code so that the error is returned rather than thrown might be the way > to go, but it's not simple, because you're not just passing a string. Seems to me we are overdue for elog()/ereport() compatible error-handling in the front end. Plus mem contexts. It sucks to make that a prereq for this project but the longer we kick that can down the road... > https://www.pgcon.org/2013/schedule/events/595.en.html This talk was good fun. The largest number of tables we've seen is a few hundred thousand, but that still adds up to more than a million files to backup. > This gets at another problem that I just started to think about. If > the file is just a series of lines, you can parse it one line and a > time and do something with that line, then move on. If it's a JSON > blob, you have to parse the whole file and get a potentially giant > data structure back, and then operate on that data structure. At > least, I think you do. JSON can definitely be parsed incrementally, but for practical reasons certain structures work better than others. > There's probably some way to create a callback > structure that lets you presuppose that the toplevel data structure is > an array (or object) and get back each element of that array (or > key/value pair) as it's parsed, but that sounds pretty annoying to get > working. And that's how we do it. It's annoying and yeah it's complicated but it is very fast and memory-efficient. > Or we could just decide that you have to have enough memory > to hold the parsed version of the entire manifest file in memory all > at once, and if you don't, maybe you should drop some tables or buy > more RAM. I assume you meant "un-parsed" here? > That still leaves you with bypassing the 1GB size limit on > StringInfo, maybe by having a "huge" option, or perhaps by > memory-mapping the file and then making the StringInfo point directly > into the mapped region. Perhaps I'm overthinking this and maybe you > have a simpler idea in mind about how it can be made to work, but I > find all this complexity pretty unappealing. Our String object has the same 1GB limit. Partly because it works and saves a bit of memory per object, but also because if we find ourselves exceeding that limit we know we've probably made a design error. Parsing in stream means that you only need to store the final in-memory representation of the manifest which can be much more compact. Yeah, it's complicated, but the memory and time savings are worth it. Note that our Perl implementation took the naive approach and has worked pretty well for six years, but can choke on really large manifests with out of memory errors. Overall, I'd say getting the format right is more important than having the perfect initial implementation. > Here's a competing proposal: let's decide that lines consist of > tab-separated fields. If a field contains a \t, \r, or \n, put a " at > the beginning, a " at the end, and double any " that appears in the > middle. This is easy to generate and easy to parse. It lets us > completely ignore encoding considerations. Incremental parsing is > straightforward. Quoting will rarely be needed because there's very > little reason to create a file inside a PostgreSQL data directory that > contains a tab or a newline, but if you do it'll still work. The lack > of quoting is nice for humans reading the manifest, and nice in terms > of keeping the manifest succinct; in contrast, note that using JSON > doubles every backslash. There's other information you'll want to store that is not strictly file info so you need a way to denote that. It gets complicated quickly. > I hear you saying that this is going to end up being just as complex > in the end, but I don't think I believe it. It sounds to me like the > difference between spending a couple of hours figuring this out and > spending a couple of months trying to figure it out and maybe not > actually getting anywhere. Maybe the initial implementation will be easier but I am confident we'll pay for it down the road. Also, don't we want users to be able to read this file? Do we really want them to need to cook up a custom parser in Perl, Go, Python, etc.? -- -David david@pgmasters.net
On 9/20/19 10:59 AM, Robert Haas wrote: > On Fri, Sep 20, 2019 at 9:46 AM Robert Haas <robertmhaas@gmail.com> wrote: >> - appendStringInfo et. al. I don't think it would be that hard to move >> this to src/common, but I'm also not sure it really solves the >> problem, because StringInfo has a 1GB limit, and there's no rule at >> all that a backup manifest has got to be less than 1GB. > > Hmm. That's actually going to be a problem on the server side, no > matter what we do on the client side. We have to send the manifest > after we send everything else, so that we know what we sent. But if we > sent a lot of files, the manifest might be really huge. I had been > thinking that we would generate the manifest on the server and send it > to the client after everything else, but maybe this is an argument for > generating the manifest on the client side and writing it > incrementally. That would require the client to peek at the contents > of every tar file it receives all the time, which it currently doesn't > need to do, but it does peek inside them a little bit, so maybe it's > OK. > > Another alternative would be to have the server spill the manifest in > progress to a temp file and then stream it from there to the client. This seems reasonable to me. We keep an in-memory representation which is just an array of structs and is fairly compact -- 1 million files uses ~150MB of memory. We just format and stream this to storage when saving. Saving is easier than loading, of course. -- -David david@pgmasters.net
On 9/20/19 9:46 AM, Robert Haas wrote: > least, I think you do. There's probably some way to create a callback > structure that lets you presuppose that the toplevel data structure is > an array (or object) and get back each element of that array (or > key/value pair) as it's parsed, If a JSON parser does find its way into src/common, it probably wants to have such an incremental mode available, similar to [2] offered in the "Jackson" library for Java. The Jackson developer has propounded a thesis[1] that such a parsing library ought to offer "Three -- and Only Three" different styles of API corresponding to three ways of organizing the code using the library ([2], [3], [4], which also resemble the different APIs supplied in Java for XML processing). Regards, -Chap [1] http://www.cowtowncoder.com/blog/archives/2009/01/entry_132.html [2] http://www.cowtowncoder.com/blog/archives/2009/01/entry_137.html [3] http://www.cowtowncoder.com/blog/archives/2009/01/entry_153.html [4] http://www.cowtowncoder.com/blog/archives/2009/01/entry_152.html
On Fri, Sep 20, 2019 at 11:09 AM David Steele <david@pgmasters.net> wrote: > Seems to me we are overdue for elog()/ereport() compatible > error-handling in the front end. Plus mem contexts. > > It sucks to make that a prereq for this project but the longer we kick > that can down the road... There are no doubt many patches that would benefit from having more backend infrastructure exposed in frontend contexts, and I think we're slowly moving in that direction, but I generally do not believe in burdening feature patches with major infrastructure improvements. Sometimes it's necessary, as in the case of parallel query, which required upgrading a whole lot of backend infrastructure in order to have any chance of doing something useful. In most cases, however, there's a way of getting the patch done that dodges the problem. For example, I think there's a pretty good argument that Heikki's design for relation forks was a bad one. It's proven to scale poorly and create performance problems and extra complexity in quite a few places. It would likely have been better, from a strictly theoretical point of view, to insist on a design where the FSM and VM pages got stored inside the relation itself, and the heap was responsible for figuring out how various pages were being used. When BRIN came along, we insisted on precisely that design, because it was clear that further straining the relation fork system was not a good plan. However, if we'd insisted on that when Heikki did the original work, it might have delayed the arrival of the free space map for one or more releases, and we got big benefits out of having that done sooner. There's nothing stopping someone from writing a patch to get rid of relation forks and allow a heap AM to have multiple relfilenodes (with the extra ones used for the FSM and VM) or with multiplexing all the data inside of a single file. Nobody has, though, because it's hard, and the problems with the status quo are not so bad as to justify the amount of development effort that would be required to fix it. At some point, that problem is probably going to work its way to the top of somebody's priority list, but it's already been about 10 years since that all happened and everyone has so far dodged dealing with the problem, which in turn has enabled them to work on other things that are perhaps more important. I think the same principle applies here. It's reasonable to ask the author of a feature patch to fix issues that are closely related to the feature in question, or even problems that are not new but would be greatly exacerbated by the addition of the feature. It's not reasonable to stack up a list of infrastructure upgrades that somebody has to do as a condition of having a feature patch accepted that does not necessarily require those upgrades. I am not convinced that JSON is actually a better format for a backup manifest (more on that below), but even if I were, I believe that getting a backup manifest functionality into PostgreSQL 13, and perhaps incremental backup on top of that, is valuable enough to justify making some compromises to make that happen. And I don't mean "compromises" as in "let's commit something that doesn't work very well;" rather, I mean making design choices that are aimed at making the project something that is feasible and can be completed in reasonable time, rather than not. And saying, well, the backup manifest format *has* to be JSON because everything else suxxor is not that. We don't have a single other example of a file that we read and write in JSON format. Extension control files use a custom format. Backup labels and backup history files and timeline history files and tablespace map files use custom formats. postgresql.conf, pg_hba.conf, and pg_ident.conf use custom formats. postmaster.opts and postmaster.pid use custom formats. If JSON is better and easier, at least one of the various people who coded those things up would have chosen to use it, but none of them did, and nobody's made a serious attempt to convert them to use it. That might be because we lack the infrastructure for dealing with JSON and building it is more work than anybody's willing to do, or it might be because JSON is not actually better for these kinds of use cases, but either way, it's hard to see why this particular patch should be burdened with a requirement that none of the previous ones had to satisfy. Personally, I'd be intensely unhappy if a motion to convert postgresql.conf or pg_hba.conf to JSON format gathered enough steam to be adopted. It would be darn useful, because you could specify complex values for options instead of being limited to scalars, but it would also make the configuration files a lot harder for human beings to read and grep and the quality of error reporting would probably decline significantly. Also, appending a setting to the file, something which is currently quite simple, would get a lot harder. Ad-hoc file formats can be problematic, but they can also have real advantages in terms of readability, brevity, and fitness for purpose. > This talk was good fun. The largest number of tables we've seen is a > few hundred thousand, but that still adds up to more than a million > files to backup. A quick survey of some of my colleagues turned up a few examples of people with 2-4 million files to backup, so similar kind of ballpark. Probably not big enough for the manifest to hit the 1GB mark, but getting close. > > Or we could just decide that you have to have enough memory > > to hold the parsed version of the entire manifest file in memory all > > at once, and if you don't, maybe you should drop some tables or buy > > more RAM. > > I assume you meant "un-parsed" here? I don't think I meant that, although it seems like you might need to store either all the parsed data or all the unparsed data or even both, depending on exactly what you are trying to do. > > I hear you saying that this is going to end up being just as complex > > in the end, but I don't think I believe it. It sounds to me like the > > difference between spending a couple of hours figuring this out and > > spending a couple of months trying to figure it out and maybe not > > actually getting anywhere. > > Maybe the initial implementation will be easier but I am confident we'll > pay for it down the road. Also, don't we want users to be able to read > this file? Do we really want them to need to cook up a custom parser in > Perl, Go, Python, etc.? Well, I haven't heard anybody complain that they can't read a backup_label file because it's too hard to cook up a parser. And I think the reason is pretty clear: such files are not hard to parse. Similarly for a pg_hba.conf file. This case is a little more complicated than those, but AFAICS, not enormously so. Actually, it seems like a combination of those two cases: it has some fixed metadata fields that can be represented with one line per field, like a backup_label, and then a bunch of entries for files that are somewhat like entries in a pg_hba.conf file, in that they can be represented by a line per record with a certain number of fields on each line. I attach here a couple of patches. The first one does some refactoring of relevant code in pg_basebackup, and the second one adds checksum manifests using a format that I pulled out of my ear. It probably needs some adjustment but I don't think it's crazy. Each file gets a line that looks like this: File $FILENAME $FILESIZE $FILEMTIME $FILECHECKSUM Right now, the file checksums are computed using SHA-256 but it could be changed to anything else for which we've got code. On my system, shasum -a256 $FILE produces the same answer that shows up here. At the bottom of the manifest there's a checksum of the manifest itself, which looks like this: Manifest-Checksum 385fe156a8c6306db40937d59f46027cc079350ecf5221027d71367675c5f781 That's a SHA-256 checksum of the file contents excluding the final line. It can be verified by feeding all the file contents except the last line to shasum -a256. I can't help but observe that if the file were defined to be a JSONB blob, it's not very clear how you would include a checksum of the blob contents in the blob itself, but with a format based on a bunch of lines of data, it's super-easy to generate and super-easy to write tools that verify it. This is just a prototype so I haven't written a verification tool, and there's a bunch of testing and documentation and so forth that would need to be done aside from whatever we've got to hammer out in terms of design issues and file formats. But I think it's cool, and perhaps some discussion of how it could be evolved will get us closer to a resolution everybody can at least live with. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 9/20/19 2:55 PM, Robert Haas wrote: > On Fri, Sep 20, 2019 at 11:09 AM David Steele <david@pgmasters.net> wrote: >> >> It sucks to make that a prereq for this project but the longer we kick >> that can down the road... > > There are no doubt many patches that would benefit from having more > backend infrastructure exposed in frontend contexts, and I think we're > slowly moving in that direction, but I generally do not believe in > burdening feature patches with major infrastructure improvements. The hardest part about technical debt is knowing when to incur it. It is never a cut-and-dried choice. >> This talk was good fun. The largest number of tables we've seen is a >> few hundred thousand, but that still adds up to more than a million >> files to backup. > > A quick survey of some of my colleagues turned up a few examples of > people with 2-4 million files to backup, so similar kind of ballpark. > Probably not big enough for the manifest to hit the 1GB mark, but > getting close. I have so many doubts about clusters with this many tables, but we do support it, so... >>> I hear you saying that this is going to end up being just as complex >>> in the end, but I don't think I believe it. It sounds to me like the >>> difference between spending a couple of hours figuring this out and >>> spending a couple of months trying to figure it out and maybe not >>> actually getting anywhere. >> >> Maybe the initial implementation will be easier but I am confident we'll >> pay for it down the road. Also, don't we want users to be able to read >> this file? Do we really want them to need to cook up a custom parser in >> Perl, Go, Python, etc.? > > Well, I haven't heard anybody complain that they can't read a > backup_label file because it's too hard to cook up a parser. And I > think the reason is pretty clear: such files are not hard to parse. > Similarly for a pg_hba.conf file. This case is a little more > complicated than those, but AFAICS, not enormously so. Actually, it > seems like a combination of those two cases: it has some fixed > metadata fields that can be represented with one line per field, like > a backup_label, and then a bunch of entries for files that are > somewhat like entries in a pg_hba.conf file, in that they can be > represented by a line per record with a certain number of fields on > each line. Yeah, they are not hard to parse, but *everyone* has to cook up code for it. A bit of a bummer, that. > I attach here a couple of patches. The first one does some > refactoring of relevant code in pg_basebackup, and the second one adds > checksum manifests using a format that I pulled out of my ear. It > probably needs some adjustment but I don't think it's crazy. Each > file gets a line that looks like this: > > File $FILENAME $FILESIZE $FILEMTIME $FILECHECKSUM We also include page checksum validation failures in the file record. Not critical for the first pass, perhaps, but something to keep in mind. > Right now, the file checksums are computed using SHA-256 but it could > be changed to anything else for which we've got code. On my system, > shasum -a256 $FILE produces the same answer that shows up here. At > the bottom of the manifest there's a checksum of the manifest itself, > which looks like this: > > Manifest-Checksum > 385fe156a8c6306db40937d59f46027cc079350ecf5221027d71367675c5f781 > > That's a SHA-256 checksum of the file contents excluding the final > line. It can be verified by feeding all the file contents except the > last line to shasum -a256. I can't help but observe that if the file > were defined to be a JSONB blob, it's not very clear how you would > include a checksum of the blob contents in the blob itself, but with a > format based on a bunch of lines of data, it's super-easy to generate > and super-easy to write tools that verify it. You can do this in JSON pretty easily by handling the terminating brace/bracket: { <some json contents>*, "checksum":<sha256> } But of course a linefeed-delimited file is even easier. > This is just a prototype so I haven't written a verification tool, and > there's a bunch of testing and documentation and so forth that would > need to be done aside from whatever we've got to hammer out in terms > of design issues and file formats. But I think it's cool, and perhaps > some discussion of how it could be evolved will get us closer to a > resolution everybody can at least live with. I had a quick look and it seems pretty reasonable. I'll need to generate a manifest to see if I can spot any obvious gotchas. -- -David david@pgmasters.net
On Sat, Sep 21, 2019 at 12:25 AM Robert Haas <robertmhaas@gmail.com> wrote: > Some comments: Manifest file will be in plain text format even if compression is specified, should we compress it? May be this is intended, just raised the point to make sure that it is intended. +static void +ReceiveBackupManifestChunk(size_t r, char *copybuf, void *callback_data) +{ + WriteManifestState *state = callback_data; + + if (fwrite(copybuf, r, 1, state->file) != 1) + { + pg_log_error("could not write to file \"%s\": %m", state->filename); + exit(1); + } +} WALfile.done file gets added but wal file information is not included in the manifest file, should we include WAL file also? @@ -599,16 +618,20 @@ perform_base_backup(basebackup_options *opt) (errcode_for_file_access(), errmsg("could not stat file \"%s\": %m", pathbuf))); - sendFile(pathbuf, pathbuf, &statbuf, false, InvalidOid); + sendFile(pathbuf, pathbuf, &statbuf, false, InvalidOid, manifest, + NULL); /* unconditionally mark file as archived */ StatusFilePath(pathbuf, fname, ".done"); - sendFileWithContent(pathbuf, ""); + sendFileWithContent(pathbuf, "", manifest); Should we add an option to make manifest generation configurable to reduce overhead during backup? Manifest file does not include directory information, should we include it? There is one warning: In file included from ../../../src/include/fe_utils/string_utils.h:20:0, from pg_basebackup.c:34: pg_basebackup.c: In function ‘ReceiveTarFile’: ../../../src/interfaces/libpq/pqexpbuffer.h:60:9: warning: the comparison will always evaluate as ‘false’ for the address of ‘buf’ will never be NULL [-Waddress] ((str) == NULL || (str)->maxlen == 0) ^ pg_basebackup.c:1203:7: note: in expansion of macro ‘PQExpBufferBroken’ if (PQExpBufferBroken(&buf)) pg_gmtime can fail in case of malloc failure: + /* + * Convert time to a string. Since it's not clear what time zone to use + * and since time zone definitions can change, possibly causing confusion, + * use GMT always. + */ + pg_strftime(timebuf, sizeof(timebuf), "%Y-%m-%d %H:%M:%S %Z", + pg_gmtime(&mtime)); Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Entry for directory is not added in manifest. So it might be difficult
at client to get to know about the directories. Will it be good to add
an entry for each directory too? May be like:
Dir <dirname> <mtime>
Also, on latest HEAD patches does not apply.
On Wed, Sep 25, 2019 at 6:17 PM vignesh C <vignesh21@gmail.com> wrote:
On Sat, Sep 21, 2019 at 12:25 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
Some comments:
Manifest file will be in plain text format even if compression is
specified, should we compress it?
May be this is intended, just raised the point to make sure that it is intended.
+static void
+ReceiveBackupManifestChunk(size_t r, char *copybuf, void *callback_data)
+{
+ WriteManifestState *state = callback_data;
+
+ if (fwrite(copybuf, r, 1, state->file) != 1)
+ {
+ pg_log_error("could not write to file \"%s\": %m", state->filename);
+ exit(1);
+ }
+}
WALfile.done file gets added but wal file information is not included
in the manifest file, should we include WAL file also?
@@ -599,16 +618,20 @@ perform_base_backup(basebackup_options *opt)
(errcode_for_file_access(),
errmsg("could not stat file \"%s\": %m", pathbuf)));
- sendFile(pathbuf, pathbuf, &statbuf, false, InvalidOid);
+ sendFile(pathbuf, pathbuf, &statbuf, false, InvalidOid, manifest,
+ NULL);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
- sendFileWithContent(pathbuf, "");
+ sendFileWithContent(pathbuf, "", manifest);
Should we add an option to make manifest generation configurable to
reduce overhead during backup?
Manifest file does not include directory information, should we include it?
There is one warning:
In file included from ../../../src/include/fe_utils/string_utils.h:20:0,
from pg_basebackup.c:34:
pg_basebackup.c: In function ‘ReceiveTarFile’:
../../../src/interfaces/libpq/pqexpbuffer.h:60:9: warning: the
comparison will always evaluate as ‘false’ for the address of ‘buf’
will never be NULL [-Waddress]
((str) == NULL || (str)->maxlen == 0)
^
pg_basebackup.c:1203:7: note: in expansion of macro ‘PQExpBufferBroken’
if (PQExpBufferBroken(&buf))
Yes I too obeserved this warning.
pg_gmtime can fail in case of malloc failure:
+ /*
+ * Convert time to a string. Since it's not clear what time zone to use
+ * and since time zone definitions can change, possibly causing confusion,
+ * use GMT always.
+ */
+ pg_strftime(timebuf, sizeof(timebuf), "%Y-%m-%d %H:%M:%S %Z",
+ pg_gmtime(&mtime));
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
--
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
On Wed, Sep 25, 2019 at 6:17 PM vignesh C <vignesh21@gmail.com> wrote:
On Sat, Sep 21, 2019 at 12:25 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
Some comments:
Manifest file will be in plain text format even if compression is
specified, should we compress it?
May be this is intended, just raised the point to make sure that it is intended.
+static void
+ReceiveBackupManifestChunk(size_t r, char *copybuf, void *callback_data)
+{
+ WriteManifestState *state = callback_data;
+
+ if (fwrite(copybuf, r, 1, state->file) != 1)
+ {
+ pg_log_error("could not write to file \"%s\": %m", state->filename);
+ exit(1);
+ }
+}
WALfile.done file gets added but wal file information is not included
in the manifest file, should we include WAL file also?
@@ -599,16 +618,20 @@ perform_base_backup(basebackup_options *opt)
(errcode_for_file_access(),
errmsg("could not stat file \"%s\": %m", pathbuf)));
- sendFile(pathbuf, pathbuf, &statbuf, false, InvalidOid);
+ sendFile(pathbuf, pathbuf, &statbuf, false, InvalidOid, manifest,
+ NULL);
/* unconditionally mark file as archived */
StatusFilePath(pathbuf, fname, ".done");
- sendFileWithContent(pathbuf, "");
+ sendFileWithContent(pathbuf, "", manifest);
Should we add an option to make manifest generation configurable to
reduce overhead during backup?
Manifest file does not include directory information, should we include it?
There is one warning:
In file included from ../../../src/include/fe_utils/string_utils.h:20:0,
from pg_basebackup.c:34:
pg_basebackup.c: In function ‘ReceiveTarFile’:
../../../src/interfaces/libpq/pqexpbuffer.h:60:9: warning: the
comparison will always evaluate as ‘false’ for the address of ‘buf’
will never be NULL [-Waddress]
((str) == NULL || (str)->maxlen == 0)
^
pg_basebackup.c:1203:7: note: in expansion of macro ‘PQExpBufferBroken’
if (PQExpBufferBroken(&buf))
I also observed this warning. PFA to fix the same.
pg_gmtime can fail in case of malloc failure:
+ /*
+ * Convert time to a string. Since it's not clear what time zone to use
+ * and since time zone definitions can change, possibly causing confusion,
+ * use GMT always.
+ */
+ pg_strftime(timebuf, sizeof(timebuf), "%Y-%m-%d %H:%M:%S %Z",
+ pg_gmtime(&mtime));
Fixed that into attached patch.
Rushabh Lathia
Attachment
On Mon, Sep 30, 2019 at 5:31 AM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote: > Entry for directory is not added in manifest. So it might be difficult > at client to get to know about the directories. Will it be good to add > an entry for each directory too? May be like: > Dir <dirname> <mtime> Well, what kind of corruption would this allow us to detect that we can't detect as things stand? I think the only case is an empty directory. If it's not empty, we'd have some entries for the files in that directory, and those files won't be able to exist unless the directory does. But, how would we end up backing up an empty directory, anyway? I don't really *mind* adding directories into the manifest, but I'm not sure how much it helps. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
with the checksums. On further testing, he found that specifically with
sha its more of performance impact.
Please find below statistics:
no of tables | without checksum | SHA256 checksum | % performnce overhead with SHA-256 | md5 checksum | % performnce overhead with md5 | CRC checksum | % performnce overhead with CRC |
10 (100 MB in each table) | real 0m10.957s user 0m0.367s sys 0m2.275s | real 0m16.816s user 0m0.210s sys 0m2.067s | 53% | real 0m11.895s user 0m0.174s sys 0m1.725s | 8% | real 0m11.136s user 0m0.365s sys 0m2.298s | 2% |
20 (100 MB in each table) | real 0m20.610s user 0m0.484s sys 0m3.198s | real 0m31.745s user 0m0.569s sys 0m4.089s | 54% | real 0m22.717s user 0m0.638s sys 0m4.026s | 10% | real 0m21.075s user 0m0.538s sys 0m3.417s | 2% |
50 (100 MB in each table) | real 0m49.143s user 0m1.646s sys 0m8.499s | real 1m13.683s user 0m1.305s sys 0m10.541s | 50% | real 0m51.856s user 0m0.932s sys 0m7.702s | 6% | real 0m49.689s user 0m1.028s sys 0m6.921s | 1% |
100 (100 MB in each table) | real 1m34.308s user 0m2.265s sys 0m14.717s | real 2m22.403s user 0m2.613s sys 0m20.776s | 51% | real 1m41.524s user 0m2.158s sys 0m15.949s | 8% | real 1m35.045s user 0m2.061s sys 0m16.308s | 1% |
100 (1 GB in each table) | real 17m18.336s user 0m20.222s sys 3m12.960s | real 24m45.942s user 0m26.911s sys 3m33.501s | 43% | real 17m41.670s user 0m26.506s sys 3m18.402s | 2% | real 17m22.296s user 0m26.811s sys 3m56.653s sometimes, this test completes within the same time as without checksum. | approx. 0.5% |
Considering the above results, I modified the earlier Robert's patch and added
"manifest_with_checksums" option to pg_basebackup. With a new patch.
by default, checksums will be disabled and will be only enabled when
"manifest_with_checksums" option is provided. Also re-based all patch set.
Regards,
Rushabh Lathia
On Tue, Oct 1, 2019 at 5:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Sep 30, 2019 at 5:31 AM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Entry for directory is not added in manifest. So it might be difficult
> at client to get to know about the directories. Will it be good to add
> an entry for each directory too? May be like:
> Dir <dirname> <mtime>
Well, what kind of corruption would this allow us to detect that we
can't detect as things stand? I think the only case is an empty
directory. If it's not empty, we'd have some entries for the files in
that directory, and those files won't be able to exist unless the
directory does. But, how would we end up backing up an empty
directory, anyway?
I don't really *mind* adding directories into the manifest, but I'm
not sure how much it helps.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Rushabh Lathia
Attachment
On 11/19/19 5:00 AM, Rushabh Lathia wrote: > > > My colleague Suraj did testing and noticed the performance impact > with the checksums. On further testing, he found that specifically with > sha its more of performance impact. > > I admit I haven't been following along closely, but why do we need a cryptographic checksum here instead of, say, a CRC? Do we think that somehow the checksum might be forged? Use of cryptographic hashes as general purpose checksums has become far too common IMNSHO. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/19/19 5:00 AM, Rushabh Lathia wrote: > > My colleague Suraj did testing and noticed the performance impact > with the checksums. On further testing, he found that specifically with > sha its more of performance impact. We have found that SHA1 adds about 3% overhead when the backup is also compressed (gzip -6), which is what most people want to do. This percentage goes down even more if the backup is being transferred over a network or to an object store such as S3. We judged that the lower collision rate of SHA1 justified the additional expense. That said, making SHA256 optional seems reasonable. We decided not to make our SHA1 checksums optional to reduce the test matrix and because parallelism largely addressed performance concerns. Regards, -- -David david@pgmasters.net
On Tue, Nov 19, 2019 at 7:19 PM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
On 11/19/19 5:00 AM, Rushabh Lathia wrote:
>
>
> My colleague Suraj did testing and noticed the performance impact
> with the checksums. On further testing, he found that specifically with
> sha its more of performance impact.
>
>
I admit I haven't been following along closely, but why do we need a
cryptographic checksum here instead of, say, a CRC? Do we think that
somehow the checksum might be forged? Use of cryptographic hashes as
general purpose checksums has become far too common IMNSHO.
Yeah, maybe. I was thinking to give the user an option to choose checksums
algorithms (SHA256. CRC, MD5, etc), so that they are open to choose what
suites for their environment.
If we decide to do that than we need to store the checksums algorithm
information in the manifest file.
Thoughts?
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Rushabh Lathia
Hi,
Since now we are generating the backup manifest file with each backup, it provides us an option to validate the given backup.
Let's say, we have taken a backup and after a few days, we want to check whether that backup is validated or corruption-free without restarting the server.
Please find attached POC patch for same which will be based on the latest backup manifest patch from Rushabh. With this functionality, we add new option to pg_basebackup, something like --verify-backup.
So, the syntax would be:
./bin/pg_basebackup --verify-backup -D <backup_directory_path>
Basically, we read the backup_manifest file line by line from the given directory path and build the hash table, then scan the directory and compare each file with the hash entry.
Thoughts/suggestions?
Since now we are generating the backup manifest file with each backup, it provides us an option to validate the given backup.
Let's say, we have taken a backup and after a few days, we want to check whether that backup is validated or corruption-free without restarting the server.
Please find attached POC patch for same which will be based on the latest backup manifest patch from Rushabh. With this functionality, we add new option to pg_basebackup, something like --verify-backup.
So, the syntax would be:
./bin/pg_basebackup --verify-backup -D <backup_directory_path>
Basically, we read the backup_manifest file line by line from the given directory path and build the hash table, then scan the directory and compare each file with the hash entry.
Thoughts/suggestions?
On Tue, Nov 19, 2019 at 3:30 PM Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
My colleague Suraj did testing and noticed the performance impactwith the checksums. On further testing, he found that specifically withsha its more of performance impact.Please find below statistics:
no of tables without checksum SHA256
checksum% performnce
overhead
with
SHA-256md5 checksum % performnce
overhead with md5CRC checksum % performnce
overhead with
CRC10 (100 MB
in each table)real 0m10.957s
user 0m0.367s
sys 0m2.275sreal 0m16.816s
user 0m0.210s
sys 0m2.067s53% real 0m11.895s
user 0m0.174s
sys 0m1.725s8% real 0m11.136s
user 0m0.365s
sys 0m2.298s2% 20 (100 MB
in each table)real 0m20.610s
user 0m0.484s
sys 0m3.198sreal 0m31.745s
user 0m0.569s
sys 0m4.089s54% real 0m22.717s
user 0m0.638s
sys 0m4.026s10% real 0m21.075s
user 0m0.538s
sys 0m3.417s2% 50 (100 MB
in each table)real 0m49.143s
user 0m1.646s
sys 0m8.499sreal 1m13.683s
user 0m1.305s
sys 0m10.541s50% real 0m51.856s
user 0m0.932s
sys 0m7.702s6% real 0m49.689s
user 0m1.028s
sys 0m6.921s1% 100 (100 MB
in each table)real 1m34.308s
user 0m2.265s
sys 0m14.717sreal 2m22.403s
user 0m2.613s
sys 0m20.776s51% real 1m41.524s
user 0m2.158s
sys 0m15.949s8% real 1m35.045s
user 0m2.061s
sys 0m16.308s1% 100 (1 GB
in each table)real 17m18.336s
user 0m20.222s
sys 3m12.960sreal 24m45.942s
user 0m26.911s
sys 3m33.501s43% real 17m41.670s
user 0m26.506s
sys 3m18.402s2% real 17m22.296s
user 0m26.811s
sys 3m56.653s
sometimes, this test
completes within the
same time as without
checksum.approx. 0.5% Considering the above results, I modified the earlier Robert's patch and added"manifest_with_checksums" option to pg_basebackup. With a new patch.by default, checksums will be disabled and will be only enabled when"manifest_with_checksums" option is provided. Also re-based all patch set.Regards,--Rushabh LathiaOn Tue, Oct 1, 2019 at 5:43 PM Robert Haas <robertmhaas@gmail.com> wrote:On Mon, Sep 30, 2019 at 5:31 AM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Entry for directory is not added in manifest. So it might be difficult
> at client to get to know about the directories. Will it be good to add
> an entry for each directory too? May be like:
> Dir <dirname> <mtime>
Well, what kind of corruption would this allow us to detect that we
can't detect as things stand? I think the only case is an empty
directory. If it's not empty, we'd have some entries for the files in
that directory, and those files won't be able to exist unless the
directory does. But, how would we end up backing up an empty
directory, anyway?
I don't really *mind* adding directories into the manifest, but I'm
not sure how much it helps.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company--Rushabh Lathia
--
Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.
Attachment
On Tue, Nov 19, 2019 at 3:30 PM Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
My colleague Suraj did testing and noticed the performance impactwith the checksums. On further testing, he found that specifically withsha its more of performance impact.Please find below statistics:
no of tables without checksum SHA256
checksum% performnce
overhead
with
SHA-256md5 checksum % performnce
overhead with md5CRC checksum % performnce
overhead with
CRC10 (100 MB
in each table)real 0m10.957s
user 0m0.367s
sys 0m2.275sreal 0m16.816s
user 0m0.210s
sys 0m2.067s53% real 0m11.895s
user 0m0.174s
sys 0m1.725s8% real 0m11.136s
user 0m0.365s
sys 0m2.298s2% 20 (100 MB
in each table)real 0m20.610s
user 0m0.484s
sys 0m3.198sreal 0m31.745s
user 0m0.569s
sys 0m4.089s54% real 0m22.717s
user 0m0.638s
sys 0m4.026s10% real 0m21.075s
user 0m0.538s
sys 0m3.417s2% 50 (100 MB
in each table)real 0m49.143s
user 0m1.646s
sys 0m8.499sreal 1m13.683s
user 0m1.305s
sys 0m10.541s50% real 0m51.856s
user 0m0.932s
sys 0m7.702s6% real 0m49.689s
user 0m1.028s
sys 0m6.921s1% 100 (100 MB
in each table)real 1m34.308s
user 0m2.265s
sys 0m14.717sreal 2m22.403s
user 0m2.613s
sys 0m20.776s51% real 1m41.524s
user 0m2.158s
sys 0m15.949s8% real 1m35.045s
user 0m2.061s
sys 0m16.308s1% 100 (1 GB
in each table)real 17m18.336s
user 0m20.222s
sys 3m12.960sreal 24m45.942s
user 0m26.911s
sys 3m33.501s43% real 17m41.670s
user 0m26.506s
sys 3m18.402s2% real 17m22.296s
user 0m26.811s
sys 3m56.653s
sometimes, this test
completes within the
same time as without
checksum.approx. 0.5% Considering the above results, I modified the earlier Robert's patch and added"manifest_with_checksums" option to pg_basebackup. With a new patch.by default, checksums will be disabled and will be only enabled when"manifest_with_checksums" option is provided. Also re-based all patch set.
Review comments on 0004:
1.
I don't think we need o_manifest_with_checksums variable,
manifest_with_checksums can be used instead.
2.
We need to document this new option for pg_basebackup and basebackup.
3.
Also, instead of keeping manifest_with_checksums as a global variable, we
should pass that to the required function. Patch 0002 already modified the
signature of all relevant functions anyways. So just need to add one more bool
variable there.
4.
Why we need a "File" at the start of each entry as we are adding files only?
I wonder if we also need to provide a tablespace name and directory marker so
that we have "Tablespace" and "Dir" at the start.
5.
If I don't provide manifest-with-checksums option then too I see that checksum
is calculated for backup_manifest file itself. Is that intentional or missed?
I think we should omit that too if this option is not provided.
6.
Is it possible to get only a backup manifest from the server? A client like
pg_basebackup can then use that to fetch files reading that.
Thanks
On Tue, Oct 1, 2019 at 5:43 PM Robert Haas <robertmhaas@gmail.com> wrote:On Mon, Sep 30, 2019 at 5:31 AM Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:
> Entry for directory is not added in manifest. So it might be difficult
> at client to get to know about the directories. Will it be good to add
> an entry for each directory too? May be like:
> Dir <dirname> <mtime>
Well, what kind of corruption would this allow us to detect that we
can't detect as things stand? I think the only case is an empty
directory. If it's not empty, we'd have some entries for the files in
that directory, and those files won't be able to exist unless the
directory does. But, how would we end up backing up an empty
directory, anyway?
I don't really *mind* adding directories into the manifest, but I'm
not sure how much it helps.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company--Rushabh Lathia
--
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
On Wed, Nov 20, 2019 at 11:05 AM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote:
Hi,
Since now we are generating the backup manifest file with each backup, it provides us an option to validate the given backup.
Let's say, we have taken a backup and after a few days, we want to check whether that backup is validated or corruption-free without restarting the server.
Please find attached POC patch for same which will be based on the latest backup manifest patch from Rushabh. With this functionality, we add new option to pg_basebackup, something like --verify-backup.
So, the syntax would be:
./bin/pg_basebackup --verify-backup -D <backup_directory_path>
Basically, we read the backup_manifest file line by line from the given directory path and build the hash table, then scan the directory and compare each file with the hash entry.
Thoughts/suggestions?
I like the idea of verifying the backup once we have backup_manifest with us.
Periodically verifying the already taken backup with this simple tool becomes
easy now.
I have reviewed this patch and here are my comments:
1.
@@ -30,7 +30,9 @@
#include "common/file_perm.h"
#include "common/file_utils.h"
#include "common/logging.h"
+#include "common/sha2.h"
#include "common/string.h"
+#include "fe_utils/simple_list.h"
#include "fe_utils/recovery_gen.h"
#include "fe_utils/string_utils.h"
#include "getopt_long.h"
@@ -38,12 +40,19 @@
#include "pgtar.h"
#include "pgtime.h"
#include "pqexpbuffer.h"
+#include "pgrhash.h"
#include "receivelog.h"
#include "replication/basebackup.h"
#include "streamutil.h"
Please add new files in order.
2.
Can hash related file names be renamed to backuphash.c and backuphash.h?
3.
Need indentation adjustments at various places.
4.
+ char buf[1000000]; // 1MB chunk
It will be good if we have multiple of block /page size (or at-least power of 2
number).
5.
+typedef struct pgrhash_entry
+{
+ struct pgrhash_entry *next; /* link to next entry in same bucket */
+ DataDirectoryFileInfo *record;
+} pgrhash_entry;
+
+struct pgrhash
+{
+ unsigned nbuckets; /* number of buckets */
+ pgrhash_entry **bucket; /* pointer to hash entries */
+};
+
+typedef struct pgrhash pgrhash;
These two can be moved to .h file instead of redefining over there.
6.
+/*
+ * TODO: this function is not necessary, can be removed.
+ * Test whether the given row number is match for the supplied keys.
+ */
+static bool
+pgrhash_compare(char *bt_filename, char *filename)
Yeah, it can be removed by doing strcmp() at the required places rather than
doing it in a separate function.
7.
mdate is not compared anywhere. I understand that it can't be compared with
the file in the backup directory and its entry in the manifest as manifest
entry gives mtime from server file whereas the same file in the backup will
have different mtime. But adding a few comments there will be good.
8.
+ char mdate[24];
should be mtime instead?
Thanks
--
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Thank you Jeevan for reviewing the patch.
On Thu, Nov 21, 2019 at 2:33 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
On Tue, Nov 19, 2019 at 3:30 PM Rushabh Lathia <rushabh.lathia@gmail.com> wrote:My colleague Suraj did testing and noticed the performance impactwith the checksums. On further testing, he found that specifically withsha its more of performance impact.Please find below statistics:
no of tables without checksum SHA256
checksum% performnce
overhead
with
SHA-256md5 checksum % performnce
overhead with md5CRC checksum % performnce
overhead with
CRC10 (100 MB
in each table)real 0m10.957s
user 0m0.367s
sys 0m2.275sreal 0m16.816s
user 0m0.210s
sys 0m2.067s53% real 0m11.895s
user 0m0.174s
sys 0m1.725s8% real 0m11.136s
user 0m0.365s
sys 0m2.298s2% 20 (100 MB
in each table)real 0m20.610s
user 0m0.484s
sys 0m3.198sreal 0m31.745s
user 0m0.569s
sys 0m4.089s54% real 0m22.717s
user 0m0.638s
sys 0m4.026s10% real 0m21.075s
user 0m0.538s
sys 0m3.417s2% 50 (100 MB
in each table)real 0m49.143s
user 0m1.646s
sys 0m8.499sreal 1m13.683s
user 0m1.305s
sys 0m10.541s50% real 0m51.856s
user 0m0.932s
sys 0m7.702s6% real 0m49.689s
user 0m1.028s
sys 0m6.921s1% 100 (100 MB
in each table)real 1m34.308s
user 0m2.265s
sys 0m14.717sreal 2m22.403s
user 0m2.613s
sys 0m20.776s51% real 1m41.524s
user 0m2.158s
sys 0m15.949s8% real 1m35.045s
user 0m2.061s
sys 0m16.308s1% 100 (1 GB
in each table)real 17m18.336s
user 0m20.222s
sys 3m12.960sreal 24m45.942s
user 0m26.911s
sys 3m33.501s43% real 17m41.670s
user 0m26.506s
sys 3m18.402s2% real 17m22.296s
user 0m26.811s
sys 3m56.653s
sometimes, this test
completes within the
same time as without
checksum.approx. 0.5% Considering the above results, I modified the earlier Robert's patch and added"manifest_with_checksums" option to pg_basebackup. With a new patch.by default, checksums will be disabled and will be only enabled when"manifest_with_checksums" option is provided. Also re-based all patch set.
Review comments on 0004:
1.
I don't think we need o_manifest_with_checksums variable,
manifest_with_checksums can be used instead.
Yes, done in the latest version of opatch.
2.
We need to document this new option for pg_basebackup and basebackup.
Done, attaching documentation patch with the mail.
3.
Also, instead of keeping manifest_with_checksums as a global variable, we
should pass that to the required function. Patch 0002 already modified the
signature of all relevant functions anyways. So just need to add one more bool
variable there.
yes, earlier I did that implementation but later found that we already
have checksum related global variable i.e. noverify_checksums, so
that it will be clean implementation - rather modifying the function definition
to pass the variable (which is actually global for the operation).
4.
Why we need a "File" at the start of each entry as we are adding files only?
I wonder if we also need to provide a tablespace name and directory marker so
that we have "Tablespace" and "Dir" at the start.
Sorry, I am not quite sure about this, may be Robert is right person
to answer this.
5.
If I don't provide manifest-with-checksums option then too I see that checksum
is calculated for backup_manifest file itself. Is that intentional or missed?
I think we should omit that too if this option is not provided.
Oops yeah, corrected this in the latest version of patch.
6.
Is it possible to get only a backup manifest from the server? A client like
pg_basebackup can then use that to fetch files reading that.
Currently we don't have any option to just get the manifest file from the
server. I am not sure but why we need this at this point of time.
Regards,
Rushabh Lathia
Attachment
On Tue, Nov 19, 2019 at 8:49 AM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > I admit I haven't been following along closely, but why do we need a > cryptographic checksum here instead of, say, a CRC? Do we think that > somehow the checksum might be forged? Use of cryptographic hashes as > general purpose checksums has become far too common IMNSHO. I tend to agree with you. I suspect if we just use CRC, some people are going to complain that they want something "stronger" because that will make them feel better about error detection rates or obscure threat models or whatever other things a SHA-based approach might be able to catch that CRC would not catch. However, I suspect that for normal use cases, CRC would be totally adequate, and the fact that the performance overhead is almost none vs. a whole lot - at least in this test setup, other results might vary depending on what you test - makes it look pretty appealing. My gut reaction is to make CRC the default, but have an option that you can use to either turn it off entirely (if even 1-2% is too much for you) or opt in to SHA-something if you want it. I don't think we should offer an option for MD5, because MD5 is a dirty word these days and will cause problems for users who have to worry about FIPS 140-2 compliance. Phrased more positively, if you want a cryptographic hash at all, you should probably use one that isn't widely viewed as too weak. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/22/19 10:58 AM, Robert Haas wrote: > On Tue, Nov 19, 2019 at 8:49 AM Andrew Dunstan > <andrew.dunstan@2ndquadrant.com> wrote: >> I admit I haven't been following along closely, but why do we need a >> cryptographic checksum here instead of, say, a CRC? Do we think that >> somehow the checksum might be forged? Use of cryptographic hashes as >> general purpose checksums has become far too common IMNSHO. > > I tend to agree with you. I suspect if we just use CRC, some people > are going to complain that they want something "stronger" because that > will make them feel better about error detection rates or obscure > threat models or whatever other things a SHA-based approach might be > able to catch that CRC would not catch. Well, the maximum amount of data that can be protected with a 32-bit CRC is 512MB according to all the sources I found (NIST, Wikipedia, etc). I presume that's what we are talking about since I can't find any 64-bit CRC code in core or this patch. So, that's half of what we need with the default relation segment size (I've seen larger in the field). > I don't think we > should offer an option for MD5, because MD5 is a dirty word these days > and will cause problems for users who have to worry about FIPS 140-2 > compliance. +1. > Phrased more positively, if you want a cryptographic hash > at all, you should probably use one that isn't widely viewed as too > weak. Sure. There's another advantage to picking an algorithm with lower collision rates, though. CRCs are fine for catching transmission errors (as caveated above) but not as great for comparing two files for equality. With strong hashes you can confidently compare local files against the path, size, and hash stored in the manifest and save yourself a round-trip to the remote storage to grab the file if it has not changed locally. This is the basic premise of what we call delta restore which can speed up restores by orders of magnitude. Delta restore is the main advantage that made us decide to require SHA1 checksums. In most cases, restore speed is more important than backup speed. Regards, -- -David david@pgmasters.net
On Tue, Nov 19, 2019 at 4:34 PM David Steele <david@pgmasters.net> wrote: > On 11/19/19 5:00 AM, Rushabh Lathia wrote: > > My colleague Suraj did testing and noticed the performance impact > > with the checksums. On further testing, he found that specifically with > > sha its more of performance impact. > > We have found that SHA1 adds about 3% overhead when the backup is also > compressed (gzip -6), which is what most people want to do. This > percentage goes down even more if the backup is being transferred over a > network or to an object store such as S3. I don't really understand why your tests and Suraj's tests are showing such different results, or how compression plays into it. I tried running shasum -a$N lineitem-big.csv on my laptop, where that file contains ~70MB of random-looking data whose source I no longer remember. Here are the results by algorithm: SHA1, ~25 seconds; SHA224 or SHA256, ~52 seconds; SHA384 and SHA512, ~39 seconds. Aside from the interesting discovery that the algorithms with more bits actually run faster on this machine, this seems to show that there's only about a ~2x difference between the SHA1 that you used and that I (pretty much arbitrarily) used. But Rushabh and Suraj are reporting 43-54% overhead, and even if you divide that by two it's a lot more than 3%. One possible explanation is that the compression is really slow, and so it makes the checksum overhead a smaller percentage of the total. Like, if you've already slowed down the backup by 8x, then 24% overhead turns into 3% overhead! But I assume that's not the real explanation here. Another explanation is that your tests were I/O-bound rather than CPU-bound, maybe because you tested with a much larger database or a much smaller amount of I/O bandwidth. If you had CPU cycles to burn, then neither compression nor checksums will cost much in terms of overall runtime. But that's a little hard to swallow, too, because I don't think the testing mentioned above was done using any sort of exotic test configuration, so why would yours be so different? Another possibility is that Suraj and Rushabh messed up the tests, or alternatively that you did. Or, it could be that your checksum implementation is way faster than the one PG uses, and so the impact was much less. I don't know, but I'm having a hard time understanding the divergent results. Any ideas? > We judged that the lower collision rate of SHA1 justified the additional > expense. > > That said, making SHA256 optional seems reasonable. We decided not to > make our SHA1 checksums optional to reduce the test matrix and because > parallelism largely addressed performance concerns. Just to be clear, I really don't have any objection to using SHA1 instead of SHA256, or anything else for that matter. I picked the one to use out of a hat for the purpose of having a POC quickly; I didn't have any intention to insist on that as the final selection. It seems likely that anything we pick here will eventually be considered obsolete, so I think we need to allow for configurability, but I don't have a horse in the game as far as an initial selection goes. Except - and this gets back to the previous point - I don't want to slow down backups by 40% by default. I wouldn't mind slowing them down 3% by default, but 40% is too much overhead. I think we've gotta either the overhead of using SHA way down or not use SHA by default. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Nov 22, 2019 at 1:10 PM David Steele <david@pgmasters.net> wrote: > Well, the maximum amount of data that can be protected with a 32-bit CRC > is 512MB according to all the sources I found (NIST, Wikipedia, etc). I > presume that's what we are talking about since I can't find any 64-bit > CRC code in core or this patch. Could you give a more precise citation for this? I can't find a reference to that in the Wikipedia article off-hand and I don't know where to look in NIST. I apologize if I'm being dense here, but I don't see why there should be any limit on the amount of data that can be protected. The important thing is that if the original file F is altered to F', we hope that CHECKSUM(F) != CHECKSUM(F'). The probability of that, assuming that the alteration is random rather than malicious and that the checksum function is equally likely to produce every possible output, is just 1-2^-${CHECKSUM_BITS}, regardless of the length of the message (except that there might be some special cases for very short messages, which don't matter here). This analysis by me seems to match https://en.wikipedia.org/wiki/Cyclic_redundancy_check, which says: "Typically an n-bit CRC applied to a data block of arbitrary length will detect any single error burst not longer than n bits, and the fraction of all longer error bursts that it will detect is (1 − 2^−n)." Notice the phrase "a data block of arbitrary length" and the formula "1 - 2^-n". > > Phrased more positively, if you want a cryptographic hash > > at all, you should probably use one that isn't widely viewed as too > > weak. > > Sure. There's another advantage to picking an algorithm with lower > collision rates, though. > > CRCs are fine for catching transmission errors (as caveated above) but > not as great for comparing two files for equality. With strong hashes > you can confidently compare local files against the path, size, and hash > stored in the manifest and save yourself a round-trip to the remote > storage to grab the file if it has not changed locally. I agree in part. I think there are two reasons why a cryptographically strong hash is desirable for delta restore. First, since the checksums are longer, the probability of a false match happening randomly is lower, which is important. Even if the above analysis is correct and the chance of a false match is just 2^-32 with a 32-bit CRC, if you back up ten million files every day, you'll likely get a false match within a few years or less, and once is too often. Second, unlike what I supposed above, the contents of a PostgreSQL data file are not chosen at random, unlike transmission errors, which probably are more or less random. It seems somewhat possible that there is an adversary who is trying to choose the data that gets stored in some particular record so as to create a false checksum match. A CRC is a lot easier to fool than a crytographic hash, so I think that using a CRC of *any* length for this kind of use case would be extremely dangerous no matter the probability of an accidental match. > This is the basic premise of what we call delta restore which can speed > up restores by orders of magnitude. > > Delta restore is the main advantage that made us decide to require SHA1 > checksums. In most cases, restore speed is more important than backup > speed. I see your point, but it's not the whole story. We've encountered a bunch of cases where the time it took to complete a backup exceeded the user's desired backup interval, which is obviously very bad, or even more commonly where it exceeded the length of the user's "low-usage" period when they could tolerate the extra overhead imposed by the backup. A few percentage points is probably not a big deal, but a user who has an 8-hour window to get the backup done overnight will not be happy if it's taking 6 hours now and we tack 40%-50% on to that. So I think that we either have to disable backup checksums by default, or figure out a way to get the overhead down to something a lot smaller than what current tests are showing -- which we could possibly do without changing the algorithm if we can somehow make it a lot cheaper, but otherwise I think the choice is between disabling the functionality altogether by default and adopting a less-expensive algorithm. Maybe someday when delta restore is in core and widely used and CPUs are faster, it'll make sense to revise the default, and that's cool, but I can't see imposing a big overhead by default to enable a feature core doesn't have yet... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/22/19 1:24 PM, Robert Haas wrote: > On Tue, Nov 19, 2019 at 4:34 PM David Steele <david@pgmasters.net> wrote: >> On 11/19/19 5:00 AM, Rushabh Lathia wrote: >>> My colleague Suraj did testing and noticed the performance impact >>> with the checksums. On further testing, he found that specifically with >>> sha its more of performance impact. >> >> We have found that SHA1 adds about 3% overhead when the backup is also >> compressed (gzip -6), which is what most people want to do. This >> percentage goes down even more if the backup is being transferred over a >> network or to an object store such as S3. > > I don't really understand why your tests and Suraj's tests are showing > such different results, or how compression plays into it. I tried > running shasum -a$N lineitem-big.csv on my laptop, where that file > contains ~70MB of random-looking data whose source I no longer > remember. Here are the results by algorithm: SHA1, ~25 seconds; SHA224 > or SHA256, ~52 seconds; SHA384 and SHA512, ~39 seconds. Aside from the > interesting discovery that the algorithms with more bits actually run > faster on this machine, this seems to show that there's only about a > ~2x difference between the SHA1 that you used and that I (pretty much > arbitrarily) used. But Rushabh and Suraj are reporting 43-54% > overhead, and even if you divide that by two it's a lot more than 3%. > > One possible explanation is that the compression is really slow, and > so it makes the checksum overhead a smaller percentage of the total. > Like, if you've already slowed down the backup by 8x, then 24% > overhead turns into 3% overhead! But I assume that's not the real > explanation here. That's the real explanation here. Hash calculations run at the same speed, they just become a smaller portion of the *total* time once compression (gzip -6) is added. With something like lz4 hashing will obviously be a big percentage of the total. Also consider how much extra latency you get from copying over a network. My 3% did not include that but realistically most backups are running over a network (hopefully). >> That said, making SHA256 optional seems reasonable. We decided not to >> make our SHA1 checksums optional to reduce the test matrix and because >> parallelism largely addressed performance concerns. > > Just to be clear, I really don't have any objection to using SHA1 > instead of SHA256, or anything else for that matter. I picked the one > to use out of a hat for the purpose of having a POC quickly; I didn't > have any intention to insist on that as the final selection. It seems > likely that anything we pick here will eventually be considered > obsolete, so I think we need to allow for configurability, but I don't > have a horse in the game as far as an initial selection goes. We decided that SHA1 was good enough and there was no need to go up to SHA256. What we were interested in was collision rates and what the chance of getting a false positive were based on the combination of path, size, and hash. With SHA1 the chance of a collision was literally astronomically low (as in the universe would probably end before it happened, depending on whether you are an expand forever or contract proponent). > Except - and this gets back to the previous point - I don't want to > slow down backups by 40% by default. I wouldn't mind slowing them down > 3% by default, but 40% is too much overhead. I think we've gotta > either the overhead of using SHA way down or not use SHA by default. Maybe -- my take is that the measurements, an uncompressed backup to the local filesystem, are not a very realistic use case. However, I'm still fine with leaving the user the option of checksums or no. I just wanted to point out that CRCs have their limits so maybe that's not a great option unless it is properly caveated and perhaps not the default. Regards, -- -David david@pgmasters.net
On 11/22/19 2:01 PM, Robert Haas wrote: > On Fri, Nov 22, 2019 at 1:10 PM David Steele <david@pgmasters.net> wrote: >> Well, the maximum amount of data that can be protected with a 32-bit CRC >> is 512MB according to all the sources I found (NIST, Wikipedia, etc). I >> presume that's what we are talking about since I can't find any 64-bit >> CRC code in core or this patch. > > Could you give a more precise citation for this? See: https://www.nist.gov/system/files/documents/2017/04/26/lrdc_systems_part2_032713.pdf Search for "The maximum block size" https://en.wikipedia.org/wiki/Cyclic_redundancy_check "The design of the CRC polynomial depends on the maximum total length of the block to be protected (data + CRC bits)", which I took to mean there are limits. Here another interesting bit from: https://en.wikipedia.org/wiki/Mathematics_of_cyclic_redundancy_checks "Because a CRC is based on division, no polynomial can detect errors consisting of a string of zeroes prepended to the data, or of missing leading zeroes" -- but it appears to matter what CRC you are using. There's a variation that works in this case and hopefully we are using that one. This paper talks about appropriate block lengths vs crc length: http://users.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf but it is concerned with network transmission and small block lengths. > "Typically an n-bit CRC applied to a data block of arbitrary length > will detect any single error burst not longer than n bits, and the > fraction of all longer error bursts that it will detect is (1 − > 2^−n)." I'm not sure how encouraging I find this -- a four-byte error not a lot and 2^32 is only 4 billion. We have individual users who have backed up more than 4 billion files over the last few years. >> This is the basic premise of what we call delta restore which can speed >> up restores by orders of magnitude. >> >> Delta restore is the main advantage that made us decide to require SHA1 >> checksums. In most cases, restore speed is more important than backup >> speed. > > I see your point, but it's not the whole story. We've encountered a > bunch of cases where the time it took to complete a backup exceeded > the user's desired backup interval, which is obviously very bad, or > even more commonly where it exceeded the length of the user's > "low-usage" period when they could tolerate the extra overhead imposed > by the backup. A few percentage points is probably not a big deal, but > a user who has an 8-hour window to get the backup done overnight will > not be happy if it's taking 6 hours now and we tack 40%-50% on to > that. So I think that we either have to disable backup checksums by > default, or figure out a way to get the overhead down to something a > lot smaller than what current tests are showing -- which we could > possibly do without changing the algorithm if we can somehow make it a > lot cheaper, but otherwise I think the choice is between disabling the > functionality altogether by default and adopting a less-expensive > algorithm. Maybe someday when delta restore is in core and widely used > and CPUs are faster, it'll make sense to revise the default, and > that's cool, but I can't see imposing a big overhead by default to > enable a feature core doesn't have yet... OK, I'll buy that. But I *don't* think CRCs should be allowed for deltas (when we have them) and I *do* think we should caveat their effectiveness (assuming we can agree on them). In general the answer to faster backups should be more cores/faster network/faster disk, not compromising backup integrity. I understand we'll need to wait until we have parallelism in pg_basebackup to justify that answer. Regards, -- -David david@pgmasters.net
Moin Robert, On 2019-11-22 20:01, Robert Haas wrote: > On Fri, Nov 22, 2019 at 1:10 PM David Steele <david@pgmasters.net> > wrote: >> Well, the maximum amount of data that can be protected with a 32-bit >> CRC >> is 512MB according to all the sources I found (NIST, Wikipedia, etc). >> I >> presume that's what we are talking about since I can't find any 64-bit >> CRC code in core or this patch. > > Could you give a more precise citation for this? I can't find a > reference to that in the Wikipedia article off-hand and I don't know > where to look in NIST. I apologize if I'm being dense here, but I > don't see why there should be any limit on the amount of data that can > be protected. The important thing is that if the original file F is > altered to F', we hope that CHECKSUM(F) != CHECKSUM(F'). The > probability of that, assuming that the alteration is random rather > than malicious and that the checksum function is equally likely to > produce every possible output, is just 1-2^-${CHECKSUM_BITS}, > regardless of the length of the message (except that there might be > some special cases for very short messages, which don't matter here). > > This analysis by me seems to match > https://en.wikipedia.org/wiki/Cyclic_redundancy_check, which says: > > "Typically an n-bit CRC applied to a data block of arbitrary length > will detect any single error burst not longer than n bits, and the > fraction of all longer error bursts that it will detect is (1 − > 2^−n)." > > Notice the phrase "a data block of arbitrary length" and the formula "1 > - 2^-n". It is related to the number of states, and the birthday problem factors in it, too: https://en.wikipedia.org/wiki/Birthday_problem If you have a 32 bit checksum or hash, it can represent only 2**32-1 states at most (or less, if the algorithmn isn't really good). Each byte is 8 bit, so 2 ** 32 / 8 is 512 Mbyte. If you process your data bit by bit, each new bit would add a new state (consider: missing bit == 0, added bit == 1). If each new state is repesented by a different checksum, all possible 2 ** 32 values are exhausted after processing 512 Mbyte, after that you get one of the former states again - aka a collision. There is no way around it with so little bits, no matter what algorithmn you choose. >> > Phrased more positively, if you want a cryptographic hash >> > at all, you should probably use one that isn't widely viewed as too >> > weak. >> >> Sure. There's another advantage to picking an algorithm with lower >> collision rates, though. >> >> CRCs are fine for catching transmission errors (as caveated above) but >> not as great for comparing two files for equality. With strong hashes >> you can confidently compare local files against the path, size, and >> hash >> stored in the manifest and save yourself a round-trip to the remote >> storage to grab the file if it has not changed locally. > > I agree in part. I think there are two reasons why a cryptographically > strong hash is desirable for delta restore. First, since the checksums > are longer, the probability of a false match happening randomly is > lower, which is important. Even if the above analysis is correct and > the chance of a false match is just 2^-32 with a 32-bit CRC, if you > back up ten million files every day, you'll likely get a false match > within a few years or less, and once is too often. Second, unlike what > I supposed above, the contents of a PostgreSQL data file are not > chosen at random, unlike transmission errors, which probably are more > or less random. It seems somewhat possible that there is an adversary > who is trying to choose the data that gets stored in some particular > record so as to create a false checksum match. A CRC is a lot easier > to fool than a crytographic hash, so I think that using a CRC of *any* > length for this kind of use case would be extremely dangerous no > matter the probability of an accidental match. Agreed. See above. However, if you choose a hash, please do not go below SHA-256. Both MD5 and SHA-1 already had collision attacks, and these only got to be bound to be worse. https://www.mscs.dal.ca/~selinger/md5collision/ https://shattered.io/ It might even be a wise idea to encode the used Hash-Algorithm into the manifest file, so it can be changed later. The hash length might be not enough to decide which algorithm is the one used. >> This is the basic premise of what we call delta restore which can >> speed >> up restores by orders of magnitude. >> >> Delta restore is the main advantage that made us decide to require >> SHA1 >> checksums. In most cases, restore speed is more important than backup >> speed. > > I see your point, but it's not the whole story. We've encountered a > bunch of cases where the time it took to complete a backup exceeded > the user's desired backup interval, which is obviously very bad, or > even more commonly where it exceeded the length of the user's > "low-usage" period when they could tolerate the extra overhead imposed > by the backup. A few percentage points is probably not a big deal, but > a user who has an 8-hour window to get the backup done overnight will > not be happy if it's taking 6 hours now and we tack 40%-50% on to > that. So I think that we either have to disable backup checksums by > default, or figure out a way to get the overhead down to something a > lot smaller than what current tests are showing -- which we could > possibly do without changing the algorithm if we can somehow make it a > lot cheaper, but otherwise I think the choice is between disabling the > functionality altogether by default and adopting a less-expensive > algorithm. Maybe someday when delta restore is in core and widely used > and CPUs are faster, it'll make sense to revise the default, and > that's cool, but I can't see imposing a big overhead by default to > enable a feature core doesn't have yet... Modern algorithms are amazingly fast on modern hardware, some even are implemented in hardware nowadays: https://software.intel.com/en-us/articles/intel-sha-extensions Quote from: https://neosmart.net/blog/2017/will-amds-ryzen-finally-bring-sha-extensions-to-intels-cpus/ "Despite the extremely limited availability of SHA extension support in modern desktop and mobile processors, crypto libraries have already upstreamed support to great effect. Botan’s SHA extension patches show a significant 3x to 5x performance boost when taking advantage of the hardware extensions, and the Linux kernel itself shipped with hardware SHA support with version 4.4, bringing a very respectable 3.6x performance upgrade over the already hardware-assisted SSE3-enabled code." If you need to load the data from disk and shove it over a network, the hashing will certainly be very little overhead, it might even be completely invisible, since it can run in paralell to all the other things. Sure, there is the thing called zero-copy-networking, but if you have to compress the data bevore sending it to the network, you have to put it through the CPU, anyway. And if you have more than one core, the second one can to the hashing it paralell to the first one doing the compression. To get a feeling one can use: openssl speed md5 sha1 sha256 sha512 On my really-not-fast desktop CPU (i5-4690T CPU @ 2.50GHz) it says: The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes md5 122638.55k 277023.96k 487725.57k 630806.19k 683892.74k 688553.98k sha1 127226.45k 313891.52k 632510.55k 865753.43k 960995.33k 977215.19k sha256 77611.02k 173368.15k 325460.99k 412633.43k 447022.92k 448020.48k sha512 51164.77k 205189.87k 361345.79k 543883.26k 638372.52k 645933.74k Or in other words, it can hash nearly 931 MByte /s with SHA-1 and about 427 MByte / s with SHA-256 (if I haven't miscalculated something). You'd need a pretty fast disk (aka M.2 SSD) and network (aka > 1 Gbit) to top these speeds and then you'd use a real CPU for your server, not some poor Intel powersaving surfing thingy-majingy :) Best regards, Tels
On 11/22/19 5:15 PM, Tels wrote: > On 2019-11-22 20:01, Robert Haas wrote: >> On Fri, Nov 22, 2019 at 1:10 PM David Steele <david@pgmasters.net> wrote: > >>> > Phrased more positively, if you want a cryptographic hash >>> > at all, you should probably use one that isn't widely viewed as too >>> > weak. >>> >>> Sure. There's another advantage to picking an algorithm with lower >>> collision rates, though. >>> >>> CRCs are fine for catching transmission errors (as caveated above) but >>> not as great for comparing two files for equality. With strong hashes >>> you can confidently compare local files against the path, size, and hash >>> stored in the manifest and save yourself a round-trip to the remote >>> storage to grab the file if it has not changed locally. >> >> I agree in part. I think there are two reasons why a cryptographically >> strong hash is desirable for delta restore. First, since the checksums >> are longer, the probability of a false match happening randomly is >> lower, which is important. Even if the above analysis is correct and >> the chance of a false match is just 2^-32 with a 32-bit CRC, if you >> back up ten million files every day, you'll likely get a false match >> within a few years or less, and once is too often. Second, unlike what >> I supposed above, the contents of a PostgreSQL data file are not >> chosen at random, unlike transmission errors, which probably are more >> or less random. It seems somewhat possible that there is an adversary >> who is trying to choose the data that gets stored in some particular >> record so as to create a false checksum match. A CRC is a lot easier >> to fool than a crytographic hash, so I think that using a CRC of *any* >> length for this kind of use case would be extremely dangerous no >> matter the probability of an accidental match. > > Agreed. See above. > > However, if you choose a hash, please do not go below SHA-256. Both MD5 > and SHA-1 already had collision attacks, and these only got to be bound > to be worse. I don't think collision attacks are a big consideration in the general case. The manifest is generally stored with the backup files so if a file is modified it is then trivial to modify the manifest as well. Of course, you could store the manifest separately or even just know the hash of the manifest and store that separately. In that case SHA-256 might be useful and it would be good to have the option, which I believe is the plan. I do wonder if you could construct a successful collision attack (even in MD5) that would also result in a valid relation file. Probably, at least eventually. Regards, -- -David david@pgmasters.net
Moin, On 2019-11-22 23:30, David Steele wrote: > On 11/22/19 5:15 PM, Tels wrote: >> On 2019-11-22 20:01, Robert Haas wrote: >>> On Fri, Nov 22, 2019 at 1:10 PM David Steele <david@pgmasters.net> >>> wrote: >> >>>> > Phrased more positively, if you want a cryptographic hash >>>> > at all, you should probably use one that isn't widely viewed as too >>>> > weak. >>>> >>>> Sure. There's another advantage to picking an algorithm with lower >>>> collision rates, though. >>>> >>>> CRCs are fine for catching transmission errors (as caveated above) >>>> but >>>> not as great for comparing two files for equality. With strong >>>> hashes >>>> you can confidently compare local files against the path, size, and >>>> hash >>>> stored in the manifest and save yourself a round-trip to the remote >>>> storage to grab the file if it has not changed locally. >>> >>> I agree in part. I think there are two reasons why a >>> cryptographically >>> strong hash is desirable for delta restore. First, since the >>> checksums >>> are longer, the probability of a false match happening randomly is >>> lower, which is important. Even if the above analysis is correct and >>> the chance of a false match is just 2^-32 with a 32-bit CRC, if you >>> back up ten million files every day, you'll likely get a false match >>> within a few years or less, and once is too often. Second, unlike >>> what >>> I supposed above, the contents of a PostgreSQL data file are not >>> chosen at random, unlike transmission errors, which probably are more >>> or less random. It seems somewhat possible that there is an adversary >>> who is trying to choose the data that gets stored in some particular >>> record so as to create a false checksum match. A CRC is a lot easier >>> to fool than a crytographic hash, so I think that using a CRC of >>> *any* >>> length for this kind of use case would be extremely dangerous no >>> matter the probability of an accidental match. >> >> Agreed. See above. >> >> However, if you choose a hash, please do not go below SHA-256. Both >> MD5 >> and SHA-1 already had collision attacks, and these only got to be >> bound >> to be worse. > > I don't think collision attacks are a big consideration in the general > case. The manifest is generally stored with the backup files so if a > file is modified it is then trivial to modify the manifest as well. That is true. However, a simple way around this is to sign the manifest with a public key l(GPG or similiar). And if the manifest contains strong, hard-to-forge hashes, we got a mure more secure backup, where (almost) nobody else can alter the manifest, nor can he mount easy collision attacks against the single files. Without the strong hashes it would be pointless to sign the manifest. > Of course, you could store the manifest separately or even just know > the > hash of the manifest and store that separately. In that case SHA-256 > might be useful and it would be good to have the option, which I > believe > is the plan. > > I do wonder if you could construct a successful collision attack (even > in MD5) that would also result in a valid relation file. Probably, at > least eventually. With MD5, certainly. One way is to have two block of 512 bits that hash to the different MD5s. It is trivial to re-use one already existing from the known examples. Here is one, where the researchers constructed 12 PDFs that all have the same MD5 hash: https://www.win.tue.nl/hashclash/Nostradamus/ If you insert one of these blocks into a relation and dump it, you could swap it (probably?) out on disk for the other block. I'm not sure this is of practical usage as an attack, tho. It would, however, cast doubt on the integrity of the backup and prove that MD5 is useless. OTOH, finding a full collision with MD5 should also be in reach with todays hardware. It is hard find exact numbers but this: https://www.win.tue.nl/hashclash/SingleBlock/ gives the following numbers for 2008/2009: "Finding the birthday bits took 47 hours (expected was 3 days) on the cluster of 215 Playstation 3 game consoles at LACAL, EPFL. This is roughly equivalent to 400,000 hours on a single PC core. The single near-collision block construction took 18 hours and 20 minutes on a single PC core." Today one can probably compute it on a single GPU in mere hours. And you can rent massive amounts of them in the cloud for real cheap. Here are a few, now a bit dated, references: https://blog.codinghorror.com/speed-hashing/ http://codahale.com/how-to-safely-store-a-password/ Best regards, Tels
On 11/23/19 3:13 AM, Tels wrote: > > Without the strong hashes it would be pointless to sign the manifest. > > I guess I must have missed where we are planning to add a cryptographic signature. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/23/19 4:34 PM, Andrew Dunstan wrote: > > On 11/23/19 3:13 AM, Tels wrote: >> >> Without the strong hashes it would be pointless to sign the manifest. >> > > I guess I must have missed where we are planning to add a cryptographic > signature. I don't think we were planning to, but the user could do so if they wished. -- -David david@pgmasters.net
Hi Jeevan,
I have incorporated all the comments in the attached patch. Please review and let me know your thoughts.
On Thu, Nov 21, 2019 at 2:51 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
On Wed, Nov 20, 2019 at 11:05 AM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote:Hi,
Since now we are generating the backup manifest file with each backup, it provides us an option to validate the given backup.
Let's say, we have taken a backup and after a few days, we want to check whether that backup is validated or corruption-free without restarting the server.
Please find attached POC patch for same which will be based on the latest backup manifest patch from Rushabh. With this functionality, we add new option to pg_basebackup, something like --verify-backup.
So, the syntax would be:
./bin/pg_basebackup --verify-backup -D <backup_directory_path>
Basically, we read the backup_manifest file line by line from the given directory path and build the hash table, then scan the directory and compare each file with the hash entry.
Thoughts/suggestions?
I like the idea of verifying the backup once we have backup_manifest with us.
Periodically verifying the already taken backup with this simple tool becomes
easy now.
I have reviewed this patch and here are my comments:
1.
@@ -30,7 +30,9 @@
#include "common/file_perm.h"
#include "common/file_utils.h"
#include "common/logging.h"
+#include "common/sha2.h"
#include "common/string.h"
+#include "fe_utils/simple_list.h"
#include "fe_utils/recovery_gen.h"
#include "fe_utils/string_utils.h"
#include "getopt_long.h"
@@ -38,12 +40,19 @@
#include "pgtar.h"
#include "pgtime.h"
#include "pqexpbuffer.h"
+#include "pgrhash.h"
#include "receivelog.h"
#include "replication/basebackup.h"
#include "streamutil.h"
Please add new files in order.
2.
Can hash related file names be renamed to backuphash.c and backuphash.h?
3.
Need indentation adjustments at various places.
4.
+ char buf[1000000]; // 1MB chunk
It will be good if we have multiple of block /page size (or at-least power of 2
number).
5.
+typedef struct pgrhash_entry
+{
+ struct pgrhash_entry *next; /* link to next entry in same bucket */
+ DataDirectoryFileInfo *record;
+} pgrhash_entry;
+
+struct pgrhash
+{
+ unsigned nbuckets; /* number of buckets */
+ pgrhash_entry **bucket; /* pointer to hash entries */
+};
+
+typedef struct pgrhash pgrhash;
These two can be moved to .h file instead of redefining over there.
6.
+/*
+ * TODO: this function is not necessary, can be removed.
+ * Test whether the given row number is match for the supplied keys.
+ */
+static bool
+pgrhash_compare(char *bt_filename, char *filename)
Yeah, it can be removed by doing strcmp() at the required places rather than
doing it in a separate function.
7.
mdate is not compared anywhere. I understand that it can't be compared with
the file in the backup directory and its entry in the manifest as manifest
entry gives mtime from server file whereas the same file in the backup will
have different mtime. But adding a few comments there will be good.
8.
+ char mdate[24];
should be mtime instead?
Thanks
--Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
--
Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.
Attachment
On 2019-11-24 15:38, David Steele wrote: > On 11/23/19 4:34 PM, Andrew Dunstan wrote: >> >> On 11/23/19 3:13 AM, Tels wrote: >>> >>> Without the strong hashes it would be pointless to sign the manifest. >>> >> >> I guess I must have missed where we are planning to add a >> cryptographic >> signature. > > I don't think we were planning to, but the user could do so if they > wished. That was what I meant. Best regards, Tels
On Fri, Nov 22, 2019 at 2:29 PM David Steele <david@pgmasters.net> wrote: > See: > https://www.nist.gov/system/files/documents/2017/04/26/lrdc_systems_part2_032713.pdf > Search for "The maximum block size" Hmm, so it says: "The maximum block size that can be protected by a 32-bit CRC is 512MB." My problem is that (1) it doesn't back this up with a citation or any kind of logical explanation and (2) it's not very clear what "protected" means. Tels replies downthread to explain that the internal state of the 32-bit CRC calculation is also limited to 32 bits, and changes once per bit, so that after processing 512MB = 2^29 bytes = 2^32 bits of data, you're guaranteed to start repeating internal states. Perhaps this is also what the NIST folks had in mind, though it's hard to know. This link provides some more details: https://community.arm.com/developer/tools-software/tools/f/keil-forum/17467/crc-for-256-byte-data Not everyone on the thread agrees with everybody else, but it seems like there are size limits below which a CRC-n is guaranteed to detect all 1-bit and 2-bit errors, and above which this is no longer guaranteed. They put the limit *lower* than what NIST supposes, namely 2^(n-1)-1 bits, which would be 256MB, not 512MB, if I'm doing math correctly. However, they also say that above that value, you are still likely to detect most errors. Absent an intelligent adversary, the chance of a random collision when corruption is present is still about 1 in 4 billion (2^-32). To me, guaranteed detection of 1-bit and 2-bit errors (and the other kinds of specific things CRC is designed to catch) doesn't seem like a principle design consideration. It's nice if we can get it and I'm not against it, but these are algorithms that are designed to be used when data undergoes a digital-to-analog-to-digital conversion, where for example it's possible that that the conversion back to digital loses sync and reads 9 bits or 7 bits rather than 8 bits. And that's not really what we're doing here: we all know that bits get flipped sometimes, but nobody uses scp to copy a 1GB file and ends up with a file that is 1GB +/- a few bits. Some lower-level part of the communication stack is handling that part of the work; you're going to get exactly 1GB. So it seems to me that here, as with XLOG, we're not relying on the specific CRC properties that were intended to be used to catch and in some cases repair bit flips caused by wrinkles in an A-to-D conversion, but just on its general tendency to probably not match if any bits got flipped. And those properties hold regardless of input length. That being said, having done some reading on this, I am a little concerned that we're getting further and further from the design center of the CRC algorithm. Like relation segment files, XLOG records are not packets subject to bit insertions, but at least they're small, and relation files are not. Using a 40-year-old algorithm that was intended to be used for things like making sure the modem hadn't lost framing in the last second to verify 1GB files feels, in some nebulous way, like we might be stretching. That being said, I'm not sure what we think the reasonable alternatives are. Users aren't going to be better off if we say that, because CRC-32C might not do a great job detecting errors, we're not going to check for errors at all. If we go the other way and say we're going to use some variant of SHA, they will be better off, but at the price of what looks like a *significant* hit in terms of backup time. > > "Typically an n-bit CRC applied to a data block of arbitrary length > > will detect any single error burst not longer than n bits, and the > > fraction of all longer error bursts that it will detect is (1 − > > 2^−n)." > > I'm not sure how encouraging I find this -- a four-byte error not a lot > and 2^32 is only 4 billion. We have individual users who have backed up > more than 4 billion files over the last few years. I agree that people have a lot more than 4 billion files backed up, but I'm not sure it matters very much given the use case I'm trying to enable. There's a lot of difference between delta restore and backup integrity checking. For backup integrity checking, my goal is that, on those occasions when a file gets corrupted, the chances that we notice that it has been corrupted. For that purpose, a 32-bit checksum is probably sufficient. If a file gets corrupted, we have about a 1-in-4-billion chance of being unable to detect it. If 4 billion files get corrupted, we'll miss, on average, one of those corruption events. That's sad, but so is the fact that you had *4 billion corrupted files*. This is not the total number of files backed up; this is the number of those that got corrupted. I don't really know how common it is to copy a file and end up with a corrupt copy, but if you say it's one-in-a-million, which I suspect is far too high, then you'd have to back up something like 4 quadrillion files before you missed a corruption event, and that's a *very* big number. Now delta restore is a whole different kettle of fish. The birthday problem is huge here. If you've got a 32-bit checksum for file A, and you go and look it up in a database of checksums, and that database has even 1 billion things in it, you've got a pretty decent shot of latching onto a file that is not actually the same as file A. The problem goes away almost entirely if you only compare against previous versions of that file from that database cluster. You've probably only got tens or maybe at the very outside hundreds or thousands of backups of that particular file, and a collision is unlikely even with only a 32-bit checksum -- though even there maybe you'd like to use something larger just to be on the safe side. But if you're going to compare to other files from the same cluster, or even worse any file from any cluster, 32 bits is *woefully* inadequate. TBH even using SHA for such use cases feels a little scary to me. It's probably good enough -- 2^160 for SHA-1 is a *lot* bigger than 2^32, and 2^512 for SHA-512 is enormous. But I'd want to spend time thinking very carefully about the math before designing such a system. > OK, I'll buy that. But I *don't* think CRCs should be allowed for > deltas (when we have them) and I *do* think we should caveat their > effectiveness (assuming we can agree on them). Sounds good. > In general the answer to faster backups should be more cores/faster > network/faster disk, not compromising backup integrity. I understand > we'll need to wait until we have parallelism in pg_basebackup to justify > that answer. I would like to dispute that characterization of what we're talking about here. If we added a 1-bit checksum (parity bit) it would be *strictly better* than what we're doing right now, which is nothing. That's not a serious proposal because it's obvious we can do a lot better for trivial additional cost, but deciding that we're going to use a weaker kind of checksum to avoid adding too much overhead is not wimping out, because it's still going to be strong enough to catch the overwhelming majority of problems that go undetected today. Even an *8-bit* checksum would give us a >99% chance of catching a corrupted file, which would be noticeably better than the 0% chance we have today. Even a manifest with no checksums at all that just checked the presence and size of files would catch tons of operator error, e.g. - wait, that database had tablespaces? - were those logs in pg_clog anything important? - oh, i wasn't supposed to start postgres on the copy of the database stored in the backup directory? So I don't think we're talking about whether to compromise backup integrity. I think we're talking about - if we're going to make backup integrity better than it is today, how much better should we try to make it, and what are the trade-offs there? The straw man here is that we could make the database infinitely secure if we put it in a concrete bunker and sunk it to the bottom of the ocean, with the small price that we'd no longer be able to access it either. Somewhere between that extreme and the other extreme of setting the authentication method to 0.0.0.0/0 trust there's a happy medium where security is tolerably good but ease of access isn't crippled, and the same thing applies here. We could (probably) be the first database on the planet to store a 1024-bit encrypted checksum of every 8kB block, but that seems like it's going too far in the "concrete bunker" direction. IMHO, at least, we should be aiming for something that has a high probability of catching real problems and a low probability of being super-annoying. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Nov 22, 2019 at 2:02 PM David Steele <david@pgmasters.net> wrote: > > Except - and this gets back to the previous point - I don't want to > > slow down backups by 40% by default. I wouldn't mind slowing them down > > 3% by default, but 40% is too much overhead. I think we've gotta > > either the overhead of using SHA way down or not use SHA by default. > > Maybe -- my take is that the measurements, an uncompressed backup to the > local filesystem, are not a very realistic use case. Well, compression is a feature we don't have yet, in core. So for people who are only using core tools, an uncompressed backup is a very realistic use case, because it's the only kind they can get. Granted the situation is different if you are using pgbackrest. I don't have enough experience to know how often people back up to local filesystems vs. remote filesystems mounted locally vs. overtly over-the-network. I sometimes get the impression that users choose their backup tools and procedures with, as Tom would say, the aid of a dart board, but that's probably the cynic in me talking. Or maybe a reflection of the fact that I usually end up talking to the users for whom things have gone really, really badly wrong, rather than the ones for whom things went as planned. > However, I'm still fine with leaving the user the option of checksums or > no. I just wanted to point out that CRCs have their limits so maybe > that's not a great option unless it is properly caveated and perhaps not > the default. I think the default is the sticking point here. To me, it looks like CRC is a better default than nothing at all because it should still catch a high percentage of issues that would otherwise be missed, and a better default than SHA because it's so cheap to compute. However, I'm certainly willing to consider other theories. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Nov 22, 2019 at 5:15 PM Tels <nospam-pg-abuse@bloodgate.com> wrote: > It is related to the number of states... Thanks for this explanation. See my reply to David where I also discuss this point. > However, if you choose a hash, please do not go below SHA-256. Both MD5 > and SHA-1 already had collision attacks, and these only got to be bound > to be worse. > > https://www.mscs.dal.ca/~selinger/md5collision/ > https://shattered.io/ Yikes, that second link, about SHA-1, is depressing. Now, it's not likely that an attacker has access to your backup repository and can spend 6500 years of CPU time to engineer a Trojan file there (maybe more, because the files are probably bigger than the PDFs they used in that case) and then induce you to restore and rely upon that backup. However, it's entirely likely that somebody is going to eventually ban SHA-1 as the attacks get better, which is going to be a problem for us whether the underlying exposures are problems or not. > It might even be a wise idea to encode the used Hash-Algorithm into the > manifest file, so it can be changed later. The hash length might be not > enough to decide which algorithm is the one used. I agree. Let's write SHA256:bc1c3a57369acd0d2183a927fb2e07acbbb1c97f317bbc3b39d93ec65b754af5 or similar rather than just the hash. That way even if the entire SHA family gets cracked, we can easily substitute in something else that hasn't been cracked yet. (It is unclear to me why anyone supposes that *any* popular hash function won't eventually be cracked. For a K-bit hash function, there are 2^K possible outputs, where K is probably in the hundreds. But there are 2^{2^33} possible 1GB files. So for every possible output value, there are 2^{2^33-K} inputs that produce that value, which is a very very big number. The probability that any given input produces a certain output is very low, but the number of possible inputs that produce a given output is very high; so assuming that nobody's ever going to figure out how to construct them seems optimistic.) > To get a feeling one can use: > > openssl speed md5 sha1 sha256 sha512 > > On my really-not-fast desktop CPU (i5-4690T CPU @ 2.50GHz) it says: > > The 'numbers' are in 1000s of bytes per second processed. > type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 > bytes 16384 bytes > md5 122638.55k 277023.96k 487725.57k 630806.19k > 683892.74k 688553.98k > sha1 127226.45k 313891.52k 632510.55k 865753.43k > 960995.33k 977215.19k > sha256 77611.02k 173368.15k 325460.99k 412633.43k > 447022.92k 448020.48k > sha512 51164.77k 205189.87k 361345.79k 543883.26k > 638372.52k 645933.74k > > Or in other words, it can hash nearly 931 MByte /s with SHA-1 and about > 427 MByte / s with SHA-256 (if I haven't miscalculated something). You'd > need a > pretty fast disk (aka M.2 SSD) and network (aka > 1 Gbit) to top these > speeds > and then you'd use a real CPU for your server, not some poor Intel > powersaving > surfing thingy-majingy :) I mean, how fast is in theory doesn't matter nearly as much as what happens when you benchmark the proposed implementation, and the results we have so far don't support the theory that this is so cheap as to be negligible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
As per the discussion on the thread, here is the patch which
a) Make checksum for manifest file optional.
b) Allow user to choose a particular algorithm.
Currently with the WIP patch SHA256 and CRC checksum algorithm
supported. Patch also changed the manifest file format to append
the used algorithm name before the checksum, this way it will be
easy to validator to know which algorithm to used.
Ex:
./db/bin/pg_basebackup -D bksha/ --manifest-with-checksums=SHA256
$ cat bksha/backup_manifest | more
PostgreSQL-Backup-Manifest-Version 1
File backup_label 226 2019-12-04 17:46:46 GMT SHA256:7cf53d1b9facca908678ab70d93a9e7460cd35cedf7891de948dcf858f8a281a
File pg_xact/0000 8192 2019-12-04 17:46:46 GMT SHA256:8d2b6cb1dc1a6e8cee763b52d75e73571fddce06eb573861d44082c7d8c03c26
PostgreSQL-Backup-Manifest-Version 1
File backup_label 226 2019-12-04 17:46:46 GMT SHA256:7cf53d1b9facca908678ab70d93a9e7460cd35cedf7891de948dcf858f8a281a
File pg_xact/0000 8192 2019-12-04 17:46:46 GMT SHA256:8d2b6cb1dc1a6e8cee763b52d75e73571fddce06eb573861d44082c7d8c03c26
./db/bin/pg_basebackup -D bkcrc/ --manifest-with-checksums=CRC
PostgreSQL-Backup-Manifest-Version 1
File backup_label 226 2019-12-04 17:58:40 GMT CRC:343138313931333134
File pg_xact/0000 8192 2019-12-04 17:46:46 GMT CRC:363538343433333133
File backup_label 226 2019-12-04 17:58:40 GMT CRC:343138313931333134
File pg_xact/0000 8192 2019-12-04 17:46:46 GMT CRC:363538343433333133
Pending TODOs:
- Documentation update
- Code cleanup
- Testing.
I will further continue to work on the patch and meanwhile feel free to provide
thoughts/inputs.
Thanks,
On Mon, Nov 25, 2019 at 11:13 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Nov 22, 2019 at 5:15 PM Tels <nospam-pg-abuse@bloodgate.com> wrote:
> It is related to the number of states...
Thanks for this explanation. See my reply to David where I also
discuss this point.
> However, if you choose a hash, please do not go below SHA-256. Both MD5
> and SHA-1 already had collision attacks, and these only got to be bound
> to be worse.
>
> https://www.mscs.dal.ca/~selinger/md5collision/
> https://shattered.io/
Yikes, that second link, about SHA-1, is depressing. Now, it's not
likely that an attacker has access to your backup repository and can
spend 6500 years of CPU time to engineer a Trojan file there (maybe
more, because the files are probably bigger than the PDFs they used in
that case) and then induce you to restore and rely upon that backup.
However, it's entirely likely that somebody is going to eventually ban
SHA-1 as the attacks get better, which is going to be a problem for us
whether the underlying exposures are problems or not.
> It might even be a wise idea to encode the used Hash-Algorithm into the
> manifest file, so it can be changed later. The hash length might be not
> enough to decide which algorithm is the one used.
I agree. Let's write
SHA256:bc1c3a57369acd0d2183a927fb2e07acbbb1c97f317bbc3b39d93ec65b754af5
or similar rather than just the hash. That way even if the entire SHA
family gets cracked, we can easily substitute in something else that
hasn't been cracked yet.
(It is unclear to me why anyone supposes that *any* popular hash
function won't eventually be cracked. For a K-bit hash function, there
are 2^K possible outputs, where K is probably in the hundreds. But
there are 2^{2^33} possible 1GB files. So for every possible output
value, there are 2^{2^33-K} inputs that produce that value, which is a
very very big number. The probability that any given input produces a
certain output is very low, but the number of possible inputs that
produce a given output is very high; so assuming that nobody's ever
going to figure out how to construct them seems optimistic.)
> To get a feeling one can use:
>
> openssl speed md5 sha1 sha256 sha512
>
> On my really-not-fast desktop CPU (i5-4690T CPU @ 2.50GHz) it says:
>
> The 'numbers' are in 1000s of bytes per second processed.
> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192
> bytes 16384 bytes
> md5 122638.55k 277023.96k 487725.57k 630806.19k
> 683892.74k 688553.98k
> sha1 127226.45k 313891.52k 632510.55k 865753.43k
> 960995.33k 977215.19k
> sha256 77611.02k 173368.15k 325460.99k 412633.43k
> 447022.92k 448020.48k
> sha512 51164.77k 205189.87k 361345.79k 543883.26k
> 638372.52k 645933.74k
>
> Or in other words, it can hash nearly 931 MByte /s with SHA-1 and about
> 427 MByte / s with SHA-256 (if I haven't miscalculated something). You'd
> need a
> pretty fast disk (aka M.2 SSD) and network (aka > 1 Gbit) to top these
> speeds
> and then you'd use a real CPU for your server, not some poor Intel
> powersaving
> surfing thingy-majingy :)
I mean, how fast is in theory doesn't matter nearly as much as what
happens when you benchmark the proposed implementation, and the
results we have so far don't support the theory that this is so cheap
as to be negligible.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Rushabh Lathia
Attachment
On Wed, Dec 4, 2019 at 1:01 PM Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > As per the discussion on the thread, here is the patch which > > a) Make checksum for manifest file optional. > b) Allow user to choose a particular algorithm. > > Currently with the WIP patch SHA256 and CRC checksum algorithm > supported. Patch also changed the manifest file format to append > the used algorithm name before the checksum, this way it will be > easy to validator to know which algorithm to used. > > Ex: > ./db/bin/pg_basebackup -D bksha/ --manifest-with-checksums=SHA256 > > $ cat bksha/backup_manifest | more > PostgreSQL-Backup-Manifest-Version 1 > File backup_label 226 2019-12-04 17:46:46 GMT SHA256:7cf53d1b9facca908678ab70d93a9e7460cd35cedf7891de948dcf858f8a281a > File pg_xact/0000 8192 2019-12-04 17:46:46 GMT SHA256:8d2b6cb1dc1a6e8cee763b52d75e73571fddce06eb573861d44082c7d8c03c26 > > ./db/bin/pg_basebackup -D bkcrc/ --manifest-with-checksums=CRC > PostgreSQL-Backup-Manifest-Version 1 > File backup_label 226 2019-12-04 17:58:40 GMT CRC:343138313931333134 > File pg_xact/0000 8192 2019-12-04 17:46:46 GMT CRC:363538343433333133 > > Pending TODOs: > - Documentation update > - Code cleanup > - Testing. > > I will further continue to work on the patch and meanwhile feel free to provide > thoughts/inputs. + initilize_manifest_checksum(&cCtx); Spelling. - Spurious. + case MC_CRC: + INIT_CRC32C(cCtx->crc_ctx); Suggest that we do CRC -> CRC32C throughout the patch. Someone might conceivably want some other CRC variant, mostly likely 64-bit, in the future. +final_manifest_checksum(ChecksumCtx *cCtx, char *checksumbuf) finalize printf(_(" --manifest-with-checksums\n" - " do calculate checksums for manifest files\n")); + " calculate checksums for manifest files using provided algorithm\n")); Switch name is wrong. Suggest --manifest-checksums. Help usually shows that an argument is expected, e.g. --manifest-checksums=ALGORITHM or --manifest-checksums=sha256|crc32c|none This seems to apply over some earlier version of the patch. A consolidated patch, or the whole stack, would be better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 5, 2019 at 12:17 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Dec 4, 2019 at 1:01 PM Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> As per the discussion on the thread, here is the patch which
>
> a) Make checksum for manifest file optional.
> b) Allow user to choose a particular algorithm.
>
> Currently with the WIP patch SHA256 and CRC checksum algorithm
> supported. Patch also changed the manifest file format to append
> the used algorithm name before the checksum, this way it will be
> easy to validator to know which algorithm to used.
>
> Ex:
> ./db/bin/pg_basebackup -D bksha/ --manifest-with-checksums=SHA256
>
> $ cat bksha/backup_manifest | more
> PostgreSQL-Backup-Manifest-Version 1
> File backup_label 226 2019-12-04 17:46:46 GMT SHA256:7cf53d1b9facca908678ab70d93a9e7460cd35cedf7891de948dcf858f8a281a
> File pg_xact/0000 8192 2019-12-04 17:46:46 GMT SHA256:8d2b6cb1dc1a6e8cee763b52d75e73571fddce06eb573861d44082c7d8c03c26
>
> ./db/bin/pg_basebackup -D bkcrc/ --manifest-with-checksums=CRC
> PostgreSQL-Backup-Manifest-Version 1
> File backup_label 226 2019-12-04 17:58:40 GMT CRC:343138313931333134
> File pg_xact/0000 8192 2019-12-04 17:46:46 GMT CRC:363538343433333133
>
> Pending TODOs:
> - Documentation update
> - Code cleanup
> - Testing.
>
> I will further continue to work on the patch and meanwhile feel free to provide
> thoughts/inputs.
+ initilize_manifest_checksum(&cCtx);
Spelling.
Fixed.
-
Spurious.
+ case MC_CRC:
+ INIT_CRC32C(cCtx->crc_ctx);
Suggest that we do CRC -> CRC32C throughout the patch. Someone might
conceivably want some other CRC variant, mostly likely 64-bit, in the
future.
Make sense, done.
+final_manifest_checksum(ChecksumCtx *cCtx, char *checksumbuf)
finalize
Done.
printf(_(" --manifest-with-checksums\n"
- " do calculate checksums for manifest files\n"));
+ " calculate checksums for manifest files
using provided algorithm\n"));
Switch name is wrong. Suggest --manifest-checksums.
Help usually shows that an argument is expected, e.g.
--manifest-checksums=ALGORITHM or
--manifest-checksums=sha256|crc32c|none
Fixed.
This seems to apply over some earlier version of the patch. A
consolidated patch, or the whole stack, would be better.
Here is the whole stack of patches.
Rushabh Lathia
Attachment
- 0001-Reduce-code-duplication-and-eliminate-weird-macro-tr.patch
- 0004-Documentation-for-backup-manifest-and-manifest-check.patch
- 0003-Make-checksum-optional-in-pg_basebackup-review-comme.patch
- 0002-POC-of-backup-manifest-with-file-names-sizes-timesta.patch
- 0005-Allow-user-to-choose-a-checksum-algorithm-for-manife.patch
On Thu, Dec 5, 2019 at 11:22 AM Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > Here is the whole stack of patches. Please include proper attribution and, where somebody's written them, commit messages in each patch in the stack. For example, I see that your 0001 is mostly the same as my 0001 from upthread, but now it says: From a3e075d5edb5031ea358e049f8cb07031fc480a3 Mon Sep 17 00:00:00 2001 From: Rushabh Lathia <rushabh.lathia@enterprisedb.com> Date: Wed, 13 Nov 2019 15:19:22 +0530 Subject: [PATCH 1/5] Reduce code duplication and eliminate weird macro tricks. ...with no indication of who the original author was. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 5, 2019 at 11:22 AM Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > Here is the whole stack of patches. I committed 0001, as that's just refactoring and I think (hope) it's uncontroversial. I think 0002-0005 need to be squashed together (crediting all authors properly and in the appropriate order) as it's quite hard to understand right now, and that Suraj's patch to validate the backup should be included in the patch stack. It needs documentation. Also, we need, either in that patch or a separate, TAP tests that exercise this feature. Things we should try to check: - Plain format backups can be verified against the manifest. - Tar format backups can be verified against the manifest after untarring (this might be a problem; not sure there's any guarantee that we have a working "tar" command available). - Verification succeeds for all available checksums algorithms and also for no checksum algorithm (should still check which files are present, and sizes). - If we tamper with a backup by removing a file, adding a file, or changing the size of a file, the modification is detected even without checksums. - If we tamper with a backup by changing the contents of a file but not the size, the modification is detected if checksums are used. - Everything above still works if there is user-defined tablespace that contains a table. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 6, 2019 at 1:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Dec 5, 2019 at 11:22 AM Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> Here is the whole stack of patches.
I committed 0001, as that's just refactoring and I think (hope) it's
uncontroversial. I think 0002-0005 need to be squashed together
(crediting all authors properly and in the appropriate order) as it's
quite hard to understand right now,
Please find attached single patch and I tried to add the credit to all
the authors.
There is one review comment from Jeevan Chalke, which still pending
to address is:
4.
Why we need a "File" at the start of each entry as we are adding files only?
I wonder if we also need to provide a tablespace name and directory marker so
that we have "Tablespace" and "Dir" at the start.
Sorry, I am not quite sure about this, may be Robert is right person
to answer this.
and that Suraj's patch to validate
the backup should be included in the patch stack. It needs
documentation. Also, we need, either in that patch or a separate, TAP
tests that exercise this feature. Things we should try to check:
- Plain format backups can be verified against the manifest.
- Tar format backups can be verified against the manifest after
untarring (this might be a problem; not sure there's any guarantee
that we have a working "tar" command available).
- Verification succeeds for all available checksums algorithms and
also for no checksum algorithm (should still check which files are
present, and sizes).
- If we tamper with a backup by removing a file, adding a file, or
changing the size of a file, the modification is detected even without
checksums.
- If we tamper with a backup by changing the contents of a file but
not the size, the modification is detected if checksums are used.
- Everything above still works if there is user-defined tablespace
that contains a table.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment
On Fri, Dec 6, 2019 at 1:35 AM Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > There is one review comment from Jeevan Chalke, which still pending > to address is: > >> 4. >> Why we need a "File" at the start of each entry as we are adding files only? >> I wonder if we also need to provide a tablespace name and directory marker so >> that we have "Tablespace" and "Dir" at the start. > > Sorry, I am not quite sure about this, may be Robert is right person > to answer this. I did it that way for extensibility. Notice that the first and last line of the manifest begin with other words, so someone parsing the manifest can identify the line type by looking just at the first word. Someone might in the future find some need to add other kinds of lines that don't exist today. "Tablespace" and "Dir" are, in fact, pretty good examples of things that someone might want to add in the future. I don't really see a clear need for either one today, although maybe somebody else will, but I think we should leave ourselves room to add such things in the future. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 6, 2019 at 12:05 PM Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
On Fri, Dec 6, 2019 at 1:44 AM Robert Haas <robertmhaas@gmail.com> wrote:On Thu, Dec 5, 2019 at 11:22 AM Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> Here is the whole stack of patches.
I committed 0001, as that's just refactoring and I think (hope) it's
uncontroversial. I think 0002-0005 need to be squashed together
(crediting all authors properly and in the appropriate order) as it's
quite hard to understand right now,Please find attached single patch and I tried to add the credit to allthe authors.
I had a look over the patch and here are my few review comments:
1.
+ if (pg_strcasecmp(manifest_checksum_algo, "SHA256") == 0)
+ manifest_checksums = MC_SHA256;
+ else if (pg_strcasecmp(manifest_checksum_algo, "CRC32C") == 0)
+ manifest_checksums = MC_CRC32C;
+ else if (pg_strcasecmp(manifest_checksum_algo, "NONE") == 0)
+ manifest_checksums = MC_NONE;
+ else
+ ereport(ERROR,
Is NONE is a valid input? I think the default is "NONE" only and thus no need
of this as an input. It will be better if we simply error out if input is
neither "SHA256" nor "CRC32C".
I believe you have done this way as from pg_basebackup you are always passing
MANIFEST_CHECKSUMS '%s' string which will have "NONE" if no user input is
given. But I think passing that conditional will be better like we have
maxrate_clause for example.
Well, this is what I think, feel free to ignore as I don't see any correctness
issue over here.
2.
+ if (manifest_checksums != MC_NONE)
+ {
+ checksumbuflen = finalize_manifest_checksum(cCtx, checksumbuf);
+ switch (manifest_checksums)
+ {
+ case MC_NONE:
+ break;
+ }
Since switch case is within "if (manifest_checksums != MC_NONE)" condition,
I don't think we need a case for MC_NONE here. Rather we can use a default
case to error out.
3.
+ if (manifest_checksums != MC_NONE)
+ {
+ initialize_manifest_checksum(&cCtx);
+ update_manifest_checksum(&cCtx, content, len);
+ }
@@ -1384,6 +1641,9 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
int segmentno = 0;
char *segmentpath;
bool verify_checksum = false;
+ ChecksumCtx cCtx;
+
+ initialize_manifest_checksum(&cCtx);
I see that in a few cases you are calling initialize/update_manifest_checksum()
conditional and at some other places call is unconditional. It seems like
calling unconditional will not have any issues as switch cases inside them
return doing nothing when manifest_checksums is MC_NONE.
4.
initialize/update/finalize_manifest_checksum() functions may be needed by the
validation patch as well. And thus I think these functions should not depend
on a global variable as such. Also, it will be good if we keep them in a file
that is accessible to frontend-only code. Well, you can ignore these comments
with the argument saying that this refactoring can be done by the patch adding
validation support. I have no issues. Since both the patches are dependent and
posted on the same email chain, thought of putting that observation.
5.
+ switch (manifest_checksums)
+ {
+ case MC_SHA256:
+ checksumlabel = "SHA256:";
+ break;
+ case MC_CRC32C:
+ checksumlabel = "CRC32C:";
+ break;
+ case MC_NONE:
+ break;
+ }
This code in AddFileToManifest() is executed for every file for which we are
adding an entry. However, the checksumlabel will be going to remain the same
throughout. Can it be set just once and then used as is?
6.
Can we avoid manifest_checksums from declaring it as a global variable?
I think for that, we need to pass that to every function and thus need to
change the function signature of various functions. Currently, we pass
"StringInfo manifest" to all the required function, will it better to pass
the struct variable instead? A struct may have members like,
"StringInfo manifest" in it, checksum type (manifest_checksums),
checksum label, etc.
Thanks
--
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Thanks Jeevan for reviewing the patch and offline discussion.
On Mon, Dec 9, 2019 at 11:15 AM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
On Fri, Dec 6, 2019 at 12:05 PM Rushabh Lathia <rushabh.lathia@gmail.com> wrote:On Fri, Dec 6, 2019 at 1:44 AM Robert Haas <robertmhaas@gmail.com> wrote:On Thu, Dec 5, 2019 at 11:22 AM Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> Here is the whole stack of patches.
I committed 0001, as that's just refactoring and I think (hope) it's
uncontroversial. I think 0002-0005 need to be squashed together
(crediting all authors properly and in the appropriate order) as it's
quite hard to understand right now,Please find attached single patch and I tried to add the credit to allthe authors.
I had a look over the patch and here are my few review comments:
1.
+ if (pg_strcasecmp(manifest_checksum_algo, "SHA256") == 0)
+ manifest_checksums = MC_SHA256;
+ else if (pg_strcasecmp(manifest_checksum_algo, "CRC32C") == 0)
+ manifest_checksums = MC_CRC32C;
+ else if (pg_strcasecmp(manifest_checksum_algo, "NONE") == 0)
+ manifest_checksums = MC_NONE;
+ else
+ ereport(ERROR,
Is NONE is a valid input? I think the default is "NONE" only and thus no need
of this as an input. It will be better if we simply error out if input is
neither "SHA256" nor "CRC32C".
I believe you have done this way as from pg_basebackup you are always passing
MANIFEST_CHECKSUMS '%s' string which will have "NONE" if no user input is
given. But I think passing that conditional will be better like we have
maxrate_clause for example.
Well, this is what I think, feel free to ignore as I don't see any correctness
issue over here.
I would still keep this NONE as it's look more cleaner in the say of
given options to the checksums.
2.
+ if (manifest_checksums != MC_NONE)
+ {
+ checksumbuflen = finalize_manifest_checksum(cCtx, checksumbuf);
+ switch (manifest_checksums)
+ {
+ case MC_NONE:
+ break;
+ }
Since switch case is within "if (manifest_checksums != MC_NONE)" condition,
I don't think we need a case for MC_NONE here. Rather we can use a default
case to error out.
Yeah, with the new patch we don't have this part of code.
3.
+ if (manifest_checksums != MC_NONE)
+ {
+ initialize_manifest_checksum(&cCtx);
+ update_manifest_checksum(&cCtx, content, len);
+ }
@@ -1384,6 +1641,9 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
int segmentno = 0;
char *segmentpath;
bool verify_checksum = false;
+ ChecksumCtx cCtx;
+
+ initialize_manifest_checksum(&cCtx);
I see that in a few cases you are calling initialize/update_manifest_checksum()
conditional and at some other places call is unconditional. It seems like
calling unconditional will not have any issues as switch cases inside them
return doing nothing when manifest_checksums is MC_NONE.
Fixed.
4.
initialize/update/finalize_manifest_checksum() functions may be needed by the
validation patch as well. And thus I think these functions should not depend
on a global variable as such. Also, it will be good if we keep them in a file
that is accessible to frontend-only code. Well, you can ignore these comments
with the argument saying that this refactoring can be done by the patch adding
validation support. I have no issues. Since both the patches are dependent and
posted on the same email chain, thought of putting that observation.
Make sense, I just changed those API to that it doesn't have to
access the global.
5.
+ switch (manifest_checksums)
+ {
+ case MC_SHA256:
+ checksumlabel = "SHA256:";
+ break;
+ case MC_CRC32C:
+ checksumlabel = "CRC32C:";
+ break;
+ case MC_NONE:
+ break;
+ }
This code in AddFileToManifest() is executed for every file for which we are
adding an entry. However, the checksumlabel will be going to remain the same
throughout. Can it be set just once and then used as is?
Yeah, with the attached patch we no more have this part of code.
6.
Can we avoid manifest_checksums from declaring it as a global variable?
I think for that, we need to pass that to every function and thus need to
change the function signature of various functions. Currently, we pass
"StringInfo manifest" to all the required function, will it better to pass
the struct variable instead? A struct may have members like,
"StringInfo manifest" in it, checksum type (manifest_checksums),
checksum label, etc.
I agree. Earlier I was not sure about this because that require data structure
to expose. But in the given attached patch that's what I tried, introduced new
data structure and defined in basebackup.h and passed the same through the
function so that doesn't require to pass an individual members. Also removed
global manifest_checksum and added the same in the newly introduced structure.
Attaching the patch, which need to apply on the top of earlier 0001 patch.
Thanks,
Rushabh Lathia
Attachment
On Mon, Dec 9, 2019 at 2:52 PM Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
Thanks Jeevan for reviewing the patch and offline discussion.On Mon, Dec 9, 2019 at 11:15 AM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:On Fri, Dec 6, 2019 at 12:05 PM Rushabh Lathia <rushabh.lathia@gmail.com> wrote:On Fri, Dec 6, 2019 at 1:44 AM Robert Haas <robertmhaas@gmail.com> wrote:On Thu, Dec 5, 2019 at 11:22 AM Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> Here is the whole stack of patches.
I committed 0001, as that's just refactoring and I think (hope) it's
uncontroversial. I think 0002-0005 need to be squashed together
(crediting all authors properly and in the appropriate order) as it's
quite hard to understand right now,Please find attached single patch and I tried to add the credit to allthe authors.
I had a look over the patch and here are my few review comments:
1.
+ if (pg_strcasecmp(manifest_checksum_algo, "SHA256") == 0)
+ manifest_checksums = MC_SHA256;
+ else if (pg_strcasecmp(manifest_checksum_algo, "CRC32C") == 0)
+ manifest_checksums = MC_CRC32C;
+ else if (pg_strcasecmp(manifest_checksum_algo, "NONE") == 0)
+ manifest_checksums = MC_NONE;
+ else
+ ereport(ERROR,
Is NONE is a valid input? I think the default is "NONE" only and thus no need
of this as an input. It will be better if we simply error out if input is
neither "SHA256" nor "CRC32C".
I believe you have done this way as from pg_basebackup you are always passing
MANIFEST_CHECKSUMS '%s' string which will have "NONE" if no user input is
given. But I think passing that conditional will be better like we have
maxrate_clause for example.
Well, this is what I think, feel free to ignore as I don't see any correctness
issue over here.I would still keep this NONE as it's look more cleaner in the say ofgiven options to the checksums.
2.
+ if (manifest_checksums != MC_NONE)
+ {
+ checksumbuflen = finalize_manifest_checksum(cCtx, checksumbuf);
+ switch (manifest_checksums)
+ {
+ case MC_NONE:
+ break;
+ }
Since switch case is within "if (manifest_checksums != MC_NONE)" condition,
I don't think we need a case for MC_NONE here. Rather we can use a default
case to error out.Yeah, with the new patch we don't have this part of code.
3.
+ if (manifest_checksums != MC_NONE)
+ {
+ initialize_manifest_checksum(&cCtx);
+ update_manifest_checksum(&cCtx, content, len);
+ }
@@ -1384,6 +1641,9 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
int segmentno = 0;
char *segmentpath;
bool verify_checksum = false;
+ ChecksumCtx cCtx;
+
+ initialize_manifest_checksum(&cCtx);
I see that in a few cases you are calling initialize/update_manifest_checksum()
conditional and at some other places call is unconditional. It seems like
calling unconditional will not have any issues as switch cases inside them
return doing nothing when manifest_checksums is MC_NONE.Fixed.
4.
initialize/update/finalize_manifest_checksum() functions may be needed by the
validation patch as well. And thus I think these functions should not depend
on a global variable as such. Also, it will be good if we keep them in a file
that is accessible to frontend-only code. Well, you can ignore these comments
with the argument saying that this refactoring can be done by the patch adding
validation support. I have no issues. Since both the patches are dependent and
posted on the same email chain, thought of putting that observation.Make sense, I just changed those API to that it doesn't have toaccess the global.
5.
+ switch (manifest_checksums)
+ {
+ case MC_SHA256:
+ checksumlabel = "SHA256:";
+ break;
+ case MC_CRC32C:
+ checksumlabel = "CRC32C:";
+ break;
+ case MC_NONE:
+ break;
+ }
This code in AddFileToManifest() is executed for every file for which we are
adding an entry. However, the checksumlabel will be going to remain the same
throughout. Can it be set just once and then used as is?Yeah, with the attached patch we no more have this part of code.
6.
Can we avoid manifest_checksums from declaring it as a global variable?
I think for that, we need to pass that to every function and thus need to
change the function signature of various functions. Currently, we pass
"StringInfo manifest" to all the required function, will it better to pass
the struct variable instead? A struct may have members like,
"StringInfo manifest" in it, checksum type (manifest_checksums),
checksum label, etc.I agree. Earlier I was not sure about this because that require data structureto expose. But in the given attached patch that's what I tried, introduced newdata structure and defined in basebackup.h and passed the same through thefunction so that doesn't require to pass an individual members. Also removedglobal manifest_checksum and added the same in the newly introduced structure.Attaching the patch, which need to apply on the top of earlier 0001 patch.
Attaching another version of 0002 patch, as my collogue Jeevan Chalke pointed
few indentation problem in 0002 patch which I sent earlier. Fixed the same in
the latest patch.
--
Rushabh Lathia
Attachment
On Tue, Dec 10, 2019 at 3:29 PM Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
Attaching another version of 0002 patch, as my collogue Jeevan Chalke pointedfew indentation problem in 0002 patch which I sent earlier. Fixed the same inthe latest patch.
I had a look over the new patch and see no issues. Looks good to me.
Thanks for quickly fixing the review comments posted earlier.
However, here are the minor comments:
1.
@@ -122,6 +133,7 @@ static long long int total_checksum_failures;
/* Do not verify checksums. */
static bool noverify_checksums = false;
+
/*
* The contents of these directories are removed or recreated during server
* start so they are not included in backups. The directories themselves are
Please remove this unnecessary change.
Need to run the indentation.
Thanks
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Hi,
Please find attached patch for backup validator implementation (0004 patch). This patch is based
Please find attached patch for backup validator implementation (0004 patch). This patch is based
on Rushabh's latest patch for backup manifest.
There are some functions required at client side as well, so I have moved those functions
and some data structure at common place so that they can be accessible for both. (0003 patch).
My colleague Rajkumar Raghuwanshi has prepared the WIP patch (0005) for tap test cases which
There are some functions required at client side as well, so I have moved those functions
and some data structure at common place so that they can be accessible for both. (0003 patch).
My colleague Rajkumar Raghuwanshi has prepared the WIP patch (0005) for tap test cases which
is also attached. As of now, test cases related to the tablespace and tar backup format are missing,
will continue work on same and submit the complete patch.
With this mail, I have attached the complete patch stack for backup manifest and backup
validate implementation.
Please let me know your thoughts on the same.
With this mail, I have attached the complete patch stack for backup manifest and backup
validate implementation.
Please let me know your thoughts on the same.
On Fri, Dec 6, 2019 at 1:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Dec 5, 2019 at 11:22 AM Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> Here is the whole stack of patches.
I committed 0001, as that's just refactoring and I think (hope) it's
uncontroversial. I think 0002-0005 need to be squashed together
(crediting all authors properly and in the appropriate order) as it's
quite hard to understand right now, and that Suraj's patch to validate
the backup should be included in the patch stack. It needs
documentation. Also, we need, either in that patch or a separate, TAP
tests that exercise this feature. Things we should try to check:
- Plain format backups can be verified against the manifest.
- Tar format backups can be verified against the manifest after
untarring (this might be a problem; not sure there's any guarantee
that we have a working "tar" command available).
- Verification succeeds for all available checksums algorithms and
also for no checksum algorithm (should still check which files are
present, and sizes).
- If we tamper with a backup by removing a file, adding a file, or
changing the size of a file, the modification is detected even without
checksums.
- If we tamper with a backup by changing the contents of a file but
not the size, the modification is detected if checksums are used.
- Everything above still works if there is user-defined tablespace
that contains a table.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.
Attachment
On Tue, Dec 10, 2019 at 6:40 AM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote: > Please find attached patch for backup validator implementation (0004 patch). This patch is based > on Rushabh's latest patch for backup manifest. > > There are some functions required at client side as well, so I have moved those functions > and some data structure at common place so that they can be accessible for both. (0003 patch). > > My colleague Rajkumar Raghuwanshi has prepared the WIP patch (0005) for tap test cases which > is also attached. As of now, test cases related to the tablespace and tar backup format are missing, > will continue work on same and submit the complete patch. > > With this mail, I have attached the complete patch stack for backup manifest and backup > validate implementation. > > Please let me know your thoughts on the same. Well, for the second time on this thread, please don't take a bunch of somebody else's code and post it in a patch that doesn't attribute that person as one of the authors. For the second time on this thread, the person is me, but don't borrow *anyone's* code without proper attribution. It's really important! On a related note, it's a very good idea to use git format-patch and git rebase -i to maintain patch stacks like this. Rushabh seems to have done that, but the files you're posting look like raw 'git diff' output. Notice that this gives him a way to include authorship information and a tentative commit message in each patch, but you don't have any of that. Also on a related note, part of the process of adapting existing code to a new purpose is adapting the comments. You haven't done that: + * Search a result-set hash table for a row matching a given filename. ... + * Insert a row into a result-set hash table, provided no such row is already ... + * Most of the values + * that we're hashing are short integers formatted as text, so there + * shouldn't be much room for pathological input. I think that what we should actually do here is try to use simplehash. Right now, it won't work for frontend code, but I posted some patches to try to address that issue: https://www.postgresql.org/message-id/CA%2BTgmob8oyh02NrZW%3DxCScB%2B5GyJ-jVowE3%2BTWTUmPF%3DFsGWTA%40mail.gmail.com That would have a few advantages. One, we wouldn't need to know the number of elements in advance, because simplehash can grow dynamically. Two, we could use the iteration interfaces to walk the hash table. Your solution to that is pgrhash_seq_search, but that's actually not well-designed, because it's not a generic iterator function but something that knows specifically about the 'touch' flag. I incidentally suggest renaming 'touch' to 'matched;' 'touch' is not bad, but I think 'matched' will be a little more recognizable. Please run pgindent. If needed, first add locally defined types to typedefs.list, so that things indent properly. It's not a crazy idea to try to share some data structures and code between the frontend and the backend here, but I think src/common/backup.c and src/include/common/backup.h is a far too generic name given what the code is actually doing. It's mostly about checksums, not backup, and I think it should be named accordingly. I suggest removing "manifestinfo" and renaming the rest to just talk about checksums rather than manifests. That would make it logical to reuse this for any other future code that needs a configurable checksum type. Also, how about adding a function like: extern bool parse_checksum_algorithm(char *name, ChecksumAlgorithm *algo); ...which would return true and set *algo if name is recognized, and return false otherwise. That code could be used on both the client and server sides of this patch, and by any future patches that want to return this scaffolding. The file header for backup.h has the wrong filename (string.h). The header format looks somewhat atypical compared to what we normally do, too. It's arguable, but I tend to think that it would be better to hex-encode the CRC rather than printing it as an integer. Maybe hex_encode() is another thing that could be moved into the new src/common file. As I said before about Rushabh's patch set, it's very confusing that we have so many patches here stacked up. Like, you have 0002 moving stuff, and then 0003 moving it again. That's super-confusing. Please try to structure the patch set so as to make it as easy to review as possible. Regarding the test case patch, error checks are important! Don't do things like this: +open my $modify_file_sha256, '>>', "$tempdir/backup_verify/postgresql.conf"; +print $modify_file_sha256 "port = 5555\n"; +close $modify_file_sha256; If the open fails, then it and the print and the close are going to silently do nothing. That's bad. I don't know exactly what the customary error-checking is for things like this in TAP tests, but I hope it's not like this, because this has a pretty fair chance of looking like it's testing something that it isn't. Let's figure out what the best practice in this area is and adhere to it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thanks, Robert for the review.
On Wed, Dec 11, 2019 at 1:10 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Dec 10, 2019 at 6:40 AM Suraj Kharage
<suraj.kharage@enterprisedb.com> wrote:
> Please find attached patch for backup validator implementation (0004 patch). This patch is based
> on Rushabh's latest patch for backup manifest.
>
> There are some functions required at client side as well, so I have moved those functions
> and some data structure at common place so that they can be accessible for both. (0003 patch).
>
> My colleague Rajkumar Raghuwanshi has prepared the WIP patch (0005) for tap test cases which
> is also attached. As of now, test cases related to the tablespace and tar backup format are missing,
> will continue work on same and submit the complete patch.
>
> With this mail, I have attached the complete patch stack for backup manifest and backup
> validate implementation.
>
> Please let me know your thoughts on the same.
Well, for the second time on this thread, please don't take a bunch of
somebody else's code and post it in a patch that doesn't attribute
that person as one of the authors. For the second time on this thread,
the person is me, but don't borrow *anyone's* code without proper
attribution. It's really important!
On a related note, it's a very good idea to use git format-patch and
git rebase -i to maintain patch stacks like this. Rushabh seems to
have done that, but the files you're posting look like raw 'git diff'
output. Notice that this gives him a way to include authorship
information and a tentative commit message in each patch, but you
don't have any of that.
Sorry, I have corrected this in the attached v2 patch set.
Also on a related note, part of the process of adapting existing code
to a new purpose is adapting the comments. You haven't done that:
+ * Search a result-set hash table for a row matching a given filename.
...
+ * Insert a row into a result-set hash table, provided no such row is already
...
+ * Most of the values
+ * that we're hashing are short integers formatted as text, so there
+ * shouldn't be much room for pathological input.
Corrected in v2 patch.
I think that what we should actually do here is try to use simplehash.
Right now, it won't work for frontend code, but I posted some patches
to try to address that issue:
https://www.postgresql.org/message-id/CA%2BTgmob8oyh02NrZW%3DxCScB%2B5GyJ-jVowE3%2BTWTUmPF%3DFsGWTA%40mail.gmail.com
That would have a few advantages. One, we wouldn't need to know the
number of elements in advance, because simplehash can grow
dynamically. Two, we could use the iteration interfaces to walk the
hash table. Your solution to that is pgrhash_seq_search, but that's
actually not well-designed, because it's not a generic iterator
function but something that knows specifically about the 'touch' flag.
I incidentally suggest renaming 'touch' to 'matched;' 'touch' is not
bad, but I think 'matched' will be a little more recognizable.
Thanks for the suggestion. Will try to implement the same and update accordingly.
I am assuming that I need to build the patch based on the changes that you proposed on the mentioned thread.
Please run pgindent. If needed, first add locally defined types to
typedefs.list, so that things indent properly.
It's not a crazy idea to try to share some data structures and code
between the frontend and the backend here, but I think
src/common/backup.c and src/include/common/backup.h is a far too
generic name given what the code is actually doing. It's mostly about
checksums, not backup, and I think it should be named accordingly. I
suggest removing "manifestinfo" and renaming the rest to just talk
about checksums rather than manifests. That would make it logical to
reuse this for any other future code that needs a configurable
checksum type. Also, how about adding a function like:
extern bool parse_checksum_algorithm(char *name, ChecksumAlgorithm *algo);
...which would return true and set *algo if name is recognized, and
return false otherwise. That code could be used on both the client and
server sides of this patch, and by any future patches that want to
return this scaffolding.
Corrected the filename and implemented the function as suggested.
The file header for backup.h has the wrong filename (string.h). The
header format looks somewhat atypical compared to what we normally do,
too.
My bad, corrected the header format as well.
It's arguable, but I tend to think that it would be better to
hex-encode the CRC rather than printing it as an integer. Maybe
hex_encode() is another thing that could be moved into the new
src/common file.
We are already encoding the CRC checksum as well. Please let me know if I misunderstood anything.
Moved hex_encode into src/common.
As I said before about Rushabh's patch set, it's very confusing that
we have so many patches here stacked up. Like, you have 0002 moving
stuff, and then 0003 moving it again. That's super-confusing. Please
try to structure the patch set so as to make it as easy to review as
possible.
Sorry for the confusion. I have squashed 0001 to 0003 patches in one patch.
Regarding the test case patch, error checks are important! Don't do
things like this:
+open my $modify_file_sha256, '>>', "$tempdir/backup_verify/postgresql.conf";
+print $modify_file_sha256 "port = 5555\n";
+close $modify_file_sha256;
If the open fails, then it and the print and the close are going to
silently do nothing. That's bad. I don't know exactly what the
customary error-checking is for things like this in TAP tests, but I
hope it's not like this, because this has a pretty fair chance of
looking like it's testing something that it isn't. Let's figure out
what the best practice in this area is and adhere to it.
Rajkumar has fixed this, please find attached 0003 patch for same.
Please find attached v2 set patches.
TODO: will implement the simplehash as suggested.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.
Attachment
Hi,
I think that what we should actually do here is try to use simplehash.
Right now, it won't work for frontend code, but I posted some patches
to try to address that issue:
https://www.postgresql.org/message-id/CA%2BTgmob8oyh02NrZW%3DxCScB%2B5GyJ-jVowE3%2BTWTUmPF%3DFsGWTA%40mail.gmail.com
That would have a few advantages. One, we wouldn't need to know the
number of elements in advance, because simplehash can grow
dynamically. Two, we could use the iteration interfaces to walk the
hash table. Your solution to that is pgrhash_seq_search, but that's
actually not well-designed, because it's not a generic iterator
function but something that knows specifically about the 'touch' flag.
I incidentally suggest renaming 'touch' to 'matched;' 'touch' is not
bad, but I think 'matched' will be a little more recognizable.Thanks for the suggestion. Will try to implement the same and update accordingly.I am assuming that I need to build the patch based on the changes that you proposed on the mentioned thread.
I have implemented the simplehash in backup validator patch as Robert suggested. Please find attached 0002 patch for the same.
kindly review and let me know your thoughts.
Also attached the remaining patches. 0001 and 0003 are same as v2, only patch version is bumped.
--
Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.
Attachment
On Tue, Dec 17, 2019 at 12:54 AM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote: > I have implemented the simplehash in backup validator patch as Robert suggested. Please find attached 0002 patch for thesame. > > kindly review and let me know your thoughts. +#define CHECKSUM_LENGTH 256 This seems wrong. Not all checksums are the same length, and none of the ones we're using are 256 bytes in length, and if we've got to have a constant someplace for the maximum checksum length, it should probably be in the new header file, not here. But I don't think we should need this in the first place; see comments below about how to revise the parsing of the manifest file. + char filetype[10]; A mysterious 10-byte field with no comments explaining what it means... and the same magic number 10 appears in at least one other place in the patch. +typedef struct manifesthash_hash *hashtab; This declares a new *type* called hashtab, not a variable called hashtab. The new type is not used anywhere, but later, you have several variables of the same type that have this name. Just remove this: it's wrong and unused. +static enum ChecksumAlgorithm checksum_type = MC_NONE; Remove "enum". Not needed, because you have a typedef for it in the header, and not per style. +static manifesthash_hash *create_manifest_hash(char manifest_path[MAXPGPATH]); Whitespace is wrong. The whole patch needs a visit from pgindent with a properly-updated typedefs.list. Also, you will struggle to find anywhere else in the code base where pass a character array as a function argument. I don't know why this isn't just char *. + if(verify_backup) Whitespace wrong here, too. + * Read the backup_manifest file and generate the hash table, then scan data + * directroy and verify each file. Finally, iterate on hash table to find + * out missing files. You've got a word spelled wrong here, but the bigger problem is that this comment doesn't actually describe what this function is trying to do. Instead, it describes how it does it. If it's necessary to explain what steps the function takes in order to accomplish some goal, you should comment individual bits of code in the function. The header comment is a high-level overview, not a description of the algorithm. It's also pretty unhelpful, here and elsewhere, to refer to "the hash table" as if there were only one, and as if the reader were supposed to know something about it when you haven't told them anything about it. + if (!entry->matched) + { + pg_log_info("missing file: %s", entry->filename); + } + The braces here are not project style. We usually omit braces when only a single line of code is present. I think some work needs to be done to standardize and improve the messages that get produced here. You have: 1. missing file: %s 2. duplicate file present: %s 3. size changed for file: %s, original size: %d, current size: %zu 4. checksum difference for file: %s 5. extra file found: %s I suggest: 1. file \"%s\" is present in manifest but missing from the backup 2. file \"%s\" has multiple manifest entries (this one should probably be pg_log_error(), not pg_log_info(), as it represents a corrupt-manifest problem) 3. file \"%s" has size %lu in manifest but size %lu in backup 4. file \"%s" has checksum %s in manifest but checksum %s in backup 5. file \"%s" is present in backup but not in manifest Your patch actually doesn't compile on my system, because for the third message above, it uses %zu to print the size. But %zu is for size_t, not off_t. I went looking for other places in the code where we print off_t; based on that, I think the right thing to do is to print it using %lu and write (unsigned long) st.st_size. + char file_checksum[256]; + char header[1024]; More arbitrary constants. + if (!file) + { + pg_log_error("could not open backup_manifest"); That's bad error reporting. See e.g. readfile() in initdb.c. + if (fscanf(file, "%1023[^\n]\n", header) != 1) + { + pg_log_error("error while reading the header from backup_manifest"); That's also bad error reporting. It is only a slight step up from "ERROR: error". And we have another magic number (1023). + appendPQExpBufferStr(manifest, header); + appendPQExpBufferStr(manifest, "\n"); ... + appendPQExpBuffer(manifest, "File\t%s\t%d\t%s\t%s\n", filename, + filesize, mtime, checksum_with_type); This whole thing seems completely crazy to me. Basically, you're trying to use fscanf() to parse the file. But then, because fscanf() doesn't give you the original bytes back, you're trying to reassemble the data that you parsed to recover the original line, so that you can stuff it in the buffer and eventually checksum it. However, that's highly error-prone. You're basically duplicating server code, and thus risking getting out of sync in the server code, to work around a problem that is entirely self-inflicted, namely, deciding to use fscanf(). What I would recommend is: 1. Use open(), read(), close() rather than the fopen() family of functions. As we have discovered elsewhere, fread() doesn't promise to set errno, so we can't necessarily get reliable error-reporting out of it. 2. Before you start reading the file, create a buffer that's large enough to hold the whole thing, by using fstat() to figure out how big the file is. Read the whole file into that buffer. If you're not able to read the whole file -- i.e. open() or read() or close() fail -- then just error out and exit. 3. Now advance through the file line by line. Write a function that knows how to search forward for the next \r or \n but with checks to make sure it can't run off the end of the buffer, and use that to locate the end of each line so that you can walk forward. As you walk forward line by line, add the line you just processed to the checksum. That way, you only need a single pass over the data. Also, you can modify it in place. More on that below. 4. As you examine each line, start by examining the first word. You'll need a function that finds the first word by searching forward for a tab character, but not beyond the end of the line. The first word of the first line should be PostgreSQL-Backup-Manifest-Version and the second word should be 1. Then on each subsequent line check whether the first word is File or Manifest-Checksum or something else, erroring out in the last case. If it's Manifest-Checksum, verify that this is the last line of the file and that the checksum matches. If it's File, break the line into fields so you can add it to the hash table. You'll want a pointer to the filename and a pointer to the checksum, and you'll want to parse the size as an integer. Instead of allocating new memory for those fields, just overwrite the character that follows the field with a \0. There must be one - either \t or \n - so you shouldn't run off the end of the buffer. If you do this, a bunch of the fixed-size buffers you have right now go away. You don't need the variable filetype[10] any more, or checksum_with_type[CHECKSUM_LENGTH], or checksum[CHECKSUM_LENGTH], or the character arrays inside DataDirectoryFileInfo. Instead you can just have pointers into the buffer that contains the file. And you don't need this code to back up using fseek() and reread the lines, either. Also read this article: https://stackoverflow.com/questions/2430303/disadvantages-of-scanf Note that the very first point in the article talks about the problem of overrunning the buffer, which you certainly have in the current code right here: + if (fscanf(file, "%s\t%s\t%d\t%23[^\t] %s\n", filetype, filename, filetype is declared as char[10], but %s could read arbitrarily much data. + filename = (char*) pg_malloc(MAXPGPATH); pg_malloc returns void *, so no cast is required. + if (strcmp(checksum_with_type, "-") == 0) + { + checksum_type = MC_NONE; + } + else + { + if (strncmp(checksum_with_type, "SHA256", 6) == 0) Use parse_checksum_algorithm. Right now you've invented a "common" function with 1 caller, but I explicitly suggested previously that you put it in common so that you could reuse it. + if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0 || + strcmp(de->d_name, "pg_wal") == 0) + continue; Ignoring pg_wal at the top level might be OK, but this will ignore a pg_wal entry anywhere in the directory tree. + /* Skip backup manifest file. */ + if (strcmp(de->d_name, "backup_manifest") == 0) + return; Same problem. + filename = createPQExpBuffer(); + if (!filename) + { + pg_log_error("out of memory"); + exit(1); + } + + appendPQExpBuffer(filename, "%s%s", relative_path, de->d_name); Just use char filename[MAXPGPATH] and snprintf here, as you do elsewhere. It will be simpler and save memory. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thank you for review comments.
On Thu, Dec 19, 2019 at 2:54 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Dec 17, 2019 at 12:54 AM Suraj Kharage
<suraj.kharage@enterprisedb.com> wrote:
> I have implemented the simplehash in backup validator patch as Robert suggested. Please find attached 0002 patch for the same.
>
> kindly review and let me know your thoughts.
+#define CHECKSUM_LENGTH 256
This seems wrong. Not all checksums are the same length, and none of
the ones we're using are 256 bytes in length, and if we've got to have
a constant someplace for the maximum checksum length, it should
probably be in the new header file, not here. But I don't think we
should need this in the first place; see comments below about how to
revise the parsing of the manifest file.
I agree. Removed.
+ char filetype[10];
A mysterious 10-byte field with no comments explaining what it
means... and the same magic number 10 appears in at least one other
place in the patch.
with current logic, we don't need this anymore.
I have removed the filetype from the structure as we are not doing any comparison anywhere.
+typedef struct manifesthash_hash *hashtab;
This declares a new *type* called hashtab, not a variable called
hashtab. The new type is not used anywhere, but later, you have
several variables of the same type that have this name. Just remove
this: it's wrong and unused.
corrected.
+static enum ChecksumAlgorithm checksum_type = MC_NONE;
Remove "enum". Not needed, because you have a typedef for it in the
header, and not per style.
corrected.
+static manifesthash_hash *create_manifest_hash(char manifest_path[MAXPGPATH]);
Whitespace is wrong. The whole patch needs a visit from pgindent with
a properly-updated typedefs.list.
Also, you will struggle to find anywhere else in the code base where
pass a character array as a function argument. I don't know why this
isn't just char *.
Corrected.
+ if(verify_backup)
Whitespace wrong here, too.
Fixed
It's also pretty unhelpful, here and elsewhere, to refer to "the hash
table" as if there were only one, and as if the reader were supposed
to know something about it when you haven't told them anything about
it.
+ if (!entry->matched)
+ {
+ pg_log_info("missing file: %s", entry->filename);
+ }
+
The braces here are not project style. We usually omit braces when
only a single line of code is present.
fixed
I think some work needs to be done to standardize and improve the
messages that get produced here. You have:
1. missing file: %s
2. duplicate file present: %s
3. size changed for file: %s, original size: %d, current size: %zu
4. checksum difference for file: %s
5. extra file found: %s
I suggest:
1. file \"%s\" is present in manifest but missing from the backup
2. file \"%s\" has multiple manifest entries
(this one should probably be pg_log_error(), not pg_log_info(), as it
represents a corrupt-manifest problem)
3. file \"%s" has size %lu in manifest but size %lu in backup
4. file \"%s" has checksum %s in manifest but checksum %s in backup
5. file \"%s" is present in backup but not in manifest
Corrected.
Your patch actually doesn't compile on my system, because for the
third message above, it uses %zu to print the size. But %zu is for
size_t, not off_t. I went looking for other places in the code where
we print off_t; based on that, I think the right thing to do is to
print it using %lu and write (unsigned long) st.st_size.
Corrected.
+ char file_checksum[256];
+ char header[1024];
More arbitrary constants.
+ if (!file)
+ {
+ pg_log_error("could not open backup_manifest");
That's bad error reporting. See e.g. readfile() in initdb.c.
Corrected.
+ if (fscanf(file, "%1023[^\n]\n", header) != 1)
+ {
+ pg_log_error("error while reading the header from backup_manifest");
That's also bad error reporting. It is only a slight step up from
"ERROR: error".
And we have another magic number (1023).
With current logic, we don't need this anymore.
+ appendPQExpBufferStr(manifest, header);
+ appendPQExpBufferStr(manifest, "\n");
...
+ appendPQExpBuffer(manifest, "File\t%s\t%d\t%s\t%s\n", filename,
+ filesize, mtime, checksum_with_type);
This whole thing seems completely crazy to me. Basically, you're
trying to use fscanf() to parse the file. But then, because fscanf()
doesn't give you the original bytes back, you're trying to reassemble
the data that you parsed to recover the original line, so that you can
stuff it in the buffer and eventually checksum it. However, that's
highly error-prone. You're basically duplicating server code, and thus
risking getting out of sync in the server code, to work around a
problem that is entirely self-inflicted, namely, deciding to use
fscanf().
What I would recommend is:
1. Use open(), read(), close() rather than the fopen() family of
functions. As we have discovered elsewhere, fread() doesn't promise to
set errno, so we can't necessarily get reliable error-reporting out of
it.
2. Before you start reading the file, create a buffer that's large
enough to hold the whole thing, by using fstat() to figure out how big
the file is. Read the whole file into that buffer. If you're not able
to read the whole file -- i.e. open() or read() or close() fail --
then just error out and exit.
3. Now advance through the file line by line. Write a function that
knows how to search forward for the next \r or \n but with checks to
make sure it can't run off the end of the buffer, and use that to
locate the end of each line so that you can walk forward. As you walk
forward line by line, add the line you just processed to the checksum.
That way, you only need a single pass over the data. Also, you can
modify it in place. More on that below.
4. As you examine each line, start by examining the first word. You'll
need a function that finds the first word by searching forward for a
tab character, but not beyond the end of the line. The first word of
the first line should be PostgreSQL-Backup-Manifest-Version and the
second word should be 1. Then on each subsequent line check whether
the first word is File or Manifest-Checksum or something else,
erroring out in the last case. If it's Manifest-Checksum, verify that
this is the last line of the file and that the checksum matches. If
it's File, break the line into fields so you can add it to the hash
table. You'll want a pointer to the filename and a pointer to the
checksum, and you'll want to parse the size as an integer. Instead of
allocating new memory for those fields, just overwrite the character
that follows the field with a \0. There must be one - either \t or \n
- so you shouldn't run off the end of the buffer.
If you do this, a bunch of the fixed-size buffers you have right now
go away. You don't need the variable filetype[10] any more, or
checksum_with_type[CHECKSUM_LENGTH], or checksum[CHECKSUM_LENGTH], or
the character arrays inside DataDirectoryFileInfo. Instead you can
just have pointers into the buffer that contains the file. And you
don't need this code to back up using fseek() and reread the lines,
either.
Thanks for the suggestion. I tried to mimic your approach in the attached v4-0002 patch.
Please let me know your thoughts on the same.
Also read this article:
https://stackoverflow.com/questions/2430303/disadvantages-of-scanf
Note that the very first point in the article talks about the problem
of overrunning the buffer, which you certainly have in the current
code right here:
+ if (fscanf(file, "%s\t%s\t%d\t%23[^\t] %s\n", filetype, filename,
filetype is declared as char[10], but %s could read arbitrarily much data.
now with this revised logic, we don't use this anymore.
+ filename = (char*) pg_malloc(MAXPGPATH);
pg_malloc returns void *, so no cast is required.
fixed.
+ if (strcmp(checksum_with_type, "-") == 0)
+ {
+ checksum_type = MC_NONE;
+ }
+ else
+ {
+ if (strncmp(checksum_with_type, "SHA256", 6) == 0)
Use parse_checksum_algorithm. Right now you've invented a "common"
function with 1 caller, but I explicitly suggested previously that you
put it in common so that you could reuse it.
while parsing the record, we get <checktype>:<checksum> as a string for checksum.
parse_checksum_algorithm uses pg_strcasecmp() so we need to pass exact string to that function.
with current logic, we can't add '\0' in between the line unless we parse it completely.
So we may need to allocate another small buffer and copy only checksum type in that and pass that to
parse_checksum_algorithm. I don't think of any other solution apart from this. I might be missing something
here, please correct me if I am wrong.
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0 ||
+ strcmp(de->d_name, "pg_wal") == 0)
+ continue;
Ignoring pg_wal at the top level might be OK, but this will ignore a
pg_wal entry anywhere in the directory tree.
+ /* Skip backup manifest file. */
+ if (strcmp(de->d_name, "backup_manifest") == 0)
+ return;
Same problem.
You are right. Added extra check for this.
+ filename = createPQExpBuffer();
+ if (!filename)
+ {
+ pg_log_error("out of memory");
+ exit(1);
+ }
+
+ appendPQExpBuffer(filename, "%s%s", relative_path, de->d_name);
Just use char filename[MAXPGPATH] and snprintf here, as you do
elsewhere. It will be simpler and save memory.
Fixed.
TAP test case patch needs some modification, Will do that and submit.
--
Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.
Attachment
Fixed some typos in attached v5-0002 patch. Please consider this patch for review.
On Fri, Dec 20, 2019 at 6:54 PM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote:
Thank you for review comments.On Thu, Dec 19, 2019 at 2:54 AM Robert Haas <robertmhaas@gmail.com> wrote:On Tue, Dec 17, 2019 at 12:54 AM Suraj Kharage
<suraj.kharage@enterprisedb.com> wrote:
> I have implemented the simplehash in backup validator patch as Robert suggested. Please find attached 0002 patch for the same.
>
> kindly review and let me know your thoughts.
+#define CHECKSUM_LENGTH 256
This seems wrong. Not all checksums are the same length, and none of
the ones we're using are 256 bytes in length, and if we've got to have
a constant someplace for the maximum checksum length, it should
probably be in the new header file, not here. But I don't think we
should need this in the first place; see comments below about how to
revise the parsing of the manifest file.I agree. Removed.+ char filetype[10];
A mysterious 10-byte field with no comments explaining what it
means... and the same magic number 10 appears in at least one other
place in the patch.with current logic, we don't need this anymore.I have removed the filetype from the structure as we are not doing any comparison anywhere.
+typedef struct manifesthash_hash *hashtab;
This declares a new *type* called hashtab, not a variable called
hashtab. The new type is not used anywhere, but later, you have
several variables of the same type that have this name. Just remove
this: it's wrong and unused.corrected.+static enum ChecksumAlgorithm checksum_type = MC_NONE;
Remove "enum". Not needed, because you have a typedef for it in the
header, and not per style.corrected.+static manifesthash_hash *create_manifest_hash(char manifest_path[MAXPGPATH]);
Whitespace is wrong. The whole patch needs a visit from pgindent with
a properly-updated typedefs.list.
Also, you will struggle to find anywhere else in the code base where
pass a character array as a function argument. I don't know why this
isn't just char *.Corrected.
+ if(verify_backup)
Whitespace wrong here, too.Fixed
It's also pretty unhelpful, here and elsewhere, to refer to "the hash
table" as if there were only one, and as if the reader were supposed
to know something about it when you haven't told them anything about
it.
+ if (!entry->matched)
+ {
+ pg_log_info("missing file: %s", entry->filename);
+ }
+
The braces here are not project style. We usually omit braces when
only a single line of code is present.fixed
I think some work needs to be done to standardize and improve the
messages that get produced here. You have:
1. missing file: %s
2. duplicate file present: %s
3. size changed for file: %s, original size: %d, current size: %zu
4. checksum difference for file: %s
5. extra file found: %s
I suggest:
1. file \"%s\" is present in manifest but missing from the backup
2. file \"%s\" has multiple manifest entries
(this one should probably be pg_log_error(), not pg_log_info(), as it
represents a corrupt-manifest problem)
3. file \"%s" has size %lu in manifest but size %lu in backup
4. file \"%s" has checksum %s in manifest but checksum %s in backup
5. file \"%s" is present in backup but not in manifestCorrected.
Your patch actually doesn't compile on my system, because for the
third message above, it uses %zu to print the size. But %zu is for
size_t, not off_t. I went looking for other places in the code where
we print off_t; based on that, I think the right thing to do is to
print it using %lu and write (unsigned long) st.st_size.Corrected.+ char file_checksum[256];
+ char header[1024];
More arbitrary constants.
+ if (!file)
+ {
+ pg_log_error("could not open backup_manifest");
That's bad error reporting. See e.g. readfile() in initdb.c.Corrected.
+ if (fscanf(file, "%1023[^\n]\n", header) != 1)
+ {
+ pg_log_error("error while reading the header from backup_manifest");
That's also bad error reporting. It is only a slight step up from
"ERROR: error".
And we have another magic number (1023).With current logic, we don't need this anymore.
+ appendPQExpBufferStr(manifest, header);
+ appendPQExpBufferStr(manifest, "\n");
...
+ appendPQExpBuffer(manifest, "File\t%s\t%d\t%s\t%s\n", filename,
+ filesize, mtime, checksum_with_type);
This whole thing seems completely crazy to me. Basically, you're
trying to use fscanf() to parse the file. But then, because fscanf()
doesn't give you the original bytes back, you're trying to reassemble
the data that you parsed to recover the original line, so that you can
stuff it in the buffer and eventually checksum it. However, that's
highly error-prone. You're basically duplicating server code, and thus
risking getting out of sync in the server code, to work around a
problem that is entirely self-inflicted, namely, deciding to use
fscanf().
What I would recommend is:
1. Use open(), read(), close() rather than the fopen() family of
functions. As we have discovered elsewhere, fread() doesn't promise to
set errno, so we can't necessarily get reliable error-reporting out of
it.
2. Before you start reading the file, create a buffer that's large
enough to hold the whole thing, by using fstat() to figure out how big
the file is. Read the whole file into that buffer. If you're not able
to read the whole file -- i.e. open() or read() or close() fail --
then just error out and exit.
3. Now advance through the file line by line. Write a function that
knows how to search forward for the next \r or \n but with checks to
make sure it can't run off the end of the buffer, and use that to
locate the end of each line so that you can walk forward. As you walk
forward line by line, add the line you just processed to the checksum.
That way, you only need a single pass over the data. Also, you can
modify it in place. More on that below.
4. As you examine each line, start by examining the first word. You'll
need a function that finds the first word by searching forward for a
tab character, but not beyond the end of the line. The first word of
the first line should be PostgreSQL-Backup-Manifest-Version and the
second word should be 1. Then on each subsequent line check whether
the first word is File or Manifest-Checksum or something else,
erroring out in the last case. If it's Manifest-Checksum, verify that
this is the last line of the file and that the checksum matches. If
it's File, break the line into fields so you can add it to the hash
table. You'll want a pointer to the filename and a pointer to the
checksum, and you'll want to parse the size as an integer. Instead of
allocating new memory for those fields, just overwrite the character
that follows the field with a \0. There must be one - either \t or \n
- so you shouldn't run off the end of the buffer.
If you do this, a bunch of the fixed-size buffers you have right now
go away. You don't need the variable filetype[10] any more, or
checksum_with_type[CHECKSUM_LENGTH], or checksum[CHECKSUM_LENGTH], or
the character arrays inside DataDirectoryFileInfo. Instead you can
just have pointers into the buffer that contains the file. And you
don't need this code to back up using fseek() and reread the lines,
either.Thanks for the suggestion. I tried to mimic your approach in the attached v4-0002 patch.Please let me know your thoughts on the same.Also read this article:
https://stackoverflow.com/questions/2430303/disadvantages-of-scanf
Note that the very first point in the article talks about the problem
of overrunning the buffer, which you certainly have in the current
code right here:
+ if (fscanf(file, "%s\t%s\t%d\t%23[^\t] %s\n", filetype, filename,
filetype is declared as char[10], but %s could read arbitrarily much data.now with this revised logic, we don't use this anymore.
+ filename = (char*) pg_malloc(MAXPGPATH);
pg_malloc returns void *, so no cast is required.fixed.+ if (strcmp(checksum_with_type, "-") == 0)
+ {
+ checksum_type = MC_NONE;
+ }
+ else
+ {
+ if (strncmp(checksum_with_type, "SHA256", 6) == 0)
Use parse_checksum_algorithm. Right now you've invented a "common"
function with 1 caller, but I explicitly suggested previously that you
put it in common so that you could reuse it.while parsing the record, we get <checktype>:<checksum> as a string for checksum.parse_checksum_algorithm uses pg_strcasecmp() so we need to pass exact string to that function.with current logic, we can't add '\0' in between the line unless we parse it completely.So we may need to allocate another small buffer and copy only checksum type in that and pass that toparse_checksum_algorithm. I don't think of any other solution apart from this. I might be missing somethinghere, please correct me if I am wrong.
+ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0 ||
+ strcmp(de->d_name, "pg_wal") == 0)
+ continue;
Ignoring pg_wal at the top level might be OK, but this will ignore a
pg_wal entry anywhere in the directory tree.
+ /* Skip backup manifest file. */
+ if (strcmp(de->d_name, "backup_manifest") == 0)
+ return;
Same problem.You are right. Added extra check for this.
+ filename = createPQExpBuffer();
+ if (!filename)
+ {
+ pg_log_error("out of memory");
+ exit(1);
+ }
+
+ appendPQExpBuffer(filename, "%s%s", relative_path, de->d_name);
Just use char filename[MAXPGPATH] and snprintf here, as you do
elsewhere. It will be simpler and save memory.Fixed.TAP test case patch needs some modification, Will do that and submit.----Thanks & Regards,Suraj kharage,EnterpriseDB Corporation,The Postgres Database Company.
--
Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.
Attachment
On Fri, Dec 20, 2019 at 8:24 AM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote: > Thank you for review comments. Thanks for the new version. + <term><option>--verify-backup </option></term> Whitespace. +struct manifesthash_hash *hashtab; Uh, I had it in mind that you would nuke this line completely, not just remove "typedef" from it. You shouldn't need a global variable here. + if (buf == NULL) pg_malloc seems to have an internal check such that it never returns NULL. I don't see anything like this test in other callers. The order of operations in create_manifest_hash() seems unusual: + fd = open(manifest_path, O_RDONLY, 0); + if (fstat(fd, &stat)) + buf = pg_malloc(stat.st_size); + hashtab = manifesthash_create(1024, NULL); ... + entry = manifesthash_insert(hashtab, filename, &found); ... + close(fd); I would have expected open-fstat-read-close to be consecutive, and the manifesthash stuff all done afterwards. In fact, it seems like reading the file could be a separate function. + if (strncmp(checksum, "SHA256", 6) == 0) This isn't really right; it would give a false match if we had a checksum algorithm with a name like SHA2560 or SHA256C or SHA256ExceptWayBetter. The right thing to do is find the colon first, and then probably overwrite it with '\0' so that you have a string that you can pass to parse_checksum_algorithm(). + /* + * we don't have checksum type in the header, so need to + * read through the first file enttry to find the checksum + * type for the manifest file and initilize the checksum + * for the manifest file itself. + */ This seems to be proceeding on the assumption that the checksum type for the manifest itself will always be the same as the checksum type for the first file in the manifest. I don't think that's the right approach. I think the manifest should always have a SHA256 checksum, regardless of what type of checksum is used for the individual files within the manifest. Since the volume of data in the manifest is presumably very small compared to the size of the database cluster itself, I don't think there should be any performance problem there. + filesize = atol(size); Using strtol() would allow for some error checking. + * Increase the checksum by its lable length so that we can + checksum = checksum + checksum_lable_length; Spelling. + pg_log_error("invalid record found in \"%s\"", manifest_path); Error message needs work. +VerifyBackup(void) +create_manifest_hash(char *manifest_path) +nextLine(char *buf) Your function names should be consistent with the surrounding style, and with each other, as far as possible. Three different conventions within the same patch and source file seems over the top. Also keep in mind that you're not writing code in a vacuum. There's a whole file of code here, and around that, a whole project. scan_data_directory() is a good example of a function whose name is clearly too generic. It's not a general-purpose function for scanning the data directory; it's specifically a support function for verifying a backup. Yet, the name gives no hint of this. +verify_file(struct dirent *de, char fn[MAXPGPATH], struct stat st, + char relative_path[MAXPGPATH], manifesthash_hash *hashtab) I think I commented on the use of char[] parameters in my previous review. + /* Skip backup manifest file. */ + if (strcmp(de->d_name, "backup_manifest") == 0) + return; Still looks like this will be skipped at any level of the directory hierarchy, not just the top. And why are we skipping backup_manifest here bug pg_wal in scan_data_directory? That's a rhetorical question, because I know the answer: verify_file() is only getting called for files, so you can't use it to skip directories. But that's not a good excuse for putting closely-related checks in different parts of the code. It's just going to result in the checks being inconsistent and each one having its own bugs that have to be fixed separately from the other one, as here. Please try to reorganize this code so that it can be done in a consistent way. I think this is related to the way you're traversing the directory tree, which somehow looks a bit awkward to me. At the top of scan_data_directory(), you've got code that uses basedir and subdirpath to construct path and relative_path. I was initially surprised to see that this was the job of this function, rather than the caller, but then I thought: well, as long as it makes life easy for the caller, it's probably fine. However, I notice that the only non-trivial caller is the scan_data_directory() itself, and it has to go and construct newsubdirpath from subdirpath and the directory name. It seems to me that this would get easier if you defined scan_data_directory() -- or whatever we end up calling it -- to take two pathname-related arguments: - basepath, which would be $PGDATA and would never change as we recurse down, so same as what you're now calling basedir - pathsuffix, which would be an empty string at the top level and at each recursive level we'd add a slash and then de->d_name. So at the top of the function we wouldn't need an if statement, because you could just do: snprintf(path, MAXPGPATH, "%s%s", basedir, pathsuffix); And when you recurse you wouldn't need an if statement either, because you could just do: snprintf(newpathsuffix, MAXPGPATH, "%s/%s", pathsuffix, de->d_name); What I'd suggest is constructing newpathsuffix right after rejecting "." and ".." entries, and then you can reject both pg_wal and backup_manifest, at the top-level only, using symmetric and elegant code: if (strcmp(newpathsuffix, "/pg_wal") == 0 || strcmp(newpathsuffix, "/backup_manifest") == 0) continue; + record = manifesthash_lookup(hashtab, filename);; + if (record) + { ...long block... + } + else + pg_log_info("file \"%s\" is present in backup but not in manifest", + filename); Try to structure the code in such a way that you minimize unnecessary indentation. For example, in this case, you could instead write: if (record == NULL) { pg_log_info(...) return; } and the result would be that everything inside that long if-block is now at the top level of the function and indented one level less. And I think if you look at this function you'll see a way that you can save a *second* level of indentation for much of that code. Please check the rest of the patch for similar cases, too. +static char * +nextLine(char *buf) +{ + while (*buf != '\0' && *buf != '\n') + buf = buf + 1; + + return buf + 1; +} I'm pretty sure that my previous review mentioned the importance of protecting against buffer overruns here. +static char * +nextWord(char *line) +{ + while (*line != '\0' && *line != '\t' && *line != '\n') + line = line + 1; + + return line + 1; +} Same problem here. In both cases, ++ is more idiomatic. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 20, 2019 at 9:14 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 20, 2019 at 8:24 AM Suraj Kharage
<suraj.kharage@enterprisedb.com> wrote:
> Thank you for review comments.
Thanks for the new version.
+ <term><option>--verify-backup </option></term>
Whitespace.
+struct manifesthash_hash *hashtab;
Uh, I had it in mind that you would nuke this line completely, not
just remove "typedef" from it. You shouldn't need a global variable
here.
+ if (buf == NULL)
pg_malloc seems to have an internal check such that it never returns
NULL. I don't see anything like this test in other callers.
The order of operations in create_manifest_hash() seems unusual:
+ fd = open(manifest_path, O_RDONLY, 0);
+ if (fstat(fd, &stat))
+ buf = pg_malloc(stat.st_size);
+ hashtab = manifesthash_create(1024, NULL);
...
+ entry = manifesthash_insert(hashtab, filename, &found);
...
+ close(fd);
I would have expected open-fstat-read-close to be consecutive, and the
manifesthash stuff all done afterwards. In fact, it seems like reading
the file could be a separate function.
+ if (strncmp(checksum, "SHA256", 6) == 0)
This isn't really right; it would give a false match if we had a
checksum algorithm with a name like SHA2560 or SHA256C or
SHA256ExceptWayBetter. The right thing to do is find the colon first,
and then probably overwrite it with '\0' so that you have a string
that you can pass to parse_checksum_algorithm().
+ /*
+ * we don't have checksum type in the header, so need to
+ * read through the first file enttry to find the checksum
+ * type for the manifest file and initilize the checksum
+ * for the manifest file itself.
+ */
This seems to be proceeding on the assumption that the checksum type
for the manifest itself will always be the same as the checksum type
for the first file in the manifest. I don't think that's the right
approach. I think the manifest should always have a SHA256 checksum,
regardless of what type of checksum is used for the individual files
within the manifest. Since the volume of data in the manifest is
presumably very small compared to the size of the database cluster
itself, I don't think there should be any performance problem there.
Agree, that performance won't be a problem, but that will be bit confusing
to the user. As at the start user providing the manifest-checksum (assume
that user-provided CRC32C) and at the end, user will find the SHA256
checksum string in the backup_manifest file.
Does this also means that irrespective of whether user provided a checksum
option or not, we will be always generating the checksum for the backup_manifest file?
+ filesize = atol(size);
Using strtol() would allow for some error checking.
+ * Increase the checksum by its lable length so that we can
+ checksum = checksum + checksum_lable_length;
Spelling.
+ pg_log_error("invalid record found in \"%s\"", manifest_path);
Error message needs work.
+VerifyBackup(void)
+create_manifest_hash(char *manifest_path)
+nextLine(char *buf)
Your function names should be consistent with the surrounding style,
and with each other, as far as possible. Three different conventions
within the same patch and source file seems over the top.
Also keep in mind that you're not writing code in a vacuum. There's a
whole file of code here, and around that, a whole project.
scan_data_directory() is a good example of a function whose name is
clearly too generic. It's not a general-purpose function for scanning
the data directory; it's specifically a support function for verifying
a backup. Yet, the name gives no hint of this.
+verify_file(struct dirent *de, char fn[MAXPGPATH], struct stat st,
+ char relative_path[MAXPGPATH], manifesthash_hash *hashtab)
I think I commented on the use of char[] parameters in my previous review.
+ /* Skip backup manifest file. */
+ if (strcmp(de->d_name, "backup_manifest") == 0)
+ return;
Still looks like this will be skipped at any level of the directory
hierarchy, not just the top. And why are we skipping backup_manifest
here bug pg_wal in scan_data_directory? That's a rhetorical question,
because I know the answer: verify_file() is only getting called for
files, so you can't use it to skip directories. But that's not a good
excuse for putting closely-related checks in different parts of the
code. It's just going to result in the checks being inconsistent and
each one having its own bugs that have to be fixed separately from the
other one, as here. Please try to reorganize this code so that it can
be done in a consistent way.
I think this is related to the way you're traversing the directory
tree, which somehow looks a bit awkward to me. At the top of
scan_data_directory(), you've got code that uses basedir and
subdirpath to construct path and relative_path. I was initially
surprised to see that this was the job of this function, rather than
the caller, but then I thought: well, as long as it makes life easy
for the caller, it's probably fine. However, I notice that the only
non-trivial caller is the scan_data_directory() itself, and it has to
go and construct newsubdirpath from subdirpath and the directory name.
It seems to me that this would get easier if you defined
scan_data_directory() -- or whatever we end up calling it -- to take
two pathname-related arguments:
- basepath, which would be $PGDATA and would never change as we
recurse down, so same as what you're now calling basedir
- pathsuffix, which would be an empty string at the top level and at
each recursive level we'd add a slash and then de->d_name.
So at the top of the function we wouldn't need an if statement,
because you could just do:
snprintf(path, MAXPGPATH, "%s%s", basedir, pathsuffix);
And when you recurse you wouldn't need an if statement either, because
you could just do:
snprintf(newpathsuffix, MAXPGPATH, "%s/%s", pathsuffix, de->d_name);
What I'd suggest is constructing newpathsuffix right after rejecting
"." and ".." entries, and then you can reject both pg_wal and
backup_manifest, at the top-level only, using symmetric and elegant
code:
if (strcmp(newpathsuffix, "/pg_wal") == 0 || strcmp(newpathsuffix,
"/backup_manifest") == 0)
continue;
+ record = manifesthash_lookup(hashtab, filename);;
+ if (record)
+ {
...long block...
+ }
+ else
+ pg_log_info("file \"%s\" is present in backup but not in manifest",
+ filename);
Try to structure the code in such a way that you minimize unnecessary
indentation. For example, in this case, you could instead write:
if (record == NULL)
{
pg_log_info(...)
return;
}
and the result would be that everything inside that long if-block is
now at the top level of the function and indented one level less. And
I think if you look at this function you'll see a way that you can
save a *second* level of indentation for much of that code. Please
check the rest of the patch for similar cases, too.
+static char *
+nextLine(char *buf)
+{
+ while (*buf != '\0' && *buf != '\n')
+ buf = buf + 1;
+
+ return buf + 1;
+}
I'm pretty sure that my previous review mentioned the importance of
protecting against buffer overruns here.
+static char *
+nextWord(char *line)
+{
+ while (*line != '\0' && *line != '\t' && *line != '\n')
+ line = line + 1;
+
+ return line + 1;
+}
Same problem here.
In both cases, ++ is more idiomatic.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Rushabh Lathia
On Sun, Dec 22, 2019 at 8:32 PM Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > Agree, that performance won't be a problem, but that will be bit confusing > to the user. As at the start user providing the manifest-checksum (assume > that user-provided CRC32C) and at the end, user will find the SHA256 > checksum string in the backup_manifest file. I don't think that's particularly confusing. The documentation should say that this is the algorithm to be used for checksumming the files which are backed up. The algorithm to be used for the manifest itself is another matter. To me, it seems far MORE confusing if the algorithm used for the manifest itself is magically inferred from the algorithm used for one of the File lines therein. > Does this also means that irrespective of whether user provided a checksum > option or not, we will be always generating the checksum for the backup_manifest file? Yes, that is what I am proposing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thank you for review comments.
On Fri, Dec 20, 2019 at 9:14 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 20, 2019 at 8:24 AM Suraj Kharage
<suraj.kharage@enterprisedb.com> wrote:
> Thank you for review comments.
Thanks for the new version.
+ <term><option>--verify-backup </option></term>
Whitespace.
Corrected.
+struct manifesthash_hash *hashtab;
Uh, I had it in mind that you would nuke this line completely, not
just remove "typedef" from it. You shouldn't need a global variable
here.
Removed.
+ if (buf == NULL)
pg_malloc seems to have an internal check such that it never returns
NULL. I don't see anything like this test in other callers.
Yeah, removed this check
The order of operations in create_manifest_hash() seems unusual:
+ fd = open(manifest_path, O_RDONLY, 0);
+ if (fstat(fd, &stat))
+ buf = pg_malloc(stat.st_size);
+ hashtab = manifesthash_create(1024, NULL);
...
+ entry = manifesthash_insert(hashtab, filename, &found);
...
+ close(fd);
I would have expected open-fstat-read-close to be consecutive, and the
manifesthash stuff all done afterwards. In fact, it seems like reading
the file could be a separate function.
Yes, created new function which will read the file and return the buffer.
+ if (strncmp(checksum, "SHA256", 6) == 0)
This isn't really right; it would give a false match if we had a
checksum algorithm with a name like SHA2560 or SHA256C or
SHA256ExceptWayBetter. The right thing to do is find the colon first,
and then probably overwrite it with '\0' so that you have a string
that you can pass to parse_checksum_algorithm().
Corrected this check. Below suggestion, allow us to put '\0' in between the line.
since SHA256 is used to generate for backup manifest, so that we can feed that
line early to the checksum machinery.
+ /*
+ * we don't have checksum type in the header, so need to
+ * read through the first file enttry to find the checksum
+ * type for the manifest file and initilize the checksum
+ * for the manifest file itself.
+ */
This seems to be proceeding on the assumption that the checksum type
for the manifest itself will always be the same as the checksum type
for the first file in the manifest. I don't think that's the right
approach. I think the manifest should always have a SHA256 checksum,
regardless of what type of checksum is used for the individual files
within the manifest. Since the volume of data in the manifest is
presumably very small compared to the size of the database cluster
itself, I don't think there should be any performance problem there.
Made the change in backup manifest as well in backup validatort patch. Thanks to Rushabh Lathia for the offline discussion and help.
To examine the first word of each line, I am using below check:
if (strncmp(line, "File", 4) == 0){
..
-- }
else if (strncmp(line, "Manifest-Checksum", 17) == 0){
..
}
else
error
strncmp might be not right here, but we can not put '\0' in between the line (to find out first word)
before we recognize the line type.
All the lines expect line last one (where we have manifest checksum) are feed to the checksum machinary to calculate manifest checksum.
so update_checksum() should be called after recognizing the type, i.e: if it is a File type record. Do you see any issues with this?
+ filesize = atol(size);
Using strtol() would allow for some error checking.
corrected.
+ * Increase the checksum by its lable length so that we can
+ checksum = checksum + checksum_lable_length;
Spelling.
corrected.
+ pg_log_error("invalid record found in \"%s\"", manifest_path);
Error message needs work.
+VerifyBackup(void)
+create_manifest_hash(char *manifest_path)
+nextLine(char *buf)
Your function names should be consistent with the surrounding style,
and with each other, as far as possible. Three different conventions
within the same patch and source file seems over the top.
Also keep in mind that you're not writing code in a vacuum. There's a
whole file of code here, and around that, a whole project.
scan_data_directory() is a good example of a function whose name is
clearly too generic. It's not a general-purpose function for scanning
the data directory; it's specifically a support function for verifying
a backup. Yet, the name gives no hint of this.
+verify_file(struct dirent *de, char fn[MAXPGPATH], struct stat st,
+ char relative_path[MAXPGPATH], manifesthash_hash *hashtab)
I think I commented on the use of char[] parameters in my previous review.
+ /* Skip backup manifest file. */
+ if (strcmp(de->d_name, "backup_manifest") == 0)
+ return;
Still looks like this will be skipped at any level of the directory
hierarchy, not just the top. And why are we skipping backup_manifest
here bug pg_wal in scan_data_directory? That's a rhetorical question,
because I know the answer: verify_file() is only getting called for
files, so you can't use it to skip directories. But that's not a good
excuse for putting closely-related checks in different parts of the
code. It's just going to result in the checks being inconsistent and
each one having its own bugs that have to be fixed separately from the
other one, as here. Please try to reorganize this code so that it can
be done in a consistent way.
I think this is related to the way you're traversing the directory
tree, which somehow looks a bit awkward to me. At the top of
scan_data_directory(), you've got code that uses basedir and
subdirpath to construct path and relative_path. I was initially
surprised to see that this was the job of this function, rather than
the caller, but then I thought: well, as long as it makes life easy
for the caller, it's probably fine. However, I notice that the only
non-trivial caller is the scan_data_directory() itself, and it has to
go and construct newsubdirpath from subdirpath and the directory name.
It seems to me that this would get easier if you defined
scan_data_directory() -- or whatever we end up calling it -- to take
two pathname-related arguments:
- basepath, which would be $PGDATA and would never change as we
recurse down, so same as what you're now calling basedir
- pathsuffix, which would be an empty string at the top level and at
each recursive level we'd add a slash and then de->d_name.
So at the top of the function we wouldn't need an if statement,
because you could just do:
snprintf(path, MAXPGPATH, "%s%s", basedir, pathsuffix);
And when you recurse you wouldn't need an if statement either, because
you could just do:
snprintf(newpathsuffix, MAXPGPATH, "%s/%s", pathsuffix, de->d_name);
What I'd suggest is constructing newpathsuffix right after rejecting
"." and ".." entries, and then you can reject both pg_wal and
backup_manifest, at the top-level only, using symmetric and elegant
code:
if (strcmp(newpathsuffix, "/pg_wal") == 0 || strcmp(newpathsuffix,
"/backup_manifest") == 0)
continue;
Thanks for the suggestion. Corrected as per the above inputs.
+ record = manifesthash_lookup(hashtab, filename);;
+ if (record)
+ {
...long block...
+ }
+ else
+ pg_log_info("file \"%s\" is present in backup but not in manifest",
+ filename);
Try to structure the code in such a way that you minimize unnecessary
indentation. For example, in this case, you could instead write:
if (record == NULL)
{
pg_log_info(...)
return;
}
and the result would be that everything inside that long if-block is
now at the top level of the function and indented one level less. And
I think if you look at this function you'll see a way that you can
save a *second* level of indentation for much of that code. Please
check the rest of the patch for similar cases, too.
Make sense. corrected.
+static char *
+nextLine(char *buf)
+{
+ while (*buf != '\0' && *buf != '\n')
+ buf = buf + 1;
+
+ return buf + 1;
+}
I'm pretty sure that my previous review mentioned the importance of
protecting against buffer overruns here.
+static char *
+nextWord(char *line)
+{
+ while (*line != '\0' && *line != '\t' && *line != '\n')
+ line = line + 1;
+
+ return line + 1;
+}
Same problem here.
In both cases, ++ is more idiomatic.
I have added a check for EOF, but not sure whether that woule be right here.
Do we need to check the length of buffer as well?
Rajkaumar has changed the tap test case patch as per revised error messages.
Please find attached patch stack incorporated the above comments.
--
Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.
Attachment
On Tue, Dec 24, 2019 at 5:42 AM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote: > Made the change in backup manifest as well in backup validatort patch. Thanks to Rushabh Lathia for the offline discussionand help. > > To examine the first word of each line, I am using below check: > if (strncmp(line, "File", 4) == 0) > { > .. > } > else if (strncmp(line, "Manifest-Checksum", 17) == 0) > { > .. > } > else > error > > strncmp might be not right here, but we can not put '\0' in between the line (to find out first word) > before we recognize the line type. > All the lines expect line last one (where we have manifest checksum) are feed to the checksum machinary to calculate manifestchecksum. > so update_checksum() should be called after recognizing the type, i.e: if it is a File type record. Do you see any issueswith this? I see the problem, but I don't think your solution is right, because the first test would pass if the line said FiletMignon rather than just File, which we certainly don't want. You've got to write the test so that you're checking against the whole first word, not just some prefix of it. There are several possible ways to accomplish that, but this isn't one of them. >> + pg_log_error("invalid record found in \"%s\"", manifest_path); >> >> Error message needs work. Looks better now, but you have a messages that say "invalid checksums type \"%s\" found in \"%s\"". This is wrong because checksums would need to be singular in this context (checksum). Also, I think it could be better phrased as "manifest file \"%s\" specifies unknown checksum algorithm \"%s\" at line %d". >> Your function names should be consistent with the surrounding style, >> and with each other, as far as possible. Three different conventions >> within the same patch and source file seems over the top. This appears to be fixed. >> Also keep in mind that you're not writing code in a vacuum. There's a >> whole file of code here, and around that, a whole project. >> scan_data_directory() is a good example of a function whose name is >> clearly too generic. It's not a general-purpose function for scanning >> the data directory; it's specifically a support function for verifying >> a backup. Yet, the name gives no hint of this. But this appears not to be fixed. >> if (strcmp(newpathsuffix, "/pg_wal") == 0 || strcmp(newpathsuffix, >> "/backup_manifest") == 0) >> continue; > > Thanks for the suggestion. Corrected as per the above inputs. You need a comment here, like "Ignore the possible presence of a backup_manifest file and/or a pg_wal directory in the backup being verified." and then maybe another sentence explaining why that's the right thing to do. + * The forth parameter to VerifyFile() will pass the relative path + * of file to match exactly with the filename present in manifest. I don't know what this comment is trying to tell me, which might be something you want to try to fix. However, I'm pretty sure it's supposed to say "fourth" not "forth". >> and the result would be that everything inside that long if-block is >> now at the top level of the function and indented one level less. And >> I think if you look at this function you'll see a way that you can >> save a *second* level of indentation for much of that code. Please >> check the rest of the patch for similar cases, too. > > Make sense. corrected. I don't agree. A large chunk of VerifyFile() is still subject to a quite unnecessary level of indentation. > I have added a check for EOF, but not sure whether that woule be right here. > Do we need to check the length of buffer as well? That's really, really not right. EOF is not a character that can appear in the buffer. It's chosen on purpose to be a value that never matches any actual character when both the character and the EOF value are regarded as values of type 'int'. That guarantee doesn't apply here though because you're dealing with values of type 'char'. So what this code is doing is searching for an impossible value using incorrect logic, which has very little to do with the actual need here, which is to avoid running off the end of the buffer. To see what the problem is, try creating a file with no terminating newline, like this: echo -n this file has no terminating newline >> some-file I doubt it will be very hard to make this patch crash horribly. Even if you can't, it seems pretty clear that the logic isn't right. I don't really know what the \0 tests in NextLine() and NextWord() think they're doing either. If there's a \0 in the buffer before you add one, it was in the original input data, and pretending like that marks a word or line boundary seems like a fairly arbitrary choice. What I suggest is: (1) Allocate one byte more than the file size for the buffer that's going to hold the file, so that if you write a \0 just after the last byte of the file, you don't overrun the allocated buffer. (2) Compute char *endptr = buf + len. (3) Pass endptr to NextLine and NextWord and write the loop condition something like while (*buf != '\n' && buf < endptr). Other notes: - The error handling in ReadFileIntoBuffer() does not seem to consider the case of a short read. If you look through the source tree, you can find examples of how we normally handle that. - Putting string_hash_sdbm() into encode.c seems like a surprising choice. What does this have to do with encoding anything? And why is it going into src/common at all if it's only intended for frontend use? - It seems like whether or not any problems were found while verifying the manifest ought to affect the exit status of pg_basebackup. I'm not exactly sure what exit codes ought to be used, but you could look for similar precedents. Document this, too. - As much as possible let's have errors in the manifest file report the line number, and let's also try to make them more specific, e.g. instead of "invalid manifest record found in \"%s\"", perhaps "manifest file \"%s\" contains invalid keyword \"%s\" at line %d". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Moin, sorry for the very late reply. There was a discussion about the specific format of the backup manifests, and maybe that was already discussed and I just overlooked it: 1) Why invent your own format, and not just use a machine-readable format that already exists? It doesn't have to be full blown XML, or even JSON, something simple as YAML would already be better. That way not everyone has to write their own parser. Or maybe it is already YAML and just the different keywords where under discussion? 2) It would be very wise to add a version number to the format. That will making an extension later much easier and avoids the "we need to add X, but that breaks compatibility with all software out there" situations that often arise a few years down the line. Best regards, and a happy New Year 2020 Tels
On Tue, Dec 31, 2019 at 01:30:01PM +0100, Tels wrote: > Moin, > > sorry for the very late reply. There was a discussion about the specific > format of the backup manifests, and maybe that was already discussed and I > just overlooked it: > > 1) Why invent your own format, and not just use a machine-readable format > that already exists? It doesn't have to be full blown XML, or even JSON, > something simple as YAML would already be better. That way not everyone has > to write their own parser. Or maybe it is already YAML and just the > different keywords where under discussion? YAML is extremely fragile and error-prone. It's also a superset of JSON, so I don't understand what you mean by "as simple as." -1 from me on YAML That said, I agree that there's no reason to come up with a bespoke format and parser when JSON is already available in every PostgreSQL installation. Imposing a structure atop that includes a version number, as you suggest, seems pretty straightforward, and should be done. Would it make sense to include some kind of capability description in the format along with the version number? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 12/31/19 10:43 AM, David Fetter wrote: > On Tue, Dec 31, 2019 at 01:30:01PM +0100, Tels wrote: >> Moin, >> >> sorry for the very late reply. There was a discussion about the specific >> format of the backup manifests, and maybe that was already discussed and I >> just overlooked it: >> >> 1) Why invent your own format, and not just use a machine-readable format >> that already exists? It doesn't have to be full blown XML, or even JSON, >> something simple as YAML would already be better. That way not everyone has >> to write their own parser. Or maybe it is already YAML and just the >> different keywords where under discussion? > > YAML is extremely fragile and error-prone. It's also a superset of > JSON, so I don't understand what you mean by "as simple as." > > -1 from me on YAML -1 from me as well. YAML is easy to write but definitely non-trivial to read. > That said, I agree that there's no reason to come up with a bespoke > format and parser when JSON is already available in every PostgreSQL > installation. Imposing a structure atop that includes a version > number, as you suggest, seems pretty straightforward, and should be > done. +1. I continue to support a format that would be easily readable without writing a lot of code. -- -David david@pgmasters.net
On Tue, Dec 31, 2019 at 9:16 PM David Steele <david@pgmasters.net> wrote: > > That said, I agree that there's no reason to come up with a bespoke > > format and parser when JSON is already available in every PostgreSQL > > installation. Imposing a structure atop that includes a version > > number, as you suggest, seems pretty straightforward, and should be > > done. > > +1. I continue to support a format that would be easily readable > without writing a lot of code. So, if someone can suggest to me how I could read JSON from a tool in src/bin without writing a lot of code, I'm all ears. So far that's been asserted but not been demonstrated to be possible. Getting the JSON parser that we have in the backend to work from frontend doesn't look all that straightforward, for reasons that I talked about in http://postgr.es/m/CA+TgmobZrNYR-ATtfZiZ_k-W7tSPgvmYZmyiqumQig4R4fkzHw@mail.gmail.com As to the suggestion that a version number be included, that's been there in every version of the patch I've posted. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 01, 2020 at 01:43:40PM -0500, Robert Haas wrote: > On Tue, Dec 31, 2019 at 9:16 PM David Steele <david@pgmasters.net> wrote: > > > That said, I agree that there's no reason to come up with a bespoke > > > format and parser when JSON is already available in every PostgreSQL > > > installation. Imposing a structure atop that includes a version > > > number, as you suggest, seems pretty straightforward, and should be > > > done. > > > > +1. I continue to support a format that would be easily readable > > without writing a lot of code. > > So, if someone can suggest to me how I could read JSON from a tool in > src/bin without writing a lot of code, I'm all ears. So far that's > been asserted but not been demonstrated to be possible. Getting the > JSON parser that we have in the backend to work from frontend doesn't > look all that straightforward, for reasons that I talked about in > http://postgr.es/m/CA+TgmobZrNYR-ATtfZiZ_k-W7tSPgvmYZmyiqumQig4R4fkzHw@mail.gmail.com Maybe I'm missing something obvious, but wouldn't combining pg_read_file() with a cast to JSONB fix this, as below? shackle@[local]:5413/postgres(13devel)(892328) # SELECT jsonb_pretty(j::jsonb) FROM pg_read_file('/home/shackle/advanced_comparison.json')AS t(j); jsonb_pretty ════════════════════════════════════ [ ↵ { ↵ "message": "hello world!",↵ "severity": "[DEBUG]" ↵ }, ↵ { ↵ "message": "boz", ↵ "severity": "[INFO]" ↵ }, ↵ { ↵ "message": "foo", ↵ "severity": "[DEBUG]" ↵ }, ↵ { ↵ "message": "null", ↵ "severity": "null" ↵ } ↵ ] (1 row) Time: 3.050 ms > As to the suggestion that a version number be included, that's been > there in every version of the patch I've posted. and thanks for that! Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter <david@fetter.org> writes: > On Wed, Jan 01, 2020 at 01:43:40PM -0500, Robert Haas wrote: >> So, if someone can suggest to me how I could read JSON from a tool in >> src/bin without writing a lot of code, I'm all ears. > Maybe I'm missing something obvious, but wouldn't combining > pg_read_file() with a cast to JSONB fix this, as below? Only if you're prepared to restrict the use of the tool to superusers (or at least people with whatever privilege that function requires). Admittedly, you can probably feed the data to the backend without use of an intermediate file; but it still requires a working backend connection, which might be a bit of a leap for backup-related tools. I'm sure Robert was envisioning doing this processing inside the tool. regards, tom lane
On Wed, Jan 1, 2020 at 7:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > David Fetter <david@fetter.org> writes: > > On Wed, Jan 01, 2020 at 01:43:40PM -0500, Robert Haas wrote: > >> So, if someone can suggest to me how I could read JSON from a tool in > >> src/bin without writing a lot of code, I'm all ears. > > > Maybe I'm missing something obvious, but wouldn't combining > > pg_read_file() with a cast to JSONB fix this, as below? > > Only if you're prepared to restrict the use of the tool to superusers > (or at least people with whatever privilege that function requires). > > Admittedly, you can probably feed the data to the backend without > use of an intermediate file; but it still requires a working backend > connection, which might be a bit of a leap for backup-related tools. > I'm sure Robert was envisioning doing this processing inside the tool. Yeah, exactly. I don't think verifying a backup should require a running server, let alone a running server on the same machine where the backup is stored and for which you have superuser privileges. AFAICS, the only options to make that work with JSON are (1) introduce a new hand-coded JSON parser designed for frontend operation, (2) add a dependency on an external JSON parser that we can use from frontend code, or (3) adapt the existing JSON parser used in the backend so that it can also be used in the frontend. I'd be willing to do (1) -- it wouldn't be the first time I've written JSON parser for PostgreSQL -- but I think it will take an order of magnitude more code than using a file with tab-separated columns as I've proposed, and I assume that there will be complaints about having two JSON parsers in core. I'd also be willing to do (2) if that's the consensus, but I'd vote against such an approach if somebody else proposed it because (a) I'm not aware of a widely-available library upon which we could depend and (b) introducing such a dependency for a minor feature like this seems fairly unpalatable to me, and it'd probably still be more code than just using a tab-separated file. I'd be willing to do (3) if somebody could explain to me how to solve the problems with porting that code to work on the frontend side, but the only suggestion so far as to how to do that is to port memory contexts, elog/report, and presumably encoding handling to work on the frontend side. That seems to me to be an unreasonably large lift, especially given that we have lots of other files that use ad-hoc formats already, and if somebody ever gets around to converting all of those to JSON, they can certainly convert this one at the same time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > AFAICS, the only options to make that work with JSON are (1) introduce > a new hand-coded JSON parser designed for frontend operation, (2) add > a dependency on an external JSON parser that we can use from frontend > code, or (3) adapt the existing JSON parser used in the backend so > that it can also be used in the frontend. > ... I'd > be willing to do (3) if somebody could explain to me how to solve the > problems with porting that code to work on the frontend side, but the > only suggestion so far as to how to do that is to port memory > contexts, elog/report, and presumably encoding handling to work on the > frontend side. That seems to me to be an unreasonably large lift, Yeah, agreed. The only consideration that'd make that a remotely sane idea is that if somebody did the work, there would be other uses for it. (One that comes to mind immediately is cleaning up ecpg's miserably-maintained fork of the backend datetime code.) But there's no denying that it would be a large amount of work (if it's even feasible), and nobody has stepped up to volunteer. It's not reasonable to hold up this particular feature waiting for that to happen. If a tab-delimited file can handle this requirement, that seems like a sane choice to me. regards, tom lane
On Wed, Jan 01, 2020 at 08:57:11PM -0500, Robert Haas wrote: > On Wed, Jan 1, 2020 at 7:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Fetter <david@fetter.org> writes: > > > On Wed, Jan 01, 2020 at 01:43:40PM -0500, Robert Haas wrote: > > >> So, if someone can suggest to me how I could read JSON from a tool in > > >> src/bin without writing a lot of code, I'm all ears. > > > > > Maybe I'm missing something obvious, but wouldn't combining > > > pg_read_file() with a cast to JSONB fix this, as below? > > > > Only if you're prepared to restrict the use of the tool to superusers > > (or at least people with whatever privilege that function requires). > > > > Admittedly, you can probably feed the data to the backend without > > use of an intermediate file; but it still requires a working backend > > connection, which might be a bit of a leap for backup-related tools. > > I'm sure Robert was envisioning doing this processing inside the tool. > > Yeah, exactly. I don't think verifying a backup should require a > running server, let alone a running server on the same machine where > the backup is stored and for which you have superuser privileges. Thanks for clarifying the context. > AFAICS, the only options to make that work with JSON are (1) introduce > a new hand-coded JSON parser designed for frontend operation, (2) add > a dependency on an external JSON parser that we can use from frontend > code, or (3) adapt the existing JSON parser used in the backend so > that it can also be used in the frontend. > > I'd be willing to do (1) -- it wouldn't be the first time I've written > JSON parser for PostgreSQL -- but I think it will take an order of > magnitude more code than using a file with tab-separated columns as > I've proposed, and I assume that there will be complaints about having > two JSON parsers in core. I'd also be willing to do (2) if that's the > consensus, but I'd vote against such an approach if somebody else > proposed it because (a) I'm not aware of a widely-available library > upon which we could depend and I believe jq has an excellent one that's available under a suitable license. Making jq a dependency seems like a separate discussion, though. At the moment, we don't use git tools like submodel/subtree, and deciding which (or whether) seems like a gigantic discussion all on its own. > (b) introducing such a dependency for a minor feature like this > seems fairly unpalatable to me, and it'd probably still be more code > than just using a tab-separated file. I'd be willing to do (3) if > somebody could explain to me how to solve the problems with porting > that code to work on the frontend side, but the only suggestion so > far as to how to do that is to port memory contexts, elog/report, > and presumably encoding handling to work on the frontend side. This port has come up several times recently in different contexts. How big a chunk of work would it be? Just so we're clear, I'm not suggesting that this port should gate this feature. > That seems to me to be an unreasonably large lift, especially given > that we have lots of other files that use ad-hoc formats already, > and if somebody ever gets around to converting all of those to JSON, > they can certainly convert this one at the same time. Would that require some kind of file converter program, or just a really loud notice in the release notes? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Thu, Jan 2, 2020 at 1:03 PM David Fetter <david@fetter.org> wrote: > I believe jq has an excellent one that's available under a suitable > license. > > Making jq a dependency seems like a separate discussion, though. At > the moment, we don't use git tools like submodel/subtree, and deciding > which (or whether) seems like a gigantic discussion all on its own. Yep. And it doesn't seem worth it for a relatively small feature like this. If we already had it, it might be worth using for a relatively small feature like this, but that's a different issue. > > (b) introducing such a dependency for a minor feature like this > > seems fairly unpalatable to me, and it'd probably still be more code > > than just using a tab-separated file. I'd be willing to do (3) if > > somebody could explain to me how to solve the problems with porting > > that code to work on the frontend side, but the only suggestion so > > far as to how to do that is to port memory contexts, elog/report, > > and presumably encoding handling to work on the frontend side. > > This port has come up several times recently in different contexts. > How big a chunk of work would it be? Just so we're clear, I'm not > suggesting that this port should gate this feature. I don't really know. It's more of a research project than a coding project, at least initially, I think. For instance, psql has its own non-local-transfer-of-control mechanism using sigsetjmp(). If you wanted to introduce elog/ereport on the frontend, would you make psql use it? Or just let psql continue to do what it does now and introduce the new mechanism as an option for code going forward? Or try to make the two mechanisms work together somehow? Will you start using the same error codes that we use in the backend on the frontend side, and if so, what will they do, given that what the backend does is just embed them in a protocol message that any particular client may or may not display? Similarly, should frontend errors support reporting a hint, detail, statement, or query? Will it be confusing if backend and frontend errors are too similar? If you make memory contexts available in the frontend, what if any code will you adapt to use them? There's a lot of stuff in src/bin. If you want the encoding machinery on the front end, what will you use in place of the backend's idea of the "database encoding"? What will you do about dependencies on Datum in frontend code? Somebody would need to study all this stuff, come up with a tentative set of decisions, write patches, get it all working, and then quite possibly have the choices they made get second-guessed by other people who have different ideas. If you come up with a really good, clean proposal that doesn't provoke any major disagreements, you might be able to get this done in a couple of months. If you can't come up with something people good, or if you're the only one who thinks what you come up with is good, it might take years. It seems to me that in a perfect world a lot of the code we have in the backend that is usefully reusable in other contexts would be structured so that it doesn't have random dependencies on backend-only machinery like memory contexts and elog/ereport. For example, if you write a function that returns an error message rather than throwing an error, then you can arrange to call that from either frontend or backend code and the caller can do whatever it wishes with that error text. However, once you've written your code so that an error gets thrown six layers down in the call stack, it's really hard to rearrange that so that the error is returned, and if you are populating not only the primary error message but error code, detail, hint, etc. it's almost impractical to think that you can rearrange things that way anyway. And generally you want to be populating those things, as a best practice for backend code. So while in theory I kind of like the idea of adapting the JSON parser we've already got to just not depend so heavily on a backend environment, it's not really very clear how to actually make that happen. At least not to me. > > That seems to me to be an unreasonably large lift, especially given > > that we have lots of other files that use ad-hoc formats already, > > and if somebody ever gets around to converting all of those to JSON, > > they can certainly convert this one at the same time. > > Would that require some kind of file converter program, or just a > really loud notice in the release notes? Maybe neither. I don't see why it wouldn't be possible to be backward-compatible just by keeping the old code around and having it parse as far as the version number. Then it could decide to continue on with the old code or call the new code, depending. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thank you for review comments.
On Mon, Dec 30, 2019 at 11:53 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Dec 24, 2019 at 5:42 AM Suraj Kharage
<suraj.kharage@enterprisedb.com> wrote:
> To examine the first word of each line, I am using below check:
> if (strncmp(line, "File", 4) == 0)
> {
> ..
> }
> else if (strncmp(line, "Manifest-Checksum", 17) == 0)
> {
> ..
> }
> else
> error
>
> strncmp might be not right here, but we can not put '\0' in between the line (to find out first word)
> before we recognize the line type.
> All the lines expect line last one (where we have manifest checksum) are feed to the checksum machinary to calculate manifest checksum.
> so update_checksum() should be called after recognizing the type, i.e: if it is a File type record. Do you see any issues with this?
I see the problem, but I don't think your solution is right, because
the first test would pass if the line said FiletMignon rather than
just File, which we certainly don't want. You've got to write the test
so that you're checking against the whole first word, not just some
prefix of it. There are several possible ways to accomplish that, but
this isn't one of them.
Yeah. Fixed in the attached patch.
>> + pg_log_error("invalid record found in \"%s\"", manifest_path);
>>
>> Error message needs work.
Looks better now, but you have a messages that say "invalid checksums
type \"%s\" found in \"%s\"". This is wrong because checksums would
need to be singular in this context (checksum). Also, I think it could
be better phrased as "manifest file \"%s\" specifies unknown checksum
algorithm \"%s\" at line %d".
Corrected.
>> Your function names should be consistent with the surrounding style,
>> and with each other, as far as possible. Three different conventions
>> within the same patch and source file seems over the top.
This appears to be fixed.
>> Also keep in mind that you're not writing code in a vacuum. There's a
>> whole file of code here, and around that, a whole project.
>> scan_data_directory() is a good example of a function whose name is
>> clearly too generic. It's not a general-purpose function for scanning
>> the data directory; it's specifically a support function for verifying
>> a backup. Yet, the name gives no hint of this.
But this appears not to be fixed.
I have changed this function name to "VerifyDir" likewise, we have sendDir and sendFile in basebackup.c
>> if (strcmp(newpathsuffix, "/pg_wal") == 0 || strcmp(newpathsuffix,
>> "/backup_manifest") == 0)
>> continue;
>
> Thanks for the suggestion. Corrected as per the above inputs.
You need a comment here, like "Ignore the possible presence of a
backup_manifest file and/or a pg_wal directory in the backup being
verified." and then maybe another sentence explaining why that's the
right thing to do.
Corrected.
+ * The forth parameter to VerifyFile() will pass the relative path
+ * of file to match exactly with the filename present in manifest.
I don't know what this comment is trying to tell me, which might be
something you want to try to fix. However, I'm pretty sure it's
supposed to say "fourth" not "forth".
I have changed the fourth parameter of VerifyFile(), so my comment over there is no more valid.
>> and the result would be that everything inside that long if-block is
>> now at the top level of the function and indented one level less. And
>> I think if you look at this function you'll see a way that you can
>> save a *second* level of indentation for much of that code. Please
>> check the rest of the patch for similar cases, too.
>
> Make sense. corrected.
I don't agree. A large chunk of VerifyFile() is still subject to a
quite unnecessary level of indentation.
Yeah, corrected.
> I have added a check for EOF, but not sure whether that woule be right here.
> Do we need to check the length of buffer as well?
That's really, really not right. EOF is not a character that can
appear in the buffer. It's chosen on purpose to be a value that never
matches any actual character when both the character and the EOF value
are regarded as values of type 'int'. That guarantee doesn't apply
here though because you're dealing with values of type 'char'. So what
this code is doing is searching for an impossible value using
incorrect logic, which has very little to do with the actual need
here, which is to avoid running off the end of the buffer. To see what
the problem is, try creating a file with no terminating newline, like
this:
echo -n this file has no terminating newline >> some-file
I doubt it will be very hard to make this patch crash horribly. Even
if you can't, it seems pretty clear that the logic isn't right.
I don't really know what the \0 tests in NextLine() and NextWord()
think they're doing either. If there's a \0 in the buffer before you
add one, it was in the original input data, and pretending like that
marks a word or line boundary seems like a fairly arbitrary choice.
What I suggest is:
(1) Allocate one byte more than the file size for the buffer that's
going to hold the file, so that if you write a \0 just after the last
byte of the file, you don't overrun the allocated buffer.
(2) Compute char *endptr = buf + len.
(3) Pass endptr to NextLine and NextWord and write the loop condition
something like while (*buf != '\n' && buf < endptr).
Thanks for the suggestion. Corrected as per above suggestion.
Other notes:
- The error handling in ReadFileIntoBuffer() does not seem to consider
the case of a short read. If you look through the source tree, you can
find examples of how we normally handle that.
yeah, corrected.
- Putting string_hash_sdbm() into encode.c seems like a surprising
choice. What does this have to do with encoding anything? And why is
it going into src/common at all if it's only intended for frontend
use?
I thought this function can be used in backend as well, i.e: likewise we are using in simplehash, so kept that in src/common.
After your comment, I have moved this to pg_basebackup.c.
I think this can be kept in common place but not in "srs/common/encode.c" thoughts?
- It seems like whether or not any problems were found while verifying
the manifest ought to affect the exit status of pg_basebackup. I'm not
exactly sure what exit codes ought to be used, but you could look for
similar precedents. Document this, too.
I might be not getting this completely correct, but as per my observation, if any error occurs, pg_basebackup terminated with exit(1).
Whereas in normal case (without an error), main function returns 0. The "help" and "version" option terminate normally with exit(0).
So in our case, exit(0) would be appropriate. Please correct me if I misunderstood anything.
- As much as possible let's have errors in the manifest file report
the line number, and let's also try to make them more specific, e.g.
instead of "invalid manifest record found in \"%s\"", perhaps
"manifest file \"%s\" contains invalid keyword \"%s\" at line %d".
yeah, added line number at possible places.
I have also fixed few comments given by Jeevan Chalke offlist.
Please find attached v7 patches and let me know your comments.
-- --
Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.
Attachment
Greetings, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > AFAICS, the only options to make that work with JSON are (1) introduce > > a new hand-coded JSON parser designed for frontend operation, (2) add > > a dependency on an external JSON parser that we can use from frontend > > code, or (3) adapt the existing JSON parser used in the backend so > > that it can also be used in the frontend. > > ... I'd > > be willing to do (3) if somebody could explain to me how to solve the > > problems with porting that code to work on the frontend side, but the > > only suggestion so far as to how to do that is to port memory > > contexts, elog/report, and presumably encoding handling to work on the > > frontend side. That seems to me to be an unreasonably large lift, > > Yeah, agreed. The only consideration that'd make that a remotely > sane idea is that if somebody did the work, there would be other > uses for it. (One that comes to mind immediately is cleaning up > ecpg's miserably-maintained fork of the backend datetime code.) > > But there's no denying that it would be a large amount of work > (if it's even feasible), and nobody has stepped up to volunteer. > It's not reasonable to hold up this particular feature waiting > for that to happen. Sure, it'd be work, and for "adding a simple backup manifest", maybe too much to be worth considering ... but that's not what is going on here, is it? Are we really *just* going to add a backup manifest to pg_basebackup and call it done? That's not what I understood the goal here to be but rather to start doing a lot of other things with pg_basebackup beyond just having a manifest and if you think just a bit farther down the path, I think you start to realize that you're going to need this base set of capabilities to get to a point where pg_basebackup (or whatever it ends up being called) is able to have the kind of capabilities that exist in other PG backup software already. I'm sure I don't need to say where to find it, but I can point you to a pretty good example of a similar effort, and we didn't start with "build a manifest into a custom format" as the first thing implemented, but rather a great deal of work was first put into building out things like logging, memory management/contexts, error handling/try-catch, having a string type, a variant type, etc. In some ways, it's kind of impressive what we've got in our front-ends tools even though we don't have these things, really, and certainly not all in one nice library that they all use... but at the same time, I think that lack has also held those tools back, pg_basebackup among them. Anyway, off my high horse, I'll just say I agree w/ David and David wrt using JSON for this over hacking together yet another format. We didn't do that as thoroughly as we should have (we've got a JSON parser and all that, and use JSON quite a bit, but the actual manifest format is a mix of ini-style and JSON, because it's got more in it than just a list of files, something that I suspect will also end up being true of this down the road and for good reasons, and we started with the ini format and discovered it sucked and then started embedding JSON in it...), and we've come to realize that was a bad idea, and intend to fix it in our next manifest major version bump. Would be unfortunate to see PG making that same mistake. Thanks, Stephen
Attachment
On Fri, Jan 3, 2020 at 11:44 AM Stephen Frost <sfrost@snowman.net> wrote: > Sure, it'd be work, and for "adding a simple backup manifest", maybe too > much to be worth considering ... but that's not what is going on here, > is it? Are we really *just* going to add a backup manifest to > pg_basebackup and call it done? That's not what I understood the goal > here to be but rather to start doing a lot of other things with > pg_basebackup beyond just having a manifest and if you think just a bit > farther down the path, I think you start to realize that you're going to > need this base set of capabilities to get to a point where pg_basebackup > (or whatever it ends up being called) is able to have the kind of > capabilities that exist in other PG backup software already. I have no development plans for pg_basebackup that require extending the format of the manifest file in any significant way, and am not aware that anyone else has such plans either. If you are aware of something I'm not, or if anyone else is, it would be helpful to know about it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Fri, Jan 3, 2020 at 11:44 AM Stephen Frost <sfrost@snowman.net> wrote: > > Sure, it'd be work, and for "adding a simple backup manifest", maybe too > > much to be worth considering ... but that's not what is going on here, > > is it? Are we really *just* going to add a backup manifest to > > pg_basebackup and call it done? That's not what I understood the goal > > here to be but rather to start doing a lot of other things with > > pg_basebackup beyond just having a manifest and if you think just a bit > > farther down the path, I think you start to realize that you're going to > > need this base set of capabilities to get to a point where pg_basebackup > > (or whatever it ends up being called) is able to have the kind of > > capabilities that exist in other PG backup software already. > > I have no development plans for pg_basebackup that require extending > the format of the manifest file in any significant way, and am not > aware that anyone else has such plans either. If you are aware of > something I'm not, or if anyone else is, it would be helpful to know > about it. You're certainly intending to do *something* with the manifest, and while I appreciate that you feel you've come up with a complete use-case that this simple manifest will be sufficient for, I frankly doubt that'll actually be the case. Not long ago it wasn't completely clear that a manifest at *all* was even going to be necessary for the specific use-case you had in mind (I'll admit I wasn't 100% sure myself at the time either), but now that we're down the road of having one, I can't agree with the blanket assumption that we're never going to want to extend it, or even that it won't be necessary to add to it before this particular use-case is fully addressed. And the same goes for the other things that were discussed up-thread regarding memory context and error handling and such. I'm happy to outline the other things that one *might* want to include in a manifest, if that would be helpful, but I'll also say that I'm not planning to hack on adding that to pg_basebackup in the next month or two. Once we've actually got a manifest, if it's in an extendable format, I could certainly see people wanting to do more with it though. Thanks, Stephen
Attachment
On Fri, Jan 3, 2020 at 12:01 PM Stephen Frost <sfrost@snowman.net> wrote: > You're certainly intending to do *something* with the manifest, and > while I appreciate that you feel you've come up with a complete use-case > that this simple manifest will be sufficient for, I frankly doubt > that'll actually be the case. Not long ago it wasn't completely clear > that a manifest at *all* was even going to be necessary for the specific > use-case you had in mind (I'll admit I wasn't 100% sure myself at the > time either), but now that we're down the road of having one, I can't > agree with the blanket assumption that we're never going to want to > extend it, or even that it won't be necessary to add to it before this > particular use-case is fully addressed. > > And the same goes for the other things that were discussed up-thread > regarding memory context and error handling and such. Well, I don't know how to make you happy here. It looks to me like insisting on a JSON-format manifest will likely mean that this doesn't get into PG13 or PG14 or probably PG15, because a port of all that machinery to work in frontend code will be neither simple nor quick. If you want this to happen for this release, you've got to be willing to settle for something that can be implemented in the time we have. I'm not sure whether what you and David are arguing boils down to thinking that I'm wrong when I say that doing that is hard, or whether you know it's hard but you just don't care because you'd rather see the feature go nowhere than use a format other than JSON. I don't see much difference between the latter position and a desire to block the feature permanently. And if it's the former then you have yet to make any suggestions for how to get it done with reasonable effort. > I'm happy to outline the other things that one *might* want to include > in a manifest, if that would be helpful, but I'll also say that I'm not > planning to hack on adding that to pg_basebackup in the next month or > two. Once we've actually got a manifest, if it's in an extendable > format, I could certainly see people wanting to do more with it though. Well, as I say, it's got a version number, so somebody can always come along with something better. I really think this is a red herring, though. If somebody wants to track additional data about a backup, there's no rule that they have to include it in the backup manifest. A backup management solution might want to track things like who initiated the backup, or for what purpose it was taken, or the IP address of the machine where it was taken, or the backup system's own identifier, but any of that stuff could (and probably should) be stored in a file managed by that tool rather than in the server's own manifest. As to the per-file information, I believe that David and I discussed that and the list of fields that I had seemed relatively OK, and I believe I added at least one (mtime) per his suggestion. Of course, it's a tab-separated file; more fields could easily be added at the end, separated by tabs. Or, you could modify the file so that after each "File" line you had another line with supplementary information about that file, beginning with some other word. Or, you could convert the whole file to JSON for v2 of the manifest, if, contrary to my belief, that's a fairly simple thing to do. There are probably other approaches as well. This file format has already had considerably more thought about forward-compatibility than pg_hba.conf, which has been retrofitted multiple times without breaking the world. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Fri, Jan 3, 2020 at 12:01 PM Stephen Frost <sfrost@snowman.net> wrote: > > You're certainly intending to do *something* with the manifest, and > > while I appreciate that you feel you've come up with a complete use-case > > that this simple manifest will be sufficient for, I frankly doubt > > that'll actually be the case. Not long ago it wasn't completely clear > > that a manifest at *all* was even going to be necessary for the specific > > use-case you had in mind (I'll admit I wasn't 100% sure myself at the > > time either), but now that we're down the road of having one, I can't > > agree with the blanket assumption that we're never going to want to > > extend it, or even that it won't be necessary to add to it before this > > particular use-case is fully addressed. > > > > And the same goes for the other things that were discussed up-thread > > regarding memory context and error handling and such. > > Well, I don't know how to make you happy here. I suppose I should admit that, first off, I don't feel you're required to make me happy, and I don't think it's necessary to make me happy to get this feature into PG. Since you expressed that interest though, I'll go out on a limb and say that what would make me *really* happy would be to think about where the project should be taking pg_basebackup, what we should be working on *today* to address the concerns we hear about from our users, and to consider the best way to implement solutions to what they're actively asking for a core backup solution to be providing. I get that maybe that isn't how the world works and that sometimes we have people who write our paychecks wanting us to work on something else, and yes, I'm sure there are some users who are asking for this specific thing but I certainly don't think it's a common ask of pg_basebackup or what users feel is missing from the backup options we offer in core; we had users on this list specifically saying they *wouldn't* use this feature (referring to the differential backup stuff, of course), in fact, because of the things which are missing, which is pretty darn rare. That's what would make *me* happy. Even some comments about how to *get* there while also working towards these features would be likely to make me happy. Instead, I feel like we're being told that we need this feature badly in v13 and we're going to cut bait and do whatever is necessary to get us there. > It looks to me like > insisting on a JSON-format manifest will likely mean that this doesn't > get into PG13 or PG14 or probably PG15, because a port of all that > machinery to work in frontend code will be neither simple nor quick. I certainly understand that these things take time, sometimes quite a bit of it as the past 2 years have shown in this other little side project, and that was hacking without having to go through the much larger effort involved in getting things into PG core. That doesn't mean that kind of effort isn't worthwhile or that, because something is a bunch of work, we shouldn't spend the time on it. I do feel what you're after here is a multi-year project, and I've said before that I don't agree that this is a feature (the differential backup with pg_basebackup thing) that makes any sense going into PG at this time, but I'm also not trying to block this feature, just to share the experience that we've gotten from working in this area for quite a while and hopefully help guide the effort in PG away from pitfalls and in a good direction long-term. > If you want this to happen for this release, you've got to be willing > to settle for something that can be implemented in the time we have. I'm not sure what you're expecting here, but for my part, at least, I'm not going to be terribly upset if this feature doesn't make this release because there's an agreement and understanding that the current direction isn't a good long-term solution. Nor am I going to be terribly upset about the time that's been spent on this particular approach given that there's been no shortage of people commenting that they'd rather see an extensible format, like JSON, and has been for quite some time. All that said- one thing we've done is to consider that *we* are the ones who are writing the JSON, while also being the ones to read it- we don't need the parsing side to understand and deal with *any* JSON that might exist out there, just whatever it is the server creates/created. It may be possible to use that to simplify the parser, or perhaps at least to accept that if it ends up being given something else that it might not perform as well with it. I'm not sure how helpful that will be to you, but I recall David finding it a helpful thought. > I'm not sure whether what you and David are arguing boils down to > thinking that I'm wrong when I say that doing that is hard, or whether > you know it's hard but you just don't care because you'd rather see > the feature go nowhere than use a format other than JSON. I don't see > much difference between the latter position and a desire to block the > feature permanently. And if it's the former then you have yet to make > any suggestions for how to get it done with reasonable effort. There seems to be a great deal of daylight between the two positions you're proposing I might have (as I don't speak for David..). I *do* think there's a lot of work that would need to be done here to make this a good solution. I'm *not* completely against other formats besides JSON. Even more so though, I am *not* argueing that this feature should go 'nowhere', whether it uses JSON or not. What I don't care for is having a hand-hacked inflexible format that's going to require everyone down the road to implement their own parser for it and bespoke code for every version of the custom format that there ends up being, *including* PG core, to be clear. Whatever utility is going to be utilizing this manifest, it's going to need to support older versions, just like pg_dump deals with older versions of custom format dumps (though we still get people complaining about not being able to use older tools with newer dumps- it'd be awful nice if we could use JSON, or something, and then just *add* things that wouldn't break older tools, except for the rare case where we don't have a choice..). Not to mention the debugging grief and such, since we can't just use a tool like jq to check out what's going on. As to the reference to pg_hba.conf- I don't think the packagers would necessairly agree that there's been little grief around that, but even so, a given pg_hba.conf is only going to be used with a given major version and, sure, it might have to be updated to that newer major version's format if we change the format and someone copies the old version to the new version, but that's during a major version upgrade of the server, and at least newer tools don't have to deal with the older pg_hba.conf version. Also, pg_hba.conf doesn't seem like a terribly good example in any case- the last time the actual structure of that file was changed in a breaking way was in 2002 when the 'user' column was added, and the example pg_hba.conf from that commit works just fine with PG12, it seems, based on some quick tests. There have been other backwards-incompatible changes, of course, the last being 6 years ago, I think, when 'krb5' was removed. I suppose there is some chance that you might have a PG12-configured pg_hba.conf and you try copying that back to a PG11 or PG10 server and it doesn't work, but that strikes me as far less of an issue than trying to read a PG12 backup with a PG11 tool, which we know people do because they complain on the lists about it with pg_dump/pg_restore. Thanks, Stephen
Attachment
On Fri, Jan 3, 2020 at 2:35 PM Stephen Frost <sfrost@snowman.net> wrote: > > Well, I don't know how to make you happy here. > > I suppose I should admit that, first off, I don't feel you're required > to make me happy, and I don't think it's necessary to make me happy to > get this feature into PG. Fair enough. That is gracious of you, but I would like to try to make you happy if it is possible to do so. > Since you expressed that interest though, I'll go out on a limb and say > that what would make me *really* happy would be to think about where the > project should be taking pg_basebackup, what we should be working on > *today* to address the concerns we hear about from our users, and to > consider the best way to implement solutions to what they're actively > asking for a core backup solution to be providing. I get that maybe > that isn't how the world works and that sometimes we have people who > write our paychecks wanting us to work on something else, and yes, I'm > sure there are some users who are asking for this specific thing but I > certainly don't think it's a common ask of pg_basebackup or what users > feel is missing from the backup options we offer in core; we had users > on this list specifically saying they *wouldn't* use this feature > (referring to the differential backup stuff, of course), in fact, > because of the things which are missing, which is pretty darn rare. Well, I mean, what you seem to be suggesting here is that somebody is driving me with a stick to do something that I don't really like but have to do because otherwise I won't be able to make rent, but that's actually not the case. I genuinely believe that this is a good design, and it's driven by me, not some shadowy conglomerate of EnterpriseDB executives who are out to make PostgreSQL sucks. If I'm wrong and the design sucks, that's again not the fault of shadowy EnterpriseDB executives; it's my fault. Incidentally, my boss is not very shadowy anyhow; he's a super-nice guy, and a major reason why I work here. :-) I don't think the issue here is that I haven't thought about what users want, but that not everybody wants the same thing, and it's seems like the people with whom I interact want somewhat different things than those with whom you interact. EnterpriseDB has an existing tool that does parallel and block-level incremental backup, and I started out with the goal of providing those same capabilities in core. They are quite popular with EnterpriseDB customers, and I'd like to make them more widely available and, as far as I can, improve on them. From our previous discussion and from a (brief) look at pgbackrest, I gather that the interests of your customers are somewhat different. Apparently, block-level incremental backup isn't quite as important to your customers, perhaps because you've already got file-level incremental backup, but various other things like encryption and backup verification are extremely important, and you've got a set of ideas about what would be valuable in the future which I'm sure is based on real input from your customers. I hope you pursue those ideas, and I hope you do it in core rather than in a separate piece of software, but that's up to you. Meanwhile, I think that if I have somewhat different ideas about what I'd like to pursue, that ought to be just fine. And I don't think it is unreasonable to hope that you'll acknowledge my goals as legitimate even if you have different ones. I want to point out that my idea about how to do all of this has shifted by a considerable amount based on the input that you and David have provided. My original design didn't involve a backup manifest, but now it does. That turned out to be necessary, but it was also something you suggested, and something where I asked and took advice on what ought to go into it. Likewise, you suggested that the process of taking the backup should involve giving the client more control rather than trying to do everything on the server side, and that is now the design which I plan to pursue. You suggested that because it would be more advantageous for out-of-core backup tools, such as pgbackrest, and I acknowledge that as a benefit and I think we're headed in that direction. I am not doing a single thing which, to my knowledge, blocks anything that you might want to do with pg_basebackup in the future. I have accepted as much of your input as I believe that I can without killing the project off completely. To go further, I'd have to either accept years of delay or abandon my priorities entirely and pursue yours. > That's what would make *me* happy. Even some comments about how to > *get* there while also working towards these features would be likely > to make me happy. Instead, I feel like we're being told that we need > this feature badly in v13 and we're going to cut bait and do whatever > is necessary to get us there. This seems like a really unfair accusation given how much work I've put into trying to satisfy you and David. If this patch, the parallel full backup patch, and the incremental backup patch were all to get committed to v13, an outcome which seems pretty unlikely to me at this point, then you would have a very significant number of things that you have requested in the course of the various discussions, and AFAICS the only thing you'd have that you don't want is the need to parse the manifest file use while (<>) { @a = split /\t/, $_ } rather than $a = parse_json(join '', <>). You would, for example, have the ability to request an individual file from the server rather than a complete tarball. Maybe the command that requests a file would lack an encryption option, something which IIUC you would like to have, but that certainly does not leave you worse off. It is easier to add an encryption option to a command which you already have than it is to invent a whole new command -- or really several whole new commands, since such a command is not really usable unless you also have facilities to start and stop a backup through the replication protocol. All that being said, I continue to maintain that insisting on JSON is not a reasonable request. It is not easy to parse JSON, or a subset of JSON. The amount of code required to write even a stripped-down JSON parser is far more than the amount required to split a file on tabs, and the existing code we have for the backend cannot be easily (or even with moderate effort) adapted to work in the frontend. On the other hand, the code that pgbackrest would need to parse the manifest file format I've proposed could have easily been written in less time than you've spent arguing about it. Heck, if it helps, I'll offer write that patch myself (I could be a pgbackrest contributor!). I don't want this effort to suck because something gets rushed through too quickly, but I also don't want it to get derailed because of what I view as a relatively minor detail. It is not always right to take the easier road, but it is also not always wrong. I have no illusions that what is being proposed here is perfect, but lots of features started out imperfect and get better over time -- RLS and parallel query come to mind, among others -- and we often learn from the experience of shipping something which parts of the feature are most in need of improvement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Fri, Jan 3, 2020 at 2:35 PM Stephen Frost <sfrost@snowman.net> wrote: > > > Well, I don't know how to make you happy here. > > > > I suppose I should admit that, first off, I don't feel you're required > > to make me happy, and I don't think it's necessary to make me happy to > > get this feature into PG. > > Fair enough. That is gracious of you, but I would like to try to make > you happy if it is possible to do so. I certainly appreciate that, but I don't know that it is possible to do so while approaching this in the order that you are, which I tried to point out previously. > > Since you expressed that interest though, I'll go out on a limb and say > > that what would make me *really* happy would be to think about where the > > project should be taking pg_basebackup, what we should be working on > > *today* to address the concerns we hear about from our users, and to > > consider the best way to implement solutions to what they're actively > > asking for a core backup solution to be providing. I get that maybe > > that isn't how the world works and that sometimes we have people who > > write our paychecks wanting us to work on something else, and yes, I'm > > sure there are some users who are asking for this specific thing but I > > certainly don't think it's a common ask of pg_basebackup or what users > > feel is missing from the backup options we offer in core; we had users > > on this list specifically saying they *wouldn't* use this feature > > (referring to the differential backup stuff, of course), in fact, > > because of the things which are missing, which is pretty darn rare. > > Well, I mean, what you seem to be suggesting here is that somebody is > driving me with a stick to do something that I don't really like but > have to do because otherwise I won't be able to make rent, but that's > actually not the case. I genuinely believe that this is a good design, > and it's driven by me, not some shadowy conglomerate of EnterpriseDB > executives who are out to make PostgreSQL sucks. If I'm wrong and the > design sucks, that's again not the fault of shadowy EnterpriseDB > executives; it's my fault. Incidentally, my boss is not very shadowy > anyhow; he's a super-nice guy, and a major reason why I work here. :-) Then I just have to disagree, really vehemently, that having a block-level incremental backup solution without solid dependency handling between incremental and full backups, solid WAL management and archiving, expiration handling for incremental/full backups and WAL, and the manifest that that this thread has been about, is a good design. Ultimately, what this calls for is some kind of 'repository' which you've stressed you don't think is a good idea for pg_basebackup to ever deal with and I just can't disagree more with that. I could perhaps agree that it isn't appropriate for the specific tool "pg_basebackup" to work with a repo because of the goal of that particular tool, but in that case, I don't think pg_basebackup should be the tool to provide a block-level incremental backup solution, it should continue to be a tool to provide a simple and easy way to take a one-time, complete, snapshot of a running PG system over the replication protocol- and adding support for parallel backups, or encrypted backups, or similar things would be completely in-line and appropriate for such a tool, and I'm not against those features being added to pg_basebackup even in advance of anything like support for a repo or dependency handling. > I don't think the issue here is that I haven't thought about what > users want, but that not everybody wants the same thing, and it's > seems like the people with whom I interact want somewhat different > things than those with whom you interact. EnterpriseDB has an existing > tool that does parallel and block-level incremental backup, and I > started out with the goal of providing those same capabilities in > core. They are quite popular with EnterpriseDB customers, and I'd like > to make them more widely available and, as far as I can, improve on > them. From our previous discussion and from a (brief) look at > pgbackrest, I gather that the interests of your customers are somewhat > different. Apparently, block-level incremental backup isn't quite as > important to your customers, perhaps because you've already got > file-level incremental backup, but various other things like > encryption and backup verification are extremely important, and you've > got a set of ideas about what would be valuable in the future which > I'm sure is based on real input from your customers. I hope you pursue > those ideas, and I hope you do it in core rather than in a separate > piece of software, but that's up to you. Meanwhile, I think that if I > have somewhat different ideas about what I'd like to pursue, that > ought to be just fine. And I don't think it is unreasonable to hope > that you'll acknowledge my goals as legitimate even if you have > different ones. I'm all for block-level incremental backup, in general (though I've got concerns about it from a correctness standpoint.. I certainly think it's going to be difficult to get right and probably finicky, but hopefully your experience with BART has let you identify where the dragons lie and it'll be interesting to see what that code looks like and if the approach used can be leveraged in other tools), but I am concerned about how we're getting there. > I want to point out that my idea about how to do all of this has > shifted by a considerable amount based on the input that you and David > have provided. My original design didn't involve a backup manifest, > but now it does. That turned out to be necessary, but it was also > something you suggested, and something where I asked and took advice > on what ought to go into it. Likewise, you suggested that the process > of taking the backup should involve giving the client more control > rather than trying to do everything on the server side, and that is > now the design which I plan to pursue. You suggested that because it > would be more advantageous for out-of-core backup tools, such as > pgbackrest, and I acknowledge that as a benefit and I think we're > headed in that direction. I am not doing a single thing which, to my > knowledge, blocks anything that you might want to do with > pg_basebackup in the future. I have accepted as much of your input as > I believe that I can without killing the project off completely. To go > further, I'd have to either accept years of delay or abandon my > priorities entirely and pursue yours. While I'm hopeful that the parallel backup pieces will be useful to out-of-core backup tools, I've been increasingly less confident that it'll end up being very useful to pgbackrest, as much as I would like it to be. Perhaps after it's in place we might be able to work on it to make it useful, but we'd need to push all the features like encryption and options for compression and such into the backend, in a way that works for pgbackrest, to be able to leverage it, and I'm not sure that would get much support or that it could be done in a way that doesn't end up causing problems for pg_basebackup, which clearly wouldn't be acceptable. Further, if we can't leverage the PG backup protocol that you're building here, it seems pretty darn unlikely we'd have much use for the manifest that's built as part of that. I'm probably going to lose what credibility I have in critizing what you're doing with pg_basebackup here, but I started off saying you don't have to make me happy and this is part of why- I really don't think there's much that you're doing with pg_basebackup that is ultimately going to impact what plans I have for the future, for pretty much anything. I haven't got any real specific plans around pg_basebackup, though, point-in-fact, if you put in a bunch of code that shows how to get PG and pg_basebackup to do block-level incremental backups in a safe and trusted way, that would actually be *really* useful to the pgbackrest project because we could then lift that logic out of pg_basebackup and leverage it. If I wanted to be entirely selfish, I'd be pushing you to get block-level incremental backup into pg_basebackup as quickly as possible so that we could have such an example of "how to do it in a way that, if it breaks, the PG community will figure out what went wrong and fix it". If you look at other things we've done, such as not backing up unlogged tables, that's exactly the approach we've used: introduce the feature into pg_basebackup *first*, make sure the community agrees that it's a valid approach and will deal with any issues with it (and will take pains to avoid *breaking* it in future versions..), and only *then* introduce it into pgbackrest by using the same approach. Those other features were well in-line with what makes sense for pg_basebackup too though. We haven't done that though, and I haven't been pushing in that direction, not because I think it's a bad feature or that I want to block something going into pg_basebackup or whatever, but because I think it's actually going to cause more problems for users than it solves because some users will want to use it (though not all, as we've seen on this list, as there's at least some users out there who are as scared of the idea of having *just* this in pg_basebackup without the other things I talk about above as I am) and then they're going to try and hack together all those other things they need around WAL management and archiving and expiration and they're likely to get it wrong- perhaps in obvious ways, perhaps in relatively subtle ways, but either way, they'll end up with backups that aren't valid that they only discover when they're in an emergency. Again, perhaps selfish me would say "oh good, then they'll call me and pay me lots to fix it for them", but it certainly wouldn't look good for the community- even if all of the documentation and everything we put out there says that they way they were doing it had this subtle issue or whatever (considering our docs still promote a really bad, imv anyway, archive command kinda makes this likely, if you ask me anyway..), and it wouldn't be good for the user. > > That's what would make *me* happy. Even some comments about how to > > *get* there while also working towards these features would be likely > > to make me happy. Instead, I feel like we're being told that we need > > this feature badly in v13 and we're going to cut bait and do whatever > > is necessary to get us there. > > This seems like a really unfair accusation given how much work I've > put into trying to satisfy you and David. If this patch, the parallel > full backup patch, and the incremental backup patch were all to get > committed to v13, an outcome which seems pretty unlikely to me at this > point, then you would have a very significant number of things that > you have requested in the course of the various discussions, and > AFAICS the only thing you'd have that you don't want is the need to > parse the manifest file use while (<>) { @a = split /\t/, $_ } rather > than $a = parse_json(join '', <>). You would, for example, have the > ability to request an individual file from the server rather than a > complete tarball. Maybe the command that requests a file would lack an > encryption option, something which IIUC you would like to have, but > that certainly does not leave you worse off. It is easier to add an > encryption option to a command which you already have than it is to > invent a whole new command -- or really several whole new commands, > since such a command is not really usable unless you also have > facilities to start and stop a backup through the replication > protocol. No, the manifest format is definitely not the only issue that I have with this- but as it relates to the thread about building a manifest, my complaint really is isolated to the format and just forward thinking about how the format you're advocating for will mean custom code for who knows how many different tools. While I appreciate the offer to write all the bespoke code for every version of the manifest for pgbackrest, I'm really not thrilled about the idea of having to have that extra code and having to then maintain it. Yes, when you compare the single format of the manifest and the code required for it against a JSON parser, if we only ever have this one format then it'd win in terms of code, but I don't believe it'll end up being one format, instead we're going to end up with multiple formats, each of which will have some additional code for dealing with parsing it, and that's going to add up. That's also going to, as I said before, make it almost certain that we can't use older tools with newer backups. These are issues that we've thought about and worried about over the years of pgbackrest and with that experience we've come down on the side that a JSON-based format would be an altogether better design. That's why we're advocating for it, not because it requires more code or so that it delays the efforts here, but because we've been there, we've used other formats, we've dealt with user complaints when we do break things, this is all history for us that's helped us learn- for PG, it looks like the future with a static format, and I get that the future is hard to predict and pg_basebackup isn't pgbackrest and yeah, I could be completely wrong because I don't actually have a crystal ball, but this starting point sure looks really familiar. Thanks, Stephen
Attachment
Hi Robert, On 1/7/20 6:33 PM, Stephen Frost wrote: > These are issues that we've thought > about and worried about over the years of pgbackrest and with that > experience we've come down on the side that a JSON-based format would be > an altogether better design. That's why we're advocating for it, not > because it requires more code or so that it delays the efforts here, but > because we've been there, we've used other formats, we've dealt with > user complaints when we do break things, this is all history for us > that's helped us learn- for PG, it looks like the future with a static > format, and I get that the future is hard to predict and pg_basebackup > isn't pgbackrest and yeah, I could be completely wrong because I don't > actually have a crystal ball, but this starting point sure looks really > familiar. For example, have you considered what will happen if you have a file in the cluster with a tab in the name? This is perfectly valid in Posix filesystems, at least. You may already be escaping tabs but the simple code snippet you provided earlier isn't going to work so well either way. It gets complicated quickly. I know users should not be creating weird files in PGDATA, but it's amazing how often this sort of thing pops up. We currently have an open issue because = in file names breaks our file format. Tab is surely less common but it's amazing what users will do. Another fun one is 03849840 which fixes the handling of \ characters in the code which checksums the manifest. The file is not fully JSON but the checksums are and that was initially missed in the C migration. The bug never got released but it easily could have been. In short, using a quick-and-dirty homegrown format seemed great at first but has caused many headaches. Because we don't change the repo format across releases we are kind of stuck with past sins until we create a new repo format and write update/compatability code. Users are understandably concerned if new versions of the software won't work with their repo, some of which contain years of backups (really). This doesn't even get into the work everyone else will need to do to read a custom format. I do appreciate your offer of contributing parser code to pgBackRest, but honestly I'd rather it were not necessary. Though of course I'd still love to see a contribution of some sort from you! Hard experience tells me that using a standard format where all these issues have been worked out is the way to go. There are a few MIT-licensed JSON projects that are implemented in a single file. cJSON is very capable while JSMN is very minimal. Is is possible that one of those (or something like it) would be acceptable? It looks like the one requirement we have is that the JSON can be streamed rather than just building up one big blob? Even with that requirement there are a few tricks that can be used. JSON nests rather nicely after all so the individual file records can be transmitted independently of the overall file format. Your first question may be why didn't pgBackRest use one of those parsers? The answer is that JSON parsing/rendering is pretty trivial. Memory management and a (datum-like) type system are the hard parts and pgBackRest already had those. Would it be acceptable to bring in JSON code with a compatible license to use in libcommon? If so I'm willing to help adapt that code for use in Postgres. It's possible that the pgBackRest code could be adapted similarly, but it might make more sense to start from one of these general purpose parsers. Thoughts? -- -David david@pgmasters.net
On Thu, Jan 9, 2020 at 8:19 PM David Steele <david@pgmasters.net> wrote: > For example, have you considered what will happen if you have a file in > the cluster with a tab in the name? This is perfectly valid in Posix > filesystems, at least. Yeah, there's code for that in the patch I posted. I don't think the validator patch deals with it, but that's fixable. > You may already be escaping tabs but the simple > code snippet you provided earlier isn't going to work so well either > way. It gets complicated quickly. Sure, but obviously neither of those code snippets were intended to be used straight out of the box. Even after you parse the manifest as JSON, you would still - if you really want to validate it - check that you have the keys and values you expect, that the individual field values are sensible, etc. I still stand by my earlier contention that, as things stand today, you can parse an ad-hoc format in less code than a JSON format. If we had a JSON parser available on the front end, I think it'd be roughly comparable, but maybe the JSON format would come out a bit ahead. Not sure. > There are a few MIT-licensed JSON projects that are implemented in a > single file. cJSON is very capable while JSMN is very minimal. Is is > possible that one of those (or something like it) would be acceptable? > It looks like the one requirement we have is that the JSON can be > streamed rather than just building up one big blob? Even with that > requirement there are a few tricks that can be used. JSON nests rather > nicely after all so the individual file records can be transmitted > independently of the overall file format. I haven't really looked at these. I would have expected that including a second JSON parser in core would provoke significant opposition. Generally, people dislike having more than one piece of code to do the same thing. I would also expect that depending on an external package would provoke significant opposition. If we suck the code into core, then we have to keep it up to date with the upstream, which is a significant maintenance burden - look at all the time Tom has spent on snowball, regex, and time zone code over the years. If we don't suck the code into core but depend on it, then every developer needs to have that package installed on their operating system, and every packager has to make sure that it is being built for their OS so that PostgreSQL can depend on it. Perhaps JSON is so popular today that imposing such a requirement would provoke only a groundswell of support, but based on past precedent I would assume that if I committed a patch of this sort the chances that I'd have to revert it would be about 99.9%. Optional dependencies for optional features are usually pretty well-tolerated when they're clearly necessary: e.g. you can't really do JIT without depending on something like LLVM, but the bar for a mandatory dependency has historically been quite high. > Would it be acceptable to bring in JSON code with a compatible license > to use in libcommon? If so I'm willing to help adapt that code for use > in Postgres. It's possible that the pgBackRest code could be adapted > similarly, but it might make more sense to start from one of these > general purpose parsers. For the reasons above, I expect this approach would be rejected, by Tom and by others. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > ... I would also expect that depending on an external package > would provoke significant opposition. If we suck the code into core, > then we have to keep it up to date with the upstream, which is a > significant maintenance burden - look at all the time Tom has spent on > snowball, regex, and time zone code over the years. Also worth noting is that we have a seriously bad track record about choosing external packages to depend on. The regex code has no upstream maintainer anymore (well, the Tcl guys seem to think that *we* are upstream for that now), and snowball is next door to moribund. With C not being a particularly hip language to develop in anymore, it wouldn't surprise me in the least for any C-code JSON parser we might pick to go dead pretty soon. Between that problem and the likelihood that we'd need to make significant code changes anyway to meet our own coding style etc expectations, I think really we'd have to assume that we're going to fork and maintain our own copy of any code we pick. Now, if it's a small enough chunk of code (and really, how complex is JSON parsing anyway) maybe that doesn't matter. But I tend to agree with Robert's position that it's a big ask for this patch to introduce a frontend JSON parser. regards, tom lane
On Tue, Jan 14, 2020 at 12:53:04PM -0500, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > ... I would also expect that depending on an external package > > would provoke significant opposition. If we suck the code into core, > > then we have to keep it up to date with the upstream, which is a > > significant maintenance burden - look at all the time Tom has spent on > > snowball, regex, and time zone code over the years. > > Also worth noting is that we have a seriously bad track record about > choosing external packages to depend on. The regex code has no upstream > maintainer anymore (well, the Tcl guys seem to think that *we* are > upstream for that now), and snowball is next door to moribund. > With C not being a particularly hip language to develop in anymore, > it wouldn't surprise me in the least for any C-code JSON parser > we might pick to go dead pretty soon. Given jq's extreme popularity and compatible license, I'd nominate that. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Greetings, * David Fetter (david@fetter.org) wrote: > On Tue, Jan 14, 2020 at 12:53:04PM -0500, Tom Lane wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > > ... I would also expect that depending on an external package > > > would provoke significant opposition. If we suck the code into core, > > > then we have to keep it up to date with the upstream, which is a > > > significant maintenance burden - look at all the time Tom has spent on > > > snowball, regex, and time zone code over the years. > > > > Also worth noting is that we have a seriously bad track record about > > choosing external packages to depend on. The regex code has no upstream > > maintainer anymore (well, the Tcl guys seem to think that *we* are > > upstream for that now), and snowball is next door to moribund. > > With C not being a particularly hip language to develop in anymore, > > it wouldn't surprise me in the least for any C-code JSON parser > > we might pick to go dead pretty soon. > > Given jq's extreme popularity and compatible license, I'd nominate that. I don't think that really changes Tom's concerns here about having an "upstream" for this. For my part, I don't really agree with the whole "we don't want two different JSON parsers" when we've got two of a bunch of stuff between the frontend and the backend, particularly since I don't really think it'll end up being *that* much code. My thought, which I had expressed to David (though he obviously didn't entirely agree with me since he suggested the other options), was to adapt the pgBackRest JSON parser, which isn't really all that much code. Frustratingly, that code has got some internal pgBackRest dependency on things like the memory context system (which looks, unsurprisingly, an awful lot like what is in PG backend), the error handling and logging systems (which are different from PG because they're quite intentionally segregated from each other- something PG would benefit from, imv..), and Variadics (known in the PG backend as Datums, and quite similar to them..). Even so, David's offered to adjust the code to use the frontend's memory management (*cough* malloc()..), and error handling/logging, and he had some idea for Variadics (or maybe just pulling the backend's Datum system in..? He could answer better), and basically write a frontend JSON parser for PG without too much code, no external dependencies, and to make sure it answers this requirement, and I've agreed that he can spend some time on that instead of pgBackRest to get us through this, if everyone else is agreeable to the idea. Obviously this isn't intended to box anyone in- if there turns out even after the code's been written to be some fatal issue with using it, so be it, but we're offering to help. Thanks, Stephen
Attachment
On Tue, Jan 14, 2020 at 03:35:40PM -0500, Stephen Frost wrote: > Greetings, > > * David Fetter (david@fetter.org) wrote: > > On Tue, Jan 14, 2020 at 12:53:04PM -0500, Tom Lane wrote: > > > Robert Haas <robertmhaas@gmail.com> writes: > > > > ... I would also expect that depending on an external package > > > > would provoke significant opposition. If we suck the code into core, > > > > then we have to keep it up to date with the upstream, which is a > > > > significant maintenance burden - look at all the time Tom has spent on > > > > snowball, regex, and time zone code over the years. > > > > > > Also worth noting is that we have a seriously bad track record about > > > choosing external packages to depend on. The regex code has no upstream > > > maintainer anymore (well, the Tcl guys seem to think that *we* are > > > upstream for that now), and snowball is next door to moribund. > > > With C not being a particularly hip language to develop in anymore, > > > it wouldn't surprise me in the least for any C-code JSON parser > > > we might pick to go dead pretty soon. > > > > Given jq's extreme popularity and compatible license, I'd nominate that. > > I don't think that really changes Tom's concerns here about having an > "upstream" for this. > > For my part, I don't really agree with the whole "we don't want two > different JSON parsers" when we've got two of a bunch of stuff between > the frontend and the backend, particularly since I don't really think > it'll end up being *that* much code. > > My thought, which I had expressed to David (though he obviously didn't > entirely agree with me since he suggested the other options), was to > adapt the pgBackRest JSON parser, which isn't really all that much code. > > Frustratingly, that code has got some internal pgBackRest dependency on > things like the memory context system (which looks, unsurprisingly, an > awful lot like what is in PG backend), the error handling and logging > systems (which are different from PG because they're quite intentionally > segregated from each other- something PG would benefit from, imv..), and > Variadics (known in the PG backend as Datums, and quite similar to > them..). It might be more fun to put in that infrastructure and have it gate the manifest feature than to have two vastly different parsers to contend with. I get that putting off the backup manifests isn't an awesome prospect, but neither is rushing them in and getting them wrong in ways we'll still be regretting a decade hence. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Hi Stephen, On 1/14/20 1:35 PM, Stephen Frost wrote: > > My thought, which I had expressed to David (though he obviously didn't > entirely agree with me since he suggested the other options), was to > adapt the pgBackRest JSON parser, which isn't really all that much code. It's not that I didn't agree, it's just that the pgBackRest code does use mem contexts, the type system, etc. After looking at some other solutions with similar amounts of code I thought they might be more acceptable. At least it seemed like a good idea to throw it out there. > Even so, David's offered to adjust the code to use the frontend's memory > management (*cough* malloc()..), and error handling/logging, and he had > some idea for Variadics (or maybe just pulling the backend's Datum > system in..? He could answer better), and basically write a frontend > JSON parser for PG without too much code, no external dependencies, and > to make sure it answers this requirement, and I've agreed that he can > spend some time on that instead of pgBackRest to get us through this, if > everyone else is agreeable to the idea. To keep it simple I think we are left with callbacks or a somewhat static "what's the next datum" kind of approach. I think the latter could get us through a release or two while we make improvements. > Obviously this isn't intended > to box anyone in- if there turns out even after the code's been written > to be some fatal issue with using it, so be it, but we're offering to > help. I'm happy to work up a prototype unless the consensus is that we absolutely don't want a second JSON parser in core. Regards, -- -David david@pgmasters.net
David Steele <david@pgmasters.net> writes: > I'm happy to work up a prototype unless the consensus is that we > absolutely don't want a second JSON parser in core. How much code are we talking about? If the answer is "a few hundred lines", it's a lot easier to swallow than if it's "a few thousand". regards, tom lane
Hi Tom, On 1/14/20 9:47 PM, Tom Lane wrote: > David Steele <david@pgmasters.net> writes: >> I'm happy to work up a prototype unless the consensus is that we >> absolutely don't want a second JSON parser in core. > > How much code are we talking about? If the answer is "a few hundred > lines", it's a lot easier to swallow than if it's "a few thousand". It's currently about a thousand lines but we have a lot of functions to convert to/from specific types. I imagine the line count would be similar using one of the approaches I discussed above. Current source attached for reference. Regards, -- -David david@pgmasters.net
Attachment
On Tue, Jan 14, 2020 at 12:53:04PM -0500, Tom Lane wrote: > Also worth noting is that we have a seriously bad track record about > choosing external packages to depend on. The regex code has no upstream > maintainer anymore (well, the Tcl guys seem to think that *we* are > upstream for that now), and snowball is next door to moribund. > With C not being a particularly hip language to develop in anymore, > it wouldn't surprise me in the least for any C-code JSON parser > we might pick to go dead pretty soon. > > Between that problem and the likelihood that we'd need to make > significant code changes anyway to meet our own coding style etc > expectations, I think really we'd have to assume that we're going > to fork and maintain our own copy of any code we pick. > > Now, if it's a small enough chunk of code (and really, how complex > is JSON parsing anyway) maybe that doesn't matter. But I tend to > agree with Robert's position that it's a big ask for this patch > to introduce a frontend JSON parser. I know we have talked about our experience in maintaining external code: * TCL regex * Snowball * Timezone handling However, the regex code is complex, and the Snowball and timezone code is improved as they add new languages and time zones. I don't see JSON parsing as complex or likely to change much, so it might be acceptable to include it in our frontend code. As far as using tab-delimited data, I know this usage was compared to postgresql.conf and pg_hba.conf, which don't change much. However, those files are not usually written, and do not contain user data, while the backup file might contain user-specified paths if they are not just relative to the PGDATA directory, and that would make escaping a requirement. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Fri, Jan 3, 2020 at 6:11 PM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote: > Thank you for review comments. Here's a new patch set for this feature. 0001 adds checksum helper functions, similar to what Suraj had incorporated into my original patch but separated out into a separate patch and with some different aesthetic decisions. I also decided to support all of the SHA variants that PG knows about as options and added a function to parse a checksum algorithm name, along the lines I suggested previously. 0002 teaches the server to generate a backup manifest using the format I originally proposed. This is similar to the patch I posted previously, but it spools the manifest to disk as it's being generated, so that we don't run the server out of memory or fail when hitting the 1GB allocation limit. 0003 adds a new utility, pg_validatebackup, to validate a backup against a manifest. Suraj tried to incorporate this into pg_basebackup, which I initially thought might be OK but eventually decided wasn't good, partly because this really wants to take some command-line options entirely unrelated to the options accepted by pg_basebackup. I tried to improve the error checking and the order in which various things are done, too. This is a basically a complete rewrite as compared with Suraj's version. 0004 modifies the server to generate a backup manifest in JSON format rather than my originally proposed format. This allows for some comparison of the code doing it one way vs. the other. Assuming we stick with JSON, I will squash this with 0002 at some point. 0005 is a very much work-in-progress and proof-of-concept to modify the backup validator to understand the JSON format. It doesn't validate the manifest checksum at this point; it just prints it out. The error handling needs work. It has other problems, and bugs. Although I'm still not very happy about the idea of using JSON here, I'm pretty happy with the basic approach this patch takes. It demonstrates that the JSON parser can be used for non-trivial things in frontend code, and I'd say the code even looks reasonably clean - with the exception of small details like being buggy and under-commented. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 2/27/20 9:22 PM, Robert Haas wrote: > Here's a new patch set for this feature. Thanks Robert. After applying all the 5 patches (v8-00*) against PG v13 (commit id -afb5465e0cfce7637066eaaaeecab30b0f23fbe3) , There are few issues/observations 1)Getting segmentation fault error if we try pg_validatebackup against a valid backup_manifest file but data directory path is WRONG [centos@tushar-ldap-docker bin]$ ./pg_basebackup -D bk --manifest-checksums=sha224 [centos@tushar-ldap-docker bin]$ cp bk/backup_manifest /tmp/. [centos@tushar-ldap-docker bin]$ ./pg_validatebackup -m /tmp/backup_manifest random_directory/ pg_validatebackup: * manifest_checksum = f0460cd6aa13cf0c5e35426a41af940a9231e6425cd65115a19778b7abfdaef9 pg_validatebackup: error: could not open directory "random_directory": No such file or directory Segmentation fault 2) when used '-R' option at the time of create base backup [centos@tushar-ldap-docker bin]$ ./pg_basebackup -D bar -R [centos@tushar-ldap-docker bin]$ ./pg_validatebackup bar pg_validatebackup: * manifest_checksum = a195d3a3a82a41200c9ac92c12d764d23c810e7e91b31c44a7d04f67ce012edc pg_validatebackup: error: "standby.signal" is present on disk but not in the manifest pg_validatebackup: error: "postgresql.auto.conf" has size 286 on disk but size 88 in the manifest [centos@tushar-ldap-docker bin]$ -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On 3/3/20 4:04 PM, tushar wrote: > Thanks Robert. After applying all the 5 patches (v8-00*) against PG > v13 (commit id -afb5465e0cfce7637066eaaaeecab30b0f23fbe3) , There is a scenario where pg_validatebackup is not throwing an error if some file deleted from pg_wal/ folder and but later at the time of restoring - we are getting an error [centos@tushar-ldap-docker bin]$ ./pg_basebackup -D test1 [centos@tushar-ldap-docker bin]$ ls test1/pg_wal/ 000000010000000000000010 archive_status [centos@tushar-ldap-docker bin]$ rm -rf test1/pg_wal/* [centos@tushar-ldap-docker bin]$ ./pg_validatebackup test1 pg_validatebackup: * manifest_checksum = 88f1ed995c83e86252466a2c88b3e660a69cfc76c169991134b101c4f16c9df7 pg_validatebackup: backup successfully verified [centos@tushar-ldap-docker bin]$ ./pg_ctl -D test1 start -o '-p 3333' waiting for server to start....2020-03-02 20:05:22.732 IST [21441] LOG: starting PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit 2020-03-02 20:05:22.733 IST [21441] LOG: listening on IPv6 address "::1", port 3333 2020-03-02 20:05:22.733 IST [21441] LOG: listening on IPv4 address "127.0.0.1", port 3333 2020-03-02 20:05:22.736 IST [21441] LOG: listening on Unix socket "/tmp/.s.PGSQL.3333" 2020-03-02 20:05:22.739 IST [21442] LOG: database system was interrupted; last known up at 2020-03-02 20:04:35 IST 2020-03-02 20:05:22.739 IST [21442] LOG: creating missing WAL directory "pg_wal/archive_status" 2020-03-02 20:05:22.886 IST [21442] LOG: invalid checkpoint record 2020-03-02 20:05:22.886 IST [21442] FATAL: could not locate required checkpoint record 2020-03-02 20:05:22.886 IST [21442] HINT: If you are restoring from a backup, touch "/home/centos/pg13_bk_mani/edb/edbpsql/bin/test1/recovery.signal" and add required recovery options. If you are not restoring from a backup, try removing the file "/home/centos/pg13_bk_mani/edb/edbpsql/bin/test1/backup_label". Be careful: removing "/home/centos/pg13_bk_mani/edb/edbpsql/bin/test1/backup_label" will result in a corrupt cluster if restoring from a backup. 2020-03-02 20:05:22.886 IST [21441] LOG: startup process (PID 21442) exited with exit code 1 2020-03-02 20:05:22.886 IST [21441] LOG: aborting startup due to startup process failure 2020-03-02 20:05:22.889 IST [21441] LOG: database system is shut down stopped waiting pg_ctl: could not start server Examine the log output. [centos@tushar-ldap-docker bin]$ -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
Hi, Another observation , if i change the ownership of a file which is under global/ directory i.e [root@tushar-ldap-docker global]# chown enterprisedb 2396 and run the pg_validatebackup command, i am getting this message - [centos@tushar-ldap-docker bin]$ ./pg_validatebackup gggg pg_validatebackup: * manifest_checksum = e8cb007bcc9c0deab6eff51cd8d9d9af6af35b86e02f3055e60e70e56737e877 pg_validatebackup: error: could not open file "global/2396": Permission denied *** Error in `./pg_validatebackup': double free or corruption (!prev): 0x0000000001850ba0 *** ======= Backtrace: ========= /lib64/libc.so.6(+0x81679)[0x7fa2248e3679] ./pg_validatebackup[0x401f4c] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fa224884505] ./pg_validatebackup[0x402049] ======= Memory map: ======== 00400000-00415000 r-xp 00000000 fd:03 4044545 /home/centos/pg13_bk_mani/edb/edbpsql/bin/pg_validatebackup 00614000-00615000 r--p 00014000 fd:03 4044545 /home/centos/pg13_bk_mani/edb/edbpsql/bin/pg_validatebackup 00615000-00616000 rw-p 00015000 fd:03 4044545 /home/centos/pg13_bk_mani/edb/edbpsql/bin/pg_validatebackup 017f3000-01878000 rw-p 00000000 00:00 0 [heap] 7fa218000000-7fa218021000 rw-p 00000000 00:00 0 7fa218021000-7fa21c000000 ---p 00000000 00:00 0 7fa21e122000-7fa21e137000 r-xp 00000000 fd:03 141697 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 7fa21e137000-7fa21e336000 ---p 00015000 fd:03 141697 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 7fa21e336000-7fa21e337000 r--p 00014000 fd:03 141697 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 7fa21e337000-7fa21e338000 rw-p 00015000 fd:03 141697 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 7fa21e338000-7fa224862000 r--p 00000000 fd:03 266442 /usr/lib/locale/locale-archive 7fa224862000-7fa224a25000 r-xp 00000000 fd:03 134456 /usr/lib64/libc-2.17.so 7fa224a25000-7fa224c25000 ---p 001c3000 fd:03 134456 /usr/lib64/libc-2.17.so 7fa224c25000-7fa224c29000 r--p 001c3000 fd:03 134456 /usr/lib64/libc-2.17.so 7fa224c29000-7fa224c2b000 rw-p 001c7000 fd:03 134456 /usr/lib64/libc-2.17.so 7fa224c2b000-7fa224c30000 rw-p 00000000 00:00 0 7fa224c30000-7fa224c47000 r-xp 00000000 fd:03 134485 /usr/lib64/libpthread-2.17.so 7fa224c47000-7fa224e46000 ---p 00017000 fd:03 134485 /usr/lib64/libpthread-2.17.so 7fa224e46000-7fa224e47000 r--p 00016000 fd:03 134485 /usr/lib64/libpthread-2.17.so 7fa224e47000-7fa224e48000 rw-p 00017000 fd:03 134485 /usr/lib64/libpthread-2.17.so 7fa224e48000-7fa224e4c000 rw-p 00000000 00:00 0 7fa224e4c000-7fa224e90000 r-xp 00000000 fd:03 4044478 /home/centos/pg13_bk_mani/edb/edbpsql/lib/libpq.so.5.13 7fa224e90000-7fa225090000 ---p 00044000 fd:03 4044478 /home/centos/pg13_bk_mani/edb/edbpsql/lib/libpq.so.5.13 7fa225090000-7fa225093000 r--p 00044000 fd:03 4044478 /home/centos/pg13_bk_mani/edb/edbpsql/lib/libpq.so.5.13 7fa225093000-7fa225094000 rw-p 00047000 fd:03 4044478 /home/centos/pg13_bk_mani/edb/edbpsql/lib/libpq.so.5.13 7fa225094000-7fa2250b6000 r-xp 00000000 fd:03 130333 /usr/lib64/ld-2.17.so 7fa22527d000-7fa2252a2000 rw-p 00000000 00:00 0 7fa2252b3000-7fa2252b5000 rw-p 00000000 00:00 0 7fa2252b5000-7fa2252b6000 r--p 00021000 fd:03 130333 /usr/lib64/ld-2.17.so 7fa2252b6000-7fa2252b7000 rw-p 00022000 fd:03 130333 /usr/lib64/ld-2.17.so 7fa2252b7000-7fa2252b8000 rw-p 00000000 00:00 0 7ffdf354f000-7ffdf3570000 rw-p 00000000 00:00 0 [stack] 7ffdf3572000-7ffdf3574000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] Aborted [centos@tushar-ldap-docker bin]$ I am getting the error message but along with "*** Error in `./pg_validatebackup': double free or corruption (!prev): 0x0000000001850ba0 ***" messages Is this expected ? regards, On 3/3/20 8:19 PM, tushar wrote: > On 3/3/20 4:04 PM, tushar wrote: >> Thanks Robert. After applying all the 5 patches (v8-00*) against PG >> v13 (commit id -afb5465e0cfce7637066eaaaeecab30b0f23fbe3) , > > There is a scenario where pg_validatebackup is not throwing an error > if some file deleted from pg_wal/ folder and but later at the time of > restoring - we are getting an error > > [centos@tushar-ldap-docker bin]$ ./pg_basebackup -D test1 > > [centos@tushar-ldap-docker bin]$ ls test1/pg_wal/ > 000000010000000000000010 archive_status > > [centos@tushar-ldap-docker bin]$ rm -rf test1/pg_wal/* > > [centos@tushar-ldap-docker bin]$ ./pg_validatebackup test1 > pg_validatebackup: * manifest_checksum = > 88f1ed995c83e86252466a2c88b3e660a69cfc76c169991134b101c4f16c9df7 > pg_validatebackup: backup successfully verified > > [centos@tushar-ldap-docker bin]$ ./pg_ctl -D test1 start -o '-p 3333' > waiting for server to start....2020-03-02 20:05:22.732 IST [21441] > LOG: starting PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by > gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit > 2020-03-02 20:05:22.733 IST [21441] LOG: listening on IPv6 address > "::1", port 3333 > 2020-03-02 20:05:22.733 IST [21441] LOG: listening on IPv4 address > "127.0.0.1", port 3333 > 2020-03-02 20:05:22.736 IST [21441] LOG: listening on Unix socket > "/tmp/.s.PGSQL.3333" > 2020-03-02 20:05:22.739 IST [21442] LOG: database system was > interrupted; last known up at 2020-03-02 20:04:35 IST > 2020-03-02 20:05:22.739 IST [21442] LOG: creating missing WAL > directory "pg_wal/archive_status" > 2020-03-02 20:05:22.886 IST [21442] LOG: invalid checkpoint record > 2020-03-02 20:05:22.886 IST [21442] FATAL: could not locate required > checkpoint record > 2020-03-02 20:05:22.886 IST [21442] HINT: If you are restoring from a > backup, touch > "/home/centos/pg13_bk_mani/edb/edbpsql/bin/test1/recovery.signal" and > add required recovery options. > If you are not restoring from a backup, try removing the file > "/home/centos/pg13_bk_mani/edb/edbpsql/bin/test1/backup_label". > Be careful: removing > "/home/centos/pg13_bk_mani/edb/edbpsql/bin/test1/backup_label" will > result in a corrupt cluster if restoring from a backup. > 2020-03-02 20:05:22.886 IST [21441] LOG: startup process (PID 21442) > exited with exit code 1 > 2020-03-02 20:05:22.886 IST [21441] LOG: aborting startup due to > startup process failure > 2020-03-02 20:05:22.889 IST [21441] LOG: database system is shut down > stopped waiting > pg_ctl: could not start server > Examine the log output. > [centos@tushar-ldap-docker bin]$ > -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
Another scenario, in which if we modify Manifest-Checksum" value from backup_manifest file , we are not getting an error [centos@tushar-ldap-docker bin]$ ./pg_validatebackup data/ pg_validatebackup: * manifest_checksum = 28d082921650d0ae881de8ceb122c8d2af5f449f51ecfb446827f7f49f91f65d pg_validatebackup: backup successfully verified open backup_manifest file and replace "Manifest-Checksum": "8d082921650d0ae881de8ceb122c8d2af5f449f51ecfb446827f7f49f91f65d"} with "Manifest-Checksum": "Hello World"} rerun the pg_validatebackup [centos@tushar-ldap-docker bin]$ ./pg_validatebackup data/ pg_validatebackup: * manifest_checksum = Hello World pg_validatebackup: backup successfully verified regards, On 3/4/20 3:26 PM, tushar wrote: > Hi, > Another observation , if i change the ownership of a file which is > under global/ directory > i.e > > [root@tushar-ldap-docker global]# chown enterprisedb 2396 > > and run the pg_validatebackup command, i am getting this message - > > [centos@tushar-ldap-docker bin]$ ./pg_validatebackup gggg > pg_validatebackup: * manifest_checksum = > e8cb007bcc9c0deab6eff51cd8d9d9af6af35b86e02f3055e60e70e56737e877 > pg_validatebackup: error: could not open file "global/2396": > Permission denied > *** Error in `./pg_validatebackup': double free or corruption (!prev): > 0x0000000001850ba0 *** > ======= Backtrace: ========= > /lib64/libc.so.6(+0x81679)[0x7fa2248e3679] > ./pg_validatebackup[0x401f4c] > /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fa224884505] > ./pg_validatebackup[0x402049] > ======= Memory map: ======== > 00400000-00415000 r-xp 00000000 fd:03 4044545 > /home/centos/pg13_bk_mani/edb/edbpsql/bin/pg_validatebackup > 00614000-00615000 r--p 00014000 fd:03 4044545 > /home/centos/pg13_bk_mani/edb/edbpsql/bin/pg_validatebackup > 00615000-00616000 rw-p 00015000 fd:03 4044545 > /home/centos/pg13_bk_mani/edb/edbpsql/bin/pg_validatebackup > 017f3000-01878000 rw-p 00000000 00:00 > 0 [heap] > 7fa218000000-7fa218021000 rw-p 00000000 00:00 0 > 7fa218021000-7fa21c000000 ---p 00000000 00:00 0 > 7fa21e122000-7fa21e137000 r-xp 00000000 fd:03 > 141697 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 > 7fa21e137000-7fa21e336000 ---p 00015000 fd:03 > 141697 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 > 7fa21e336000-7fa21e337000 r--p 00014000 fd:03 > 141697 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 > 7fa21e337000-7fa21e338000 rw-p 00015000 fd:03 > 141697 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 > 7fa21e338000-7fa224862000 r--p 00000000 fd:03 > 266442 /usr/lib/locale/locale-archive > 7fa224862000-7fa224a25000 r-xp 00000000 fd:03 > 134456 /usr/lib64/libc-2.17.so > 7fa224a25000-7fa224c25000 ---p 001c3000 fd:03 > 134456 /usr/lib64/libc-2.17.so > 7fa224c25000-7fa224c29000 r--p 001c3000 fd:03 > 134456 /usr/lib64/libc-2.17.so > 7fa224c29000-7fa224c2b000 rw-p 001c7000 fd:03 > 134456 /usr/lib64/libc-2.17.so > 7fa224c2b000-7fa224c30000 rw-p 00000000 00:00 0 > 7fa224c30000-7fa224c47000 r-xp 00000000 fd:03 > 134485 /usr/lib64/libpthread-2.17.so > 7fa224c47000-7fa224e46000 ---p 00017000 fd:03 > 134485 /usr/lib64/libpthread-2.17.so > 7fa224e46000-7fa224e47000 r--p 00016000 fd:03 > 134485 /usr/lib64/libpthread-2.17.so > 7fa224e47000-7fa224e48000 rw-p 00017000 fd:03 > 134485 /usr/lib64/libpthread-2.17.so > 7fa224e48000-7fa224e4c000 rw-p 00000000 00:00 0 > 7fa224e4c000-7fa224e90000 r-xp 00000000 fd:03 4044478 > /home/centos/pg13_bk_mani/edb/edbpsql/lib/libpq.so.5.13 > 7fa224e90000-7fa225090000 ---p 00044000 fd:03 4044478 > /home/centos/pg13_bk_mani/edb/edbpsql/lib/libpq.so.5.13 > 7fa225090000-7fa225093000 r--p 00044000 fd:03 4044478 > /home/centos/pg13_bk_mani/edb/edbpsql/lib/libpq.so.5.13 > 7fa225093000-7fa225094000 rw-p 00047000 fd:03 4044478 > /home/centos/pg13_bk_mani/edb/edbpsql/lib/libpq.so.5.13 > 7fa225094000-7fa2250b6000 r-xp 00000000 fd:03 > 130333 /usr/lib64/ld-2.17.so > 7fa22527d000-7fa2252a2000 rw-p 00000000 00:00 0 > 7fa2252b3000-7fa2252b5000 rw-p 00000000 00:00 0 > 7fa2252b5000-7fa2252b6000 r--p 00021000 fd:03 > 130333 /usr/lib64/ld-2.17.so > 7fa2252b6000-7fa2252b7000 rw-p 00022000 fd:03 > 130333 /usr/lib64/ld-2.17.so > 7fa2252b7000-7fa2252b8000 rw-p 00000000 00:00 0 > 7ffdf354f000-7ffdf3570000 rw-p 00000000 00:00 > 0 [stack] > 7ffdf3572000-7ffdf3574000 r-xp 00000000 00:00 > 0 [vdso] > ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 > 0 [vsyscall] > Aborted > [centos@tushar-ldap-docker bin]$ > > > I am getting the error message but along with "*** Error in > `./pg_validatebackup': double free or corruption (!prev): > 0x0000000001850ba0 ***" messages > > Is this expected ? > > regards, > > On 3/3/20 8:19 PM, tushar wrote: >> On 3/3/20 4:04 PM, tushar wrote: >>> Thanks Robert. After applying all the 5 patches (v8-00*) against PG >>> v13 (commit id -afb5465e0cfce7637066eaaaeecab30b0f23fbe3) , >> >> There is a scenario where pg_validatebackup is not throwing an error >> if some file deleted from pg_wal/ folder and but later at the time >> of restoring - we are getting an error >> >> [centos@tushar-ldap-docker bin]$ ./pg_basebackup -D test1 >> >> [centos@tushar-ldap-docker bin]$ ls test1/pg_wal/ >> 000000010000000000000010 archive_status >> >> [centos@tushar-ldap-docker bin]$ rm -rf test1/pg_wal/* >> >> [centos@tushar-ldap-docker bin]$ ./pg_validatebackup test1 >> pg_validatebackup: * manifest_checksum = >> 88f1ed995c83e86252466a2c88b3e660a69cfc76c169991134b101c4f16c9df7 >> pg_validatebackup: backup successfully verified >> >> [centos@tushar-ldap-docker bin]$ ./pg_ctl -D test1 start -o '-p 3333' >> waiting for server to start....2020-03-02 20:05:22.732 IST [21441] >> LOG: starting PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by >> gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit >> 2020-03-02 20:05:22.733 IST [21441] LOG: listening on IPv6 address >> "::1", port 3333 >> 2020-03-02 20:05:22.733 IST [21441] LOG: listening on IPv4 address >> "127.0.0.1", port 3333 >> 2020-03-02 20:05:22.736 IST [21441] LOG: listening on Unix socket >> "/tmp/.s.PGSQL.3333" >> 2020-03-02 20:05:22.739 IST [21442] LOG: database system was >> interrupted; last known up at 2020-03-02 20:04:35 IST >> 2020-03-02 20:05:22.739 IST [21442] LOG: creating missing WAL >> directory "pg_wal/archive_status" >> 2020-03-02 20:05:22.886 IST [21442] LOG: invalid checkpoint record >> 2020-03-02 20:05:22.886 IST [21442] FATAL: could not locate required >> checkpoint record >> 2020-03-02 20:05:22.886 IST [21442] HINT: If you are restoring from >> a backup, touch >> "/home/centos/pg13_bk_mani/edb/edbpsql/bin/test1/recovery.signal" and >> add required recovery options. >> If you are not restoring from a backup, try removing the file >> "/home/centos/pg13_bk_mani/edb/edbpsql/bin/test1/backup_label". >> Be careful: removing >> "/home/centos/pg13_bk_mani/edb/edbpsql/bin/test1/backup_label" will >> result in a corrupt cluster if restoring from a backup. >> 2020-03-02 20:05:22.886 IST [21441] LOG: startup process (PID 21442) >> exited with exit code 1 >> 2020-03-02 20:05:22.886 IST [21441] LOG: aborting startup due to >> startup process failure >> 2020-03-02 20:05:22.889 IST [21441] LOG: database system is shut down >> stopped waiting >> pg_ctl: could not start server >> Examine the log output. >> [centos@tushar-ldap-docker bin]$ >> > -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
Hi,
There is a scenario in which i add something inside the pg_tablespace directory , i am getting an error like-
pg_validatebackup: * manifest_checksum = 77ddacb4e7e02e2b880792a19a3adf09266dd88553dd15cfd0c22caee7d9cc04
pg_validatebackup: error: "pg_tblspc/16385/PG_13_202002271/test" is present on disk but not in the manifest
pg_validatebackup: error: "pg_tblspc/16385/PG_13_202002271/test" is present on disk but not in the manifest
but if i remove 'PG_13_202002271 ' directory then there is no error
[centos@tushar-ldap-docker bin]$ ./pg_validatebackup data
pg_validatebackup: * manifest_checksum = 77ddacb4e7e02e2b880792a19a3adf09266dd88553dd15cfd0c22caee7d9cc04
pg_validatebackup: backup successfully verified
pg_validatebackup: * manifest_checksum = 77ddacb4e7e02e2b880792a19a3adf09266dd88553dd15cfd0c22caee7d9cc04
pg_validatebackup: backup successfully verified
Steps to reproduce -
--connect to psql terminal , create a tablespace
postgres=# \! mkdir /tmp/my_tblspc
postgres=# create tablespace tbs location '/tmp/my_tblspc';
CREATE TABLESPACE
postgres=# \q
postgres=# create tablespace tbs location '/tmp/my_tblspc';
CREATE TABLESPACE
postgres=# \q
--run pg_basebackup
[centos@tushar-ldap-docker bin]$ ./pg_basebackup -D data_dir -T /tmp/my_tblspc/=/tmp/new_my_tblspc
[centos@tushar-ldap-docker bin]$
[centos@tushar-ldap-docker bin]$ ls /tmp/new_my_tblspc/
PG_13_202002271
[centos@tushar-ldap-docker bin]$
[centos@tushar-ldap-docker bin]$ ls /tmp/new_my_tblspc/
PG_13_202002271
--create a new file under PG_13_* folder
[centos@tushar-ldap-docker bin]$ touch /tmp/new_my_tblspc/PG_13_202002271/test
[centos@tushar-ldap-docker bin]$
[centos@tushar-ldap-docker bin]$ touch /tmp/new_my_tblspc/PG_13_202002271/test
[centos@tushar-ldap-docker bin]$
--run pg_validatebackup ,Getting an error which looks expected
[centos@tushar-ldap-docker bin]$ ./pg_validatebackup data_dir/
pg_validatebackup: * manifest_checksum = 3951308eab576906ebdb002ff00ca313b2c1862592168c1f5f7ecf051ac07907
pg_validatebackup: error: "pg_tblspc/16386/PG_13_202002271/test" is present on disk but not in the manifest
[centos@tushar-ldap-docker bin]$
pg_validatebackup: * manifest_checksum = 3951308eab576906ebdb002ff00ca313b2c1862592168c1f5f7ecf051ac07907
pg_validatebackup: error: "pg_tblspc/16386/PG_13_202002271/test" is present on disk but not in the manifest
[centos@tushar-ldap-docker bin]$
--remove the added file
[centos@tushar-ldap-docker bin]$ rm -rf /tmp/new_my_tblspc/PG_13_202002271/test
--run pg_validatebackup , working fine
[centos@tushar-ldap-docker bin]$ ./pg_validatebackup data_dir/
pg_validatebackup: * manifest_checksum = 3951308eab576906ebdb002ff00ca313b2c1862592168c1f5f7ecf051ac07907
pg_validatebackup: backup successfully verified
[centos@tushar-ldap-docker bin]$
pg_validatebackup: * manifest_checksum = 3951308eab576906ebdb002ff00ca313b2c1862592168c1f5f7ecf051ac07907
pg_validatebackup: backup successfully verified
[centos@tushar-ldap-docker bin]$
--remove the folder PG_13*
[centos@tushar-ldap-docker bin]$ rm -rf /tmp/new_my_tblspc/PG_13_202002271/
[centos@tushar-ldap-docker bin]$
[centos@tushar-ldap-docker bin]$ ls /tmp/new_my_tblspc/
[centos@tushar-ldap-docker bin]$
[centos@tushar-ldap-docker bin]$ ls /tmp/new_my_tblspc/
--run pg_validatebackup , No error reported ?
[centos@tushar-ldap-docker bin]$ ./pg_validatebackup data_dir/
pg_validatebackup: * manifest_checksum = 3951308eab576906ebdb002ff00ca313b2c1862592168c1f5f7ecf051ac07907
pg_validatebackup: backup successfully verified
[centos@tushar-ldap-docker bin]$
pg_validatebackup: * manifest_checksum = 3951308eab576906ebdb002ff00ca313b2c1862592168c1f5f7ecf051ac07907
pg_validatebackup: backup successfully verified
[centos@tushar-ldap-docker bin]$
Start the server -
[centos@tushar-ldap-docker bin]$ ./pg_ctl -D data_dir/ start -o '-p 9033'
waiting for server to start....2020-03-04 19:18:54.839 IST [13097] LOG: starting PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-03-04 19:18:54.840 IST [13097] LOG: listening on IPv6 address "::1", port 9033
2020-03-04 19:18:54.840 IST [13097] LOG: listening on IPv4 address "127.0.0.1", port 9033
2020-03-04 19:18:54.842 IST [13097] LOG: listening on Unix socket "/tmp/.s.PGSQL.9033"
2020-03-04 19:18:54.843 IST [13097] LOG: could not open directory "pg_tblspc/16386/PG_13_202002271": No such file or directory
2020-03-04 19:18:54.845 IST [13098] LOG: database system was interrupted; last known up at 2020-03-04 19:14:50 IST
2020-03-04 19:18:54.937 IST [13098] LOG: could not open directory "pg_tblspc/16386/PG_13_202002271": No such file or directory
2020-03-04 19:18:54.939 IST [13098] LOG: could not open directory "pg_tblspc/16386/PG_13_202002271": No such file or directory
2020-03-04 19:18:54.939 IST [13098] LOG: redo starts at 0/18000028
2020-03-04 19:18:54.939 IST [13098] LOG: consistent recovery state reached at 0/18000100
2020-03-04 19:18:54.939 IST [13098] LOG: redo done at 0/18000100
2020-03-04 19:18:54.941 IST [13098] LOG: could not open directory "pg_tblspc/16386/PG_13_202002271": No such file or directory
2020-03-04 19:18:54.984 IST [13097] LOG: database system is ready to accept connections
done
server started
[centos@tushar-ldap-docker bin]$
waiting for server to start....2020-03-04 19:18:54.839 IST [13097] LOG: starting PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-03-04 19:18:54.840 IST [13097] LOG: listening on IPv6 address "::1", port 9033
2020-03-04 19:18:54.840 IST [13097] LOG: listening on IPv4 address "127.0.0.1", port 9033
2020-03-04 19:18:54.842 IST [13097] LOG: listening on Unix socket "/tmp/.s.PGSQL.9033"
2020-03-04 19:18:54.843 IST [13097] LOG: could not open directory "pg_tblspc/16386/PG_13_202002271": No such file or directory
2020-03-04 19:18:54.845 IST [13098] LOG: database system was interrupted; last known up at 2020-03-04 19:14:50 IST
2020-03-04 19:18:54.937 IST [13098] LOG: could not open directory "pg_tblspc/16386/PG_13_202002271": No such file or directory
2020-03-04 19:18:54.939 IST [13098] LOG: could not open directory "pg_tblspc/16386/PG_13_202002271": No such file or directory
2020-03-04 19:18:54.939 IST [13098] LOG: redo starts at 0/18000028
2020-03-04 19:18:54.939 IST [13098] LOG: consistent recovery state reached at 0/18000100
2020-03-04 19:18:54.939 IST [13098] LOG: redo done at 0/18000100
2020-03-04 19:18:54.941 IST [13098] LOG: could not open directory "pg_tblspc/16386/PG_13_202002271": No such file or directory
2020-03-04 19:18:54.984 IST [13097] LOG: database system is ready to accept connections
done
server started
[centos@tushar-ldap-docker bin]$
regards,
On 3/4/20 3:51 PM, tushar wrote:
Another scenario, in which if we modify Manifest-Checksum" value from backup_manifest file , we are not getting an error
[centos@tushar-ldap-docker bin]$ ./pg_validatebackup data/
pg_validatebackup: * manifest_checksum = 28d082921650d0ae881de8ceb122c8d2af5f449f51ecfb446827f7f49f91f65d
pg_validatebackup: backup successfully verified
open backup_manifest file and replace
"Manifest-Checksum": "8d082921650d0ae881de8ceb122c8d2af5f449f51ecfb446827f7f49f91f65d"}
with
"Manifest-Checksum": "Hello World"}
rerun the pg_validatebackup
[centos@tushar-ldap-docker bin]$ ./pg_validatebackup data/
pg_validatebackup: * manifest_checksum = Hello World
pg_validatebackup: backup successfully verified
regards,
On 3/4/20 3:26 PM, tushar wrote:Hi,
Another observation , if i change the ownership of a file which is under global/ directory
i.e
[root@tushar-ldap-docker global]# chown enterprisedb 2396
and run the pg_validatebackup command, i am getting this message -
[centos@tushar-ldap-docker bin]$ ./pg_validatebackup gggg
pg_validatebackup: * manifest_checksum = e8cb007bcc9c0deab6eff51cd8d9d9af6af35b86e02f3055e60e70e56737e877
pg_validatebackup: error: could not open file "global/2396": Permission denied
*** Error in `./pg_validatebackup': double free or corruption (!prev): 0x0000000001850ba0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81679)[0x7fa2248e3679]
./pg_validatebackup[0x401f4c]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fa224884505]
./pg_validatebackup[0x402049]
======= Memory map: ========
00400000-00415000 r-xp 00000000 fd:03 4044545 /home/centos/pg13_bk_mani/edb/edbpsql/bin/pg_validatebackup
00614000-00615000 r--p 00014000 fd:03 4044545 /home/centos/pg13_bk_mani/edb/edbpsql/bin/pg_validatebackup
00615000-00616000 rw-p 00015000 fd:03 4044545 /home/centos/pg13_bk_mani/edb/edbpsql/bin/pg_validatebackup
017f3000-01878000 rw-p 00000000 00:00 0 [heap]
7fa218000000-7fa218021000 rw-p 00000000 00:00 0
7fa218021000-7fa21c000000 ---p 00000000 00:00 0
7fa21e122000-7fa21e137000 r-xp 00000000 fd:03 141697 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
7fa21e137000-7fa21e336000 ---p 00015000 fd:03 141697 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
7fa21e336000-7fa21e337000 r--p 00014000 fd:03 141697 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
7fa21e337000-7fa21e338000 rw-p 00015000 fd:03 141697 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
7fa21e338000-7fa224862000 r--p 00000000 fd:03 266442 /usr/lib/locale/locale-archive
7fa224862000-7fa224a25000 r-xp 00000000 fd:03 134456 /usr/lib64/libc-2.17.so
7fa224a25000-7fa224c25000 ---p 001c3000 fd:03 134456 /usr/lib64/libc-2.17.so
7fa224c25000-7fa224c29000 r--p 001c3000 fd:03 134456 /usr/lib64/libc-2.17.so
7fa224c29000-7fa224c2b000 rw-p 001c7000 fd:03 134456 /usr/lib64/libc-2.17.so
7fa224c2b000-7fa224c30000 rw-p 00000000 00:00 0
7fa224c30000-7fa224c47000 r-xp 00000000 fd:03 134485 /usr/lib64/libpthread-2.17.so
7fa224c47000-7fa224e46000 ---p 00017000 fd:03 134485 /usr/lib64/libpthread-2.17.so
7fa224e46000-7fa224e47000 r--p 00016000 fd:03 134485 /usr/lib64/libpthread-2.17.so
7fa224e47000-7fa224e48000 rw-p 00017000 fd:03 134485 /usr/lib64/libpthread-2.17.so
7fa224e48000-7fa224e4c000 rw-p 00000000 00:00 0
7fa224e4c000-7fa224e90000 r-xp 00000000 fd:03 4044478 /home/centos/pg13_bk_mani/edb/edbpsql/lib/libpq.so.5.13
7fa224e90000-7fa225090000 ---p 00044000 fd:03 4044478 /home/centos/pg13_bk_mani/edb/edbpsql/lib/libpq.so.5.13
7fa225090000-7fa225093000 r--p 00044000 fd:03 4044478 /home/centos/pg13_bk_mani/edb/edbpsql/lib/libpq.so.5.13
7fa225093000-7fa225094000 rw-p 00047000 fd:03 4044478 /home/centos/pg13_bk_mani/edb/edbpsql/lib/libpq.so.5.13
7fa225094000-7fa2250b6000 r-xp 00000000 fd:03 130333 /usr/lib64/ld-2.17.so
7fa22527d000-7fa2252a2000 rw-p 00000000 00:00 0
7fa2252b3000-7fa2252b5000 rw-p 00000000 00:00 0
7fa2252b5000-7fa2252b6000 r--p 00021000 fd:03 130333 /usr/lib64/ld-2.17.so
7fa2252b6000-7fa2252b7000 rw-p 00022000 fd:03 130333 /usr/lib64/ld-2.17.so
7fa2252b7000-7fa2252b8000 rw-p 00000000 00:00 0
7ffdf354f000-7ffdf3570000 rw-p 00000000 00:00 0 [stack]
7ffdf3572000-7ffdf3574000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Aborted
[centos@tushar-ldap-docker bin]$
I am getting the error message but along with "*** Error in `./pg_validatebackup': double free or corruption (!prev): 0x0000000001850ba0 ***" messages
Is this expected ?
regards,
On 3/3/20 8:19 PM, tushar wrote:On 3/3/20 4:04 PM, tushar wrote:Thanks Robert. After applying all the 5 patches (v8-00*) against PG v13 (commit id -afb5465e0cfce7637066eaaaeecab30b0f23fbe3) ,
There is a scenario where pg_validatebackup is not throwing an error if some file deleted from pg_wal/ folder and but later at the time of restoring - we are getting an error
[centos@tushar-ldap-docker bin]$ ./pg_basebackup -D test1
[centos@tushar-ldap-docker bin]$ ls test1/pg_wal/
000000010000000000000010 archive_status
[centos@tushar-ldap-docker bin]$ rm -rf test1/pg_wal/*
[centos@tushar-ldap-docker bin]$ ./pg_validatebackup test1
pg_validatebackup: * manifest_checksum = 88f1ed995c83e86252466a2c88b3e660a69cfc76c169991134b101c4f16c9df7
pg_validatebackup: backup successfully verified
[centos@tushar-ldap-docker bin]$ ./pg_ctl -D test1 start -o '-p 3333'
waiting for server to start....2020-03-02 20:05:22.732 IST [21441] LOG: starting PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-03-02 20:05:22.733 IST [21441] LOG: listening on IPv6 address "::1", port 3333
2020-03-02 20:05:22.733 IST [21441] LOG: listening on IPv4 address "127.0.0.1", port 3333
2020-03-02 20:05:22.736 IST [21441] LOG: listening on Unix socket "/tmp/.s.PGSQL.3333"
2020-03-02 20:05:22.739 IST [21442] LOG: database system was interrupted; last known up at 2020-03-02 20:04:35 IST
2020-03-02 20:05:22.739 IST [21442] LOG: creating missing WAL directory "pg_wal/archive_status"
2020-03-02 20:05:22.886 IST [21442] LOG: invalid checkpoint record
2020-03-02 20:05:22.886 IST [21442] FATAL: could not locate required checkpoint record
2020-03-02 20:05:22.886 IST [21442] HINT: If you are restoring from a backup, touch "/home/centos/pg13_bk_mani/edb/edbpsql/bin/test1/recovery.signal" and add required recovery options.
If you are not restoring from a backup, try removing the file "/home/centos/pg13_bk_mani/edb/edbpsql/bin/test1/backup_label".
Be careful: removing "/home/centos/pg13_bk_mani/edb/edbpsql/bin/test1/backup_label" will result in a corrupt cluster if restoring from a backup.
2020-03-02 20:05:22.886 IST [21441] LOG: startup process (PID 21442) exited with exit code 1
2020-03-02 20:05:22.886 IST [21441] LOG: aborting startup due to startup process failure
2020-03-02 20:05:22.889 IST [21441] LOG: database system is shut down
stopped waiting
pg_ctl: could not start server
Examine the log output.
[centos@tushar-ldap-docker bin]$
-- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Wed, Mar 4, 2020 at 3:51 PM tushar <tushar.ahuja@enterprisedb.com> wrote:
Another scenario, in which if we modify Manifest-Checksum" value from
backup_manifest file , we are not getting an error
[centos@tushar-ldap-docker bin]$ ./pg_validatebackup data/
pg_validatebackup: * manifest_checksum =
28d082921650d0ae881de8ceb122c8d2af5f449f51ecfb446827f7f49f91f65d
pg_validatebackup: backup successfully verified
open backup_manifest file and replace
"Manifest-Checksum":
"8d082921650d0ae881de8ceb122c8d2af5f449f51ecfb446827f7f49f91f65d"}
with
"Manifest-Checksum": "Hello World"}
rerun the pg_validatebackup
[centos@tushar-ldap-docker bin]$ ./pg_validatebackup data/
pg_validatebackup: * manifest_checksum = Hello World
pg_validatebackup: backup successfully verified
regards,
Yeah, This handling is missing in the provided WIP patch. I believe Robert will consider this fixing in upcoming version of validator patch.
--
Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.
On Wed, Mar 4, 2020 at 7:21 PM tushar <tushar.ahuja@enterprisedb.com> wrote:
Hi,There is a scenario in which i add something inside the pg_tablespace directory , i am getting an error like-pg_validatebackup: * manifest_checksum = 77ddacb4e7e02e2b880792a19a3adf09266dd88553dd15cfd0c22caee7d9cc04
pg_validatebackup: error: "pg_tblspc/16385/PG_13_202002271/test" is present on disk but not in the manifestbut if i remove 'PG_13_202002271 ' directory then there is no error[centos@tushar-ldap-docker bin]$ ./pg_validatebackup data
pg_validatebackup: * manifest_checksum = 77ddacb4e7e02e2b880792a19a3adf09266dd88553dd15cfd0c22caee7d9cc04
pg_validatebackup: backup successfully verified
This seems expected considering current design as we don't log the directory entries in backup_manifest. In your case, you have tablespace with no objects (empty tablespace) then backup_manifest does not have any entry for this hence when you remove this tablespace directory, validator could not detect it.
We can either document it or add the entry for directories in the manifest. Robert may have a better idea on this.
--
Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.
Hi,
In a negative test scenario, if I changed size to -1 in backup_manifest, pg_validatebackup giving
error with a random size number.
[edb@localhost bin]$ ./pg_basebackup -p 5551 -D /tmp/bold --manifest-checksum 'SHA256'
[edb@localhost bin]$ ./pg_validatebackup /tmp/bold
pg_validatebackup: backup successfully verified
pg_validatebackup: backup successfully verified
--change a file size to -1 and generate new checksum.
[edb@localhost bin]$ vi /tmp/bold/backup_manifest
[edb@localhost bin]$ shasum -a256 /tmp/bold/backup_manifest
c3d7838cbbf991c6108f9c1ab78f673c20d8073114500f14da6ed07ede2dc44a /tmp/bold/backup_manifest
[edb@localhost bin]$ vi /tmp/bold/backup_manifest
[edb@localhost bin]$ shasum -a256 /tmp/bold/backup_manifest
c3d7838cbbf991c6108f9c1ab78f673c20d8073114500f14da6ed07ede2dc44a /tmp/bold/backup_manifest
[edb@localhost bin]$ vi /tmp/bold/backup_manifest
[edb@localhost bin]$ ./pg_validatebackup /tmp/bold
pg_validatebackup: error: "global/4183" has size 0 on disk but size 18446744073709551615 in the manifest
pg_validatebackup: error: "global/4183" has size 0 on disk but size 18446744073709551615 in the manifest
Thanks & Regards,
Rajkumar Raghuwanshi
On Thu, Mar 5, 2020 at 9:37 AM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote:
On Wed, Mar 4, 2020 at 7:21 PM tushar <tushar.ahuja@enterprisedb.com> wrote:Hi,There is a scenario in which i add something inside the pg_tablespace directory , i am getting an error like-pg_validatebackup: * manifest_checksum = 77ddacb4e7e02e2b880792a19a3adf09266dd88553dd15cfd0c22caee7d9cc04
pg_validatebackup: error: "pg_tblspc/16385/PG_13_202002271/test" is present on disk but not in the manifestbut if i remove 'PG_13_202002271 ' directory then there is no error[centos@tushar-ldap-docker bin]$ ./pg_validatebackup data
pg_validatebackup: * manifest_checksum = 77ddacb4e7e02e2b880792a19a3adf09266dd88553dd15cfd0c22caee7d9cc04
pg_validatebackup: backup successfully verifiedThis seems expected considering current design as we don't log the directory entries in backup_manifest. In your case, you have tablespace with no objects (empty tablespace) then backup_manifest does not have any entry for this hence when you remove this tablespace directory, validator could not detect it.We can either document it or add the entry for directories in the manifest. Robert may have a better idea on this.----Thanks & Regards,Suraj kharage,EnterpriseDB Corporation,The Postgres Database Company.
Hi,
There is one scenario where i somehow able to run pg_validatebackup successfully but when i tried to start the server , it is failing
Steps to reproduce -
--create 2 base backup directory
[centos@tushar-ldap-docker bin]$ ./pg_basebackup -D db1
[centos@tushar-ldap-docker bin]$ ./pg_basebackup -D db2
[centos@tushar-ldap-docker bin]$ ./pg_basebackup -D db2
--run pg_validatebackup , use backup_manifest of db1 directory against db2/ . Will get an error
[centos@tushar-ldap-docker bin]$ ./pg_validatebackup -m db1/backup_manifest db2/
pg_validatebackup: * manifest_checksum = 5b131aff4a4f86e2a53efd84b003a67b9f615decb0039f19033eefa6f43c1ede
pg_validatebackup: error: checksum mismatch for file "backup_label"
pg_validatebackup: * manifest_checksum = 5b131aff4a4f86e2a53efd84b003a67b9f615decb0039f19033eefa6f43c1ede
pg_validatebackup: error: checksum mismatch for file "backup_label"
--copy the backup_level of db1 to db2 folder
[centos@tushar-ldap-docker bin]$ cp db1/backup_label db2/.
--run pg_validatebackup .. working fine
[centos@tushar-ldap-docker bin]$ ./pg_validatebackup -m db1/backup_manifest db2/
pg_validatebackup: * manifest_checksum = 5b131aff4a4f86e2a53efd84b003a67b9f615decb0039f19033eefa6f43c1ede
pg_validatebackup: backup successfully verified
[centos@tushar-ldap-docker bin]$
pg_validatebackup: * manifest_checksum = 5b131aff4a4f86e2a53efd84b003a67b9f615decb0039f19033eefa6f43c1ede
pg_validatebackup: backup successfully verified
[centos@tushar-ldap-docker bin]$
--try to start the server
[centos@tushar-ldap-docker bin]$ ./pg_ctl -D db2 start -o '-p 7777'
waiting for server to start....2020-03-05 15:33:53.471 IST [24049] LOG: starting PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-03-05 15:33:53.471 IST [24049] LOG: listening on IPv6 address "::1", port 7777
2020-03-05 15:33:53.471 IST [24049] LOG: listening on IPv4 address "127.0.0.1", port 7777
2020-03-05 15:33:53.473 IST [24049] LOG: listening on Unix socket "/tmp/.s.PGSQL.7777"
2020-03-05 15:33:53.476 IST [24050] LOG: database system was interrupted; last known up at 2020-03-05 15:32:51 IST
2020-03-05 15:33:53.573 IST [24050] LOG: invalid checkpoint record
2020-03-05 15:33:53.573 IST [24050] FATAL: could not locate required checkpoint record
2020-03-05 15:33:53.573 IST [24050] HINT: If you are restoring from a backup, touch "/home/centos/pg13_bk_mani/edb/edbpsql/bin/db2/recovery.signal" and add required recovery options.
If you are not restoring from a backup, try removing the file "/home/centos/pg13_bk_mani/edb/edbpsql/bin/db2/backup_label".
Be careful: removing "/home/centos/pg13_bk_mani/edb/edbpsql/bin/db2/backup_label" will result in a corrupt cluster if restoring from a backup.
2020-03-05 15:33:53.574 IST [24049] LOG: startup process (PID 24050) exited with exit code 1
2020-03-05 15:33:53.574 IST [24049] LOG: aborting startup due to startup process failure
2020-03-05 15:33:53.575 IST [24049] LOG: database system is shut down
stopped waiting
pg_ctl: could not start server
Examine the log output.
[centos@tushar-ldap-docker bin]$
waiting for server to start....2020-03-05 15:33:53.471 IST [24049] LOG: starting PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-03-05 15:33:53.471 IST [24049] LOG: listening on IPv6 address "::1", port 7777
2020-03-05 15:33:53.471 IST [24049] LOG: listening on IPv4 address "127.0.0.1", port 7777
2020-03-05 15:33:53.473 IST [24049] LOG: listening on Unix socket "/tmp/.s.PGSQL.7777"
2020-03-05 15:33:53.476 IST [24050] LOG: database system was interrupted; last known up at 2020-03-05 15:32:51 IST
2020-03-05 15:33:53.573 IST [24050] LOG: invalid checkpoint record
2020-03-05 15:33:53.573 IST [24050] FATAL: could not locate required checkpoint record
2020-03-05 15:33:53.573 IST [24050] HINT: If you are restoring from a backup, touch "/home/centos/pg13_bk_mani/edb/edbpsql/bin/db2/recovery.signal" and add required recovery options.
If you are not restoring from a backup, try removing the file "/home/centos/pg13_bk_mani/edb/edbpsql/bin/db2/backup_label".
Be careful: removing "/home/centos/pg13_bk_mani/edb/edbpsql/bin/db2/backup_label" will result in a corrupt cluster if restoring from a backup.
2020-03-05 15:33:53.574 IST [24049] LOG: startup process (PID 24050) exited with exit code 1
2020-03-05 15:33:53.574 IST [24049] LOG: aborting startup due to startup process failure
2020-03-05 15:33:53.575 IST [24049] LOG: database system is shut down
stopped waiting
pg_ctl: could not start server
Examine the log output.
[centos@tushar-ldap-docker bin]$
regards,
On 3/5/20 1:09 PM, Rajkumar Raghuwanshi wrote:
Hi,In a negative test scenario, if I changed size to -1 in backup_manifest, pg_validatebackup givingerror with a random size number.[edb@localhost bin]$ ./pg_basebackup -p 5551 -D /tmp/bold --manifest-checksum 'SHA256'[edb@localhost bin]$ ./pg_validatebackup /tmp/bold
pg_validatebackup: backup successfully verified--change a file size to -1 and generate new checksum.[edb@localhost bin]$ vi /tmp/bold/backup_manifest
[edb@localhost bin]$ shasum -a256 /tmp/bold/backup_manifest
c3d7838cbbf991c6108f9c1ab78f673c20d8073114500f14da6ed07ede2dc44a /tmp/bold/backup_manifest
[edb@localhost bin]$ vi /tmp/bold/backup_manifest[edb@localhost bin]$ ./pg_validatebackup /tmp/bold
pg_validatebackup: error: "global/4183" has size 0 on disk but size 18446744073709551615 in the manifestThanks & Regards,Rajkumar RaghuwanshiOn Thu, Mar 5, 2020 at 9:37 AM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote:On Wed, Mar 4, 2020 at 7:21 PM tushar <tushar.ahuja@enterprisedb.com> wrote:Hi,There is a scenario in which i add something inside the pg_tablespace directory , i am getting an error like-pg_validatebackup: * manifest_checksum = 77ddacb4e7e02e2b880792a19a3adf09266dd88553dd15cfd0c22caee7d9cc04
pg_validatebackup: error: "pg_tblspc/16385/PG_13_202002271/test" is present on disk but not in the manifestbut if i remove 'PG_13_202002271 ' directory then there is no error[centos@tushar-ldap-docker bin]$ ./pg_validatebackup data
pg_validatebackup: * manifest_checksum = 77ddacb4e7e02e2b880792a19a3adf09266dd88553dd15cfd0c22caee7d9cc04
pg_validatebackup: backup successfully verifiedThis seems expected considering current design as we don't log the directory entries in backup_manifest. In your case, you have tablespace with no objects (empty tablespace) then backup_manifest does not have any entry for this hence when you remove this tablespace directory, validator could not detect it.We can either document it or add the entry for directories in the manifest. Robert may have a better idea on this.----Thanks & Regards,Suraj kharage,EnterpriseDB Corporation,The Postgres Database Company.
-- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
There is one small observation if we use slash (/) with option -i then not getting the desired result
Steps to reproduce -
==============
[centos@tushar-ldap-docker bin]$ ./pg_basebackup -D test
[centos@tushar-ldap-docker bin]$ touch test/pg_notify/dummy_file
[centos@tushar-ldap-docker bin]$ touch test/pg_notify/dummy_file
--working
[centos@tushar-ldap-docker bin]$ ./pg_validatebackup --ignore=pg_notify test
pg_validatebackup: * manifest_checksum = be9b72e1320c6c34c131533de19371a10dd5011940181724e43277f786026c7b
pg_validatebackup: backup successfully verified
pg_validatebackup: * manifest_checksum = be9b72e1320c6c34c131533de19371a10dd5011940181724e43277f786026c7b
pg_validatebackup: backup successfully verified
--not working
[centos@tushar-ldap-docker bin]$ ./pg_validatebackup --ignore=pg_notify/ test
pg_validatebackup: * manifest_checksum = be9b72e1320c6c34c131533de19371a10dd5011940181724e43277f786026c7b
pg_validatebackup: error: "pg_notify/dummy_file" is present on disk but not in the manifest
pg_validatebackup: * manifest_checksum = be9b72e1320c6c34c131533de19371a10dd5011940181724e43277f786026c7b
pg_validatebackup: error: "pg_notify/dummy_file" is present on disk but not in the manifest
regards,
On 3/5/20 3:40 PM, tushar wrote:
Hi,There is one scenario where i somehow able to run pg_validatebackup successfully but when i tried to start the server , it is failingSteps to reproduce ---create 2 base backup directory[centos@tushar-ldap-docker bin]$ ./pg_basebackup -D db1
[centos@tushar-ldap-docker bin]$ ./pg_basebackup -D db2--run pg_validatebackup , use backup_manifest of db1 directory against db2/ . Will get an error[centos@tushar-ldap-docker bin]$ ./pg_validatebackup -m db1/backup_manifest db2/
pg_validatebackup: * manifest_checksum = 5b131aff4a4f86e2a53efd84b003a67b9f615decb0039f19033eefa6f43c1ede
pg_validatebackup: error: checksum mismatch for file "backup_label"
--copy the backup_level of db1 to db2 folder[centos@tushar-ldap-docker bin]$ cp db1/backup_label db2/.--run pg_validatebackup .. working fine[centos@tushar-ldap-docker bin]$ ./pg_validatebackup -m db1/backup_manifest db2/
pg_validatebackup: * manifest_checksum = 5b131aff4a4f86e2a53efd84b003a67b9f615decb0039f19033eefa6f43c1ede
pg_validatebackup: backup successfully verified
[centos@tushar-ldap-docker bin]$--try to start the server[centos@tushar-ldap-docker bin]$ ./pg_ctl -D db2 start -o '-p 7777'
waiting for server to start....2020-03-05 15:33:53.471 IST [24049] LOG: starting PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-03-05 15:33:53.471 IST [24049] LOG: listening on IPv6 address "::1", port 7777
2020-03-05 15:33:53.471 IST [24049] LOG: listening on IPv4 address "127.0.0.1", port 7777
2020-03-05 15:33:53.473 IST [24049] LOG: listening on Unix socket "/tmp/.s.PGSQL.7777"
2020-03-05 15:33:53.476 IST [24050] LOG: database system was interrupted; last known up at 2020-03-05 15:32:51 IST
2020-03-05 15:33:53.573 IST [24050] LOG: invalid checkpoint record
2020-03-05 15:33:53.573 IST [24050] FATAL: could not locate required checkpoint record
2020-03-05 15:33:53.573 IST [24050] HINT: If you are restoring from a backup, touch "/home/centos/pg13_bk_mani/edb/edbpsql/bin/db2/recovery.signal" and add required recovery options.
If you are not restoring from a backup, try removing the file "/home/centos/pg13_bk_mani/edb/edbpsql/bin/db2/backup_label".
Be careful: removing "/home/centos/pg13_bk_mani/edb/edbpsql/bin/db2/backup_label" will result in a corrupt cluster if restoring from a backup.
2020-03-05 15:33:53.574 IST [24049] LOG: startup process (PID 24050) exited with exit code 1
2020-03-05 15:33:53.574 IST [24049] LOG: aborting startup due to startup process failure
2020-03-05 15:33:53.575 IST [24049] LOG: database system is shut down
stopped waiting
pg_ctl: could not start server
Examine the log output.
[centos@tushar-ldap-docker bin]$regards,On 3/5/20 1:09 PM, Rajkumar Raghuwanshi wrote:Hi,In a negative test scenario, if I changed size to -1 in backup_manifest, pg_validatebackup givingerror with a random size number.[edb@localhost bin]$ ./pg_basebackup -p 5551 -D /tmp/bold --manifest-checksum 'SHA256'[edb@localhost bin]$ ./pg_validatebackup /tmp/bold
pg_validatebackup: backup successfully verified--change a file size to -1 and generate new checksum.[edb@localhost bin]$ vi /tmp/bold/backup_manifest
[edb@localhost bin]$ shasum -a256 /tmp/bold/backup_manifest
c3d7838cbbf991c6108f9c1ab78f673c20d8073114500f14da6ed07ede2dc44a /tmp/bold/backup_manifest
[edb@localhost bin]$ vi /tmp/bold/backup_manifest[edb@localhost bin]$ ./pg_validatebackup /tmp/bold
pg_validatebackup: error: "global/4183" has size 0 on disk but size 18446744073709551615 in the manifestThanks & Regards,Rajkumar RaghuwanshiOn Thu, Mar 5, 2020 at 9:37 AM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote:On Wed, Mar 4, 2020 at 7:21 PM tushar <tushar.ahuja@enterprisedb.com> wrote:Hi,There is a scenario in which i add something inside the pg_tablespace directory , i am getting an error like-pg_validatebackup: * manifest_checksum = 77ddacb4e7e02e2b880792a19a3adf09266dd88553dd15cfd0c22caee7d9cc04
pg_validatebackup: error: "pg_tblspc/16385/PG_13_202002271/test" is present on disk but not in the manifestbut if i remove 'PG_13_202002271 ' directory then there is no error[centos@tushar-ldap-docker bin]$ ./pg_validatebackup data
pg_validatebackup: * manifest_checksum = 77ddacb4e7e02e2b880792a19a3adf09266dd88553dd15cfd0c22caee7d9cc04
pg_validatebackup: backup successfully verifiedThis seems expected considering current design as we don't log the directory entries in backup_manifest. In your case, you have tablespace with no objects (empty tablespace) then backup_manifest does not have any entry for this hence when you remove this tablespace directory, validator could not detect it.We can either document it or add the entry for directories in the manifest. Robert may have a better idea on this.----Thanks & Regards,Suraj kharage,EnterpriseDB Corporation,The Postgres Database Company.
-- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
-- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Thu, Mar 5, 2020 at 7:05 AM tushar <tushar.ahuja@enterprisedb.com> wrote: > There is one small observation if we use slash (/) with option -i then not getting the desired result Here's an updated patch set responding to many of the comments received thus far. Since there are quite a few emails, let me consolidate my comments and responses here. Report: Segmentation fault if -m is used to point to a valid manifest, but actual backup directory is nonexistent. Response: Fixed; thanks for the report. Report: pg_validatebackup doesn't complain about problems within the pg_wal directory. Response: That's out of scope. The WAL files are fetched separately and are therefore not part of the manifest. Report: Inaccessible file in data directory being validated leads to a double free. Response: Fixed; thanks for the report. Report: Patch 0005 doesn't validate the manifest checksum. Response: I know. I mentioned that when posting the previous patch set. Fixed in this version, though. Report: Removing an empty directory doesn't make backup validation fail, even though it might cause problems for the server. Response: That's a little unfortunate, but I'm not sure it's really worth complicating the patch to deal with it. It's something of a corner case. Report: Negative file sizes in the backup manifest are interpreted as large integers. Response: That's also a little unfortunate, but I doubt it's worth adding code to catch it, since any such manifest is corrupt. Also, it's not like we're ignoring it; the error just isn't ideal. Report: If I take the backup label from backup #1 and stick it into otherwise-identical backup #2, validation succeeds but the server won't start. Response: That's because we can't validate the pg_wal directory. As noted above, that's out of scope. Report: Using --ignore with a slash-terminated pathname doesn't work as expected. Response: Fixed, thanks for the report. Off-List Report: You forgot a PG_BINARY flag. Response: Fixed. I thought I'd done this before but there were two places and I'd only fixed one of them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Thanks, Robert.
1: Getting below error while compiling 0002 patch.
edb@localhost:postgres$ mi > mi.log
basebackup.c: In function ‘AddFileToManifest’:
basebackup.c:1052:6: error: ‘pathname’ undeclared (first use in this function)
pathname);
^
basebackup.c:1052:6: note: each undeclared identifier is reported only once for each function it appears in
make[3]: *** [basebackup.o] Error 1
make[2]: *** [replication-recursive] Error 2
make[1]: *** [install-backend-recurse] Error 2
make: *** [install-src-recurse] Error 2
I can see you have renamed the filename argument of AddFileToManifest() to pathname, but those changes are part of 0003 (validator patch).
I think the changes related to src/backend/replication/basebackup.c should not be there in the validator patch (0003). We can move these changes to backup manifest patch, either in 0002 or 0004 for better readability of patch set.
2:
#define KW_MANIFEST_VERSION "PostgreSQL-Backup-Manifest-Version"
#define KW_MANIFEST_FILE "File"
#define KW_MANIFEST_CHECKSUM "Manifest-Checksum"
#define KWL_MANIFEST_VERSION (sizeof(KW_MANIFEST_VERSION)-1)
#define KWL_MANIFEST_FILE (sizeof(KW_MANIFEST_FILE)-1)
#define KWL_MANIFEST_CHECKSUM (sizeof(KW_MANIFEST_CHECKSUM)-1)
#define FIELDS_PER_FILE_LINE 4
Few macros defined in 0003 patch not used anywhere in 0005 patch. Either we can replace these with hard-coded values or remove them.
1: Getting below error while compiling 0002 patch.
edb@localhost:postgres$ mi > mi.log
basebackup.c: In function ‘AddFileToManifest’:
basebackup.c:1052:6: error: ‘pathname’ undeclared (first use in this function)
pathname);
^
basebackup.c:1052:6: note: each undeclared identifier is reported only once for each function it appears in
make[3]: *** [basebackup.o] Error 1
make[2]: *** [replication-recursive] Error 2
make[1]: *** [install-backend-recurse] Error 2
make: *** [install-src-recurse] Error 2
I can see you have renamed the filename argument of AddFileToManifest() to pathname, but those changes are part of 0003 (validator patch).
I think the changes related to src/backend/replication/basebackup.c should not be there in the validator patch (0003). We can move these changes to backup manifest patch, either in 0002 or 0004 for better readability of patch set.
2:
#define KW_MANIFEST_VERSION "PostgreSQL-Backup-Manifest-Version"
#define KW_MANIFEST_FILE "File"
#define KW_MANIFEST_CHECKSUM "Manifest-Checksum"
#define KWL_MANIFEST_VERSION (sizeof(KW_MANIFEST_VERSION)-1)
#define KWL_MANIFEST_FILE (sizeof(KW_MANIFEST_FILE)-1)
#define KWL_MANIFEST_CHECKSUM (sizeof(KW_MANIFEST_CHECKSUM)-1)
#define FIELDS_PER_FILE_LINE 4
Few macros defined in 0003 patch not used anywhere in 0005 patch. Either we can replace these with hard-coded values or remove them.
On Thu, Mar 5, 2020 at 10:25 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Mar 5, 2020 at 7:05 AM tushar <tushar.ahuja@enterprisedb.com> wrote:
> There is one small observation if we use slash (/) with option -i then not getting the desired result
Here's an updated patch set responding to many of the comments
received thus far. Since there are quite a few emails, let me
consolidate my comments and responses here.
Report: Segmentation fault if -m is used to point to a valid manifest,
but actual backup directory is nonexistent.
Response: Fixed; thanks for the report.
Report: pg_validatebackup doesn't complain about problems within the
pg_wal directory.
Response: That's out of scope. The WAL files are fetched separately
and are therefore not part of the manifest.
Report: Inaccessible file in data directory being validated leads to a
double free.
Response: Fixed; thanks for the report.
Report: Patch 0005 doesn't validate the manifest checksum.
Response: I know. I mentioned that when posting the previous patch
set. Fixed in this version, though.
Report: Removing an empty directory doesn't make backup validation
fail, even though it might cause problems for the server.
Response: That's a little unfortunate, but I'm not sure it's really
worth complicating the patch to deal with it. It's something of a
corner case.
Report: Negative file sizes in the backup manifest are interpreted as
large integers.
Response: That's also a little unfortunate, but I doubt it's worth
adding code to catch it, since any such manifest is corrupt. Also,
it's not like we're ignoring it; the error just isn't ideal.
Report: If I take the backup label from backup #1 and stick it into
otherwise-identical backup #2, validation succeeds but the server
won't start.
Response: That's because we can't validate the pg_wal directory. As
noted above, that's out of scope.
Report: Using --ignore with a slash-terminated pathname doesn't work
as expected.
Response: Fixed, thanks for the report.
Off-List Report: You forgot a PG_BINARY flag.
Response: Fixed. I thought I'd done this before but there were two
places and I'd only fixed one of them.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.
On 3/5/20 10:25 PM, Robert Haas wrote: > Here's an updated patch set responding to many of the comments > received thus far. Thanks Robert. There is a scenario - if user provide port of v11 server at the time of creating 'base backup' against pg_basebackup(v13+ your patch applied) with option --manifest-checksums,will lead to this below error [centos@tushar-ldap-docker bin]$ ./pg_basebackup -R -p 9045 --manifest-checksums=SHA224 -D dc1 pg_basebackup: error: could not initiate base backup: ERROR: syntax error pg_basebackup: removing data directory "dc1" [centos@tushar-ldap-docker bin]$ Steps to reproduce - PG v11 is running run pg_basebackup against that with option --manifest-checksums -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Mon, Mar 9, 2020 at 12:22 PM tushar <tushar.ahuja@enterprisedb.com> wrote: > On 3/5/20 10:25 PM, Robert Haas wrote: > > Here's an updated patch set responding to many of the comments > > received thus far. > Thanks Robert. There is a scenario - if user provide port of v11 server > at the time of creating 'base backup' against pg_basebackup(v13+ your > patch applied) > with option --manifest-checksums,will lead to this below error > > [centos@tushar-ldap-docker bin]$ ./pg_basebackup -R -p 9045 > --manifest-checksums=SHA224 -D dc1 > pg_basebackup: error: could not initiate base backup: ERROR: syntax error > pg_basebackup: removing data directory "dc1" > [centos@tushar-ldap-docker bin]$ > > Steps to reproduce - > PG v11 is running > run pg_basebackup against that with option --manifest-checksums Seems like expected behavior to me. We could consider providing a more descriptive error message, but there's now way for it to work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 6, 2020 at 3:58 AM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote: > 1: Getting below error while compiling 0002 patch. > 2: > > Few macros defined in 0003 patch not used anywhere in 0005 patch. Either we can replace these with hard-coded values orremove them. Thanks. I hope that I have straightened those things out in the new version which is attached. This version also includes some other changes. The non-JSON code is now completely gone. Also, I've refactored the code that does parses the JSON manifest to make it cleaner, and I've moved it out into a separate file. This might be useful if anyone ends up wanting to reuse that code for some other purpose, and I think it makes it easier to understand, too, since the manifest parsing is now much better separated from the task of actually validating the given directory against the manifest. I've also added some tests, which are based in part on testing ideas from Rajkumar Raghuwanshi and Mark Dilger, but this test code was written by me. So now it's like this: 0001 - checksum helper functions. same as before. 0002 - patch the server to generate and send a manifest, and pg_basebackup to receive it 0003 - add pg_validatebackup 0004 - TAP tests Comments? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 3/9/20 10:46 PM, Robert Haas wrote: > Seems like expected behavior to me. We could consider providing a more > descriptive error message, but there's now way for it to work. Right , Error message need to be more user friendly . -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On 3/12/20 8:16 PM, tushar wrote:
Seems like expected behavior to me. We could consider providing a more
descriptive error message, but there's now way for it to work.
Right , Error message need to be more user friendly .
One scenario which i feel - should error out even if -s option is specified.
create base backup directory ( ./pg_basebackup data1)
Connect to root user and take out the permission from pg_hba.conf file ( chmod 004 pg_hba.conf)
run pg_validatebackup - [centos@tushar-ldap-docker bin]$ ./pg_validatebackup data1
pg_validatebackup: error: could not open file "pg_hba.conf": Permission denied
run pg_validatebackup with switch -s [centos@tushar-ldap-docker bin]$ ./pg_validatebackup data1 -s
pg_validatebackup: backup successfully verified
here file is not accessible so i think - it should throw you an error ( the same above one) instead of blindly skipping it.
-- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Fri, Mar 13, 2020 at 9:53 AM tushar <tushar.ahuja@enterprisedb.com> wrote: > run pg_validatebackup - > > [centos@tushar-ldap-docker bin]$ ./pg_validatebackup data1 > pg_validatebackup: error: could not open file "pg_hba.conf": Permission denied > > run pg_validatebackup with switch -s > > [centos@tushar-ldap-docker bin]$ ./pg_validatebackup data1 -s > pg_validatebackup: backup successfully verified > > here file is not accessible so i think - it should throw you an error ( the same above one) instead of blindly skippingit. I don't really want to do that. That would require it to open every file even if it doesn't need to read the data in the files. I think in most cases that would just slow it down for no real benefit. If you've specified -s, you have to be OK with getting a less complete check for problems. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 12, 2020 at 10:47 AM tushar <tushar.ahuja@enterprisedb.com> wrote: > On 3/9/20 10:46 PM, Robert Haas wrote: > > Seems like expected behavior to me. We could consider providing a more > > descriptive error message, but there's now way for it to work. > > Right , Error message need to be more user friendly . OK. Done in the attached version, which also includes a few other changes: - I expanded the regression tests. They now cover every line of code in parse_manifest.c except for a few that I believe to be unreachable (though I might be mistaken). Coverage for pg_validatebackup.c is also improved, but it's not 100%; there are some cases that I don't know how to hit outside of a kernel malfunction, and others that I only know how to hit on non-Windows systems. For instance, it's easy to use perl to make a file inaccessible on Linux with chmod(0, $filename), but I gather that doesn't work on Windows. I'm going to spend a bit more time looking at this, but I think it's already reasonably good. - I fixed a couple of very minor bugs which I discovered by writing those tests. - I added documentation, in part based on a draft Mark Dilger shared with me off-list. I don't think this is committable just yet, but I think it's getting fairly close, so if anyone has major objections please speak up soon. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Thank you, Robert.
Getting below warning while compiling the v11-0003-pg_validatebackup-Validate-a-backup-against-the-.patch.
pg_validatebackup.c: In function ‘report_manifest_error’:
pg_validatebackup.c:356:2: warning: function might be possible candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format]
pg_log_generic_v(PG_LOG_FATAL, fmt, ap);
To resolve this, can we use "pg_attribute_printf(2, 3)" in function declaration something like below?
e.g:
diff --git a/src/bin/pg_validatebackup/parse_manifest.h b/src/bin/pg_validatebackup/parse_manifest.h
index b0b18a5..25d140f 100644
--- a/src/bin/pg_validatebackup/parse_manifest.h
+++ b/src/bin/pg_validatebackup/parse_manifest.h
@@ -25,7 +25,7 @@ typedef void (*json_manifest_perfile_callback)(JsonManifestParseContext *,
size_t size, pg_checksum_type checksum_type,
int checksum_length, uint8 *checksum_payload);
typedef void (*json_manifest_error_callback)(JsonManifestParseContext *,
- char *fmt, ...);
+ char *fmt,...) pg_attribute_printf(2, 3);
struct JsonManifestParseContext
{
diff --git a/src/bin/pg_validatebackup/pg_validatebackup.c b/src/bin/pg_validatebackup/pg_validatebackup.c
index 0e7299b..6ccbe59 100644
--- a/src/bin/pg_validatebackup/pg_validatebackup.c
+++ b/src/bin/pg_validatebackup/pg_validatebackup.c
@@ -95,7 +95,7 @@ static void record_manifest_details_for_file(JsonManifestParseContext *context,
int checksum_length,
uint8 *checksum_payload);
static void report_manifest_error(JsonManifestParseContext *context,
- char *fmt, ...);
+ char *fmt,...) pg_attribute_printf(2, 3);
static void validate_backup_directory(validator_context *context,
char *relpath, char *fullpath);
Typos:
0004 patch
unexpctedly => unexpectedly
0005 patch
bacup => backup
Getting below warning while compiling the v11-0003-pg_validatebackup-Validate-a-backup-against-the-.patch.
pg_validatebackup.c: In function ‘report_manifest_error’:
pg_validatebackup.c:356:2: warning: function might be possible candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format]
pg_log_generic_v(PG_LOG_FATAL, fmt, ap);
To resolve this, can we use "pg_attribute_printf(2, 3)" in function declaration something like below?
e.g:
diff --git a/src/bin/pg_validatebackup/parse_manifest.h b/src/bin/pg_validatebackup/parse_manifest.h
index b0b18a5..25d140f 100644
--- a/src/bin/pg_validatebackup/parse_manifest.h
+++ b/src/bin/pg_validatebackup/parse_manifest.h
@@ -25,7 +25,7 @@ typedef void (*json_manifest_perfile_callback)(JsonManifestParseContext *,
size_t size, pg_checksum_type checksum_type,
int checksum_length, uint8 *checksum_payload);
typedef void (*json_manifest_error_callback)(JsonManifestParseContext *,
- char *fmt, ...);
+ char *fmt,...) pg_attribute_printf(2, 3);
struct JsonManifestParseContext
{
diff --git a/src/bin/pg_validatebackup/pg_validatebackup.c b/src/bin/pg_validatebackup/pg_validatebackup.c
index 0e7299b..6ccbe59 100644
--- a/src/bin/pg_validatebackup/pg_validatebackup.c
+++ b/src/bin/pg_validatebackup/pg_validatebackup.c
@@ -95,7 +95,7 @@ static void record_manifest_details_for_file(JsonManifestParseContext *context,
int checksum_length,
uint8 *checksum_payload);
static void report_manifest_error(JsonManifestParseContext *context,
- char *fmt, ...);
+ char *fmt,...) pg_attribute_printf(2, 3);
static void validate_backup_directory(validator_context *context,
char *relpath, char *fullpath);
Typos:
0004 patch
unexpctedly => unexpectedly
0005 patch
bacup => backup
On Sat, Mar 14, 2020 at 2:04 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Mar 12, 2020 at 10:47 AM tushar <tushar.ahuja@enterprisedb.com> wrote:
> On 3/9/20 10:46 PM, Robert Haas wrote:
> > Seems like expected behavior to me. We could consider providing a more
> > descriptive error message, but there's now way for it to work.
>
> Right , Error message need to be more user friendly .
OK. Done in the attached version, which also includes a few other changes:
- I expanded the regression tests. They now cover every line of code
in parse_manifest.c except for a few that I believe to be unreachable
(though I might be mistaken). Coverage for pg_validatebackup.c is also
improved, but it's not 100%; there are some cases that I don't know
how to hit outside of a kernel malfunction, and others that I only
know how to hit on non-Windows systems. For instance, it's easy to use
perl to make a file inaccessible on Linux with chmod(0, $filename),
but I gather that doesn't work on Windows. I'm going to spend a bit
more time looking at this, but I think it's already reasonably good.
- I fixed a couple of very minor bugs which I discovered by writing those tests.
- I added documentation, in part based on a draft Mark Dilger shared
with me off-list.
I don't think this is committable just yet, but I think it's getting
fairly close, so if anyone has major objections please speak up soon.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.
One more suggestion, recent commit (1933ae62) has added the PostgreSQL home page to --help output.
e.g:
PostgreSQL home page: <https://www.postgresql.org/>
We might need to consider this change for pg_validatebackup binary.
e.g:
PostgreSQL home page: <https://www.postgresql.org/>
We might need to consider this change for pg_validatebackup binary.
On Mon, Mar 16, 2020 at 10:37 AM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote:
Thank you, Robert.
Getting below warning while compiling the v11-0003-pg_validatebackup-Validate-a-backup-against-the-.patch.
pg_validatebackup.c: In function ‘report_manifest_error’:
pg_validatebackup.c:356:2: warning: function might be possible candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format]
pg_log_generic_v(PG_LOG_FATAL, fmt, ap);
To resolve this, can we use "pg_attribute_printf(2, 3)" in function declaration something like below?
e.g:
diff --git a/src/bin/pg_validatebackup/parse_manifest.h b/src/bin/pg_validatebackup/parse_manifest.h
index b0b18a5..25d140f 100644
--- a/src/bin/pg_validatebackup/parse_manifest.h
+++ b/src/bin/pg_validatebackup/parse_manifest.h
@@ -25,7 +25,7 @@ typedef void (*json_manifest_perfile_callback)(JsonManifestParseContext *,
size_t size, pg_checksum_type checksum_type,
int checksum_length, uint8 *checksum_payload);
typedef void (*json_manifest_error_callback)(JsonManifestParseContext *,
- char *fmt, ...);
+ char *fmt,...) pg_attribute_printf(2, 3);
struct JsonManifestParseContext
{
diff --git a/src/bin/pg_validatebackup/pg_validatebackup.c b/src/bin/pg_validatebackup/pg_validatebackup.c
index 0e7299b..6ccbe59 100644
--- a/src/bin/pg_validatebackup/pg_validatebackup.c
+++ b/src/bin/pg_validatebackup/pg_validatebackup.c
@@ -95,7 +95,7 @@ static void record_manifest_details_for_file(JsonManifestParseContext *context,
int checksum_length,
uint8 *checksum_payload);
static void report_manifest_error(JsonManifestParseContext *context,
- char *fmt, ...);
+ char *fmt,...) pg_attribute_printf(2, 3);
static void validate_backup_directory(validator_context *context,
char *relpath, char *fullpath);
Typos:
0004 patch
unexpctedly => unexpectedly
0005 patch
bacup => backupOn Sat, Mar 14, 2020 at 2:04 AM Robert Haas <robertmhaas@gmail.com> wrote:On Thu, Mar 12, 2020 at 10:47 AM tushar <tushar.ahuja@enterprisedb.com> wrote:
> On 3/9/20 10:46 PM, Robert Haas wrote:
> > Seems like expected behavior to me. We could consider providing a more
> > descriptive error message, but there's now way for it to work.
>
> Right , Error message need to be more user friendly .
OK. Done in the attached version, which also includes a few other changes:
- I expanded the regression tests. They now cover every line of code
in parse_manifest.c except for a few that I believe to be unreachable
(though I might be mistaken). Coverage for pg_validatebackup.c is also
improved, but it's not 100%; there are some cases that I don't know
how to hit outside of a kernel malfunction, and others that I only
know how to hit on non-Windows systems. For instance, it's easy to use
perl to make a file inaccessible on Linux with chmod(0, $filename),
but I gather that doesn't work on Windows. I'm going to spend a bit
more time looking at this, but I think it's already reasonably good.
- I fixed a couple of very minor bugs which I discovered by writing those tests.
- I added documentation, in part based on a draft Mark Dilger shared
with me off-list.
I don't think this is committable just yet, but I think it's getting
fairly close, so if anyone has major objections please speak up soon.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company----Thanks & Regards,Suraj kharage,EnterpriseDB Corporation,The Postgres Database Company.
--
Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.
On 3/14/20 2:04 AM, Robert Haas wrote: > OK. Done in the attached version Thanks. Verified. -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Mon, Mar 16, 2020 at 2:03 AM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote: > One more suggestion, recent commit (1933ae62) has added the PostgreSQL home page to --help output. Good catch. Fixed. I also attempted to address the compiler warning you mentioned in your other email. Also, I realized that the previous patch versions didn't handle the hex-encoded path format that we need to use for non-UTF8 filenames, and that there was no easy way to test that format. So, in this version I added an option to force all pathnames to be encoded in that format. I also made that option capable of suppressing the backup manifest altogether. Other than that, this version is pretty much the same as the last version, except for a few additional test cases which I added to get the code coverage up even a little more. It would be nice if someone could test whether the tests pass on Windows. I have squashed the series down to just 2 commits, since that seems like the way that this should probably be committed. Barring strong objections and/or the end of the world, I plan to do that next week. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Sat, Mar 21, 2020 at 4:00 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Mar 16, 2020 at 2:03 AM Suraj Kharage > <suraj.kharage@enterprisedb.com> wrote: > > One more suggestion, recent commit (1933ae62) has added the PostgreSQL home page to --help output. > > Good catch. Fixed. I also attempted to address the compiler warning > you mentioned in your other email. > > Also, I realized that the previous patch versions didn't handle the > hex-encoded path format that we need to use for non-UTF8 filenames, > and that there was no easy way to test that format. So, in this > version I added an option to force all pathnames to be encoded in that > format. I also made that option capable of suppressing the backup > manifest altogether. Other than that, this version is pretty much the > same as the last version, except for a few additional test cases which > I added to get the code coverage up even a little more. It would be > nice if someone could test whether the tests pass on Windows. > On my CentOS, the patch gives below compilation failure: pg_validatebackup.c: In function ‘parse_manifest_file’: pg_validatebackup.c:335:19: error: assignment left-hand side might be a candidate for a format attribute [-Werror=suggest-attribute=format] context.error_cb = report_manifest_error; I have tested it on Windows and found there are multiple failures. The failures are as below: Test Summary Report --------------------------------------- t/002_algorithm.pl (Wstat: 512 Tests: 5 Failed: 4) Failed tests: 2-5 Non-zero exit status: 2 Parse errors: Bad plan. You planned 19 tests but ran 5. t/003_corruption.pl (Wstat: 256 Tests: 14 Failed: 7) Failed tests: 2, 4, 6, 8, 10, 12, 14 Non-zero exit status: 1 Parse errors: Bad plan. You planned 44 tests but ran 14. t/004_options.pl (Wstat: 4352 Tests: 25 Failed: 17) Failed tests: 2, 4, 6-12, 14-17, 19-20, 22, 25 Non-zero exit status: 17 t/005_bad_manifest.pl (Wstat: 1792 Tests: 44 Failed: 7) Failed tests: 18, 24, 26, 30, 32, 34, 36 Non-zero exit status: 7 Files=6, Tests=109, 72 wallclock secs ( 0.05 usr + 0.01 sys = 0.06 CPU) Result: FAIL Failure Report ------------------------ t/002_algorithm.pl ..... 1/19 # Failed test 'backup ok with algorithm "none"' # at t/002_algorithm.pl line 33. # Failed test 'backup manifest exists' # at t/002_algorithm.pl line 39. t/002_algorithm.pl ..... 4/19 # Failed test 'validate backup with algorithm "none"' # at t/002_algorithm.pl line 53. # Failed test 'backup ok with algorithm "crc32c"' # at t/002_algorithm.pl line 33. # Looks like you planned 19 tests but ran 5. # Looks like you failed 4 tests of 5 run. # Looks like your test exited with 2 just after 5. t/002_algorithm.pl ..... Dubious, test returned 2 (wstat 512, 0x200) Failed 18/19 subtests t/003_corruption.pl .... 1/44 # Failed test 'intact backup validated' # at t/003_corruption.pl line 110. # Failed test 'corrupt backup fails validation: extra_file: matches' # at t/003_corruption.pl line 117. # 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # doesn't match '(?^:extra_file.*present on disk but not in the manifest)' t/003_corruption.pl .... 5/44 # Failed test 'intact backup validated' # at t/003_corruption.pl line 110. t/003_corruption.pl .... 7/44 # Failed test 'corrupt backup fails validation: extra_tablespace_file: matches' # at t/003_corruption.pl line 117. # 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # doesn't match '(?^:extra_ts_file.*present on disk but not in the manifest)' t/003_corruption.pl .... 9/44 # Failed test 'intact backup validated' # at t/003_corruption.pl line 110. # Failed test 'corrupt backup fails validation: missing_file: matches' # at t/003_corruption.pl line 117. # 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # doesn't match '(?^:pg_xact/0000.*present in the manifest but not on disk)' t/003_corruption.pl .... 13/44 # Failed test 'intact backup validated' # at t/003_corruption.pl line 110. # Looks like you planned 44 tests but ran 14. # Looks like you failed 7 tests of 14 run. # Looks like your test exited with 1 just after 14. t/003_corruption.pl .... Dubious, test returned 1 (wstat 256, 0x100) Failed 37/44 subtests t/004_options.pl ....... 1/25 # Failed test '-q succeeds: exit code 0' # at t/004_options.pl line 25. # Failed test '-q succeeds: no stderr' # at t/004_options.pl line 27. # got: 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # expected: '' # Failed test '-q checksum mismatch: matches' # at t/004_options.pl line 37. # 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # doesn't match '(?^:checksum mismatch for file \"PG_VERSION\")' t/004_options.pl ....... 7/25 # Failed test '-s skips checksumming: exit code 0' # at t/004_options.pl line 43. # Failed test '-s skips checksumming: no stderr' # at t/004_options.pl line 43. # got: 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # expected: '' # Failed test '-s skips checksumming: matches' # at t/004_options.pl line 43. # '' # doesn't match '(?^:backup successfully verified)' # Failed test '-i ignores problem file: exit code 0' # at t/004_options.pl line 48. # Failed test '-i ignores problem file: no stderr' # at t/004_options.pl line 48. # got: 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # expected: '' # Failed test '-i ignores problem file: matches' # at t/004_options.pl line 48. # '' # doesn't match '(?^:backup successfully verified)' # Failed test '-i does not ignore all problems: matches' # at t/004_options.pl line 57. # 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # doesn't match '(?^:pg_xact.*is present in the manifest but not on disk)' # Failed test 'multiple -i options work: exit code 0' # at t/004_options.pl line 62. # Failed test 'multiple -i options work: no stderr' # at t/004_options.pl line 62. # got: 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # expected: '' # Failed test 'multiple -i options work: matches' # at t/004_options.pl line 62. # '' # doesn't match '(?^:backup successfully verified)' # Failed test 'multiple problems: missing files reported' # at t/004_options.pl line 71. # 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # doesn't match '(?^:pg_xact.*is present in the manifest but not on disk)' # Failed test 'multiple problems: checksum mismatch reported' # at t/004_options.pl line 73. # 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # doesn't match '(?^:checksum mismatch for file \"PG_VERSION\")' # Failed test '-e reports 1 error: missing files reported' # at t/004_options.pl line 80. # 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # doesn't match '(?^:pg_xact.*is present in the manifest but not on disk)' # Failed test 'nonexistent backup directory: matches' # at t/004_options.pl line 86. # 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # doesn't match '(?^:could not open directory)' # Looks like you failed 17 tests of 25. t/004_options.pl ....... Dubious, test returned 17 (wstat 4352, 0x1100) Failed 17/25 subtests t/005_bad_manifest.pl .. 1/44 # Failed test 'missing pathname: matches' # at t/005_bad_manifest.pl line 156. # 'pg_validatebackup: fatal: could not parse backup manifest: missing size # ' # doesn't match '(?^:could not parse backup manifest: missing pathname)' # Failed test 'missing size: matches' # at t/005_bad_manifest.pl line 156. # 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # doesn't match '(?^:could not parse backup manifest: missing size)' # Failed test 'file size is not an integer: matches' # at t/005_bad_manifest.pl line 156. # 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # doesn't match '(?^:could not parse backup manifest: file size is not an integer)' # Failed test 'duplicate pathname in backup manifest: matches' # at t/005_bad_manifest.pl line 156. # 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # doesn't match '(?^:fatal: duplicate pathname in backup manifest)' t/005_bad_manifest.pl .. 31/44 # Failed test 'checksum without algorithm: matches' # at t/005_bad_manifest.pl line 156. # 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # doesn't match '(?^:could not parse backup manifest: checksum without algorithm)' # Failed test 'unrecognized checksum algorithm: matches' # at t/005_bad_manifest.pl line 156. # 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # doesn't match '(?^:fatal: unrecognized checksum algorithm)' # Failed test 'invalid checksum for file: matches' # at t/005_bad_manifest.pl line 156. # 'pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname # ' # doesn't match '(?^:fatal: invalid checksum for file)' # Looks like you failed 7 tests of 44. t/005_bad_manifest.pl .. Dubious, test returned 7 (wstat 1792, 0x700) Failed 7/44 subtests -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Mar 21, 2020 at 5:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > On my CentOS, the patch gives below compilation failure: > pg_validatebackup.c: In function ‘parse_manifest_file’: > pg_validatebackup.c:335:19: error: assignment left-hand side might be > a candidate for a format attribute [-Werror=suggest-attribute=format] > context.error_cb = report_manifest_error; > > I have tested it on Windows and found there are multiple failures. > The failures are as below: > I have started to investigate the failures. > > Failure Report > ------------------------ > t/002_algorithm.pl ..... 1/19 > # Failed test 'backup ok with algorithm "none"' > # at t/002_algorithm.pl line 33. > I checked the log and it was giving error: /src/bin/pg_validatebackup/tmp_check/t_002_algorithm_master_data/backup/none --manifest-checksum none --no-sync \tmp_install\bin\pg_basebackup.EXE: illegal option -- manifest-checksum It seems the option to be used should be --manifest-checksums. The attached patch fixes this problem for me. > t/002_algorithm.pl ..... 4/19 # Failed test 'validate backup with > algorithm "none"' > # at t/002_algorithm.pl line 53. > The error message for the above failure is: pg_validatebackup: fatal: could not parse backup manifest: both pathname and encoded pathname I don't know at this stage what could cause this? Any pointers? Attached are logs of failed runs (regression.tar.gz). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Mar 23, 2020 at 7:04 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > /src/bin/pg_validatebackup/tmp_check/t_002_algorithm_master_data/backup/none > --manifest-checksum none --no-sync > \tmp_install\bin\pg_basebackup.EXE: illegal option -- manifest-checksum > > It seems the option to be used should be --manifest-checksums. The > attached patch fixes this problem for me. OK, incorporated that. > > t/002_algorithm.pl ..... 4/19 # Failed test 'validate backup with > > algorithm "none"' > > # at t/002_algorithm.pl line 53. > > > > The error message for the above failure is: > pg_validatebackup: fatal: could not parse backup manifest: both > pathname and encoded pathname > > I don't know at this stage what could cause this? Any pointers? I think I forgot an initializer. Try this version. I also incorporated a fix previously proposed by Suraj for the compiler warning you mentioned in the other email. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > I think I forgot an initializer. Try this version. Just took a quick look through this. I'm pretty sure David wants to look at it too. Anyway, some comments below. > diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml > index f139ba0231..d1ff53e8e8 100644 > --- a/doc/src/sgml/protocol.sgml > +++ b/doc/src/sgml/protocol.sgml > @@ -2466,7 +2466,7 @@ The commands accepted in replication mode are: > </varlistentry> > > <varlistentry id="protocol-replication-base-backup" xreflabel="BASE_BACKUP"> > - <term><literal>BASE_BACKUP</literal> [ <literal>LABEL</literal> <replaceable>'label'</replaceable> ] [ <literal>PROGRESS</literal>] [ <literal>FAST</literal> ] [ <literal>WAL</literal> ] [ <literal>NOWAIT</literal> ] [ <literal>MAX_RATE</literal><replaceable>rate</replaceable> ] [ <literal>TABLESPACE_MAP</literal> ] [ <literal>NOVERIFY_CHECKSUMS</literal>] > + <term><literal>BASE_BACKUP</literal> [ <literal>LABEL</literal> <replaceable>'label'</replaceable> ] [ <literal>PROGRESS</literal>] [ <literal>FAST</literal> ] [ <literal>WAL</literal> ] [ <literal>NOWAIT</literal> ] [ <literal>MAX_RATE</literal><replaceable>rate</replaceable> ] [ <literal>TABLESPACE_MAP</literal> ] [ <literal>NOVERIFY_CHECKSUMS</literal>] [ <literal>MANIFEST</literal> <replaceable>manifest_option</replaceable> ] [ <literal>MANIFEST_CHECKSUMS</literal><replaceable>checksum_algorithm</replaceable> ] > <indexterm><primary>BASE_BACKUP</primary></indexterm> > </term> > <listitem> > @@ -2576,6 +2576,37 @@ The commands accepted in replication mode are: > </para> > </listitem> > </varlistentry> > + > + <varlistentry> > + <term><literal>MANIFEST</literal></term> > + <listitem> > + <para> > + When this option is specified with a value of <literal>ye'</literal> > + or <literal>force-escape</literal>, a backup manifest is created > + and sent along with the backup. The latter value forces all filenames > + to be hex-encoded; otherwise, this type of encoding is performed only > + for files whose names are non-UTF8 octet sequences. > + <literal>force-escape</literal> is intended primarily for testing > + purposes, to be sure that clients which read the backup manifest > + can handle this case. For compatibility with previous releases, > + the default is <literal>MANIFEST 'no'</literal>. > + </para> > + </listitem> > + </varlistentry> > + > + <varlistentry> > + <term><literal>MANIFEST_CHECKSUMS</literal></term> > + <listitem> > + <para> > + Specifies the algorithm that should be used to checksum each file > + for purposes of the backup manifest. Currently, the available > + algorithms are <literal>NONE</literal>, <literal>CRC32C</literal>, > + <literal>SHA224</literal>, <literal>SHA256</literal>, > + <literal>SHA384</literal>, and <literal>SHA512</literal>. > + The default is <literal>CRC32C</literal>. > + </para> > + </listitem> > + </varlistentry> > </variablelist> > </para> > <para> While I get the desire to have a default here that includes checksums, the way the command is structured, it strikes me as odd that the lack of MANIFEST_CHECKSUMS in the command actually results in checksums being included. I would think that we'd either: - have the lack of MANIFEST_CHECKSUMS mean 'No checksums' or - Require MANIFEST_CHECKSUMS to be specified and not have it be optional We aren't expecting people to actually be typing these commands out and so I don't think it's a *huge* deal to have it the way you've written it, but it still strikes me as odd. I don't think I have a real preference between the two options that I suggest above, maybe very slightly in favor of the first. > diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml > index 90638aad0e..bf6963a595 100644 > --- a/doc/src/sgml/ref/pg_basebackup.sgml > +++ b/doc/src/sgml/ref/pg_basebackup.sgml > @@ -561,6 +561,69 @@ PostgreSQL documentation > </para> > </listitem> > </varlistentry> > + > + <varlistentry> > + <term><option>--no-manifest</option></term> > + <listitem> > + <para> > + Disables generation of a backup manifest. If this option is not > + specified, the server will and send generate a backup manifest > + which can be verified using <xref linkend="app-pgvalidatebackup" />. > + </para> > + </listitem> > + </varlistentry> How about "If this option is not specified, the server will generate and send a backup manifest which can be verified using ..." > + <varlistentry> > + <term><option>--manifest-checksums=<replaceable class="parameter">algorithm</replaceable></option></term> > + <listitem> > + <para> > + Specifies the algorithm that should be used to checksum each file > + for purposes of the backup manifest. Currently, the available > + algorithms are <literal>NONE</literal>, <literal>CRC32C</literal>, > + <literal>SHA224</literal>, <literal>SHA256</literal>, > + <literal>SHA384</literal>, and <literal>SHA512</literal>. > + The default is <literal>CRC32C</literal>. > + </para> As I recall, there was an invitation to argue about the defaults at one point, and so I'm going to say here that I would advocate for a different default than 'crc32c'. Specifically, I would think sha256 or 512 would be better. I don't recall seeing a debate about this that conclusively found crc32c to be better, but I'm happy to go back and reread anything someone wants to point me at. > + <para> > + If <literal>NONE</literal> is selected, the backup manifest will > + not contain any checksums. Otherwise, it will contain a checksum > + of each file in the backup using the specified algorithm. In addition, > + the manifest itself will always contain a <literal>SHA256</literal> > + checksum of its own contents. The <literal>SHA</literal> algorithms > + are significantly more CPU-intensive than <literal>CRC32C</literal>, > + so selecting one of them may increase the time required to complete > + the backup. > + </para> It also seems a bit silly to me that using the defaults means having to deal with two different algorithms- crc32c and sha256. Considering how fast these algorithms are, compared to everything else involved in a backup (particularly one that's likely going across a network...), I wonder if we should say "may slightly increase" above. > + <para> > + On the other hand, <literal>CRC32C</literal> is not a cryptographic > + hash function, so it is only suitable for protecting against > + inadvertent or random modifications to a backup. An adversary > + who can modify the backup could easily do so in such a way that > + the CRC does not change, whereas a SHA collision will be hard > + to manufacture. (However, note that if the attacker also has access > + to modify the backup manifest itself, no checksum algorithm will > + provide any protection.) An additional advantage of the > + <literal>SHA</literal> family of functions is that they output > + a much larger number of bits. > + </para> I'm not really sure that this paragraph is sensible to include.. We certainly don't talk about adversaries and cryptographic hash functions when we talk about our page-level checksums, for example. I'm not completely against including it, but I don't want to give the impression that this is something we routinely consider or that lack of discussion elsewhere implies we have protections against a determined attacker. > diff --git a/doc/src/sgml/ref/pg_validatebackup.sgml b/doc/src/sgml/ref/pg_validatebackup.sgml > new file mode 100644 > index 0000000000..1c171f6970 > --- /dev/null > +++ b/doc/src/sgml/ref/pg_validatebackup.sgml > @@ -0,0 +1,232 @@ > +<!-- > +doc/src/sgml/ref/pg_validatebackup.sgml > +PostgreSQL documentation > +--> > + > +<refentry id="app-pgvalidatebackup"> > + <indexterm zone="app-pgvalidatebackup"> > + <primary>pg_validatebackup</primary> > + </indexterm> > + > + <refmeta> > + <refentrytitle>pg_validatebackup</refentrytitle> > + <manvolnum>1</manvolnum> > + <refmiscinfo>Application</refmiscinfo> > + </refmeta> > + > + <refnamediv> > + <refname>pg_validatebackup</refname> > + <refpurpose>verify the integrity of a base backup of a > + <productname>PostgreSQL</productname> cluster</refpurpose> > + </refnamediv> "verify the integrity of a backup taken using pg_basebackup" > + <refsect1> > + <title> > + Description > + </title> > + <para> > + <application>pg_validatebackup</application> is used to check the integrity > + of a database cluster backup. The backup being checked should have been > + created by <command>pg_basebackup</command> or some other tool that includes > + a <literal>backup_manifest</literal> file with the backup. The backup > + must be stored in the "plain" format; a "tar" format backup can be checked > + after extracting it. Backup manifests are created by the server beginning > + with <productname>PostgreSQL</productname> version 13, so older backups > + cannot be validated using this tool. > + </para> This seems to invite the idea that pg_validatebackup should be able to work with external backup solutions- but I'm a bit concerned by that idea because it seems like it would then mean we'd have to be particularly careful when changing things in this area, and I'm not thrilled by that. I'd like to make sure that new versions of pg_validatebackup work with older backups, and, ideally, older versions of pg_validatebackup would work even with newer backups, all of which I think the json structure of the manifest helps us with, but that's when we're building the manifest and know what it's going to look like. Maybe to put it another way- would a patch be accepted to make pg_validatebackup work with other manifests..? If not, then I'd keep this to the more specific "this tool is used to validate backups taken using pg_basebackup". > + <para> > + <application>pg_validatebackup</application> reads the manifest file of a > + backup, verifies the manifest against its own internal checksum, and then > + verifies that the same files are present in the target directory as in the > + manifest itself. It then verifies that each file has the expected checksum, > + unless the backup was taken the checksum algorithm set to "was taken with the checksum algorithm"... > + <literal>none</literal>, in which case checksum verification is not > + performed. The presence or absence of directories is not checked, except > + indirectly: if a directory is missing, any files it should have contained > + will necessarily also be missing. Certain files and directories are > + excluded from verification: > + </para> > + > + <itemizedlist> > + <listitem> > + <para> > + <literal>backup_manifest</literal> is ignored because the backup > + manifest is logically not part of the backup and does not include > + any entry for itself. > + </para> > + </listitem> This seems a bit confusing, doesn't it? The backup_manifest must exist, and its checksum is internal, and is checked, isn't it? Why say that it's excluded..? > + <listitem> > + <para> > + <literal>pg_wal</literal> is ignored because WAL files are sent > + separately from the backup, and are therefore not described by the > + backup manifest. > + </para> > + </listitem> I don't agree with the choice to exclude the WAL files, considering they're an integral part of a backup, to exclude them means that if they've been corrupted at all then the entire backup is invalid. You don't want to be discovering that when you're trying to do a restore of a backup that you took with pg_basebackup and which pg_validatebackup says is valid. After all, the tool being used here, pg_basebackup, *does* also stream the WAL files- there's no reason why we can't calculate a checksum on them and store that checksum somewhere and use it to validate the WAL files. This, in my opinion, is actually a show-stopper for this feature. Claiming it's a valid backup when we don't check the absolutely necessary-for-restore WAL is making a false claim, no matter how well it's documented. I do understand that it's possible to run pg_basebackup without the WAL files being grabbed as part of that run- in such a case, we should be able to detect that was the case for the backup and when running pg_validatebackup we should issue a WARNING that the WAL files weren't able to be verified (we could have an option to suppress that warning if people feel that's needed). > + <listitem> > + <para> > + <literal>postgesql.auto.conf</literal>, > + <literal>standby.signal</literal>, > + and <literal>recovery.signal</literal> are ignored because they may > + sometimes be created or modified by the backup client itself. > + (For example, <literal>pg_basebackup -R</literal> will modify > + <literal>postgresql.auto.conf</literal> and create > + <literal>standby.signal</literal>.) > + </para> > + </listitem> > + </itemizedlist> > + </refsect1> Not really thrilled with this (pg_basebackup certainly could figure out the checksum for those files...), but I also don't think it's a huge issue as they can be recreated by a user (unlike a WAL file..). I got through most of the pg_basebackup changes, and they looked pretty good in general. Will try to review more tomorrow. Thanks, Stephen
Attachment
On Mon, Mar 23, 2020 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Mar 23, 2020 at 7:04 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > /src/bin/pg_validatebackup/tmp_check/t_002_algorithm_master_data/backup/none > > --manifest-checksum none --no-sync > > \tmp_install\bin\pg_basebackup.EXE: illegal option -- manifest-checksum > > > > It seems the option to be used should be --manifest-checksums. The > > attached patch fixes this problem for me. > > OK, incorporated that. > > > > t/002_algorithm.pl ..... 4/19 # Failed test 'validate backup with > > > algorithm "none"' > > > # at t/002_algorithm.pl line 53. > > > > > > > The error message for the above failure is: > > pg_validatebackup: fatal: could not parse backup manifest: both > > pathname and encoded pathname > > > > I don't know at this stage what could cause this? Any pointers? > > I think I forgot an initializer. Try this version. > All others except one are passing now. See the summary of the failed test below and attached are failed run logs. Test Summary Report ------------------- t/003_corruption.pl (Wstat: 65280 Tests: 14 Failed: 0) Non-zero exit status: 255 Parse errors: Bad plan. You planned 44 tests but ran 14. Files=6, Tests=123, 164 wallclock secs ( 0.06 usr + 0.02 sys = 0.08 CPU) Result: FAIL -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Mar 23, 2020 at 11:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > All others except one are passing now. See the summary of the failed > test below and attached are failed run logs. > > Test Summary Report > ------------------- > t/003_corruption.pl (Wstat: 65280 Tests: 14 Failed: 0) > Non-zero exit status: 255 > Parse errors: Bad plan. You planned 44 tests but ran 14. > Files=6, Tests=123, 164 wallclock secs ( 0.06 usr + 0.02 sys = 0.08 CPU) > Result: FAIL Hmm. It looks like it's trying to remove the symlink that points to the tablespace directory, and failing with no error message. I could set that permutation to be skipped on Windows, or maybe there's an alternate method you can suggest that would work? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 23, 2020 at 6:42 PM Stephen Frost <sfrost@snowman.net> wrote: > While I get the desire to have a default here that includes checksums, > the way the command is structured, it strikes me as odd that the lack of > MANIFEST_CHECKSUMS in the command actually results in checksums being > included. I don't think that's quite accurate, because the default for the MANIFEST option is 'no', so the actual default if you say nothing about manifests at all, you don't get one. However, it is true that if you ask for a manifest and you don't specify the type of checksums, you get CRC-32C. We could change it so that if you ask for a manifest you must also specify the type of checksum, but I don't see any advantage in that approach. Nothing prevents the client from specifying the value if it cares, but making the default "I don't care, you pick" seems pretty sensible. It could be really helpful if, for example, we decide to remove the initial default in a future release for some reason. Then the client just keeps working without needing to change anything, but anyone who explicitly specified the old default gets an error. > > + Disables generation of a backup manifest. If this option is not > > + specified, the server will and send generate a backup manifest > > + which can be verified using <xref linkend="app-pgvalidatebackup" />. > > + </para> > > + </listitem> > > + </varlistentry> > > How about "If this option is not specified, the server will generate and > send a backup manifest which can be verified using ..." Good suggestion. :-) > > + <varlistentry> > > + <term><option>--manifest-checksums=<replaceable class="parameter">algorithm</replaceable></option></term> > > + <listitem> > > + <para> > > + Specifies the algorithm that should be used to checksum each file > > + for purposes of the backup manifest. Currently, the available > > + algorithms are <literal>NONE</literal>, <literal>CRC32C</literal>, > > + <literal>SHA224</literal>, <literal>SHA256</literal>, > > + <literal>SHA384</literal>, and <literal>SHA512</literal>. > > + The default is <literal>CRC32C</literal>. > > + </para> > > As I recall, there was an invitation to argue about the defaults at one > point, and so I'm going to say here that I would advocate for a > different default than 'crc32c'. Specifically, I would think sha256 or > 512 would be better. I don't recall seeing a debate about this that > conclusively found crc32c to be better, but I'm happy to go back and > reread anything someone wants to point me at. It was discussed upthread. Andrew Dunstan argued that there was no reason to use a cryptographic checksum here and that we shouldn't do so gratuitously. Suraj Kharage found that CRC-32C has very little performance impact but that any of the SHA functions slow down backups considerably. David Steele pointed out that you'd need a better checksum if you wanted to use it for purposes such as delta restore, with which I agree, but that's not the design center for this feature. I concluded that different people wanted different things, so that we ought to make this configurable, but that CRC-32C is a good default. It has approximately a 99.9999999767169% chance of detecting a random error, which is pretty good, and it doesn't drastically slow down backups, which is also good. > It also seems a bit silly to me that using the defaults means having to > deal with two different algorithms- crc32c and sha256. Considering how > fast these algorithms are, compared to everything else involved in a > backup (particularly one that's likely going across a network...), I > wonder if we should say "may slightly increase" above. Actually, Suraj's results upthread show that it's a pretty big hit. > > + <para> > > + On the other hand, <literal>CRC32C</literal> is not a cryptographic > > + hash function, so it is only suitable for protecting against > > + inadvertent or random modifications to a backup. An adversary > > + who can modify the backup could easily do so in such a way that > > + the CRC does not change, whereas a SHA collision will be hard > > + to manufacture. (However, note that if the attacker also has access > > + to modify the backup manifest itself, no checksum algorithm will > > + provide any protection.) An additional advantage of the > > + <literal>SHA</literal> family of functions is that they output > > + a much larger number of bits. > > + </para> > > I'm not really sure that this paragraph is sensible to include.. We > certainly don't talk about adversaries and cryptographic hash functions > when we talk about our page-level checksums, for example. I'm not > completely against including it, but I don't want to give the impression > that this is something we routinely consider or that lack of discussion > elsewhere implies we have protections against a determined attacker. Given the skepticism from some quarters about CRC-32C on this thread, I didn't want to oversell it. Also, I do think that these things are possibly things that we should consider more widely. I agree with Andrew's complaint that it's far too easy to just throw SHA<lots> at problems that don't really require it without any actually good reason. Spelling out our reasons for choosing certain algorithms for certain purposes seems like a good habit to get into, and if we haven't done it in other places, maybe we should. On the other hand, while I'm inclined to keep this paragraph, I won't lose much sleep if we decide to remove it. > > + <refnamediv> > > + <refname>pg_validatebackup</refname> > > + <refpurpose>verify the integrity of a base backup of a > > + <productname>PostgreSQL</productname> cluster</refpurpose> > > + </refnamediv> > > "verify the integrity of a backup taken using pg_basebackup" OK. > This seems to invite the idea that pg_validatebackup should be able to > work with external backup solutions- but I'm a bit concerned by that > idea because it seems like it would then mean we'd have to be > particularly careful when changing things in this area, and I'm not > thrilled by that. I'd like to make sure that new versions of > pg_validatebackup work with older backups, and, ideally, older versions > of pg_validatebackup would work even with newer backups, all of which I > think the json structure of the manifest helps us with, but that's when > we're building the manifest and know what it's going to look like. Both you and David made forceful arguments that this needed to be JSON rather than an ad-hoc text format precisely so that other tools could parse it more easily, and I just spent *a lot* of time making the JSON parsing stuff work precisely so that you could have that. This project would've been done a month ago if not for that. I don't care all that much whether we remove the mention here, but the idea that using JSON was so that pg_validatebackup could manage compatibility issues is just not correct. The version number on line 1 of the file was more than sufficient for that purpose. > > + <para> > > + <application>pg_validatebackup</application> reads the manifest file of a > > + backup, verifies the manifest against its own internal checksum, and then > > + verifies that the same files are present in the target directory as in the > > + manifest itself. It then verifies that each file has the expected checksum, > > + unless the backup was taken the checksum algorithm set to > > "was taken with the checksum algorithm"... Oops. Will fix. > > + <itemizedlist> > > + <listitem> > > + <para> > > + <literal>backup_manifest</literal> is ignored because the backup > > + manifest is logically not part of the backup and does not include > > + any entry for itself. > > + </para> > > + </listitem> > > This seems a bit confusing, doesn't it? The backup_manifest must exist, > and its checksum is internal, and is checked, isn't it? Why say that > it's excluded..? Well, there's no entry in the backup manifest for backup_manifest itself. Normally, the presence of a file not mentioned in backup_manifest would cause a complaint about an extra file, but because backup_manifest is in the ignore list, it doesn't. > > + <listitem> > > + <para> > > + <literal>pg_wal</literal> is ignored because WAL files are sent > > + separately from the backup, and are therefore not described by the > > + backup manifest. > > + </para> > > + </listitem> > > I don't agree with the choice to exclude the WAL files, considering > they're an integral part of a backup, to exclude them means that if > they've been corrupted at all then the entire backup is invalid. You > don't want to be discovering that when you're trying to do a restore of > a backup that you took with pg_basebackup and which pg_validatebackup > says is valid. After all, the tool being used here, pg_basebackup, > *does* also stream the WAL files- there's no reason why we can't > calculate a checksum on them and store that checksum somewhere and use > it to validate the WAL files. This, in my opinion, is actually a > show-stopper for this feature. Claiming it's a valid backup when we > don't check the absolutely necessary-for-restore WAL is making a false > claim, no matter how well it's documented. The default for pg_basebackup is -Xstream, which means that the WAL files are being sent over a separate connection that has no connection to the original session. The server, when generating the backup manifest, has no idea what WAL files are being sent over that separate connection, and thus cannot include them in the manifest. This problem could be "solved" by having the client generate the manifest rather than the server, but I think that cure would be worse than the disease. As it stands, the manifest provides some protection against transmission errors, which would be lost with that design. As you point out, this clearly can't be done with -Xnone. I think it would be possible to support this with -Xfetch, but we'd have to have the manifest itself specify whether or not it included files in pg_wal, which would require complicating the format a bit. I don't think that makes sense. I assume -Xstream is the most commonly-used mode, because the default used to be -Xfetch and we changed it, which I think we would not have done unless people liked -Xstream significantly better. Adding complexity to cater to a non-default case which I suspect is not widely used doesn't really make sense to me. In the future, we might want to consider improvements which could make validation of pg_wal feasible in common cases. Specifically, suppose that pg_basebackup could receive the manifest from the server, keep all the entries for the existing files just as they are, but add entries for WAL files and anything else it may have added to the backup, recompute the manifest checksum, and store the resulting revised manifest with the backup. That, I think, would be fairly cool, but it's a significant body of additional development work, and this is already quite a large patch. The patch itself has grown to about 3000 lines, and has already 10 preparatory commits doing another ~1500 lines of refactoring to prepare for it. > Not really thrilled with this (pg_basebackup certainly could figure out > the checksum for those files...), but I also don't think it's a huge > issue as they can be recreated by a user (unlike a WAL file..). Yeah, same issues, though. Here again, there are several possible fixes: (1) make the server modify those files rather than letting pg_basebackup do it; (2) make the client compute the manifest rather than the server; (3) have the client revise the manifest. (3) makes most sense to me, but I think that it would be better to return to that topic at a later date. This is certainly not a perfect feature as things stand but I believe it is good enough to provide significant benefits. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 24, 2020 at 10:30 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Mar 23, 2020 at 11:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > All others except one are passing now. See the summary of the failed > > test below and attached are failed run logs. > > > > Test Summary Report > > ------------------- > > t/003_corruption.pl (Wstat: 65280 Tests: 14 Failed: 0) > > Non-zero exit status: 255 > > Parse errors: Bad plan. You planned 44 tests but ran 14. > > Files=6, Tests=123, 164 wallclock secs ( 0.06 usr + 0.02 sys = 0.08 CPU) > > Result: FAIL > > Hmm. It looks like it's trying to remove the symlink that points to > the tablespace directory, and failing with no error message. I could > set that permutation to be skipped on Windows, or maybe there's an > alternate method you can suggest that would work? > We can use rmdir() for Windows. The attached patch fixes the failure for me. I have tried the test on CentOS as well after the fix and it passes there as well. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Mar 23, 2020 at 6:42 PM Stephen Frost <sfrost@snowman.net> wrote: > > While I get the desire to have a default here that includes checksums, > > the way the command is structured, it strikes me as odd that the lack of > > MANIFEST_CHECKSUMS in the command actually results in checksums being > > included. > > I don't think that's quite accurate, because the default for the > MANIFEST option is 'no', so the actual default if you say nothing > about manifests at all, you don't get one. However, it is true that if > you ask for a manifest and you don't specify the type of checksums, > you get CRC-32C. We could change it so that if you ask for a manifest > you must also specify the type of checksum, but I don't see any > advantage in that approach. Nothing prevents the client from > specifying the value if it cares, but making the default "I don't > care, you pick" seems pretty sensible. It could be really helpful if, > for example, we decide to remove the initial default in a future > release for some reason. Then the client just keeps working without > needing to change anything, but anyone who explicitly specified the > old default gets an error. I get that the default for manifest is 'no', but I don't really see how that means that the lack of saying anything about checksums should mean "give me crc32c checksums". It's really rather common that if we don't specify something, it means don't do that thing- like an 'ORDER BY' clause. We aren't designing SQL here, so I'm not going to get terribly upset if you push forward with "if you don't want checksums, you have to explicitly say MANIFEST_CHECKSUMS no", but I don't agree with the reasoning here. > > > + <varlistentry> > > > + <term><option>--manifest-checksums=<replaceable class="parameter">algorithm</replaceable></option></term> > > > + <listitem> > > > + <para> > > > + Specifies the algorithm that should be used to checksum each file > > > + for purposes of the backup manifest. Currently, the available > > > + algorithms are <literal>NONE</literal>, <literal>CRC32C</literal>, > > > + <literal>SHA224</literal>, <literal>SHA256</literal>, > > > + <literal>SHA384</literal>, and <literal>SHA512</literal>. > > > + The default is <literal>CRC32C</literal>. > > > + </para> > > > > As I recall, there was an invitation to argue about the defaults at one > > point, and so I'm going to say here that I would advocate for a > > different default than 'crc32c'. Specifically, I would think sha256 or > > 512 would be better. I don't recall seeing a debate about this that > > conclusively found crc32c to be better, but I'm happy to go back and > > reread anything someone wants to point me at. > > It was discussed upthread. Andrew Dunstan argued that there was no > reason to use a cryptographic checksum here and that we shouldn't do > so gratuitously. Suraj Kharage found that CRC-32C has very little > performance impact but that any of the SHA functions slow down backups > considerably. David Steele pointed out that you'd need a better > checksum if you wanted to use it for purposes such as delta restore, > with which I agree, but that's not the design center for this feature. > I concluded that different people wanted different things, so that we > ought to make this configurable, but that CRC-32C is a good default. > It has approximately a 99.9999999767169% chance of detecting a random > error, which is pretty good, and it doesn't drastically slow down > backups, which is also good. There were also comments made up-thread about how it might not be great for larger (eg: 1GB files, like we tend to have quite a few of...), and something about it being a 40 year old algorithm.. Having re-read some of the discussion, I'm actually more inclined to say we should be using sha256 instead of crc32c. > > It also seems a bit silly to me that using the defaults means having to > > deal with two different algorithms- crc32c and sha256. Considering how > > fast these algorithms are, compared to everything else involved in a > > backup (particularly one that's likely going across a network...), I > > wonder if we should say "may slightly increase" above. > > Actually, Suraj's results upthread show that it's a pretty big hit. So, I went back and re-read part of the thread and looked at the (seemingly, only one..?) post regarding timing and didn't understand what, exactly, was being timed there, because I didn't see the actual commands/script/whatever that was used to get those results included. I'm sure that sha256 takes a lot more time than crc32c, I'm certainly not trying to dispute that, but what's relevent here is how much it impacts the time required to run the overall backup (including sync'ing it to disk, and possibly network transmission time.. if we're just comparing the time to run it through memory then, sure, the sha256 computation time might end up being quite a bit of the time, but that's not really that interesting of a test..). > > > + <para> > > > + On the other hand, <literal>CRC32C</literal> is not a cryptographic > > > + hash function, so it is only suitable for protecting against > > > + inadvertent or random modifications to a backup. An adversary > > > + who can modify the backup could easily do so in such a way that > > > + the CRC does not change, whereas a SHA collision will be hard > > > + to manufacture. (However, note that if the attacker also has access > > > + to modify the backup manifest itself, no checksum algorithm will > > > + provide any protection.) An additional advantage of the > > > + <literal>SHA</literal> family of functions is that they output > > > + a much larger number of bits. > > > + </para> > > > > I'm not really sure that this paragraph is sensible to include.. We > > certainly don't talk about adversaries and cryptographic hash functions > > when we talk about our page-level checksums, for example. I'm not > > completely against including it, but I don't want to give the impression > > that this is something we routinely consider or that lack of discussion > > elsewhere implies we have protections against a determined attacker. > > Given the skepticism from some quarters about CRC-32C on this thread, > I didn't want to oversell it. Also, I do think that these things are > possibly things that we should consider more widely. I agree with > Andrew's complaint that it's far too easy to just throw SHA<lots> at > problems that don't really require it without any actually good > reason. Spelling out our reasons for choosing certain algorithms for > certain purposes seems like a good habit to get into, and if we > haven't done it in other places, maybe we should. On the other hand, > while I'm inclined to keep this paragraph, I won't lose much sleep if > we decide to remove it. I don't mind spelling out reasoning for certain algorithms over others, in general, this just seems a bit much. I'm not sure we need to be going into what being a cryptographic hash function means every time we talk about any hash or checksum. Those who actually care about cryptographic hash function usage really don't need someone to explain to them that crc32c isn't cryptographically secure. The last sentence also seems kind of odd (why is a much larger number of bits, alone, an advantage..?). I tried to figure out a way to rewrite this and I feel like I keep ending up coming back to something like "CRC32C is a CRC, not a hash" and that kind of truism just doesn't feel terribly useful to include in our documentation. Maybe: "Using a SHA hash function provides a cryptographically secure digest of each file for users who wish to verify that the backup has not been tampered with, while the CRC32C algorithm provides a checksum which is much faster to calculate and good at catching errors due to accidental changes but is not resistent to targeted modifications. Note that, to be useful against an adversary who has access to the backup, the backup manifest would need to be stored securely elsewhere or otherwise verified to have not been modified since the backup was taken." This at least talks about things in a positive direction (SHA hash functions do this, CRC32C does that) rather than in a negative tone. > > This seems to invite the idea that pg_validatebackup should be able to > > work with external backup solutions- but I'm a bit concerned by that > > idea because it seems like it would then mean we'd have to be > > particularly careful when changing things in this area, and I'm not > > thrilled by that. I'd like to make sure that new versions of > > pg_validatebackup work with older backups, and, ideally, older versions > > of pg_validatebackup would work even with newer backups, all of which I > > think the json structure of the manifest helps us with, but that's when > > we're building the manifest and know what it's going to look like. > > Both you and David made forceful arguments that this needed to be JSON > rather than an ad-hoc text format precisely so that other tools could > parse it more easily, and I just spent *a lot* of time making the JSON > parsing stuff work precisely so that you could have that. This project > would've been done a month ago if not for that. I don't care all that > much whether we remove the mention here, but the idea that using JSON > was so that pg_validatebackup could manage compatibility issues is > just not correct. The version number on line 1 of the file was more > than sufficient for that purpose. I stand by the decision that the manifest should be in JSON, but that's what is produced by the backend server as part of a base backup, which is quite likely going to be used by some external tools, and isn't at all the same as the external pg_validatebackup command that the discussion here is about. I also did make the argument up-thread, though I'll admit that it seemed to be mostly ignored, but I make it still, that a simple version number sucks and using JSON does avoid some of the downsides from it. Particularly, I'd love to see a v13 pg_validatebackup able to work with a v14 pg_basebackup, even if that v14 pg_basebackup added some extra stuff to the manifest. That's possible to do with a generic structure like JSON and not something that a simple version number would allow. Yes, I admit that we might change the structure or the contents in a way where that wouldn't be possible and I'm not going to raise a fuss if we do so, but this approach gives us more options. Anyway, my point here was really just that *pg_validatebackup* is about validating backups taken with pg_basebackup. While it's possible that it could be used for backups taken with other tools, I don't think that's really part of its actual mandate or that we're going to actively work to add such support in the future. > > > + <itemizedlist> > > > + <listitem> > > > + <para> > > > + <literal>backup_manifest</literal> is ignored because the backup > > > + manifest is logically not part of the backup and does not include > > > + any entry for itself. > > > + </para> > > > + </listitem> > > > > This seems a bit confusing, doesn't it? The backup_manifest must exist, > > and its checksum is internal, and is checked, isn't it? Why say that > > it's excluded..? > > Well, there's no entry in the backup manifest for backup_manifest > itself. Normally, the presence of a file not mentioned in > backup_manifest would cause a complaint about an extra file, but > because backup_manifest is in the ignore list, it doesn't. Yes, I get why it's excluded from the manifest and why we have code to avoid complaining about it being an extra file, but this is documentation and, in this part of the docs, we seem to be saying that we're not checking/validating the manifest, and that's certainly not actually true. In particular, the sentence right above this list is: "Certain files and directories are excluded from verification:" but we actually do verify the manifest, that's all I'm saying here. Maybe rewording that a bit is what would help, say: "Certain files and directories are not included in the manifest:" then have the entry for backup_manifest be something like: "backup_manifest is not included as it is the manifest itself and is not logically part of the backup; backup_manifest is checked using its own internal validation digest" or something along those lines. > > > + <listitem> > > > + <para> > > > + <literal>pg_wal</literal> is ignored because WAL files are sent > > > + separately from the backup, and are therefore not described by the > > > + backup manifest. > > > + </para> > > > + </listitem> > > > > I don't agree with the choice to exclude the WAL files, considering > > they're an integral part of a backup, to exclude them means that if > > they've been corrupted at all then the entire backup is invalid. You > > don't want to be discovering that when you're trying to do a restore of > > a backup that you took with pg_basebackup and which pg_validatebackup > > says is valid. After all, the tool being used here, pg_basebackup, > > *does* also stream the WAL files- there's no reason why we can't > > calculate a checksum on them and store that checksum somewhere and use > > it to validate the WAL files. This, in my opinion, is actually a > > show-stopper for this feature. Claiming it's a valid backup when we > > don't check the absolutely necessary-for-restore WAL is making a false > > claim, no matter how well it's documented. > > The default for pg_basebackup is -Xstream, which means that the WAL > files are being sent over a separate connection that has no connection > to the original session. The server, when generating the backup > manifest, has no idea what WAL files are being sent over that separate > connection, and thus cannot include them in the manifest. This problem > could be "solved" by having the client generate the manifest rather > than the server, but I think that cure would be worse than the > disease. As it stands, the manifest provides some protection against > transmission errors, which would be lost with that design. As you > point out, this clearly can't be done with -Xnone. I think it would be > possible to support this with -Xfetch, but we'd have to have the > manifest itself specify whether or not it included files in pg_wal, > which would require complicating the format a bit. I don't think that > makes sense. I assume -Xstream is the most commonly-used mode, because > the default used to be -Xfetch and we changed it, which I think we > would not have done unless people liked -Xstream significantly better. > Adding complexity to cater to a non-default case which I suspect is > not widely used doesn't really make sense to me. Yeah, I get that it's not easy to figure out how to validate the WAL, but I stand by my opinion that it's simply not acceptable to exclude the necessary WAL from verification so and to claim that a backup is valid when we haven't checked the WAL. I agree that -Xfetch isn't commonly used and only supporting validation of WAL when that's used isn't a good answer. > In the future, we might want to consider improvements which could make > validation of pg_wal feasible in common cases. Specifically, suppose > that pg_basebackup could receive the manifest from the server, keep > all the entries for the existing files just as they are, but add > entries for WAL files and anything else it may have added to the > backup, recompute the manifest checksum, and store the resulting > revised manifest with the backup. That, I think, would be fairly cool, > but it's a significant body of additional development work, and this > is already quite a large patch. The patch itself has grown to about > 3000 lines, and has already 10 preparatory commits doing another ~1500 > lines of refactoring to prepare for it. Having the client calculate the checksums for the WAL and add them to the manifest is one approach and could work, but there's others- - Have the WAL checksums be calculated during the base backup and kept somewhere, and then included in the manifest sent by the server- the backup_manifest is the last thing we send anyway, isn't it? And surely at the end of the backup we actually do know all of the WAL that's needed for the backup to be valid, because we pass that information to pg_basebackup to construct the necessary backup_label file. - Validate the WAL using its own internal checksums instead of having the manifest involved at all. That's not ideal since we wouldn't have cryptographically secure digests for the WAL, but at least we will have validated it and raised the chances that the backup will be able to actually be restored using PG a whole bunch. - With the 'checksum none' option, we aren't really validating contents of anything, so in that case it'd actually be alright to simply scan the WAL and make sure that we've at least got all of the WAL files needed to go from the start of the backup to the end. I don't think just checking that the WAL files exist is a proper solution when it comes to a backup where the user has asked for checksums to be included though. I will say that I'm really very surprised that pg_validatebackup wasn't already checking that we at least had the WAL that is needed, but I don't see any code for that. > > Not really thrilled with this (pg_basebackup certainly could figure out > > the checksum for those files...), but I also don't think it's a huge > > issue as they can be recreated by a user (unlike a WAL file..). > > Yeah, same issues, though. Here again, there are several possible > fixes: (1) make the server modify those files rather than letting > pg_basebackup do it; (2) make the client compute the manifest rather > than the server; (3) have the client revise the manifest. (3) makes > most sense to me, but I think that it would be better to return to > that topic at a later date. This is certainly not a perfect feature as > things stand but I believe it is good enough to provide significant > benefits. As I said, I don't consider these files to be as much of an issue and therefore excluding them and documenting that we do would be alright. I don't feel that's an acceptable option for the WAL though. Thanks, Stephen
Attachment
On Wed, Mar 25, 2020 at 9:31 AM Stephen Frost <sfrost@snowman.net> wrote: > I get that the default for manifest is 'no', but I don't really see how > that means that the lack of saying anything about checksums should mean > "give me crc32c checksums". It's really rather common that if we don't > specify something, it means don't do that thing- like an 'ORDER BY' > clause. That's a fair argument, but I think the other relevant principle is that we try to give people useful defaults for things. I think that checksums are a sufficiently useful thing that having the default be not to do it doesn't make sense. I had the impression that you and David were in agreement on that point, actually. > There were also comments made up-thread about how it might not be great > for larger (eg: 1GB files, like we tend to have quite a few of...), and > something about it being a 40 year old algorithm.. Well, the 512MB "limit" for CRC-32C means only that for certain very specific types of errors, detection is not guaranteed above that file size. So if you have a single flipped bit, for example, and the file size is greater than 512MB, then CRC-32C has only a 99.9999999767169% chance of detecting the error, whereas if the file size is less than 512MB, it is 100% certain, because of the design of the algorithm. But nine nines is plenty, and neither SHA nor our page-level checksums provide guaranteed error detection properties anyway. I'm not sure why the fact that it's a 40-year-old algorithm is relevant. There are many 40-year-old algorithms that are very good. Generally, if we discover that we're using bad 40-year-old algorithms, like Knuth's tape sorting stuff, we eventually figure out how to replace them with something else that's better. But there's no reason to retire an algorithm simply because it's old. I have not heard anyone say, for example, that we should stop using CRC-32C for XLOG checksums. We continue to use it for that purpose because it (1) is highly likely to detect any errors and (2) is very fast. Those are the same reasons why I think it's a good fit for this case. My guess is that if this patch is adopted as currently proposed, we will eventually need to replace the cryptographic hash functions due to the march of time. As I'm sure you realize, the problem with hash functions that are designed to foil an adversary is that adversaries keep getting smarter. So, eventually someone will probably figure out how to do something nefarious with SHA-512. Some other technique that nobody's cracked yet will need to be adopted, and then people will begin trying to crack that, and the whole thing will repeat. But I suspect that we can keep using the same non-cryptographic hash function essentially forever. It does not matter that people know how the algorithm works because it makes no pretensions of trying to foil an opponent. It is just trying to mix up the bits in such a way that a change to the file is likely to cause a change in the checksum. The bit-mixing properties of the algorithm do not degrade with the passage of time. > I'm sure that sha256 takes a lot more time than crc32c, I'm certainly > not trying to dispute that, but what's relevent here is how much it > impacts the time required to run the overall backup (including sync'ing > it to disk, and possibly network transmission time.. if we're just > comparing the time to run it through memory then, sure, the sha256 > computation time might end up being quite a bit of the time, but that's > not really that interesting of a test..). I think that http://postgr.es/m/38e29a1c-0d20-fc73-badd-ca05f7f07ffa@pgmasters.net is one of the more interesting emails on this topic. My conclusion from that email, and the ones that led up to it, was that there is a 40-50% overhead from doing a SHA checksum, but in pgbackrest, users don't see it because backups are compressed. Because the compression uses so much CPU time, the additional overhead from the SHA checksum is only a few percent more. But I don't think that it would be smart to slow down uncompressed backups by 40-50%. That's going to cause a problem for somebody, almost for sure. > Maybe: > > "Using a SHA hash function provides a cryptographically secure digest > of each file for users who wish to verify that the backup has not been > tampered with, while the CRC32C algorithm provides a checksum which is > much faster to calculate and good at catching errors due to accidental > changes but is not resistent to targeted modifications. Note that, to > be useful against an adversary who has access to the backup, the backup > manifest would need to be stored securely elsewhere or otherwise > verified to have not been modified since the backup was taken." > > This at least talks about things in a positive direction (SHA hash > functions do this, CRC32C does that) rather than in a negative tone. Cool. I like it. > Anyway, my point here was really just that *pg_validatebackup* is about > validating backups taken with pg_basebackup. While it's possible that > it could be used for backups taken with other tools, I don't think > that's really part of its actual mandate or that we're going to actively > work to add such support in the future. I think you're kind just nitpicking here, because the statement that pg_validatebackup can validate not only a backup taken by pg_basebackup but also a backup taken in using some compatible method is just a tautology. But I'll remove the reference. > In particular, the sentence right above this list is: > > "Certain files and directories are excluded from verification:" > > but we actually do verify the manifest, that's all I'm saying here. > > Maybe rewording that a bit is what would help, say: > > "Certain files and directories are not included in the manifest:" Well, that'd be wrong, though. It's true that backup_manifest won't have an entry in the manifest, and neither will WAL files, but postgresql.auto.conf will. We'll just skip complaining about it if the checksum doesn't match or whatever. The server generates manifest entries for everything, and the client decides not to pay attention to some of them because it knows that pg_basebackup may have made certain changes that were not known to the server. > Yeah, I get that it's not easy to figure out how to validate the WAL, > but I stand by my opinion that it's simply not acceptable to exclude the > necessary WAL from verification so and to claim that a backup is valid > when we haven't checked the WAL. I hear that, but I don't agree that having nothing is better than having this much committed. I would be fine with renaming the tool (pg_validatebackupmanifest? pg_validatemanifest?), or with updating the documentation to be more clear about what is and is not checked, but I'm not going to extent the tool to do totally new things for which we don't even have an agreed design yet. I believe in trying to create patches that do one thing and do it well, and this patch does that. The fact that it doesn't do some other thing that is conceptually related yet different is a good thing, not a bad one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Mar 25, 2020 at 9:31 AM Stephen Frost <sfrost@snowman.net> wrote: > > I get that the default for manifest is 'no', but I don't really see how > > that means that the lack of saying anything about checksums should mean > > "give me crc32c checksums". It's really rather common that if we don't > > specify something, it means don't do that thing- like an 'ORDER BY' > > clause. > > That's a fair argument, but I think the other relevant principle is > that we try to give people useful defaults for things. I think that > checksums are a sufficiently useful thing that having the default be > not to do it doesn't make sense. I had the impression that you and > David were in agreement on that point, actually. I agree with wanting to have useful defaults and that checksums should be included by default, and I'm alright even with letting people pick what algorithms they'd like to have too. The construct here is made odd because we've got this idea that "no checksum" is an option, which is actually something that I don't particularly like, but that's what's making this particular syntax weird. I don't suppose you'd be open to the idea of just dropping that though..? There wouldn't be any issue with this syntax if we just always had checksums included when a manifest is requested. :) Somehow, I don't think I'm going to win that argument. > > There were also comments made up-thread about how it might not be great > > for larger (eg: 1GB files, like we tend to have quite a few of...), and > > something about it being a 40 year old algorithm.. > > Well, the 512MB "limit" for CRC-32C means only that for certain very > specific types of errors, detection is not guaranteed above that file > size. So if you have a single flipped bit, for example, and the file > size is greater than 512MB, then CRC-32C has only a 99.9999999767169% > chance of detecting the error, whereas if the file size is less than > 512MB, it is 100% certain, because of the design of the algorithm. But > nine nines is plenty, and neither SHA nor our page-level checksums > provide guaranteed error detection properties anyway. Right, so we know that CRC-32C has an upper-bound of 512MB to be useful for exactly what it's designed to be useful for, but we also know that we're going to have larger files- at least 1GB ones, and quite possibly larger, so why are we choosing this? At the least, wouldn't it make sense to consider a larger CRC, one whose limit is above the size of commonly expected files, if we're going to use a CRC? > I'm not sure why the fact that it's a 40-year-old algorithm is > relevant. There are many 40-year-old algorithms that are very good. Sure there are, but there probably wasn't a lot of thought about GB-sized files, and this doesn't really seem to be the direction people are going in for larger objects. s3, as an example, uses sha256. Google, it seems, suggests folks use "HighwayHash" (from their crc32c github repo- https://github.com/google/crc32c). Most CRC uses seem to be for much smaller data sets. > My guess is that if this patch is adopted as currently proposed, we > will eventually need to replace the cryptographic hash functions due > to the march of time. As I'm sure you realize, the problem with hash > functions that are designed to foil an adversary is that adversaries > keep getting smarter. So, eventually someone will probably figure out > how to do something nefarious with SHA-512. Some other technique that > nobody's cracked yet will need to be adopted, and then people will > begin trying to crack that, and the whole thing will repeat. But I > suspect that we can keep using the same non-cryptographic hash > function essentially forever. It does not matter that people know how > the algorithm works because it makes no pretensions of trying to foil > an opponent. It is just trying to mix up the bits in such a way that a > change to the file is likely to cause a change in the checksum. The > bit-mixing properties of the algorithm do not degrade with the passage > of time. Sure, there's a good chance we'll need newer algorithms in the future, I don't doubt that. On the other hand, if crc32c, or CRC whatever, was the perfect answer and no one will ever need something better, then what's with folks like Google suggesting something else..? > > I'm sure that sha256 takes a lot more time than crc32c, I'm certainly > > not trying to dispute that, but what's relevent here is how much it > > impacts the time required to run the overall backup (including sync'ing > > it to disk, and possibly network transmission time.. if we're just > > comparing the time to run it through memory then, sure, the sha256 > > computation time might end up being quite a bit of the time, but that's > > not really that interesting of a test..). > > I think that http://postgr.es/m/38e29a1c-0d20-fc73-badd-ca05f7f07ffa@pgmasters.net > is one of the more interesting emails on this topic. My conclusion > from that email, and the ones that led up to it, was that there is a > 40-50% overhead from doing a SHA checksum, but in pgbackrest, users > don't see it because backups are compressed. Because the compression > uses so much CPU time, the additional overhead from the SHA checksum > is only a few percent more. But I don't think that it would be smart > to slow down uncompressed backups by 40-50%. That's going to cause a > problem for somebody, almost for sure. I like that email on the topic also, as it points out again (as I tried to do earlier also..) that it depends on what we're actually including in the test- and it seems, again, that those tests didn't consider the time to actually write the data somewhere, either network or disk. As for folks who are that close to the edge on their backup timing that they can't have it slow down- chances are pretty darn good that they're not far from ending up needing to find a better solution than pg_basebackup anyway. Or they don't need to generate a manifest (or, I suppose, they could have one but not have checksums..). > > In particular, the sentence right above this list is: > > > > "Certain files and directories are excluded from verification:" > > > > but we actually do verify the manifest, that's all I'm saying here. > > > > Maybe rewording that a bit is what would help, say: > > > > "Certain files and directories are not included in the manifest:" > > Well, that'd be wrong, though. It's true that backup_manifest won't > have an entry in the manifest, and neither will WAL files, but > postgresql.auto.conf will. We'll just skip complaining about it if the > checksum doesn't match or whatever. The server generates manifest > entries for everything, and the client decides not to pay attention to > some of them because it knows that pg_basebackup may have made certain > changes that were not known to the server. Ok, but it's also wrong to say that the backup_label is excluded from verification. > > Yeah, I get that it's not easy to figure out how to validate the WAL, > > but I stand by my opinion that it's simply not acceptable to exclude the > > necessary WAL from verification so and to claim that a backup is valid > > when we haven't checked the WAL. > > I hear that, but I don't agree that having nothing is better than > having this much committed. I would be fine with renaming the tool > (pg_validatebackupmanifest? pg_validatemanifest?), or with updating > the documentation to be more clear about what is and is not checked, > but I'm not going to extent the tool to do totally new things for > which we don't even have an agreed design yet. I believe in trying to > create patches that do one thing and do it well, and this patch does > that. The fact that it doesn't do some other thing that is > conceptually related yet different is a good thing, not a bad one. I fail to see the usefulness of a tool that doesn't actually verify that the backup is able to be restored from. Even pg_basebackup (in both fetch and stream modes...) checks that we at least got all the WAL that's needed for the backup from the server before considering the backup to be valid and telling the user that there was a successful backup. With what you're proposing here, we could have someone do a pg_basebackup, get back an ERROR saying the backup wasn't valid, and then run pg_validatebackup and be told that the backup is valid. I don't get how that's sensible. Thanks, Stephen
Attachment
On Wed, Mar 25, 2020 at 4:54 PM Stephen Frost <sfrost@snowman.net> wrote: > > That's a fair argument, but I think the other relevant principle is > > that we try to give people useful defaults for things. I think that > > checksums are a sufficiently useful thing that having the default be > > not to do it doesn't make sense. I had the impression that you and > > David were in agreement on that point, actually. > > I agree with wanting to have useful defaults and that checksums should > be included by default, and I'm alright even with letting people pick > what algorithms they'd like to have too. The construct here is made odd > because we've got this idea that "no checksum" is an option, which is > actually something that I don't particularly like, but that's what's > making this particular syntax weird. I don't suppose you'd be open to > the idea of just dropping that though..? There wouldn't be any issue > with this syntax if we just always had checksums included when a > manifest is requested. :) > > Somehow, I don't think I'm going to win that argument. Well, it's not a crazy idea. So, at some point, I had the idea that you were always going to get a manifest, and therefore you should at least ought to have the option of not checksumming to avoid the overhead. But, as things stand now, you can suppress the manifest altogether, so that you can still take a backup even if you've got no disk space to spool the manifest on the master. So, if you really want no overhead from manifests, just don't have a manifest. And if you are OK with some overhead, why not at least have a CRC-32C checksum, which is, after all, pretty cheap? Now, on the other hand, I don't have any strong evidence that the manifest-without-checksums mode is useless. You can still use it to verify that you have the correct files and that those files have the expected sizes. And, verifying those things is very cheap, because you only need to stat() each file, not open and read them all. True, you can do those things by using pg_validatebackup -s. But, you'd still incur the (admittedly fairly low) overhead of computing checksums that you don't intend to use. This is where I feel like I'm trying to make decisions in a vacuum. If we had a few more people weighing in on the thread on this point, I'd be happy to go with whatever the consensus was. If most people think having both --no-manifest (suppressing the manifest completely) and --manifest-checksums=none (suppressing only the checksums) is useless and confusing, then sure, let's rip the latter one out. If most people like the flexibility, let's keep it: it's already implemented and tested. But I hate to base the decision on what one or two people think. > > Well, the 512MB "limit" for CRC-32C means only that for certain very > > specific types of errors, detection is not guaranteed above that file > > size. So if you have a single flipped bit, for example, and the file > > size is greater than 512MB, then CRC-32C has only a 99.9999999767169% > > chance of detecting the error, whereas if the file size is less than > > 512MB, it is 100% certain, because of the design of the algorithm. But > > nine nines is plenty, and neither SHA nor our page-level checksums > > provide guaranteed error detection properties anyway. > > Right, so we know that CRC-32C has an upper-bound of 512MB to be useful > for exactly what it's designed to be useful for, but we also know that > we're going to have larger files- at least 1GB ones, and quite possibly > larger, so why are we choosing this? > > At the least, wouldn't it make sense to consider a larger CRC, one whose > limit is above the size of commonly expected files, if we're going to > use a CRC? I mean, you're just repeating the same argument here, and it's just not valid. Regardless of the file size, the chances of a false checksum match are literally less than one in a billion. There is every reason to believe that users will be happy with a low-overhead method that has a 99.9999999+% chance of detecting corrupt files. I do agree that a 64-bit CRC would probably be not much more expensive and improve the probability of detecting errors even further, but I wanted to restrict this patch to using infrastructure we already have. The choices there are the various SHA functions (so I supported those), MD5 (which I deliberately omitted, for reasons I hope you'll be the first to agree with), CRC-32C (which is fast), a couple of other CRC-32 variants (which I omitted because they seemed redundant and one of them only ever existed in PostgreSQL because of a coding mistake), and the hacked-up version of FNV that we use for page-level checksums (which is only 16 bits and seems to have no advantages for this purpose). > > I'm not sure why the fact that it's a 40-year-old algorithm is > > relevant. There are many 40-year-old algorithms that are very good. > > Sure there are, but there probably wasn't a lot of thought about > GB-sized files, and this doesn't really seem to be the direction people > are going in for larger objects. s3, as an example, uses sha256. > Google, it seems, suggests folks use "HighwayHash" (from their crc32c > github repo- https://github.com/google/crc32c). Most CRC uses seem to > be for much smaller data sets. Again, I really want to stick with infrastructure we already have. Trying to find a hash function that will please everybody is a hole with no bottom, or more to the point, a bikeshed in need of painting. There are TONS of great hash functions out there on the Internet, and as previous discussions of pgsql-hackers will attest, as soon as you go down that road, somebody will say "well, what about xxhash" or whatever, and then you spend the rest of your life trying to figure out what hash function we could try to commit that is fast and secure and doesn't have copyright or patent problems. There have been multiple efforts to introduce such hash functions in the past, and I think basically all of those have crashed into a brick wall. I don't think that's because introducing new hash functions is a bad idea. I think that there are various reasons why it might be a good idea. For instance, highwayhash purports to be a cryptographic hash function that is fast enough to replace non-cryptographic hash functions. It's easy to see why someone might want that, here. For example, it would be entirely reasonable to copy the backup manifest onto a USB key and store it in a vault. Later, if you get the USB key back out of the vault and validate it against the backup, you pretty much know that none of the data files have been tampered with, provided that you used a cryptographic hash. So, SHA is a good option for people who have a USB key and a vault, and a faster cryptographic might be even better. I don't have any desire to block such proposals, and I would be thrilled if this work inspires other people to add such options. However, I also don't want this patch to get blocked by an interminable argument about which hash functions we ought to use. The ones we have in core now are good enough for a start, and more can be added later. > Sure, there's a good chance we'll need newer algorithms in the future, I > don't doubt that. On the other hand, if crc32c, or CRC whatever, was > the perfect answer and no one will ever need something better, then > what's with folks like Google suggesting something else..? I have never said that CRC was the perfect answer, and the reason why Google is suggesting something different is because they wanted a fast hash (not SHA) that still has cryptographic properties. What I have said is that using CRC-32C by default means that there is very little downside as compared with current releases. Backups will not get slower, and error detection will get better. If you pick any other default from the menu of options currently available, then either backups get noticeably slower, or we get less error detection capability than that option gives us. > As for folks who are that close to the edge on their backup timing that > they can't have it slow down- chances are pretty darn good that they're > not far from ending up needing to find a better solution than > pg_basebackup anyway. Or they don't need to generate a manifest (or, I > suppose, they could have one but not have checksums..). 40-50% is a lot more than "if you were on the edge." > > Well, that'd be wrong, though. It's true that backup_manifest won't > > have an entry in the manifest, and neither will WAL files, but > > postgresql.auto.conf will. We'll just skip complaining about it if the > > checksum doesn't match or whatever. The server generates manifest > > entries for everything, and the client decides not to pay attention to > > some of them because it knows that pg_basebackup may have made certain > > changes that were not known to the server. > > Ok, but it's also wrong to say that the backup_label is excluded from > verification. The docs don't say that backup_label is excluded from verification. They do say that backup_manifest is excluded from verification *against the manifest*, because it is. I'm not sure if you're honestly confused here or if we're just devolving into arguing for the sake of argument, but right now the code looks like this: simple_string_list_append(&context.ignore_list, "backup_manifest"); simple_string_list_append(&context.ignore_list, "pg_wal"); simple_string_list_append(&context.ignore_list, "postgresql.auto.conf"); simple_string_list_append(&context.ignore_list, "recovery.signal"); simple_string_list_append(&context.ignore_list, "standby.signal"); Notice that this is the same list of files mentioned in the documentation. Now let's suppose we remove the first of those lines of code, so that backup_manifest is not in the exclude list by default. Now let's try to validate a backup: [rhaas pgsql]$ src/bin/pg_validatebackup/pg_validatebackup ~/pgslave pg_validatebackup: error: "backup_manifest" is present on disk but not in the manifest Oops. If you read that error carefully, you can see that the complaint is 100% valid. backup_manifest is indeed present on disk, but not in the manifest. However, because this situation is expected and known not to be a problem, the right thing to do is suppress the error. That is why it is in the ignore_list by default. The documentation is attempting to explain this. If it's unclear, we should try to make it better, but it is absolutely NOT saying that there is no internal validation of the backup_manifest. In fact, the previous paragraph tries to explain that: + <application>pg_validatebackup</application> reads the manifest file of a + backup, verifies the manifest against its own internal checksum, and then It is, however, saying, and *entirely correctly*, that pg_validatebackup will not check the backup_manifest file against the backup_manifest. If it did, it would find that it's not there. It would then emit an error message like the one above even though there's no problem with the backup. > I fail to see the usefulness of a tool that doesn't actually verify that > the backup is able to be restored from. > > Even pg_basebackup (in both fetch and stream modes...) checks that we at > least got all the WAL that's needed for the backup from the server > before considering the backup to be valid and telling the user that > there was a successful backup. With what you're proposing here, we > could have someone do a pg_basebackup, get back an ERROR saying the > backup wasn't valid, and then run pg_validatebackup and be told that the > backup is valid. I don't get how that's sensible. I'm sorry that you can't see how that's sensible, but it doesn't mean that it isn't sensible. It is totally unrealistic to expect that any backup verification tool can verify that you won't get an error when trying to use the backup. That would require that everything that the validation tool try to do everything that PostgreSQL will try to do when the backup is used, including running recovery and updating the data files. Anything less than that creates a real possibility that the backup will verify good but fail when used. This tool has a much narrower purpose, which is to try to verify that we (still) have the files the server sent as part of the backup and that, to the best of our ability to detect such things, they have not been modified. As you know, or should know, the WAL files are not sent as part of the backup, and so are not verified. Other things that would also be useful to check are also not verified. It would be fantastic to have more verification tools in the future, but it is difficult to see why anyone would bother trying if an attempt to get the first one committed gets blocked because it does not yet do everything. Very few patches try to do everything, and those that do usually get blocked because, by trying to do too much, they get some of it badly wrong. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Mar 25, 2020 at 4:54 PM Stephen Frost <sfrost@snowman.net> wrote: > > > That's a fair argument, but I think the other relevant principle is > > > that we try to give people useful defaults for things. I think that > > > checksums are a sufficiently useful thing that having the default be > > > not to do it doesn't make sense. I had the impression that you and > > > David were in agreement on that point, actually. > > > > I agree with wanting to have useful defaults and that checksums should > > be included by default, and I'm alright even with letting people pick > > what algorithms they'd like to have too. The construct here is made odd > > because we've got this idea that "no checksum" is an option, which is > > actually something that I don't particularly like, but that's what's > > making this particular syntax weird. I don't suppose you'd be open to > > the idea of just dropping that though..? There wouldn't be any issue > > with this syntax if we just always had checksums included when a > > manifest is requested. :) > > > > Somehow, I don't think I'm going to win that argument. > > Well, it's not a crazy idea. So, at some point, I had the idea that > you were always going to get a manifest, and therefore you should at > least ought to have the option of not checksumming to avoid the > overhead. But, as things stand now, you can suppress the manifest > altogether, so that you can still take a backup even if you've got no > disk space to spool the manifest on the master. So, if you really want > no overhead from manifests, just don't have a manifest. And if you are > OK with some overhead, why not at least have a CRC-32C checksum, which > is, after all, pretty cheap? > > Now, on the other hand, I don't have any strong evidence that the > manifest-without-checksums mode is useless. You can still use it to > verify that you have the correct files and that those files have the > expected sizes. And, verifying those things is very cheap, because you > only need to stat() each file, not open and read them all. True, you > can do those things by using pg_validatebackup -s. But, you'd still > incur the (admittedly fairly low) overhead of computing checksums that > you don't intend to use. > > This is where I feel like I'm trying to make decisions in a vacuum. If > we had a few more people weighing in on the thread on this point, I'd > be happy to go with whatever the consensus was. If most people think > having both --no-manifest (suppressing the manifest completely) and > --manifest-checksums=none (suppressing only the checksums) is useless > and confusing, then sure, let's rip the latter one out. If most people > like the flexibility, let's keep it: it's already implemented and > tested. But I hate to base the decision on what one or two people > think. I'm frustrated at the lack of involvement from others also. Just to be clear- I'm not completely against having a 'manifest but no checksum' option, but if that's what we're going to have then it seems like the syntax should be such that if you don't specify checksums then you don't get checksums and "MANIFEST_CHECKSUM none" shouldn't be a thing. All that said, as I said up-thread, I appreciate that we aren't designing SQL here and that this is pretty special syntax to begin with, so if you ended up committing it the way you have it now, so be it, I wouldn't be asking for it to be reverted over this. It's a bit awkward and kind of a thorn, but it's not entirely unreasonable, and we'd probably end up there anyway if we started out without a 'none' option and someone did come up with a good argument and a patch to add such an option in the future. > > > Well, the 512MB "limit" for CRC-32C means only that for certain very > > > specific types of errors, detection is not guaranteed above that file > > > size. So if you have a single flipped bit, for example, and the file > > > size is greater than 512MB, then CRC-32C has only a 99.9999999767169% > > > chance of detecting the error, whereas if the file size is less than > > > 512MB, it is 100% certain, because of the design of the algorithm. But > > > nine nines is plenty, and neither SHA nor our page-level checksums > > > provide guaranteed error detection properties anyway. > > > > Right, so we know that CRC-32C has an upper-bound of 512MB to be useful > > for exactly what it's designed to be useful for, but we also know that > > we're going to have larger files- at least 1GB ones, and quite possibly > > larger, so why are we choosing this? > > > > At the least, wouldn't it make sense to consider a larger CRC, one whose > > limit is above the size of commonly expected files, if we're going to > > use a CRC? > > I mean, you're just repeating the same argument here, and it's just > not valid. Regardless of the file size, the chances of a false > checksum match are literally less than one in a billion. There is > every reason to believe that users will be happy with a low-overhead > method that has a 99.9999999+% chance of detecting corrupt files. I do > agree that a 64-bit CRC would probably be not much more expensive and > improve the probability of detecting errors even further, but I wanted > to restrict this patch to using infrastructure we already have. The > choices there are the various SHA functions (so I supported those), > MD5 (which I deliberately omitted, for reasons I hope you'll be the > first to agree with), CRC-32C (which is fast), a couple of other > CRC-32 variants (which I omitted because they seemed redundant and one > of them only ever existed in PostgreSQL because of a coding mistake), > and the hacked-up version of FNV that we use for page-level checksums > (which is only 16 bits and seems to have no advantages for this > purpose). The argument that "well, we happened to already have it, even though we used it for much smaller data sets, which are well within the 100%-single-bit-error detection limit" certainly doesn't make me be in more support of this. Choosing the right algorithm to use maybe shouldn't be based on the age of that algorithm, but it also certainly shouldn't be "just because we already have it" when we're using it for a very different use-case. I'm guessing folks have already seen it, but I thought this was an interesting run-down of actual collisions based on various checksum lengths using one data set (though it's not clear exactly how big it is, from what I can see)- http://www.backplane.com/matt/crc64.html I do agree with excluding things like md5 and others that aren't good options. I wasn't saying we should necessarily exclude crc32c either.. but rather saying that it shouldn't be the default. Here's another way to look at it- where do we use crc32c today, and how much data might we possibly be covering with that crc? Why was crc32c picked for that purpose? If the individual who decided to pick crc32c for that case was contemplating a checksum for up-to-1GB files, would they have picked crc32c? Seems unlikely to me. > > > I'm not sure why the fact that it's a 40-year-old algorithm is > > > relevant. There are many 40-year-old algorithms that are very good. > > > > Sure there are, but there probably wasn't a lot of thought about > > GB-sized files, and this doesn't really seem to be the direction people > > are going in for larger objects. s3, as an example, uses sha256. > > Google, it seems, suggests folks use "HighwayHash" (from their crc32c > > github repo- https://github.com/google/crc32c). Most CRC uses seem to > > be for much smaller data sets. > > Again, I really want to stick with infrastructure we already have. I don't agree with that as a sensible justification for picking it for this case, because it's clearly not the same use-case. > Trying to find a hash function that will please everybody is a hole > with no bottom, or more to the point, a bikeshed in need of painting. > There are TONS of great hash functions out there on the Internet, and > as previous discussions of pgsql-hackers will attest, as soon as you > go down that road, somebody will say "well, what about xxhash" or > whatever, and then you spend the rest of your life trying to figure > out what hash function we could try to commit that is fast and secure > and doesn't have copyright or patent problems. There have been > multiple efforts to introduce such hash functions in the past, and I > think basically all of those have crashed into a brick wall. > > I don't think that's because introducing new hash functions is a bad > idea. I think that there are various reasons why it might be a good > idea. For instance, highwayhash purports to be a cryptographic hash > function that is fast enough to replace non-cryptographic hash > functions. It's easy to see why someone might want that, here. For > example, it would be entirely reasonable to copy the backup manifest > onto a USB key and store it in a vault. Later, if you get the USB key > back out of the vault and validate it against the backup, you pretty > much know that none of the data files have been tampered with, > provided that you used a cryptographic hash. So, SHA is a good option > for people who have a USB key and a vault, and a faster cryptographic > might be even better. I don't have any desire to block such proposals, > and I would be thrilled if this work inspires other people to add such > options. However, I also don't want this patch to get blocked by an > interminable argument about which hash functions we ought to use. The > ones we have in core now are good enough for a start, and more can be > added later. I'm not actually argueing about which hash functions we should support, but rather what the default is and if crc32c, specifically, is actually a reasonable choice. Just because it's fast and we already had an implementation of it doesn't justify its use as the default. Given that it doesn't actually provide the check that is generally expected of CRC checksums (100% detection of single-bit errors) when the file size gets over 512MB makes me wonder if we should have it at all, yes, but it definitely makes me think it shouldn't be our default. Folks look to PG as being pretty good at figuring things out and doing the thing that makes sense to minimize risk of data loss or corruption. I can understand and agree with the desire to have a faster alternative to sha256 for those who don't need a cryptographically safe hash, but if we're going to provide that option, it should be the right answer and it's pretty clear, at least to me, that crc32c isn't a good choice for gigabyte-size files. > > Sure, there's a good chance we'll need newer algorithms in the future, I > > don't doubt that. On the other hand, if crc32c, or CRC whatever, was > > the perfect answer and no one will ever need something better, then > > what's with folks like Google suggesting something else..? > > I have never said that CRC was the perfect answer, and the reason why > Google is suggesting something different is because they wanted a fast > hash (not SHA) that still has cryptographic properties. What I have > said is that using CRC-32C by default means that there is very little > downside as compared with current releases. Backups will not get > slower, and error detection will get better. If you pick any other > default from the menu of options currently available, then either > backups get noticeably slower, or we get less error detection > capability than that option gives us. I don't agree with limiting our view to only those algorithms that we've already got implemented in PG. > > As for folks who are that close to the edge on their backup timing that > > they can't have it slow down- chances are pretty darn good that they're > > not far from ending up needing to find a better solution than > > pg_basebackup anyway. Or they don't need to generate a manifest (or, I > > suppose, they could have one but not have checksums..). > > 40-50% is a lot more than "if you were on the edge." We can agree to disagree on this, it's not particularly relevant in the end. > > > Well, that'd be wrong, though. It's true that backup_manifest won't > > > have an entry in the manifest, and neither will WAL files, but > > > postgresql.auto.conf will. We'll just skip complaining about it if the > > > checksum doesn't match or whatever. The server generates manifest > > > entries for everything, and the client decides not to pay attention to > > > some of them because it knows that pg_basebackup may have made certain > > > changes that were not known to the server. > > > > Ok, but it's also wrong to say that the backup_label is excluded from > > verification. > > The docs don't say that backup_label is excluded from verification. > They do say that backup_manifest is excluded from verification > *against the manifest*, because it is. I'm not sure if you're honestly > confused here or if we're just devolving into arguing for the sake of > argument, but right now the code looks like this: That you're bringing up code here is really just not sensible- we're talking about the documentation, not about the code here. I do understand what the code is doing and I don't have any complaint about the code. > Oops. If you read that error carefully, you can see that the complaint > is 100% valid. backup_manifest is indeed present on disk, but not in > the manifest. However, because this situation is expected and known > not to be a problem, the right thing to do is suppress the error. That > is why it is in the ignore_list by default. The documentation is > attempting to explain this. If it's unclear, we should try to make it > better, but it is absolutely NOT saying that there is no internal > validation of the backup_manifest. In fact, the previous paragraph > tries to explain that: Yes, I think the documentation is unclear, as I said before, because it purports to list things that aren't being validated and then includes backup_manifest in that list, which doesn't make sense. The sentence in question does *not* say "Certain files and directories are excluded from the manifest" (which is wording that I actually proposed up-thread, to try to address this...), it says, from the patch: "Certain files and directories are excluded from verification:" Excluded from verification. Then lists backup_manifest. Even though, earlier in that same paragraph it says that the manifest is verified against its own checksum. > + <application>pg_validatebackup</application> reads the manifest file of a > + backup, verifies the manifest against its own internal checksum, and then > > It is, however, saying, and *entirely correctly*, that > pg_validatebackup will not check the backup_manifest file against the > backup_manifest. If it did, it would find that it's not there. It > would then emit an error message like the one above even though > there's no problem with the backup. It's saying, removing the listing aspect, exactly that "backup_label is excluded from verification". That's what I am taking issue with. I've made multiple attempts to suggest other language to avoid saying that because it's clearly wrong- the manifest is verified. > > I fail to see the usefulness of a tool that doesn't actually verify that > > the backup is able to be restored from. > > > > Even pg_basebackup (in both fetch and stream modes...) checks that we at > > least got all the WAL that's needed for the backup from the server > > before considering the backup to be valid and telling the user that > > there was a successful backup. With what you're proposing here, we > > could have someone do a pg_basebackup, get back an ERROR saying the > > backup wasn't valid, and then run pg_validatebackup and be told that the > > backup is valid. I don't get how that's sensible. > > I'm sorry that you can't see how that's sensible, but it doesn't mean > that it isn't sensible. It is totally unrealistic to expect that any > backup verification tool can verify that you won't get an error when > trying to use the backup. That would require that everything that the > validation tool try to do everything that PostgreSQL will try to do > when the backup is used, including running recovery and updating the > data files. Anything less than that creates a real possibility that > the backup will verify good but fail when used. This tool has a much > narrower purpose, which is to try to verify that we (still) have the > files the server sent as part of the backup and that, to the best of > our ability to detect such things, they have not been modified. As you > know, or should know, the WAL files are not sent as part of the > backup, and so are not verified. Other things that would also be > useful to check are also not verified. It would be fantastic to have > more verification tools in the future, but it is difficult to see why > anyone would bother trying if an attempt to get the first one > committed gets blocked because it does not yet do everything. Very few > patches try to do everything, and those that do usually get blocked > because, by trying to do too much, they get some of it badly wrong. I'm not talking about making sure that no error ever happens when doing a restore of a particular backup. You're arguing against something that I have not advocated for and which I don't advocate for. I'm saying that the existing tool that takes the backup has a *really* *important* verification check that this proposed "validate backup" tool doesn't have, and that isn't sensible. It leads to situations where the backup tool itself, pg_basebackup, can fail or be killed before it's actually completed, and the "validate backup" tool would say that the backup is perfectly fine. That is not sensible. That there might be other reasons why a backup can't be restored isn't relevant and I'm not asking for a tool that is perfect and does some kind of proof that the backup is able to be restored. Thanks, Stephen
Attachment
> On Mar 26, 2020, at 9:34 AM, Stephen Frost <sfrost@snowman.net> wrote: > > I'm not actually argueing about which hash functions we should support, > but rather what the default is and if crc32c, specifically, is actually > a reasonable choice. Just because it's fast and we already had an > implementation of it doesn't justify its use as the default. Given that > it doesn't actually provide the check that is generally expected of > CRC checksums (100% detection of single-bit errors) when the file size > gets over 512MB makes me wonder if we should have it at all, yes, but it > definitely makes me think it shouldn't be our default. I don't understand your focus on the single-bit error issue. If you are sending your backup across the wire, single biterrors during transmission should already be detected as part of the networking protocol. The real issue has to be detectionof the kinds of errors or modifications that are most likely to happen in practice. Which are those? People manuallymucking with the files? Bugs in backup scripts? Corruption on the storage device? Truncated files? The more bitsin the checksum (assuming a well designed checksum algorithm), the more likely we are to detect accidental modification,so it is no surprise if a 64-bit crc does better than 32-bit crc. But that logic can be taken arbitrarily far. I don't see the connection between, on the one hand, an analysis of single-bit error detection against file size, andon the other hand, the verification of backups. From a support perspective, I think the much more important issue is making certain that checksums are turned on. A onein a billion chance of missing an error seems pretty acceptable compared to the, let's say, one in two chance that yourcustomer didn't use checksums. Why are we even allowing this to be turned off? Is there a usage case compelling thatoption? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 26, 2020 at 12:34 PM Stephen Frost <sfrost@snowman.net> wrote: > I do agree with excluding things like md5 and others that aren't good > options. I wasn't saying we should necessarily exclude crc32c either.. > but rather saying that it shouldn't be the default. > > Here's another way to look at it- where do we use crc32c today, and how > much data might we possibly be covering with that crc? WAL record size is a 32-bit unsigned integer, so in theory, up to 4GB minus 1 byte. In practice, most of them are not more than a few hundred bytes, the amount we might possibly be covering is a lot more. > Why was crc32c > picked for that purpose? Because it was discovered that 64-bit CRC was too slow, per commit 21fda22ec46deb7734f793ef4d7fa6c226b4c78e. > If the individual who decided to pick crc32c > for that case was contemplating a checksum for up-to-1GB files, would > they have picked crc32c? Seems unlikely to me. It's hard to be sure what someone who isn't us would have done in some situation that they didn't face, but we do have the discussion thread: https://www.postgresql.org/message-id/flat/9291.1117593389%40sss.pgh.pa.us#c4e413bbf3d7fbeced7786da1c3aca9c The question of how much data is protected by the CRC was discussed, mostly in the first few messages, in general terms, but it doesn't seem to have covered the question very thoroughly. I'm sure we could each draw things from that discussion that support our view of the situation, but I'm not sure it would be very productive. What confuses to me is that you seem to have a view of the upsides and downsides of these various algorithms that seems to me to be highly skewed. Like, suppose we change the default from CRC-32C to SHA-something. On the upside, the error detection rate will increase from 99.9999999+% to something much closer to 100%. On the downside, backups will get as much as 40-50% slower for some users. I hope we can agree that both detecting errors and taking backups quickly are important. However, it is hard for me to imagine that the typical user would want to pay even a 5-10% performance penalty when taking a backup in order to improve an error detection feature which they may not even use and which already has less than a one-in-a-billion chance of going wrong. We routinely reject features for causing, say, a 2% regression on general workloads. Base backup speed is probably less important than how many SELECT or INSERT queries you can pump through the system in a second, but it's still a pain point for lots of people. I think if you said to some users "hey, would you like to have error detection for your backups? it'll cost 10%" many people would say "yes, please." But I think if you went to the same users and said "hey, would you like to make the error detection for your backups better? it currently has a less than 1-in-a-billion chance of failing to detect random corruption, and you can reduce that by many orders of magnitude for an extra 10% on your backup time," I think the results would be much more mixed. Some people would like it, but it certainly not everybody. > I'm not actually argueing about which hash functions we should support, > but rather what the default is and if crc32c, specifically, is actually > a reasonable choice. Just because it's fast and we already had an > implementation of it doesn't justify its use as the default. Given that > it doesn't actually provide the check that is generally expected of > CRC checksums (100% detection of single-bit errors) when the file size > gets over 512MB makes me wonder if we should have it at all, yes, but it > definitely makes me think it shouldn't be our default. I mean, the property that I care about is the one where it detects better than 999,999,999 errors out of every 1,000,000,000, regardless of input length. > I don't agree with limiting our view to only those algorithms that we've > already got implemented in PG. I mean, opening that giant can of worms ~2 weeks before feature freeze is not very nice. This patch has been around for months, and the algorithms were openly discussed a long time ago. I checked and found out that the CRC-64 code was nuked in commit 404bc51cde9dce1c674abe4695635612f08fe27e, so in theory we could revert that, but how much confidence do we have that the code in question actually did the right thing, or that it's actually fast? An awful lot of work has been done on the CRC-32C code over the years, including several rounds of speeding it up (f044d71e331d77a0039cec0a11859b5a3c72bc95, 3dc2d62d0486325bf263655c2d9a96aee0b02abe) and one round of fixing it because it was producing completely wrong answers (5028f22f6eb0579890689655285a4778b4ffc460), so I don't have a lot of confidence about that CRC-64 code being totally without problems. The commit message for that last commit, 5028f22f6eb0579890689655285a4778b4ffc460, seems pretty relevant in this context, too. It observes that, because it "does not correspond to any bit-wise CRC calculation" it is "difficult to reason about its properties." In other words, the algorithm that we used for WAL records for many years likely did not have the guaranteed error-detection properties with which you are so concerned (nor do most hash functions we might choose; CRC-64 is probably the only choice that would). Despite that, the commit message also observed that "it has worked well in practice." I realize I'm not convincing you of anything here, but the guaranteed error-detection properties of CRC are almost totally uninteresting in this context. I'm not concerned that CRC-32C doesn't have those properties. I'm not concerned that SHA-n wouldn't have those properties. I'm not concerned that xxhash or HighwayHash don't have that property either. I doubt the fact that CRC-64 would have that property would give us much benefit. I think the only things that matter here are (1) how many bits you get (more bits = better chance of finding errors, but even *sixteen* bits would give you a pretty fair chance of noticing if things are broken) and (2) whether you want a cryptographic hash function so that you can keep the backup manifest in a vault. > It's saying, removing the listing aspect, exactly that "backup_label is > excluded from verification". That's what I am taking issue with. I've > made multiple attempts to suggest other language to avoid saying that > because it's clearly wrong- the manifest is verified. Well, it's talking about the particular kind of verification that has just been discussed, not any form of verification. As one idea, perhaps instead of: + Certain files and directories are + excluded from verification: ...I could maybe insert a paragraph break there and then continue with something like this: When pg_basebackup compares the files and directories in the manifest to those which are present on disk, it will ignore the presence of, or changes to, certain files: backup_manifest will not be present in the manifest itself, and is therefore ignored. Note that the manifest is still verified internally, as described above, but no error will be issued about the presence of a backup_manifest file in the backup directory even though it is not listed in the manifest. Would that be more clear? Do you want to suggest something else? > I'm not talking about making sure that no error ever happens when doing > I'm saying that the existing tool that takes the backup has a *really* > *important* verification check that this proposed "validate backup" tool > doesn't have, and that isn't sensible. It leads to situations where the > backup tool itself, pg_basebackup, can fail or be killed before it's > actually completed, and the "validate backup" tool would say that the > backup is perfectly fine. That is not sensible. If someone's procedure for taking and restoring backups involves not knowing whether or not pg_basebackup completed without error and then trying to use the backup anyway, they are doing something which is very foolish, and it's questionable whether any technological solution has much hope of getting them out of trouble. But on the plus side, this patch would have a good chance of detecting the problem, which is a noticeable improvement over what we have now, which has no chance of detecting the problem, because we have nothing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Mark Dilger (mark.dilger@enterprisedb.com) wrote: > > On Mar 26, 2020, at 9:34 AM, Stephen Frost <sfrost@snowman.net> wrote: > > I'm not actually argueing about which hash functions we should support, > > but rather what the default is and if crc32c, specifically, is actually > > a reasonable choice. Just because it's fast and we already had an > > implementation of it doesn't justify its use as the default. Given that > > it doesn't actually provide the check that is generally expected of > > CRC checksums (100% detection of single-bit errors) when the file size > > gets over 512MB makes me wonder if we should have it at all, yes, but it > > definitely makes me think it shouldn't be our default. > > I don't understand your focus on the single-bit error issue. Maybe I'm wrong, but my understanding was that detecting single-bit errors was one of the primary design goals of CRC and why people talk about CRCs of certain sizes having 'limits'- that's the size at which single-bit errors will no longer, necessarily, be picked up and therefore that's where the CRC of that size starts falling down on that goal. > If you are sending your backup across the wire, single bit errors during transmission should already be detected as partof the networking protocol. The real issue has to be detection of the kinds of errors or modifications that are mostlikely to happen in practice. Which are those? People manually mucking with the files? Bugs in backup scripts? Corruptionon the storage device? Truncated files? The more bits in the checksum (assuming a well designed checksum algorithm),the more likely we are to detect accidental modification, so it is no surprise if a 64-bit crc does better than32-bit crc. But that logic can be taken arbitrarily far. I don't see the connection between, on the one hand, an analysisof single-bit error detection against file size, and on the other hand, the verification of backups. We'd like something that does a good job at detecting any differences between when the file was copied off of the server and when the command is run- potentially weeks or months later. I would expect most issues to end up being storage-level corruption over time where the backup is stored, which could be single bit flips or whole pages getting zeroed or various other things. Files changing size probably is one of the less common things, but, sure, that too. That we could take this "arbitrarily far" is actually entirely fine- that's a good reason to have alternatives, which this patch does have, but that doesn't mean we should have a default that's not suitable for the files that we know we're going to be storing. Consider that we could have used a 16-bit CRC instead, but does that actually make sense? Ok, sure, maybe someone really wants something super fast- but should that be our default? If not, then what criteria should we use for the default? > From a support perspective, I think the much more important issue is making certain that checksums are turned on. A onein a billion chance of missing an error seems pretty acceptable compared to the, let's say, one in two chance that yourcustomer didn't use checksums. Why are we even allowing this to be turned off? Is there a usage case compelling thatoption? The argument is that adding checksums takes more time. I can understand that argument, though I don't really agree with it. Certainly a few percent really shouldn't be that big of an issue, and in many cases even a sha256 hash isn't going to have that dramatic of an impact on the actual overall time. Thanks, Stephen
Attachment
On 3/26/20 11:37 AM, Robert Haas wrote: >> On Wed, Mar 25, 2020 at 4:54 PM Stephen Frost <sfrost@snowman.net> wrot > > This is where I feel like I'm trying to make decisions in a vacuum. If > we had a few more people weighing in on the thread on this point, I'd > be happy to go with whatever the consensus was. If most people think > having both --no-manifest (suppressing the manifest completely) and > --manifest-checksums=none (suppressing only the checksums) is useless > and confusing, then sure, let's rip the latter one out. If most people > like the flexibility, let's keep it: it's already implemented and > tested. But I hate to base the decision on what one or two people > think. I'm not sure I see a lot of value to being able to build manifest with no checksums, especially if overhead for the default checksum algorithm is negligible. However, I'd still prefer that the default be something more robust and allow users to tune it down rather than the other way around. But I've made that pretty clear up-thread and I consider that argument lost at this point. >> As for folks who are that close to the edge on their backup timing that >> they can't have it slow down- chances are pretty darn good that they're >> not far from ending up needing to find a better solution than >> pg_basebackup anyway. Or they don't need to generate a manifest (or, I >> suppose, they could have one but not have checksums..). > > 40-50% is a lot more than "if you were on the edge." For the record I think this is a very misleading number. Sure, if you are doing your backup to a local SSD on a powerful development laptop it makes sense. But backups are generally placed on slower storage, remotely, with compression. Even without compression the first two are going to bring this percentage down by a lot. When you get to page-level incremental backups, which is where this all started, I'd still recommend using a stronger checksum algorithm to verify that the file was reconstructed correctly on restore. That much I believe we have agreed on. >> Even pg_basebackup (in both fetch and stream modes...) checks that we at >> least got all the WAL that's needed for the backup from the server >> before considering the backup to be valid and telling the user that >> there was a successful backup. With what you're proposing here, we >> could have someone do a pg_basebackup, get back an ERROR saying the >> backup wasn't valid, and then run pg_validatebackup and be told that the >> backup is valid. I don't get how that's sensible. > > I'm sorry that you can't see how that's sensible, but it doesn't mean > that it isn't sensible. It is totally unrealistic to expect that any > backup verification tool can verify that you won't get an error when > trying to use the backup. That would require that everything that the > validation tool try to do everything that PostgreSQL will try to do > when the backup is used, including running recovery and updating the > data files. Anything less than that creates a real possibility that > the backup will verify good but fail when used. This tool has a much > narrower purpose, which is to try to verify that we (still) have the > files the server sent as part of the backup and that, to the best of > our ability to detect such things, they have not been modified. As you > know, or should know, the WAL files are not sent as part of the > backup, and so are not verified. Other things that would also be > useful to check are also not verified. It would be fantastic to have > more verification tools in the future, but it is difficult to see why > anyone would bother trying if an attempt to get the first one > committed gets blocked because it does not yet do everything. Very few > patches try to do everything, and those that do usually get blocked > because, by trying to do too much, they get some of it badly wrong. I agree with Stephen that this should be done, but I agree with you that it can wait for a future commit. However, I do think: 1) It should be called out rather plainly in the documentation. 2) If there are files in pg_wal then pg_validatebackup should inform the user that those files have not been validated. I know you and Stephen have agreed on a number of doc changes, would it be possible to get a new patch with those included? I finally have time to do a review of this tomorrow. I saw some mistakes in the docs in the current patch but I know those patches are not current. Regards, -- -David david@pgmasters.net
> On Mar 26, 2020, at 12:37 PM, Stephen Frost <sfrost@snowman.net> wrote: > > Greetings, > > * Mark Dilger (mark.dilger@enterprisedb.com) wrote: >>> On Mar 26, 2020, at 9:34 AM, Stephen Frost <sfrost@snowman.net> wrote: >>> I'm not actually argueing about which hash functions we should support, >>> but rather what the default is and if crc32c, specifically, is actually >>> a reasonable choice. Just because it's fast and we already had an >>> implementation of it doesn't justify its use as the default. Given that >>> it doesn't actually provide the check that is generally expected of >>> CRC checksums (100% detection of single-bit errors) when the file size >>> gets over 512MB makes me wonder if we should have it at all, yes, but it >>> definitely makes me think it shouldn't be our default. >> >> I don't understand your focus on the single-bit error issue. > > Maybe I'm wrong, but my understanding was that detecting single-bit > errors was one of the primary design goals of CRC and why people talk > about CRCs of certain sizes having 'limits'- that's the size at which > single-bit errors will no longer, necessarily, be picked up and > therefore that's where the CRC of that size starts falling down on that > goal. I think I agree with all that. I'm not sure it is relevant. When people use CRCs to detect things *other than* transmissionerrors, they are in some sense using a hammer to drive a screw. At that point, the analysis of how good thehammer is, and how big a nail it can drive, is no longer relevant. The relevant discussion here is how appropriate aCRC is for our purpose. I don't know the answer to that, but it doesn't seem the single-bit error analysis is the rightanalysis. >> If you are sending your backup across the wire, single bit errors during transmission should already be detected as partof the networking protocol. The real issue has to be detection of the kinds of errors or modifications that are mostlikely to happen in practice. Which are those? People manually mucking with the files? Bugs in backup scripts? Corruptionon the storage device? Truncated files? The more bits in the checksum (assuming a well designed checksum algorithm),the more likely we are to detect accidental modification, so it is no surprise if a 64-bit crc does better than32-bit crc. But that logic can be taken arbitrarily far. I don't see the connection between, on the one hand, an analysisof single-bit error detection against file size, and on the other hand, the verification of backups. > > We'd like something that does a good job at detecting any differences > between when the file was copied off of the server and when the command > is run- potentially weeks or months later. I would expect most issues > to end up being storage-level corruption over time where the backup is > stored, which could be single bit flips or whole pages getting zeroed or > various other things. Files changing size probably is one of the less > common things, but, sure, that too. > > That we could take this "arbitrarily far" is actually entirely fine- > that's a good reason to have alternatives, which this patch does have, > but that doesn't mean we should have a default that's not suitable for > the files that we know we're going to be storing. > > Consider that we could have used a 16-bit CRC instead, but does that > actually make sense? Ok, sure, maybe someone really wants something > super fast- but should that be our default? If not, then what criteria > should we use for the default? I'll answer this below.... >> From a support perspective, I think the much more important issue is making certain that checksums are turned on. A onein a billion chance of missing an error seems pretty acceptable compared to the, let's say, one in two chance that yourcustomer didn't use checksums. Why are we even allowing this to be turned off? Is there a usage case compelling thatoption? > > The argument is that adding checksums takes more time. I can understand > that argument, though I don't really agree with it. Certainly a few > percent really shouldn't be that big of an issue, and in many cases even > a sha256 hash isn't going to have that dramatic of an impact on the > actual overall time. I see two dangers here: (1) The user enables checksums of some type, and due to checksums not being perfect, corruption happens but goes undetected,leaving her in a bad place. (2) The user makes no checksum selection at all, gets checksums of the *default* type, determines it is too slow for herpurposes, and instead of adjusting the checksum algorithm to something faster, simply turns checksums off; corruptionhappens and of course is undetected, leaving her in a bad place. I think the risk of (2) is far worse, which makes me tend towards a default that is fast enough not to encourage anybodyto disable checksums altogether. I have no opinion about which algorithm is best suited to that purpose, becauseI haven't benchmarked any. I'm pretty much going off what Robert said, in terms of how big an impact using a heavieralgorithm would be. Perhaps you'd like to run benchmarks and make a concrete proposal for another algorithm, withnumbers showing the runtime changes? You mentioned up-thread that prior timings which showed a 40-50% slowdown werenot including all the relevant stuff, so perhaps you could fix that in your benchmark and let us know what is includedin the timings? I don't think we should be contemplating for v13 any checksum algorithms for the default except the ones already in the optionslist. Doing that just derails the patch. If you want highwayhash or similar to be the default, can't we hold offuntil v14 and think about changing the default? Maybe I'm missing something, but I don't see any reason why it wouldbe hard to change this after the first version has already been released. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Mar 26, 2020 at 12:34 PM Stephen Frost <sfrost@snowman.net> wrote: > > I do agree with excluding things like md5 and others that aren't good > > options. I wasn't saying we should necessarily exclude crc32c either.. > > but rather saying that it shouldn't be the default. > > > > Here's another way to look at it- where do we use crc32c today, and how > > much data might we possibly be covering with that crc? > > WAL record size is a 32-bit unsigned integer, so in theory, up to 4GB > minus 1 byte. In practice, most of them are not more than a few > hundred bytes, the amount we might possibly be covering is a lot more. Is it actually possible, today, in PG, to have a 4GB WAL record? Judging this based on the WAL record size doesn't seem quite right. > > Why was crc32c > > picked for that purpose? > > Because it was discovered that 64-bit CRC was too slow, per commit > 21fda22ec46deb7734f793ef4d7fa6c226b4c78e. ... 15 years ago. I actually find it pretty interesting that we started out with a 64bit CRC there, I didn't know that was the case. Also interesting is that we had 64bit CRC code already. > > If the individual who decided to pick crc32c > > for that case was contemplating a checksum for up-to-1GB files, would > > they have picked crc32c? Seems unlikely to me. > > It's hard to be sure what someone who isn't us would have done in some > situation that they didn't face, but we do have the discussion thread: > > https://www.postgresql.org/message-id/flat/9291.1117593389%40sss.pgh.pa.us#c4e413bbf3d7fbeced7786da1c3aca9c > > The question of how much data is protected by the CRC was discussed, > mostly in the first few messages, in general terms, but it doesn't > seem to have covered the question very thoroughly. I'm sure we could > each draw things from that discussion that support our view of the > situation, but I'm not sure it would be very productive. Interesting. > What confuses to me is that you seem to have a view of the upsides and > downsides of these various algorithms that seems to me to be highly > skewed. Like, suppose we change the default from CRC-32C to > SHA-something. On the upside, the error detection rate will increase > from 99.9999999+% to something much closer to 100%. On the downside, > backups will get as much as 40-50% slower for some users. I hope we > can agree that both detecting errors and taking backups quickly are > important. However, it is hard for me to imagine that the typical user > would want to pay even a 5-10% performance penalty when taking a > backup in order to improve an error detection feature which they may > not even use and which already has less than a one-in-a-billion chance > of going wrong. We routinely reject features for causing, say, a 2% > regression on general workloads. Base backup speed is probably less > important than how many SELECT or INSERT queries you can pump through > the system in a second, but it's still a pain point for lots of > people. I think if you said to some users "hey, would you like to have > error detection for your backups? it'll cost 10%" many people would > say "yes, please." But I think if you went to the same users and said > "hey, would you like to make the error detection for your backups > better? it currently has a less than 1-in-a-billion chance of failing > to detect random corruption, and you can reduce that by many orders of > magnitude for an extra 10% on your backup time," I think the results > would be much more mixed. Some people would like it, but it certainly > not everybody. I think you're right that base backup speed is much less of an issue to slow down than SELECT or INSERT workloads, but I do also understand that it isn't completely unimportant, which is why having options isn't a bad idea here. That said, the options presented for users should all be reasonable options, and for the default we should pick something sensible, erroring on the "be safer" side, if anything. There's lots of options for speeding up base backups, with this patch, even if the default is to have a manifest with sha256 hashes- it could be changed to some form of CRC, or changed to not have checksums, or changed to not have a manifest. Users will have options. Again, I'm not against having a checksum algorithm as a option. I'm not saying that it must be SHA512 as the default. > > I'm not actually argueing about which hash functions we should support, > > but rather what the default is and if crc32c, specifically, is actually > > a reasonable choice. Just because it's fast and we already had an > > implementation of it doesn't justify its use as the default. Given that > > it doesn't actually provide the check that is generally expected of > > CRC checksums (100% detection of single-bit errors) when the file size > > gets over 512MB makes me wonder if we should have it at all, yes, but it > > definitely makes me think it shouldn't be our default. > > I mean, the property that I care about is the one where it detects > better than 999,999,999 errors out of every 1,000,000,000, regardless > of input length. Throwing these kinds of things around I really don't think is useful. > > I don't agree with limiting our view to only those algorithms that we've > > already got implemented in PG. > > I mean, opening that giant can of worms ~2 weeks before feature freeze > is not very nice. This patch has been around for months, and the > algorithms were openly discussed a long time ago. Yes, they were discussed before, and these issues were brought up before and there was specifically concern brought up about exactly the same issues that I'm repeating here. Those concerns seem to have been largely ignored, apparently because "we don't have that in PG today" as at least one of the considerations- even though we used to. I don't think that was the right response and, yeah, I saw that you were planning to commit and that prompted me to look into it right now. I don't think that's entirely uncommon around here. I also had hoped that David's concerns that were raised before had been heeded, as I knew he was involved in the discussion previously, but that turns out to not have been the case. > > It's saying, removing the listing aspect, exactly that "backup_label is > > excluded from verification". That's what I am taking issue with. I've > > made multiple attempts to suggest other language to avoid saying that > > because it's clearly wrong- the manifest is verified. > > Well, it's talking about the particular kind of verification that has > just been discussed, not any form of verification. As one idea, > perhaps instead of: > > + Certain files and directories are > + excluded from verification: > > ...I could maybe insert a paragraph break there and then continue with > something like this: > > When pg_basebackup compares the files and directories in the manifest > to those which are present on disk, it will ignore the presence of, or > changes to, certain files: > > backup_manifest will not be present in the manifest itself, and is > therefore ignored. Note that the manifest is still verified > internally, as described above, but no error will be issued about the > presence of a backup_manifest file in the backup directory even though > it is not listed in the manifest. > > Would that be more clear? Do you want to suggest something else? Yes, that looks fine. Feels slightly redundant to include the "as described above ..." bit, and I think that could be dropped, but up to you. > > I'm not talking about making sure that no error ever happens when doing > > I'm saying that the existing tool that takes the backup has a *really* > > *important* verification check that this proposed "validate backup" tool > > doesn't have, and that isn't sensible. It leads to situations where the > > backup tool itself, pg_basebackup, can fail or be killed before it's > > actually completed, and the "validate backup" tool would say that the > > backup is perfectly fine. That is not sensible. > > If someone's procedure for taking and restoring backups involves not > knowing whether or not pg_basebackup completed without error and then > trying to use the backup anyway, they are doing something which is > very foolish, and it's questionable whether any technological solution > has much hope of getting them out of trouble. But on the plus side, > this patch would have a good chance of detecting the problem, which is > a noticeable improvement over what we have now, which has no chance of > detecting the problem, because we have nothing. This doesn't address my concern at all. Even if it seems ridiculous and foolish to think that a backup was successful when the system was rebooted and pg_basebackup was killed before all of the WAL had made it into pg_wal, there is absolutely zero doubt in my mind that it's going to happen and users are going to, entirely reasonably, think that pg_validatebackup at least includes all the checks that pg_basebackup does about making sure that the backup is valid. I really don't understand how we can have a backup validation tool that doesn't do the absolute basics, like making sure that we have all of the WAL for the backup. I've routinely, almost jokingly, said to folks that any backup tool that doesn't check that isn't really a backup tool, and I was glad that pg_basebackup had that check, so, yeah, I'm going to continue to object to committing a backup validation tool that doesn't have that absolutely basic and necessary check. Thanks, Stephen
Attachment
Greetings, * Mark Dilger (mark.dilger@enterprisedb.com) wrote: > > On Mar 26, 2020, at 12:37 PM, Stephen Frost <sfrost@snowman.net> wrote: > > * Mark Dilger (mark.dilger@enterprisedb.com) wrote: > >>> On Mar 26, 2020, at 9:34 AM, Stephen Frost <sfrost@snowman.net> wrote: > >>> I'm not actually argueing about which hash functions we should support, > >>> but rather what the default is and if crc32c, specifically, is actually > >>> a reasonable choice. Just because it's fast and we already had an > >>> implementation of it doesn't justify its use as the default. Given that > >>> it doesn't actually provide the check that is generally expected of > >>> CRC checksums (100% detection of single-bit errors) when the file size > >>> gets over 512MB makes me wonder if we should have it at all, yes, but it > >>> definitely makes me think it shouldn't be our default. > >> > >> I don't understand your focus on the single-bit error issue. > > > > Maybe I'm wrong, but my understanding was that detecting single-bit > > errors was one of the primary design goals of CRC and why people talk > > about CRCs of certain sizes having 'limits'- that's the size at which > > single-bit errors will no longer, necessarily, be picked up and > > therefore that's where the CRC of that size starts falling down on that > > goal. > > I think I agree with all that. I'm not sure it is relevant. When people use CRCs to detect things *other than* transmissionerrors, they are in some sense using a hammer to drive a screw. At that point, the analysis of how good thehammer is, and how big a nail it can drive, is no longer relevant. The relevant discussion here is how appropriate aCRC is for our purpose. I don't know the answer to that, but it doesn't seem the single-bit error analysis is the rightanalysis. I disagree that it's not relevant- it's, in fact, the one really clear thing we can get a pretty straight-forward answer on, and that seems really useful to me. > >> If you are sending your backup across the wire, single bit errors during transmission should already be detected aspart of the networking protocol. The real issue has to be detection of the kinds of errors or modifications that are mostlikely to happen in practice. Which are those? People manually mucking with the files? Bugs in backup scripts? Corruptionon the storage device? Truncated files? The more bits in the checksum (assuming a well designed checksum algorithm),the more likely we are to detect accidental modification, so it is no surprise if a 64-bit crc does better than32-bit crc. But that logic can be taken arbitrarily far. I don't see the connection between, on the one hand, an analysisof single-bit error detection against file size, and on the other hand, the verification of backups. > > > > We'd like something that does a good job at detecting any differences > > between when the file was copied off of the server and when the command > > is run- potentially weeks or months later. I would expect most issues > > to end up being storage-level corruption over time where the backup is > > stored, which could be single bit flips or whole pages getting zeroed or > > various other things. Files changing size probably is one of the less > > common things, but, sure, that too. > > > > That we could take this "arbitrarily far" is actually entirely fine- > > that's a good reason to have alternatives, which this patch does have, > > but that doesn't mean we should have a default that's not suitable for > > the files that we know we're going to be storing. > > > > Consider that we could have used a 16-bit CRC instead, but does that > > actually make sense? Ok, sure, maybe someone really wants something > > super fast- but should that be our default? If not, then what criteria > > should we use for the default? > > I'll answer this below.... > > >> From a support perspective, I think the much more important issue is making certain that checksums are turned on. Aone in a billion chance of missing an error seems pretty acceptable compared to the, let's say, one in two chance that yourcustomer didn't use checksums. Why are we even allowing this to be turned off? Is there a usage case compelling thatoption? > > > > The argument is that adding checksums takes more time. I can understand > > that argument, though I don't really agree with it. Certainly a few > > percent really shouldn't be that big of an issue, and in many cases even > > a sha256 hash isn't going to have that dramatic of an impact on the > > actual overall time. > > I see two dangers here: > > (1) The user enables checksums of some type, and due to checksums not being perfect, corruption happens but goes undetected,leaving her in a bad place. > > (2) The user makes no checksum selection at all, gets checksums of the *default* type, determines it is too slow for herpurposes, and instead of adjusting the checksum algorithm to something faster, simply turns checksums off; corruptionhappens and of course is undetected, leaving her in a bad place. Alright, I have tried to avoid referring back to pgbackrest, but I can't help it here. We have never, ever, had a user come to us and complain that pgbackrest is too slow because we're using a SHA hash. We have also had them by default since absolutely day number one, and we even removed the option to disable them in 1.0. We've never even been asked if we should implement some other hash or checksum which is faster. > I think the risk of (2) is far worse, which makes me tend towards a default that is fast enough not to encourage anybodyto disable checksums altogether. I have no opinion about which algorithm is best suited to that purpose, becauseI haven't benchmarked any. I'm pretty much going off what Robert said, in terms of how big an impact using a heavieralgorithm would be. Perhaps you'd like to run benchmarks and make a concrete proposal for another algorithm, withnumbers showing the runtime changes? You mentioned up-thread that prior timings which showed a 40-50% slowdown werenot including all the relevant stuff, so perhaps you could fix that in your benchmark and let us know what is includedin the timings? I don't even know what the 40-50% slowdown numbers included. Also, the general expectation in this community is that whomever is pushing a given patch forward should be providing the benchmarks to justify their choice. > I don't think we should be contemplating for v13 any checksum algorithms for the default except the ones already in theoptions list. Doing that just derails the patch. If you want highwayhash or similar to be the default, can't we holdoff until v14 and think about changing the default? Maybe I'm missing something, but I don't see any reason why it wouldbe hard to change this after the first version has already been released. I'd rather we default to something that we are all confident and happy with, erroring on the side of it being overkill rather than something that we know isn't really appropriate for the data volume. Thanks, Stephen
Attachment
Hi, On 2020-03-26 11:37:48 -0400, Robert Haas wrote: > I mean, you're just repeating the same argument here, and it's just > not valid. Regardless of the file size, the chances of a false > checksum match are literally less than one in a billion. There is > every reason to believe that users will be happy with a low-overhead > method that has a 99.9999999+% chance of detecting corrupt files. I do > agree that a 64-bit CRC would probably be not much more expensive and > improve the probability of detecting errors even further I *seriously* doubt that it's true that 64bit CRCs wouldn't be slower. The only reason CRC32C is semi-fast is that we're accelerating it using hardware instructions (on x86-64 and ARM at least). Before that it was very regularly the bottleneck for processing WAL - and it still sometimes is. Most CRCs aren't actually very fast to compute, because they don't lend themselves to benefit from ILP or SIMD. We spent a fair bit of time optimizing our crc implementation before the hardware support was widespread. > but I wanted to restrict this patch to using infrastructure we already > have. The choices there are the various SHA functions (so I supported > those), MD5 (which I deliberately omitted, for reasons I hope you'll > be the first to agree with), CRC-32C (which is fast), a couple of > other CRC-32 variants (which I omitted because they seemed redundant > and one of them only ever existed in PostgreSQL because of a coding > mistake), and the hacked-up version of FNV that we use for page-level > checksums (which is only 16 bits and seems to have no advantages for > this purpose). FWIW, FNV is only 16bit because we reduce its size to 16 bit. See the tail of pg_checksum_page. I'm not sure the error detection guarantees of various CRC algorithms are that relevant here, btw. IMO, for something like checksums in a backup, just having a single one-bit error isn't as common as having larger errors (e.g. entire blocks beeing zeroed). And to detect that 32bit checksums aren't that good. > > As for folks who are that close to the edge on their backup timing that > > they can't have it slow down- chances are pretty darn good that they're > > not far from ending up needing to find a better solution than > > pg_basebackup anyway. Or they don't need to generate a manifest (or, I > > suppose, they could have one but not have checksums..). > > 40-50% is a lot more than "if you were on the edge." sha256 does about approx 400MB/s per core on modern intel CPUs. That's way below commonly accessible storage / network capabilities (and even if you're only doing 200MB/s, you're still going to spend roughly half of the CPU time just doing hashing. It's unlikely that you're going to see much speedups for sha256 just by upgrading a CPU. While there are hardware instructions available, they don't result in all that large improvements. Of course, we could also start using the GPU (err, really no). Defaulting to that makes very little sense to me. You're not just going to spend that time while backing up, but also when validating backups (i.e. network limits suddenly aren't a relevant bottleneck anymore). > > I fail to see the usefulness of a tool that doesn't actually verify that > > the backup is able to be restored from. > > > > Even pg_basebackup (in both fetch and stream modes...) checks that we at > > least got all the WAL that's needed for the backup from the server > > before considering the backup to be valid and telling the user that > > there was a successful backup. With what you're proposing here, we > > could have someone do a pg_basebackup, get back an ERROR saying the > > backup wasn't valid, and then run pg_validatebackup and be told that the > > backup is valid. I don't get how that's sensible. > > I'm sorry that you can't see how that's sensible, but it doesn't mean > that it isn't sensible. It is totally unrealistic to expect that any > backup verification tool can verify that you won't get an error when > trying to use the backup. That would require that everything that the > validation tool try to do everything that PostgreSQL will try to do > when the backup is used, including running recovery and updating the > data files. Anything less than that creates a real possibility that > the backup will verify good but fail when used. This tool has a much > narrower purpose, which is to try to verify that we (still) have the > files the server sent as part of the backup and that, to the best of > our ability to detect such things, they have not been modified. As you > know, or should know, the WAL files are not sent as part of the > backup, and so are not verified. Other things that would also be > useful to check are also not verified. It would be fantastic to have > more verification tools in the future, but it is difficult to see why > anyone would bother trying if an attempt to get the first one > committed gets blocked because it does not yet do everything. Very few > patches try to do everything, and those that do usually get blocked > because, by trying to do too much, they get some of it badly wrong. It sounds to me that if there are to be manifests for the WAL, it should be a separate (set of) manifests. Trying to somehow tie together the manifest for the base backup, and the one for the WAL, makes little sense to me. They're commonly not computed in one place, often not even stored in the same place. For PITR relevant WAL doesn't even exist yet at the time the manifest is created (and thus obviously cannot be included in the base backup manifest). And fairly obviously one would want to be able to verify the correctness of WAL between two basebackups. I don't see much point in complicating the design to somehow capture WAL in the manifest, when it's only going to solve a small set of cases. Seems better to (later?) add support for generating manifests for WAL files, and then have a tool that can verify all the manifests required to restore a base backup. Greetings, Andres Freund
Hi, On 2020-03-26 14:02:29 -0400, Robert Haas wrote: > On Thu, Mar 26, 2020 at 12:34 PM Stephen Frost <sfrost@snowman.net> wrote: > > Why was crc32c > > picked for that purpose? > > Because it was discovered that 64-bit CRC was too slow, per commit > 21fda22ec46deb7734f793ef4d7fa6c226b4c78e. Well, a 32bit crc, not crc32c. IIRC it was the ethernet polynomial (+ bug). We switched to crc32c at some point because there are hardware implementations: commit 5028f22f6eb0579890689655285a4778b4ffc460 Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> Date: 2014-11-04 11:35:15 +0200 Switch to CRC-32C in WAL and other places. > Like, suppose we change the default from CRC-32C to SHA-something. On > the upside, the error detection rate will increase from 99.9999999+% > to something much closer to 100%. FWIW, I don't buy the relevancy of 99.9999999+% at all. That's assuming a single bit error (at relevant lengths, before that it's single burst errors of a greater length), which isn't that relevant for our purposes. That's not to say that I don't think a CRC check can provide value. It does provide a high likelihood of detecting enough errors, including coding errors in how data is restored (not unimportant), that you're likely not find out aobut a problem soon. > On the downside, > backups will get as much as 40-50% slower for some users. I hope we > can agree that both detecting errors and taking backups quickly are > important. However, it is hard for me to imagine that the typical user > would want to pay even a 5-10% performance penalty when taking a > backup in order to improve an error detection feature which they may > not even use and which already has less than a one-in-a-billion chance > of going wrong. FWIW, that seems far too large a slowdown to default to for me. Most people aren't going to be able to figure out that it's the checksum parameter that causes this slowdown, there just going to feel the pain of the backup being much slower than their hardware. A few hundred megabytes of streaming reads/writes really doesn't take a beefy server these days. Medium sized VMs + a bit larger network block devices at all the common cloud providers have considerably higher bandwidth. Even a raid5x of 4 spinning disks can deliver > 500MB/s. And plenty of even the smaller instances at many providers have > 5gbit/s network. At the upper end it's way more than that. Greetings, Andres Freund
Hi, On 2020-03-26 15:37:11 -0400, Stephen Frost wrote: > The argument is that adding checksums takes more time. I can understand > that argument, though I don't really agree with it. Certainly a few > percent really shouldn't be that big of an issue, and in many cases even > a sha256 hash isn't going to have that dramatic of an impact on the > actual overall time. I don't understand how you can come to that conclusion? It doesn't take very long to measure openssl's sha256 performance (which is pretty well optimized). Note that we do use openssl's sha256, when compiled with openssl support. On my workstation, with a pretty new (but not fastest single core perf model) intel Xeon Gold 5215, I get: $ openssl speed sha256 ... type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes sha256 76711.75k 172036.78k 321566.89k 399008.09k 431423.49k 433689.94k IOW, ~430MB/s. On my laptop, with pretty fast cores: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes sha256 97054.91k 217188.63k 394864.13k 493441.02k 532100.44k 533441.19k IOW, 530MB/s 530 MB/s is well within the realm of medium sized VMs. And, as mentioned before. even if you do only half of that, you're still going to be spending roughly half of the CPU time of sending a base backup. What makes you think that a few hundred MB/s is out of reach for a large fraction of PG installations that actually keep backups? Greetings, Andres Freund
Hi, On 2020-03-23 12:15:54 -0400, Robert Haas wrote: > + <varlistentry> > + <term><literal>MANIFEST</literal></term> > + <listitem> > + <para> > + When this option is specified with a value of <literal>ye'</literal> s/ye'/yes/ > + or <literal>force-escape</literal>, a backup manifest is created > + and sent along with the backup. The latter value forces all filenames > + to be hex-encoded; otherwise, this type of encoding is performed only > + for files whose names are non-UTF8 octet sequences. > + <literal>force-escape</literal> is intended primarily for testing > + purposes, to be sure that clients which read the backup manifest > + can handle this case. For compatibility with previous releases, > + the default is <literal>MANIFEST 'no'</literal>. > + </para> > + </listitem> > + </varlistentry> Are you planning to include a specification of the manifest file format anywhere? I looked through the patches and didn't find anything. I think it'd also be good to include more information about what the point of manifest files actually is. > + <para> > + <application>pg_validatebackup</application> reads the manifest file of a > + backup, verifies the manifest against its own internal checksum, and then > + verifies that the same files are present in the target directory as in the > + manifest itself. It then verifies that each file has the expected checksum, > + unless the backup was taken the checksum algorithm set to > + <literal>none</literal>, in which case checksum verification is not > + performed. The presence or absence of directories is not checked, except > + indirectly: if a directory is missing, any files it should have contained > + will necessarily also be missing. Certain files and directories are > + excluded from verification: > + </para> Depending on what you want to use the manifest for, we'd also need to check that there are no additional files. That seems to actually be implemented, which imo should be mentioned here. > +/* > + * Finalize the backup manifest, and send it to the client. > + */ > +static void > +SendBackupManifest(manifest_info *manifest) > +{ > + StringInfoData protobuf; > + uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH]; > + char checksumstringbuf[PG_SHA256_DIGEST_STRING_LENGTH]; > + size_t manifest_bytes_done = 0; > + > + /* > + * If there is no buffile, then the user doesn't want a manifest, so > + * don't waste any time generating one. > + */ > + if (manifest->buffile == NULL) > + return; > + > + /* Terminate the list of files. */ > + AppendStringToManifest(manifest, "],\n"); > + > + /* > + * Append manifest checksum, so that the problems with the manifest itself > + * can be detected. > + * > + * We always use SHA-256 for this, regardless of what algorithm is chosen > + * for checksumming the files. If we ever want to make the checksum > + * algorithm used for the manifest file variable, the client will need a > + * way to figure out which algorithm to use as close to the beginning of > + * the manifest file as possible, to avoid having to read the whole thing > + * twice. > + */ > + manifest->still_checksumming = false; > + pg_sha256_final(&manifest->manifest_ctx, checksumbuf); > + AppendStringToManifest(manifest, "\"Manifest-Checksum\": \""); > + hex_encode((char *) checksumbuf, sizeof checksumbuf, checksumstringbuf); > + checksumstringbuf[PG_SHA256_DIGEST_STRING_LENGTH - 1] = '\0'; > + AppendStringToManifest(manifest, checksumstringbuf); > + AppendStringToManifest(manifest, "\"}\n"); Hm. Is it a great choice to include the checksum for the manifest inside the manifest itself? With a cryptographic checksum it seems like it could make a ton of sense to store the checksum somewhere "safe", but keep the manifest itself alongside the base backup itself. While not huge, they won't be tiny either. > diff --git a/src/bin/pg_validatebackup/parse_manifest.c b/src/bin/pg_validatebackup/parse_manifest.c > new file mode 100644 > index 0000000000..e6b42adfda > --- /dev/null > +++ b/src/bin/pg_validatebackup/parse_manifest.c > @@ -0,0 +1,576 @@ > +/*------------------------------------------------------------------------- > + * > + * parse_manifest.c > + * Parse a backup manifest in JSON format. > + * > + * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group > + * Portions Copyright (c) 1994, Regents of the University of California > + * > + * src/bin/pg_validatebackup/parse_manifest.c > + * > + *------------------------------------------------------------------------- > + */ Doesn't have to be in the first version, but could it be useful to move this to common/ or such? > +/* > + * Validate one directory. > + * > + * 'relpath' is NULL if we are to validate the top-level backup directory, > + * and otherwise the relative path to the directory that is to be validated. > + * > + * 'fullpath' is the backup directory with 'relpath' appended; i.e. the actual > + * filesystem path at which it can be found. > + */ > +static void > +validate_backup_directory(validator_context *context, char *relpath, > + char *fullpath) > +{ Hm. Should this warn if the directory's permissions are set too openly (world writable?)? > +/* > + * Validate the checksum of a single file. > + */ > +static void > +validate_file_checksum(validator_context *context, manifestfile *tabent, > + char *fullpath) > +{ > + pg_checksum_context checksum_ctx; > + char *relpath = tabent->pathname; > + int fd; > + int rc; > + uint8 buffer[READ_CHUNK_SIZE]; > + uint8 checksumbuf[PG_CHECKSUM_MAX_LENGTH]; > + int checksumlen; > + > + /* Open the target file. */ > + if ((fd = open(fullpath, O_RDONLY | PG_BINARY, 0)) < 0) > + { > + report_backup_error(context, "could not open file \"%s\": %m", > + relpath); > + return; > + } > + > + /* Initialize checksum context. */ > + pg_checksum_init(&checksum_ctx, tabent->checksum_type); > + > + /* Read the file chunk by chunk, updating the checksum as we go. */ > + while ((rc = read(fd, buffer, READ_CHUNK_SIZE)) > 0) > + pg_checksum_update(&checksum_ctx, buffer, rc); > + if (rc < 0) > + report_backup_error(context, "could not read file \"%s\": %m", > + relpath); > + Hm. I think it'd be good to verify that the checksummed size is the same as the size of the file in the manifest. Greetings, Andres Freund
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2020-03-26 11:37:48 -0400, Robert Haas wrote: > > I'm sorry that you can't see how that's sensible, but it doesn't mean > > that it isn't sensible. It is totally unrealistic to expect that any > > backup verification tool can verify that you won't get an error when > > trying to use the backup. That would require that everything that the > > validation tool try to do everything that PostgreSQL will try to do > > when the backup is used, including running recovery and updating the > > data files. Anything less than that creates a real possibility that > > the backup will verify good but fail when used. This tool has a much > > narrower purpose, which is to try to verify that we (still) have the > > files the server sent as part of the backup and that, to the best of > > our ability to detect such things, they have not been modified. As you > > know, or should know, the WAL files are not sent as part of the > > backup, and so are not verified. Other things that would also be > > useful to check are also not verified. It would be fantastic to have > > more verification tools in the future, but it is difficult to see why > > anyone would bother trying if an attempt to get the first one > > committed gets blocked because it does not yet do everything. Very few > > patches try to do everything, and those that do usually get blocked > > because, by trying to do too much, they get some of it badly wrong. > > It sounds to me that if there are to be manifests for the WAL, it should > be a separate (set of) manifests. Trying to somehow tie together the > manifest for the base backup, and the one for the WAL, makes little > sense to me. They're commonly not computed in one place, often not even > stored in the same place. For PITR relevant WAL doesn't even exist yet > at the time the manifest is created (and thus obviously cannot be > included in the base backup manifest). And fairly obviously one would > want to be able to verify the correctness of WAL between two > basebackups. We aren't talking about generic PITR or about tools other than pg_basebackup, which has specific options for grabbing the WAL, and making sure that it is all there for the backup that was taken. > I don't see much point in complicating the design to somehow capture WAL > in the manifest, when it's only going to solve a small set of cases. As it relates to this, I tend to think that it solves the exact case that pg_basebackup is built for and used for. I said up-thread that if someone does decide to use -X none then we could just throw a warning (and perhaps have a way to override that if there's desire for it). > Seems better to (later?) add support for generating manifests for WAL > files, and then have a tool that can verify all the manifests required > to restore a base backup. I'm not trying to expand on the feature set here or move the goalposts way down the road, which is what seems to be what's being suggested here. To be clear, I don't have any objection to adding a generic tool for validating WAL as you're talking about here, but I also don't think that's required for pg_validatebackup. What I do think we need is a check of the WAL that's fetched when people use pg_basebackup -Xstream or -Xfetch. pg_basebackup itself has that check because it's critical to the backup being successful and valid. Not having that basic validation of a backup really just isn't ok- there's a reason pg_basebackup has that check. Thanks, Stephen
Attachment
On Thu, Mar 26, 2020 at 4:37 PM David Steele <david@pgmasters.net> wrote: > I know you and Stephen have agreed on a number of doc changes, would it > be possible to get a new patch with those included? I finally have time > to do a review of this tomorrow. I saw some mistakes in the docs in the > current patch but I know those patches are not current. Hi David, Here's a new version with some fixes: - Fixes for doc typos noted by Stephen Frost and Andres Freund. - Replace a doc paragraph about the advantages and disadvantages of CRC-32C with one by Stephen Frost, with a slightly change by me that I thought made it sound more grammatical. - Change the pg_validatebackup documentation so that it makes no mention of compatible tools, per Stephen. - Reword the discussion of the exclude list in the pg_validatebackup documentation, per discussion between Stephen and myself. - Try to make the documentation more clear about the fact that we check for both extra and missing files. - Incorporate a fix from Amit Kapila to make 003_corruption.pl pass on Windows. HTH, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Fri, Mar 27, 2020 at 1:06 AM Andres Freund <andres@anarazel.de> wrote: > > Like, suppose we change the default from CRC-32C to SHA-something. On > > the upside, the error detection rate will increase from 99.9999999+% > > to something much closer to 100%. > > FWIW, I don't buy the relevancy of 99.9999999+% at all. That's assuming > a single bit error (at relevant lengths, before that it's single burst > errors of a greater length), which isn't that relevant for our purposes. > > That's not to say that I don't think a CRC check can provide value. It > does provide a high likelihood of detecting enough errors, including > coding errors in how data is restored (not unimportant), that you're > likely not find out aobut a problem soon. So, I'm glad that you think a CRC check gives a sufficiently good chance of detection errors, but I don't understand what your objection to the percentage. Stephen just objected to it again, too: On Thu, Mar 26, 2020 at 4:44 PM Stephen Frost <sfrost@snowman.net> wrote: > > I mean, the property that I care about is the one where it detects > > better than 999,999,999 errors out of every 1,000,000,000, regardless > > of input length. > > Throwing these kinds of things around I really don't think is useful. ...but I don't understand his reasoning, or yours. My reasoning for thinking that the number is accurate is that a 32-bit checksum has 2^32 possible results. If all of those results are equally probable, then the probability that two files with unequal contents produce the same result is 2^-32. This does assume that the hash function is perfect, which no hash function is, so the actual probability of a collision is likely higher. But if the hash function is pretty good, it shouldn't be all that much higher. Note that I am making no assumptions here about how many bits are different, nor am I making any assumption about the length of a file. I am simply saying that an n-bit checksum should detect a difference between two files with a probability of roughly 1-2^{-n}, modulo the imperfections of the hash function. I thought that this was a well-accepted fact that would produce little argument from anybody, and I'm confused that people seem to feel otherwise. One explanation that would make sense to me is if somebody said, well, the nature of this particular algorithm means that, although values are uniformly distributed in general, the kinds of errors that are likely to occur in practice are likely to cancel out. For instance, if you imagine trivial algorithms such as adding or xor-ing all the bytes, adding zero bytes doesn't change the answer, and neither do transpositions. However, CRC is, AIUI, designed to be resistant to such problems. Your remark about large blocks of zero bytes is interesting to me in this context, but in a quick search I couldn't find anything stating that CRC was weak for such use cases. The old thread about switching from 64-bit CRC to 32-bit CRC had a link to a page which has subsequently been moved to here: https://www.ece.unb.ca/tervo/ee4253/crc.shtml Down towards the bottom, it says: "In general, bit errors and bursts up to N-bits long will be detected for a P(x) of degree N. For arbitrary bit errors longer than N-bits, the odds are one in 2^{N} than a totally false bit pattern will nonetheless lead to a zero remainder." Which I think is the same thing I'm saying: the chances of failing to detecting an error with a decent n-bit checksum ought to be about 2^{-N}. If that's not right, I'd really like to understand why. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 26, 2020 at 4:37 PM David Steele <david@pgmasters.net> wrote: > I agree with Stephen that this should be done, but I agree with you that > it can wait for a future commit. However, I do think: > > 1) It should be called out rather plainly in the documentation. > 2) If there are files in pg_wal then pg_validatebackup should inform the > user that those files have not been validated. I agree with you about #1, and I suspect that there's a way to improve what I've got here now, but I think I might be too close to this to figure out what the best way would be, so suggestions welcome. I think #2 is an interesting idea and could possibly reduce the danger of user confusion on this point considerably - because, let's face it, not everyone is going to read the documentation. However, I'm having a hard time figuring out exactly what we'd print. Right now on success, unless you specify -q, you get: [rhaas ~]$ pg_validatebackup ~/pgslave backup successfully verified But it feels strange and possibly confusing to me to print something like: [rhaas ~]$ pg_validatebackup ~/pgslave backup successfully verified (except for pg_wal) ...because there are a few other exceptions too, and also because it might make the user think that we normally check that but for some reason decided to skip it in this case. Maybe something more verbose like: [rhaas ~]$ pg_validatebackup ~/pgslave backup files successfully verified your backup contains a pg_wal directory, but this tool can't validate that, so do it yourself ...but that seems a little obnoxious and a little silly to print out every time. Ideas? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 26, 2020 at 4:44 PM Stephen Frost <sfrost@snowman.net> wrote: > Is it actually possible, today, in PG, to have a 4GB WAL record? > Judging this based on the WAL record size doesn't seem quite right. I'm not sure. I mean, most records are quite small, but I think if you set REPLICA IDENTITY FULL on a table with a bunch of very wide columns (and also wal_level=logical) it can get really big. I haven't tested to figure out just how big it can get. (If I have a table with lots of almost-1GB-blobs in it, does it work without logical replication and fail with logical replication? I don't know, but I doubt a WAL record >4GB is possible, because it seems unlikely that the code has a way to cope with that struct field overflowing.) > Again, I'm not against having a checksum algorithm as a option. I'm not > saying that it must be SHA512 as the default. I think that what we have seen so far is that all of the SHA-n algorithms that PostgreSQL supports are about equally slow, so it doesn't really matter which one you pick there from a performance point of view. If you're not saying it has to be SHA-512 but you do want it to be SHA-256, I don't think that really fixes anything. Using CRC-32C does fix the performance issue, but I don't think you like that, either. We could default to having no checksums at all, or even no manifest at all, but I didn't get the impression that David, at least, wanted to go that way, and I don't like it either. It's not the world's best feature, but I think it's good enough to justify enabling it by default. So I'm not sure we have any options here that will satisfy you. > > > I don't agree with limiting our view to only those algorithms that we've > > > already got implemented in PG. > > > > I mean, opening that giant can of worms ~2 weeks before feature freeze > > is not very nice. This patch has been around for months, and the > > algorithms were openly discussed a long time ago. > > Yes, they were discussed before, and these issues were brought up before > and there was specifically concern brought up about exactly the same > issues that I'm repeating here. Those concerns seem to have been > largely ignored, apparently because "we don't have that in PG today" as > at least one of the considerations- even though we used to. I might have missed something, but I don't remember any suggestion of CRC-64 or other algorithms for which PG does not currently have support prior to this week. The only thing I remember having been suggested previously was SHA, and I responded to that by adding support for SHA, not by ignoring the suggestion. If there was another suggestion made earlier, I must have missed it. > I also had hoped that > David's concerns that were raised before had been heeded, as I knew he > was involved in the discussion previously, but that turns out to not > have been the case. Well, I mean, I am trying pretty hard here, but I realize that I'm not succeeding. I don't know which specific suggestion you're talking about here. I understand that there is a concern about a 32-bit CRC somehow not being valid for more than 512MB, but based on my research, I believe that to be incorrect. I've explained the reasons why I believe it to be incorrect several times now, but I feel like we're just going around in circles. If my explanation of why it's incorrect is itself incorrect, tell me why, but let's not just keep saying the things we've both already said. > Yes, that looks fine. Feels slightly redundant to include the "as > described above ..." bit, and I think that could be dropped, but up to > you. Done in the version I posted a bit ago. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 27, 2020 at 2:29 AM Andres Freund <andres@anarazel.de> wrote: > s/ye'/yes/ Ugh, sorry. Fixed in the version posted earlier. > Are you planning to include a specification of the manifest file format > anywhere? I looked through the patches and didn't find anything. I thought about that. I think it would be good to have. I was sort of hoping to leave it for a follow-on patch, but maybe that's cheating too much. > I think it'd also be good to include more information about what the > point of manifest files actually is. What kind of information do you want to see included there? Basically, the way the documentation is written right now, it essentially says, well, we have this manifest thing so that you can later run pg_validatebackup, and pg_validatebackup says that it's there to check the integrity of backups using the manifest. This is all a bit circular, though, and maybe needs elaboration. What I've experienced is that: - Sometimes people take a backup and then wonder later whether the disk has flipped some bits. - Sometimes people restore a backup and forget some of the parts, like the user-defined tablespaces. - Sometimes anti-virus software, or poorly-run cron job run amok, wander around inflicting unpredictable damage. It would be nice to have a system that would notice these kinds of things on a running system, but here I've got the more modest goal of checking for in the context of a backup. If the data gets corrupted in transit, or if the disk mutilates it, or if the user mutilates it, you need something to check the backup against to find out that bad things have happend; the manifest is that thing. But I don't know exactly how much of all that should go in the docs, or in what way. > > + <para> > > + <application>pg_validatebackup</application> reads the manifest file of a > > + backup, verifies the manifest against its own internal checksum, and then > > + verifies that the same files are present in the target directory as in the > > + manifest itself. It then verifies that each file has the expected checksum, > > Depending on what you want to use the manifest for, we'd also need to > check that there are no additional files. That seems to actually be > implemented, which imo should be mentioned here. I intended the text to say that, because it says that it checks that the two things are "the same," which is symmetric. In the new version I posted a bit ago, I tried to make it more explicit, because apparently it was not sufficiently clear. > Hm. Is it a great choice to include the checksum for the manifest inside > the manifest itself? With a cryptographic checksum it seems like it > could make a ton of sense to store the checksum somewhere "safe", but > keep the manifest itself alongside the base backup itself. While not > huge, they won't be tiny either. Seems like the user could just copy the manifest checksum and store it somewhere, if they wish. Then they can check it against the manifest itself later, if they wish. Or they can take a SHA-512 of the whole file and store that securely. The problem is that we have no idea how to write that checksum to a more security storage. We could write backup_manifest and backup_manifest.checksum into separate files, but that seems like it's adding complexity without any real benefit. To me, the security-related uses of this patch seem to be fairly niche. I think it's nice that they exist, but I don't think that's the main selling point. For me, the main selling point is that you can check that your disk didn't eat your data and that nobody nuked any files that were supposed to be there. > Doesn't have to be in the first version, but could it be useful to move > this to common/ or such? Yeah. At one point, this code was written in a way that was totally specific to pg_validatebackup, but I then realized that it would be better to make it more general, so I refactored it into in the form you see now, where pg_validatebackup.c depends on parse_manifest.c but not the reverse. I suspect that if someone wants to use this for something else they might need to change a few more things - not sure exactly what - but I don't think it would be too hard. I thought it would be best to leave that task until someone has a concrete use case in mind, but I did want it to to be relatively easy to do that down the road, and I hope that the way I've organized the code achieves that. > > +static void > > +validate_backup_directory(validator_context *context, char *relpath, > > + char *fullpath) > > +{ > > Hm. Should this warn if the directory's permissions are set too openly > (world writable?)? I don't think so, but it's pretty clear that different people have different ideas about what the scope of this tool ought to be, even in this first version. > Hm. I think it'd be good to verify that the checksummed size is the same > as the size of the file in the manifest. That's checked in an earlier phase. Are you worried about the file being modified after the first pass checks the size and before we come through to do the checksumming? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 27, 2020 at 11:26 AM Stephen Frost <sfrost@snowman.net> wrote: > > Seems better to (later?) add support for generating manifests for WAL > > files, and then have a tool that can verify all the manifests required > > to restore a base backup. > > I'm not trying to expand on the feature set here or move the goalposts > way down the road, which is what seems to be what's being suggested > here. To be clear, I don't have any objection to adding a generic tool > for validating WAL as you're talking about here, but I also don't think > that's required for pg_validatebackup. What I do think we need is a > check of the WAL that's fetched when people use pg_basebackup -Xstream > or -Xfetch. pg_basebackup itself has that check because it's critical > to the backup being successful and valid. Not having that basic > validation of a backup really just isn't ok- there's a reason > pg_basebackup has that check. I don't understand how this could be done without significantly complicating the architecture. As I said before, -Xstream sends WAL over a separate connection that is unrelated to the one running BASE_BACKUP, so the base-backup connection doesn't know what to include in the manifest. Now you could do something like: once all of the WAL files have been fetched, the client checksums all of those and sends their names and checksums to the server, which turns around and puts them into the manifest, which it then sends back to the client. But that is actually quite a bit of additional complexity, and it's pretty strange, too, because now you have the client checksumming some files and the server checksumming others. I know you mentioned a few different ideas before, but I think they all kinda have some problem along these lines. I also kinda disagree with the idea that the WAL should be considered an integral part of the backup. I don't know how pgbackrest does things, but BART stores each backup in a separate directly without any associated WAL, and then keeps all the WAL together in a different directory. I imagine that people who are using continuous archiving also tend to use -Xnone, or if they do backups by copying the files rather than using pg_backrest, they exclude pg_wal. In fact, for people with big, important databases, I'd assume that would be the normal pattern. You presumably wouldn't want to keep one copy of the WAL files taken during the backup with the backup itself, and a separate copy in the archive. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Mar 26, 2020 at 4:37 PM David Steele <david@pgmasters.net> wrote: > > I agree with Stephen that this should be done, but I agree with you that > > it can wait for a future commit. However, I do think: > > > > 1) It should be called out rather plainly in the documentation. > > 2) If there are files in pg_wal then pg_validatebackup should inform the > > user that those files have not been validated. > > I agree with you about #1, and I suspect that there's a way to improve > what I've got here now, but I think I might be too close to this to > figure out what the best way would be, so suggestions welcome. > > I think #2 is an interesting idea and could possibly reduce the danger > of user confusion on this point considerably - because, let's face it, > not everyone is going to read the documentation. However, I'm having a > hard time figuring out exactly what we'd print. Right now on success, > unless you specify -q, you get: > > [rhaas ~]$ pg_validatebackup ~/pgslave > backup successfully verified > > But it feels strange and possibly confusing to me to print something like: > > [rhaas ~]$ pg_validatebackup ~/pgslave > backup successfully verified (except for pg_wal) > > ...because there are a few other exceptions too, and also because it The exceptions you're referring to here are things like the various signal files, that the user can recreated pretty easily..? I don't think those really rise to the level of pg_wal. What I would hope to see (... well, we know what I *really* would hope to see, but if we really go this route) is something like: WARNING: pg_wal not empty, WAL files are not validated by this tool data files successfully verified and a non-zero exit code. Basically, if you're doing WAL yourself, then you'd use pg_receivewal and maybe your own manifest-building code for WAL or something and then use -X none with pg_basebackup. Then again, I'd have -X none throw a warning too. I'd be alright with all of these having override switches to say "ok, I get it, don't complain about it". I disagree with the idea of writing "backup successfully verified" when we aren't doing any checking of the WAL that's essential for the backup (unlike various signal files and whatnot, which aren't...). Thanks, Stephen
Attachment
On 3/27/20 1:53 PM, Robert Haas wrote: > On Thu, Mar 26, 2020 at 4:37 PM David Steele <david@pgmasters.net> wrote: >> I know you and Stephen have agreed on a number of doc changes, would it >> be possible to get a new patch with those included? I finally have time >> to do a review of this tomorrow. I saw some mistakes in the docs in the >> current patch but I know those patches are not current. > > Hi David, > > Here's a new version with some fixes: > > - Fixes for doc typos noted by Stephen Frost and Andres Freund. > - Replace a doc paragraph about the advantages and disadvantages of > CRC-32C with one by Stephen Frost, with a slightly change by me that I > thought made it sound more grammatical. > - Change the pg_validatebackup documentation so that it makes no > mention of compatible tools, per Stephen. > - Reword the discussion of the exclude list in the pg_validatebackup > documentation, per discussion between Stephen and myself. > - Try to make the documentation more clear about the fact that we > check for both extra and missing files. > - Incorporate a fix from Amit Kapila to make 003_corruption.pl pass on Windows. Thanks! There appear to be conflicts with 67e0adfb3f98: $ git apply -3 ../download/v14-0002-Generate-backup-manifests-for-base-backups-and-v.patch ../download/v14-0002-Generate-backup-manifests-for-base-backups-and-v.patch:3396: trailing whitespace. sub cleanup_search_directory_fails error: patch failed: src/backend/replication/basebackup.c:258 Falling back to three-way merge... Applied patch to 'src/backend/replication/basebackup.c' with conflicts. U src/backend/replication/basebackup.c warning: 1 line adds whitespace errors. > + Specifies the algorithm that should be used to checksum each file > + for purposes of the backup manifest. Currently, the available perhaps "for inclusion in the backup manifest"? Anyway, I think this sentence is awkward. > + Specifies the algorithm that should be used to checksum each file > + for purposes of the backup manifest. Currently, the available And again. > + because the files themselves do not need to read. should be "need to be read". > + the manifest itself will always contain a <literal>SHA256</literal> I think just "the manifest will always contain" is fine. > + manifeste itself, and is therefore ignored. Note that the manifest typo "manifeste", perhaps remove itself. > { "Path": "backup_label", "Size": 224, "Last-Modified": "2020-03-27 18:33:18 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "b914bec9" }, Storing the checksum type with each file seems pretty redundant. Perhaps that could go in the header? You could always override if a specific file had a different checksum type, though that seems unlikely. In general it might be good to go with shorter keys: "mod", "chk", etc. Manifests can get pretty big and that's a lot of extra bytes. I'm also partial to using epoch time in the manifest because it is generally easier for programs to work with. But, human-readable doesn't suck, either. > if (maxrate > 0) > maxrate_clause = psprintf("MAX_RATE %u", maxrate); > + if (manifest) A linefeed here would be nice. > + manifestfile *tabent; This is an odd name. A holdover from the tab-delimited version? > + printf(_("Usage:\n %s [OPTION]... BACKUPDIR\n\n"), progname); When I ran pg_validatebackup I expected to use -D to specify the backup dir since pg_basebackup does. On the other hand -D is weird because I *really* expect that to be the pg data dir. But, do we want this to be different from pg_basebackup? > + checksum_length = checksum_string_length / 2; This check is defeated if a single character is added the to checksum. Not too big a deal since you still get an error, but still. > + * Verify that the manifest checksum is correct. This is not working the way I would expect -- I could freely modify the manifest without getting a checksum error on the manifest. For example: $ /home/vagrant/test/pg/bin/pg_validatebackup test/backup3 pg_validatebackup: fatal: invalid checksum for file "backup_label": "408901e0814f40f8ceb7796309a59c7248458325a21941e7c55568e381f53831?" So, if I deleted the entry above, I got a manifest checksum error. But if I just modified the checksum I get a file checksum error with no manifest checksum error. I would prefer a manifest checksum error in all cases where it is wrong, unless --exit-on-error is specified. -- -David david@pgmasters.net
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Mar 26, 2020 at 4:44 PM Stephen Frost <sfrost@snowman.net> wrote: > > Is it actually possible, today, in PG, to have a 4GB WAL record? > > Judging this based on the WAL record size doesn't seem quite right. > > I'm not sure. I mean, most records are quite small, but I think if you > set REPLICA IDENTITY FULL on a table with a bunch of very wide columns > (and also wal_level=logical) it can get really big. I haven't tested > to figure out just how big it can get. (If I have a table with lots of > almost-1GB-blobs in it, does it work without logical replication and > fail with logical replication? I don't know, but I doubt a WAL record > >4GB is possible, because it seems unlikely that the code has a way to > cope with that struct field overflowing.) Interesting.. Well, topic for another thread, but I'd say if we believe that's possible then we might want to consider if the crc32c is a good decision to use still there. > > Again, I'm not against having a checksum algorithm as a option. I'm not > > saying that it must be SHA512 as the default. > > I think that what we have seen so far is that all of the SHA-n > algorithms that PostgreSQL supports are about equally slow, so it > doesn't really matter which one you pick there from a performance > point of view. If you're not saying it has to be SHA-512 but you do > want it to be SHA-256, I don't think that really fixes anything. Using > CRC-32C does fix the performance issue, but I don't think you like > that, either. We could default to having no checksums at all, or even > no manifest at all, but I didn't get the impression that David, at > least, wanted to go that way, and I don't like it either. It's not the > world's best feature, but I think it's good enough to justify enabling > it by default. So I'm not sure we have any options here that will > satisfy you. I do like having a manifest by default. At this point it's pretty clear that we've just got a fundamental disagreement that more words aren't going to fix. I'd rather we play it safe and use a sha256 hash and accept that it's going to be slower by default, and then give users an option to make it go faster if they want (though I'd much rather that alternative be a 64bit CRC than a 32bit one). Andres seems to agree with you. I'm not sure where David sits on this specific question. Thanks, Stephen
Attachment
Hi, On 2020-03-27 14:13:17 -0400, Robert Haas wrote: > On Thu, Mar 26, 2020 at 4:44 PM Stephen Frost <sfrost@snowman.net> wrote: > > > I mean, the property that I care about is the one where it detects > > > better than 999,999,999 errors out of every 1,000,000,000, regardless > > > of input length. > > > > Throwing these kinds of things around I really don't think is useful. > > ...but I don't understand his reasoning, or yours. > > My reasoning for thinking that the number is accurate is that a 32-bit > checksum has 2^32 possible results. If all of those results are > equally probable, then the probability that two files with unequal > contents produce the same result is 2^-32. This does assume that the > hash function is perfect, which no hash function is, so the actual > probability of a collision is likely higher. But if the hash function > is pretty good, it shouldn't be all that much higher. Note that I am > making no assumptions here about how many bits are different, nor am I > making any assumption about the length of a file. I am simply saying > that an n-bit checksum should detect a difference between two files > with a probability of roughly 1-2^{-n}, modulo the imperfections of > the hash function. I thought that this was a well-accepted fact that > would produce little argument from anybody, and I'm confused that > people seem to feel otherwise. Well: crc32 is a terrible hash, if you're looking for even distribution of hashed values. That's not too surprising - its design goals included guaranteed error detection for certain lengths, and error correction of single bit errors. My understanding of the underlying math is spotty at best, but from what I understand that does pretty directly imply less independence between source data -> hash value than what we'd want from a good hash function. Here's an smhasher result page for crc32 (at least the latter is crc32 afaict): https://notabug.org/vaeringjar/smhasher/src/master/doc/crc32 https://notabug.org/vaeringjar/smhasher/src/master/doc/crc32_hw and then compare that with something like xxhash, or even lookup3 (which I think is what our hash is a variant of): https://notabug.org/vaeringjar/smhasher/src/master/doc/xxHash32 https://notabug.org/vaeringjar/smhasher/src/master/doc/lookup3 The birthday paradoxon doesn't apply (otherwise 32bit would never be enough, at a 50% chance of conflict at around 80k hashes), but still I do wonder if it matters that we're trying to detect errors in not one, but commonly tens of thousands to millions of files. But since we just need to detect one error to call the whole backup corrupt... > One explanation that would make sense to me is if somebody said, well, > the nature of this particular algorithm means that, although values > are uniformly distributed in general, the kinds of errors that are > likely to occur in practice are likely to cancel out. For instance, if > you imagine trivial algorithms such as adding or xor-ing all the > bytes, adding zero bytes doesn't change the answer, and neither do > transpositions. However, CRC is, AIUI, designed to be resistant to > such problems. Your remark about large blocks of zero bytes is > interesting to me in this context, but in a quick search I couldn't > find anything stating that CRC was weak for such use cases. My main point was that CRC's error detection guarantees are pretty much irrelevant for us. I.e. while the right CRC will guarantee that all single 2 bit errors will be detected, that's not a helpful property for us. There rarely are single bit errors, and the bursts are too long to to benefit from any >2 bit guarantees. Nor are multiple failures rare once you hit a problem. > The old thread about switching from 64-bit CRC to 32-bit CRC had a > link to a page which has subsequently been moved to here: > > https://www.ece.unb.ca/tervo/ee4253/crc.shtml > > Down towards the bottom, it says: > > "In general, bit errors and bursts up to N-bits long will be detected > for a P(x) of degree N. For arbitrary bit errors longer than N-bits, > the odds are one in 2^{N} than a totally false bit pattern will > nonetheless lead to a zero remainder." That's still about a single sequence of bit errors though, as far as I can tell. I.e. it doesn't hold for CRCs if you have two errors at different places. Greetings, Andres Freund
On 3/27/20 3:20 PM, Robert Haas wrote: > On Fri, Mar 27, 2020 at 2:29 AM Andres Freund <andres@anarazel.de> wrote: > >> Hm. Is it a great choice to include the checksum for the manifest inside >> the manifest itself? With a cryptographic checksum it seems like it >> could make a ton of sense to store the checksum somewhere "safe", but >> keep the manifest itself alongside the base backup itself. While not >> huge, they won't be tiny either. > > Seems like the user could just copy the manifest checksum and store it > somewhere, if they wish. Then they can check it against the manifest > itself later, if they wish. Or they can take a SHA-512 of the whole > file and store that securely. The problem is that we have no idea how > to write that checksum to a more security storage. We could write > backup_manifest and backup_manifest.checksum into separate files, but > that seems like it's adding complexity without any real benefit. I agree that this seems like a separate problem. What Robert has done here is detect random mutilation of the manifest. To prevent malicious modifications you either need to store the checksum in another place, or digitally sign the file and store that alongside it (or inside it even). Either way seems pretty far out of scope to me. >> Hm. I think it'd be good to verify that the checksummed size is the same >> as the size of the file in the manifest. > > That's checked in an earlier phase. Are you worried about the file > being modified after the first pass checks the size and before we come > through to do the checksumming? I prefer to validate the size and checksum in the same pass, but I'm not sure it's that big a deal. If the backup is being corrupted under the validate process that would also apply to files that had already been validated. Regards, -- -David david@pgmasters.net
Hi, On 2020-03-27 15:29:02 -0400, Robert Haas wrote: > On Fri, Mar 27, 2020 at 11:26 AM Stephen Frost <sfrost@snowman.net> wrote: > > > Seems better to (later?) add support for generating manifests for WAL > > > files, and then have a tool that can verify all the manifests required > > > to restore a base backup. > > > > I'm not trying to expand on the feature set here or move the goalposts > > way down the road, which is what seems to be what's being suggested > > here. To be clear, I don't have any objection to adding a generic tool > > for validating WAL as you're talking about here, but I also don't think > > that's required for pg_validatebackup. What I do think we need is a > > check of the WAL that's fetched when people use pg_basebackup -Xstream > > or -Xfetch. pg_basebackup itself has that check because it's critical > > to the backup being successful and valid. Not having that basic > > validation of a backup really just isn't ok- there's a reason > > pg_basebackup has that check. > > I don't understand how this could be done without significantly > complicating the architecture. As I said before, -Xstream sends WAL > over a separate connection that is unrelated to the one running > BASE_BACKUP, so the base-backup connection doesn't know what to > include in the manifest. Now you could do something like: once all of > the WAL files have been fetched, the client checksums all of those and > sends their names and checksums to the server, which turns around and > puts them into the manifest, which it then sends back to the client. > But that is actually quite a bit of additional complexity, and it's > pretty strange, too, because now you have the client checksumming some > files and the server checksumming others. I know you mentioned a few > different ideas before, but I think they all kinda have some problem > along these lines. How about having separate manifests for segments? And have them stay separate? And then have an option to verify the manifests for all the WAL files that are required for a specific restore? The easiest way would be to just add a separate manifest file for each segment, and name them accordingly. But inventing a naming pattern that specifies both start-end segments wouldn't be hard either, and result in fewer manifests. Base backups (in the backup sense, not for bringing up replicas etc) without the ability to apply newer WAL are fairly pointless imo. And if newer WAL is applied, there's not much point in just verifying the WAL that's necessary to restore the base backup. Instead you'd want to be able to verify all the WAL since the base backup to the "current" point (or the next base backup). For me having something inside pg_basebackup (or the server, for -Xfetch) that somehow includes the WAL files in the manifest doesn't really gain us much - it's obviously not something that'll help us to verify all the WAL that needs to be applied (to either get the base backup into a consistent state, or to roll forward to the desired point). > I also kinda disagree with the idea that the WAL should be considered > an integral part of the backup. I don't know how pgbackrest does > things, but BART stores each backup in a separate directly without any > associated WAL, and then keeps all the WAL together in a different > directory. I imagine that people who are using continuous archiving > also tend to use -Xnone, or if they do backups by copying the files > rather than using pg_backrest, they exclude pg_wal. In fact, for > people with big, important databases, I'd assume that would be the > normal pattern. You presumably wouldn't want to keep one copy of the > WAL files taken during the backup with the backup itself, and a > separate copy in the archive. +1 I also don't see them as being as important, due to the already existing checksums (which are of a much much much higher quality than what we have for database pages, both by being wider, and by being much more frequent in most cases). There's obviously a need to validate the WAL in a nicer way than scripting pg_waldump - but that seems separate anyway. Greetings, Andres Freund
Hi, On 2020-03-27 14:34:19 -0400, Robert Haas wrote: > I think #2 is an interesting idea and could possibly reduce the danger > of user confusion on this point considerably - because, let's face it, > not everyone is going to read the documentation. However, I'm having a > hard time figuring out exactly what we'd print. Right now on success, > unless you specify -q, you get: > > [rhaas ~]$ pg_validatebackup ~/pgslave > backup successfully verified > > But it feels strange and possibly confusing to me to print something like: > > [rhaas ~]$ pg_validatebackup ~/pgslave > backup successfully verified (except for pg_wal) You could print something like: WAL necessary to restore this base backup can be validated with: pg_waldump -p ~/pgslave -t tl -s backup_start_location -e backup_end_loc > /dev/null && echo true Obviously that specific invocation sucks, but it'd not be hard to add an option to waldump to not output anything. Greetings, Andres Freund
On 3/27/20 3:29 PM, Robert Haas wrote: > On Fri, Mar 27, 2020 at 11:26 AM Stephen Frost <sfrost@snowman.net> wrote: >>> Seems better to (later?) add support for generating manifests for WAL >>> files, and then have a tool that can verify all the manifests required >>> to restore a base backup. >> >> I'm not trying to expand on the feature set here or move the goalposts >> way down the road, which is what seems to be what's being suggested >> here. To be clear, I don't have any objection to adding a generic tool >> for validating WAL as you're talking about here, but I also don't think >> that's required for pg_validatebackup. What I do think we need is a >> check of the WAL that's fetched when people use pg_basebackup -Xstream >> or -Xfetch. pg_basebackup itself has that check because it's critical >> to the backup being successful and valid. Not having that basic >> validation of a backup really just isn't ok- there's a reason >> pg_basebackup has that check. > > I don't understand how this could be done without significantly > complicating the architecture. As I said before, -Xstream sends WAL > over a separate connection that is unrelated to the one running > BASE_BACKUP, so the base-backup connection doesn't know what to > include in the manifest. Now you could do something like: once all of > the WAL files have been fetched, the client checksums all of those and > sends their names and checksums to the server, which turns around and > puts them into the manifest, which it then sends back to the client. > But that is actually quite a bit of additional complexity, and it's > pretty strange, too, because now you have the client checksumming some > files and the server checksumming others. I know you mentioned a few > different ideas before, but I think they all kinda have some problem > along these lines. > > I also kinda disagree with the idea that the WAL should be considered > an integral part of the backup. I don't know how pgbackrest does > things, We checksum each WAL file while it is read and transmitted to the repo by the archive_command. Then at the end of the backup we ensure that all the WAL required to make the backup consistent has made it to the repo. > but BART stores each backup in a separate directly without any > associated WAL, and then keeps all the WAL together in a different > directory. I imagine that people who are using continuous archiving > also tend to use -Xnone, or if they do backups by copying the files > rather than using pg_backrest, they exclude pg_wal. In fact, for > people with big, important databases, I'd assume that would be the > normal pattern. You presumably wouldn't want to keep one copy of the > WAL files taken during the backup with the backup itself, and a > separate copy in the archive. pgBackRest does provide the option to copy WAL into the backup directory for the super-paranoid, though it is not the default. It is pretty handy for moving individual backups some other medium like tape, though. If -Xnone is specified then it seems like pg_validatebackup is completely off the hook. But in the case of -Xstream or -Xfetch couldn't we at least verify that the expected WAL segments are present and the correct size? Storing the start/stop lsn in the manifest would be a nice thing to have anyway and that would make this feature pretty trivial. Yeah, that's in the backup_label file as well but the manifest is so much easier to read. Regards, -- -David david@pgmasters.net
Hi, On 2020-03-27 15:20:27 -0400, Robert Haas wrote: > On Fri, Mar 27, 2020 at 2:29 AM Andres Freund <andres@anarazel.de> wrote: > > Are you planning to include a specification of the manifest file format > > anywhere? I looked through the patches and didn't find anything. > > I thought about that. I think it would be good to have. I was sort of > hoping to leave it for a follow-on patch, but maybe that's cheating > too much. I don't like having a file format that's intended to be used by external tools too that's undocumented except for code that assembles it in a piecemeal fashion. Do you mean in a follow-on patch this release, or later? I don't have a problem with the former. > > I think it'd also be good to include more information about what the > > point of manifest files actually is. > > What kind of information do you want to see included there? Basically, > the way the documentation is written right now, it essentially says, > well, we have this manifest thing so that you can later run > pg_validatebackup, and pg_validatebackup says that it's there to check > the integrity of backups using the manifest. This is all a bit > circular, though, and maybe needs elaboration. I do found it to be circular. I think we mostly need a paragraph or two somewhere that explains on a higher level what the point of verifying base backups is and what is verified. > > Hm. Is it a great choice to include the checksum for the manifest inside > > the manifest itself? With a cryptographic checksum it seems like it > > could make a ton of sense to store the checksum somewhere "safe", but > > keep the manifest itself alongside the base backup itself. While not > > huge, they won't be tiny either. > > Seems like the user could just copy the manifest checksum and store it > somewhere, if they wish. Then they can check it against the manifest > itself later, if they wish. Or they can take a SHA-512 of the whole > file and store that securely. The problem is that we have no idea how > to write that checksum to a more security storage. We could write > backup_manifest and backup_manifest.checksum into separate files, but > that seems like it's adding complexity without any real benefit. > > To me, the security-related uses of this patch seem to be fairly > niche. I think it's nice that they exist, but I don't think that's the > main selling point. For me, the main selling point is that you can > check that your disk didn't eat your data and that nobody nuked any > files that were supposed to be there. Oh, I agree. I wasn't really mentioning the crypto checksum because of it being "security" stuff, but because of the quality of the guarantee it gives. I don't know how large the manifest file will be for a setup of with a lot of partitioned tables, but I'd expect it to not be tiny. So not having to store it in the 'archiving sytem' is nice. FWIW, I was thinking of backup_manifest.checksum potentially being desirable for another reason: The need to embed the checksum inside the document imo adds a fair bit of rigidity to the file format. See > +static void > +verify_manifest_checksum(JsonManifestParseState *parse, char *buffer, > + size_t size) > +{ ... > + > + /* Find the last two newlines in the file. */ > + for (i = 0; i < size; ++i) > + { > + if (buffer[i] == '\n') > + { > + ++number_of_newlines; > + penultimate_newline = ultimate_newline; > + ultimate_newline = i; > + } > + } > + > + /* > + * Make sure that the last newline is right at the end, and that there are > + * at least two lines total. We need this to be true in order for the > + * following code, which computes the manifest checksum, to work properly. > + */ > + if (number_of_newlines < 2) > + json_manifest_parse_failure(parse->context, > + "expected at least 2 lines"); > + if (ultimate_newline != size - 1) > + json_manifest_parse_failure(parse->context, > + "last line not newline-terminated"); > + > + /* Checksum the rest. */ > + pg_sha256_init(&manifest_ctx); > + pg_sha256_update(&manifest_ctx, (uint8 *) buffer, penultimate_newline + 1); > + pg_sha256_final(&manifest_ctx, manifest_checksum_actual); which certainly isn't "free form json". > > Doesn't have to be in the first version, but could it be useful to move > > this to common/ or such? > > Yeah. At one point, this code was written in a way that was totally > specific to pg_validatebackup, but I then realized that it would be > better to make it more general, so I refactored it into in the form > you see now, where pg_validatebackup.c depends on parse_manifest.c but > not the reverse. I suspect that if someone wants to use this for > something else they might need to change a few more things - not sure > exactly what - but I don't think it would be too hard. I thought it > would be best to leave that task until someone has a concrete use case > in mind, but I did want it to to be relatively easy to do that down > the road, and I hope that the way I've organized the code achieves > that. Cool. > > > +static void > > > +validate_backup_directory(validator_context *context, char *relpath, > > > + char *fullpath) > > > +{ > > > > Hm. Should this warn if the directory's permissions are set too openly > > (world writable?)? > > I don't think so, but it's pretty clear that different people have > different ideas about what the scope of this tool ought to be, even in > this first version. Yea. I don't have a strong opinion on this specific issue. I was mostly wondering because I've repeatedly seen people restore backups with world readable properties, and with that it's obviously possible for somebody else to change the contents after the checksum was computed. > > Hm. I think it'd be good to verify that the checksummed size is the same > > as the size of the file in the manifest. > > That's checked in an earlier phase. Are you worried about the file > being modified after the first pass checks the size and before we come > through to do the checksumming? Not really, I wondered about it for a bit, and then decided that it's too remote an issue. What I've seen a couple of times is that actually reading a file can result in the file ending to be reported at a different position than what stat() said. So by crosschecking the size while reading with the one from stat (which was compared with the source system one) we'd make the errors much better. It's certainly easier to know where to start looking when validate says "error: read %llu bytes from file, expected %llu" or something along those lines, than when it just were to report a checksum error. There's also some crypto hash algorithm weaknesses that are easier to exploit when it's possible to append data to a known prefix, but that doesn't seem an obvious threat here. Greetings, Andres Freund
On 3/27/20 3:55 PM, Stephen Frost wrote: > * Robert Haas (robertmhaas@gmail.com) wrote: >> I think that what we have seen so far is that all of the SHA-n >> algorithms that PostgreSQL supports are about equally slow, so it >> doesn't really matter which one you pick there from a performance >> point of view. If you're not saying it has to be SHA-512 but you do >> want it to be SHA-256, I don't think that really fixes anything. Using >> CRC-32C does fix the performance issue, but I don't think you like >> that, either. We could default to having no checksums at all, or even >> no manifest at all, but I didn't get the impression that David, at >> least, wanted to go that way, and I don't like it either. It's not the >> world's best feature, but I think it's good enough to justify enabling >> it by default. So I'm not sure we have any options here that will >> satisfy you. > > I do like having a manifest by default. At this point it's pretty clear > that we've just got a fundamental disagreement that more words aren't > going to fix. I'd rather we play it safe and use a sha256 hash and > accept that it's going to be slower by default, and then give users an > option to make it go faster if they want (though I'd much rather that > alternative be a 64bit CRC than a 32bit one). > > Andres seems to agree with you. I'm not sure where David sits on this > specific question. I would prefer a stronger checksum as the default but I would be fine with SHA1, which is a bit faster. I believe the overhead of checksums is being overblown. In my experience the vast majority of users are using compression and running the backup over a network. Once you have done those two things the cost of SHA1 is pretty negligible. As I posted way up-thread we found that just gzip -6 pushed the cost of SHA1 below 3% and that did not include network transfer. Regards, -- -David david@pgmasters.net
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Fri, Mar 27, 2020 at 11:26 AM Stephen Frost <sfrost@snowman.net> wrote: > > > Seems better to (later?) add support for generating manifests for WAL > > > files, and then have a tool that can verify all the manifests required > > > to restore a base backup. > > > > I'm not trying to expand on the feature set here or move the goalposts > > way down the road, which is what seems to be what's being suggested > > here. To be clear, I don't have any objection to adding a generic tool > > for validating WAL as you're talking about here, but I also don't think > > that's required for pg_validatebackup. What I do think we need is a > > check of the WAL that's fetched when people use pg_basebackup -Xstream > > or -Xfetch. pg_basebackup itself has that check because it's critical > > to the backup being successful and valid. Not having that basic > > validation of a backup really just isn't ok- there's a reason > > pg_basebackup has that check. > > I don't understand how this could be done without significantly > complicating the architecture. As I said before, -Xstream sends WAL > over a separate connection that is unrelated to the one running > BASE_BACKUP, so the base-backup connection doesn't know what to > include in the manifest. Now you could do something like: once all of > the WAL files have been fetched, the client checksums all of those and > sends their names and checksums to the server, which turns around and > puts them into the manifest, which it then sends back to the client. > But that is actually quite a bit of additional complexity, and it's > pretty strange, too, because now you have the client checksumming some > files and the server checksumming others. I know you mentioned a few > different ideas before, but I think they all kinda have some problem > along these lines. I've made some suggestions before, also chatted about an idea with David that I'll outline here. First off- I'm a bit mystified why you are saying that the base backup connection doesn't know what to include in the manifest regarding WAL. The base-backup process determines the starting position (and then even puts it into the backup_label that's sent to the client), and then it directly returns the ending position at the end of the BASE_BACKUP command. Given that we do know that information, then we just need to get the checksums/hashes for each of the WAL files, if it's been asked for. How do we know checksums or hashes have been asked for in the WAL streaming connection? We can have the pg_basebackup process ask for that when it connects to stream the WAL that's needed. Now the only part that's a little grotty is dealing with passing the checksums/hashes that the WAL stream connection calculates over to the base backup connection to include in the manifest. Offhand though, it seems like we could drop a file in archive_status for that, perhaps "wal_checksums.PID" or such (the PID would be that of the PG backend that's doing the base backup, which we'd pass to START_REPLICATION). Of course, the backup process would have to check and make sure that it got all the needed WAL file checksums, but since it knows the end, that shouldn't be too bad. > I also kinda disagree with the idea that the WAL should be considered > an integral part of the backup. I don't know how pgbackrest does > things, but BART stores each backup in a separate directly without any > associated WAL, and then keeps all the WAL together in a different > directory. I imagine that people who are using continuous archiving > also tend to use -Xnone, or if they do backups by copying the files > rather than using pg_backrest, they exclude pg_wal. In fact, for > people with big, important databases, I'd assume that would be the > normal pattern. You presumably wouldn't want to keep one copy of the > WAL files taken during the backup with the backup itself, and a > separate copy in the archive. I really don't know what to say to this. WAL is absolutely critical to a backup being valid. pgBackRest doesn't have a way to *just* validate a backup today, unfortunately, but we're planning to support it in the future and we will absolutely include in that validation checking all of the WAL that's part of the backup. I'm fine with forgoing all of this in the -X none case, as I've said elsewhere. I think it'd be great for pg_receivewal to have a way to validate WAL and such, but that's a clearly new feature and it's independent from validating a backup. As it relates to how pgBackRest stores WAL, we actually do support both of the options you mention, because people with big important databases like to be extra paranoid. WAL can either be stored in just the archive, or it can be stored in both the archive and in the backup (with '--archive-copy'). Note that this isn't done by just grabbing whatever is in pg_wal at the time of the backup, as that wouldn't actually work, but rather by copying the necessary WAL from the archive at the end of the backup. We do also check all WAL that's pulled from the archive by the restore command, though exactly what WAL is needed isn't something we know ahead of time (yet, anyway.. we are working on WAL parsing code that'll change that by actually scanning the WAL and storing all restore points, starting/ending times and transaction IDs, and anything else that can be used as a restore target, so we can figure out exactly all WAL that's needed to get to a particular restore target). We actually have someone who implemented an independent tool called check_pgbackrest which specifically has a "archives" check, for checking that the WAL is in the archive. We plan to also provide a way to ask pgbackrest to confirm that there's no missing WAL, and that all of the WAL is valid. WAL is critical to a backup that's been taken in an online manner, no matter where it's stored. A backup isn't valid without the WAL that's needed to reach consistency. Thanks, Stephen
Attachment
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2020-03-27 14:34:19 -0400, Robert Haas wrote: > > I think #2 is an interesting idea and could possibly reduce the danger > > of user confusion on this point considerably - because, let's face it, > > not everyone is going to read the documentation. However, I'm having a > > hard time figuring out exactly what we'd print. Right now on success, > > unless you specify -q, you get: > > > > [rhaas ~]$ pg_validatebackup ~/pgslave > > backup successfully verified > > > > But it feels strange and possibly confusing to me to print something like: > > > > [rhaas ~]$ pg_validatebackup ~/pgslave > > backup successfully verified (except for pg_wal) > > You could print something like: > WAL necessary to restore this base backup can be validated with: > > pg_waldump -p ~/pgslave -t tl -s backup_start_location -e backup_end_loc > /dev/null && echo true > > Obviously that specific invocation sucks, but it'd not be hard to add an > option to waldump to not output anything. Interesting idea to use pg_waldump. I had suggested up-thread, and I'm still fine with, having pg_validatebackup scan the WAL and check the internal checksums. I'd prefer an option that uses hashes to check when the user has asked for hashes with SHA256 or something, but at least scanning the WAL and making sure it validates its internal checksum (and is actually all there, which is pretty darn critical) would be enough to say that we're pretty sure the backup is valid. Thanks, Stephen
Attachment
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2020-03-27 15:20:27 -0400, Robert Haas wrote: > > On Fri, Mar 27, 2020 at 2:29 AM Andres Freund <andres@anarazel.de> wrote: > > > Hm. Should this warn if the directory's permissions are set too openly > > > (world writable?)? > > > > I don't think so, but it's pretty clear that different people have > > different ideas about what the scope of this tool ought to be, even in > > this first version. > > Yea. I don't have a strong opinion on this specific issue. I was mostly > wondering because I've repeatedly seen people restore backups with world > readable properties, and with that it's obviously possible for somebody > else to change the contents after the checksum was computed. For my 2c, at least, I don't think we need to check the directory permissions, but I wouldn't object to including a warning if they're set such that PG won't start. I suppose +0 for "warn if they are such that PG won't start". Thanks, Stephen
Attachment
Hi, On 2020-03-27 17:44:07 -0400, Stephen Frost wrote: > * Andres Freund (andres@anarazel.de) wrote: > > On 2020-03-27 15:20:27 -0400, Robert Haas wrote: > > > On Fri, Mar 27, 2020 at 2:29 AM Andres Freund <andres@anarazel.de> wrote: > > > > Hm. Should this warn if the directory's permissions are set too openly > > > > (world writable?)? > > > > > > I don't think so, but it's pretty clear that different people have > > > different ideas about what the scope of this tool ought to be, even in > > > this first version. > > > > Yea. I don't have a strong opinion on this specific issue. I was mostly > > wondering because I've repeatedly seen people restore backups with world > > readable properties, and with that it's obviously possible for somebody > > else to change the contents after the checksum was computed. > > For my 2c, at least, I don't think we need to check the directory > permissions, but I wouldn't object to including a warning if they're set > such that PG won't start. I suppose +0 for "warn if they are such that > PG won't start". I was thinking of that check not being just at the top-level, but in subdirectories too. It's easy to screw up the top and subdirectory permissions in different ways, e.g. when manually creating the database dir and then restoring a data directory directly into that. IIRC postmaster doesn't check that at start. Greetings, Andres Freund
Hi, On 2020-03-27 17:07:42 -0400, Stephen Frost wrote: > I had suggested up-thread, and I'm still fine with, having > pg_validatebackup scan the WAL and check the internal checksums. I'd > prefer an option that uses hashes to check when the user has asked for > hashes with SHA256 or something, but at least scanning the WAL and > making sure it validates its internal checksum (and is actually all > there, which is pretty darn critical) would be enough to say that we're > pretty sure the backup is valid. I'd say that actually parsing the WAL will give you a lot higher confidence than verifying a sha256 for each file. There's plenty of ways to screw up the pg_wal on the source server (I've seen several restore_commands doing so, particularly when eagerly fetching). Sure, it'll not help against an attacker, but I'm not sure I see the threat model. There's imo a cost argument against doing WAL verification by reading it, but that'd mostly be a factor when comparing against a faster whole-file checksum. Greetings, Andres Freund
Hi, On 2020-03-27 16:57:46 -0400, Stephen Frost wrote: > I really don't know what to say to this. WAL is absolutely critical to > a backup being valid. pgBackRest doesn't have a way to *just* validate > a backup today, unfortunately, but we're planning to support it in the > future and we will absolutely include in that validation checking all of > the WAL that's part of the backup. Could you please address the fact that just about everybody uses base backups + later WAL to have a short data loss window? Integrating the WAL files necessary to make the base backup consistent doesn't achieve much if we can't verify the WAL files afterwards. And fairly obviously pg_basebackup can't do much about WAL created after its invocation. Given that we need something separate to address that "verification hole", I don't see why it's useful to have a special case solution (or rather multiple ones, for stream and fetch) inside pg_basebackup. Greetings, Andres Freund
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2020-03-27 17:44:07 -0400, Stephen Frost wrote: > > * Andres Freund (andres@anarazel.de) wrote: > > > On 2020-03-27 15:20:27 -0400, Robert Haas wrote: > > > > On Fri, Mar 27, 2020 at 2:29 AM Andres Freund <andres@anarazel.de> wrote: > > > > > Hm. Should this warn if the directory's permissions are set too openly > > > > > (world writable?)? > > > > > > > > I don't think so, but it's pretty clear that different people have > > > > different ideas about what the scope of this tool ought to be, even in > > > > this first version. > > > > > > Yea. I don't have a strong opinion on this specific issue. I was mostly > > > wondering because I've repeatedly seen people restore backups with world > > > readable properties, and with that it's obviously possible for somebody > > > else to change the contents after the checksum was computed. > > > > For my 2c, at least, I don't think we need to check the directory > > permissions, but I wouldn't object to including a warning if they're set > > such that PG won't start. I suppose +0 for "warn if they are such that > > PG won't start". > > I was thinking of that check not being just at the top-level, but in > subdirectories too. It's easy to screw up the top and subdirectory > permissions in different ways, e.g. when manually creating the database > dir and then restoring a data directory directly into that. IIRC > postmaster doesn't check that at start. Yeah, I'm pretty sure we don't check that at postmaster start.. which also means that we'll start up just fine even if the perms on subdirectories are odd or wrong, unless maybe we end up in a really odd state where a directory is 000'd or something. Of course.. this is all a mess when it comes to pg_basebackup, really, as previously discussed elsewhere, because what permissions and such you end up with actually depends on what *format* you use with pg_basebackup- it's different between 'tar' format and 'plain' format. That is, if you use 'tar' format, and then actually use 'tar' to extract, you get one set of privs, but if you use 'plain', you get something different. I mean.. pgBackRest sets all perms to whatever is in the manifest on restore (or delta), but this patch doesn't include the permissions on files, or ownership (something pgBackRest also tries to set, if possible, on restore), does it...? Doesn't look like it on a quick look. So if we want to compare to pgBackRest then, yes, we should include the permissions in the manifest and we should check that everything in the manifest matches what's on the filesystem. I don't think we should just compare all permissions or ownership with some arbitrary idea of what we think they should be, even though if you use pg_basebackup in 'plain' format, you actually end up with differences, today, from what the source system has. In my view, that should actually be fixed, to the extent possible. Thanks, Stephen
Attachment
On 2020-Mar-27, Stephen Frost wrote: > I don't think we should just compare all permissions or ownership with > some arbitrary idea of what we think they should be, even though if you > use pg_basebackup in 'plain' format, you actually end up with > differences, today, from what the source system has. In my view, that > should actually be fixed, to the extent possible. I posted some thoughts about this at https://www.postgresql.org/message-id/20190904201117.GA12986%40alvherre.pgsql I didn't get time to work on that myself. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2020-03-27 16:57:46 -0400, Stephen Frost wrote: > > I really don't know what to say to this. WAL is absolutely critical to > > a backup being valid. pgBackRest doesn't have a way to *just* validate > > a backup today, unfortunately, but we're planning to support it in the > > future and we will absolutely include in that validation checking all of > > the WAL that's part of the backup. > > Could you please address the fact that just about everybody uses base > backups + later WAL to have a short data loss window? Integrating the > WAL files necessary to make the base backup consistent doesn't achieve > much if we can't verify the WAL files afterwards. And fairly obviously > pg_basebackup can't do much about WAL created after its invocation. I feel like we have very different ideas about what "just about everybody" does here. In my view, folks use pg_basebackup because it's easy and they can create self-contained backups that include all the WAL needed to get the backup up and running again and they don't typically care about PITR all that much. Folks who care about PITR use something that manages WAL for them, which pg_basebackup and pg_receivewal really don't do and it's not easy to add scripting around them to figure out what WAL is needed for what backup, etc. If we didn't think that the ability to create a self-contained backup was useful, it sure seems odd that we've done a lot to make that work (having both fetch and stream modes for it) and that it's the default. > Given that we need something separate to address that "verification > hole", I don't see why it's useful to have a special case solution (or > rather multiple ones, for stream and fetch) inside pg_basebackup. Well, the proposal up-thread would end up with almost zero changes to pg_basebackup itself, but, yes, there'd be changes to BASE_BACKUP and different ones for STREAMING_REPLICATION to support getting the WAL checksums into the manifest. Thanks, Stephen
Attachment
On 3/27/20 6:07 PM, Andres Freund wrote: > Hi, > > On 2020-03-27 16:57:46 -0400, Stephen Frost wrote: >> I really don't know what to say to this. WAL is absolutely critical to >> a backup being valid. pgBackRest doesn't have a way to *just* validate >> a backup today, unfortunately, but we're planning to support it in the >> future and we will absolutely include in that validation checking all of >> the WAL that's part of the backup. > > Could you please address the fact that just about everybody uses base > backups + later WAL to have a short data loss window? Integrating the > WAL files necessary to make the base backup consistent doesn't achieve > much if we can't verify the WAL files afterwards. And fairly obviously > pg_basebackup can't do much about WAL created after its invocation. > > Given that we need something separate to address that "verification > hole", I don't see why it's useful to have a special case solution (or > rather multiple ones, for stream and fetch) inside pg_basebackup. There's a pretty big difference between not being able to play forward to the end of WAL and not being able to get the backup to restore to consistency at all. The WAL that is generated during during the backup has special importance. Without it you have no backup at all. It's the difference between *some* data loss and *total* data loss. Regards, -- -David david@pgmasters.net
On Thu, Mar 26, 2020 at 12:34:52PM -0400, Stephen Frost wrote: > * Robert Haas (robertmhaas@gmail.com) wrote: > > This is where I feel like I'm trying to make decisions in a vacuum. If > > we had a few more people weighing in on the thread on this point, I'd > > be happy to go with whatever the consensus was. If most people think > > having both --no-manifest (suppressing the manifest completely) and > > --manifest-checksums=none (suppressing only the checksums) is useless > > and confusing, then sure, let's rip the latter one out. If most people > > like the flexibility, let's keep it: it's already implemented and > > tested. But I hate to base the decision on what one or two people > > think. > > I'm frustrated at the lack of involvement from others also. Well, the topic of backup manifests feels like it has generated a lot of bickering emails, and people don't want to spend their time dealing with that. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Greetings,
On Fri, Mar 27, 2020 at 18:36 Bruce Momjian <bruce@momjian.us> wrote:
On Thu, Mar 26, 2020 at 12:34:52PM -0400, Stephen Frost wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
> > This is where I feel like I'm trying to make decisions in a vacuum. If
> > we had a few more people weighing in on the thread on this point, I'd
> > be happy to go with whatever the consensus was. If most people think
> > having both --no-manifest (suppressing the manifest completely) and
> > --manifest-checksums=none (suppressing only the checksums) is useless
> > and confusing, then sure, let's rip the latter one out. If most people
> > like the flexibility, let's keep it: it's already implemented and
> > tested. But I hate to base the decision on what one or two people
> > think.
>
> I'm frustrated at the lack of involvement from others also.
Well, the topic of backup manifests feels like it has generated a lot of
bickering emails, and people don't want to spend their time dealing with
that.
I’d like to not also. I suppose it’s just an area that I’m particularly concerned with that allows me to overcome that. Backups are important to me.
Thanks,
Stephen
On Fri, Mar 27, 2020 at 06:38:33PM -0400, Stephen Frost wrote: > Greetings, > > On Fri, Mar 27, 2020 at 18:36 Bruce Momjian <bruce@momjian.us> wrote: > > On Thu, Mar 26, 2020 at 12:34:52PM -0400, Stephen Frost wrote: > > * Robert Haas (robertmhaas@gmail.com) wrote: > > > This is where I feel like I'm trying to make decisions in a vacuum. If > > > we had a few more people weighing in on the thread on this point, I'd > > > be happy to go with whatever the consensus was. If most people think > > > having both --no-manifest (suppressing the manifest completely) and > > > --manifest-checksums=none (suppressing only the checksums) is useless > > > and confusing, then sure, let's rip the latter one out. If most people > > > like the flexibility, let's keep it: it's already implemented and > > > tested. But I hate to base the decision on what one or two people > > > think. > > > > I'm frustrated at the lack of involvement from others also. > > Well, the topic of backup manifests feels like it has generated a lot of > bickering emails, and people don't want to spend their time dealing with > that. > > > I’d like to not also. I suppose it’s just an area that I’m particularly > concerned with that allows me to overcome that. Backups are important to me. The big question is whether the discussion _needs_ to be that way. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Fri, Mar 27, 2020 at 01:53:54PM -0400, Robert Haas wrote: > - Replace a doc paragraph about the advantages and disadvantages of > CRC-32C with one by Stephen Frost, with a slightly change by me that I > thought made it sound more grammatical. Defaulting to CRC-32C seems prudent to me: - As Andres Freund said, SHA-512 is slow relative to storage now available. Since gzip is a needlessly-slow choice for backups (or any application that copies the compressed data just a few times), comparison to "gzip -6" speed is immaterial. - While I'm sure some other fast hash would be a superior default, introducing a new algorithm is a bikeshed, as you said. This design makes it easy, technically, for someone to introduce a new algorithm later. CRC-32C is not catastrophically unfit for 1GiB files. - Defaulting to SHA-512 would, in the absence of a WAL archive that also uses a cryptographic hash function, give a false sense of having achieved some coherent cryptographic goal. With the CRC-32C default, WAL and the rest get similar protection. I'm discounting the case of using BASE_BACKUP without a WAL archive, because I expect little intersection between sites "worried enough to hash everything" and those "not worried enough to use an archive". (On the other hand, the program that manages the WAL archive can reasonably own hashing base backups; putting ownership in the server isn't achieving much extra.) > + <refnamediv> > + <refname>pg_validatebackup</refname> > + <refpurpose>verify the integrity of a base backup of a > + <productname>PostgreSQL</productname> cluster</refpurpose> > + </refnamediv> > + <listitem> > + <para> > + <literal>pg_wal</literal> is ignored because WAL files are sent > + separately from the backup, and are therefore not described by the > + backup manifest. > + </para> > + </listitem> Stephen Frost mentioned that a backup could pass validation even if pg_basebackup were killed after writing the base backup and before finishing the writing of pg_wal. One might avoid that by simply writing the manifest to a temporary name and renaming it to the final name after populating pg_wal. What do you think of having the verification process also call pg_waldump to validate the WAL CRCs (shown upthread)? That looked helpful and simple. I think this functionality doesn't belong in its own program. If you suspect pg_basebackup or pg_restore will eventually gain the ability to merge incremental backups into a recovery-ready base backup, I would put the functionality in that program. Otherwise, I would put it in pg_checksums. For me, part of the friction here is that the program description indicates general verification, but the actual functionality merely checks hashes on a directory tree that happens to represent a PostgreSQL base backup. > + parse->pathname = palloc(raw_length + 1); I don't see this freed anywhere; is it? (It's useful to make peak memory consumption not grow in proportion to the number of files backed up.) [This message is not a full code review.]
On Fri, Mar 27, 2020 at 4:32 PM Andres Freund <andres@anarazel.de> wrote: > I don't like having a file format that's intended to be used by external > tools too that's undocumented except for code that assembles it in a > piecemeal fashion. Do you mean in a follow-on patch this release, or > later? I don't have a problem with the former. This release. I'm happy to work on that as soon as this gets committed, assuming it gets committed. > I do found it to be circular. I think we mostly need a paragraph or two > somewhere that explains on a higher level what the point of verifying > base backups is and what is verified. Fair enough. > FWIW, I was thinking of backup_manifest.checksum potentially being > desirable for another reason: The need to embed the checksum inside the > document imo adds a fair bit of rigidity to the file format. See Well, David Steele suggested this approach. I didn't particularly like it, but nobody showed up to agree with me or propose anything different, so here we are. I don't think it's the end of the world. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Mar 28, 2020 at 11:40 PM Noah Misch <noah@leadboat.com> wrote: > Stephen Frost mentioned that a backup could pass validation even if > pg_basebackup were killed after writing the base backup and before finishing > the writing of pg_wal. One might avoid that by simply writing the manifest to > a temporary name and renaming it to the final name after populating pg_wal. Huh, that's an idea. I'll have a look at the code and see what would be involved. > What do you think of having the verification process also call pg_waldump to > validate the WAL CRCs (shown upthread)? That looked helpful and simple. I don't love calls to external binaries, but I think the thing that really bothers me is that pg_waldump is practically bound to terminate with an error, because the last WAL segment will end with a partial record. For the same reason, I think there's really no such thing as validating a single WAL file. I suppose you'd need to know the exact start and end locations for a minimal WAL replay and check that all records between those LSNs appear OK, ignoring any apparent problems after the minimum ending point, or at least ignoring any problems due to an incomplete record in the last file. We don't have a tool for that currently, and I don't think I can write one this week. Or at least, not a good one. > I think this functionality doesn't belong in its own program. If you suspect > pg_basebackup or pg_restore will eventually gain the ability to merge > incremental backups into a recovery-ready base backup, I would put the > functionality in that program. Otherwise, I would put it in pg_checksums. > For me, part of the friction here is that the program description indicates > general verification, but the actual functionality merely checks hashes on a > directory tree that happens to represent a PostgreSQL base backup. Suraj's original patch made this part of pg_basebackup, but I didn't really like that, because I wanted it to have its own set of options. I still think all the options I've added are pretty useful ones, and I can think of other things somebody might want to do. It feels very uncomfortable to make pg_basebackup, or pg_checksums, take either options from set A and do thing X, or options from set B and do thing Y. But it feels clear that the name pg_validatebackup is not going over very well with anyone. I think I should rename it to pg_validatemanifest. > > + parse->pathname = palloc(raw_length + 1); > > I don't see this freed anywhere; is it? (It's useful to make peak memory > consumption not grow in proportion to the number of files backed up.) We need the hash table to remain populated for the whole run time of the tool, because we're essentially doing a full join of the actual directory contents against the manifest contents. That's a bit unfortunate but it doesn't seem simple to improve. I think the only people who are really going to suffer are people who have an enormous pile of empty or nearly-empty relations. People who have large databases for the normal reason - i.e. a reasonable number of tables that hold a lot of data - will have manifests of very manageable size. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 27, 2020 at 4:02 PM David Steele <david@pgmasters.net> wrote: > I prefer to validate the size and checksum in the same pass, but I'm not > sure it's that big a deal. If the backup is being corrupted under the > validate process that would also apply to files that had already been > validated. I did it like this because I thought that in typical scenarios it would be likely to produce useful results more quickly. For instance, suppose that you forget to restore the tablespace directories, and just get the main $PGDATA directory. Well, if you do it all in one pass, you might spend a long time checksumming things before you realize that some files are completely missing. I thought it would be useful to complain about files that are extra or missing or the wrong size FIRST, because that only requires us to stat() each file, and only after that do the comparatively extensive checksumming step that requires us to read the entire contents of each file. Granted, unless you use --exit-on-error, you're going to get all the complaints eventually anyway, but you might use that option, or you might hit ^C when you start to see a slough of complaints poppoing out. Maybe that was the wrong idea, but I thought people would like the idea of running cheaper checks first. I wasn't worried about concurrent modification of the backup because then you're super-hosed no matter what. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/29/20 8:33 PM, Robert Haas wrote: > On Fri, Mar 27, 2020 at 4:32 PM Andres Freund <andres@anarazel.de> wrote: > >> FWIW, I was thinking of backup_manifest.checksum potentially being >> desirable for another reason: The need to embed the checksum inside the >> document imo adds a fair bit of rigidity to the file format. See > > Well, David Steele suggested this approach. I didn't particularly like > it, but nobody showed up to agree with me or propose anything > different, so here we are. I don't think it's the end of the world. I prefer the embedded checksum even though it is a pain. It's a lot less likely to go missing. -- -David david@pgmasters.net
On 3/29/20 8:42 PM, Robert Haas wrote: > On Sat, Mar 28, 2020 at 11:40 PM Noah Misch <noah@leadboat.com> wrote: >> I don't see this freed anywhere; is it? (It's useful to make peak memory >> consumption not grow in proportion to the number of files backed up.) > > We need the hash table to remain populated for the whole run time of > the tool, because we're essentially doing a full join of the actual > directory contents against the manifest contents. That's a bit > unfortunate but it doesn't seem simple to improve. I think the only > people who are really going to suffer are people who have an enormous > pile of empty or nearly-empty relations. People who have large > databases for the normal reason - i.e. a reasonable number of tables > that hold a lot of data - will have manifests of very manageable size. +1 -- -David david@pgmasters.net
Hi, On 2020-03-29 20:47:40 -0400, Robert Haas wrote: > Maybe that was the wrong idea, but I thought people would like the > idea of running cheaper checks first. I wasn't worried about > concurrent modification of the backup because then you're super-hosed > no matter what. I do like that approach. To be clear: I'm suggesting the additional crosscheck not because I'm not concerned with concurrent modifications, but because I've seen filesystem per-inode metadata and the actual data / extent-tree differ. Leading to EOF reported while reading at a different place than what the size via stat() would indicate. Greetings, Andres Freund
On 3/29/20 8:47 PM, Robert Haas wrote: > On Fri, Mar 27, 2020 at 4:02 PM David Steele <david@pgmasters.net> wrote: >> I prefer to validate the size and checksum in the same pass, but I'm not >> sure it's that big a deal. If the backup is being corrupted under the >> validate process that would also apply to files that had already been >> validated. > > I did it like this because I thought that in typical scenarios it > would be likely to produce useful results more quickly. For instance, > suppose that you forget to restore the tablespace directories, and > just get the main $PGDATA directory. Well, if you do it all in one > pass, you might spend a long time checksumming things before you > realize that some files are completely missing. I thought it would be > useful to complain about files that are extra or missing or the wrong > size FIRST, because that only requires us to stat() each file, and > only after that do the comparatively extensive checksumming step that > requires us to read the entire contents of each file. Granted, unless > you use --exit-on-error, you're going to get all the complaints > eventually anyway, but you might use that option, or you might hit ^C > when you start to see a slough of complaints poppoing out. Yeah, that seems reasonable. In our case backups are nearly always compressed and/or encrypted so even checking the original size is a bit of work. Getting the checksum at the same time seems like an obvious win. Currently we don't have a separate validate command outside of restore but when we do we'll consider doing a pass to check for file presence (and size when possible) first. Thanks! > I wasn't worried about > concurrent modification of the backup because then you're super-hosed > no matter what. Really, really, super-hosed. Regards, -- -David david@pgmasters.net
Hi, On 2020-03-29 20:42:35 -0400, Robert Haas wrote: > > What do you think of having the verification process also call pg_waldump to > > validate the WAL CRCs (shown upthread)? That looked helpful and simple. > > I don't love calls to external binaries, but I think the thing that > really bothers me is that pg_waldump is practically bound to terminate > with an error, because the last WAL segment will end with a partial > record. I don't think that's the case here. You should know the last required record, which should allow to specify the precise end for pg_waldump. If it errors out reading to that point, we'd be in trouble. > For the same reason, I think there's really no such thing as > validating a single WAL file. I suppose you'd need to know the exact > start and end locations for a minimal WAL replay and check that all > records between those LSNs appear OK, ignoring any apparent problems > after the minimum ending point, or at least ignoring any problems due > to an incomplete record in the last file. We don't have a tool for > that currently, and I don't think I can write one this week. Or at > least, not a good one. pg_waldump -s / -e? > > > + parse->pathname = palloc(raw_length + 1); > > > > I don't see this freed anywhere; is it? (It's useful to make peak memory > > consumption not grow in proportion to the number of files backed up.) > > We need the hash table to remain populated for the whole run time of > the tool, because we're essentially doing a full join of the actual > directory contents against the manifest contents. That's a bit > unfortunate but it doesn't seem simple to improve. I think the only > people who are really going to suffer are people who have an enormous > pile of empty or nearly-empty relations. People who have large > databases for the normal reason - i.e. a reasonable number of tables > that hold a lot of data - will have manifests of very manageable size. Given that that's a pre-existing issue - at a significantly larger scale imo - e.g. for pg_dump (even in the --schema-only case), and that there are tons of backend side issues with lots of relations too, I think that's fine. You could of course implement something merge-join like, and implement the sorted input via a disk base sort. But that's a lot of work (good luck making tuplesort work in the frontend...). So I'd not go there unless there's a lot of evidence this is a serious practical issue. If we find this use too much memory, I think we'd be better off condensing pathnames into either fewer allocations, or a RelFileNode as part of the struct (with a fallback to string for other types of files). But I'd also not go there for now. Greetings, Andres Freund
On 3/29/20 9:07 PM, Andres Freund wrote: > On 2020-03-29 20:42:35 -0400, Robert Haas wrote: >>> What do you think of having the verification process also call pg_waldump to >>> validate the WAL CRCs (shown upthread)? That looked helpful and simple. >> >> I don't love calls to external binaries, but I think the thing that >> really bothers me is that pg_waldump is practically bound to terminate >> with an error, because the last WAL segment will end with a partial >> record. > > I don't think that's the case here. You should know the last required > record, which should allow to specify the precise end for pg_waldump. If > it errors out reading to that point, we'd be in trouble. Exactly. All WAL generated during the backup should read fine with pg_waldump or there is a problem. -- -David david@pgmasters.net
Hi, On 2020-03-29 21:23:06 -0400, David Steele wrote: > On 3/29/20 9:07 PM, Andres Freund wrote: > > On 2020-03-29 20:42:35 -0400, Robert Haas wrote: > > > > What do you think of having the verification process also call pg_waldump to > > > > validate the WAL CRCs (shown upthread)? That looked helpful and simple. > > > > > > I don't love calls to external binaries, but I think the thing that > > > really bothers me is that pg_waldump is practically bound to terminate > > > with an error, because the last WAL segment will end with a partial > > > record. > > > > I don't think that's the case here. You should know the last required > > record, which should allow to specify the precise end for pg_waldump. If > > it errors out reading to that point, we'd be in trouble. > > Exactly. All WAL generated during the backup should read fine with > pg_waldump or there is a problem. See the attached minimal prototype for what I am thinking of. This would not correctly handle the case where the timeline changes while taking a base backup. But I'm not sure that'd be all that serious a limitation for now? I'd personally not want to use a base backup that included a timeline switch... Greetings, Andres Freund
Attachment
On Sun, Mar 29, 2020 at 08:42:35PM -0400, Robert Haas wrote: > On Sat, Mar 28, 2020 at 11:40 PM Noah Misch <noah@leadboat.com> wrote: > > I think this functionality doesn't belong in its own program. If you suspect > > pg_basebackup or pg_restore will eventually gain the ability to merge > > incremental backups into a recovery-ready base backup, I would put the > > functionality in that program. Otherwise, I would put it in pg_checksums. > > For me, part of the friction here is that the program description indicates > > general verification, but the actual functionality merely checks hashes on a > > directory tree that happens to represent a PostgreSQL base backup. > > Suraj's original patch made this part of pg_basebackup, but I didn't > really like that, because I wanted it to have its own set of options. > I still think all the options I've added are pretty useful ones, and I > can think of other things somebody might want to do. It feels very > uncomfortable to make pg_basebackup, or pg_checksums, take either > options from set A and do thing X, or options from set B and do thing > Y. pg_checksums does already have that property, for what it's worth. (More specifically, certain options dictate the mode, and it reports an error if another option is incompatible with the mode.) > But it feels clear that the name pg_validatebackup is not going > over very well with anyone. I think I should rename it to > pg_validatemanifest. Between those two, I would use "pg_validatebackup" if there's a fair chance it will end up doing the pg_waldump check. Otherwise, I would use "pg_validatemanifest". I still most prefer delivering this as a mode of an existing program. > > > + parse->pathname = palloc(raw_length + 1); > > > > I don't see this freed anywhere; is it? (It's useful to make peak memory > > consumption not grow in proportion to the number of files backed up.) > > We need the hash table to remain populated for the whole run time of > the tool, because we're essentially doing a full join of the actual > directory contents against the manifest contents. That's a bit > unfortunate but it doesn't seem simple to improve. I think the only > people who are really going to suffer are people who have an enormous > pile of empty or nearly-empty relations. People who have large > databases for the normal reason - i.e. a reasonable number of tables > that hold a lot of data - will have manifests of very manageable size. Okay.
On Mon, Mar 30, 2020 at 11:28 AM Noah Misch <noah@leadboat.com> wrote: > > On Sun, Mar 29, 2020 at 08:42:35PM -0400, Robert Haas wrote: > > > But it feels clear that the name pg_validatebackup is not going > > over very well with anyone. I think I should rename it to > > pg_validatemanifest. > > Between those two, I would use "pg_validatebackup" if there's a fair chance it > will end up doing the pg_waldump check. Otherwise, I would use > "pg_validatemanifest". > +1. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sun, Mar 29, 2020 at 10:08 PM Andres Freund <andres@anarazel.de> wrote: > See the attached minimal prototype for what I am thinking of. > > This would not correctly handle the case where the timeline changes > while taking a base backup. But I'm not sure that'd be all that serious > a limitation for now? > > I'd personally not want to use a base backup that included a timeline > switch... Interesting concept. I've never (or almost never) used the -s and -e options to pg_waldump, so I didn't think about using those. I think having a --just-parse option to pg_waldump is a good idea, though maybe not with that name e.g. we could call it --quiet. It is less obvious to me what to do about all that as it pertains to the current patch. If we want pg_validatebackup to run pg_waldump in that mode or print out a hint about how to run pg_waldump in that mode, it would need to obtain the relevant LSNs. I guess that would require reading the backup_label file. It's not clear to me what we would do if the backup crosses a timeline switch, assuming that's even a case pg_basebackup allows. If we don't want to do anything in pg_validatebackup automatically but just want to document this as a a possible technique, we could finesse that problem with some weasel-wording. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-03-30 14:35:40 -0400, Robert Haas wrote: > On Sun, Mar 29, 2020 at 10:08 PM Andres Freund <andres@anarazel.de> wrote: > > See the attached minimal prototype for what I am thinking of. > > > > This would not correctly handle the case where the timeline changes > > while taking a base backup. But I'm not sure that'd be all that serious > > a limitation for now? > > > > I'd personally not want to use a base backup that included a timeline > > switch... > > Interesting concept. I've never (or almost never) used the -s and -e > options to pg_waldump, so I didn't think about using those. Oh - it's how I use it most of the time when investigating a specific problem. I just about always use -s, and often -e. Besides just reducing the logging output, and avoiding spurious errors, it makes it a lot easier to iteratively expand the logging for records that are problematic for the case at hand. > I think > having a --just-parse option to pg_waldump is a good idea, though > maybe not with that name e.g. we could call it --quiet. Yea, I didn't like the option's name. It's just the first thing that came to mind. > It is less obvious to me what to do about all that as it pertains to > the current patch. FWIW, I personally think we can live with this not validating WAL in the first release. But I also think it'd be within reach to do better and allow for WAL verification. > If we want pg_validatebackup to run pg_waldump in that mode or print > out a hint about how to run pg_waldump in that mode, it would need to > obtain the relevant LSNs. We could just include those in the manifest. Seems like good information to have in there to me, as it allows to build the complete list of files needed for a restore. > It's not clear to me what we would do if the backup crosses a timeline > switch, assuming that's even a case pg_basebackup allows. I've not tested it, but it sure looks like it's possible. Both by having a standby replaying from a node that promotes (multiple timeline switches possible too, I think, if the WAL source follows timelines), and by backing up from a standby that's being promoted. > If we don't want to do anything in pg_validatebackup automatically but > just want to document this as a a possible technique, we could finesse > that problem with some weasel-wording. It'd probably not be too hard to simply emit multiple commands, one for each timeline "segment". I wonder if it'd not be best, independent of whether we build in this verification, to include that metadata in the manifest file. That's for sure better than having to build a separate tool to parse timeline history files. I think it wouldn't be too hard to compute that information while taking the base backup. We know the end timeline (ThisTimeLineID), so we can just call readTimeLineHistory(ThisTimeLineID). Which should then allow for something pretty trivial along the lines of timelines = readTimeLineHistory(ThisTimeLineID); last_start = InvalidXLogRecPtr; foreach(lc, timelines) { TimeLineHistoryEntry *he = lfirst(lc); if (he->end < startptr) continue; // manifest_emit_wal_range(Min(he->begin, startptr), he->end); last_start = he->end; } if (last_start == InvalidXlogRecPtr) start = startptr; else start = last_start; manifest_emit_wal_range(start, entptr); Btw, just in case somebody suggests it: I don't think it's possible to compute the WAL checksums at this point. In stream mode WAL very well might already have been removed. Greetings, Andres Freund
On Mon, Mar 30, 2020 at 2:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Between those two, I would use "pg_validatebackup" if there's a fair chance it > > will end up doing the pg_waldump check. Otherwise, I would use > > "pg_validatemanifest". > > +1. I guess I'd like to be clear here that I have no fundamental disagreement with taking this tool in any direction that people would like it to go. For me it's just a question of timing. Feature freeze is now a week or so away, and nothing complicated is going to get done in that time. If we can all agree on something simple based on Andres's recent proposal, cool, but I'm not yet sure that will be the case, so what's plan B? We could decide that what I have here is just too little to be a viable facility on its own, but I think Stephen is the only one taking that position. We could release it as pg_validatemanifest with a plan to rename it if other backup-related checks are added later. We could release it as pg_validatebackup with the idea to avoid having to rename it when more backup-related checks are added later, but with a greater possibility of confusion in the meantime and no hard guarantee that anyone will actually develop such checks. We could put it in to pg_checksums, but I think that's really backing ourselves into a corner: if backup validation develops other checks that are not checksum-related, what then? I'd much rather gamble on keeping things together by topic (backup) than technology used internally (checksum). Putting it into pg_basebackup is another option, and would avoid that problem, but it's not my preferred option, because as I noted before, I think the command-line options will get confusing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-03-30 15:04:55 -0400, Robert Haas wrote: > I guess I'd like to be clear here that I have no fundamental > disagreement with taking this tool in any direction that people would > like it to go. For me it's just a question of timing. Feature freeze > is now a week or so away, and nothing complicated is going to get done > in that time. If we can all agree on something simple based on > Andres's recent proposal, cool, but I'm not yet sure that will be the > case, so what's plan B? We could decide that what I have here is just > too little to be a viable facility on its own, but I think Stephen is > the only one taking that position. We could release it as > pg_validatemanifest with a plan to rename it if other backup-related > checks are added later. We could release it as pg_validatebackup with > the idea to avoid having to rename it when more backup-related checks > are added later, but with a greater possibility of confusion in the > meantime and no hard guarantee that anyone will actually develop such > checks. We could put it in to pg_checksums, but I think that's really > backing ourselves into a corner: if backup validation develops other > checks that are not checksum-related, what then? I'd much rather > gamble on keeping things together by topic (backup) than technology > used internally (checksum). Putting it into pg_basebackup is another > option, and would avoid that problem, but it's not my preferred > option, because as I noted before, I think the command-line options > will get confusing. I'm mildly inclined to name it pg_validate, pg_validate_dbdir or such. And eventually (definitely not this release) subsume pg_checksums in it. That way we can add other checkers too. I don't really see a point in ending up with lots of different commands over time. Partially because there's probably plenty checks where the overall cost can be drastically reduced by combining IO. Partially because there's probably plenty shareable infrastructure. And partially because I think it makes discovery for users a lot easier. Greetings, Andres Freund
On Mon, Mar 30, 2020 at 2:59 PM Andres Freund <andres@anarazel.de> wrote: > I wonder if it'd not be best, independent of whether we build in this > verification, to include that metadata in the manifest file. That's for > sure better than having to build a separate tool to parse timeline > history files. I don't think that's better, or at least not "for sure better". The backup_label going to include the START TIMELINE, and if -Xfetch is used, we're also going to have all the timeline history files. If the backup manifest includes those same pieces of information, then we've got two sources of truth: one copy in the files the server's actually going to read, and another copy in the backup_manifest which we're going to potentially use for validation but ignore at runtime. That seems not great. > Btw, just in case somebody suggests it: I don't think it's possible to > compute the WAL checksums at this point. In stream mode WAL very well > might already have been removed. Right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 27, 2020 at 3:51 PM David Steele <david@pgmasters.net> wrote: > There appear to be conflicts with 67e0adfb3f98: Rebased. > > + Specifies the algorithm that should be used to checksum > each file > > + for purposes of the backup manifest. Currently, the available > > perhaps "for inclusion in the backup manifest"? Anyway, I think this > sentence is awkward. I changed it to "Specifies the checksum algorithm that should be applied to each file included in the backup manifest." I hope that's better. I also added, in both of the places where this text occurs, an explanation a little higher up of what a backup manifest actually is. > > + because the files themselves do not need to read. > > should be "need to be read". Fixed. > > + the manifest itself will always contain a > <literal>SHA256</literal> > > I think just "the manifest will always contain" is fine. OK. > > + manifeste itself, and is therefore ignored. Note that the > manifest > > typo "manifeste", perhaps remove itself. OK, fixed. > > { "Path": "backup_label", "Size": 224, "Last-Modified": "2020-03-27 > 18:33:18 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "b914bec9" }, > > Storing the checksum type with each file seems pretty redundant. > Perhaps that could go in the header? You could always override if a > specific file had a different checksum type, though that seems unlikely. > > In general it might be good to go with shorter keys: "mod", "chk", etc. > Manifests can get pretty big and that's a lot of extra bytes. > > I'm also partial to using epoch time in the manifest because it is > generally easier for programs to work with. But, human-readable doesn't > suck, either. It doesn't seem impossible for it to come up; for example, consider a file-level incremental backup facility. You might retain whatever checksums you have for the unchanged files (to avoid rereading them) and add checksums for modified or added files. I am not convinced that minimizing the size of the file here is a particularly important goal, because I don't think it's going to get that big in normal cases. I also think having the keys and values be easily understandable by human being is a plus. If we really want a minimal format without redundancy, we should've gone with what I proposed before (though admittedly that could've been tamped down even further if we'd cared to squeeze, which I didn't think was important then either). > > > if (maxrate > 0) > > maxrate_clause = psprintf("MAX_RATE %u", maxrate); > > + if (manifest) > > A linefeed here would be nice. Added. > > + manifestfile *tabent; > > This is an odd name. A holdover from the tab-delimited version? No, it was meant to stand for table entry. (Now we find out what happens when I break my own rule against using abbreviated words.) > > + printf(_("Usage:\n %s [OPTION]... BACKUPDIR\n\n"), progname); > > When I ran pg_validatebackup I expected to use -D to specify the backup > dir since pg_basebackup does. On the other hand -D is weird because I > *really* expect that to be the pg data dir. > > But, do we want this to be different from pg_basebackup? I think it's pretty distinguishable, because pg_basebackup needs an input (server) and an output (directory), whereas pg_validatebackup only needs one. I don't really care if we want to change it, but I was thinking of this as being more analogous to, say, pg_resetwal. Granted, that's a danger-don't-use-this tool and this isn't, but I don't think we want the -D-is-optional behavior that tools like pg_ctl have, because having a tool that isn't supposed to be used on a running cluster default to $PGDATA seems inadvisable. And if the argument is mandatory then it's not clear to me why we should make people type -D in front of it. > > + checksum_length = checksum_string_length / 2; > > This check is defeated if a single character is added the to checksum. > > Not too big a deal since you still get an error, but still. I don't see what the problem is here. We speculatively divide by two and allocate memory assuming the value that it was even, but then before doing anything critical we bail out if it was actually odd. That's harmless. We could get around it by saying: if (checksum_string_length % 2 != 0) context->error_cb(...); checksum_length = checksum_string_length / 2; checksum_payload = palloc(checksum_length); if (!hexdecode_string(...)) context->error_cb(...); ...but that would be adding additional code, and error messages, for what's basically a can't-happen-unless-the-user-is-messing-with-us case. > > + * Verify that the manifest checksum is correct. > > This is not working the way I would expect -- I could freely modify the > manifest without getting a checksum error on the manifest. For example: > > $ /home/vagrant/test/pg/bin/pg_validatebackup test/backup3 > pg_validatebackup: fatal: invalid checksum for file "backup_label": > "408901e0814f40f8ceb7796309a59c7248458325a21941e7c55568e381f53831?" > > So, if I deleted the entry above, I got a manifest checksum error. But > if I just modified the checksum I get a file checksum error with no > manifest checksum error. > > I would prefer a manifest checksum error in all cases where it is wrong, > unless --exit-on-error is specified. I think I would too, but I'm confused as to what you're doing, because if I just modified the manifest -- by deleting a file, for example, or changing the checksum of a file, I just get: pg_validatebackup: fatal: manifest checksum mismatch I'm confused as to why you're not seeing that. What's the exact sequence of steps? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Sun, Mar 29, 2020 at 9:05 PM David Steele <david@pgmasters.net> wrote: > Yeah, that seems reasonable. > > In our case backups are nearly always compressed and/or encrypted so > even checking the original size is a bit of work. Getting the checksum > at the same time seems like an obvious win. Makes sense. If this even got extended so it could read from tar-files instead of the filesystem directly, we'd surely want to take the opposite approach and just make a single pass. I'm not sure whether it's worth doing that at some point in the future, but it might be. If we're going to add the capability to compress or encrypt backups to pg_basebackup, we might want to do that first, and then make this tool handle all of those formats in one go. (As always, I don't have the ability to control how arbitrary developers spend their development time... so this is just a thought.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-03-30 15:23:08 -0400, Robert Haas wrote: > On Mon, Mar 30, 2020 at 2:59 PM Andres Freund <andres@anarazel.de> wrote: > > I wonder if it'd not be best, independent of whether we build in this > > verification, to include that metadata in the manifest file. That's for > > sure better than having to build a separate tool to parse timeline > > history files. > > I don't think that's better, or at least not "for sure better". The > backup_label going to include the START TIMELINE, and if -Xfetch is > used, we're also going to have all the timeline history files. If the > backup manifest includes those same pieces of information, then we've > got two sources of truth: one copy in the files the server's actually > going to read, and another copy in the backup_manifest which we're > going to potentially use for validation but ignore at runtime. That > seems not great. The data in the backup label isn't sufficient though. Without having parsed the timeline file there's no way to verify that the correct WAL is present. I guess we can also add client side tools to parse timelines, add command the fetch all of the required files, and then interpret that somehow. But that seems much more complicated. Imo it makes sense to want to be able verify that WAL looks correct even transporting WAL using another method (say archiving) and thus using pg_basebackup's -Xnone. For the manifest to actually list what's required for the base backup doesn't seem redundant to me. Imo it makes the manifest file make a good bit more sense, since afterwards it actually describes the whole base backup. Taking the redundancy agreement a bit further you can argue that we don't need a list of relation files at all, since they're in the catalog :P. Obviously going to that extreme doesn't make all that much sense... But I do think it's a second source of truth that's independent of what the backends actually are going to read. Greetings, Andres Freund
On 3/30/20 5:08 PM, Andres Freund wrote: > > The data in the backup label isn't sufficient though. Without having > parsed the timeline file there's no way to verify that the correct WAL > is present. I guess we can also add client side tools to parse > timelines, add command the fetch all of the required files, and then > interpret that somehow. > > But that seems much more complicated. > > Imo it makes sense to want to be able verify that WAL looks correct even > transporting WAL using another method (say archiving) and thus using > pg_basebackup's -Xnone. > > For the manifest to actually list what's required for the base backup > doesn't seem redundant to me. Imo it makes the manifest file make a good > bit more sense, since afterwards it actually describes the whole base > backup. FWIW, pgBackRest stores the backup WAL stop/start in the manifest. To get this information after the backup is complete requires parsing the .backup file which doesn't get stored in the backup directory by pg_basebackup. As far as I know, this is only accessibly to solutions that implement archive_command. So, pgBackRest could do that but it seems far more trouble than it is worth. Regards, -- -David david@pgmasters.net
On 3/30/20 4:16 PM, Robert Haas wrote: > On Fri, Mar 27, 2020 at 3:51 PM David Steele <david@pgmasters.net> wrote: > >> > { "Path": "backup_label", "Size": 224, "Last-Modified": "2020-03-27 >> 18:33:18 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "b914bec9" }, >> >> Storing the checksum type with each file seems pretty redundant. >> Perhaps that could go in the header? You could always override if a >> specific file had a different checksum type, though that seems unlikely. >> >> In general it might be good to go with shorter keys: "mod", "chk", etc. >> Manifests can get pretty big and that's a lot of extra bytes. >> >> I'm also partial to using epoch time in the manifest because it is >> generally easier for programs to work with. But, human-readable doesn't >> suck, either. > > It doesn't seem impossible for it to come up; for example, consider a > file-level incremental backup facility. You might retain whatever > checksums you have for the unchanged files (to avoid rereading them) > and add checksums for modified or added files. OK. > I am not convinced that minimizing the size of the file here is a > particularly important goal, because I don't think it's going to get > that big in normal cases. I also think having the keys and values be > easily understandable by human being is a plus. If we really want a > minimal format without redundancy, we should've gone with what I > proposed before (though admittedly that could've been tamped down even > further if we'd cared to squeeze, which I didn't think was important > then either). Well, normal cases is the key. But fine, in general we have found that the in memory representation is more important in terms of supporting clusters with very large numbers of files. >> When I ran pg_validatebackup I expected to use -D to specify the backup >> dir since pg_basebackup does. On the other hand -D is weird because I >> *really* expect that to be the pg data dir. >> >> But, do we want this to be different from pg_basebackup? > > I think it's pretty distinguishable, because pg_basebackup needs an > input (server) and an output (directory), whereas pg_validatebackup > only needs one. I don't really care if we want to change it, but I was > thinking of this as being more analogous to, say, pg_resetwal. > Granted, that's a danger-don't-use-this tool and this isn't, but I > don't think we want the -D-is-optional behavior that tools like pg_ctl > have, because having a tool that isn't supposed to be used on a > running cluster default to $PGDATA seems inadvisable. And if the > argument is mandatory then it's not clear to me why we should make > people type -D in front of it. Honestly I think pg_basebackup is the confusing one, because in most cases -D points at the running cluster dir. So, OK. >> > + checksum_length = checksum_string_length / 2; >> >> This check is defeated if a single character is added the to checksum. >> >> Not too big a deal since you still get an error, but still. > > I don't see what the problem is here. We speculatively divide by two > and allocate memory assuming the value that it was even, but then > before doing anything critical we bail out if it was actually odd. > That's harmless. We could get around it by saying: > > if (checksum_string_length % 2 != 0) > context->error_cb(...); > checksum_length = checksum_string_length / 2; > checksum_payload = palloc(checksum_length); > if (!hexdecode_string(...)) > context->error_cb(...); > > ...but that would be adding additional code, and error messages, for > what's basically a can't-happen-unless-the-user-is-messing-with-us > case. Sorry, pasted the wrong code and even then still didn't get it quite right. The problem: If I remove an even characters from a checksum it appears the checksum passes but the manifest checksum fails: $ pg_basebackup -D test/backup5 --manifest-checksums=SHA256 $ vi test/backup5/backup_manifest * Remove two characters from the checksum of backup_label $ pg_validatebackup test/backup5 pg_validatebackup: fatal: manifest checksum mismatch But if I add any number of characters or remove an odd number of characters I get: pg_validatebackup: fatal: invalid checksum for file "backup_label": "a98e9164fd59d498d14cfdf19c67d1c2208a30e7b939d1b4a09f524c7adfc11fXX" and no manifest checksum failure. >> > + * Verify that the manifest checksum is correct. >> >> This is not working the way I would expect -- I could freely modify the >> manifest without getting a checksum error on the manifest. For example: >> >> $ /home/vagrant/test/pg/bin/pg_validatebackup test/backup3 >> pg_validatebackup: fatal: invalid checksum for file "backup_label": >> "408901e0814f40f8ceb7796309a59c7248458325a21941e7c55568e381f53831?" >> >> So, if I deleted the entry above, I got a manifest checksum error. But >> if I just modified the checksum I get a file checksum error with no >> manifest checksum error. >> >> I would prefer a manifest checksum error in all cases where it is wrong, >> unless --exit-on-error is specified. > > I think I would too, but I'm confused as to what you're doing, because > if I just modified the manifest -- by deleting a file, for example, or > changing the checksum of a file, I just get: > > pg_validatebackup: fatal: manifest checksum mismatch > > I'm confused as to why you're not seeing that. What's the exact > sequence of steps? $ pg_basebackup -D test/backup5 --manifest-checksums=SHA256 $ vi test/backup5/backup_manifest * Add 'X' to the checksum of backup_label $ pg_validatebackup test/backup5 pg_validatebackup: fatal: invalid checksum for file "backup_label": "a98e9164fd59d498d14cfdf19c67d1c2208a30e7b939d1b4a09f524c7adfc11fX" No mention of the manifest checksum being invalid. But if I remove the backup label file from the manifest: pg_validatebackup: fatal: manifest checksum mismatch -- -David david@pgmasters.net
On Mon, Mar 30, 2020 at 12:16:31PM -0700, Andres Freund wrote: > On 2020-03-30 15:04:55 -0400, Robert Haas wrote: > > I guess I'd like to be clear here that I have no fundamental > > disagreement with taking this tool in any direction that people would > > like it to go. For me it's just a question of timing. Feature freeze > > is now a week or so away, and nothing complicated is going to get done > > in that time. If we can all agree on something simple based on > > Andres's recent proposal, cool, but I'm not yet sure that will be the > > case, so what's plan B? We could decide that what I have here is just > > too little to be a viable facility on its own, but I think Stephen is > > the only one taking that position. We could release it as > > pg_validatemanifest with a plan to rename it if other backup-related > > checks are added later. We could release it as pg_validatebackup with > > the idea to avoid having to rename it when more backup-related checks > > are added later, but with a greater possibility of confusion in the > > meantime and no hard guarantee that anyone will actually develop such > > checks. We could put it in to pg_checksums, but I think that's really > > backing ourselves into a corner: if backup validation develops other > > checks that are not checksum-related, what then? I'd much rather > > gamble on keeping things together by topic (backup) than technology > > used internally (checksum). Putting it into pg_basebackup is another > > option, and would avoid that problem, but it's not my preferred > > option, because as I noted before, I think the command-line options > > will get confusing. > > I'm mildly inclined to name it pg_validate, pg_validate_dbdir or > such. And eventually (definitely not this release) subsume pg_checksums > in it. That way we can add other checkers too. Works for me; of those two, I prefer pg_validate.
On Tue, Mar 31, 2020 at 11:10 AM Noah Misch <noah@leadboat.com> wrote: > > On Mon, Mar 30, 2020 at 12:16:31PM -0700, Andres Freund wrote: > > On 2020-03-30 15:04:55 -0400, Robert Haas wrote: > > > I guess I'd like to be clear here that I have no fundamental > > > disagreement with taking this tool in any direction that people would > > > like it to go. For me it's just a question of timing. Feature freeze > > > is now a week or so away, and nothing complicated is going to get done > > > in that time. If we can all agree on something simple based on > > > Andres's recent proposal, cool, but I'm not yet sure that will be the > > > case, so what's plan B? We could decide that what I have here is just > > > too little to be a viable facility on its own, but I think Stephen is > > > the only one taking that position. We could release it as > > > pg_validatemanifest with a plan to rename it if other backup-related > > > checks are added later. We could release it as pg_validatebackup with > > > the idea to avoid having to rename it when more backup-related checks > > > are added later, but with a greater possibility of confusion in the > > > meantime and no hard guarantee that anyone will actually develop such > > > checks. We could put it in to pg_checksums, but I think that's really > > > backing ourselves into a corner: if backup validation develops other > > > checks that are not checksum-related, what then? I'd much rather > > > gamble on keeping things together by topic (backup) than technology > > > used internally (checksum). Putting it into pg_basebackup is another > > > option, and would avoid that problem, but it's not my preferred > > > option, because as I noted before, I think the command-line options > > > will get confusing. > > > > I'm mildly inclined to name it pg_validate, pg_validate_dbdir or > > such. And eventually (definitely not this release) subsume pg_checksums > > in it. That way we can add other checkers too. > > Works for me; of those two, I prefer pg_validate. > pg_validate sounds like a tool with a much bigger purpose. I think even things like amcheck could also fall under it. This patch has two parts (a) Generate backup manifests for base backups, and (b) Validate backup (manifest). It seems to me that there are not many things pending for (a), can't we commit that first or is it the case that (a) depends on (b)? This is *not* a suggestion to leave pg_validatebackup from this release rather just to commit if something is ready and meaningful on its own. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Mar 30, 2020 at 7:24 PM David Steele <david@pgmasters.net> wrote: > > I'm confused as to why you're not seeing that. What's the exact > > sequence of steps? > > $ pg_basebackup -D test/backup5 --manifest-checksums=SHA256 > > $ vi test/backup5/backup_manifest > * Add 'X' to the checksum of backup_label > > $ pg_validatebackup test/backup5 > pg_validatebackup: fatal: invalid checksum for file "backup_label": > "a98e9164fd59d498d14cfdf19c67d1c2208a30e7b939d1b4a09f524c7adfc11fX" > > No mention of the manifest checksum being invalid. But if I remove the > backup label file from the manifest: > > pg_validatebackup: fatal: manifest checksum mismatch Oh, I see what's happening now. If the checksum is not an even-length string of hexademical characters, it's treated as a syntax error, so it bails out at that point. Generally, a syntax error in the manifest file is treated as a fatal error, and you just die right there. You'd get the same behavior if you had malformed JSON, like a stray { or } or [ or ] someplace that it doesn't belong according to the rules of JSON. On the other hand, if you corrupt the checksum by adding AA or EE or 54 or some other even-length string of hex characters, then you have (in this code's view) a semantic error rather than a syntax error, so it will finish loading all the manifest data and then bail because the checksum doesn't match. We really can't avoid bailing out early sometimes, because if the file is totally malformed at the JSON level, there's just no way to continue. We could cause this particular error to get treated as a semantic error rather than a syntax error, but I don't really see much advantage in so doing. This way was easier to code, and I don't think it really matters which error we find first. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Amit Kapila (amit.kapila16@gmail.com) wrote: > On Tue, Mar 31, 2020 at 11:10 AM Noah Misch <noah@leadboat.com> wrote: > > On Mon, Mar 30, 2020 at 12:16:31PM -0700, Andres Freund wrote: > > > On 2020-03-30 15:04:55 -0400, Robert Haas wrote: > > > > I guess I'd like to be clear here that I have no fundamental > > > > disagreement with taking this tool in any direction that people would > > > > like it to go. For me it's just a question of timing. Feature freeze > > > > is now a week or so away, and nothing complicated is going to get done > > > > in that time. If we can all agree on something simple based on > > > > Andres's recent proposal, cool, but I'm not yet sure that will be the > > > > case, so what's plan B? We could decide that what I have here is just > > > > too little to be a viable facility on its own, but I think Stephen is > > > > the only one taking that position. We could release it as > > > > pg_validatemanifest with a plan to rename it if other backup-related > > > > checks are added later. We could release it as pg_validatebackup with > > > > the idea to avoid having to rename it when more backup-related checks > > > > are added later, but with a greater possibility of confusion in the > > > > meantime and no hard guarantee that anyone will actually develop such > > > > checks. We could put it in to pg_checksums, but I think that's really > > > > backing ourselves into a corner: if backup validation develops other > > > > checks that are not checksum-related, what then? I'd much rather > > > > gamble on keeping things together by topic (backup) than technology > > > > used internally (checksum). Putting it into pg_basebackup is another > > > > option, and would avoid that problem, but it's not my preferred > > > > option, because as I noted before, I think the command-line options > > > > will get confusing. > > > > > > I'm mildly inclined to name it pg_validate, pg_validate_dbdir or > > > such. And eventually (definitely not this release) subsume pg_checksums > > > in it. That way we can add other checkers too. > > > > Works for me; of those two, I prefer pg_validate. > > pg_validate sounds like a tool with a much bigger purpose. I think > even things like amcheck could also fall under it. Yeah, I tend to agree with this. > This patch has two parts (a) Generate backup manifests for base > backups, and (b) Validate backup (manifest). It seems to me that > there are not many things pending for (a), can't we commit that first > or is it the case that (a) depends on (b)? This is *not* a suggestion > to leave pg_validatebackup from this release rather just to commit if > something is ready and meaningful on its own. I suspect the idea here is that we don't really want to commit something that nothing is actually using, and that's understandable and justified here- consider that even in this recent discussion there was talk that maybe we should have included permissions and ownership in the manifest, or starting and ending WAL positions, so that they'd be able to be checked by this tool more easily (and because it's just useful to have all that info in one place... I don't really agree with the concerns that it's an issue for static information like that to be duplicated). In other words, while the manifest creation code might be something we could commit, without a tool to use it (which does all the things that we think it needs to, to perform some high-level task, such as "validate a backup") we don't know that the manifest that's actually generated is really up to snuff and has what it needs to have to perform that task. I had been hoping that the discussion Andres was leading regarding leveraging pg_waldump (or maybe just code from it..) would get us to a point where pg_validatebackup would check that we have all of the WAL needed for the backup to be consistent and that it would then verify the internal checksums of the WAL. That would certainly be a good solution for this time around, in my view, and is already all existing client-side code. I do think we'd want to have a note about how we verify pg_wal differently from the other files which are in the manifest. Thanks, Stephen
Attachment
On Mon, Mar 30, 2020 at 2:59 PM Andres Freund <andres@anarazel.de> wrote: > I think it wouldn't be too hard to compute that information while taking > the base backup. We know the end timeline (ThisTimeLineID), so we can > just call readTimeLineHistory(ThisTimeLineID). Which should then allow > for something pretty trivial along the lines of > > timelines = readTimeLineHistory(ThisTimeLineID); > last_start = InvalidXLogRecPtr; > foreach(lc, timelines) > { > TimeLineHistoryEntry *he = lfirst(lc); > > if (he->end < startptr) > continue; > > // > manifest_emit_wal_range(Min(he->begin, startptr), he->end); > last_start = he->end; > } > > if (last_start == InvalidXlogRecPtr) > start = startptr; > else > start = last_start; > > manifest_emit_wal_range(start, entptr); I made an attempt to implement this. In the attached patch set, 0001 and 0002 are (I think) unmodified from the last version. 0003 is a slightly-rejiggered version of your new pg_waldump option. 0004 whacks 0002 around so that the WAL ranges are included in the manifest and pg_validatebackup tries to run pg_waldump for each WAL range. It appears to work in light testing, but I haven't yet (1) tested it extensively, (2) written good regression tests for it above and beyond what pg_validatebackup had already, or (3) updated the documentation. I'm going to work on those things. I would appreciate *very timely* feedback on anything people do or do not like about this, because I want to commit this patch set by the end of the work week and that isn't very far away. I would also appreciate if people would bear in mind the principle that half a loaf is better than none, and further improvements can be made in future releases. As part of my light testing, I tried promoting a standby that was running pg_basebackup, and found that pg_basebackup failed like this: pg_basebackup: error: could not get COPY data stream: ERROR: the standby was promoted during online backup HINT: This means that the backup being taken is corrupt and should not be used. Try taking another online backup. pg_basebackup: removing data directory "/Users/rhaas/pgslave2" My first thought was that this error message is hard to reconcile with this comment: /* * Send timeline history files too. Only the latest timeline history * file is required for recovery, and even that only if there happens * to be a timeline switch in the first WAL segment that contains the * checkpoint record, or if we're taking a base backup from a standby * server and the target timeline changes while the backup is taken. * But they are small and highly useful for debugging purposes, so * better include them all, always. */ But then it occurred to me that this might be a cascading standby. Maybe the original master died and this machine's master got promoted, so it has to follow a timeline switch but doesn't itself get promoted. I think I might try to test out that scenario and see what happens, but I haven't done so as of this writing. Regardless, it seems like a really good idea to store a list of WAL ranges rather than a single start/end/timeline, because even if it's impossible today it might become possible in the future. Still, unless there's an easy way to set up a test scenario where multiple WAL ranges need to be verified, it may be hard to test that this code actually behaves properly. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On 2020-03-31 14:10:34 -0400, Robert Haas wrote: > I made an attempt to implement this. Awesome! > In the attached patch set, 0001 I'm going to work on those things. I > would appreciate *very timely* feedback on anything people do or do > not like about this, because I want to commit this patch set by the > end of the work week and that isn't very far away. I would also > appreciate if people would bear in mind the principle that half a loaf > is better than none, and further improvements can be made in future > releases. > > As part of my light testing, I tried promoting a standby that was > running pg_basebackup, and found that pg_basebackup failed like this: > > pg_basebackup: error: could not get COPY data stream: ERROR: the > standby was promoted during online backup > HINT: This means that the backup being taken is corrupt and should > not be used. Try taking another online backup. > pg_basebackup: removing data directory "/Users/rhaas/pgslave2" > > My first thought was that this error message is hard to reconcile with > this comment: > > /* > * Send timeline history files too. Only the latest timeline history > * file is required for recovery, and even that only if there happens > * to be a timeline switch in the first WAL segment that contains the > * checkpoint record, or if we're taking a base backup from a standby > * server and the target timeline changes while the backup is taken. > * But they are small and highly useful for debugging purposes, so > * better include them all, always. > */ > > But then it occurred to me that this might be a cascading standby. Yea. The check just prevents the walsender's database from being promoted: /* * Check if the postmaster has signaled us to exit, and abort with an * error in that case. The error handler further up will call * do_pg_abort_backup() for us. Also check that if the backup was * started while still in recovery, the server wasn't promoted. * do_pg_stop_backup() will check that too, but it's better to stop * the backup early than continue to the end and fail there. */ CHECK_FOR_INTERRUPTS(); if (RecoveryInProgress() != backup_started_in_recovery) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("the standby was promoted during online backup"), errhint("This means that the backup being taken is corrupt " "and should not be used. " "Try taking another online backup."))); and if (strcmp(backupfrom, "standby") == 0 && !backup_started_in_recovery) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("the standby was promoted during online backup"), errhint("This means that the backup being taken is corrupt " "and should not be used. " "Try taking another online backup."))); So that just prevents promotions of the current node, afaict. > Regardless, it seems like a really good idea to store a list of WAL > ranges rather than a single start/end/timeline, because even if it's > impossible today it might become possible in the future. Indeed. > Still, unless there's an easy way to set up a test scenario where > multiple WAL ranges need to be verified, it may be hard to test that > this code actually behaves properly. I think it'd be possible to test without a fully cascading setup, by creating an initial base backup, then do some work to create a bunch of new timelines, and then start the initial base backup. That'd have to follow all those timelines. Not sure that's better than a cascading setup though. > +/* > + * Add information about the WAL that will need to be replayed when restoring > + * this backup to the manifest. > + */ > +static void > +AddWALInfoToManifest(manifest_info *manifest, XLogRecPtr startptr, > + TimeLineID starttli, XLogRecPtr endptr, TimeLineID endtli) > +{ > + List *timelines = readTimeLineHistory(endtli); should probably happen after the manifest->buffile check. > + ListCell *lc; > + bool first_wal_range = true; > + bool found_ending_tli = false; > + > + /* If there is no buffile, then the user doesn't want a manifest. */ > + if (manifest->buffile == NULL) > + return; Not really about this patch/function specifically: I wonder if this'd look better if you added ManifestEnabled() macro instead of repeating the comment repeatedly. > + /* Unless --no-parse-wal was specified, we will need pg_waldump. */ > + if (!no_parse_wal) > + { > + int ret; > + > + pg_waldump_path = pg_malloc(MAXPGPATH); > + ret = find_other_exec(argv[0], "pg_waldump", > + "pg_waldump (PostgreSQL) " PG_VERSION "\n", > + pg_waldump_path); > + if (ret < 0) > + { > + char full_path[MAXPGPATH]; > + > + if (find_my_exec(argv[0], full_path) < 0) > + strlcpy(full_path, progname, sizeof(full_path)); > + if (ret == -1) > + pg_log_fatal("The program \"%s\" is needed by %s but was\n" > + "not found in the same directory as \"%s\".\n" > + "Check your installation.", > + "pg_waldump", "pg_validatebackup", full_path); > + else > + pg_log_fatal("The program \"%s\" was found by \"%s\" but was\n" > + "not the same version as %s.\n" > + "Check your installation.", > + "pg_waldump", full_path, "pg_validatebackup"); > + } > + } ISTM, and this can definitely wait for another time, that we should have one wrapper doing all of this, instead of having quite a few copies of very similar logic to the above. > +/* > + * Attempt to parse the WAL files required to restore from backup using > + * pg_waldump. > + */ > +static void > +parse_required_wal(validator_context *context, char *pg_waldump_path, > + char *wal_directory, manifest_wal_range *first_wal_range) > +{ > + manifest_wal_range *this_wal_range = first_wal_range; > + > + while (this_wal_range != NULL) > + { > + char *pg_waldump_cmd; > + > + pg_waldump_cmd = psprintf("\"%s\" --quiet --path=\"%s\" --timeline=%u --start=%X/%X --end=%X/%X\n", > + pg_waldump_path, wal_directory, this_wal_range->tli, > + (uint32) (this_wal_range->start_lsn >> 32), > + (uint32) this_wal_range->start_lsn, > + (uint32) (this_wal_range->end_lsn >> 32), > + (uint32) this_wal_range->end_lsn); > + if (system(pg_waldump_cmd) != 0) > + report_backup_error(context, > + "WAL parsing failed for timeline %u", > + this_wal_range->tli); > + > + this_wal_range = this_wal_range->next; > + } > +} Should we have a function to properly escape paths in cases like this? Not that it's likely or really problematic, but the quoting for path could be "circumvented". Greetings, Andres Freund
On Tue, Mar 31, 2020 at 03:50:34PM -0700, Andres Freund wrote: > On 2020-03-31 14:10:34 -0400, Robert Haas wrote: > > +/* > > + * Attempt to parse the WAL files required to restore from backup using > > + * pg_waldump. > > + */ > > +static void > > +parse_required_wal(validator_context *context, char *pg_waldump_path, > > + char *wal_directory, manifest_wal_range *first_wal_range) > > +{ > > + manifest_wal_range *this_wal_range = first_wal_range; > > + > > + while (this_wal_range != NULL) > > + { > > + char *pg_waldump_cmd; > > + > > + pg_waldump_cmd = psprintf("\"%s\" --quiet --path=\"%s\" --timeline=%u --start=%X/%X --end=%X/%X\n", > > + pg_waldump_path, wal_directory, this_wal_range->tli, > > + (uint32) (this_wal_range->start_lsn >> 32), > > + (uint32) this_wal_range->start_lsn, > > + (uint32) (this_wal_range->end_lsn >> 32), > > + (uint32) this_wal_range->end_lsn); > > + if (system(pg_waldump_cmd) != 0) > > + report_backup_error(context, > > + "WAL parsing failed for timeline %u", > > + this_wal_range->tli); > > + > > + this_wal_range = this_wal_range->next; > > + } > > +} > > Should we have a function to properly escape paths in cases like this? > Not that it's likely or really problematic, but the quoting for path > could be "circumvented". Are you looking for appendShellString(), or something different?
On Tue, Mar 31, 2020 at 6:50 PM Andres Freund <andres@anarazel.de> wrote: > On 2020-03-31 14:10:34 -0400, Robert Haas wrote: > > I made an attempt to implement this. > > Awesome! Here's a new patch set. I haven't fixed the things in your latest round of review comments yet, but I did rewrite the documentation for pg_validatebackup, add documentation for the new pg_waldump option, and add regression tests for the new WAL-checking facility of pg_validatebackup. 0001 - add pg_waldump -q 0002 - add checksum helpers 0003 - core backup manifest patch, now with WAL verification included -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On 2020-03-31 22:15:04 -0700, Noah Misch wrote: > On Tue, Mar 31, 2020 at 03:50:34PM -0700, Andres Freund wrote: > > On 2020-03-31 14:10:34 -0400, Robert Haas wrote: > > > +/* > > > + * Attempt to parse the WAL files required to restore from backup using > > > + * pg_waldump. > > > + */ > > > +static void > > > +parse_required_wal(validator_context *context, char *pg_waldump_path, > > > + char *wal_directory, manifest_wal_range *first_wal_range) > > > +{ > > > + manifest_wal_range *this_wal_range = first_wal_range; > > > + > > > + while (this_wal_range != NULL) > > > + { > > > + char *pg_waldump_cmd; > > > + > > > + pg_waldump_cmd = psprintf("\"%s\" --quiet --path=\"%s\" --timeline=%u --start=%X/%X --end=%X/%X\n", > > > + pg_waldump_path, wal_directory, this_wal_range->tli, > > > + (uint32) (this_wal_range->start_lsn >> 32), > > > + (uint32) this_wal_range->start_lsn, > > > + (uint32) (this_wal_range->end_lsn >> 32), > > > + (uint32) this_wal_range->end_lsn); > > > + if (system(pg_waldump_cmd) != 0) > > > + report_backup_error(context, > > > + "WAL parsing failed for timeline %u", > > > + this_wal_range->tli); > > > + > > > + this_wal_range = this_wal_range->next; > > > + } > > > +} > > > > Should we have a function to properly escape paths in cases like this? > > Not that it's likely or really problematic, but the quoting for path > > could be "circumvented". > > Are you looking for appendShellString(), or something different? Looks like that'd be it. Thanks. Greetings, Andres Freund
Hi, On 2020-03-31 14:56:07 +0530, Amit Kapila wrote: > On Tue, Mar 31, 2020 at 11:10 AM Noah Misch <noah@leadboat.com> wrote: > > On Mon, Mar 30, 2020 at 12:16:31PM -0700, Andres Freund wrote: > > > On 2020-03-30 15:04:55 -0400, Robert Haas wrote: > > > I'm mildly inclined to name it pg_validate, pg_validate_dbdir or > > > such. And eventually (definitely not this release) subsume pg_checksums > > > in it. That way we can add other checkers too. > > > > Works for me; of those two, I prefer pg_validate. > > > > pg_validate sounds like a tool with a much bigger purpose. I think > even things like amcheck could also fall under it. Intentionally so. We don't serve our users by collecting a lot of differently named commands to work with data directories. A I wrote above, the point would be to eventually have that tool also perform checksum validation etc. Potentially even in a single pass over the data directory. > This patch has two parts (a) Generate backup manifests for base > backups, and (b) Validate backup (manifest). It seems to me that > there are not many things pending for (a), can't we commit that first > or is it the case that (a) depends on (b)? This is *not* a suggestion > to leave pg_validatebackup from this release rather just to commit if > something is ready and meaningful on its own. IDK, it seems easier to be able to modify both at the same time. Greetings, Andres Freund
On 3/31/20 7:57 AM, Robert Haas wrote: > On Mon, Mar 30, 2020 at 7:24 PM David Steele <david@pgmasters.net> wrote: >>> I'm confused as to why you're not seeing that. What's the exact >>> sequence of steps? >> >> $ pg_basebackup -D test/backup5 --manifest-checksums=SHA256 >> >> $ vi test/backup5/backup_manifest >> * Add 'X' to the checksum of backup_label >> >> $ pg_validatebackup test/backup5 >> pg_validatebackup: fatal: invalid checksum for file "backup_label": >> "a98e9164fd59d498d14cfdf19c67d1c2208a30e7b939d1b4a09f524c7adfc11fX" >> >> No mention of the manifest checksum being invalid. But if I remove the >> backup label file from the manifest: >> >> pg_validatebackup: fatal: manifest checksum mismatch > > Oh, I see what's happening now. If the checksum is not an even-length > string of hexademical characters, it's treated as a syntax error, so > it bails out at that point. Generally, a syntax error in the manifest > file is treated as a fatal error, and you just die right there. You'd > get the same behavior if you had malformed JSON, like a stray { or } > or [ or ] someplace that it doesn't belong according to the rules of > JSON. On the other hand, if you corrupt the checksum by adding AA or > EE or 54 or some other even-length string of hex characters, then you > have (in this code's view) a semantic error rather than a syntax > error, so it will finish loading all the manifest data and then bail > because the checksum doesn't match. > > We really can't avoid bailing out early sometimes, because if the file > is totally malformed at the JSON level, there's just no way to > continue. We could cause this particular error to get treated as a > semantic error rather than a syntax error, but I don't really see much > advantage in so doing. This way was easier to code, and I don't think > it really matters which error we find first. I think it would be good to know that the manifest checksum is bad in all cases because that may well inform other errors. That said, I know you have a lot on your plate with this patch so I'm not going to make a fuss about such a minor gripe. Perhaps this can be considered for future improvement. Regards, -- -David david@pgmasters.net
On Wed, Apr 1, 2020 at 4:47 PM Robert Haas <robertmhaas@gmail.com> wrote: > Here's a new patch set. I haven't fixed the things in your latest > round of review comments yet, but I did rewrite the documentation for > pg_validatebackup, add documentation for the new pg_waldump option, > and add regression tests for the new WAL-checking facility of > pg_validatebackup. > > 0001 - add pg_waldump -q > 0002 - add checksum helpers > 0003 - core backup manifest patch, now with WAL verification included And here's another new patch set. After some experimentation, I was able to manually test the timeline-switch-during-a-base-backup case and found that it had bugs in both pg_validatebackup and the code I added to the backend's basebackup.c. So I fixed those. It would be nice to have automated tests, but you need a large database (so that backing it up takes non-trivial time) and a load on the primary (so that WAL is being replayed during the backup) and there's a race condition (because the backup has to not finish before the cascading standby learns that the upstream has been promoted), so I don't at present see a practical way to automate that. I did verify, in manual testing, that a problem with WAL files on either timeline caused a validation failure. I also verified that the LSNs at which the standby began replay and reached consistency matched what was stored in the manifest. I also implemented Noah's suggestion that we should write the backup manifest under a temporary name and then rename it afterward. Stephen's original complaint that you could end up with a backup that validates successfully even though we died before we got the WAL is, at this point, moot, because pg_validatebackup is now capable of noticing that the WAL is missing. Nevertheless, this seems like a nice belt-and-suspenders check. I was able to position the rename *after* we fsync() the backup directory, as well as after we get all of the WAL, so unless those steps complete you'll have backup_manifest.tmp rather than backup_manifest. It's true that, if we suffered an OS crash before the fsync() completed and lost some files or some file data, pg_validatebackup ought to fail anyway, but this way it is absolutely certain to fail, and to do so immediately. Likewise for a failure while fetching WAL that manages to leave the output directory behind. This version has also had a visit from the pgindent police. I think this responds to pretty much all of the complaints that I know about and upon which we have a reasonable degree of consensus. There are still some things that not everybody is happy about. In particular, Stephen and David are unhappy about using CRC-32C as the default algorithm, but Andres and Noah both think it's a reasonable choice, even if not as robust as everybody will want. As I agree, I'm going to stick with that choice. Also, there is still some debate about what the tool ought to be called. My previous suggestion to rename this from pg_validatebackup to pg_validatemanifest seems wrong now that WAL validation has been added; in fact, given that we now have two independent sanity checks on a backup, I'm going to argue that it would be reasonable to extend that by adding more kinds of backup validation, perhaps even including the permissions check that Andres suggested before. I don't plan to pursue that at present, though. There remains the idea of merging this with some other tool, but I still don't like that. On the one hand, it's been suggested that it could be merged into pg_checksums, but I think that is less appealing now that it seems to be growing into a general-purpose backup validation tool. It may do things that have nothing to do with checksums. On the other hand, it's been suggested that it ought to be called pg_validate and that pg_checksums ought to eventually be merged into it, but I don't think we have sufficient consensus here to commit the project to such a plan. Nobody responsible for the pg_checksums work has endorsed it, for example. Moreover, pg_checksums does things other than validation, such as enabling and disabling checksums. Therefore, I think it's unclear that such a plan would achieve a sufficient degree of consensus. For my part, I think this is a general issue that is not really this patch's problem to solve. We have had multiple discussions over the years about reducing the number of binaries that we ship. We could have a general binary called "pg" or similar and use subcommands: pg createdb, pg basebackup, pg validatebackup, etc. I think such an approach is worth considering, though it would certainly be an adjustment for everyone. Or we might do something else. But I don't want to deal with that in this patch. A couple of other minor suggestions have been made: (1) rejigger things to avoid message duplication related to launching external binaries, (2) maybe use appendShellString, and (3) change some details of error-reporting related to manifest parsing. I don't believe anyone views these as blockers; (1) and (2) are preexisting issues that this patch extends to one new case. Considering all the foregoing, I would like to go ahead and commit this stuff. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On 2020-04-02 13:04:45 -0400, Robert Haas wrote: > And here's another new patch set. After some experimentation, I was > able to manually test the timeline-switch-during-a-base-backup case > and found that it had bugs in both pg_validatebackup and the code I > added to the backend's basebackup.c. So I fixed those. Cool. > It would be > nice to have automated tests, but you need a large database (so that > backing it up takes non-trivial time) and a load on the primary (so > that WAL is being replayed during the backup) and there's a race > condition (because the backup has to not finish before the cascading > standby learns that the upstream has been promoted), so I don't at > present see a practical way to automate that. I did verify, in manual > testing, that a problem with WAL files on either timeline caused a > validation failure. I also verified that the LSNs at which the standby > began replay and reached consistency matched what was stored in the > manifest. I suspect its possible to control the timing by preventing the checkpoint at the end of recovery from completing within a relevant timeframe. I think configuring a large checkpoint_timeout and using a non-fast base backup ought to do the trick. The state can be advanced by separately triggering an immediate checkpoint? Or by changing the checkpoint_timeout? > I also implemented Noah's suggestion that we should write the backup > manifest under a temporary name and then rename it afterward. > Stephen's original complaint that you could end up with a backup that > validates successfully even though we died before we got the WAL is, > at this point, moot, because pg_validatebackup is now capable of > noticing that the WAL is missing. Nevertheless, this seems like a nice > belt-and-suspenders check. Yea, it's imo generally a good idea. > I think this responds to pretty much all of the complaints that I know > about and upon which we have a reasonable degree of consensus. There > are still some things that not everybody is happy about. In > particular, Stephen and David are unhappy about using CRC-32C as the > default algorithm, but Andres and Noah both think it's a reasonable > choice, even if not as robust as everybody will want. As I agree, I'm > going to stick with that choice. I think it might be worth looking, in a later release, at something like blake3 for a fast cryptographic checksum. By allowing for instruction parallelism (by independently checksuming different blocks in data, and only advancing the "shared" checksum separately) it achieves considerably higher throughput rates. I suspect we should also look at a better non-crypto hash. xxhash or whatever. Not just for these checksums, but also for in-memory. > Also, there is still some debate about what the tool ought to be > called. My previous suggestion to rename this from pg_validatebackup > to pg_validatemanifest seems wrong now that WAL validation has been > added; in fact, given that we now have two independent sanity checks > on a backup, I'm going to argue that it would be reasonable to extend > that by adding more kinds of backup validation, perhaps even including > the permissions check that Andres suggested before. FWIW, the only check I'd really like to see in this release is the crosscheck with the files length and the actually read data (to be able to disagnose FS issues). Greetings, Andres Freund
On Thu, Apr 2, 2020 at 1:23 PM Andres Freund <andres@anarazel.de> wrote: > I suspect its possible to control the timing by preventing the > checkpoint at the end of recovery from completing within a relevant > timeframe. I think configuring a large checkpoint_timeout and using a > non-fast base backup ought to do the trick. The state can be advanced by > separately triggering an immediate checkpoint? Or by changing the > checkpoint_timeout? That might make the window fairly wide on normal systems, but I'm not sure about Raspberry Pi BF members or things running CLOBBER_CACHE_ALWAYS/RECURSIVELY. I guess I could try it. > I think it might be worth looking, in a later release, at something like > blake3 for a fast cryptographic checksum. By allowing for instruction > parallelism (by independently checksuming different blocks in data, and > only advancing the "shared" checksum separately) it achieves > considerably higher throughput rates. > > I suspect we should also look at a better non-crypto hash. xxhash or > whatever. Not just for these checksums, but also for in-memory. I have no problem with that. I don't feel that I am well-placed to recommend for or against specific algorithms. Speed is easy to measure, but there's also code stability, the license under which something is released, the quality of the hashes it produces, and the extent to which it is cryptographically secure. I'm not an expert in any of that stuff, but if we get consensus on something it should be easy enough to plug it into this framework. Even changing the default would be no big deal. > FWIW, the only check I'd really like to see in this release is the > crosscheck with the files length and the actually read data (to be able > to disagnose FS issues). Not sure I understand this comment. Isn't that a subset of what the patch already does? Are you asking for something to be changed? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-04-02 14:16:27 -0400, Robert Haas wrote: > On Thu, Apr 2, 2020 at 1:23 PM Andres Freund <andres@anarazel.de> wrote: > > I suspect its possible to control the timing by preventing the > > checkpoint at the end of recovery from completing within a relevant > > timeframe. I think configuring a large checkpoint_timeout and using a > > non-fast base backup ought to do the trick. The state can be advanced by > > separately triggering an immediate checkpoint? Or by changing the > > checkpoint_timeout? > > That might make the window fairly wide on normal systems, but I'm not > sure about Raspberry Pi BF members or things running > CLOBBER_CACHE_ALWAYS/RECURSIVELY. I guess I could try it. You can set checkpoint_timeout to be a day. If that's not enough, well, then I think we have other problems. > > FWIW, the only check I'd really like to see in this release is the > > crosscheck with the files length and the actually read data (to be able > > to disagnose FS issues). > > Not sure I understand this comment. Isn't that a subset of what the > patch already does? Are you asking for something to be changed? Yes, I am asking for something to be changed: I'd like the code that read()s the file when computing the checksum to add up how many bytes were read, and compare that to the size in the manifest. And if there's a difference report an error about that, instead of a checksum failure. I've repeatedly seen filesystem issues lead to to earlier EOFs when read()ing than what stat() returns. It'll be pretty annoying to have to debug a general "checksum failure", rather than just knowing that reading stopped after 100MB of 1GB. Greetings, Andres Freund
On Thu, Apr 2, 2020 at 2:23 PM Andres Freund <andres@anarazel.de> wrote: > > That might make the window fairly wide on normal systems, but I'm not > > sure about Raspberry Pi BF members or things running > > CLOBBER_CACHE_ALWAYS/RECURSIVELY. I guess I could try it. > > You can set checkpoint_timeout to be a day. If that's not enough, well, > then I think we have other problems. I'm not sure that's the only issue here, but I'll try it. > Yes, I am asking for something to be changed: I'd like the code that > read()s the file when computing the checksum to add up how many bytes > were read, and compare that to the size in the manifest. And if there's > a difference report an error about that, instead of a checksum failure. > > I've repeatedly seen filesystem issues lead to to earlier EOFs when > read()ing than what stat() returns. It'll be pretty annoying to have to > debug a general "checksum failure", rather than just knowing that > reading stopped after 100MB of 1GB. Is 0004 attached like what you have in mind? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 2020-04-02 14:55:19 -0400, Robert Haas wrote: > > Yes, I am asking for something to be changed: I'd like the code that > > read()s the file when computing the checksum to add up how many bytes > > were read, and compare that to the size in the manifest. And if there's > > a difference report an error about that, instead of a checksum failure. > > > > I've repeatedly seen filesystem issues lead to to earlier EOFs when > > read()ing than what stat() returns. It'll be pretty annoying to have to > > debug a general "checksum failure", rather than just knowing that > > reading stopped after 100MB of 1GB. > > Is 0004 attached like what you have in mind? Yes. Thanks! - Andres
On 4/2/20 1:04 PM, Robert Haas wrote: > > There > are still some things that not everybody is happy about. In > particular, Stephen and David are unhappy about using CRC-32C as the > default algorithm, but Andres and Noah both think it's a reasonable > choice, even if not as robust as everybody will want. As I agree, I'm > going to stick with that choice. Yeah, I seem to be on the losing side of this argument, at least for now, so I don't think it should block the commit of this patch. It's an easy enough tweak if we change our minds. > For my part, I think this is a general issue that is not really this > patch's problem to solve. We have had multiple discussions over the > years about reducing the number of binaries that we ship. We could > have a general binary called "pg" or similar and use subcommands: pg > createdb, pg basebackup, pg validatebackup, etc. I think such an > approach is worth considering, though it would certainly be an > adjustment for everyone. Or we might do something else. But I don't > want to deal with that in this patch. I'm fine with the current name, especially now that WAL is validated. > A couple of other minor suggestions have been made: (1) rejigger > things to avoid message duplication related to launching external > binaries, That'd be nice to have, but I think we can live without it for now. > (2) maybe use appendShellString Seems like this would be good to have but I'm not going to make a fuss about it. > and (3) change some details > of error-reporting related to manifest parsing. I don't believe anyone > views these as blockers I'd view this as later refinement once we see how the tool is being used and/or get gripes from the field. So, with the addition of the 0004 patch down-thread this looks committable to me. Regards, -- -David david@pgmasters.net
On Thu, Apr 2, 2020 at 2:55 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Apr 2, 2020 at 2:23 PM Andres Freund <andres@anarazel.de> wrote: > > > That might make the window fairly wide on normal systems, but I'm not > > > sure about Raspberry Pi BF members or things running > > > CLOBBER_CACHE_ALWAYS/RECURSIVELY. I guess I could try it. > > > > You can set checkpoint_timeout to be a day. If that's not enough, well, > > then I think we have other problems. > > I'm not sure that's the only issue here, but I'll try it. I ran into a few problems here. In trying to set this up manually, I always began with the following steps: ==== # (1) create cluster initdb # (2) add to configuration file log_checkpoints=on checkpoint_timeout=1d checkpoint_completion_target=0.99 # (3) fire it up postgres createdb ==== If at this point I do "pg_basebackup -D pgslave -R -c spread", it completes within a few seconds anyway, because there's basically nothing dirty, and no matter how slowly you write out no data, it's still pretty quick. If I run "pgbench -i" first, and then "pg_basebackup -D pgslave -R -c spread", it hangs, apparently essentially forever, because now the checkpoint has something to do, and it does it super-slowly, and "psql -c checkpoint" makes it finish immediately. However, this experiment isn't testing quite the right thing, because what I actually need is a slow backup off of a cascading standby, so that I have time to promote the parent standby before the backup completes. I tried continuing like this: ==== # (4) set up standby pg_basebackup -D pgslave -R postgres -D pgslave -c port=5433 # (5) set up cascading standby pg_basebackup -D pgslave2 -d port=5433 -R postgres -c port=5434 -D pgslave2 # (6) dirty some pages on the master pgbench -i # (7) start a backup of the cascading standby pg_basebackup -D pgslave3 -d port=5434 -R -c spread ==== However, the pg_basebackup in the last step completes after only a few seconds. If it were hanging, then I could continue with "pg_ctl promote -D pgslave" and that might give me what I need, but that's not what happens. I suspect I'm not doing quite what you had in mind here... thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Apr 2, 2020 at 3:26 PM David Steele <david@pgmasters.net> wrote: > So, with the addition of the 0004 patch down-thread this looks > committable to me. Glad to hear it. Thank you. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2020-04-02 15:42:48 -0400, Robert Haas wrote: > I suspect I'm not doing quite what you had in mind here... thoughts? I have some ideas, but I think it's complicated enough that I'd not put it in the "pre commit path" for now.
On 4/2/20 3:47 PM, Andres Freund wrote: > On 2020-04-02 15:42:48 -0400, Robert Haas wrote: >> I suspect I'm not doing quite what you had in mind here... thoughts? > > I have some ideas, but I think it's complicated enough that I'd not put > it in the "pre commit path" for now. +1. These would be great tests to have and a win for pg_basebackup overall but I don't think they should be a prerequisite for this commit. Regards, -- -David david@pgmasters.net
On Thu, Apr 2, 2020 at 4:34 PM David Steele <david@pgmasters.net> wrote: > +1. These would be great tests to have and a win for pg_basebackup > overall but I don't think they should be a prerequisite for this commit. Not to mention the server. I can't say that I have a lot of confidence that all of the server behavior in this area is well-understood and sane. I've pushed all the patches. Hopefully everyone is happy now, or at least not so unhappy that they're going to break quarantine to beat me up. I hope I acknowledged all of the relevant people in the commit message, but it's possible that I missed somebody; if so, my apologies. As is my usual custom, I added entries in roughly the order that people chimed in on the thread, so the ordering should not be taken as a reflection of magnitude of contribution or, well, anything other than the approximate order in which they chimed in. It looks like the buildfarm is unhappy though, so I guess I'd better go look at that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Apr 3, 2020 at 3:22 PM Robert Haas <robertmhaas@gmail.com> wrote: > It looks like the buildfarm is unhappy though, so I guess I'd better > go look at that. I fixed two things so far, and there seems to be at least one more possible issue that I don't understand. 1. Apparently, we have an automated perlcritic run built in to the build farm, and apparently, it really hates Perl subroutines that don't end with an explicit return statement. We have that overridden to severity 5 in our Perl critic configuration. I guess I should've known this, but didn't. I've pushed a fix adding return statements. I believe I'm on record as thinking that perlcritic is a tool for complaining about a lot of things that don't really matter and very few that actually do -- but it's project style, so I'll suck it up! 2. Also, a bunch of machines were super-unhappy with 003_corruption.pl, failing with this sort of thing: pg_basebackup: error: could not get COPY data stream: ERROR: symbolic link target too long for tar format: file name "pg_tblspc/16387", target "/home/fabien/pg/build-farm-11/buildroot/HEAD/pgsql.build/src/bin/pg_validatebackup/tmp_check/tmp_test_7w0w" Apparently, this is a known problem and the solution is to use TestLib::tempdir_short instead of TestLib::tempdir, so I pushed a fix to make it do that. 3. spurfowl has failed its last two runs like this: sh: 1: ./configure: not found I am not sure how this patch could've caused that to happen, but the timing of the failures is certainly suspicious. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Apr 3, 2020 at 3:53 PM Robert Haas <robertmhaas@gmail.com> wrote: > 2. Also, a bunch of machines were super-unhappy with > 003_corruption.pl, failing with this sort of thing: > > pg_basebackup: error: could not get COPY data stream: ERROR: symbolic > link target too long for tar format: file name "pg_tblspc/16387", > target "/home/fabien/pg/build-farm-11/buildroot/HEAD/pgsql.build/src/bin/pg_validatebackup/tmp_check/tmp_test_7w0w" > > Apparently, this is a known problem and the solution is to use > TestLib::tempdir_short instead of TestLib::tempdir, so I pushed a fix > to make it do that. By and large, the buildfarm is a lot happier now, but fairywren (Windows / Msys Server 2019 / 2 gcc 7.3.0 x86_64) failed like this: # Postmaster PID for node "master" is 198420 error running SQL: 'psql:<stdin>:3: ERROR: directory "/tmp/9peoZHrEia" does not exist' while running 'psql -XAtq -d port=51493 host=127.0.0.1 dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CREATE TABLE x1 (a int); INSERT INTO x1 VALUES (111); CREATE TABLESPACE ts1 LOCATION '/tmp/9peoZHrEia'; CREATE TABLE x2 (a int) TABLESPACE ts1; INSERT INTO x1 VALUES (222); ' at /home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/PostgresNode.pm line 1531. ### Stopping node "master" using mode immediate I wondered why this should be failing on this machine when none of the other places where tempdir_short is used are similarly failing. The answer appears to be that most of the TAP tests that use tempdir_short just do this: my $tempdir_short = TestLib::tempdir_short; ...and then ignore that variable completely for the rest of the script. That's not ideal, and we should probably remove those calls to avoid giving that it's actually used for something. The two TAP tests that actually do something with it - apart from the one I just added - are pg_basebackup's 010_pg_basebackup.pl and pg_ctl's 001_start_stop.pl. However, both of those are skipped on Windows. Also, PostgresNode.pm itself uses it, but only when UNIX sockets are used, so again not on Windows. So it sorta looks to me like we no preexisting tests that meaningfully exercise TestLib::tempdir_short on Windows. Given that, I suppose I should consider myself lucky if this ends up working on *any* of the Windows critters, but given the implementation I'm kinda surprised we have a problem. That function is just: sub tempdir_short { return File::Temp::tempdir(CLEANUP => 1); } And File::Temp's documentation says that the temporary directory is picked using File::Spec's tmpdir(), which says that it knows about different operating systems and will DTRT on Unix, Mac, OS2, Win32, and VMS. Yet on fairywren it is apparently DTWT. I'm not sure why. Any ideas? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2020-Apr-03, Robert Haas wrote: > sub tempdir_short > { > > return File::Temp::tempdir(CLEANUP => 1); > } > > And File::Temp's documentation says that the temporary directory is > picked using File::Spec's tmpdir(), which says that it knows about > different operating systems and will DTRT on Unix, Mac, OS2, Win32, > and VMS. Yet on fairywren it is apparently DTWT. I'm not sure why. Maybe it needs perl2host? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Apr 3, 2020 at 4:54 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Maybe it needs perl2host? *jaw drops* Wow, OK, yeah, that looks like the thing. Thanks for the suggestion; I didn't know that existed (and I kinda wish I still didn't). I'lll go see about adding that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Apr 03, 2020 at 03:22:23PM -0400, Robert Haas wrote: > I've pushed all the patches. I didn't manage to look at this in advance but have some doc fixes. word-diff: diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index 536de9a698..d84afb7b18 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -2586,7 +2586,7 @@ The commands accepted in replication mode are: and sent along with the backup. The manifest is a list of every file present in the backup with the exception of any WAL files that may be included. It also stores the size, last modification time, and [-an optional-]{+optionally a+} checksum for each file. A value of <literal>force-escape</literal> forces all filenames to be hex-encoded; otherwise, this type of encoding is performed only for files whose names are non-UTF8 octet sequences. @@ -2602,7 +2602,7 @@ The commands accepted in replication mode are: <term><literal>MANIFEST_CHECKSUMS</literal></term> <listitem> <para> Specifies the {+checksum+} algorithm that should be applied to each file included in the backup manifest. Currently, the available algorithms are <literal>NONE</literal>, <literal>CRC32C</literal>, <literal>SHA224</literal>, <literal>SHA256</literal>, diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index c778e061f3..922688e227 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -604,7 +604,7 @@ PostgreSQL documentation not contain any checksums. Otherwise, it will contain a checksum of each file in the backup using the specified algorithm. In addition, the manifest will always contain a <literal>SHA256</literal> checksum of its own [-contents.-]{+content.+} The <literal>SHA</literal> algorithms are significantly more CPU-intensive than <literal>CRC32C</literal>, so selecting one of them may increase the time required to complete the backup. @@ -614,7 +614,7 @@ PostgreSQL documentation of each file for users who wish to verify that the backup has not been tampered with, while the CRC32C algorithm provides a checksum which is much faster to calculate and good at catching errors due to accidental changes but is not resistant to [-targeted-]{+malicious+} modifications. Note that, to be useful against an adversary who has access to the backup, the backup manifest would need to be stored securely elsewhere or otherwise verified not to have been modified since the backup was taken. diff --git a/doc/src/sgml/ref/pg_validatebackup.sgml b/doc/src/sgml/ref/pg_validatebackup.sgml index 19888dc196..748ac439a6 100644 --- a/doc/src/sgml/ref/pg_validatebackup.sgml +++ b/doc/src/sgml/ref/pg_validatebackup.sgml @@ -41,12 +41,12 @@ PostgreSQL documentation </para> <para> It is important to note that[-that-] the validation which is performed by <application>pg_validatebackup</application> does not and [-can not-]{+cannot+} include every check which will be performed by a running server when attempting to make use of the backup. Even if you use this tool, you should still perform test restores and verify that the resulting databases work as expected and that they[-appear to-] contain the correct data. However, <application>pg_validatebackup</application> can detect many problems that commonly occur due to storage problems or user error. </para> @@ -73,7 +73,7 @@ PostgreSQL documentation a <literal>backup_manifest</literal> file in the target directory or about anything inside <literal>pg_wal</literal>, even though these files won't be listed in the backup manifest. Only files are checked; the presence or absence [-or-]{+of+} directories is not verified, except indirectly: if a directory is missing, any files it should have contained will necessarily also be missing. </para> @@ -84,7 +84,7 @@ PostgreSQL documentation for any files for which the computed checksum does not match the checksum stored in the manifest. This step is not performed for any files which produced errors in the previous step, since they are already known to have problems. [-Also, files-]{+Files+} which were ignored in the previous step are also ignored in this step. </para> @@ -123,7 +123,7 @@ PostgreSQL documentation <title>Options</title> <para> The following command-line options control the [-behavior.-]{+behavior of this program.+} <variablelist> <varlistentry> diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index 3b18e733cd..aa72a6ff10 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -1148,7 +1148,7 @@ AddFileToManifest(manifest_info *manifest, const char *spcoid, } /* * Each file's entry [-need-]{+needs+} to be separated from any entry that follows by a * comma, but there's no comma before the first one or after the last one. * To make that work, adding a file to the manifest starts by terminating * the most recently added line, with a comma if appropriate, but does not -- Justin
Attachment
[ splitting this off into a separate thread ] On Fri, Apr 3, 2020 at 5:07 PM Robert Haas <robertmhaas@gmail.com> wrote: > I'lll go see about adding that. Done now. Meanwhile, two more machines have reported the mysterious message: sh: ./configure: not found ...that first appeared on spurfowl a few hours ago. The other two machines are eelpout and elver, both of which list Thomas Munro as a maintainer. spurfowl lists Stephen Frost. Thomas, Stephen, can one of you check and see what's going on? spurfowl has failed this way four times now, and eelpout and elver have each failed the last two runs, but since there's no helpful information in the logs, it's hard to guess what went wrong. I'm sort of afraid that something in the new TAP tests accidentally removed way too many files during the cleanup phase - e.g. it decided the temporary directory was / and removed every file it could access, or something like that. It doesn't do that here, or I, uh, would've noticed by now. But sometimes strange things happen on other people's machines. Hopefully one of those strange things is not that my test code is single-handedly destroying the entire buildfarm, but it's possible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello Robert, > Done now. Meanwhile, two more machines have reported the mysterious message: > > sh: ./configure: not found > > ...that first appeared on spurfowl a few hours ago. The other two > machines are eelpout and elver, both of which list Thomas Munro as a > maintainer. spurfowl lists Stephen Frost. Thomas, Stephen, can one of > you check and see what's going on? spurfowl has failed this way four > times now, and eelpout and elver have each failed the last two runs, > but since there's no helpful information in the logs, it's hard to > guess what went wrong. > > I'm sort of afraid that something in the new TAP tests accidentally > removed way too many files during the cleanup phase - e.g. it decided > the temporary directory was / and removed every file it could access, > or something like that. It doesn't do that here, or I, uh, would've > noticed by now. But sometimes strange things happen on other people's > machines. Hopefully one of those strange things is not that my test > code is single-handedly destroying the entire buildfarm, but it's > possible. seawasp just failed the same way. Good news, I can see "configure" under "HEAD/pgsql". The only strange thing under buildroot I found is: HEAD/pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans/ this last directory perms are d--------- which seems to break cleanup. It may be a left over from a previous run which failed (possibly 21dc488 ?). I cannot see how this would be related to configure, though. Maybe something else fails silently and the message is about a consequence of the prior silent failure. I commented out the cron job and will try to look into it on tomorrow if the status has not changed by then. -- Fabien.
Fabien COELHO <coelho@cri.ensmp.fr> writes: > The only strange thing under buildroot I found is: > HEAD/pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans/ > this last directory perms are d--------- which seems to break cleanup. Locally, I observe that "make clean" in src/bin/pg_validatebackup fails to clean up the tmp_check directory left behind by "make check". So the new makefile is not fully plugged into its standard responsibilities. I don't see any unreadable subdirectories though. I wonder if VPATH versus not-VPATH might be a relevant factor ... regards, tom lane
On 2020-Apr-03, Tom Lane wrote: > I wonder if VPATH versus not-VPATH might be a relevant factor ... Oh, absolutely. The ones that failed show, in the last successful run, the configure line invoked as "./configure", while the animals that are still running are invoking configure from some other directory. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Apr 4, 2020 at 11:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Fabien COELHO <coelho@cri.ensmp.fr> writes: > > The only strange thing under buildroot I found is: > > > HEAD/pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans/ > > > this last directory perms are d--------- which seems to break cleanup. Same here, on elver. I see pg_subtrans has been chmod(0)'d, presumably by the perl subroutine mutilate_open_directory_fails. I see this in my inbox (the build farm wrote it to stderr or stdout rather than the log file): cannot chdir to child for pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans: Permission denied at ./run_build.pl line 1013. cannot remove directory for pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails: Directory not empty at ./run_build.pl line 1013. cannot remove directory for pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup: Directory not empty at ./run_build.pl line 1013. cannot remove directory for pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data: Directory not empty at ./run_build.pl line 1013. cannot remove directory for pgsql.build/src/bin/pg_validatebackup/tmp_check: Directory not empty at ./run_build.pl line 1013. cannot remove directory for pgsql.build/src/bin/pg_validatebackup: Directory not empty at ./run_build.pl line 1013. cannot remove directory for pgsql.build/src/bin: Directory not empty at ./run_build.pl line 1013. cannot remove directory for pgsql.build/src: Directory not empty at ./run_build.pl line 1013. cannot remove directory for pgsql.build: Directory not empty at ./run_build.pl line 1013. cannot chdir to child for pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans: Permission denied at ./run_build.pl line 589. cannot remove directory for pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails: Directory not empty at ./run_build.pl line 589. cannot remove directory for pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup: Directory not empty at ./run_build.pl line 589. cannot remove directory for pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data: Directory not empty at ./run_build.pl line 589. cannot remove directory for pgsql.build/src/bin/pg_validatebackup/tmp_check: Directory not empty at ./run_build.pl line 589. cannot remove directory for pgsql.build/src/bin/pg_validatebackup: Directory not empty at ./run_build.pl line 589. cannot remove directory for pgsql.build/src/bin: Directory not empty at ./run_build.pl line 589. cannot remove directory for pgsql.build/src: Directory not empty at ./run_build.pl line 589. cannot remove directory for pgsql.build: Directory not empty at ./run_build.pl line 589.
Greetings, * Thomas Munro (thomas.munro@gmail.com) wrote: > On Sat, Apr 4, 2020 at 11:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Fabien COELHO <coelho@cri.ensmp.fr> writes: > > > The only strange thing under buildroot I found is: > > > > > HEAD/pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans/ > > > > > this last directory perms are d--------- which seems to break cleanup. > > Same here, on elver. I see pg_subtrans has been chmod(0)'d, > presumably by the perl subroutine mutilate_open_directory_fails. I > see this in my inbox (the build farm wrote it to stderr or stdout > rather than the log file): Yup, saw the same here. chmod'ing it to 755 seemed to result it the next run cleaning it up, at least. Not sure how things will go on the next actual build tho. Thanks, Stephen
Attachment
Thomas Munro <thomas.munro@gmail.com> writes: > Same here, on elver. I see pg_subtrans has been chmod(0)'d, > presumably by the perl subroutine mutilate_open_directory_fails. I > see this in my inbox (the build farm wrote it to stderr or stdout > rather than the log file): > cannot chdir to child for > pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans: > Permission denied at ./run_build.pl line 1013. > cannot remove directory for > pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails: > Directory not empty at ./run_build.pl line 1013. I'm guessing that we're looking at a platform-specific difference in whether "rm -rf" fails outright on an unreadable subdirectory, or just tries to carry on by unlinking it anyway. A partial fix would be to have the test script put back normal permissions on that directory before it exits ... but any failure partway through the script would leave a time bomb requiring manual cleanup. On the whole, I'd argue that testing that behavior is not valuable enough to take risks of periodically breaking buildfarm members in a way that will require manual recovery --- to say nothing of annoying developers who trip over it. So my vote is to remove that part of the test and be satisfied with checking the behavior for an unreadable file. This doesn't directly explain the failure-at-next-configure behavior that we're seeing in the buildfarm, but it wouldn't be too surprising if it ends up being that the buildfarm client script doesn't manage to fully recover from the situation. regards, tom lane
I wrote: > I'm guessing that we're looking at a platform-specific difference in > whether "rm -rf" fails outright on an unreadable subdirectory, or > just tries to carry on by unlinking it anyway. Yeah... on my RHEL6 box, "make check" cleans up the working directories under tmp_check, but on a FreeBSD 12.1 box, not so much: I'm left with $ ls tmp_check/ log/ t_003_corruption_master_data/ tgl@oldmini$ ls -R tmp_check/t_003_corruption_master_data/ backup/ tmp_check/t_003_corruption_master_data/backup: open_directory_fails/ tmp_check/t_003_corruption_master_data/backup/open_directory_fails: pg_subtrans/ tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans: ls: tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans: Permission denied I did not see any complaints printed to the terminal, but in regress_log_003_corruption there's ... ok 40 - corrupt backup fails validation: open_directory_fails: matches cannot chdir to child for /usr/home/tgl/pgsql/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans: Permissiondenied at t/003_corruption.pl line 126. cannot remove directory for /usr/home/tgl/pgsql/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails: Directorynot empty at t/003_corruption.pl line 126. # Running: pg_basebackup -D /usr/home/tgl/pgsql/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/search_directory_fails --no-sync-T /tmp/lxaL_sLcnr=/tmp/_fegwVjoDR ok 41 - base backup ok ... This may be more of a Perl version issue than a platform issue, but either way it's a problem. Also, on the FreeBSD box, "rm -rf" isn't happy either: $ rm -rf tmp_check rm: tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans: Permission denied rm: tmp_check/t_003_corruption_master_data/backup/open_directory_fails: Directory not empty rm: tmp_check/t_003_corruption_master_data/backup: Directory not empty rm: tmp_check/t_003_corruption_master_data: Directory not empty rm: tmp_check: Directory not empty regards, tom lane
On Fri, Apr 3, 2020 at 6:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm guessing that we're looking at a platform-specific difference in > whether "rm -rf" fails outright on an unreadable subdirectory, or > just tries to carry on by unlinking it anyway. My intention was that it would be cleaned by the TAP framework itself, since the temporary directories it creates are marked for cleanup. But it may be that there's a platform dependency in the behavior of Perl's File::Path::rmtree, too. > A partial fix would be to have the test script put back normal > permissions on that directory before it exits ... but any failure > partway through the script would leave a time bomb requiring manual > cleanup. Yeah. I've pushed that fix for now, but as you say, it may not survive contact with the enemy. That's kind of disappointing, because I put a lot of work into trying to make the tests cover every line of code that they possibly could, and there's no reason to suppose that pg_validatebackup is the only tool that could benefit from having code coverage of those kinds of scenarios. It's probably not even the tool that is most in need of such testing; it must be far worse if, say, pg_rewind can't cope with it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Apr 3, 2020 at 5:58 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > seawasp just failed the same way. Good news, I can see "configure" under > "HEAD/pgsql". Ah, good. > The only strange thing under buildroot I found is: > > HEAD/pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans/ Huh. I wonder how that got left behind ... it should've been cleaned up by the TAP test framework. But I pushed a commit to change the permissions back explicitly before exiting. As Tom says, I probably need to remove that entire test, but I'm going to try this first. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Apr 3, 2020 at 6:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm guessing that we're looking at a platform-specific difference in >> whether "rm -rf" fails outright on an unreadable subdirectory, or >> just tries to carry on by unlinking it anyway. > My intention was that it would be cleaned by the TAP framework itself, > since the temporary directories it creates are marked for cleanup. But > it may be that there's a platform dependency in the behavior of Perl's > File::Path::rmtree, too. Yeah, so it would seem. The buildfarm script uses rmtree to clean out the old build tree. The man page for File::Path suggests (but can't quite bring itself to say in so many words) that by default, rmtree will adjust the permissions on target directories to allow the deletion to succeed. But that's very clearly not happening on some platforms. (Maybe that represents a local patch on the part of some packagers who thought it was too unsafe?) Anyway, the end state presumably is that the pgsql.build directory is still there at the end of the buildfarm run, and the next run's attempt to also rmtree it fares no better. Then look what it does to set up the new build: system("cp -R -p $target $build_path 2>&1"); Of course, if $build_path already exists, then cp copies to a subdirectory of the target not the target itself. So that explains the symptom "./configure does not exist" --- it exists all right, but in a subdirectory below the one where the buildfarm expects it to be. It looks to me like the same problem would occur with VPATH or no. The lack of failures among the VPATH-using critters probably has more to do with whether their rmtree is willing to deal with this case than with VPATH. Anyway, it's evident that the buildfarm critters that are busted will need manual cleanup, because the script is not going to be able to get out of this by itself. I remain of the opinion that the hazard of that happening again in the future (eg, if a buildfarm animal loses power during the test) is sufficient reason to remove this test case. regards, tom lane
BTW, some of the buildfarm is showing a simpler portability problem: they think you were too cavalier about the difference between time_t and pg_time_t. (On a platform with 32-bit time_t, that's an actual bug, probably.) lapwing is actually failing: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2020-04-03%2021%3A41%3A49 ccache gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2-Werror -I. -I. -I../../../src/include -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/et -c -o basebackup.o basebackup.c basebackup.c: In function 'AddFileToManifest': basebackup.c:1199:10: error: passing argument 1 of 'pg_gmtime' from incompatible pointer type [-Werror] In file included from ../../../src/include/access/xlog_internal.h:26:0, from basebackup.c:20: ../../../src/include/pgtime.h:49:22: note: expected 'const pg_time_t *' but argument is of type 'time_t *' cc1: all warnings being treated as errors make[3]: *** [basebackup.o] Error 1 but some others are showing it as a warning. I suppose that judicious s/time_t/pg_time_t/ would fix this. regards, tom lane
On Fri, Apr 3, 2020 at 6:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Locally, I observe that "make clean" in src/bin/pg_validatebackup fails > to clean up the tmp_check directory left behind by "make check". Fixed. I also tried to fix 'lapwing', which was complaining about about a call to pg_gmtime, saying that it "expected 'const pg_time_t *' but argument is of type 'time_t *'". I was thinking that the problem had something to do with const, but Thomas pointed out to me that pg_time_t != time_t, so I pushed a fix which assumes that was the issue. (It was certainly *an* issue.) 'prairiedog' is also unhappy, and it looks related: /bin/sh ../../../../config/install-sh -c -d '/Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts'/tmp_check cd . && TESTDIR='/Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts' PATH="/Users/buildfarm/bf-data/HEAD/pgsql.build/tmp_install/Users/buildfarm/bf-data/HEAD/inst/bin:$PATH" DYLD_LIBRARY_PATH="/Users/buildfarm/bf-data/HEAD/pgsql.build/tmp_install/Users/buildfarm/bf-data/HEAD/inst/lib:$DYLD_LIBRARY_PATH" PGPORT='65678' PG_REGRESS='/Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/../../../../src/test/regress/pg_regress' REGRESS_SHLIB='/Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/regress/regress.so' /usr/local/perl5.8.3/bin/prove -I ../../../../src/test/perl/ -I . t/*.pl t/001_base.........ok t/002_standby......FAILED--Further testing stopped: system pg_basebackup failed make: *** [check] Error 25 Unfortunately, that error message is not very informative and for some reason the TAP logs don't seem to be included in the buildfarm output in this case, so it's hard to tell exactly what went wrong. This appears to be another 32-bit critter, which may be related somehow, but I don't know how exactly. 'serinus' is also failing. This is less obviously related: [02:08:55] t/003_constraints.pl .. ok 2048 ms ( 0.01 usr 0.00 sys + 1.28 cusr 0.38 csys = 1.67 CPU) # poll_query_until timed out executing this query: # SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's'); # expecting this output: # t # last actual query output: # f # with stderr: But there's also this: 2020-04-04 02:08:57.297 CEST [5e87d019.506c1:1] LOG: connection received: host=[local] 2020-04-04 02:08:57.298 CEST [5e87d019.506c1:2] LOG: replication connection authorized: user=bf application_name=tap_sub_16390_sync_16384 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:3] LOG: statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:4] LOG: received replication command: CREATE_REPLICATION_SLOT "tap_sub_16390_sync_16384" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:5] ERROR: replication slot "tap_sub_16390_sync_16384" already exists TRAP: FailedAssertion("owner->bufferarr.nitems == 0", File: "/home/bf/build/buildfarm-serinus/HEAD/pgsql.build/../pgsql/src/backend/utils/resowner/resowner.c", Line: 718) postgres: publisher: walsender bf [local] idle in transaction (aborted)(ExceptionalCondition+0x5c)[0x9a13ac] postgres: publisher: walsender bf [local] idle in transaction (aborted)(ResourceOwnerDelete+0x295)[0x9db8e5] postgres: publisher: walsender bf [local] idle in transaction (aborted)[0x54c61f] postgres: publisher: walsender bf [local] idle in transaction (aborted)(AbortOutOfAnyTransaction+0x122)[0x550e32] postgres: publisher: walsender bf [local] idle in transaction (aborted)[0x9b3bc9] postgres: publisher: walsender bf [local] idle in transaction (aborted)(shmem_exit+0x35)[0x80db45] postgres: publisher: walsender bf [local] idle in transaction (aborted)[0x80dc77] postgres: publisher: walsender bf [local] idle in transaction (aborted)(proc_exit+0x8)[0x80dd08] postgres: publisher: walsender bf [local] idle in transaction (aborted)(PostgresMain+0x59f)[0x83bd0f] postgres: publisher: walsender bf [local] idle in transaction (aborted)[0x7a0264] postgres: publisher: walsender bf [local] idle in transaction (aborted)(PostmasterMain+0xbfc)[0x7a2b8c] postgres: publisher: walsender bf [local] idle in transaction (aborted)(main+0x6fb)[0x49749b] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7fc52d83bbbb] postgres: publisher: walsender bf [local] idle in transaction (aborted)(_start+0x2a)[0x49753a] 2020-04-04 02:08:57.302 CEST [5e87d018.5066b:4] LOG: server process (PID 329409) was terminated by signal 6: Aborted 2020-04-04 02:08:57.302 CEST [5e87d018.5066b:5] DETAIL: Failed process was running: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ That might well be related. I note in passing that the DETAIL emitted by the postmaster shows the previous SQL command rather than the more-recent replication command, which seems like something to fix. (I still really dislike the fact that we have this evil hack allowing one connection to mix and match those sets of commands...) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Apr 3, 2020 at 8:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, so it would seem. The buildfarm script uses rmtree to clean out > the old build tree. The man page for File::Path suggests (but can't > quite bring itself to say in so many words) that by default, rmtree > will adjust the permissions on target directories to allow the deletion > to succeed. But that's very clearly not happening on some platforms. > (Maybe that represents a local patch on the part of some packagers > who thought it was too unsafe?) Interestingly, on my machine, rmtree coped with a mode 0 directory just fine, but mode 0400 was more than its tiny brain could handle, so the originally committed fix had code to revert 0400 back to 0700, but I didn't add similar code to revert from 0 back to 0700 because that was working fine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > 'prairiedog' is also unhappy, and it looks related: Yeah, gaur also failed in the same place. Both of those are alignment-picky 32-bit hardware, so I'm thinking the problem is pg_gmtime() trying to fetch a 64-bit pg_time_t from an insufficiently aligned address. I'm trying to confirm that on gaur's host right now, but it's a slow machine ... > 'serinus' is also failing. This is less obviously related: Dunno about this one. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > Interestingly, on my machine, rmtree coped with a mode 0 directory > just fine, but mode 0400 was more than its tiny brain could handle, so > the originally committed fix had code to revert 0400 back to 0700, but > I didn't add similar code to revert from 0 back to 0700 because that > was working fine. It seems really odd that an implementation could cope with mode-0 but not mode-400. Not sure I care enough to dig into the Perl library code, though. regards, tom lane
On Fri, Apr 3, 2020 at 9:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > 'prairiedog' is also unhappy, and it looks related: > > Yeah, gaur also failed in the same place. Both of those are > alignment-picky 32-bit hardware, so I'm thinking the problem is > pg_gmtime() trying to fetch a 64-bit pg_time_t from an insufficiently > aligned address. I'm trying to confirm that on gaur's host right now, > but it's a slow machine ... You might just want to wait until tomorrow and see whether it clears up in newer runs. I just pushed yet another fix that might be relevant. I think I've done about as much as I can do for tonight, though. Most things are green now, and the ones that aren't are failing because of stuff that is at least plausibly fixed. By morning it should be clearer how much broken stuff is left, although that will be somewhat complicated by at least sidewinder and seawasp needing manual intervention to get back on track. I apologize to everyone who has been or will be inconvenienced by all of this. So far I've pushed 4 test case fixes, 2 bug fixes, and 1 makefile fix, which I'm pretty sure is over quota for one patch. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Peter, Petr, CCed you because it's probably a bug somewhere around the initial copy code for logical replication. On 2020-04-03 20:48:09 -0400, Robert Haas wrote: > 'serinus' is also failing. This is less obviously related: Hm. Tests passed once since then. > 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:4] LOG: received > replication command: CREATE_REPLICATION_SLOT > "tap_sub_16390_sync_16384" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT > 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:5] ERROR: replication > slot "tap_sub_16390_sync_16384" already exists That already seems suspicious. I checked the following (successful) run and I did not see that in the stage's logs. Looking at the failing log, it fails because for some reason there's rounds (once due to a refresh, once due to an intention replication failure) of copying the relation. Each creates its own temporary slot. first time: 2020-04-04 02:08:57.276 CEST [5e87d019.506bd:1] LOG: connection received: host=[local] 2020-04-04 02:08:57.278 CEST [5e87d019.506bd:4] LOG: received replication command: CREATE_REPLICATION_SLOT "tap_sub_16390_sync_16384"TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 2020-04-04 02:08:57.282 CEST [5e87d019.506bd:9] LOG: statement: COPY public.tab_rep TO STDOUT 2020-04-04 02:08:57.284 CEST [5e87d019.506bd:10] LOG: disconnection: session time: 0:00:00.007 user=bf database=postgreshost=[local] second time: 2020-04-04 02:08:57.288 CEST [5e87d019.506bf:1] LOG: connection received: host=[local] 2020-04-04 02:08:57.289 CEST [5e87d019.506bf:4] LOG: received replication command: CREATE_REPLICATION_SLOT "tap_sub_16390_sync_16384"TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 2020-04-04 02:08:57.293 CEST [5e87d019.506bf:9] LOG: statement: COPY public.tab_rep TO STDOUT third time: 2020-04-04 02:08:57.297 CEST [5e87d019.506c1:1] LOG: connection received: host=[local] 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:4] LOG: received replication command: CREATE_REPLICATION_SLOT "tap_sub_16390_sync_16384"TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:5] ERROR: replication slot "tap_sub_16390_sync_16384" already exists Note that the connection from the second attempt has not yet disconnected. Hence the error about the replication slot already existing - it's a temporary replication slot that'd otherwise already have been dropped. Seems the logical rep code needs to do something about this race? About the assertion failure: TRAP: FailedAssertion("owner->bufferarr.nitems == 0", File: "/home/bf/build/buildfarm-serinus/HEAD/pgsql.build/../pgsql/src/backend/utils/resowner/resowner.c",Line: 718) postgres: publisher: walsender bf [local] idle in transaction (aborted)(ExceptionalCondition+0x5c)[0x9a13ac] postgres: publisher: walsender bf [local] idle in transaction (aborted)(ResourceOwnerDelete+0x295)[0x9db8e5] postgres: publisher: walsender bf [local] idle in transaction (aborted)[0x54c61f] postgres: publisher: walsender bf [local] idle in transaction (aborted)(AbortOutOfAnyTransaction+0x122)[0x550e32] postgres: publisher: walsender bf [local] idle in transaction (aborted)[0x9b3bc9] postgres: publisher: walsender bf [local] idle in transaction (aborted)(shmem_exit+0x35)[0x80db45] postgres: publisher: walsender bf [local] idle in transaction (aborted)[0x80dc77] postgres: publisher: walsender bf [local] idle in transaction (aborted)(proc_exit+0x8)[0x80dd08] postgres: publisher: walsender bf [local] idle in transaction (aborted)(PostgresMain+0x59f)[0x83bd0f] postgres: publisher: walsender bf [local] idle in transaction (aborted)[0x7a0264] postgres: publisher: walsender bf [local] idle in transaction (aborted)(PostmasterMain+0xbfc)[0x7a2b8c] postgres: publisher: walsender bf [local] idle in transaction (aborted)(main+0x6fb)[0x49749b] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7fc52d83bbbb] postgres: publisher: walsender bf [local] idle in transaction (aborted)(_start+0x2a)[0x49753a] 2020-04-04 02:08:57.302 CEST [5e87d018.5066b:4] LOG: server process (PID 329409) was terminated by signal 6: Aborted Due to the log_line_prefix used, I was at first not entirely sure the backend that crashed was the one with the ERROR. But it appears we print the pid as hex for '%c' (why?), so it indeed is the one. I, again, have to say that the amount of stuff that was done as part of commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920 Author: Peter Eisentraut <peter_e@gmx.net> Date: 2017-03-23 08:36:36 -0400 Logical replication support for initial data copy is insane. Adding support for running sql over replication connections and extending CREATE_REPLICATION_SLOT with new options (without even mentioning that in the commit message!) as part of a commit described as "Logical replication support for initial data copy" shouldn't happen. It's not obvious to me what buffer pins could be held at this point. I wonder if this could be somehow related to commit 3cb646264e8ced9f25557ce271284da512d92043 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: 2018-07-18 12:15:16 -0400 Use a ResourceOwner to track buffer pins in all cases. ... In passing, remove some other ad-hoc resource owner creations that had gotten cargo-culted into various other places. As far as I can tell that was all unnecessary, and if it had been necessary it was incomplete, due to lacking any provision for clearing those resowners later. (Also worth noting in this connection is that a process that hasn't called InitBufferPoolBackend has no business accessing buffers; so there's more to do than just add the resowner if we want to touch buffers in processes not covered by this patch.) which removed the resowner previously used in walsender. At the very least we should remove the SavedResourceOwnerDuringExport dance that's still done in snapbuild.c. But it can't really be at fault here, because the crashing backend won't have used that. So I'm a bit confused here. The best approach is probably to try to reproduce this by adding an artifical delay into backend shutdown. > (I still really dislike the fact that we have this evil hack allowing > one connection to mix and match those sets of commands...) FWIW, I think the opposite. We should get rid of the difference as much as possible. Greetings, Andres Freund
On 04/04/2020 05:06, Andres Freund wrote: > Hi, > > Peter, Petr, CCed you because it's probably a bug somewhere around the > initial copy code for logical replication. > > > On 2020-04-03 20:48:09 -0400, Robert Haas wrote: >> 'serinus' is also failing. This is less obviously related: > > Hm. Tests passed once since then. > > >> 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:4] LOG: received >> replication command: CREATE_REPLICATION_SLOT >> "tap_sub_16390_sync_16384" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT >> 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:5] ERROR: replication >> slot "tap_sub_16390_sync_16384" already exists > > That already seems suspicious. I checked the following (successful) run > and I did not see that in the stage's logs. > > Looking at the failing log, it fails because for some reason there's > rounds (once due to a refresh, once due to an intention replication > failure) of copying the relation. Each creates its own temporary slot. > > first time: > 2020-04-04 02:08:57.276 CEST [5e87d019.506bd:1] LOG: connection received: host=[local] > 2020-04-04 02:08:57.278 CEST [5e87d019.506bd:4] LOG: received replication command: CREATE_REPLICATION_SLOT "tap_sub_16390_sync_16384"TEMPORARY LOGICAL pgoutput USE_SNAPSHOT > 2020-04-04 02:08:57.282 CEST [5e87d019.506bd:9] LOG: statement: COPY public.tab_rep TO STDOUT > 2020-04-04 02:08:57.284 CEST [5e87d019.506bd:10] LOG: disconnection: session time: 0:00:00.007 user=bf database=postgreshost=[local] > > second time: > 2020-04-04 02:08:57.288 CEST [5e87d019.506bf:1] LOG: connection received: host=[local] > 2020-04-04 02:08:57.289 CEST [5e87d019.506bf:4] LOG: received replication command: CREATE_REPLICATION_SLOT "tap_sub_16390_sync_16384"TEMPORARY LOGICAL pgoutput USE_SNAPSHOT > 2020-04-04 02:08:57.293 CEST [5e87d019.506bf:9] LOG: statement: COPY public.tab_rep TO STDOUT > > third time: > 2020-04-04 02:08:57.297 CEST [5e87d019.506c1:1] LOG: connection received: host=[local] > 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:4] LOG: received replication command: CREATE_REPLICATION_SLOT "tap_sub_16390_sync_16384"TEMPORARY LOGICAL pgoutput USE_SNAPSHOT > 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:5] ERROR: replication slot "tap_sub_16390_sync_16384" already exists > > Note that the connection from the second attempt has not yet > disconnected. Hence the error about the replication slot already > existing - it's a temporary replication slot that'd otherwise already > have been dropped. > > > Seems the logical rep code needs to do something about this race? > The downstream: > 2020-04-04 02:08:57.275 CEST [5e87d019.506bc:1] LOG: logical replication table synchronization worker for subscription"tap_sub", table "tab_rep" has started > 2020-04-04 02:08:57.282 CEST [5e87d019.506bc:2] ERROR: duplicate key value violates unique constraint "tab_rep_pkey" > 2020-04-04 02:08:57.282 CEST [5e87d019.506bc:3] DETAIL: Key (a)=(1) already exists. > 2020-04-04 02:08:57.282 CEST [5e87d019.506bc:4] CONTEXT: COPY tab_rep, line 1 > 2020-04-04 02:08:57.283 CEST [5e87d018.50689:5] LOG: background worker "logical replication worker" (PID 329404) exitedwith exit code 1 > 2020-04-04 02:08:57.287 CEST [5e87d019.506be:1] LOG: logical replication table synchronization worker for subscription"tap_sub", table "tab_rep" has started > 2020-04-04 02:08:57.293 CEST [5e87d019.506be:2] ERROR: duplicate key value violates unique constraint "tab_rep_pkey" > 2020-04-04 02:08:57.293 CEST [5e87d019.506be:3] DETAIL: Key (a)=(1) already exists. > 2020-04-04 02:08:57.293 CEST [5e87d019.506be:4] CONTEXT: COPY tab_rep, line 1 > 2020-04-04 02:08:57.295 CEST [5e87d018.50689:6] LOG: background worker "logical replication worker" (PID 329406) exitedwith exit code 1 > 2020-04-04 02:08:57.297 CEST [5e87d019.506c0:1] LOG: logical replication table synchronization worker for subscription"tap_sub", table "tab_rep" has started > 2020-04-04 02:08:57.299 CEST [5e87d019.506c0:2] ERROR: could not create replication slot "tap_sub_16390_sync_16384": ERROR: replication slot "tap_sub_16390_sync_16384" already exists > 2020-04-04 02:08:57.300 CEST [5e87d018.50689:7] LOG: background worker "logical replication worker" (PID 329408) exitedwith exit code Looks like we are simply retrying so fast that upstream will not have finished cleanup after second try by the time we already run the third one. The last_start_times is supposed to protect against that so I guess there is some issue with how that works. -- Petr Jelinek 2ndQuadrant - PostgreSQL Solutions for the Enterprise https://www.2ndQuadrant.com/
On Fri, Apr 3, 2020 at 11:06 PM Andres Freund <andres@anarazel.de> wrote: > On 2020-04-03 20:48:09 -0400, Robert Haas wrote: > > 'serinus' is also failing. This is less obviously related: > > Hm. Tests passed once since then. Yeah, but conchuela also failed once in what I think was a similar way. I suspect the fix I pushed last night (3e0d80fd8d3dd4f999e0d3aa3e591f480d8ad1fd) may have been enough to clear this up. > That already seems suspicious. I checked the following (successful) run > and I did not see that in the stage's logs. Yeah, the behavior of the test case doesn't seem to be entirely deterministic. > I, again, have to say that the amount of stuff that was done as part of > > commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920 > Author: Peter Eisentraut <peter_e@gmx.net> > Date: 2017-03-23 08:36:36 -0400 > > Logical replication support for initial data copy > > is insane. Adding support for running sql over replication connections > and extending CREATE_REPLICATION_SLOT with new options (without even > mentioning that in the commit message!) as part of a commit described as > "Logical replication support for initial data copy" shouldn't happen. I agreed then and still do. > So I'm a bit confused here. The best approach is probably to try to > reproduce this by adding an artifical delay into backend shutdown. I was able to reproduce an assertion failure by starting a transaction, running a replication command that failed, and then exiting the backend. 3e0d80fd8d3dd4f999e0d3aa3e591f480d8ad1fd made that go away. I had wrongly assumed that there was no other way for a walsender to have a ResourceOwner, and in the face of SQL commands also being executed by walsenders, that's clearly not true. I'm not sure *precisely* how that lead to the BF failures, but it was really clear that it was wrong. > > (I still really dislike the fact that we have this evil hack allowing > > one connection to mix and match those sets of commands...) > > FWIW, I think the opposite. We should get rid of the difference as much > as possible. Well, that's another approach. It's OK to have one system and it's OK to have two systems, but one and a half is not ideal. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Apr 3, 2020 at 8:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > BTW, some of the buildfarm is showing a simpler portability problem: > they think you were too cavalier about the difference between time_t > and pg_time_t. (On a platform with 32-bit time_t, that's an actual > bug, probably.) lapwing is actually failing: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2020-04-03%2021%3A41%3A49 > > ccache gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2-Werror -I. -I. -I../../../src/include -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/et -c -o basebackup.o basebackup.c > basebackup.c: In function 'AddFileToManifest': > basebackup.c:1199:10: error: passing argument 1 of 'pg_gmtime' from incompatible pointer type [-Werror] > In file included from ../../../src/include/access/xlog_internal.h:26:0, > from basebackup.c:20: > ../../../src/include/pgtime.h:49:22: note: expected 'const pg_time_t *' but argument is of type 'time_t *' > cc1: all warnings being treated as errors > make[3]: *** [basebackup.o] Error 1 > > but some others are showing it as a warning. > > I suppose that judicious s/time_t/pg_time_t/ would fix this. I think you sent this email just after I pushed db1531cae00941bfe4f6321fdef1e1ef355b6bed, or maybe after I'd committed it locally and just before I pushed it. If you prefer a different fix than what I did there, I can certainly whack it around some more. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Apr 3, 2020 at 10:43 PM Robert Haas <robertmhaas@gmail.com> wrote: > I think I've done about as much as I can do for tonight, though. Most > things are green now, and the ones that aren't are failing because of > stuff that is at least plausibly fixed. By morning it should be > clearer how much broken stuff is left, although that will be somewhat > complicated by at least sidewinder and seawasp needing manual > intervention to get back on track. Taking stock of the situation this morning, most of the buildfarm is now green. There are three failures, on eelpout (6 hours ago), fairywren (17 hours ago), and hyrax (3 days, 7 hours ago). eelpout is unhappy because: +WARNING: could not remove shared memory segment "/PostgreSQL.248989127": No such file or directory +WARNING: could not remove shared memory segment "/PostgreSQL.1450751626": No such file or directory multibatch ------------ f @@ -861,22 +863,15 @@ select length(max(s.t)) from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id); - length --------- - 320000 -(1 row) - +ERROR: could not open shared memory segment "/PostgreSQL.605707657": No such file or directory +CONTEXT: parallel worker I'm not sure what caused that exactly, but it sorta looks like operator intervention. Thomas, any ideas? fairywren's last run was on 21dc488, and commit 460314db08e8688e1a54a0a26657941e058e45c5 was an attempt to fix what broken there. I guess we'll find out whether that worked the next time it runs. hyrax's last run was before any of this happened, so it seems to have an unrelated problem. The last two runs, three and six days ago, both failed like this: -ERROR: stack depth limit exceeded +ERROR: stack depth limit exceeded at character 8 Not sure what that's about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Apr 3, 2020 at 8:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I suppose that judicious s/time_t/pg_time_t/ would fix this. > I think you sent this email just after I pushed > db1531cae00941bfe4f6321fdef1e1ef355b6bed, or maybe after I'd committed > it locally and just before I pushed it. If you prefer a different fix > than what I did there, I can certainly whack it around some more. Yeah, that commit showed up moments after I sent this. Your fix seems fine -- at least prairiedog and gaur are OK with it. (I did verify that gaur was reproducibly crashing at that new pg_strftime call, so we know it was that and not some on-again- off-again issue.) regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > hyrax's last run was before any of this happened, so it seems to have > an unrelated problem. The last two runs, three and six days ago, both > failed like this: > -ERROR: stack depth limit exceeded > +ERROR: stack depth limit exceeded at character 8 > Not sure what that's about. What it looks like is that hyrax is managing to detect stack overflow at a point where an errcontext callback is active that adds an error cursor to the failure. It's not so surprising that we could get a different result that way from a CLOBBER_CACHE_ALWAYS animal like hyrax, since CCA-forced cache reloads would cause extra stack expenditure at a lot of places. And it could vary depending on totally random details, like the number of local variables in seemingly unrelated code. What is odd is that (AFAIR) we've never seen this before. Maybe somebody recently added an error cursor callback in a place that didn't have it before, and is involved in SQL-function processing? None of the commits leading up to the earlier failure look promising for that, though. regards, tom lane
On Sat, Apr 4, 2020 at 10:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > It's not so surprising that we could get a different result that way > from a CLOBBER_CACHE_ALWAYS animal like hyrax, since CCA-forced > cache reloads would cause extra stack expenditure at a lot of places. > And it could vary depending on totally random details, like the number > of local variables in seemingly unrelated code. Oh, yeah. That's unfortunate. > What is odd is that > (AFAIR) we've never seen this before. Maybe somebody recently added > an error cursor callback in a place that didn't have it before, and > is involved in SQL-function processing? None of the commits leading > up to the earlier failure look promising for that, though. The relevant range of commits (e8b1774fc2 to a7b9d24e4e) includes an ereport change (bda6dedbea) and a couple of "simple expression" changes (8f59f6b9c0, fbc7a71608) but I don't know exactly why they would have caused this. It seems at least possible, though, that changing the return type of functions involved in error reporting would slightly change the amount of stack space used; and the others are related to SQL-function processing. Other than experimenting on that machine, I'm not sure how we could really determine the relevant factors here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Apr 4, 2020 at 10:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> What is odd is that >> (AFAIR) we've never seen this before. Maybe somebody recently added >> an error cursor callback in a place that didn't have it before, and >> is involved in SQL-function processing? None of the commits leading >> up to the earlier failure look promising for that, though. > The relevant range of commits (e8b1774fc2 to a7b9d24e4e) includes an > ereport change (bda6dedbea) and a couple of "simple expression" > changes (8f59f6b9c0, fbc7a71608) but I don't know exactly why they > would have caused this. When I first noticed hyrax's failure, some days ago, I immediately thought of the "simple expression" patch. But that should not have affected SQL-function processing in any way: the bulk of the changes were in plpgsql, and even the changes in plancache could not be relevant, because functions.c does not use the plancache. As for ereport, you'd think that that would only matter once you were already doing an ereport. The point at which the stack overflow check triggers should be in normal code, not error recovery. > It seems at least possible, though, that > changing the return type of functions involved in error reporting > would slightly change the amount of stack space used; Right, but if it's down to that sort of phase-of-the-moon codegen difference, you'd think this failure would have been coming and going for years. I still suppose that some fairly recent change must be contributing to this, but haven't had time to investigate. > Other than experimenting on > that machine, I'm not sure how we could really determine the relevant > factors here. We don't have a lot of CCA buildfarm machines, so I'm suspecting that it's probably not that hard to repro if you build with CCA. regards, tom lane
On Sun, Apr 5, 2020 at 2:36 AM Robert Haas <robertmhaas@gmail.com> wrote: > eelpout is unhappy because: > > +WARNING: could not remove shared memory segment > "/PostgreSQL.248989127": No such file or directory > +WARNING: could not remove shared memory segment > "/PostgreSQL.1450751626": No such file or directory Seems to have fixed itself while I was sleeping. I did happen run apt-get upgrade on that box some time yesterday-ish, but I don't understand what mechanism would trash my /dev/shm in that process. /me eyes systemd with suspicion
On 2020-04-04 04:43, Robert Haas wrote: > I think I've done about as much as I can do for tonight, though. Most > things are green now, and the ones that aren't are failing because of > stuff that is at least plausibly fixed. By morning it should be > clearer how much broken stuff is left, although that will be somewhat > complicated by at least sidewinder and seawasp needing manual > intervention to get back on track. I fixed sidewinder I think. Should clear up the next time it runs. It was the mode on the directory it couldn't handle- A regular rm -rf didn't work I had to do a chmod -R 700 on all directories to be able to manually remove it. /Mikael
Hi, On 2020-04-03 15:22:23 -0400, Robert Haas wrote: > I've pushed all the patches. Seeing new warnings in an optimized build /home/andres/src/postgresql-master/src/bin/pg_validatebackup/parse_manifest.c: In function 'json_manifest_object_end': /home/andres/src/postgresql-master/src/bin/pg_validatebackup/parse_manifest.c:591:2: warning: 'end_lsn' may be used uninitializedin this function [-Wmaybe-uninitialized] 591 | context->perwalrange_cb(context, tli, start_lsn, end_lsn); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/andres/src/postgresql-master/src/bin/pg_validatebackup/parse_manifest.c:567:5: note: 'end_lsn' was declared here 567 | end_lsn; | ^~~~~~~ /home/andres/src/postgresql-master/src/bin/pg_validatebackup/parse_manifest.c:591:2: warning: 'start_lsn' may be used uninitializedin this function [-Wmaybe-uninitialized] 591 | context->perwalrange_cb(context, tli, start_lsn, end_lsn); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/andres/src/postgresql-master/src/bin/pg_validatebackup/parse_manifest.c:566:13: note: 'start_lsn' was declared here 566 | XLogRecPtr start_lsn, | ^~~~~~~~~ The warnings don't seem too unreasonable. The compiler can't see that the error_cb inside json_manifest_parse_failure() is not expected to return. Probably worth adding a wrapper around the calls to context->error_cb and mark that as noreturn. - Andres
On 4/5/20 9:10 AM, Mikael Kjellström wrote: > On 2020-04-04 04:43, Robert Haas wrote: > >> I think I've done about as much as I can do for tonight, though. Most >> things are green now, and the ones that aren't are failing because of >> stuff that is at least plausibly fixed. By morning it should be >> clearer how much broken stuff is left, although that will be somewhat >> complicated by at least sidewinder and seawasp needing manual >> intervention to get back on track. > > I fixed sidewinder I think. Should clear up the next time it runs. > > It was the mode on the directory it couldn't handle- A regular rm -rf > didn't work I had to do a chmod -R 700 on all directories to be able > to manually remove it. > > Hmm, the buildfarm client does this at the beginning of each run to remove anything that might be left over from a previous run: rmtree("inst"); rmtree("$pgsql") unless ($from_source && !$use_vpath); Do I need to precede those with some recursive chmod commands? Perhaps the client should refuse to run if there is still something left after these. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > Hmm, the buildfarm client does this at the beginning of each run to > remove anything that might be left over from a previous run: > rmtree("inst"); > rmtree("$pgsql") unless ($from_source && !$use_vpath); Right, the point is precisely that some versions of rmtree() fail to remove a mode-0 subdirectory. > Do I need to precede those with some recursive chmod commands? Perhaps > the client should refuse to run if there is still something left after > these. I think the latter would be a very good idea, just so that this sort of failure is less obscure. Not sure about whether a recursive chmod is really going to be worth the cycles. (On the other hand, the normal case should be that there's nothing there anyway, so maybe it's not going to be costly.) regards, tom lane
Hello, >> Do I need to precede those with some recursive chmod commands? Perhaps >> the client should refuse to run if there is still something left after >> these. > > I think the latter would be a very good idea, just so that this sort of > failure is less obscure. Not sure about whether a recursive chmod is > really going to be worth the cycles. (On the other hand, the normal > case should be that there's nothing there anyway, so maybe it's not > going to be costly.) Could it be a two-stage process to minimize cost but still be resilient? rmtree if (-d $DIR) { emit warning chmodtree rmtree again if (-d $DIR) emit error } -- Fabien.
On Sun, Apr 5, 2020 at 4:07 PM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > Do I need to precede those with some recursive chmod commands? +1. > Perhaps > the client should refuse to run if there is still something left after > these. +1 to that, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/6/20 7:53 AM, Robert Haas wrote: > On Sun, Apr 5, 2020 at 4:07 PM Andrew Dunstan > <andrew.dunstan@2ndquadrant.com> wrote: >> Do I need to precede those with some recursive chmod commands? > +1. > >> Perhaps >> the client should refuse to run if there is still something left after >> these. > +1 to that, too. > See https://github.com/PGBuildFarm/client-code/commit/0ef76bb1e2629713898631b9a3380d02d41c60ad This will be in the next release, probably fairly soon. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Robert Haas <robertmhaas@gmail.com> writes: > Taking stock of the situation this morning, most of the buildfarm is > now green. There are three failures, on eelpout (6 hours ago), > fairywren (17 hours ago), and hyrax (3 days, 7 hours ago). fairywren has now done this twice in the pg_validatebackupCheck step: exec failed: Bad address at /home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/TestLib.pm line 340. at /home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/TestLib.pm line 340. I'm a tad suspicious that it needs another perl2host() somewhere, but the log isn't very clear as to where. More generally, I wonder if we ought to be trying to centralize those perl2host() calls instead of sticking them into individual test cases. regards, tom lane
On Mon, Apr 6, 2020 at 1:18 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > > Hello, > > >> Do I need to precede those with some recursive chmod commands? Perhaps > >> the client should refuse to run if there is still something left after > >> these. > > > > I think the latter would be a very good idea, just so that this sort of > > failure is less obscure. Not sure about whether a recursive chmod is > > really going to be worth the cycles. (On the other hand, the normal > > case should be that there's nothing there anyway, so maybe it's not > > going to be costly.) > > Could it be a two-stage process to minimize cost but still be resilient? > > rmtree > if (-d $DIR) { > emit warning > chmodtree > rmtree again > if (-d $DIR) > emit error > } > I thought about doing that. However, it's not really necessary. In the normal course of events these directories should have been removed at the end of the previous run, so we're only dealing with exceptional cases here. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 7, 2020 at 12:37 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > Taking stock of the situation this morning, most of the buildfarm is > > now green. There are three failures, on eelpout (6 hours ago), > > fairywren (17 hours ago), and hyrax (3 days, 7 hours ago). > > fairywren has now done this twice in the pg_validatebackupCheck step: > > exec failed: Bad address at /home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/TestLib.pm line 340. > at /home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/TestLib.pm line 340. > > I'm a tad suspicious that it needs another perl2host() > somewhere, but the log isn't very clear as to where. > > More generally, I wonder if we ought to be trying to > centralize those perl2host() calls instead of sticking > them into individual test cases. > > Not sure about that. I'll see if I can run it by hand and get some more info. What's quite odd is that jacana (a very similar setup) is passing this happily. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020/04/04 4:22, Robert Haas wrote: > On Thu, Apr 2, 2020 at 4:34 PM David Steele <david@pgmasters.net> wrote: >> +1. These would be great tests to have and a win for pg_basebackup >> overall but I don't think they should be a prerequisite for this commit. > > Not to mention the server. I can't say that I have a lot of confidence > that all of the server behavior in this area is well-understood and > sane. > > I've pushed all the patches. When there is a backup_manifest in the database cluster, it's included in the backup even when --no-manifest is specified. ISTM that this is problematic because the backup_manifest is obviously not valid for the backup. So, isn't it better to always exclude the *existing* backup_manifest in the cluster from the backup, like backup_label/tablespace_map? Patch attached. Also I found the typo in the document. Patch attached. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
On Wed, Apr 8, 2020 at 1:15 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > When there is a backup_manifest in the database cluster, it's included in > the backup even when --no-manifest is specified. ISTM that this is problematic > because the backup_manifest is obviously not valid for the backup. > So, isn't it better to always exclude the *existing* backup_manifest in the > cluster from the backup, like backup_label/tablespace_map? Patch attached. > > Also I found the typo in the document. Patch attached. Both patches look good. The second one is definitely a mistake on my part, and the first one seems like a totally reasonable change. Thanks! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/7/20 9:42 AM, Andrew Dunstan wrote: > On Tue, Apr 7, 2020 at 12:37 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> Taking stock of the situation this morning, most of the buildfarm is >>> now green. There are three failures, on eelpout (6 hours ago), >>> fairywren (17 hours ago), and hyrax (3 days, 7 hours ago). >> fairywren has now done this twice in the pg_validatebackupCheck step: >> >> exec failed: Bad address at /home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/TestLib.pm line 340. >> at /home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/TestLib.pm line 340. >> >> I'm a tad suspicious that it needs another perl2host() >> somewhere, but the log isn't very clear as to where. >> >> More generally, I wonder if we ought to be trying to >> centralize those perl2host() calls instead of sticking >> them into individual test cases. >> >> > > Not sure about that. I'll see if I can run it by hand and get some > more info. What's quite odd is that jacana (a very similar setup) is > passing this happily. > OK, tricky, but here's what I did to get this working on fairywren. First, on Msys2 there is a problem with name mangling. We've had to fix this before by telling it to ignore certain argument prefixes. Second, once that was fixed rmdir was failing on the tablespace. On Windows this is a junction, so unlink is the correct thing to do, I believe, just as it is on Unix where it's a symlink. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > OK, tricky, but here's what I did to get this working on fairywren. > First, on Msys2 there is a problem with name mangling. We've had to fix > this before by telling it to ignore certain argument prefixes. > Second, once that was fixed rmdir was failing on the tablespace. On > Windows this is a junction, so unlink is the correct thing to do, I > believe, just as it is on Unix where it's a symlink. Hmm, no opinion about the name mangling business, but the other part seems like it might break jacana and/or bowerbird, which are currently happy with this test? (AFAICS we only have four Windows animals running the TAP tests, and the fourth (drongo) hasn't reported in for awhile.) I guess we could commit it and find out. I'm all for the simpler coding if it works. regards, tom lane
On Wed, Apr 8, 2020 at 1:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I guess we could commit it and find out. I'm all for the simpler > coding if it works. I don't understand what the local $ENV{MSYS2_ARG_CONV_EXCL} = $source_ts_prefix does, but the remove/unlink condition was suggested by Amit Kapila on the basis of testing on his Windows development environment, so I suspect that's actually needed on at least some systems. I just work here, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/8/20 3:41 PM, Robert Haas wrote: > On Wed, Apr 8, 2020 at 1:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I guess we could commit it and find out. I'm all for the simpler >> coding if it works. > I don't understand what the local $ENV{MSYS2_ARG_CONV_EXCL} = > $source_ts_prefix does, You don't want to know .... See <https://www.msys2.org/wiki/Porting/#filesystem-namespaces> for the gory details. It's the tablespace map parameter that is upsetting it. > but the remove/unlink condition was suggested > by Amit Kapila on the basis of testing on his Windows development > environment, so I suspect that's actually needed on at least some > systems. I just work here, though. > Yeah, drongo doesn't like it, so we'll have to tweak the logic. I'll update after some more testing. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > On 4/8/20 3:41 PM, Robert Haas wrote: >> I don't understand what the local $ENV{MSYS2_ARG_CONV_EXCL} = >> $source_ts_prefix does, > You don't want to know .... > See <https://www.msys2.org/wiki/Porting/#filesystem-namespaces> for the > gory details. I don't want to know either, but maybe that reference should be cited somewhere near where we use this sort of hack. regards, tom lane
On 2020/04/09 2:35, Robert Haas wrote: > On Wed, Apr 8, 2020 at 1:15 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >> When there is a backup_manifest in the database cluster, it's included in >> the backup even when --no-manifest is specified. ISTM that this is problematic >> because the backup_manifest is obviously not valid for the backup. >> So, isn't it better to always exclude the *existing* backup_manifest in the >> cluster from the backup, like backup_label/tablespace_map? Patch attached. >> >> Also I found the typo in the document. Patch attached. > > Both patches look good. The second one is definitely a mistake on my > part, and the first one seems like a totally reasonable change. > Thanks! Thanks for reviewing them! I pushed them. Please note that the commit messages have not been delivered to pgsql-committers yet. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Greetings, * Fujii Masao (masao.fujii@oss.nttdata.com) wrote: > On 2020/04/09 2:35, Robert Haas wrote: > >On Wed, Apr 8, 2020 at 1:15 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >>When there is a backup_manifest in the database cluster, it's included in > >>the backup even when --no-manifest is specified. ISTM that this is problematic > >>because the backup_manifest is obviously not valid for the backup. > >>So, isn't it better to always exclude the *existing* backup_manifest in the > >>cluster from the backup, like backup_label/tablespace_map? Patch attached. > >> > >>Also I found the typo in the document. Patch attached. > > > >Both patches look good. The second one is definitely a mistake on my > >part, and the first one seems like a totally reasonable change. > >Thanks! > > Thanks for reviewing them! I pushed them. > > Please note that the commit messages have not been delivered to > pgsql-committers yet. They've been released and your address whitelisted. Thanks, Stephen
Attachment
On 2020/04/09 23:10, Stephen Frost wrote: > Greetings, > > * Fujii Masao (masao.fujii@oss.nttdata.com) wrote: >> On 2020/04/09 2:35, Robert Haas wrote: >>> On Wed, Apr 8, 2020 at 1:15 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >>>> When there is a backup_manifest in the database cluster, it's included in >>>> the backup even when --no-manifest is specified. ISTM that this is problematic >>>> because the backup_manifest is obviously not valid for the backup. >>>> So, isn't it better to always exclude the *existing* backup_manifest in the >>>> cluster from the backup, like backup_label/tablespace_map? Patch attached. >>>> >>>> Also I found the typo in the document. Patch attached. >>> >>> Both patches look good. The second one is definitely a mistake on my >>> part, and the first one seems like a totally reasonable change. >>> Thanks! >> >> Thanks for reviewing them! I pushed them. >> >> Please note that the commit messages have not been delivered to >> pgsql-committers yet. > > They've been released and your address whitelisted. Many thanks!! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2020/04/09 23:06, Fujii Masao wrote: > > > On 2020/04/09 2:35, Robert Haas wrote: >> On Wed, Apr 8, 2020 at 1:15 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >>> When there is a backup_manifest in the database cluster, it's included in >>> the backup even when --no-manifest is specified. ISTM that this is problematic >>> because the backup_manifest is obviously not valid for the backup. >>> So, isn't it better to always exclude the *existing* backup_manifest in the >>> cluster from the backup, like backup_label/tablespace_map? Patch attached. >>> >>> Also I found the typo in the document. Patch attached. >> >> Both patches look good. The second one is definitely a mistake on my >> part, and the first one seems like a totally reasonable change. >> Thanks! > > Thanks for reviewing them! I pushed them. I found other minor issues. + When this option is specified with a value of <literal>yes</literal> + or <literal>force-escape</literal>, a backup manifest is created force-escape should be force-encode. Patch attached. - while ((c = getopt_long(argc, argv, "CD:F:r:RS:T:X:l:nNzZ:d:c:h:p:U:s:wWkvP", + while ((c = getopt_long(argc, argv, "CD:F:r:RS:T:X:l:nNzZ:d:c:h:p:U:s:wWkvPm:", "m:" seems unnecessary, so should be removed? Patch attached. + if (strcmp(basedir, "-") == 0) + { + char header[512]; + PQExpBufferData buf; + + initPQExpBuffer(&buf); + ReceiveBackupManifestInMemory(conn, &buf); backup_manifest should be received only when the manifest is enabled, so ISTM that the flag "manifest" should be checked in the above if-condition. Thought? Patch attached. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
On Mon, Apr 13, 2020 at 11:09:34AM +0900, Fujii Masao wrote: > - while ((c = getopt_long(argc, argv, "CD:F:r:RS:T:X:l:nNzZ:d:c:h:p:U:s:wWkvP", > + while ((c = getopt_long(argc, argv, "CD:F:r:RS:T:X:l:nNzZ:d:c:h:p:U:s:wWkvPm:", > > "m:" seems unnecessary, so should be removed? > Patch attached. Smells like some remnant diff from a previous version. > + if (strcmp(basedir, "-") == 0) > + { > + char header[512]; > + PQExpBufferData buf; > + > + initPQExpBuffer(&buf); > + ReceiveBackupManifestInMemory(conn, &buf); > > backup_manifest should be received only when the manifest is enabled, > so ISTM that the flag "manifest" should be checked in the above if-condition. > Thought? Patch attached. > > - if (strcmp(basedir, "-") == 0) > + if (strcmp(basedir, "-") == 0 && manifest) > { > char header[512]; > PQExpBufferData buf; Indeed. Using the tar format with --no-manifest causes a failure: pg_basebackup -D - --format=t --wal-method=none \ --no-manifest > /dev/null The doc changes look right to me. Nice catches. -- Michael
Attachment
On Sun, Apr 12, 2020 at 10:09 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > I found other minor issues. I think these are all correct fixes. Thanks for the post-commit review, and sorry for this mistakes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 27, 2020 at 4:32 PM Andres Freund <andres@anarazel.de> wrote: > I don't like having a file format that's intended to be used by external > tools too that's undocumented except for code that assembles it in a > piecemeal fashion. Do you mean in a follow-on patch this release, or > later? I don't have a problem with the former. Here is a patch for that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Mon, Apr 13, 2020 at 01:40:56PM -0400, Robert Haas wrote: > On Fri, Mar 27, 2020 at 4:32 PM Andres Freund <andres@anarazel.de> wrote: > > I don't like having a file format that's intended to be used by external > > tools too that's undocumented except for code that assembles it in a > > piecemeal fashion. Do you mean in a follow-on patch this release, or > > later? I don't have a problem with the former. > > Here is a patch for that. typos: manifes hexademical (twice) -- Justin
On Mon, Apr 13, 2020 at 1:55 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > typos: > manifes > hexademical (twice) Thanks. v2 attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 2020-04-13 20:08, Robert Haas wrote: > [v2-0001-Document-the-backup-manifest-file-format.patch] Can you double check this sentence? Seems strange to me but I don't know why; it may well be that my english is not good enough. Maybe a comma after 'required' makes reading easier? The timeline from which this range of WAL records will be required in order to make use of this backup. The value is an integer. One typo: 'when making using' should be 'when making use' Erik Rijkers
+ The LSN at which replay must begin on the indicated timeline in order to + make use of this backup. The LSN is stored in the format normally used + by <productname>PostgreSQL</productname>; that is, it is a string + consisting of two strings of hexademical characters, each with a length + of between 1 and 8, separated by a slash. typo "hexademical" Are these hex figures upper or lower case? No leading zeroes? This would normally not matter, but the toplevel checksum will care. Also, I see no mention of prettification-chars such as newlines or indentation. I suppose if I pass a manifest file through prettification (or Windows newline conversion), the checksum may break. As for Last-Modification, I think the spec should indicate the exact format that's used, because it'll also be critical for checksumming. Why is the top-level checksum only allowed to be SHA-256, if the files can use up to SHA-512? (Also, did we intentionally omit the dash in hash names, so "SHA-256" to make it SHA256? This will also be critical for checksumming the manifest itself.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 13, 2020 at 2:28 PM Erik Rijkers <er@xs4all.nl> wrote: > Can you double check this sentence? Seems strange to me but I don't > know why; it may well be that my english is not good enough. Maybe a > comma after 'required' makes reading easier? > > The timeline from which this range of WAL records will be required in > order to make use of this backup. The value is an integer. It sounds a little awkward to me, but not outright wrong. I'm not exactly sure how to rephrase it, though. Maybe just shorten it to "the timeline for this range of WAL records"? > One typo: > > 'when making using' should be > 'when making use' Right, thanks, fixed in my local copy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/13/20 1:40 PM, Robert Haas wrote: > On Fri, Mar 27, 2020 at 4:32 PM Andres Freund <andres@anarazel.de> wrote: >> I don't like having a file format that's intended to be used by external >> tools too that's undocumented except for code that assembles it in a >> piecemeal fashion. Do you mean in a follow-on patch this release, or >> later? I don't have a problem with the former. > Here is a patch for that. > Seems ok. A tiny example, or an excerpt, might be nice. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 13, 2020 at 3:34 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Are these hex figures upper or lower case? No leading zeroes? This > would normally not matter, but the toplevel checksum will care. Not really. You just feed the whole file except for the last line through shasum and you get the answer. It so happens that the server generates lower-case, but pg_verifybackup will accept either. Leading zeroes are not omitted. If the checksum's not the right length, it ain't gonna work. If SHA is used, it's the same output you would get from running shasum -a<whatever> on the file, which is certainly a fixed length. I assumed that this followed from the statement that there are two characters per byte in the checksum, and from the fact that no checksum algorithm I know about drops leading zeroes in the output. > Also, I > see no mention of prettification-chars such as newlines or indentation. > I suppose if I pass a manifest file through prettification (or Windows > newline conversion), the checksum may break. It would indeed break. I'm not sure what you want me to say here, though. If you're trying to parse a manifest, you shouldn't care about how the whitespace is arranged. If you're trying to generate one, you can arrange it any way you like, as long as you also include it in the checksum. > As for Last-Modification, I think the spec should indicate the exact > format that's used, because it'll also be critical for checksumming. Again, I don't think it really matters for checksumming, but it's "YYYY-MM-DD HH:MM:SS TZ" format, where TZ is always GMT. > Why is the top-level checksum only allowed to be SHA-256, if the files > can use up to SHA-512? If we allowed the top-level checksum to be changed to something else, then we'd probably we want to indicate which kind of checksum is being used at the beginning of the file, so as to enable incremental parsing with checksum verification at the end. pg_verifybackup doesn't currently do incremental parsing, but I'd like to add that sometime, if I get time to hash out the details. I think the use case for varying the checksum type of the manifest itself is much less than for varying it for the files. The big problem with checksumming the files is that it can be slow, because the files can be big. However, unless you have a truckload of empty files in the database, the manifest is going to be very small compared to the sizes of all the files, so it seemed harmless to use a stronger checksum algorithm for the manifest itself. Maybe someone with a ton of empty or nearly-empty relations will complain, but they can always use --no-manifest if they want. I agree that it's a little bit weird that you can have a stronger checksum for the files instead of the manifest itself, but I also wonder what the use case would be for using a stronger checksum on the manifest. David Steele argued that strong checksums on the files could be useful to software that wants to rifle through all the backups you've ever taken and find another copy of that file by looking for something with a matching checksum. CRC-32C wouldn't be strong enough for that, because eventually you could have enough files that you start to have collisions. The SHA algorithms output enough bits to make that quite unlikely. But this argument only makes sense for the files, not the manifest. Naturally, all this is arguable, though, and a good deal of arguing about it has been done, as you have probably noticed. I am still of the opinion that if somebody's goal is to use this facility for its intended purpose, which is to find out whether your backup got corrupted, any of these algorithms are fine, and are highly likely to tell you that you have a problem if, in fact, you do. In fact, I bet that even a checksum algorithm considerably stupider than anything I'd actually consider using would accomplish that goal in a high percentage of cases. But not everybody agrees with me, to the point where I am starting to wonder if I really understand how computers work. > (Also, did we intentionally omit the dash in > hash names, so "SHA-256" to make it SHA256? This will also be critical > for checksumming the manifest itself.) I debated this with myself, settled on this spelling, and nobody complained until now. It could be changed, though. I didn't have any particular reason for choosing it except the feeling that people would probably prefer to type --manifest-checksum=sha256 rather than --manifest-checksum=sha-256. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 13, 2020 at 4:10 PM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > Seems ok. A tiny example, or an excerpt, might be nice. An empty database produces a manifest about 1200 lines long, so a full example seems like too much to include in the documentation. An excerpt could be included, I suppose. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/13/20 4:14 PM, Robert Haas wrote: > On Mon, Apr 13, 2020 at 3:34 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > >> Also, I >> see no mention of prettification-chars such as newlines or indentation. >> I suppose if I pass a manifest file through prettification (or Windows >> newline conversion), the checksum may break. > > It would indeed break. I'm not sure what you want me to say here, > though. If you're trying to parse a manifest, you shouldn't care about > how the whitespace is arranged. If you're trying to generate one, you > can arrange it any way you like, as long as you also include it in the > checksum. pgBackRest ignores whitespace but this is a legacy of the way Perl calculated checksums, not an intentional feature. This worked well when the manifest was loaded as a whole, converted to JSON, and checksummed, but it is a major pain for the streaming code we now have in C. I guarantee that that our next manifest version will do a simple checksum of bytes as Robert has done in this feature. So, I'm +1 as implemented. >> Why is the top-level checksum only allowed to be SHA-256, if the files >> can use up to SHA-512? <snip> > I agree that it's a little bit weird that you can have a stronger > checksum for the files instead of the manifest itself, but I also > wonder what the use case would be for using a stronger checksum on the > manifest. David Steele argued that strong checksums on the files could > be useful to software that wants to rifle through all the backups > you've ever taken and find another copy of that file by looking for > something with a matching checksum. CRC-32C wouldn't be strong enough > for that, because eventually you could have enough files that you > start to have collisions. The SHA algorithms output enough bits to > make that quite unlikely. But this argument only makes sense for the > files, not the manifest. Agreed. I think SHA-256 is *more* than enough to protect the manifest against corruption. That said, since the cost of SHA-256 vs. SHA-512 in the context on the manifest is negligible we could just use the stronger algorithm to deflect a similar question going forward. That choice might not age well, but we could always say, well, we picked it because it was the strongest available at the time. Allowing a choice of which algorithm to use for to manifest checksum seems like it will just make verifying the file harder with no tangible benefit. Maybe just a comment in the docs about why SHA-256 was used would be fine. >> (Also, did we intentionally omit the dash in >> hash names, so "SHA-256" to make it SHA256? This will also be critical >> for checksumming the manifest itself.) > > I debated this with myself, settled on this spelling, and nobody > complained until now. It could be changed, though. I didn't have any > particular reason for choosing it except the feeling that people would > probably prefer to type --manifest-checksum=sha256 rather than > --manifest-checksum=sha-256. +1 for sha256 rather than sha-256. Regards, -- -David david@pgmasters.net
On 2020-Apr-13, Robert Haas wrote: > On Mon, Apr 13, 2020 at 3:34 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Are these hex figures upper or lower case? No leading zeroes? This > > would normally not matter, but the toplevel checksum will care. > > Not really. You just feed the whole file except for the last line > through shasum and you get the answer. > > It so happens that the server generates lower-case, but > pg_verifybackup will accept either. > > Leading zeroes are not omitted. If the checksum's not the right > length, it ain't gonna work. If SHA is used, it's the same output you > would get from running shasum -a<whatever> on the file, which is > certainly a fixed length. I assumed that this followed from the > statement that there are two characters per byte in the checksum, and > from the fact that no checksum algorithm I know about drops leading > zeroes in the output. Eh, apologies, I was completely unclear -- I was looking at the LSN fields when writing the above. So the leading zeroes and letter case comment refers to those in the LSN values. I agree that it doesn't matter as long as the same tool generates the json file and writes the checksum. > > Also, I see no mention of prettification-chars such as newlines or > > indentation. I suppose if I pass a manifest file through > > prettification (or Windows newline conversion), the checksum may > > break. > > It would indeed break. I'm not sure what you want me to say here, > though. If you're trying to parse a manifest, you shouldn't care about > how the whitespace is arranged. If you're trying to generate one, you > can arrange it any way you like, as long as you also include it in the > checksum. Yeah, I guess I'm just saying that it feels brittle to have a file format that's supposed to be good for data exchange and then make it itself depend on representation details such as the order that fields appear in, the letter case, or the format of newlines. Maybe this isn't really of concern, but it seemed strange. > > As for Last-Modification, I think the spec should indicate the exact > > format that's used, because it'll also be critical for checksumming. > > Again, I don't think it really matters for checksumming, but it's > "YYYY-MM-DD HH:MM:SS TZ" format, where TZ is always GMT. I agree that whatever format you use will work as long as it isn't modified. I think strict ISO 8601 might be preferable (with the T in the middle and ending in Z instead of " GMT"). > > Why is the top-level checksum only allowed to be SHA-256, if the > > files can use up to SHA-512? Thanks for the discussion. I think you mostly want to make sure that the manifest is sensible (not corrupt) rather than defend against somebody maliciously giving you an attacking manifest (??). I incline to agree that any SHA-2 hash is going to serve that purpose and have no further comment to make. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 13, 2020 at 5:43 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Yeah, I guess I'm just saying that it feels brittle to have a file > format that's supposed to be good for data exchange and then make it > itself depend on representation details such as the order that fields > appear in, the letter case, or the format of newlines. Maybe this isn't > really of concern, but it seemed strange. I didn't want to use JSON for this at all, but I got outvoted. When I raised this issue, it was suggested that I deal with it in this way, so I did. I can't really defend it too far beyond that, although I do think that one nice thing about this is that you can verify the checksum using shell commands if you want. Just figure out the number of lines in the file, minus one, and do head -n$LINES backup_manifest | shasum -a256 and boom. If there were some whitespace-skipping thing figuring out how to reproduce the checksum calculation would be hard. > I think strict ISO 8601 might be preferable (with the T in the middle > and ending in Z instead of " GMT"). Hmm, did David suggest that before? I don't recall for sure. I think he had some suggestion, but I'm not sure if it was the same one. > > > Why is the top-level checksum only allowed to be SHA-256, if the > > > files can use up to SHA-512? > > Thanks for the discussion. I think you mostly want to make sure that > the manifest is sensible (not corrupt) rather than defend against > somebody maliciously giving you an attacking manifest (??). I incline > to agree that any SHA-2 hash is going to serve that purpose and have no > further comment to make. The code has other sanity checks against the manifest failing to parse properly, so you can't (I hope) crash it or anything even if you falsify the checksum. But suppose that there is a gremlin running around your system flipping occasional bits. If said gremlin flips a bit in a "0" that appears in a file's checksum string, it could become a "1", a "3", or a "7", all of which are still valid characters for a hex string. When you then tried to verify the backup, verification for that file would fail, but you'd think it was a problem with the file, rather than a problem with the manifest. The manifest checksum prevents that: you'll get a complaint about the manifest checksum being wrong rather than a complaint about the file not matching the manifest checksum. A sufficiently smart gremlin could figure out the expected checksum for the revised manifest and flip bits to make the actual value match the expected one, but I think we're worried about "chaotic neutral" gremlins, not "lawful evil" ones. That having been said, there was some discussion on the original thread about keeping your backup on regular storage and your manifest checksum in a concrete bunker at the bottom of the ocean; in that scenario, it should be possible to detect tampering in either the manifest itself or in non-WAL data files, as long as the adversary can't break SHA-256. But I'm not sure how much we should really worry about that. For me, the design center for this feature is a user who untars base.tar and forgets about 43965.tar. If that person runs pg_verifybackup, it's gonna tell them that things are broken, and that's good enough for me. It may not be good enough for everybody, but it's good enough for me. I think I'm going to go ahed and push this now, maybe with a small wording tweak as discussed upthread with Andrew. The rest of this discussion is really about whether the patch needs any design changes rather than about whether the documentation describes what the patch does, so it makes sense to me to commit this first and then if somebody wants to argue for a change they certainly can. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/14/20 12:56 PM, Robert Haas wrote: > On Mon, Apr 13, 2020 at 5:43 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> Yeah, I guess I'm just saying that it feels brittle to have a file >> format that's supposed to be good for data exchange and then make it >> itself depend on representation details such as the order that fields >> appear in, the letter case, or the format of newlines. Maybe this isn't >> really of concern, but it seemed strange. > > I didn't want to use JSON for this at all, but I got outvoted. When I > raised this issue, it was suggested that I deal with it in this way, > so I did. I can't really defend it too far beyond that, although I do > think that one nice thing about this is that you can verify the > checksum using shell commands if you want. Just figure out the number > of lines in the file, minus one, and do head -n$LINES backup_manifest > | shasum -a256 and boom. If there were some whitespace-skipping thing > figuring out how to reproduce the checksum calculation would be hard. > >> I think strict ISO 8601 might be preferable (with the T in the middle >> and ending in Z instead of " GMT"). > > Hmm, did David suggest that before? I don't recall for sure. I think > he had some suggestion, but I'm not sure if it was the same one. "I'm also partial to using epoch time in the manifest because it is generally easier for programs to work with. But, human-readable doesn't suck, either." Also you don't need to worry about time-zone conversion errors -- even if the source time is UTC this can easily happen if you are not careful. It also saves a parsing step. The downside is it is not human-readable but this is intended to be a machine-readable format so I don't think it's a big deal (encoded filenames will be just as opaque). If a user really needs to know what time some file is (rare, I think) they can paste it with a web tool to find out. Regards, -- -David david@pgmasters.net
On 2020-Apr-14, David Steele wrote: > On 4/14/20 12:56 PM, Robert Haas wrote: > > > Hmm, did David suggest that before? I don't recall for sure. I think > > he had some suggestion, but I'm not sure if it was the same one. > > "I'm also partial to using epoch time in the manifest because it is > generally easier for programs to work with. But, human-readable doesn't > suck, either." Ugh. If you go down that road, why write human-readable contents at all? You may as well just use a binary format. But that's a very slippery slope and you won't like to be in the bottom -- I don't see what that gains you. It's not like it's a lot of work to parse a timestamp in a non-internationalized well-defined human-readable format. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 4/14/20 1:27 PM, Alvaro Herrera wrote: > On 2020-Apr-14, David Steele wrote: > >> On 4/14/20 12:56 PM, Robert Haas wrote: >> >>> Hmm, did David suggest that before? I don't recall for sure. I think >>> he had some suggestion, but I'm not sure if it was the same one. >> >> "I'm also partial to using epoch time in the manifest because it is >> generally easier for programs to work with. But, human-readable doesn't >> suck, either." > > Ugh. If you go down that road, why write human-readable contents at > all? You may as well just use a binary format. But that's a very > slippery slope and you won't like to be in the bottom -- I don't see > what that gains you. It's not like it's a lot of work to parse a > timestamp in a non-internationalized well-defined human-readable format. Well, times are a special case because they are so easy to mess up. Try converting ISO-8601 to epoch time using the standard C functions on a system where TZ != UTC. Fun times. Regards, -- -David david@pgmasters.net
On 4/14/20 1:33 PM, David Steele wrote: > On 4/14/20 1:27 PM, Alvaro Herrera wrote: >> On 2020-Apr-14, David Steele wrote: >> >>> On 4/14/20 12:56 PM, Robert Haas wrote: >>> >>>> Hmm, did David suggest that before? I don't recall for sure. I think >>>> he had some suggestion, but I'm not sure if it was the same one. >>> >>> "I'm also partial to using epoch time in the manifest because it is >>> generally easier for programs to work with. But, human-readable >>> doesn't >>> suck, either." >> >> Ugh. If you go down that road, why write human-readable contents at >> all? You may as well just use a binary format. But that's a very >> slippery slope and you won't like to be in the bottom -- I don't see >> what that gains you. It's not like it's a lot of work to parse a >> timestamp in a non-internationalized well-defined human-readable format. > > Well, times are a special case because they are so easy to mess up. > Try converting ISO-8601 to epoch time using the standard C functions > on a system where TZ != UTC. Fun times. > > Even if it's a zulu time? That would be pretty damn sad. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 4/14/20 3:03 PM, Andrew Dunstan wrote: > > On 4/14/20 1:33 PM, David Steele wrote: >> On 4/14/20 1:27 PM, Alvaro Herrera wrote: >>> On 2020-Apr-14, David Steele wrote: >>> >>>> On 4/14/20 12:56 PM, Robert Haas wrote: >>>> >>>>> Hmm, did David suggest that before? I don't recall for sure. I think >>>>> he had some suggestion, but I'm not sure if it was the same one. >>>> >>>> "I'm also partial to using epoch time in the manifest because it is >>>> generally easier for programs to work with. But, human-readable >>>> doesn't >>>> suck, either." >>> >>> Ugh. If you go down that road, why write human-readable contents at >>> all? You may as well just use a binary format. But that's a very >>> slippery slope and you won't like to be in the bottom -- I don't see >>> what that gains you. It's not like it's a lot of work to parse a >>> timestamp in a non-internationalized well-defined human-readable format. >> >> Well, times are a special case because they are so easy to mess up. >> Try converting ISO-8601 to epoch time using the standard C functions >> on a system where TZ != UTC. Fun times. > > Even if it's a zulu time? That would be pretty damn sad. ZULU/GMT/UTC are all fine. But if the server timezone is EDT for example (not that I recommend this) you are likely to get the wrong result. Results vary based on your platform. For instance, we found MacOS was more likely to work the way you would expect and Linux was hopeless. There are all kinds of fun tricks to get around this (sort of). One is to temporarily set TZ=UTC which sucks if an error happens before it gets set back. There are some hacks to try to determine your offset which have inherent race conditions around DST changes. After some experimentation we just used the Posix definition for epoch time and used that to do our conversions: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_16 Regards, -- -David david@pgmasters.net
On 4/14/20 3:19 PM, David Steele wrote: > On 4/14/20 3:03 PM, Andrew Dunstan wrote: >> >> On 4/14/20 1:33 PM, David Steele wrote: >>> On 4/14/20 1:27 PM, Alvaro Herrera wrote: >>>> On 2020-Apr-14, David Steele wrote: >>>> >>>>> On 4/14/20 12:56 PM, Robert Haas wrote: >>>>> >>>>>> Hmm, did David suggest that before? I don't recall for sure. I think >>>>>> he had some suggestion, but I'm not sure if it was the same one. >>>>> >>>>> "I'm also partial to using epoch time in the manifest because it is >>>>> generally easier for programs to work with. But, human-readable >>>>> doesn't >>>>> suck, either." >>>> >>>> Ugh. If you go down that road, why write human-readable contents at >>>> all? You may as well just use a binary format. But that's a very >>>> slippery slope and you won't like to be in the bottom -- I don't see >>>> what that gains you. It's not like it's a lot of work to parse a >>>> timestamp in a non-internationalized well-defined human-readable >>>> format. >>> >>> Well, times are a special case because they are so easy to mess up. >>> Try converting ISO-8601 to epoch time using the standard C functions >>> on a system where TZ != UTC. Fun times. >> >> Even if it's a zulu time? That would be pretty damn sad. > ZULU/GMT/UTC are all fine. But if the server timezone is EDT for > example (not that I recommend this) you are likely to get the wrong > result. Results vary based on your platform. For instance, we found > MacOS was more likely to work the way you would expect and Linux was > hopeless. > > There are all kinds of fun tricks to get around this (sort of). One is > to temporarily set TZ=UTC which sucks if an error happens before it > gets set back. There are some hacks to try to determine your offset > which have inherent race conditions around DST changes. > > After some experimentation we just used the Posix definition for epoch > time and used that to do our conversions: > > https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_16 > > > OK, but I think if we're putting a timestamp string in ISO-8601 format in the manifest it should be in UTC / Zulu time, precisely to avoid these issues. If that's too much trouble then yes an epoch time will probably do. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 4/14/20 3:55 PM, Andrew Dunstan wrote: > > OK, but I think if we're putting a timestamp string in ISO-8601 format > in the manifest it should be in UTC / Zulu time, precisely to avoid > these issues. If that's too much trouble then yes an epoch time will > probably do. Happily ISO-8601 is always UTC. The problem I'm referring to is the timezone setting on the host system when doing conversions in C. To be fair most languages handle this well and C is C so I'm not sure we need to make a big deal of it. In JSON/XML it's pretty common to use ISO-8601 so that seems like a rational choice. Regards, -- -David david@pgmasters.net
On 2020-Apr-14, Andrew Dunstan wrote: > OK, but I think if we're putting a timestamp string in ISO-8601 format > in the manifest it should be in UTC / Zulu time, precisely to avoid > these issues. If that's too much trouble then yes an epoch time will > probably do. The timestamp is always specified and always UTC (except the code calls it GMT). + /* + * Convert last modification time to a string and append it to the + * manifest. Since it's not clear what time zone to use and since time + * zone definitions can change, possibly causing confusion, use GMT + * always. + */ + appendStringInfoString(&buf, "\"Last-Modified\": \""); + enlargeStringInfo(&buf, 128); + buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%d %H:%M:%S %Z", + pg_gmtime(&mtime)); + appendStringInfoString(&buf, "\""); I was merely saying that it's trivial to make this iso-8601 compliant as buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%dT%H:%M:%SZ", ie. omit the "GMT" string and replace it with a literal Z, and remove the space and replace it with a T. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Apr-14, David Steele wrote: > Happily ISO-8601 is always UTC. Uh, it is not -- https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 4/14/20 4:09 PM, Alvaro Herrera wrote: > On 2020-Apr-14, Andrew Dunstan wrote: > >> OK, but I think if we're putting a timestamp string in ISO-8601 format >> in the manifest it should be in UTC / Zulu time, precisely to avoid >> these issues. If that's too much trouble then yes an epoch time will >> probably do. > The timestamp is always specified and always UTC (except the code calls > it GMT). > > + /* > + * Convert last modification time to a string and append it to the > + * manifest. Since it's not clear what time zone to use and since time > + * zone definitions can change, possibly causing confusion, use GMT > + * always. > + */ > + appendStringInfoString(&buf, "\"Last-Modified\": \""); > + enlargeStringInfo(&buf, 128); > + buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%d %H:%M:%S %Z", > + pg_gmtime(&mtime)); > + appendStringInfoString(&buf, "\""); > > I was merely saying that it's trivial to make this iso-8601 compliant as > > buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%dT%H:%M:%SZ", > > ie. omit the "GMT" string and replace it with a literal Z, and remove > the space and replace it with a T. > +1 cheers andre -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 4/14/20 4:11 PM, Alvaro Herrera wrote: > On 2020-Apr-14, David Steele wrote: > >> Happily ISO-8601 is always UTC. > > Uh, it is not -- > https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators Whoops, you are correct. I've just never seen non-UTC in the wild yet. -- -David david@pgmasters.net
On 2020/04/14 0:15, Robert Haas wrote: > On Sun, Apr 12, 2020 at 10:09 PM Fujii Masao > <masao.fujii@oss.nttdata.com> wrote: >> I found other minor issues. > > I think these are all correct fixes. Thanks for the post-commit > review, and sorry for this mistakes. Thanks for the review, Michael and Robert. Pushed the patches! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2020/04/14 2:40, Robert Haas wrote: > On Fri, Mar 27, 2020 at 4:32 PM Andres Freund <andres@anarazel.de> wrote: >> I don't like having a file format that's intended to be used by external >> tools too that's undocumented except for code that assembles it in a >> piecemeal fashion. Do you mean in a follow-on patch this release, or >> later? I don't have a problem with the former. > > Here is a patch for that. While reading the document that you pushed, I thought that it's better to define index term for backup manifest, so that we can easily reach this document from the index page. Thought? Patch attached. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
On Tue, Apr 14, 2020 at 11:49 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > While reading the document that you pushed, I thought that it's better > to define index term for backup manifest, so that we can easily reach > this document from the index page. Thought? Patch attached. Fine with me. I tend not to think about the index very much, so I'm glad you are. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 14 Apr 2020 12:56:49 -0400 Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Apr 13, 2020 at 5:43 PM Alvaro Herrera <alvherre@2ndquadrant.com> > wrote: > > Yeah, I guess I'm just saying that it feels brittle to have a file > > format that's supposed to be good for data exchange and then make it > > itself depend on representation details such as the order that fields > > appear in, the letter case, or the format of newlines. Maybe this isn't > > really of concern, but it seemed strange. > > I didn't want to use JSON for this at all, but I got outvoted. When I > raised this issue, it was suggested that I deal with it in this way, > so I did. I can't really defend it too far beyond that, although I do > think that one nice thing about this is that you can verify the > checksum using shell commands if you want. Just figure out the number > of lines in the file, minus one, and do head -n$LINES backup_manifest > | shasum -a256 and boom. If there were some whitespace-skipping thing > figuring out how to reproduce the checksum calculation would be hard. FWIW, shell commands (md5sum and sha*sum) read checksums from a separate file with a very simple format: one file per line with format "CHECKSUM FILEPATH". Thanks to json, it is fairly easy to extract checksums and filenames from the current manifest file format and check them all with one command: jq -r '.Files|.[]|.Checksum+" "+.Path' backup_manifest > checksums.sha256 sha256sum --check --quiet checksums.sha256 You can even pipe these commands together to avoid the intermediary file. But for backup_manifest, it's kind of shame we have to check the checksum against an transformed version of the file. Did you consider creating eg. a separate backup_manifest.sha256 file? I'm very sorry in advance if this has been discussed previously. Regards,
On Wed, Apr 15, 2020 at 11:23 AM Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > But for backup_manifest, it's kind of shame we have to check the checksum > against an transformed version of the file. Did you consider creating eg. a > separate backup_manifest.sha256 file? > > I'm very sorry in advance if this has been discussed previously. It was briefly mentioned in the original (lengthy) discussion, but I think there was one vote in favor and two votes against or something like that, so it didn't go anywhere. I didn't realize that there were handy command-line tools for manipulating json like that, or I probably would have considered that idea more strongly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 15 Apr 2020 12:03:28 -0400 Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Apr 15, 2020 at 11:23 AM Jehan-Guillaume de Rorthais > <jgdr@dalibo.com> wrote: > > But for backup_manifest, it's kind of shame we have to check the checksum > > against an transformed version of the file. Did you consider creating eg. a > > separate backup_manifest.sha256 file? > > > > I'm very sorry in advance if this has been discussed previously. > > It was briefly mentioned in the original (lengthy) discussion, but I > think there was one vote in favor and two votes against or something > like that, so it didn't go anywhere. Argh. > I didn't realize that there were handy command-line tools for manipulating > json like that, or I probably would have considered that idea more strongly. That was indeed a lengthy thread with various details discussed. I'm sorry I didn't catch the ball back then. Regards,
On 4/15/20 6:43 PM, Jehan-Guillaume de Rorthais wrote: > On Wed, 15 Apr 2020 12:03:28 -0400 > Robert Haas <robertmhaas@gmail.com> wrote: > >> On Wed, Apr 15, 2020 at 11:23 AM Jehan-Guillaume de Rorthais >> <jgdr@dalibo.com> wrote: >>> But for backup_manifest, it's kind of shame we have to check the checksum >>> against an transformed version of the file. Did you consider creating eg. a >>> separate backup_manifest.sha256 file? >>> >>> I'm very sorry in advance if this has been discussed previously. >> >> It was briefly mentioned in the original (lengthy) discussion, but I >> think there was one vote in favor and two votes against or something >> like that, so it didn't go anywhere. > > Argh. > >> I didn't realize that there were handy command-line tools for manipulating >> json like that, or I probably would have considered that idea more strongly. > > That was indeed a lengthy thread with various details discussed. I'm sorry I > didn't catch the ball back then. One of the reasons to use JSON was to be able to use command line tools like jq to do tasks (I use it myself). But I think only the pg_verifybackup tool should be used to verify the internal checksum. Two thoughts: 1) You can always generate an external checksum when you generate the backup if you want to do your own verification without running pg_verifybackup. 2) Perhaps it would be good if the pg_verifybackup command had a --verify-manifest-checksum option (or something) to check that the manifest file looks valid without checking any files. That's not going to happen for PG13, but it's possible for PG14. Regards, -- -David david@pgmasters.net
On Wed, 15 Apr 2020 18:54:14 -0400 David Steele <david@pgmasters.net> wrote: > On 4/15/20 6:43 PM, Jehan-Guillaume de Rorthais wrote: > > On Wed, 15 Apr 2020 12:03:28 -0400 > > Robert Haas <robertmhaas@gmail.com> wrote: > > > >> On Wed, Apr 15, 2020 at 11:23 AM Jehan-Guillaume de Rorthais > >> <jgdr@dalibo.com> wrote: > >>> But for backup_manifest, it's kind of shame we have to check the checksum > >>> against an transformed version of the file. Did you consider creating eg. > >>> a separate backup_manifest.sha256 file? > >>> > >>> I'm very sorry in advance if this has been discussed previously. > >> > >> It was briefly mentioned in the original (lengthy) discussion, but I > >> think there was one vote in favor and two votes against or something > >> like that, so it didn't go anywhere. > > > > Argh. > > > >> I didn't realize that there were handy command-line tools for manipulating > >> json like that, or I probably would have considered that idea more > >> strongly. > > > > That was indeed a lengthy thread with various details discussed. I'm sorry I > > didn't catch the ball back then. > > One of the reasons to use JSON was to be able to use command line tools > like jq to do tasks (I use it myself). That's perfectly fine. I was only wondering about having the manifest checksum outside of the manifest itself. > But I think only the pg_verifybackup tool should be used to verify the > internal checksum. true. > Two thoughts: > > 1) You can always generate an external checksum when you generate the > backup if you want to do your own verification without running > pg_verifybackup. Sure, but by the time I want to produce an external checksum, the manifest would have travel around quite a bit with various danger on its way to corrupt it. Checksuming it from the original process that produced it sounds safer. > 2) Perhaps it would be good if the pg_verifybackup command had a > --verify-manifest-checksum option (or something) to check that the > manifest file looks valid without checking any files. That's not going > to happen for PG13, but it's possible for PG14. Sure. I just liked the idea to be able to check the manifest using an external command line implementing the same standardized checksum algo. Without editing the manifest first. But I understand it's too late to discuss this now. Regards,
On 2020/04/15 22:24, Robert Haas wrote: > On Tue, Apr 14, 2020 at 11:49 PM Fujii Masao > <masao.fujii@oss.nttdata.com> wrote: >> While reading the document that you pushed, I thought that it's better >> to define index term for backup manifest, so that we can easily reach >> this document from the index page. Thought? Patch attached. > > Fine with me. I tend not to think about the index very much, so I'm > glad you are. :-) Pushed! Thanks! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2020/04/15 11:18, Fujii Masao wrote: > > > On 2020/04/14 0:15, Robert Haas wrote: >> On Sun, Apr 12, 2020 at 10:09 PM Fujii Masao >> <masao.fujii@oss.nttdata.com> wrote: >>> I found other minor issues. >> >> I think these are all correct fixes. Thanks for the post-commit >> review, and sorry for this mistakes. > > Thanks for the review, Michael and Robert. Pushed the patches! I found three minor issues in pg_verifybackup. + {"print-parse-wal", no_argument, NULL, 'p'}, This is unused option, so this line should be removed. + printf(_(" -m, --manifest=PATH use specified path for manifest\n")); Typo: --manifest should be --manifest-path pg_verifybackup accepts --quiet option, but its usage() doesn't print any message for --quiet option. Attached is the patch that fixes those issues. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
On Wed, Apr 22, 2020 at 12:21 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > I found three minor issues in pg_verifybackup. > > + {"print-parse-wal", no_argument, NULL, 'p'}, > > This is unused option, so this line should be removed. > > + printf(_(" -m, --manifest=PATH use specified path for manifest\n")); > > Typo: --manifest should be --manifest-path > > pg_verifybackup accepts --quiet option, but its usage() doesn't > print any message for --quiet option. > > Attached is the patch that fixes those issues. Thanks; LGTM. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2020/04/23 1:28, Robert Haas wrote: > On Wed, Apr 22, 2020 at 12:21 PM Fujii Masao > <masao.fujii@oss.nttdata.com> wrote: >> I found three minor issues in pg_verifybackup. >> >> + {"print-parse-wal", no_argument, NULL, 'p'}, >> >> This is unused option, so this line should be removed. >> >> + printf(_(" -m, --manifest=PATH use specified path for manifest\n")); >> >> Typo: --manifest should be --manifest-path >> >> pg_verifybackup accepts --quiet option, but its usage() doesn't >> print any message for --quiet option. >> >> Attached is the patch that fixes those issues. > > Thanks; LGTM. Thanks for the review! Pushed. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Sun, Apr 5, 2020 at 3:31 PM Andres Freund <andres@anarazel.de> wrote: > The warnings don't seem too unreasonable. The compiler can't see that > the error_cb inside json_manifest_parse_failure() is not expected to > return. Probably worth adding a wrapper around the calls to > context->error_cb and mark that as noreturn. Eh, how? The callback is declared as: typedef void (*json_manifest_error_callback)(JsonManifestParseContext *, char *fmt, ...) pg_attribute_printf(2, 3); I don't know of a way to create a wrapper around that, because of the variable argument list. We could change the callback to take va_list, I guess. Does it work for you to just add pg_attribute_noreturn() to this typedef, as in the attached? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On 2020-04-23 08:57:39 -0400, Robert Haas wrote: > On Sun, Apr 5, 2020 at 3:31 PM Andres Freund <andres@anarazel.de> wrote: > > The warnings don't seem too unreasonable. The compiler can't see that > > the error_cb inside json_manifest_parse_failure() is not expected to > > return. Probably worth adding a wrapper around the calls to > > context->error_cb and mark that as noreturn. > > Eh, how? The callback is declared as: > > typedef void (*json_manifest_error_callback)(JsonManifestParseContext *, > char > *fmt, ...) pg_attribute_printf(2, 3); > > I don't know of a way to create a wrapper around that, because of the > variable argument list. Didn't think that far... > We could change the callback to take va_list, I guess. I'd argue that that'd be a good idea anyway, otherwise there's no way to wrap the invocation anywhere in the code. But that's an independent consideration, as: > Does it work for you to just add pg_attribute_noreturn() to this > typedef, as in the attached? does fix the problem for me, cool. Do you not see a warning when compiling with optimizations enabled? Greetings, Andres Freund
On Thu, Apr 23, 2020 at 5:16 PM Andres Freund <andres@anarazel.de> wrote: > Do you not see a warning when compiling with optimizations enabled? No, I don't. I tried it with -O{0,1,2,3} and I always use -Wall -Werror. No warnings. [rhaas pgsql]$ clang -v clang version 5.0.2 (tags/RELEASE_502/final) Target: x86_64-apple-darwin19.4.0 Thread model: posix InstalledDir: /opt/local/libexec/llvm-5.0/bin -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2020/04/15 5:33, Andrew Dunstan wrote: > > On 4/14/20 4:09 PM, Alvaro Herrera wrote: >> On 2020-Apr-14, Andrew Dunstan wrote: >> >>> OK, but I think if we're putting a timestamp string in ISO-8601 format >>> in the manifest it should be in UTC / Zulu time, precisely to avoid >>> these issues. If that's too much trouble then yes an epoch time will >>> probably do. >> The timestamp is always specified and always UTC (except the code calls >> it GMT). >> >> + /* >> + * Convert last modification time to a string and append it to the >> + * manifest. Since it's not clear what time zone to use and since time >> + * zone definitions can change, possibly causing confusion, use GMT >> + * always. >> + */ >> + appendStringInfoString(&buf, "\"Last-Modified\": \""); >> + enlargeStringInfo(&buf, 128); >> + buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%d %H:%M:%S %Z", >> + pg_gmtime(&mtime)); >> + appendStringInfoString(&buf, "\""); >> >> I was merely saying that it's trivial to make this iso-8601 compliant as >> >> buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%dT%H:%M:%SZ", >> >> ie. omit the "GMT" string and replace it with a literal Z, and remove >> the space and replace it with a T. I have one question related to this; Why don't we use log_timezone, like backup_label? log_timezone is used for "START TIME" field in backup_label. Sorry if this was already discussed. /* Use the log timezone here, not the session timezone */ stamp_time = (pg_time_t) time(NULL); pg_strftime(strfbuf, sizeof(strfbuf), "%Y-%m-%d %H:%M:%S %Z", pg_localtime(&stamp_time, log_timezone)); OTOH, *if* we want to use the same timezone for backup-related files because backup can be used in different environements and timezone setting may be different there or for other reasons, backup_label also should use GMT or something for the sake of consistency? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Fri, May 15, 2020 at 2:10 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > I have one question related to this; Why don't we use log_timezone, > like backup_label? log_timezone is used for "START TIME" field in > backup_label. Sorry if this was already discussed. > > /* Use the log timezone here, not the session timezone */ > stamp_time = (pg_time_t) time(NULL); > pg_strftime(strfbuf, sizeof(strfbuf), > "%Y-%m-%d %H:%M:%S %Z", > pg_localtime(&stamp_time, log_timezone)); > > OTOH, *if* we want to use the same timezone for backup-related files because > backup can be used in different environements and timezone setting > may be different there or for other reasons, backup_label also should use > GMT or something for the sake of consistency? It's a good question. My inclination was to think that GMT would be the clearest thing, but I also didn't realize that the result would thus be inconsistent with backup_label. Not sure what's best here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > It's a good question. My inclination was to think that GMT would be > the clearest thing, but I also didn't realize that the result would > thus be inconsistent with backup_label. Not sure what's best here. I vote for following the backup_label precedent; that's stood for quite some years now. regards, tom lane
On 5/15/20 9:34 AM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> It's a good question. My inclination was to think that GMT would be >> the clearest thing, but I also didn't realize that the result would >> thus be inconsistent with backup_label. Not sure what's best here. > > I vote for following the backup_label precedent; that's stood for quite > some years now. I'd rather keep it GMT. The timestamps in the backup label are purely informational, but the timestamps in the manifest are useful, e.g. to set the mtime on a restore to the original value. Forcing the user to do timezone conversions is prone to error. Some languages, like C, simply aren't good at it. Of course, my actual preference is to use epoch time which is easy to work with and eliminates the possibility of conversion errors. It is also compact. Regards, -- -David david@pgmasters.net
David Steele <david@pgmasters.net> writes: > On 5/15/20 9:34 AM, Tom Lane wrote: >> I vote for following the backup_label precedent; that's stood for quite >> some years now. > Of course, my actual preference is to use epoch time which is easy to > work with and eliminates the possibility of conversion errors. It is > also compact. Well, if we did that then it'd be sufficiently different from the backup label as to remove any risk of confusion. But "easy to work with" is in the eye of the beholder; do we really want a format that's basically unreadable to the naked eye? regards, tom lane
On 5/15/20 10:17 AM, Tom Lane wrote: > David Steele <david@pgmasters.net> writes: >> On 5/15/20 9:34 AM, Tom Lane wrote: >>> I vote for following the backup_label precedent; that's stood for quite >>> some years now. > >> Of course, my actual preference is to use epoch time which is easy to >> work with and eliminates the possibility of conversion errors. It is >> also compact. > > Well, if we did that then it'd be sufficiently different from the backup > label as to remove any risk of confusion. But "easy to work with" is in > the eye of the beholder; do we really want a format that's basically > unreadable to the naked eye? Well, I lost this argument before so it seems I'm in the minority on easy-to-use. We use epoch time in the pgBackRest manifests which has been easy to deal with in both C and Perl, so experience tells me it really is easy, at least for programs. The manifest (to me, at least) is generally intended to be machine-processed. For instance, it contains checksums which are not all that useful unless they are checked programmatically -- they can't just be eye-balled. Regards, -- -David david@pgmasters.net