Re: backup manifests - Mailing list pgsql-hackers
From | David Steele |
---|---|
Subject | Re: backup manifests |
Date | |
Msg-id | fd84612d-8bf4-0db1-14cf-02aa4f9ca396@pgmasters.net Whole thread Raw |
In response to | Re: backup manifests (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
On 9/20/19 2:55 PM, Robert Haas wrote: > On Fri, Sep 20, 2019 at 11:09 AM David Steele <david@pgmasters.net> wrote: >> >> It sucks to make that a prereq for this project but the longer we kick >> that can down the road... > > There are no doubt many patches that would benefit from having more > backend infrastructure exposed in frontend contexts, and I think we're > slowly moving in that direction, but I generally do not believe in > burdening feature patches with major infrastructure improvements. The hardest part about technical debt is knowing when to incur it. It is never a cut-and-dried choice. >> This talk was good fun. The largest number of tables we've seen is a >> few hundred thousand, but that still adds up to more than a million >> files to backup. > > A quick survey of some of my colleagues turned up a few examples of > people with 2-4 million files to backup, so similar kind of ballpark. > Probably not big enough for the manifest to hit the 1GB mark, but > getting close. I have so many doubts about clusters with this many tables, but we do support it, so... >>> I hear you saying that this is going to end up being just as complex >>> in the end, but I don't think I believe it. It sounds to me like the >>> difference between spending a couple of hours figuring this out and >>> spending a couple of months trying to figure it out and maybe not >>> actually getting anywhere. >> >> Maybe the initial implementation will be easier but I am confident we'll >> pay for it down the road. Also, don't we want users to be able to read >> this file? Do we really want them to need to cook up a custom parser in >> Perl, Go, Python, etc.? > > Well, I haven't heard anybody complain that they can't read a > backup_label file because it's too hard to cook up a parser. And I > think the reason is pretty clear: such files are not hard to parse. > Similarly for a pg_hba.conf file. This case is a little more > complicated than those, but AFAICS, not enormously so. Actually, it > seems like a combination of those two cases: it has some fixed > metadata fields that can be represented with one line per field, like > a backup_label, and then a bunch of entries for files that are > somewhat like entries in a pg_hba.conf file, in that they can be > represented by a line per record with a certain number of fields on > each line. Yeah, they are not hard to parse, but *everyone* has to cook up code for it. A bit of a bummer, that. > I attach here a couple of patches. The first one does some > refactoring of relevant code in pg_basebackup, and the second one adds > checksum manifests using a format that I pulled out of my ear. It > probably needs some adjustment but I don't think it's crazy. Each > file gets a line that looks like this: > > File $FILENAME $FILESIZE $FILEMTIME $FILECHECKSUM We also include page checksum validation failures in the file record. Not critical for the first pass, perhaps, but something to keep in mind. > Right now, the file checksums are computed using SHA-256 but it could > be changed to anything else for which we've got code. On my system, > shasum -a256 $FILE produces the same answer that shows up here. At > the bottom of the manifest there's a checksum of the manifest itself, > which looks like this: > > Manifest-Checksum > 385fe156a8c6306db40937d59f46027cc079350ecf5221027d71367675c5f781 > > That's a SHA-256 checksum of the file contents excluding the final > line. It can be verified by feeding all the file contents except the > last line to shasum -a256. I can't help but observe that if the file > were defined to be a JSONB blob, it's not very clear how you would > include a checksum of the blob contents in the blob itself, but with a > format based on a bunch of lines of data, it's super-easy to generate > and super-easy to write tools that verify it. You can do this in JSON pretty easily by handling the terminating brace/bracket: { <some json contents>*, "checksum":<sha256> } But of course a linefeed-delimited file is even easier. > This is just a prototype so I haven't written a verification tool, and > there's a bunch of testing and documentation and so forth that would > need to be done aside from whatever we've got to hammer out in terms > of design issues and file formats. But I think it's cool, and perhaps > some discussion of how it could be evolved will get us closer to a > resolution everybody can at least live with. I had a quick look and it seems pretty reasonable. I'll need to generate a manifest to see if I can spot any obvious gotchas. -- -David david@pgmasters.net
pgsql-hackers by date: