Re: backup manifests - Mailing list pgsql-hackers

From Robert Haas
Subject Re: backup manifests
Date
Msg-id CA+TgmoZn6K4KiBvcAfYOxUwv-t9oot0ZULdPbOG5k8prpEVVZA@mail.gmail.com
Whole thread Raw
In response to Re: backup manifests  (Suraj Kharage <suraj.kharage@enterprisedb.com>)
Responses Re: backup manifests
List pgsql-hackers
On Tue, Dec 10, 2019 at 6:40 AM Suraj Kharage
<suraj.kharage@enterprisedb.com> wrote:
> Please find attached patch for backup validator implementation (0004 patch). This patch is based
> on Rushabh's latest patch for backup manifest.
>
> There are some functions required at client side as well, so I have moved those functions
> and some data structure at common place so that they can be accessible for both. (0003 patch).
>
> My colleague Rajkumar Raghuwanshi has prepared the WIP patch (0005) for tap test cases which
> is also attached. As of now, test cases related to the tablespace and tar backup  format are missing,
> will continue work on same and submit the complete patch.
>
> With this mail, I have attached the complete patch stack for backup manifest and backup
> validate implementation.
>
> Please let me know your thoughts on the same.

Well, for the second time on this thread, please don't take a bunch of
somebody else's code and post it in a patch that doesn't attribute
that person as one of the authors. For the second time on this thread,
the person is me, but don't borrow *anyone's* code without proper
attribution. It's really important!

On a related note, it's a very good idea to use git format-patch and
git rebase -i to maintain patch stacks like this. Rushabh seems to
have done that, but the files you're posting look like raw 'git diff'
output. Notice that this gives him a way to include authorship
information and a tentative commit message in each patch, but you
don't have any of that.

Also on a related note, part of the process of adapting existing code
to a new purpose is adapting the comments. You haven't done that:

+ * Search a result-set hash table for a row matching a given filename.
...
+ * Insert a row into a result-set hash table, provided no such row is already
...
+ * Most of the values
+ * that we're hashing are short integers formatted as text, so there
+ * shouldn't be much room for pathological input.

I think that what we should actually do here is try to use simplehash.
Right now, it won't work for frontend code, but I posted some patches
to try to address that issue:

https://www.postgresql.org/message-id/CA%2BTgmob8oyh02NrZW%3DxCScB%2B5GyJ-jVowE3%2BTWTUmPF%3DFsGWTA%40mail.gmail.com

That would have a few advantages. One, we wouldn't need to know the
number of elements in advance, because simplehash can grow
dynamically. Two, we could use the iteration interfaces to walk the
hash table.  Your solution to that is pgrhash_seq_search, but that's
actually not well-designed, because it's not a generic iterator
function but something that knows specifically about the 'touch' flag.
I incidentally suggest renaming 'touch' to 'matched;' 'touch' is not
bad, but I think 'matched' will be a little more recognizable.

Please run pgindent. If needed, first add locally defined types to
typedefs.list, so that things indent properly.

It's not a crazy idea to try to share some data structures and code
between the frontend and the backend here, but I think
src/common/backup.c and src/include/common/backup.h is a far too
generic name given what the code is actually doing. It's mostly about
checksums, not backup, and I think it should be named accordingly. I
suggest removing "manifestinfo" and renaming the rest to just talk
about checksums rather than manifests. That would make it logical to
reuse this for any other future code that needs a configurable
checksum type. Also, how about adding a function like:

extern bool parse_checksum_algorithm(char *name, ChecksumAlgorithm *algo);

...which would return true and set *algo if name is recognized, and
return false otherwise. That code could be used on both the client and
server sides of this patch, and by any future patches that want to
return this scaffolding.

The file header for backup.h has the wrong filename (string.h). The
header format looks somewhat atypical compared to what we normally do,
too.

It's arguable, but I tend to think that it would be better to
hex-encode the CRC rather than printing it as an integer.  Maybe
hex_encode() is another thing that could be moved into the new
src/common file.

As I said before about Rushabh's patch set, it's very confusing that
we have so many patches here stacked up. Like, you have 0002 moving
stuff, and then 0003 moving it again. That's super-confusing. Please
try to structure the patch set so as to make it as easy to review as
possible.

Regarding the test case patch, error checks are important! Don't do
things like this:

+open my $modify_file_sha256, '>>', "$tempdir/backup_verify/postgresql.conf";
+print $modify_file_sha256 "port = 5555\n";
+close $modify_file_sha256;

If the open fails, then it and the print and the close are going to
silently do nothing. That's bad. I don't know exactly what the
customary error-checking is for things like this in TAP tests, but I
hope it's not like this, because this has a pretty fair chance of
looking like it's testing something that it isn't. Let's figure out
what the best practice in this area is and adhere to it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: log bind parameter values on error
Next
From: Alvaro Herrera
Date:
Subject: Re: log bind parameter values on error