Re: pg_verifybackup: TAR format backup verification - Mailing list pgsql-hackers

From Robert Haas
Subject Re: pg_verifybackup: TAR format backup verification
Date
Msg-id CA+TgmoZTJzUyzDkdo6K7TTM2ePP0EMXTwxGhJN1Ymx3qNHvevQ@mail.gmail.com
Whole thread Raw
In response to Re: pg_verifybackup: TAR format backup verification  (Amul Sul <sulamul@gmail.com>)
List pgsql-hackers
On Mon, Aug 12, 2024 at 5:13 AM Amul Sul <sulamul@gmail.com> wrote:
> I tried this in the attached version and made a few additional changes
> based on Sravan's off-list comments regarding function names and
> descriptions.
>
> Now, verification happens in two passes. The first pass simply
> verifies the file names, determines their compression types, and
> returns a list of valid tar files whose contents need to be verified
> in the second pass. The second pass is called at the end of
> verify_backup_directory() after all files in that directory have been
> scanned. I named the functions for pass 1 and pass 2 as
> verify_tar_file_name() and verify_tar_file_contents(), respectively.
> The rest of the code flow is similar as in the previous version.
>
> In the attached patch set, I abandoned the changes, touching the
> progress reporting code of plain backups by dropping the previous 0009
> patch. The new 0009 patch adds missing APIs to simple_list.c to
> destroy SimplePtrList. The rest of the patch numbers remain unchanged.

I think you've entangled the code paths here for plain-format backup
and tar-format backup in a way that is not very nice. I suggest
refactoring things so that verify_backup_directory() is only called
for plain-format backups, and you have some completely separate
function (perhaps verify_tar_backup) that is called for tar-format
backups. I don't think verify_backup_file() should be shared between
tar-format and plain-format backups either. Let that be just for
plain-format backups, and have separate logic for tar-format backups.
Right now you've got "if" statements in various places trying to get
all the cases correct, but I think you've missed some (and there's
also the issue of updating all the comments).

For instance, verify_backup_file() recurses into subdirectories, but
that behavior is inappropriate for a tar format backup, where
subdirectories should instead be treated like stray files: complain
that they exist. pg_verify_backup() does this:

    /* If it's a directory, just recurse. */
    if (S_ISDIR(sb.st_mode))
    {
        verify_backup_directory(context, relpath, fullpath);
        return;
    }

    /* If it's not a directory, it should be a plain file. */
    if (!S_ISREG(sb.st_mode))
    {
        report_backup_error(context,
                            "\"%s\" is not a file or directory",
                            relpath);
        return;
    }

For a plain format backup, this whole thing should be just:

    /* In a tar format backup, we expect only plain files. */
    if (!S_ISREG(sb.st_mode))
    {
        report_backup_error(context,
                            "\"%s\" is not a plain file",
                            relpath);
        return;
    }

Also, immediately above, you do
simple_string_list_append(&context->ignore_list, relpath), but that is
pointless in the tar-file case, and arguably wrong, if -i is going to
ignore both pathnames in the base directory and also pathnames inside
the tar files, because we could add something to the ignore list here
-- accomplishing nothing useful -- and then that ignore-list entry
could cause us to disregard a stray file with the same name present
inside one of the tar files -- which is silly. Note that the actual
point of this logic is to make sure that if we can't stat() a certain
directory, we don't go off and issue a million complaints about all
the files in that directory being missing. But this doesn't accomplish
that goal for a tar-format backup. For a tar-format backup, you'd want
to figure out which files in the manifest we don't expect to see based
on this file being inaccessible, and then arrange to suppress future
complaints about all of those files. But you can't implement that
here, because you haven't parsed the file name yet. That happens
later, in verify_tar_file_name().

You could add a whole bunch more if statements here and try to work
around these issues, but it seems pretty obviously a dead end to me.
Almost the entire function is going to end up being inside of an
if-statement. Essentially the only thing in verify_backup_file() that
should actually be the same in the plain and tar-format cases is that
you should call stat() either way and check whether it throws an
error. But that is not enough justification for trying to share the
whole function.

I find the logic in verify_tar_file_name() to be somewhat tortured as
well. The strstr() calls will match those strings anywhere in the
filename, not just at the end. But also, even if you fixed that, why
work backward from the end of the filename toward the beginning? It
would seem like it would more sense to (1) first check if the string
starts with "base" and set suffix equal to pathname+4, (2) if not,
strtol(pathname, &suffix, 10) and complain if we didn't eat at least
one character or got 0 or something too big to be an OID, (3) check
whether suffix is .tar, .tar.gz, etc.

In verify_member_checksum(), you set mystreamer->verify_checksum =
false. That would be correct if there could only ever be one call to
verify_member_checksum() per member file, but there is no such rule.
There can be (and, I think, normally will be) more than one
ASTREAMER_MEMBER_CONTENTS chunk. I'm a little confused as to how this
code passes any kind of testing.

Also in verify_member_checksum(), the mystreamer->received_bytes <
m->size seems strange. I don't think this is the right way to do
something when you reach the end of an archive member. The right way
to do that is to do it when the ASTREAMER_MEMBER_TRAILER chunk shows
up.

In verify_member_control_data(), you use astreamer_buffer_untIl(). But
that's the same buffer that is being used by verify_member_checksum(),
so I don't really understand how you expect this to work. If this code
path were ever taken, verify_member_checksum() would see the same data
more than once.

The call to pg_log_debug() in this function also seems quite random.
In a plain-format backup, we'd actually be doing something different
for pg_controldata vs. other files, namely reading it during the
initial directory scan. But here we're reading the file in exactly the
same sense as we're reading every other file, neither more nor less,
so why mention this file and not all of the others? And why at this
exact spot in the code?

I suspect that the report_fatal_error("%s: could not read control
file: read %d of %zu", ...) call is unreachable. I agree that you need
to handle the case where the control file inside the tar file is not
the expected length, and in fact I think you should probably write a
TAP test for that exact scenario to make sure it works. I bet this
doesn't. Even if it did, the error message makes no sense in context.
In the plain-format backup, this error would come from code reading
the actual bytes off the disk -- i.e. the complaint about not being
able to read the control file would come from the read() system call.
Here it doesn't.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: Restart pg_usleep when interrupted
Next
From: Jeff Davis
Date:
Subject: Re: tiny step toward threading: reduce dependence on setlocale()