Re: pgsql: Add TAP tests for pg_verify_checksums - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: pgsql: Add TAP tests for pg_verify_checksums
Date
Msg-id CAOuzzgqH=J2QPxhQKAxh+Fbn6Uw-_x3rPkJbcFFbUh50qm0v8A@mail.gmail.com
Whole thread Raw
In response to Re: pgsql: Add TAP tests for pg_verify_checksums  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
Greetings,

On Sat, Oct 20, 2018 at 00:43 Michael Paquier <michael@paquier.xyz> wrote:
On Sat, Oct 20, 2018 at 12:25:19AM -0400, Stephen Frost wrote:
> On Fri, Oct 19, 2018 at 23:40 Michael Paquier <michael@paquier.xyz> wrote:
>> - I agree with not doing a simple revert to not turn the buildfarm red
>> again.  This is annoying for animal maintainers.  Andrew has done a very
>> nice work in disabling manually those tests temporarily.
>
> This is a red herring, and always was, so I’m rather unimpressed at how it
> keeps coming up- no, I’m not advocating that we should just make the build
> farm red and just leave it that way.  Yes, we should fix this case, and fix
> pg_basebackup, and maybe even try to add some regression tests which test
> this exact same case in pg_basebackup, but making the build farm green is
> *not* the only thing we should care about.

Well, the root of the problem was that pg_verify_checksums has been
committed without any tests on its own.  If we had those tests from the
start, then we would not be having this discussion post-release time,
still trying to figure out if whitelisting or blacklisting is
appropriate.

The validation checksums in base backups has been added in 4eb77d50, a
couple of days before pg_verify_checksums has been introduced.  This has
added corruption-related tests in src/bin/pg_basebackup, which is a good
thing.  However the feature has been designed so as checksum mismatches
are ignored after 5 failures, which actually *masked* the fact that
EXEC_BACKEND files like CONFIG_EXEC_PARAMS should have been skipped
instead of getting checksum failures.  And that's a bad thing.  So this
gives in my opinion a good point about using a whitelist.

That counter of checksum failures should have been getting reset between files..  that’s certainly what I had understood was intended.  The idea of the counter is to not flood the log with errors when a file is discovered that’s full of checksum failures (as can happen if large chunks of the file got replaced with something else, for example), but it should be reporting on each file that does have failures in it.

I don’t see how having a whitelist changes that or would have impacted that logic to make it correct initially or not.

I’m also trying to understand how this checksum logging limit is getting hit in the tests but they’re passing..?  If this was an intended failure check then surely there’s a test done that’s intended to be successful and it should have complained about this file too, or perhaps that’s what’s missing. I’m happy to look into this later this weekend.  Certainly seems like something here isn’t really making sense tho.

Thanks!

Stephen

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: pgsql: Add TAP tests for pg_verify_checksums
Next
From: Michael Paquier
Date:
Subject: Re: pgsql: Add TAP tests for pg_verify_checksums