Re: pgsql: Add TAP tests for pg_verify_checksums - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: pgsql: Add TAP tests for pg_verify_checksums
Date
Msg-id CAOuzzgpAk3C6uAdyP6MQdamVRK5u05FkqRRmZOk0xqAs9DTqKA@mail.gmail.com
Whole thread Raw
In response to Re: pgsql: Add TAP tests for pg_verify_checksums  (Michael Paquier <michael@paquier.xyz>)
Responses Re: pgsql: Add TAP tests for pg_verify_checksums  (Michael Paquier <michael@paquier.xyz>)
Re: pgsql: Add TAP tests for pg_verify_checksums  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Greetings,

On Fri, Oct 19, 2018 at 23:40 Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Oct 19, 2018 at 06:43:18PM -0400, Tom Lane wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>> I don't think just reverting it is really acceptable.
>
> +several.  I do not mind somebody writing and installing a better fix.
> I do object to turning the buildfarm red again.

I did not expect this thread to turn into a war zone.  Anyway, there are
a couple of things I agree with on this thread:
- I agree with Andres point here:
https://postgr.es/m/20181019171747.4uithw2sjkt6msne@alap3.anarazel.de
A blacklist is fundamentally more difficult to maintain as there are
way more things added in a data folder which do not have data checksums
than things which have checksums.  So using a blacklist approach looks
unmaintainable in the long term.  Future patches related to enabling
online checksum verification make me worry if we keep the code like
that.  I can also easily imagine that anybody willing to use the
pluggable storage API would like to put new files in tablespace-related
data folders, relying on "base/" being the default system tablespace

If we are going to decide that we only deal with files in our directories matching these whitelisted patterns, then shouldn’t we apply similar logic in things like DROP DATABASE and any other cases where we perform actions in a recursive manner across our database and table space directories?

Should we really be removing arbitrary files that we know nothing about, after all?

What about pg_basebackup?  Shall we update it to only stream through files matching these patterns as those are the only files we consider ourselves to be aware of?

I don’t buy off on any argument that presents pluggable storage as not including some way for us to track and be aware of what files are associated with that pluggable storage mechanism and which of those files have checksums and how to verify them if they do. In other words, I sure hope we don’t accept an approach like cstore *fdw* uses for pluggable storage where the core system has no idea whatsoever about what these random files dropped into our tablespace directories are. 

- I agree with Stephen's point that we should decide if a file has
checksums or not in a single place, and that we should use the same
logic for base backups and pg_verify_checksums.

Despite it being a lot of the discussion, I don’t think there was ever disagreement on this point.

- I agree with not doing a simple revert to not turn the buildfarm red
again.  This is annoying for animal maintainers.  Andrew has done a very
nice work in disabling manually those tests temporarily.

This is a red herring, and always was, so I’m rather unimpressed at how it keeps coming up- no, I’m not advocating that we should just make the build farm red and just leave it that way.  Yes, we should fix this case, and fix pg_basebackup, and maybe even try to add some regression tests which test this exact same case in pg_basebackup, but making the build farm green is *not* the only thing we should care about.

- The base backup logic deciding if a file has checksums looks broken to
me: it misses files generated by EXEC_BACKEND, and any instance of
Postgres using an extension with custom files and data checksums has its
backups broken.  cstore_fdw has been mentioned above, and I recall that
Heroku for example enables data checksums.  If you combine both, it
basically means that such an instance cannot take base backups anymore
while it was able to do so with pre-10 with default options.  That's not
cool.

This is incorrect, pg_basebackup will still back up and keep files which fail checksum checks- logic which David Steele and I pushed for when the checksumming logic was added to pg_basebackup, as I recall, but it’ll complain and warn about these files ...

As it *should*, even if we weren’t doing checksum checks, because these are random files that have been dropped into a PG directory that PG doesn’t know anything about, which aren’t handled through WAL and therefore there’s no way to know if they’ll be at all valid when they’re copied, or that backing them up is at all useful.

Backups are absolutely important and we wouldn’t want a backup to be aborted or failed due to checksum failures, in general, but the user should be made aware of these failures.  This approach of only taking responsibility for the files we know of through these patterns could also imply that we shouldn’t back up in pg_basebackup files that don’t match- but that strikes me as another similarly risky approach as any mistake in this whitelist of known files could then result in an inconsistent backup that’s missing things that should have been included.

As for which approach is easier to maintain, I don’t see one as being meaningfully much easier to maintain than the other, code wise, while having these patterns looks to me as having a lot more risk, with the only rather nebulous gain being that we don’t complain about extensions dropping random files in places they shouldn’t have been- and which, if they had actually been implemented in a way we’d be accepting of (I.e. pluggable storage), we would have been tracking and handling appropriately.

So what I think we ought to do is the following:
- Start a new thread, this one about TAP tests is not adapted.

I don’t have any objection to a new thread. I don’t know that it’ll really get more people to comment, but perhaps it will.

Thanks!

Stephen

pgsql-hackers by date:

Previous
From: Laurenz Albe
Date:
Subject: Re: Function to promote standby servers
Next
From: Michael Paquier
Date:
Subject: Re: pgsql: Add TAP tests for pg_verify_checksums