Re: BUG #17212: pg_amcheck fails on checking temporary relations - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: BUG #17212: pg_amcheck fails on checking temporary relations
Date
Msg-id 81019B2A-C0EB-4890-BA3A-DB60590AAF43@enterprisedb.com
Whole thread Raw
In response to Re: BUG #17212: pg_amcheck fails on checking temporary relations  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: BUG #17212: pg_amcheck fails on checking temporary relations  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers

> On Oct 4, 2021, at 4:45 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>
> I'm guessing that you meant REINDEX CONCURRENTLY.

Yes.

> Since you're talking about the case where it has an error

Sorry, I realized after hitting <send> that you might take it that way, but I mean the logs generally, not just
postgreslogs.  If somebody runs "reindex concurrently" on all tables at midnight every night, and they see one morning
inthe (non-postgres) logs from the midnight hour weird error messages about their RAID controller, they may well want
tocheck all their indexes to see if any of them were corrupted.  This is a totally made-up example, but the idea that a
usermight want to check their indexes, tables, or both owing to errors of some kind is not far-fetched.   

> , the whole
> index build must have failed. So the user would get exactly what
> they'd expect -- verification of the original index, without any
> hindrance from the new/failed index.

Right, in that case, but not if hardware errors corrupted the index, and generated logs, without happening to trip up
thereindex concurrently statement itself. 

> (Thinks for a moment...)
>
> Actually, I think that we'd only verify the original index, even
> before the error with CONCURRENTLY (though I've not checked that point
> myself).

To get back on track, let me say that I'm not taking the position that what pg_amcheck currently does is necessarily
correct,but just that I'd like to be careful about what we change, if anything.  There are three things that seem to
irritatepeople: 

1) A non-zero exit code means "not everything was checked and passed" rather than "at least one thing is definitely
corrupt".

2) Skipping of indexes is reported to the user with the word 'ERROR' in the report rather than, say, 'NOTICE'.

3) Deadlocks can occur

I have resisted changing #1 on the theory that `pg_amcheck --all && ./post_all_checks_pass.sh` should only run the
post_all_checks_pass.shif indeed all checks have passed, and I'm interpreting skipping an index check as being contrary
tothat.  But maybe that's wrong of me.  I don't know.  There is already sloppiness between the time that pg_amcheck
resolveswhich database relations are matched by --all, --table, --index, etc. and the time that all the checks are
started,and again between that time and when the last one is complete.  Database objects could be created or dropped
duringthose spans of time, in which case --all doesn't have quite so well defined a meaning.  But the user running
pg_amcheckmight also *know* that they aren't running any such DDL, and therefore expect --all to actually result in
everythingbeing checked. 

I find it strange that I should do anything about #2 in pg_amcheck, since it's the function in verify_nbtree that
phrasesthe situation as an error.  But I suppose I can just ignore that and have it print as a notice.  I'm genuinely
nottrying to give you grief here -- I simply don't like that pg_amcheck is adding commentary atop what the checking
functionsare doing.  I see a clean division between what pg_amcheck is doing and what amcheck is doing, and this feels
tome to put that on the wrong side of the divide.  If refusing to check the index because it is not in the requisite
stateis a notice, then why wouldn't verify_nbtree raise it as one and return early rather than raising an error? 

I also find it strange that #3 is being attributed to pg_amcheck's choice of how to call the checking function, because
Ican't think of any other function where we require the SQL caller to do anything like what you are requiring here in
orderto prevent deadlocks, and also because the docs for the functions don't say that a deadlock is possible, merely
thatthe function may return with an error.  I was totally content to get an error back, since errors are how the
verify_nbtreefunctions communicate everything else, and the handler for those functions is already prepared to deal
withthe error messages so returned.  But it clearly annoys you that pg_amcheck is doing this, so I'll go forward with
thepatch that I already have written which does otherwise. 


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






pgsql-hackers by date:

Previous
From: Jaime Casanova
Date:
Subject: Re: Patch: Range Merge Join
Next
From: Andres Freund
Date:
Subject: Re: plperl on windows