Re: Intermittent "cache lookup failed for type" buildfarm failures - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Intermittent "cache lookup failed for type" buildfarm failures
Date
Msg-id 21004.1472131937@sss.pgh.pa.us
Whole thread Raw
In response to Intermittent "cache lookup failed for type" buildfarm failures  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I wrote:
> There is something rotten in the state of Denmark.  Here are four recent
> runs that failed with unexpected "cache lookup failed for type nnnn"
> errors:

> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grouse&dt=2016-08-16%2008%3A39%3A03
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nudibranch&dt=2016-08-13%2009%3A55%3A09
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-08-09%2001%3A46%3A17
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2016-08-09%2000%3A44%3A18

I believe I've figured this out.

I realized that all the possible instances of "cache lookup failed for
type" are reporting failures of SearchSysCache1(TYPEOID, ...) or related
calls, and therefore I could narrow this down by setting a breakpoint
there on the combination of cacheId = TYPEOID and key1 > 16384 (since the
OIDs reported for the failures are clearly for some non-builtin type).
After a bit of logging it became clear that the only such calls occurring
in the statements that are failing in the buildfarm are coming from the
parser's attempts to resolve an operator name.  And then it was blindingly
obvious what changed recently: commits f0c7b789a et al added a test case
in case.sql that creates and then drops both an '=' operator and the type
it's for.  And that runs in parallel with the failing tests, which all
need to resolve operators named '='.  So in the other sessions, the parser
is seeing that transient '=' operator as a possible candidate, but then
when it goes to test whether that operator could match the actual inputs,
the type is already gone (causing a failure in getBaseType or
get_element_type or possibly other places).

The best short-term fix, and the only one I'd consider back-patching,
is to band-aid the test to prevent this problem, probably by wrapping
that whole test case in BEGIN ... ROLLBACK so that concurrent tests
never see the transient '=' operator.

In the long run, it'd be nice if we were more robust about such
situations, but I have to admit I have no idea how to go about
making that so.  Certainly, just letting the parser ignore catalog
lookup failures doesn't sound attractive.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: increasing the default WAL segment size
Next
From: Robert Haas
Date:
Subject: Re: increasing the default WAL segment size