Re: error: could not find pg_class tuple for index 2662 - Mailing list pgsql-hackers

From Robert Haas
Subject Re: error: could not find pg_class tuple for index 2662
Date
Msg-id CA+TgmoZJbJ3iyFeHEQgPrjEvqGK_-87uwGSWtNj2fiEsKE7_2Q@mail.gmail.com
Whole thread Raw
In response to Re: error: could not find pg_class tuple for index 2662  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: error: could not find pg_class tuple for index 2662
List pgsql-hackers
On Thu, Aug 11, 2011 at 5:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I can reproduce the problem fairly conveniently with this crude hack:
>
> diff --git a/src/backend/storage/ipc/sinval.c b/src/backend/storage/ipc/sinval.c
> index 8499615..5ad2aee 100644
> *** a/src/backend/storage/ipc/sinval.c
> --- b/src/backend/storage/ipc/sinval.c
> *************** ReceiveSharedInvalidMessages(
> *** 106,112 ****
>        /* Try to get some more messages */
>        getResult = SIGetDataEntries(messages, MAXINVALMSGS);
>
> !       if (getResult < 0)
>        {
>            /* got a reset message */
>            elog(DEBUG4, "cache state reset");
> --- 106,112 ----
>        /* Try to get some more messages */
>        getResult = SIGetDataEntries(messages, MAXINVALMSGS);
>
> !       if (getResult != 0)
>        {
>            /* got a reset message */
>            elog(DEBUG4, "cache state reset");
>
> which forces every occurrence of an incoming sinval message to be treated
> as a reset.  The serial regression tests still work with that, but they
> fall over almost immediately if you run something like this in parallel:
>
>        while psql -c "vacuum full pg_class" regression; do usleep 100000; done
>
> and the parallel regression tests tend to fall over without any outside
> help because of the one occurrence of "vacuum full pg_class" in them.

Nice detective work.

> I'm inclined to think that a robust solution requires both of the ideas
> I proposed last week: we should update all the relmapping entries during
> pass 1 of RelationCacheInvalidate, *and* make pass 2 do things in a more
> robust order.  I think pg_class, pg_class_oid_index, other nailed
> relations, and then everything else ought to do it.
>
> Anyway, that's easily fixed now that we have a reproduceable case.
> What's bothering me at the moment is that the CLOBBER_CACHE_ALWAYS hack,
> which was meant to expose exactly this sort of problem, failed to do so
> --- buildfarm member jaguar has been running with that flag for ages and
> never showed this problem.  I'm thinking that we should take out the
> hack in AcceptInvalidationMessages and instead put in #ifdeffed code
> that causes ReceiveSharedInvalidMessages to forcibly always call the
> reset function.  Any thoughts about that?

I'm OK with that, but while we're whacking this around, is there any
chance that we could reduce the number of layers of function calls
that are happening in here?

Right now, we call AcceptInvalidationMessages().
...which basically does nothing but call ReceiveSharedInvalidMessages()
...which in turn calls SIGetDataEntries() ... and then, by way of a
function pointer, LocalExecuteInvalidMessage() or perhaps
InvalidateSystemCaches().

It would be nice to move the short-circuit test I recently inserted at
the top of SIGetDataEntries() somewhere higher up in the call stack,
but right now the layers of abstraction are so thick that it's not
exactly clear how to do that.  In my ideal world, I'd like to turn
AcceptInvalidationMessages into something like this:

#define AcceptInvalidationMessages() \   do { if (*flagvar) ReallyDoIt(); } while (0)

...where flagvar is a pointer to myStateP->hasMessages.  This isn't
really a huge hotspot any more, but a cycle saved is a cycle earned,
and we call this frequently enough that it seems awfully wasteful to
push and pop three call stack frames every time we check and find no
messages.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: index-only scans
Next
From: Robert Haas
Date:
Subject: Re: psql document fix about showing FDW options