"Greg Sabino Mullane" <greg@turnstep.com> writes:
> Tom Lane wrote:
>> Does the mentioned OID actually correspond to the OID of the table it's
>> supposed to be opening, or is it wrong? Is anything being done to
>> the table schema in parallel?
> Yes, it is the correct OID. No, nothing done to the schema in parallel,
> although there is a process that disables/re-enables triggers and rules
> on that table via pg_class tweaking (inside a txn, of course).
Oh! Duh, that's your issue right there, I'll bet. The problem is that
relcache-open tries to read the pg_class row under SnapshotNow rules,
and if there is another xact concurrently modifying the row, it is
entirely possible for none of the row versions to be committed good at
the instant they are visited. (The new row version either isn't seen at
all or isn't committed good yet when it's visited, and later when the
old row version is visited, it has become committed dead.) This results
in ScanPgRelation failing (returning NULL) which leads to exactly the
"could not open relation with OID xxx" symptom --- and in fact I see no
other code path that yields that failure.
As of 8.2 we have this problem fixed for system-initiated changes to the
pg_class row, but you're still going to be at risk if you are doing
manual "UPDATE pg_class" operations. Can you get away from needing to
do that? ALTER TABLE DISABLE TRIGGER might help, but we haven't got
anything like ALTER TABLE DISABLE RULE. In any case the important point
is that you have to take AccessExclusive lock on a relation whose
pg_class row you would like to change, and you need to be on 8.2 because
prior releases weren't careful about obtaining lock *before* reading the
row.
regards, tom lane