Initializing initFileRelationIds list via write is unsafe - Mailing list pgsql-hackers

From Tom Lane
Subject Initializing initFileRelationIds list via write is unsafe
Date
Msg-id 7758.1433610350@sss.pgh.pa.us
Whole thread Raw
Responses Re: Initializing initFileRelationIds list via write is unsafe  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I wrote:
> I've been chasing the intermittent "cache lookup failed for access method
> 403" failure at session startup that's been seen lately in the buildfarm,
> for instance here:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-06-04%2019%3A22%3A46
> (Axolotl has shown this 3 times in the last 90 days, not sure if any
> others have seen it.)  I hypothesized that this was triggered by the
> "VACUUM FULL pg_am" in the concurrently running vacuum.sql regression
> test, so I started running the regression tests in parallel with a
> shell script doing
> while sleep 0.1; do psql -c 'vacuum full pg_am' regression; done
> and sure enough, I can reproduce it once in awhile.

I've tracked down what's happening here, and it's pretty nasty.
relcache.c keeps a list of the OIDs of all system catalogs/indexes that
are in the relcache init file, which it uses to decide whether the init
file needs to be updated after a relcache invalidation has been issued
against some relation.  That list is created in one of two ways:

1. If we read in an init file, we just record all the entries we saw.

2. If we write out an init file, we just record all the entries we wrote
out.

Before doing #2, we open all the catalog caches, which should result in
populating the relcache with all the interesting system catalogs and
indexes.

Now there's a race condition here, which is that we might receive an
invalidation message telling us that one of those relcache entries is
stale before we get done writing it out.  Fine, we check for that, and
if it happens then we discard the already-stale init file instead of
installing it as valid.  That all works.  BUT: the session keeps on
using the initFileRelationIds list that it constructed while writing
the bogus init file.  Processing of the invalidation message might've
resulted in flushing one or more relcache entries, so that they were
never written out and thus are missing from the initFileRelationIds
list.  This can lead to wrong decisions later in the session about
whether its own actions require init file updates.

To be concrete, here's the sequence of events that I see happening
in the above-described test case:

1. The "VACUUM FULL pg_am" in the vacuum.sql regression test rewrites
pg_am, sends out a relcache inval for pg_am, and unlinks the init file.
(Actually the init file was probably already gone from other activity.)

2. An incoming session from the manual script fails to open the init
file, so it loads up its catalog caches.  Somewhere pretty late in that
sequence, it notices the relcache inval for pg_am and flushes its pg_am
relcache entry, which it then has no need to reload later in the
sequence.  So when it goes to write the init file, it doesn't write a
pg_am entry and doesn't add pg_am's OID to the initFileRelationIds list
either.  At the end of that, it correctly notices that what it's written
is stale, so it doesn't install a busted init file ... but it does sally
on with a busted initFileRelationIds list.

3. Meanwhile, some other regression-test session that arrived slightly
later successfully builds and installs a valid init file.

4. When the manual script session does "vacuum full pg_am", it does not
realize that there ought to be an init file invalidation event from that.
So the init file created by step 3 is left alone, even though it now
contains obsolete data about the relfilenode of pg_am.

5. Specifically, the init file's pg_am entry is referencing a physical
disk file that has been truncated to zero size and is pending deletion at
the next checkpoint.  So, when incoming sessions decide they need to fetch
a pg_am entry, the symptom is "row not found" rather than "file not
found".  That matches the failure symptom we see in the buildfarm.

Now, this test case seems to require two successive vacuum fulls on pg_am
in order to create the observed problem, which is something that doesn't
occur in the regression tests themselves, so how come we saw it in the
buildfarm?  I think the answer is probably that an sinval overflow/reset
event can substitute for step 1.  I tried instrumenting
RelationCacheInitializePhase3 to log a complaint anytime the list length
wasn't the currently-expected 94 entries, and got results like these:

LOG:  length(initFileRelationIds) = 11 in pid 2054
LOG:  length(initFileRelationIds) = 93 in pid 2298
LOG:  length(initFileRelationIds) = 93 in pid 4631
LOG:  length(initFileRelationIds) = 11 in pid 4684
LOG:  length(initFileRelationIds) = 11 in pid 4680
LOG:  length(initFileRelationIds) = 11 in pid 4695
LOG:  length(initFileRelationIds) = 11 in pid 4692
LOG:  length(initFileRelationIds) = 11 in pid 4682
LOG:  length(initFileRelationIds) = 11 in pid 4687
LOG:  length(initFileRelationIds) = 93 in pid 5722

The events with one missing entry are probably consequences of the
sequence described above, but the events with only 11 entries seem like
they must reflect an sinval overflow causing everything not nailed in
cache to be lost.  (Note that there are 11 nailed non-shared relations
at present, so that's why 11 and not some other number.)

Bottom line is that what we have is an unsafe way of constructing the
list of significant relations.  What I'm inclined to do is get rid of
initFileRelationIds and RelationIdIsInInitFile() as such altogether,
and instead have a function in syscache.c that reports "true" if the
relation OID being inquired of is one of the tables or indexes that
are listed in the static catalog cache definitions.  (RelationHasSysCache
is the same general idea, but it considers only tables not indexes.)
This is a bit more coupling between "is in init file" and "supports a
syscache" than was originally intended, but we clearly need a less
shaky way of determining "is in init file", and I don't see a better
alternative.

Comments?
        regards, tom lane



pgsql-hackers by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: [CORE] Restore-reliability mode
Next
From: Noah Misch
Date:
Subject: Re: Restore-reliability mode