Something is broken about connection startup - Mailing list pgsql-hackers

From Tom Lane
Subject Something is broken about connection startup
Date
Msg-id 16447.1478818294@sss.pgh.pa.us
Whole thread Raw
Responses Re: Something is broken about connection startup  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I noticed that buildfarm member piculet fell over this afternoon:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2016-11-10%2020%3A10%3A02
with this interesting failure during startup of the "collate" test:
psql: FATAL:  cache lookup failed for relation 1255

1255 is pg_proc, and nosing around, I noticed that the concurrent
"init_privs" test does this:
GRANT SELECT ON pg_proc TO CURRENT_USER;
GRANT SELECT (prosrc) ON pg_proc TO CURRENT_USER;

So that led me to hypothesize that GRANT on a system catalog can cause a
concurrent connection failure, which I tested by running

pgbench -U postgres -n -f script1.sql -T 300 regression
with this script:
GRANT SELECT ON pg_proc TO CURRENT_USER;
GRANT SELECT (prosrc) ON pg_proc TO CURRENT_USER;
REVOKE SELECT ON pg_proc FROM CURRENT_USER;
REVOKE SELECT (prosrc) ON pg_proc FROM CURRENT_USER;

and concurrently
pgbench -C -U postgres -n -f script2.sql -c 10 -j 10 -T 300 regression
with this script:
select 2 + 2;

and sure enough, the second one falls over after a bit with

connection to database "regression" failed:
FATAL:  cache lookup failed for relation 1255
client 5 aborted while establishing connection

For me, this typically happens within thirty seconds or less.  I thought
perhaps it only happened with --no-atomics which piculet is using, but
nope, I can reproduce it in a stock debug build.  For the record, I'm
testing on an 8-core x86_64 machine running RHEL6.

Note: you can't merge this test scenario into one pgbench run with two
scripts, because then you can't keep pgbench from sometimes running two
instances of script1 concurrently, with ensuing "tuple concurrently
updated" errors.  That's something we've previously deemed not worth
changing, and in any case it's not what I'm on about at the moment.
I tried to make script1 safe for concurrent calls by putting "begin; lock
table pg_proc in share row exclusive mode; ...; commit;" around it, but
that caused the error to go away, or at least become far less frequent.
Which is odd in itself, since that lock level shouldn't block connection
startup accesses to pg_proc.

A quick look through the sources confirms that this error implies that
SearchSysCache on the RELOID cache must have failed to find a tuple for
pg_proc --- there are many occurrences of this text, but they all are
reporting that.  Which absolutely should not be happening now that we use
MVCC catalog scans, concurrent updates or no.  So I think this is a bug,
and possibly a fairly-recently-introduced one, because I can't remember
seeing buildfarm failures like this one before.

I've not dug further than that yet.  Any thoughts out there?
        regards, tom lane



pgsql-hackers by date:

Previous
From: Petr Jelinek
Date:
Subject: Re: Logical Replication WIP
Next
From: leoaaryan
Date:
Subject: Shared memory estimation for postgres