Thread: Error “cache lookup failed for function”
Hi all, On a Debian Buster/64 Bit I run Postgres 11 (apt-cache info): Package: postgresql-11 Version: 11.7-0+deb10u1 Maintainer: Debian PostgreSQL Maintainers <team+postgresql@tracker.debian.org> I tried to update the running server by executing the following SQL update script using psql: ---8<----------------------------------------------- BEGIN; -- add a column to an existing table -- add a new table -- add several db functions -- replace a DB function: DROP FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT vanished boolean); CREATE FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT vanishedboolean) RETURNS record […] COMMIT; ---8<----------------------------------------------- About ~350 clients were connected to the server when I ran the above script, a few of them using, inter alia, the functionget_result2() which ought to be replaced. Immediately after running the script, the log was filled with errors ---8<----------------------------------------------- ERROR: cache lookup failed for function 1821571 CONTEXT: PL/pgSQL function get_result2(bigint) while casting return value to function's return type STATEMENT: SELECT data, metadata, errortext, vanished FROM get_result2(26671107) ---8<----------------------------------------------- Interestingly, some clients were reporting errors while trying to call a /different/ DB function which had not been touchedby the above script. According to some older reports I found searching the internet for the error, the pg_catalog was probably damaged. As arestart of the daemon didn't help, and following the recommendations in the older reports, I dumped the data (no idea ifit was damaged, too, though) and re-installed the cluster form scratch. It now again works as expected using the modifiedschema. I would agree that updating the schema of a database server under considerable load is not the best idea, but it should *never*damage the database itself (it would be acceptable if the transaction just fails, though). Any idea what happened here, i.e. what caused the error, and how I can avoid it? Thanks in advance, Albrecht.
Attachment
Albrecht =?iso-8859-1?b?RHJl3w==?= <albrecht.dress@arcor.de> writes: > I tried to update the running server by executing the following SQL update script using psql: > ---8<----------------------------------------------- > BEGIN; > -- add a column to an existing table > -- add a new table > -- add several db functions > -- replace a DB function: > DROP FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT vanishedboolean); > CREATE FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT vanishedboolean) RETURNS record > […] > COMMIT; > ---8<----------------------------------------------- > About ~350 clients were connected to the server when I ran the above script, a few of them using, inter alia, the functionget_result2() which ought to be replaced. > Immediately after running the script, the log was filled with errors > ---8<----------------------------------------------- > ERROR: cache lookup failed for function 1821571 > CONTEXT: PL/pgSQL function get_result2(bigint) while casting return value to function's return type > STATEMENT: SELECT data, metadata, errortext, vanished FROM get_result2(26671107) > ---8<----------------------------------------------- This is, actually, not very surprising. You dropped the old function while clients were using it. The new function is a completely unrelated object, even if it happens to have the same name. What you should have done was CREATE OR REPLACE FUNCTION, which would have preserved the object's identity. It does seem a bit annoying that something in plpgsql is apparently doing a fresh catalog lookup to find information that likely was already cached at the start of function execution. But I think that's a performance deficiency, not a bug per se. regards, tom lane
I wrote: > It does seem a bit annoying that something in plpgsql is apparently > doing a fresh catalog lookup to find information that likely was > already cached at the start of function execution. But I think that's > a performance deficiency, not a bug per se. I reproduced this issue locally (to avoid race conditions, I just had the function drop itself ;-)) and traced the site of the failure to this bit in plpgsql function exit: /* * Need to look up the expected result type. XXX would be * better to cache the tupdesc instead of repeating * get_call_result_type(), but the only easy place to save it * is in the PLpgSQL_function struct, and that's too * long-lived: composite types could change during the * existence of a PLpgSQL_function. */ switch (get_call_result_type(fcinfo, &resultTypeId, &tupdesc)) The catalog access inside get_call_result_type is fairly essential, because this function has OUT parameters, so its pg_proc row is the ultimate source of truth about what it returns. We could imagine caching the info earlier during function execution, but as the comment says, that has its own failure modes ... and they're more important ones in practice. So I'm afraid there's not much to be done to improve this. regards, tom lane
Given the explicit begin before drop/create, this is a bit of an unexpected gotcha to have any impact on other transactions. Are there other known limitations of DDL in a transaction?
Am 20.02.20 19:32 schrieb(en) Tom Lane: > This is, actually, not very surprising. You dropped the old function while clients were using it. The new function isa completely unrelated object, even if it happens to have the same name. Yes, I agree that this was not a too clever approach… > It does seem a bit annoying that something in plpgsql is apparently doing a fresh catalog lookup to find information thatlikely was already cached at the start of function execution. OK, but after fully stopping the daemon via systemctl (which of course disconnects all clients) and re-starting it, the cacheis empty, isn't it? So the client after re-connecting /should/ find the proper function? In my case the full restartdid *not* help, but produced the same error again afterwards! I didn't reboot the whole box, though. This led me to the assumption that the database was “somehow” damaged. Thanks, Albrecht.
Attachment
Albrecht =?iso-8859-1?b?RHJl3w==?= <albrecht.dress@arcor.de> writes: > OK, but after fully stopping the daemon via systemctl (which of course disconnects all clients) and re-starting it, thecache is empty, isn't it? So the client after re-connecting /should/ find the proper function? In my case the full restartdid *not* help, but produced the same error again afterwards! I didn't reboot the whole box, though. Hm ... that's interesting, but it sounds like you've systematically expunged all the evidence about what happened :-( regards, tom lane
On 2/20/20 11:28 AM, Albrecht Dreß wrote: > Am 20.02.20 19:32 schrieb(en) Tom Lane: >> This is, actually, not very surprising. You dropped the old function >> while clients were using it. The new function is a completely >> unrelated object, even if it happens to have the same name. > > Yes, I agree that this was not a too clever approach… > >> It does seem a bit annoying that something in plpgsql is apparently >> doing a fresh catalog lookup to find information that likely was >> already cached at the start of function execution. > > OK, but after fully stopping the daemon via systemctl (which of course > disconnects all clients) and re-starting it, the cache is empty, isn't > it? So the client after re-connecting /should/ find the proper > function? In my case the full restart did *not* help, but produced the > same error again afterwards! I didn't reboot the whole box, though. It would be nice to know what: […] represented in: CREATE FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT vanished boolean) RETURNS record […] COMMIT; The Postgres logs during and after restart might provide some info. Also the errors thrown when accessing the other function. > > This led me to the assumption that the database was “somehow” damaged. > > Thanks, > Albrecht. -- Adrian Klaver adrian.klaver@aklaver.com
Am 20.02.20 21:41 schrieb(en) Adrian Klaver: > It would be nice to know what: [snip] > represented in: Dropping and re-creating the function is actually the last operation in the script. The function is /very/ simple (justa wrapper to hide all internals from "agent" clients): ---8<-------------------------------------------------------------------- DROP FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT vanished boolean); CREATE FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT vanishedboolean) RETURNS record LANGUAGE plpgsql STABLE SECURITY DEFINER SET search_path TO 'public', 'pg_temp' AS $$ BEGIN SELECT r.data, r.metadata, r.errortext FROM results r INNER JOIN tasks USING(resultid) WHERE taskid = mytaskid LIMIT1 INTO data, metadata, errortext; SELECT COUNT(*) = 0 FROM tasks WHERE taskid = mytaskid INTO vanished; END; $$; ALTER FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT vanished boolean)OWNER TO manager; REVOKE ALL ON FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT vanishedboolean) FROM PUBLIC; GRANT ALL ON FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT vanishedboolean) TO "agent"; COMMIT; ---8<-------------------------------------------------------------------- > The Postgres logs during and after restart might provide some info. > > Also the errors thrown when accessing the other function. I attach the (slightly stripped down; I don't want to post ~100k…) log, starting with the very first error at 13:39:59.302UTC. Prior to that line are *no* errors. I added a few [comments]. At 13:39:59.484 the error message changes, referring to an ancient function “retrieve_single_result()” which (according tothe person who wrote the “agent” client) is *not* called. The clients try periodically poll “get_result2()”. At 13:42:00 the “systemctl restart” has been initiated. At 13:42:02 the database has been stopped and is restarted immediately,revealing one misconfigured client (should be harmless), but the cache lookup error persists. Thanks in advance for your help, Albrecht.
Attachment
On 2/21/20 9:55 AM, Albrecht Dreß wrote: > Am 20.02.20 21:41 schrieb(en) Adrian Klaver: >> It would be nice to know what: > [snip] >> represented in: > > Dropping and re-creating the function is actually the last operation in > the script. The function is /very/ simple (just a wrapper to hide all > internals from "agent" clients): > > ---8<-------------------------------------------------------------------- > DROP FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT > metadata jsonb, OUT errortext text, OUT vanished boolean); > CREATE FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT > metadata jsonb, OUT errortext text, OUT vanished boolean) RETURNS record > LANGUAGE plpgsql STABLE SECURITY DEFINER > SET search_path TO 'public', 'pg_temp' > AS $$ > BEGIN > SELECT r.data, r.metadata, r.errortext FROM results r INNER JOIN > tasks USING(resultid) WHERE taskid = mytaskid LIMIT 1 INTO data, > metadata, errortext; > SELECT COUNT(*) = 0 FROM tasks WHERE taskid = mytaskid INTO vanished; > END; > $$; > ALTER FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT > metadata jsonb, OUT errortext text, OUT vanished boolean) OWNER TO manager; > REVOKE ALL ON FUNCTION public.get_result2(mytaskid bigint, OUT data > bytea, OUT metadata jsonb, OUT errortext text, OUT vanished boolean) > FROM PUBLIC; > GRANT ALL ON FUNCTION public.get_result2(mytaskid bigint, OUT data > bytea, OUT metadata jsonb, OUT errortext text, OUT vanished boolean) TO > "agent"; > COMMIT; > ---8<-------------------------------------------------------------------- > >> The Postgres logs during and after restart might provide some info. >> >> Also the errors thrown when accessing the other function. > > I attach the (slightly stripped down; I don't want to post ~100k…) log, > starting with the very first error at 13:39:59.302 UTC. Prior to that > line are *no* errors. I added a few [comments]. > > At 13:39:59.484 the error message changes, referring to an ancient > function “retrieve_single_result()” which (according to the person who > wrote the “agent” client) is *not* called. The clients try periodically > poll “get_result2()”. > > At 13:42:00 the “systemctl restart” has been initiated. At 13:42:02 the > database has been stopped and is restarted immediately, revealing one > misconfigured client (should be harmless), but the cache lookup error > persists. Hmm. More questions: 1) From your original post what does the below mean?: -- add several db functions -- replace a DB function: 2) What do you see when you do?: select * from pg_proc where oid = 1821571; 3) What is the definition for retrieve_single_result()? 4) What does the below show?: select proname, prosrc from pg_proc where proname in ('retrieve_single_result', 'get_result2'); > > Thanks in advance for your help, > Albrecht. -- Adrian Klaver adrian.klaver@aklaver.com
Sorry for the late reply, I've been on a short vacation… Am 21.02.20 21:44 schrieb(en) Adrian Klaver: > 1) From your original post what does the below mean?: > > -- add several db functions One trigger function, plus two “normal” ones, all (yet) unused. > -- replace a DB function: This was the ---8<-------------------------------------------------------------------- DROP FUNCTION public.get_result2[…] CREATE FUNCTION public.get_result2[…] ---8<-------------------------------------------------------------------- sequence triggering the error. > 2) What do you see when you do?: > > select * from pg_proc where oid = 1821571; Unfortunately, when I saw that restarting the cluster didn't solve the issue, I re-installed it from a dump (which appearsto be intact, btw), but /without/ taking a copy of the original data folder. Thus, this information is lost, sorry. > 3) What is the definition for retrieve_single_result()? Ancient implementation, somewhat similar to get_result2(), but with less security and different error checking approach: ---8<-------------------------------------------------------------------- CREATE FUNCTION public.retrieve_single_result(mytaskid bigint, OUT data bytea, OUT metadata jsonb) RETURNS record LANGUAGE plpgsql STABLE SECURITY DEFINER AS $$ DECLARE errortext text; BEGIN SELECT results.data, results.metadata, results.errortext FROM results LEFT JOIN tasks ON tasks.resultid = results.resultidWHERE taskid = mytaskid LIMIT 1 INTO data, metadata, errortext; IF errortext IS NOT NULL THEN RAISE EXCEPTION 'error result "%"', errortext; END IF; END; $$; ---8<-------------------------------------------------------------------- > 4) What does the below show?: > > select proname, prosrc from pg_proc where proname in ('retrieve_single_result', 'get_result2'); Same problem as with #2, sorry… Thanks, Albrecht.
Attachment
On 2/26/20 11:37 AM, Albrecht Dreß wrote: > Sorry for the late reply, I've been on a short vacation… > > Am 21.02.20 21:44 schrieb(en) Adrian Klaver: >> 1) From your original post what does the below mean?: >> >> -- add several db functions > > One trigger function, plus two “normal” ones, all (yet) unused. > >> -- replace a DB function: > > This was the > > ---8<-------------------------------------------------------------------- > DROP FUNCTION public.get_result2[…] > CREATE FUNCTION public.get_result2[…] > ---8<-------------------------------------------------------------------- > > sequence triggering the error. > >> 2) What do you see when you do?: >> >> select * from pg_proc where oid = 1821571; > > Unfortunately, when I saw that restarting the cluster didn't solve the > issue, I re-installed it from a dump (which appears to be intact, btw), > but /without/ taking a copy of the original data folder. Thus, this > information is lost, sorry. > >> 3) What is the definition for retrieve_single_result()? > > Ancient implementation, somewhat similar to get_result2(), but with less > security and different error checking approach: > > ---8<-------------------------------------------------------------------- > CREATE FUNCTION public.retrieve_single_result(mytaskid bigint, OUT data > bytea, OUT metadata jsonb) RETURNS record > LANGUAGE plpgsql STABLE SECURITY DEFINER > AS $$ > DECLARE > errortext text; > BEGIN > SELECT results.data, results.metadata, results.errortext FROM > results LEFT JOIN tasks ON tasks.resultid = results.resultid WHERE > taskid = mytaskid LIMIT 1 INTO data, metadata, errortext; > IF errortext IS NOT NULL THEN > RAISE EXCEPTION 'error result "%"', errortext; > END IF; > END; > $$; > ---8<-------------------------------------------------------------------- > >> 4) What does the below show?: >> >> select proname, prosrc from pg_proc where proname in >> ('retrieve_single_result', 'get_result2'); > > Same problem as with #2, sorry… Alright, as Tom said upstream, there is not much that can be done with information missing. Hopefully it does not happen again, but if it does you know what needs to be captured. > > Thanks, > Albrecht. -- Adrian Klaver adrian.klaver@aklaver.com