Thread: Error “cache lookup failed for function”

Error “cache lookup failed for function”

From
Albrecht Dreß
Date:
Hi all,

On a Debian Buster/64 Bit I run Postgres 11 (apt-cache info):

Package: postgresql-11
Version: 11.7-0+deb10u1
Maintainer: Debian PostgreSQL Maintainers <team+postgresql@tracker.debian.org>

I tried to update the running server by executing the following SQL update script using psql:

---8<-----------------------------------------------
BEGIN;
-- add a column to an existing table
-- add a new table
-- add several db functions
-- replace a DB function:
DROP FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT vanished
boolean);
CREATE FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT
vanishedboolean) RETURNS record 
[…]
COMMIT;
---8<-----------------------------------------------

About ~350 clients were connected to the server when I ran the above script, a few of them using, inter alia, the
functionget_result2() which ought to be replaced. 

Immediately after running the script, the log was filled with errors

---8<-----------------------------------------------
ERROR:  cache lookup failed for function 1821571
CONTEXT:  PL/pgSQL function get_result2(bigint) while casting return value to function's return type
STATEMENT:  SELECT data, metadata, errortext, vanished FROM get_result2(26671107)
---8<-----------------------------------------------

Interestingly, some clients were reporting errors while trying to call a /different/ DB function which had not been
touchedby the above script. 

According to some older reports I found searching the internet for the error, the pg_catalog was probably damaged.  As
arestart of the daemon didn't help, and following the recommendations in the older reports, I dumped the data (no idea
ifit was damaged, too, though) and re-installed the cluster form scratch.  It now again works as expected using the
modifiedschema. 

I would agree that updating the schema of a database server under considerable load is not the best idea, but it should
*never*damage the database itself (it would be acceptable if the transaction just fails, though). 

Any idea what happened here, i.e. what caused the error, and how I can avoid it?

Thanks in advance,
Albrecht.
Attachment

Re: Error “cache lookup failed for function”

From
Tom Lane
Date:
Albrecht =?iso-8859-1?b?RHJl3w==?= <albrecht.dress@arcor.de> writes:
> I tried to update the running server by executing the following SQL update script using psql:

> ---8<-----------------------------------------------
> BEGIN;
> -- add a column to an existing table
> -- add a new table
> -- add several db functions
> -- replace a DB function:
> DROP FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT
vanishedboolean); 
> CREATE FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT
vanishedboolean) RETURNS record 
> […]
> COMMIT;
> ---8<-----------------------------------------------

> About ~350 clients were connected to the server when I ran the above script, a few of them using, inter alia, the
functionget_result2() which ought to be replaced. 

> Immediately after running the script, the log was filled with errors

> ---8<-----------------------------------------------
> ERROR:  cache lookup failed for function 1821571
> CONTEXT:  PL/pgSQL function get_result2(bigint) while casting return value to function's return type
> STATEMENT:  SELECT data, metadata, errortext, vanished FROM get_result2(26671107)
> ---8<-----------------------------------------------

This is, actually, not very surprising.  You dropped the old function
while clients were using it.  The new function is a completely unrelated
object, even if it happens to have the same name.

What you should have done was CREATE OR REPLACE FUNCTION, which would
have preserved the object's identity.

It does seem a bit annoying that something in plpgsql is apparently
doing a fresh catalog lookup to find information that likely was
already cached at the start of function execution.  But I think that's
a performance deficiency, not a bug per se.

            regards, tom lane



Re: Error “cache lookup failed for function”

From
Tom Lane
Date:
I wrote:
> It does seem a bit annoying that something in plpgsql is apparently
> doing a fresh catalog lookup to find information that likely was
> already cached at the start of function execution.  But I think that's
> a performance deficiency, not a bug per se.

I reproduced this issue locally (to avoid race conditions, I just had the
function drop itself ;-)) and traced the site of the failure to this
bit in plpgsql function exit:

                /*
                 * Need to look up the expected result type.  XXX would be
                 * better to cache the tupdesc instead of repeating
                 * get_call_result_type(), but the only easy place to save it
                 * is in the PLpgSQL_function struct, and that's too
                 * long-lived: composite types could change during the
                 * existence of a PLpgSQL_function.
                 */
                switch (get_call_result_type(fcinfo, &resultTypeId, &tupdesc))

The catalog access inside get_call_result_type is fairly essential,
because this function has OUT parameters, so its pg_proc row is the
ultimate source of truth about what it returns.  We could imagine
caching the info earlier during function execution, but as the comment
says, that has its own failure modes ... and they're more important
ones in practice.  So I'm afraid there's not much to be done to
improve this.

            regards, tom lane



Re: Error “cache lookup failed for function”

From
Michael Lewis
Date:
Given the explicit begin before drop/create, this is a bit of an unexpected gotcha to have any impact on other transactions. Are there other known limitations of DDL in a transaction?

Re: Error“cache lookup failed for function”

From
Albrecht Dreß
Date:
Am 20.02.20 19:32 schrieb(en) Tom Lane:
> This is, actually, not very surprising.  You dropped the old function while clients were using it.  The new function
isa completely unrelated object, even if it happens to have the same name. 

Yes, I agree that this was not a too clever approach…

> It does seem a bit annoying that something in plpgsql is apparently doing a fresh catalog lookup to find information
thatlikely was already cached at the start of function execution. 

OK, but after fully stopping the daemon via systemctl (which of course disconnects all clients) and re-starting it, the
cacheis empty, isn't it?  So the client after re-connecting /should/ find the proper function?  In my case the full
restartdid *not* help, but produced the same error again afterwards!  I didn't reboot the whole box, though. 

This led me to the assumption that the database was “somehow” damaged.

Thanks,
Albrecht.
Attachment

Re: Error “cache lookup failed for function”

From
Tom Lane
Date:
Albrecht =?iso-8859-1?b?RHJl3w==?= <albrecht.dress@arcor.de> writes:
> OK, but after fully stopping the daemon via systemctl (which of course disconnects all clients) and re-starting it,
thecache is empty, isn't it?  So the client after re-connecting /should/ find the proper function?  In my case the full
restartdid *not* help, but produced the same error again afterwards!  I didn't reboot the whole box, though. 

Hm ... that's interesting, but it sounds like you've systematically
expunged all the evidence about what happened :-(

            regards, tom lane



Re: Error “cache lookup failed for function”

From
Adrian Klaver
Date:
On 2/20/20 11:28 AM, Albrecht Dreß wrote:
> Am 20.02.20 19:32 schrieb(en) Tom Lane:
>> This is, actually, not very surprising.  You dropped the old function 
>> while clients were using it.  The new function is a completely 
>> unrelated object, even if it happens to have the same name.
> 
> Yes, I agree that this was not a too clever approach…
> 
>> It does seem a bit annoying that something in plpgsql is apparently 
>> doing a fresh catalog lookup to find information that likely was 
>> already cached at the start of function execution.
> 
> OK, but after fully stopping the daemon via systemctl (which of course 
> disconnects all clients) and re-starting it, the cache is empty, isn't 
> it?  So the client after re-connecting /should/ find the proper 
> function?  In my case the full restart did *not* help, but produced the 
> same error again afterwards!  I didn't reboot the whole box, though.

It would be nice to know what:

[…]

represented in:

CREATE FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT 
metadata jsonb, OUT errortext text, OUT vanished boolean) RETURNS record
[…]
COMMIT;

The Postgres logs during and after restart might provide some info.

Also the errors thrown when accessing the other function.

> 
> This led me to the assumption that the database was “somehow” damaged.
> 
> Thanks,
> Albrecht.


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Error“cache lookup failed for function”

From
Albrecht Dreß
Date:
Am 20.02.20 21:41 schrieb(en) Adrian Klaver:
> It would be nice to know what:
[snip]
> represented in:

Dropping and re-creating the function is actually the last operation in the script.  The function is /very/ simple
(justa wrapper to hide all internals from "agent" clients): 

---8<--------------------------------------------------------------------
DROP FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT vanished
boolean);
CREATE FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT
vanishedboolean) RETURNS record 
     LANGUAGE plpgsql STABLE SECURITY DEFINER
     SET search_path TO 'public', 'pg_temp'
     AS $$
BEGIN
     SELECT r.data, r.metadata, r.errortext FROM results r INNER JOIN tasks USING(resultid) WHERE taskid = mytaskid
LIMIT1 INTO data, metadata, errortext; 
     SELECT COUNT(*) = 0 FROM tasks WHERE taskid = mytaskid INTO vanished;
END;
$$;
ALTER FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT vanished
boolean)OWNER TO manager; 
REVOKE ALL ON FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT
vanishedboolean) FROM PUBLIC; 
GRANT ALL ON FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT metadata jsonb, OUT errortext text, OUT
vanishedboolean) TO "agent"; 
COMMIT;
---8<--------------------------------------------------------------------

> The Postgres logs during and after restart might provide some info.
>
> Also the errors thrown when accessing the other function.

I attach the (slightly stripped down; I don't want to post ~100k…) log, starting with the very first error at
13:39:59.302UTC.  Prior to that line are *no* errors.  I added a few [comments]. 

At 13:39:59.484 the error message changes, referring to an ancient function “retrieve_single_result()” which (according
tothe person who wrote the “agent” client) is *not* called.  The clients try periodically poll “get_result2()”. 

At 13:42:00 the “systemctl restart” has been initiated.  At 13:42:02 the database has been stopped and is restarted
immediately,revealing one misconfigured client (should be harmless), but the cache lookup error persists. 

Thanks in advance for your help,
Albrecht.
Attachment

Re: Error “cache lookup failed for function”

From
Adrian Klaver
Date:
On 2/21/20 9:55 AM, Albrecht Dreß wrote:
> Am 20.02.20 21:41 schrieb(en) Adrian Klaver:
>> It would be nice to know what:
> [snip]
>> represented in:
> 
> Dropping and re-creating the function is actually the last operation in 
> the script.  The function is /very/ simple (just a wrapper to hide all 
> internals from "agent" clients):
> 
> ---8<--------------------------------------------------------------------
> DROP FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT 
> metadata jsonb, OUT errortext text, OUT vanished boolean);
> CREATE FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT 
> metadata jsonb, OUT errortext text, OUT vanished boolean) RETURNS record
>      LANGUAGE plpgsql STABLE SECURITY DEFINER
>      SET search_path TO 'public', 'pg_temp'
>      AS $$
> BEGIN
>      SELECT r.data, r.metadata, r.errortext FROM results r INNER JOIN 
> tasks USING(resultid) WHERE taskid = mytaskid LIMIT 1 INTO data, 
> metadata, errortext;
>      SELECT COUNT(*) = 0 FROM tasks WHERE taskid = mytaskid INTO vanished;
> END;
> $$;
> ALTER FUNCTION public.get_result2(mytaskid bigint, OUT data bytea, OUT 
> metadata jsonb, OUT errortext text, OUT vanished boolean) OWNER TO manager;
> REVOKE ALL ON FUNCTION public.get_result2(mytaskid bigint, OUT data 
> bytea, OUT metadata jsonb, OUT errortext text, OUT vanished boolean) 
> FROM PUBLIC;
> GRANT ALL ON FUNCTION public.get_result2(mytaskid bigint, OUT data 
> bytea, OUT metadata jsonb, OUT errortext text, OUT vanished boolean) TO 
> "agent";
> COMMIT;
> ---8<--------------------------------------------------------------------
> 
>> The Postgres logs during and after restart might provide some info.
>>
>> Also the errors thrown when accessing the other function.
> 
> I attach the (slightly stripped down; I don't want to post ~100k…) log, 
> starting with the very first error at 13:39:59.302 UTC.  Prior to that 
> line are *no* errors.  I added a few [comments].
> 
> At 13:39:59.484 the error message changes, referring to an ancient 
> function “retrieve_single_result()” which (according to the person who 
> wrote the “agent” client) is *not* called.  The clients try periodically 
> poll “get_result2()”.
> 
> At 13:42:00 the “systemctl restart” has been initiated.  At 13:42:02 the 
> database has been stopped and is restarted immediately, revealing one 
> misconfigured client (should be harmless), but the cache lookup error 
> persists.

Hmm. More questions:

1) From your original post what does the below mean?:

-- add several db functions
-- replace a DB function:

2) What do you see when you do?:

select * from pg_proc where oid = 1821571;

3) What is the definition for retrieve_single_result()?

4) What does the below show?:

select proname, prosrc from pg_proc  where proname in 
('retrieve_single_result', 'get_result2');

> 
> Thanks in advance for your help,
> Albrecht.


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Error“cache lookup failed for function”

From
Albrecht Dreß
Date:
Sorry for the late reply, I've been on a short vacation…

Am 21.02.20 21:44 schrieb(en) Adrian Klaver:
> 1) From your original post what does the below mean?:
>
> -- add several db functions

One trigger function, plus two “normal” ones, all (yet) unused.

> -- replace a DB function:

This was the

---8<--------------------------------------------------------------------
DROP FUNCTION public.get_result2[…]
CREATE FUNCTION public.get_result2[…]
---8<--------------------------------------------------------------------

sequence triggering the error.

> 2) What do you see when you do?:
>
> select * from pg_proc where oid = 1821571;

Unfortunately, when I saw that restarting the cluster didn't solve the issue, I re-installed it from a dump (which
appearsto be intact, btw), but /without/ taking a copy of the original data folder.  Thus, this information is lost,
sorry.

> 3) What is the definition for retrieve_single_result()?

Ancient implementation, somewhat similar to get_result2(), but with less security and different error checking
approach:

---8<--------------------------------------------------------------------
CREATE FUNCTION public.retrieve_single_result(mytaskid bigint, OUT data bytea, OUT metadata jsonb) RETURNS record
     LANGUAGE plpgsql STABLE SECURITY DEFINER
     AS $$
DECLARE
     errortext text;
BEGIN
     SELECT results.data, results.metadata, results.errortext FROM results LEFT JOIN tasks ON tasks.resultid =
results.resultidWHERE taskid = mytaskid LIMIT 1 INTO data, metadata, errortext; 
     IF errortext IS NOT NULL THEN
         RAISE EXCEPTION 'error result "%"', errortext;
     END IF;
END;
$$;
---8<--------------------------------------------------------------------

> 4) What does the below show?:
>
> select proname, prosrc from pg_proc  where proname in ('retrieve_single_result', 'get_result2');

Same problem as with #2, sorry…

Thanks,
Albrecht.
Attachment

Re: Error “cache lookup failed for function”

From
Adrian Klaver
Date:
On 2/26/20 11:37 AM, Albrecht Dreß wrote:
> Sorry for the late reply, I've been on a short vacation…
> 
> Am 21.02.20 21:44 schrieb(en) Adrian Klaver:
>> 1) From your original post what does the below mean?:
>>
>> -- add several db functions
> 
> One trigger function, plus two “normal” ones, all (yet) unused.
> 
>> -- replace a DB function:
> 
> This was the
> 
> ---8<--------------------------------------------------------------------
> DROP FUNCTION public.get_result2[…]
> CREATE FUNCTION public.get_result2[…]
> ---8<--------------------------------------------------------------------
> 
> sequence triggering the error.
> 
>> 2) What do you see when you do?:
>>
>> select * from pg_proc where oid = 1821571;
> 
> Unfortunately, when I saw that restarting the cluster didn't solve the 
> issue, I re-installed it from a dump (which appears to be intact, btw), 
> but /without/ taking a copy of the original data folder.  Thus, this 
> information is lost, sorry.
> 
>> 3) What is the definition for retrieve_single_result()?
> 
> Ancient implementation, somewhat similar to get_result2(), but with less 
> security and different error checking approach:
> 
> ---8<--------------------------------------------------------------------
> CREATE FUNCTION public.retrieve_single_result(mytaskid bigint, OUT data 
> bytea, OUT metadata jsonb) RETURNS record
>      LANGUAGE plpgsql STABLE SECURITY DEFINER
>      AS $$
> DECLARE
>      errortext text;
> BEGIN
>      SELECT results.data, results.metadata, results.errortext FROM 
> results LEFT JOIN tasks ON tasks.resultid = results.resultid WHERE 
> taskid = mytaskid LIMIT 1 INTO data, metadata, errortext;
>      IF errortext IS NOT NULL THEN
>          RAISE EXCEPTION 'error result "%"', errortext;
>      END IF;
> END;
> $$;
> ---8<--------------------------------------------------------------------
> 
>> 4) What does the below show?:
>>
>> select proname, prosrc from pg_proc  where proname in 
>> ('retrieve_single_result', 'get_result2');
> 
> Same problem as with #2, sorry…

Alright, as Tom said upstream, there is not much that can be done with 
information missing. Hopefully it does not happen again, but if it does 
you know what needs to be captured.


> 
> Thanks,
> Albrecht.


-- 
Adrian Klaver
adrian.klaver@aklaver.com