Thread: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Frank van Vugt
Date:
L.S.

After an upgrade to v8.4 one of our clients is experiencing heaps of problems,
they get errors like "ERROR:  SPI_connect failed: SPI_ERROR_CONNECT".


I refer to this earlier post, it looks a lot like it:
    http://archives.postgresql.org/pgsql-general/2009-07/msg00388.php
We too had no problem when running v8.3



Tom, you stated:
    Really?  Could we see a self-contained example?


We're working on that, but its very difficult. We have a way to reproduce it
in our application, but this works only in the production environment, not in
our development situation (the more users, the more faster the problem
appears). Some users are able to work 15 minutes without problems and then
they get errors. Restarting the app and thus the connection silences it again
for a while, etc.

Though a reproducable 'psql' testcase is not yet available, I do have a bit of
additional information that might serve as a clue to some:


* when they happen, the code triggering the errors seems to be fairly limited

* a small part are all triggered in before triggers and all are in a custom
function session_userid()

* the bulk part are triggered in both after as well as deferred triggers that
in turn call a couple of custom functions




_all_ of these custom function have the following structure in common:

CREATE OR REPLACE FUNCTION xxxxxxxxx()
    RETURNS int
    LANGUAGE 'plpgsql'
    IMMUTABLE
    STRICT
    SECURITY INVOKER
    AS '    DECLARE
            result INT := 0;
        BEGIN
            BEGIN
                SELECT id INTO STRICT result FROM xxxxxxxxx WHERE xxxxxxxxx;
                EXCEPTION
                    WHEN NO_DATA_FOUND THEN
                        RAISE EXCEPTION ''Unknown (%)...!!'', xxxxxxxxx
                    WHEN TOO_MANY_ROWS THEN
                        RAISE EXCEPTION ''More than one found (%)...!!'', xxxxxxxxx;
            END;
            RETURN result;
        END;';

=> mind the fact that it is marked 'immutable'

=> mind the fact that due to the exception usage, it contains an inner
begin/end block



Though the frequency / localisation of these errors seems too high to be a
coincidence, I do have to point out that these are not the only functions with
the same structure. I have a whole set of _status() and _type() functions that
look exactly the same, but who never come up in an error. However, this might
simply be due to the usage-pattern of my application.


Any clues as to how to gather additional information that might bring us
closer to a solution is appreciated also. I'd have no problem with applying
some patch as long as it's safe enough for a production environment ;)



--
Best,




Frank.

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Tom Lane
Date:
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> Any clues as to how to gather additional information that might bring us
> closer to a solution is appreciated also.

A stack trace from the point of the error would probably tell us a great
deal.  Maybe you could attach to a backend with gdb, set a breakpoint
at the failure return in SPI_connect(), and then provoke the error
manually?

            regards, tom lane

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Frank van Vugt
Date:
Hi Tom,

Op maandag 13 juli 2009, schreef Tom Lane:
> Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> > Any clues as to how to gather additional information that might bring us
> > closer to a solution is appreciated also.
>
> A stack trace from the point of the error would probably tell us a great
> deal.  Maybe you could attach to a backend with gdb, set a breakpoint
> at the failure return in SPI_connect(), and then provoke the error
> manually?

Just fyi, a breakpoint at SPI_connect with condition
    _SPI_curid != _SPI_connected

produced the following backtrace:

Program received signal SIGUSR2, User defined signal 2.
0x00002b539af2b5f5 in recv () from /lib64/libc.so.6
(gdb) bt
#0  0x00002b539af2b5f5 in recv () from /lib64/libc.so.6
#1  0x000000000054d692 in secure_read ()
#2  0x0000000000552c74 in pq_recvbuf ()
#3  0x0000000000553077 in pq_getbyte ()
#4  0x00000000005ce5f6 in PostgresMain ()
#5  0x00000000005a50fb in ServerLoop ()
#6  0x00000000005a5c2a in PostmasterMain ()
#7  0x000000000055498e in main ()


However, after continuing this did NOT give the SPI_connect error message, so
this probably is about something else completely?

We cannot reproduce the error anymore due to end of working hours, will try
again tomorrow morning (localtime).


More to follow.




--
Best,




Frank.

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Tom Lane
Date:
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> Just fyi, a breakpoint at SPI_connect with condition
>     _SPI_curid != _SPI_connected

Right, that's what to look for.

> produced the following backtrace:

> Program received signal SIGUSR2, User defined signal 2.
> 0x00002b539af2b5f5 in recv () from /lib64/libc.so.6
> (gdb) bt
> #0  0x00002b539af2b5f5 in recv () from /lib64/libc.so.6
> #1  0x000000000054d692 in secure_read ()
> #2  0x0000000000552c74 in pq_recvbuf ()
> #3  0x0000000000553077 in pq_getbyte ()
> #4  0x00000000005ce5f6 in PostgresMain ()
> #5  0x00000000005a50fb in ServerLoop ()
> #6  0x00000000005a5c2a in PostmasterMain ()
> #7  0x000000000055498e in main ()

This is a normal interbackend communication signal.  You need to
configure gdb to ignore SIGUSR2 (ie, pass it on and not stop execution).
Probably SIGUSR1 too.

            regards, tom lane

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Frank van Vugt
Date:
Hi,

> This is a normal interbackend communication signal.  You need to
> configure gdb to ignore SIGUSR2 (ie, pass it on and not stop execution).
> Probably SIGUSR1 too.

Done.


Here's the result:

Breakpoint 1, 0x000000000054aec0 in SPI_connect ()
(gdb) bt
#0 0x000000000054aec0 in SPI_connect ()
#1 0x00002afdffd243c2 in plpgsql_call_handler () from
/usr/local/pgsql/lib/plpgsql.so
#2 0x00000000005331f7 in ExecMakeFunctionResult ()
#3 0x0000000000534f1a in ExecEvalExprSwitchContext ()
#4 0x00000000005919e4 in evaluate_expr ()
#5 0x0000000000592e69 in simplify_function ()
#6 0x000000000059354e in eval_const_expressions_mutator ()
#7 0x0000000000556d70 in expression_tree_mutator ()
#8 0x0000000000593414 in eval_const_expressions_mutator ()
#9 0x00000000005934e2 in eval_const_expressions_mutator ()
#10 0x00000000005566bd in expression_tree_mutator ()
#11 0x000000000059332d in eval_const_expressions_mutator ()
#12 0x0000000000556d70 in expression_tree_mutator ()
#13 0x000000000059332d in eval_const_expressions_mutator ()
#14 0x000000000059436f in eval_const_expressions ()
#15 0x00000000005862a1 in preprocess_expression ()
#16 0x00000000005889cf in subquery_planner ()
#17 0x0000000000588f63 in standard_planner ()
#18 0x00000000005cc6b1 in pg_plan_query ()
#19 0x00000000005cc76e in pg_plan_queries ()
#20 0x000000000066a522 in RevalidateCachedPlan ()
#21 0x00002afdffd27e12 in exec_eval_expr () from
/usr/local/pgsql/lib/plpgsql.so
#22 0x00002afdffd28122 in exec_eval_boolean () from
/usr/local/pgsql/lib/plpgsql.so
#23 0x00002afdffd2d927 in exec_stmt_if () from /usr/local/pgsql/lib/plpgsql.so
#24 0x00002afdffd2c3b8 in exec_stmts () from /usr/local/pgsql/lib/plpgsql.so
#25 0x00002afdffd2bb70 in exec_stmt_block () from
/usr/local/pgsql/lib/plpgsql.so
#26 0x00002afdffd2e26f in plpgsql_exec_function () from
/usr/local/pgsql/lib/plpgsql.so
#27 0x00002afdffd2447e in plpgsql_call_handler () from
/usr/local/pgsql/lib/plpgsql.so
#28 0x00000000005331f7 in ExecMakeFunctionResult ()
#29 0x0000000000530157 in ExecProject ()
#30 0x0000000000541dd3 in ExecResult ()
#31 0x000000000052fb86 in ExecProcNode ()
#32 0x000000000052d99f in standard_ExecutorRun ()
#33 0x00000000005cfff4 in PortalRunSelect ()
#34 0x00000000005d11e0 in PortalRun ()
#35 0x00000000005ccfcb in exec_simple_query ()
#36 0x00000000005cdce4 in PostgresMain ()
#37 0x00000000005a50fb in ServerLoop ()
#38 0x00000000005a5c2a in PostmasterMain ()
#39 0x000000000055498e in main ()



Looking forward to your reply !




--
Best,




Frank.

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Frank van Vugt
Date:
Hi Marek,

> Could you please send me gdb
> commands - I could also provide the backtrace

Basically the flow is as follows:

- identify the backend that you expect to run in to this error by pid

- start gdb with the option "-p <pid>" to attach to this backend

- set a conditional breakpoint:
    break SPI_connect if _SPI_curid != _SPI_connected
gdb should confirm this with something like Breakpoint 1 at 0x54af60

- tell gdb to ignore SIGUSR1:
    handle SIGUSR1 nostop
gdb should confirm this with something like:
    Signal        Stop      Print   Pass to program Description
    SIGUSR1       No        Yes     Yes             User defined signal 1

- tell gdb to ignore SIGUSR2:
    handle SIGUSR2 nostop
gdb should confirm this with something like:
    Signal        Stop      Print   Pass to program Description
    SIGUSR2       No        Yes     Yes             User defined signal 2

- type 'c' to allow the backend to continue

- reproduce the problem in your application, the moment you do it will seem to
'freeze' (so you will _not_ see any error message you usually were seeing, at
least not yet), because gdb will have catched the situation and should now
show a prompt again

- type 'bt' to get a backtrace, just copy&paste&post ;)

You can quit gdb by a plain <ctrl>-c, just confirm when it asks you to detach.
The backend will continue to run, your app should now show the error.





--
Best,




Frank.

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Marek Lewczuk
Date:
2009/7/14 Frank van Vugt <ftm.van.vugt@foxi.nl>:
> Hi,
>
>> This is a normal interbackend communication signal. =A0You need to
>> configure gdb to ignore SIGUSR2 (ie, pass it on and not stop execution).
>> Probably SIGUSR1 too.
Frank,
thanks for letting me know about your post - I didn't have time yet to
read yesterday posts, but I'm glad that I'm not only one who is
experiencing that problem :-). Thanks for the backtrace - I hope that
Tom will find the reason of that error. Could you please send me gdb
commands - I could also provide the backtrace, so Tom would have wider
view of what is happening.

Marek

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Marek Lewczuk
Date:
2009/7/14 Frank van Vugt <ftm.van.vugt@foxi.nl>:
> Basically the flow is as follows:
> You can quit gdb by a plain <ctrl>-c, just confirm when it asks you to detach.
> The backend will continue to run, your app should now show the error.

Thanks.

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Tom Lane
Date:
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> [ backtrace ]

Hmph.  So evidently the unexpected recursion into SPI is happening when
a cached plan has to be recomputed.  I suspected something of the sort,
but now the question is why you didn't see the same problem in 8.3 ...

            regards, tom lane

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Frank van Vugt
Date:
Hi,

> Hmph.  So evidently the unexpected recursion into SPI is happening when
> a cached plan has to be recomputed.  I suspected something of the sort,
> but now the question is why you didn't see the same problem in 8.3 ...

Just as an additional confirmation: nothing else but the db changed when we
migrated from 8.3.7 to 8.4.0 last weekend

Anything else I can do to assist?




FYI:

- the production site today ran into this problem just over a thousand times
resulting in the same number of rollbacks and reconnects (since that solves it
for a little while again), so needless to say there's some additional
motivation to help ;)

- with relation to recomputation of cached plans: our usage patterns basically
do not include DDL, with the exception of temp tables; the functions that bail
are all using a single table each, the records of which are not changing in
any way

- we use before/after and deferred triggers, all in sql or pgsql; we use a
single after-trigger written in C for keeping track of history, which does its
own SPI_connect/finish calls, but as stated earlier, all of this was in place
a long time when we were running 8.3.7, though obviously the C-trigger has
been recompiled

- the distribution over triggers / functions that run into this error seems
uneven, basically they all boil down to a couple of custom functions, the
specifics of which I described earlier (same structure, immutable, use
exception thus an extra begin/end block, etc), but since we have loads of
other simular functions that don't come up, it's hard to see what would be
special here


Depending on the other idea's you might have, we could fairly easily
temporarily redefine one or more of the pgsql functions that suffer from this
error as plain sql functions, just to see what would happen.......?
(I'm reaching, I know ;) )




--
Best,




Frank.

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Tom Lane
Date:
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
>> Hmph.  So evidently the unexpected recursion into SPI is happening when
>> a cached plan has to be recomputed.  I suspected something of the sort,
>> but now the question is why you didn't see the same problem in 8.3 ...

> Just as an additional confirmation: nothing else but the db changed when we
> migrated from 8.3.7 to 8.4.0 last weekend

I'm convinced that 8.3 has the same bug, in the sense that it could fail
this way if it had to revalidate a cached plan.  Probably the reason you
didn't see it before is that 8.4 has more conditions in which it will
invalidate cached plans.  In particular, is it possible that this
failure occurs when you try to call a plpgsql function that has just
been replaced via CREATE OR REPLACE FUNCTION?

[ looks at CVS logs ... ]  Another case that will cause plan
invalidation in 8.4 and not 8.3 is creating or dropping a schema.
Do you do a lot of that?

            regards, tom lane

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Tom Lane
Date:
I think the attached patch will fix it for you.

            regards, tom lane

Index: src/backend/utils/cache/plancache.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/cache/plancache.c,v
retrieving revision 1.27
diff -c -r1.27 plancache.c
*** src/backend/utils/cache/plancache.c    11 Jun 2009 14:49:05 -0000    1.27
--- src/backend/utils/cache/plancache.c    14 Jul 2009 15:04:38 -0000
***************
*** 45,50 ****
--- 45,51 ----
  #include "access/transam.h"
  #include "catalog/namespace.h"
  #include "executor/executor.h"
+ #include "executor/spi.h"
  #include "nodes/nodeFuncs.h"
  #include "optimizer/planmain.h"
  #include "storage/lmgr.h"
***************
*** 502,509 ****
--- 503,521 ----
          {
              /*
               * Generate plans for queries.
+              *
+              * The planner may try to call SPI-using functions, which causes
+              * a problem if we're already inside one.  Rather than expect
+              * all SPI-using code to do SPI_push whenever a replan could
+              * happen, it seems best to take care of the case here.
               */
+             bool    pushed;
+
+             pushed = SPI_push_conditional();
+
              slist = pg_plan_queries(slist, plansource->cursor_options, NULL);
+
+             SPI_pop_conditional(pushed);
          }

          /*

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Frank van Vugt
Date:
Hi,

> I'm convinced that 8.3 has the same bug, in the sense that it could fail
> this way if it had to revalidate a cached plan.  Probably the reason you
> didn't see it before is that 8.4 has more conditions in which it will
> invalidate cached plans.

Ok

> In particular, is it possible that this
> failure occurs when you try to call a plpgsql function that has just
> been replaced via CREATE OR REPLACE FUNCTION?

Ah, you might be on to something here, we have a few patterns that match
that....

-----

When starting a certain wizard in our app, we need to keep track of complex
dependancies; in order to do this, we call a custom function called
'tree_amounts_blockade_setup', which basically executes the following steps:

- create temp table
- create index on that table
- create before trigger on that table based on an existing pgsql function
- create after trigger on that table based on another existing pgsql function

Now, if I look at the distribution of users running into this error today, it
looks like this:

      1 a
      1 b
      1 c
      1 d
      2 e
      3 f
      3 g
      3 h
      3 i
      4 j
      4 k
     10 l
     16 m
     16 n
     22 o
    120 p
    853 q

User 'q' is precisely the single user that would indeed use this 'wizard' the
most and therefor would call 'tree_amounts_blockade_setup' the most. It has to
be said though that the winner's score is a bit skewed, I verified this and it
turns out the specific user was happily ignoring all errors and continuing the
work (or at least trying to) without reconnecting ;)

-----

One other thing that comes to mind is the fact that we have functions that
when called should include certain temp tables when they exist. We can and do
check for existence of a temp table through something like

(SELECT 1 FROM information_schema.tables WHERE table_schema ~''pg_temp'' AND
table_name = ''t_creditor_address'')

but for this to work the temp schema has to exists. For this we 'cheat', every
user will create and drop a temp table once upon connecting to the database,
the function itself doublechecks and bails out IF NOT pg_my_temp_schema() > 0


> [ looks at CVS logs ... ]  Another case that will cause plan
> invalidation in 8.4 and not 8.3 is creating or dropping a schema.
> Do you do a lot of that?

Apart from the temp schema described above, we don't do that at all.


==========

I'd also like to mention still our C-trigger..... :

- it starts of with three plan variables containing '0' and it uses
SPI_saveplan() to fill these with a cached plan

- the three underlying tables will never change structure during such a
session, so I expected these variables / plans to never change

- if these static SPIPlanPtr variables are non-zero, the cached plans will be
reused in a SPI_execute_plan() call

Now the docs of SPI_saveplan say the following:

"If one of the objects (a table, function, etc.) referenced by the prepared
plan is dropped or redefined, then future executions of SPI_execute_plan may
fail or return different results than the plan initially indicates. "


Given the fact that 8.4 invalidates plans much more often, this might become
tricky..... the 'SPI_execute_plan may fail' part can be handled, but I'm a bit
worried about the 'may return different results' part..... Is there a way to
determine (efficiently) that such a save plan has been invalidated?

Or maybe for the future: would a kind of guarded plan pointer be handy...?







--
Best,




Frank.

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Frank van Vugt
Date:
Hi,

> I think the attached patch will fix it for you.

Great, thanks!

I'll apply this tonight, will get back with results tomorrow.





--
Best,




Frank.

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Alvaro Herrera
Date:
Tom Lane wrote:

> I'm convinced that 8.3 has the same bug, in the sense that it could fail
> this way if it had to revalidate a cached plan.  Probably the reason you
> didn't see it before is that 8.4 has more conditions in which it will
> invalidate cached plans.  In particular, is it possible that this
> failure occurs when you try to call a plpgsql function that has just
> been replaced via CREATE OR REPLACE FUNCTION?

Do we have the stuff where an ANALYZE could invalidate cached plans?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> [ looks at CVS logs ... ]  Another case that will cause plan
> invalidation in 8.4 and not 8.3 is creating or dropping a schema.
> Do you do a lot of that?

Hmm, he says he uses temporary tables. Does creating/dropping a
temporary namespace trigger plan cache invalidation?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Tom Lane
Date:
Frank van Vugt <ftm.van.vugt@foxi.nl> writes:
> I'd also like to mention still our C-trigger..... :

> - it starts of with three plan variables containing '0' and it uses
> SPI_saveplan() to fill these with a cached plan

> - the three underlying tables will never change structure during such a
> session, so I expected these variables / plans to never change

> - if these static SPIPlanPtr variables are non-zero, the cached plans will be
> reused in a SPI_execute_plan() call

> Now the docs of SPI_saveplan say the following:

> "If one of the objects (a table, function, etc.) referenced by the prepared
> plan is dropped or redefined, then future executions of SPI_execute_plan may
> fail or return different results than the plan initially indicates. "


> Given the fact that 8.4 invalidates plans much more often, this might become
> tricky..... the 'SPI_execute_plan may fail' part can be handled, but I'm a bit
> worried about the 'may return different results' part..... Is there a way to
> determine (efficiently) that such a save plan has been invalidated?

You don't need to worry about that as long as you're not changing the
table's DDL in a way that would affect the query result.  It's entirely
possible for the plan to get invalidated and regenerated underneath you,
but that should be invisible to a SPI caller.

            regards, tom lane

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> I'm convinced that 8.3 has the same bug, in the sense that it could fail
>> this way if it had to revalidate a cached plan.  Probably the reason you
>> didn't see it before is that 8.4 has more conditions in which it will
>> invalidate cached plans.  In particular, is it possible that this
>> failure occurs when you try to call a plpgsql function that has just
>> been replaced via CREATE OR REPLACE FUNCTION?

> Do we have the stuff where an ANALYZE could invalidate cached plans?

That was true in 8.3, too.  But it doesn't affect this particular code
path because exec_eval_simple_expr won't be used for anything that
references a table.  In 8.4 you could have a plan inval on an expression
that just uses a user-defined function, but it's a lot harder to trigger
such a thing in 8.3.  Which is why we didn't get reports before.

            regards, tom lane

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> [ looks at CVS logs ... ]  Another case that will cause plan
>> invalidation in 8.4 and not 8.3 is creating or dropping a schema.
>> Do you do a lot of that?

> Hmm, he says he uses temporary tables. Does creating/dropping a
> temporary namespace trigger plan cache invalidation?

It could, but I suppose that most DBs would soon reach a steady state
where they have all the pg_temp_ schemas they need.  (Hadn't occurred to
me before, but this is a real good reason for the implementation not
discarding the actual schema when a temp namespace is recycled ...)

            regards, tom lane

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Marek Lewczuk
Date:
2009/7/14 Tom Lane <tgl@sss.pgh.pa.us>:
> Tom Lane wrote:
I've applied the patch - I hope it will fix our problem. I will ask
users of our application to do those things they did when the error
was thrown. Thanks a lot Tom.

ML

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Frank van Vugt
Date:
Hi Tom,

Op dinsdag 14 juli 2009, schreef Tom Lane:
> I think the attached patch will fix it for you.

Yep, after applying it yesterday evening, we didn't see this problem at all
today, so it definitely looks as if this patch nailed it.

Thanks for the [great|quick] support !




--
Best,




Frank.

Re: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4

From
Marek Lewczuk
Date:
2009/7/15 Frank van Vugt <ftm.van.vugt@foxi.nl>:
> Hi Tom,
> Yep, after applying it yesterday evening, we didn't see this problem at all
> today, so it definitely looks as if this patch nailed it.
> Thanks for the [great|quick] support !
I confirm that too - no errors after applying the patch. Great work
Tom, thanks!!!!

ML