Thread: Runaway backend at 100% CPU PostgreSQL v8.3.5

Runaway backend at 100% CPU PostgreSQL v8.3.5

From
"Stephen R. van den Berg"
Date:
I'm running Debian PostgreSQL v8.3.5-1 on x86 in 32-bit mode.
Every once in a while, some backends start taking 100% CPU, as can be seen
below in the excerpt from the process table:

27256 ?        Ss     0:04 /usr/lib/postgresql/8.3/bin/postgres -D /var/lib/post
27299 ?        Ss     0:00  \_ postgres: writer process                        
27300 ?        Ss     0:00  \_ postgres: wal writer process                    
27301 ?        Ss     0:00  \_ postgres: autovacuum launcher process           
27302 ?        Ss     0:26  \_ postgres: stats collector process               
12076 ?        Rs   5681:50  \_ postgres: cms cms 10.0.0.5(59125) PARSE   
12853 ?        Rs   5549:25  \_ postgres: cms cms 10.0.0.5(48437) PARSE   
25985 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.18(45807) idle    
25986 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.18(45808) idle    

It seems that the backend is stuck in some kind of endless loop.  Since
it's a production Debian server, the backend is not compiled with debugging
turned on.  The best I can do is ltrace it, in hopes of someone recognising
the infinite sequence.
Ltracing the 12853 process reveals the following libc calls:

Label a:
[pid 12853] __sigsetjmp(0xbfec7000, 0, 0x852e740, 0xb5b3fc16, 0x85e9228) = 0
[pid 12853] strlen("ExecutorState")              = 13
[pid 12853] strcpy(0x84fb0e8, "ExecutorState")   = 0x84fb0e8
[pid 12853] malloc(8192)                         = 0x860da48
[pid 12853] strlen("ExprContext")                = 11
[pid 12853] strcpy(0x84fb060, "ExprContext")     = 0x84fb060
[pid 12853] strlen("ExprContext")                = 11
[pid 12853] strcpy(0x84fbda0, "ExprContext")     = 0x84fbda0
[pid 12853] strncpy(0x860e498, "parent", 64)     = 0x860e498
[pid 12853] strncpy(0x860e500, "sid", 64)        = 0x860e500
[pid 12853] strlen("ExprContext")                = 11
[pid 12853] strcpy(0x85cd810, "ExprContext")     = 0x85cd810
[pid 12853] malloc(8256)                         = 0x8600a28
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0, 0x82eada9) = 0x860e738
[pid 12853] strncpy(0x860e7f0, "parent", 64)     = 0x860e7f0
[pid 12853] strncpy(0x860e858, "sid", 64)        = 0x860e858
[pid 12853] strlen("SPI TupTable")               = 12
[pid 12853] strcpy(0x85cd1b0, "SPI TupTable")    = 0x85cd1b0
[pid 12853] malloc(8192)                         = 0x860fa50
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0x830fab4, 0x84fb088) = 0x860e738
[pid 12853] free(0x8600a28)                      = <void>
[pid 12853] free(0x860da48)                      = <void>
[pid 12853] free(0x860fa50)                      = <void>
[pid 12853] __sigsetjmp(0xbfec7000, 0, 0x852e740, 0xb5b3fc16, 0x85e9228) = 0
[pid 12853] strlen("ExecutorState")              = 13
[pid 12853] strcpy(0x85cd1b0, "ExecutorState")   = 0x85cd1b0
[pid 12853] malloc(8192)                         = 0x860da48
[pid 12853] strlen("ExprContext")                = 11
[pid 12853] strcpy(0x84fb0e8, "ExprContext")     = 0x84fb0e8
[pid 12853] strlen("ExprContext")                = 11
[pid 12853] strcpy(0x84fb060, "ExprContext")     = 0x84fb060
[pid 12853] strncpy(0x860e498, "parent", 64)     = 0x860e498
[pid 12853] strncpy(0x860e500, "sid", 64)        = 0x860e500
[pid 12853] strlen("ExprContext")                = 11
[pid 12853] strcpy(0x84fbda0, "ExprContext")     = 0x84fbda0
[pid 12853] malloc(8256)                         = 0x8600a28
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0, 0x82eada9) = 0x860e738
[pid 12853] strncpy(0x860e7f0, "parent", 64)     = 0x860e7f0
[pid 12853] strncpy(0x860e858, "sid", 64)        = 0x860e858
[pid 12853] strlen("SPI TupTable")               = 12
[pid 12853] strcpy(0x85cd810, "SPI TupTable")    = 0x85cd810
[pid 12853] malloc(8192)                         = 0x860fa50
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0x830fab4, 0x85cd150) = 0x860e738
[pid 12853] free(0x8600a28)                      = <void>
[pid 12853] free(0x860da48)                      = <void>
[pid 12853] free(0x860fa50)                      = <void>
[pid 12853] __sigsetjmp(0xbfec7000, 0, 0x852e740, 0xb5b3fc16, 0x85e9228) = 0
[pid 12853] strlen("ExecutorState")              = 13
[pid 12853] strcpy(0x85cd810, "ExecutorState")   = 0x85cd810
[pid 12853] malloc(8192)                         = 0x860da48
[pid 12853] strlen("ExprContext")                = 11
[pid 12853] strcpy(0x85cd1b0, "ExprContext")     = 0x85cd1b0
[pid 12853] strlen("ExprContext")                = 11
[pid 12853] strcpy(0x84fb0e8, "ExprContext")     = 0x84fb0e8
[pid 12853] strncpy(0x860e498, "parent", 64)     = 0x860e498
[pid 12853] strncpy(0x860e500, "sid", 64)        = 0x860e500
[pid 12853] strlen("ExprContext")                = 11
[pid 12853] strcpy(0x84fb060, "ExprContext")     = 0x84fb060
[pid 12853] malloc(8256)                         = 0x8600a28
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0, 0x82eada9) = 0x860e738
[pid 12853] strncpy(0x860e7f0, "parent", 64)     = 0x860e7f0
[pid 12853] strncpy(0x860e858, "sid", 64)        = 0x860e858
[pid 12853] strlen("SPI TupTable")               = 12
[pid 12853] strcpy(0x84fbda0, "SPI TupTable")    = 0x84fbda0
[pid 12853] malloc(8192)                         = 0x860fa50
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0x830fab4, 0x85cd7b0) = 0x860e738
[pid 12853] free(0x8600a28)                      = <void>
[pid 12853] free(0x860da48)                      = <void>
[pid 12853] free(0x860fa50)                      = <void>
[pid 12853] __sigsetjmp(0xbfec7000, 0, 0x852e740, 0xb5b3fc16, 0x85e9228) = 0
[pid 12853] strlen("ExecutorState")              = 13
[pid 12853] strcpy(0x84fbda0, "ExecutorState")   = 0x84fbda0
[pid 12853] malloc(8192)                         = 0x860da48
[pid 12853] strlen("ExprContext")                = 11
[pid 12853] strcpy(0x85cd810, "ExprContext")     = 0x85cd810
[pid 12853] strlen("ExprContext")                = 11
[pid 12853] strcpy(0x85cd1b0, "ExprContext")     = 0x85cd1b0
[pid 12853] strncpy(0x860e498, "parent", 64)     = 0x860e498
[pid 12853] strncpy(0x860e500, "sid", 64)        = 0x860e500
[pid 12853] strlen("ExprContext")                = 11
[pid 12853] strcpy(0x84fb0e8, "ExprContext")     = 0x84fb0e8
[pid 12853] malloc(8256)                         = 0x8600a28
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0, 0x82eada9) = 0x860e738
[pid 12853] strncpy(0x860e7f0, "parent", 64)     = 0x860e7f0
[pid 12853] strncpy(0x860e858, "sid", 64)        = 0x860e858
[pid 12853] strlen("SPI TupTable")               = 12
[pid 12853] strcpy(0x84fb060, "SPI TupTable")    = 0x84fb060
[pid 12853] malloc(8192)                         = 0x860fa50
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0x830fab4, 0x84fbd40) = 0x860e738
[pid 12853] free(0x8600a28)                      = <void>
[pid 12853] free(0x860da48)                      = <void>
[pid 12853] free(0x860fa50)                      = <void>
[pid 12853] __sigsetjmp(0xbfec7000, 0, 0x852e740, 0xb5b3fc16, 0x85e9228) = 0
[pid 12853] strlen("ExecutorState")              = 13
[pid 12853] strcpy(0x84fb060, "ExecutorState")   = 0x84fb060
[pid 12853] malloc(8192)                         = 0x860da48
[pid 12853] strlen("ExprContext")                = 11
[pid 12853] strcpy(0x84fbda0, "ExprContext")     = 0x84fbda0
[pid 12853] strlen("ExprContext")                = 11
[pid 12853] strcpy(0x85cd810, "ExprContext")     = 0x85cd810
[pid 12853] strncpy(0x860e498, "parent", 64)     = 0x860e498
[pid 12853] strncpy(0x860e500, "sid", 64)        = 0x860e500
[pid 12853] strlen("ExprContext")                = 11
[pid 12853] strcpy(0x85cd1b0, "ExprContext")     = 0x85cd1b0
[pid 12853] malloc(8256)                         = 0x8600a28
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0, 0x82eada9) = 0x860e738
[pid 12853] strncpy(0x860e7f0, "parent", 64)     = 0x860e7f0
[pid 12853] strncpy(0x860e858, "sid", 64)        = 0x860e858
[pid 12853] strlen("SPI TupTable")               = 12
[pid 12853] strcpy(0x84fb0e8, "SPI TupTable")    = 0x84fb0e8
[pid 12853] malloc(8192)                         = 0x860fa50
[pid 12853] memmove(0x860e738, 0x860e600, 40, 0x830fab4, 0x84fb000) = 0x860e738
[pid 12853] free(0x8600a28)                      = <void>
[pid 12853] free(0x860da48)                      = <void>
[pid 12853] free(0x860fa50)                      = <void>
Jump back to label a:, repeat ad infinitum, chewing up 100% CPU

Below a complete process list of all running backends:

27256 ?        Ss     0:04 /usr/lib/postgresql/8.3/bin/postgres -D /var/lib/post
27299 ?        Ss     0:00  \_ postgres: writer process                        
27300 ?        Ss     0:00  \_ postgres: wal writer process                    
27301 ?        Ss     0:00  \_ postgres: autovacuum launcher process           
27302 ?        Ss     0:26  \_ postgres: stats collector process               
12076 ?        Rs   5681:50  \_ postgres: cms cms 10.0.0.5(59125) PARSE   
12853 ?        Ts   5549:25  \_ postgres: cms cms 10.0.0.5(48437) PARSE   
25985 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.18(45807) idle    
25986 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.18(45808) idle    
21100 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.12(45544) idle    
21101 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.12(45545) idle    
26777 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.12(43123) idle    
24711 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.5(34684) idle     
24715 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.5(34687) idle     
24716 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.5(34688) idle     
24721 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.5(34693) idle     8355 ?        Ss     0:01  \_ postgres: cms
cms10.0.0.5(59188) idle     
 
13504 ?        Ss     0:05  \_ postgres: cms cms 10.0.0.5(35244) idle     
17568 ?        Ss     0:01  \_ postgres: cms cms 10.0.0.5(50414) idle     
17615 ?        Ss     0:02  \_ postgres: cms cms 10.0.0.5(42313) idle     
17617 ?        Ss     0:01  \_ postgres: cms cms 10.0.0.5(42315) idle     
20412 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.5(43902) idle     
20425 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.5(43905) idle     
20427 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.5(43907) idle     
20428 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.5(43908) idle     
20429 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.5(43909) idle     
20430 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.5(43910) idle     
20431 ?        Ss     0:00  \_ postgres: cms cms 10.0.0.5(43911) idle     

-- 
Sincerely,          Stephen R. van den Berg.

"Very funny, Mr. Scott. Now beam down my clothes!"


Re: Runaway backend at 100% CPU PostgreSQL v8.3.5

From
Tom Lane
Date:
"Stephen R. van den Berg" <srb@cuci.nl> writes:
> It seems that the backend is stuck in some kind of endless loop.  Since
> it's a production Debian server, the backend is not compiled with debugging
> turned on.  The best I can do is ltrace it, in hopes of someone recognising
> the infinite sequence.

Well, it seems to be repeatedly executing a SPI call, which suggests but
doesn't prove that it's in a PL function ... but how would it get into
one while in PARSE state?

Anyway there's not much data here.  You might try attaching to the
backend process with gdb and trying "bt" and "p debug_query_string";
I think you should get something out of that even without debug symbols.
If not, maybe turning on query logging would be worthwhile.
        regards, tom lane


Re: Runaway backend at 100% CPU PostgreSQL v8.3.5

From
"Stephen R. van den Berg"
Date:
Tom Lane wrote:
>"Stephen R. van den Berg" <srb@cuci.nl> writes:
>> It seems that the backend is stuck in some kind of endless loop.  Since
>> it's a production Debian server, the backend is not compiled with debugging
>> turned on.  The best I can do is ltrace it, in hopes of someone recognising
>> the infinite sequence.

>Well, it seems to be repeatedly executing a SPI call, which suggests but
>doesn't prove that it's in a PL function ... but how would it get into
>one while in PARSE state?

>Anyway there's not much data here.  You might try attaching to the
>backend process with gdb and trying "bt" and "p debug_query_string";

Interesting, this is what I get:
(gdb) bt
#0  0xb5b3fd54 in ?? () from /usr/lib/postgresql/8.3/lib/plpgsql.so
#1  0x085e9228 in ?? ()
#2  0x00000048 in ?? ()
#3  0xbfec723b in ?? ()
#4  0x083cdb24 in ?? ()
#5  0xbfec74b4 in ?? ()
#6  0x0840d858 in ?? ()
#7  0xbfec6ff8 in ?? ()
#8  0x082e5102 in RevalidateCachedPlan ()
#9  0xb5b4077c in ?? () from /usr/lib/postgresql/8.3/lib/plpgsql.so
#10 0x00000017 in ?? ()
#11 0x085e9248 in ?? ()
#12 0x085e9240 in ?? ()
#13 0x085e9244 in ?? ()
#14 0xccdf2081 in ?? ()
#15 0x2066f5ee in ?? ()
#16 0x00000000 in ?? ()
(gdb) p debug_query_string
$1 = 138920681
(gdb) p (char*)debug_query_string
$2 = 0x847c2e9 "SELECT csi.sid,csi.name\n  FROM cmssite AS csi\n  WHERE csi.sid=cmsfindsid('506435')\n  LIMIT 1;"

Whereas:

CREATE OR REPLACE FUNCTIONcmsfindsid(INT) RETURNS INT AS '
DECLAREvfrom ALIAS FOR $1;vcsid INT;vold INT;vsid INT;
BEGINvcsid:=vfrom;LOOP vold:=vcsid; SELECT INTO vcsid,vsid parent,sid  FROM cmsstruct  WHERE csid=vcsid  LIMIT 1; EXIT
WHENvsid IS NOT NULL OR vcsid=vold;END LOOP;RETURN vsid;
 
END;' LANGUAGE PLPGSQL STABLE STRICT;

Where cmsstruct is a table where csid is a PRIMARY KEY INT column.

And, no, there have not been any function or table redefinitions.
-- 
Sincerely,          Stephen R. van den Berg.

"Very funny, Mr. Scott. Now beam down my clothes!"


Re: Runaway backend at 100% CPU PostgreSQL v8.3.5

From
Tom Lane
Date:
"Stephen R. van den Berg" <srb@cuci.nl> writes:
> BEGIN
>  vcsid:=vfrom;
>  LOOP
>   vold:=vcsid;
>   SELECT INTO vcsid,vsid parent,sid
>    FROM cmsstruct
>    WHERE csid=vcsid
>    LIMIT 1;
>   EXIT WHEN vsid IS NOT NULL OR vcsid=vold;
>  END LOOP;
>  RETURN vsid;
> END;

And are you certain that's never an infinite loop?  The query string
gives a starting point that one suspects doesn't terminate.
        regards, tom lane


Re: Runaway backend at 100% CPU PostgreSQL v8.3.5

From
"Stephen R. van den Berg"
Date:
Tom Lane wrote:
>"Stephen R. van den Berg" <srb@cuci.nl> writes:
>> BEGIN
>>  vcsid:=vfrom;
>>  LOOP

>>   EXIT WHEN vsid IS NOT NULL OR vcsid=vold;
>>  END LOOP;
>>  RETURN vsid;
>> END;

>And are you certain that's never an infinite loop?  The query string
>gives a starting point that one suspects doesn't terminate.

Indeed, you're right.  It starts at a deleted item due to a race condition.
Thanks, sorry for the noise.
-- 
Sincerely,          Stephen R. van den Berg.

"Very funny, Mr. Scott. Now beam down my clothes!"