Thread: pl/pgsql function spikes CPU 100%

pl/pgsql function spikes CPU 100%

From
Jeff Frost
Date:
I've got a client that has a function in a db which had been humming along
quite nicely on 2xOpteron 275, PG 8.1.5, 8GB of RAM.  Now suddenly many of the
functions in the DB if called will spike the CPU to 100%.  These are functions
that used to finish in 7ms, now run for 20-40 mins.  Interestingly, when you
strace the backend, it doesn't appear to be doing too much...here's some
sample output:

select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
semop(3932217, 0x7fbfffd150, 1)         = 0
semop(3932217, 0x7fbfffd150, 1)         = 0
semop(3932217, 0x7fbfffd150, 1)         = 0
semop(3932217, 0x7fbfffd150, 1)         = 0
semop(3932217, 0x7fbfffd150, 1)         = 0
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
semop(3997755, 0x7fbfffd170, 1)         = 0
semop(3932217, 0x7fbfffd150, 1)         = 0

Any chance we've stumbled into some corner case bug?  We actually moved the DB
to a different server thinking perhaps we had gotten to the limit of slow
hardware, but in fact it happens on the other server as well.

I don't see any ungranted locks in pg_locks, nor are there any other non idle
queries this time of the night.

I'll see if I can share the function code tomorrow when people are awake again
in case we have something especially silly in there.


--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: pl/pgsql function spikes CPU 100%

From
"Shoaib Mir"
Date:
Are the stat collector and autovacuum processes in good shape?

--
Shoaib Mir
EnterpriseDB (www.enterprisedb.com)

On 3/16/07, Jeff Frost <jeff@frostconsultingllc.com> wrote:
I've got a client that has a function in a db which had been humming along
quite nicely on 2xOpteron 275, PG 8.1.5, 8GB of RAM.  Now suddenly many of the
functions in the DB if called will spike the CPU to 100%.  These are functions
that used to finish in 7ms, now run for 20-40 mins.  Interestingly, when you
strace the backend, it doesn't appear to be doing too much...here's some
sample output:

select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
semop(3932217, 0x7fbfffd150, 1)         = 0
semop(3932217, 0x7fbfffd150, 1)         = 0
semop(3932217, 0x7fbfffd150, 1)         = 0
semop(3932217, 0x7fbfffd150, 1)         = 0
semop(3932217, 0x7fbfffd150, 1)         = 0
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
semop(3997755, 0x7fbfffd170, 1)         = 0
semop(3932217, 0x7fbfffd150, 1)         = 0

Any chance we've stumbled into some corner case bug?  We actually moved the DB
to a different server thinking perhaps we had gotten to the limit of slow
hardware, but in fact it happens on the other server as well.

I don't see any ungranted locks in pg_locks, nor are there any other non idle
queries this time of the night.

I'll see if I can share the function code tomorrow when people are awake again
in case we have something especially silly in there.


--
Jeff Frost, Owner       < jeff@frostconsultingllc.com>
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908     FAX: 650-649-1954

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

                 http://www.postgresql.org/about/donate

Re: pl/pgsql function spikes CPU 100%

From
Jeff Frost
Date:
Everything else except the one postmaster process hum along just fine.  I.e.
nothing else appears to take much system resources at all. Autovac is set with
the 8.2.x default settings.  Oh, and the data was ANALYZE'd after it got moved
to the new server. Here's the code in case we have something especially silly
going on:

CREATE OR REPLACE FUNCTION populate_page_view_indices()
   RETURNS integer AS
$BODY$
DECLARE
     v_page_view_row RECORD;
     v_cindex INTEGER;
     v_tindex INTEGER;
     v_visit_id BIGINT;
     v_get BOOLEAN;
     v_row_count INTEGER;
BEGIN
     RAISE NOTICE 'Populating page_view indices ...';

UPDATE visit SET status = 'H'
         FROM (SELECT visit_id
                   FROM page_view p, visit v
                   WHERE p.visit_id = v.id
                     AND v.status = 'N'
                   GROUP BY visit_id
                   HAVING max(p.stamp) < now() - INTERVAL '1 hour') AS ready
         WHERE visit.id = ready.visit_id
           AND visit.status = 'N';

     v_cindex := -1;
     v_tindex := -1;
     v_visit_id := -1;
     FOR v_page_view_row IN
         SELECT p.* FROM page_view p, visit v
             WHERE p.visit_id = v.id
               AND v.status = 'H'
             ORDER BY visit_id, p.stamp LOOP
         v_get := (v_page_view_row.method = 'GET');
         IF (v_visit_id != v_page_view_row.visit_id) THEN
             v_visit_id := v_page_view_row.visit_id;
             v_tindex := 1;
             v_cindex := 1;
         ELSE
             v_tindex := v_tindex + 1;
             IF v_get THEN
                 v_cindex := v_cindex + 1;
             END IF;
         END IF;

         UPDATE page_view
             SET tindex = v_tindex,
                 cindex = CASE WHEN v_get THEN v_cindex ELSE -1 END
             WHERE id = v_page_view_row.id;
     END LOOP;

     UPDATE visit SET status = 'I'
         WHERE status = 'H';
     GET DIAGNOSTICS v_row_count = ROW_COUNT;

     RAISE NOTICE 'Done populating page_view indices ...';
     RETURN v_row_count;
END;
$BODY$
   LANGUAGE 'plpgsql' VOLATILE;
ALTER FUNCTION populate_page_view_indices() OWNER TO postgres;

And the other that seems to tickle the problem:

CREATE OR REPLACE FUNCTION populate_page_view_clickstreams()
   RETURNS integer AS
$BODY$
DECLARE
     v_row_count INTEGER;
BEGIN
     RAISE NOTICE 'Populating page_view clickstreams ...';
     UPDATE visit SET status = 'K'
         WHERE status = 'I';

     INSERT INTO tlink (id, from_id, to_id)
         SELECT nextval('hibernate_sequence'),
                f.id AS from_id,
                t.id AS to_id
             FROM page_view f, page_view t, visit v
             WHERE f.visit_id = t.visit_id
               AND f.visit_id = v.id
               AND v.status = 'K'
               AND f.tindex = t.tindex - 1
             ORDER BY f.visit_id, f.tindex;

     INSERT INTO clink (id, from_id, to_id)
         SELECT nextval('hibernate_sequence'),
                f.id AS from_id,
                t.id AS to_id
             FROM page_view f, page_view t, visit v
             WHERE f.visit_id = t.visit_id
               AND f.visit_id = v.id
               AND v.status = 'K'
               AND f.cindex = t.cindex - 1
             ORDER BY f.visit_id, f.cindex;

     UPDATE visit SET status = 'L'
         WHERE status = 'K';
     GET DIAGNOSTICS v_row_count = ROW_COUNT;
     RAISE NOTICE 'Done populating page_view clickstreams ...';
     RETURN v_row_count;
END;
$BODY$
   LANGUAGE 'plpgsql' VOLATILE;
ALTER FUNCTION populate_page_view_clickstreams() OWNER TO postgres;


On Fri, 16 Mar 2007, Shoaib Mir wrote:

> Are the stat collector and autovacuum processes in good shape?
>
> --
> Shoaib Mir
> EnterpriseDB (www.enterprisedb.com)
>
> On 3/16/07, Jeff Frost <jeff@frostconsultingllc.com> wrote:
>>
>> I've got a client that has a function in a db which had been humming along
>> quite nicely on 2xOpteron 275, PG 8.1.5, 8GB of RAM.  Now suddenly many of
>> the
>> functions in the DB if called will spike the CPU to 100%.  These are
>> functions
>> that used to finish in 7ms, now run for 20-40 mins.  Interestingly, when
>> you
>> strace the backend, it doesn't appear to be doing too much...here's some
>> sample output:
>>
>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>> semop(3932217, 0x7fbfffd150, 1)         = 0
>> semop(3932217, 0x7fbfffd150, 1)         = 0
>> semop(3932217, 0x7fbfffd150, 1)         = 0
>> semop(3932217, 0x7fbfffd150, 1)         = 0
>> semop(3932217, 0x7fbfffd150, 1)         = 0
>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>> semop(3997755, 0x7fbfffd170, 1)         = 0
>> semop(3932217, 0x7fbfffd150, 1)         = 0
>>
>> Any chance we've stumbled into some corner case bug?  We actually moved
>> the DB
>> to a different server thinking perhaps we had gotten to the limit of slow
>> hardware, but in fact it happens on the other server as well.
>>
>> I don't see any ungranted locks in pg_locks, nor are there any other non
>> idle
>> queries this time of the night.
>>
>> I'll see if I can share the function code tomorrow when people are awake
>> again
>> in case we have something especially silly in there.
>>
>>
>> --
>> Jeff Frost, Owner       <jeff@frostconsultingllc.com>
>> Frost Consulting, LLC   http://www.frostconsultingllc.com/
>> Phone: 650-780-7908     FAX: 650-649-1954
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 7: You can help support the PostgreSQL project by donating at
>>
>>                 http://www.postgresql.org/about/donate
>>
>

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: pl/pgsql function spikes CPU 100%

From
Tom Lane
Date:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> ... Interestingly, when you
> strace the backend, it doesn't appear to be doing too much...here's some
> sample output:

> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
> semop(3932217, 0x7fbfffd150, 1)         = 0
> semop(3932217, 0x7fbfffd150, 1)         = 0
> semop(3932217, 0x7fbfffd150, 1)         = 0
> semop(3932217, 0x7fbfffd150, 1)         = 0
> semop(3932217, 0x7fbfffd150, 1)         = 0
> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
> semop(3997755, 0x7fbfffd170, 1)         = 0
> semop(3932217, 0x7fbfffd150, 1)         = 0

This looks suspiciously like the sort of trace we saw in the various
"context swap storm" threads.  The test cases for those generally
involved really tight indexscan loops, ie, the backends were spending
all their time trying to access shared buffers.  If you haven't changed
the function or the data, then I concur with the nearby worry about
autovacuuming (large buildup of dead tuples could result in this symptom).
Or maybe you've got an old open transaction that is blocking cleanup?

            regards, tom lane

Re: pl/pgsql function spikes CPU 100%

From
Jeff Frost
Date:
On Fri, 16 Mar 2007, Tom Lane wrote:

> Jeff Frost <jeff@frostconsultingllc.com> writes:
>> ... Interestingly, when you
>> strace the backend, it doesn't appear to be doing too much...here's some
>> sample output:
>
>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>> semop(3932217, 0x7fbfffd150, 1)         = 0
>> semop(3932217, 0x7fbfffd150, 1)         = 0
>> semop(3932217, 0x7fbfffd150, 1)         = 0
>> semop(3932217, 0x7fbfffd150, 1)         = 0
>> semop(3932217, 0x7fbfffd150, 1)         = 0
>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>> semop(3997755, 0x7fbfffd170, 1)         = 0
>> semop(3932217, 0x7fbfffd150, 1)         = 0
>
> This looks suspiciously like the sort of trace we saw in the various
> "context swap storm" threads.  The test cases for those generally
> involved really tight indexscan loops, ie, the backends were spending
> all their time trying to access shared buffers.  If you haven't changed
> the function or the data, then I concur with the nearby worry about
> autovacuuming (large buildup of dead tuples could result in this symptom).
> Or maybe you've got an old open transaction that is blocking cleanup?

Tom,

I doubt it's a problem with autovacuum as the data in this server was just
loaded an hour before the strace above was taken, so there should have been
almost no dead tuples, especially since these tables are nearly write once.
I.e. they get written to once, then the populate function updates them, then
months later they get archived off.

There did not appear to be high context switch activity nor any IO wait to
mention during the time I was watching the postmaster.  If it's worth
mentioning, it's running CentOS 4.4 with the kernel-2.6.9-34.EL kernel.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: pl/pgsql function spikes CPU 100%

From
Jeff Frost
Date:
On Fri, 16 Mar 2007, Jeff Frost wrote:

> On Fri, 16 Mar 2007, Tom Lane wrote:
>
>> Jeff Frost <jeff@frostconsultingllc.com> writes:
>>> ... Interestingly, when you
>>> strace the backend, it doesn't appear to be doing too much...here's some
>>> sample output:
>>
>>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>>> semop(3932217, 0x7fbfffd150, 1)         = 0
>>> semop(3932217, 0x7fbfffd150, 1)         = 0
>>> semop(3932217, 0x7fbfffd150, 1)         = 0
>>> semop(3932217, 0x7fbfffd150, 1)         = 0
>>> semop(3932217, 0x7fbfffd150, 1)         = 0
>>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>>> semop(3997755, 0x7fbfffd170, 1)         = 0
>>> semop(3932217, 0x7fbfffd150, 1)         = 0
>>
>> This looks suspiciously like the sort of trace we saw in the various
>> "context swap storm" threads.  The test cases for those generally
>> involved really tight indexscan loops, ie, the backends were spending
>> all their time trying to access shared buffers.  If you haven't changed
>> the function or the data, then I concur with the nearby worry about
>> autovacuuming (large buildup of dead tuples could result in this symptom).
>> Or maybe you've got an old open transaction that is blocking cleanup?
>
> Tom,
>
> I doubt it's a problem with autovacuum as the data in this server was just
> loaded an hour before the strace above was taken, so there should have been
> almost no dead tuples, especially since these tables are nearly write once.
> I.e. they get written to once, then the populate function updates them, then
> months later they get archived off.
>
> There did not appear to be high context switch activity nor any IO wait to
> mention during the time I was watching the postmaster.  If it's worth
> mentioning, it's running CentOS 4.4 with the kernel-2.6.9-34.EL kernel.

Oh, and I meant to mention that this query was the only thing in
pg_stat_activity during the painful time it was running, and there were no
ungranted locks in pg_locks.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: pl/pgsql function spikes CPU 100%

From
Jeff Frost
Date:
On Fri, 16 Mar 2007, Jeff Frost wrote:

> On Fri, 16 Mar 2007, Jeff Frost wrote:
>
>> On Fri, 16 Mar 2007, Tom Lane wrote:
>>
>>> Jeff Frost <jeff@frostconsultingllc.com> writes:
>>>> ... Interestingly, when you
>>>> strace the backend, it doesn't appear to be doing too much...here's some
>>>> sample output:
>>>
>>>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>>>> semop(3932217, 0x7fbfffd150, 1)         = 0
>>>> semop(3932217, 0x7fbfffd150, 1)         = 0
>>>> semop(3932217, 0x7fbfffd150, 1)         = 0
>>>> semop(3932217, 0x7fbfffd150, 1)         = 0
>>>> semop(3932217, 0x7fbfffd150, 1)         = 0
>>>> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
>>>> semop(3997755, 0x7fbfffd170, 1)         = 0
>>>> semop(3932217, 0x7fbfffd150, 1)         = 0
>>>
>>> This looks suspiciously like the sort of trace we saw in the various
>>> "context swap storm" threads.  The test cases for those generally
>>> involved really tight indexscan loops, ie, the backends were spending
>>> all their time trying to access shared buffers.  If you haven't changed
>>> the function or the data, then I concur with the nearby worry about
>>> autovacuuming (large buildup of dead tuples could result in this symptom).
>>> Or maybe you've got an old open transaction that is blocking cleanup?
>>
>> Tom,
>>
>> I doubt it's a problem with autovacuum as the data in this server was just
>> loaded an hour before the strace above was taken, so there should have been
>> almost no dead tuples, especially since these tables are nearly write once.
>> I.e. they get written to once, then the populate function updates them,
>> then months later they get archived off.
>>
>> There did not appear to be high context switch activity nor any IO wait to
>> mention during the time I was watching the postmaster.  If it's worth
>> mentioning, it's running CentOS 4.4 with the kernel-2.6.9-34.EL kernel.
>
> Oh, and I meant to mention that this query was the only thing in
> pg_stat_activity during the painful time it was running, and there were no
> ungranted locks in pg_locks.

Well, finally got this system upgraded to 8.1.8, but unfortunately, that did
not resolve this.  Is there any reasonable way to see where this function is
spending it's time?  The on disk size of the tables in question are about
500MB each including the indexes.  The machine in question has 8GB of RAM and
the total DB working set size is about 7GB.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: pl/pgsql function spikes CPU 100%

From
Tom Lane
Date:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> Well, finally got this system upgraded to 8.1.8, but unfortunately, that did
> not resolve this.  Is there any reasonable way to see where this function is
> spending it's time?

I've grown enamored of oprofile lately --- if you're on a recent Linux
system, try that.  Solaris' DTrace is said to have equivalent
capability.  On other systems you could consider plain 'ol gprof.

Any of these will require a certain amount of man-page-reading :-(

            regards, tom lane

Re: pl/pgsql function spikes CPU 100%

From
Jeff Frost
Date:
On Tue, 24 Apr 2007, Tom Lane wrote:

> Jeff Frost <jeff@frostconsultingllc.com> writes:
>> Well, finally got this system upgraded to 8.1.8, but unfortunately, that did
>> not resolve this.  Is there any reasonable way to see where this function is
>> spending it's time?
>
> I've grown enamored of oprofile lately --- if you're on a recent Linux
> system, try that.  Solaris' DTrace is said to have equivalent
> capability.  On other systems you could consider plain 'ol gprof.
>
> Any of these will require a certain amount of man-page-reading :-(

Got my oprofile man page reading done.  Here's the general opreport:

CPU: P4 / Xeon with 2 hyper-threads, speed 3196.26 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped)
with a unit mask of 0x01 (mandatory) count 100000
GLOBAL_POWER_E...|
   samples|      %|
------------------
  26902279 96.3837 postgres
    612276  2.1936 vmlinux
    172733  0.6189 oprofiled
     94914  0.3401 libc-2.3.4.so
     54608  0.1956 oprofile
     22887  0.0820 opreport
     10811  0.0387 jbd
      7792  0.0279 ext3
      7673  0.0275 ld-2.3.4.so
      6143  0.0220 libstdc++.so.6.0.3
      4336  0.0155 e1000
      3253  0.0117 ip_conntrack
      2305  0.0083 ip_tables
      1957  0.0070 libcrypto.so.0.9.7a
      1275  0.0046 hald
      1066  0.0038 uhci_hcd
       859  0.0031 iptable_nat

And the postgres one:

CPU: P4 / Xeon with 2 hyper-threads, speed 3196.26 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped)
with a unit mask of 0x01 (mandatory) count 100000
samples  %        symbol name
3894022  12.6488  LWLockAcquire
3535497  11.4842  slot_deform_tuple
3030280   9.8431  LWLockRelease
2279699   7.4050  HeapTupleSatisfiesSnapshot
1782097   5.7887  ExecMakeFunctionResultNoSets
1209262   3.9280  ExecEvalScalarVar
1128461   3.6655  heap_release_fetch
916447    2.9768  LockBuffer
898437    2.9183  _bt_checkkeys
887904    2.8841  btgettuple
728377    2.3660  slot_getattr
648121    2.1053  hash_search
608352    1.9761  _bt_restscan
574699    1.8668  index_getnext
570999    1.8547  _bt_step
517225    1.6801  FunctionCall2
501845    1.6301  _bt_next
452142    1.4687  IndexNext
426455    1.3852  PinBuffer

Is any of that helpful?  This is the 2.6.9-34.ELsmp centos kernel.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: pl/pgsql function spikes CPU 100%

From
Tom Lane
Date:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> Got my oprofile man page reading done.  Here's the general opreport:
> ...
> And the postgres one:
> samples  %        symbol name
> 3894022  12.6488  LWLockAcquire
> 3535497  11.4842  slot_deform_tuple
> 3030280   9.8431  LWLockRelease
> 2279699   7.4050  HeapTupleSatisfiesSnapshot
> 1782097   5.7887  ExecMakeFunctionResultNoSets
> 1209262   3.9280  ExecEvalScalarVar
> 1128461   3.6655  heap_release_fetch
> 916447    2.9768  LockBuffer
> 898437    2.9183  _bt_checkkeys
> 887904    2.8841  btgettuple
> 728377    2.3660  slot_getattr
> 648121    2.1053  hash_search
> 608352    1.9761  _bt_restscan
> 574699    1.8668  index_getnext
> 570999    1.8547  _bt_step
> 517225    1.6801  FunctionCall2
> 501845    1.6301  _bt_next
> 452142    1.4687  IndexNext
> 426455    1.3852  PinBuffer

It seems to basically be spending its time scanning tuples and applying
some function or other (probably a WHERE clause).  The btree references
suggest that the scan is an indexscan and not a brute-force seqscan, but
still I wonder if the problem isn't that the planner has switched to some
plan much less efficient than the one it was using before.  oprofile is
too low-level to give us much info about that, unfortunately.

Probably your next step should be to investigate the plans being used
for the queries in the slow functions.  See the archives for hints, eg
http://archives.postgresql.org/pgsql-performance/2006-06/msg00301.php

            regards, tom lane