Thread: Performance of aggregates over set-returning functions

Performance of aggregates over set-returning functions

From
"John Smith"
Date:
My set-up:

Postgres 8.2.5 on AMD x86_64 compiled with GCC 3.4.4  on Gentoo Linux 2.6.17
4 GB of RAM,
shared_buffers = 1000
work_mem = 1024


This is regarding performance of set-returning functions in queries. I
use generate_series() in the following as an example. The true
motivation is a need for a custom set-returning function that returns
large result sets, on which I'd like to use Postgres to compute groups
and aggregates.

Consider the following query:

         postgres=# select count(*) from generate_series(1,100000000000000000);

A vmstat while this query was running seems to suggest that the
generate_series() was being materialized to disk first and then the
count computed (blocks were being written to disk even though there
was loads of memory still available).

A small variation in the query (i.e. placing the generate_series in
the select list without a from clause) exhibited a different
behaviour:

         postgres=# select count(*) from (select
generate_series(1,100000000000000000)) as A;

This time, Postgres seemed to be trying to build the entire
generate_series() result set in memory and eventually started to swap
out until the swap space limit after which it crashed.

         server closed the connection unexpectedly
         This probably means the server terminated abnormally
                  before or while processing the request.
         The connection to the server was lost. Attempting reset: Failed.
         !>

Interestingly though, when the range in the generate_series() was
small enough to fit in 4 bytes of memory (e.g.
generate_series(1,1000000000) ), the above query completed consuming
only negligible amount of memory. So, it looked like the aggregate
computation was being pipelined with the tuples returned from
generate_series().

        postgres=# select count(*) from (select
generate_series(1,1000000000)) as A;
           count
        ------------
         1000000000
        (1 row)

It seems the only difference between the queries is that in the former
case, the generate_series(bigint, bigint) overload is selected,
whereas in the latter case, the generate_series(int, int) overload is
selected. A little investigation seemed to suggest that in the former
case (where the generate_series() was returning bigint), Postgres was
using palloc to store the numbers so that it could pass them by
reference. By contrast, in the latter case, the numbers are small
enough (4-byte int) to be passed by value.

Assuming the aggregate computation is pipelined with the reading of
the tuples from the function (like in the 4-byte case), the aggregate
was probably immediately updated. But then, the memory so allocated
wasnt freed and that's what was resulting in eating up memory. Is this
right? If not, could you explain why this behaviour was being
observed?

First question: why isnt the aggregate computation being pipelined
with the read of the tuples from the set-returning function in the
first query's case (i.e., when generate_series appears in the from
list)?

Second question: is there a recommended way of using set returning
functions in queries that will enable such a pipelining, i.e., without
materializing and without consuming huge amounts of memory?

Third question: more generally, is there a way I can ask Postgres to
report an Out-of-memory the moment it tries to consume greater than a
certain percentage of memory (instead of letting it hit the swap and
eventually die or thrash) ?


Thanks!
- John

Re: Performance of aggregates over set-returning functions

From
Tom Lane
Date:
"John Smith" <sodgodofall@gmail.com> writes:
> Consider the following query:
>          postgres=# select count(*) from generate_series(1,100000000000000000);
> A vmstat while this query was running seems to suggest that the
> generate_series() was being materialized to disk first and then the
> count computed

Yeah, that's what I'd expect given the way nodeFunctionscan.c works.

> A small variation in the query (i.e. placing the generate_series in
> the select list without a from clause) exhibited a different
> behaviour:
>          postgres=# select count(*) from (select
> generate_series(1,100000000000000000)) as A;
> This time, Postgres seemed to be trying to build the entire
> generate_series() result set in memory and eventually started to swap
> out until the swap space limit after which it crashed.

Hmm, at worst you should get an "out of memory" error --- that's what
I get.  I think you're suffering from the dreaded "OOM kill" disease.

> Interestingly though, when the range in the generate_series() was
> small enough to fit in 4 bytes of memory (e.g.
> generate_series(1,1000000000) ), the above query completed consuming
> only negligible amount of memory. So, it looked like the aggregate
> computation was being pipelined with the tuples returned from
> generate_series().

It's pipelined either way.  But int8 is a pass-by-reference data type,
and it sounds like we have a memory leak for this case.

> Third question: more generally, is there a way I can ask Postgres to
> report an Out-of-memory the moment it tries to consume greater than a
> certain percentage of memory (instead of letting it hit the swap and
> eventually die or thrash) ?

First, you want to turn off memory overcommit in your kernel;
second, you might find that running the postmaster under conservative
ulimit settings would make the behavior nicer.

            regards, tom lane

Re: Performance of aggregates over set-returning functions

From
"John Smith"
Date:
> > Interestingly though, when the range in the generate_series() was
> > small enough to fit in 4 bytes of memory (e.g.
> > generate_series(1,1000000000) ), the above query completed consuming
> > only negligible amount of memory. So, it looked like the aggregate
> > computation was being pipelined with the tuples returned from
> > generate_series().
>
> It's pipelined either way.  But int8 is a pass-by-reference data type,
> and it sounds like we have a memory leak for this case.

Thanks for your reply. How easy is it to fix this? Which portion of
the code should we look to change?

- John

Re: Performance of aggregates over set-returning functions

From
Tom Lane
Date:
"John Smith" <sodgodofall@gmail.com> writes:
>> It's pipelined either way.  But int8 is a pass-by-reference data type,
>> and it sounds like we have a memory leak for this case.

> Thanks for your reply. How easy is it to fix this? Which portion of
> the code should we look to change?

I was just looking at that.  The issue looks to me that nodeResult.c
(and other plan node types that support SRFs in their targetlists)
do this:

    /*
     * Check to see if we're still projecting out tuples from a previous scan
     * tuple (because there is a function-returning-set in the projection
     * expressions).  If so, try to project another one.
     */
    if (node->ps.ps_TupFromTlist)
    {
        resultSlot = ExecProject(node->ps.ps_ProjInfo, &isDone);
        if (isDone == ExprMultipleResult)
            return resultSlot;
        /* Done with that source tuple... */
        node->ps.ps_TupFromTlist = false;
    }

    /*
     * Reset per-tuple memory context to free any expression evaluation
     * storage allocated in the previous tuple cycle.  Note this can't happen
     * until we're done projecting out tuples from a scan tuple.
     */
    ResetExprContext(econtext);

whereas there would be no memory leak if these two chunks of code were
in the other order.  The question is whether resetting the context first
would throw away any data that we *do* still need for the repeated
ExecProject calls.  That second comment block implies there's something
we do need.

I'm not sure why it's like this.  Some digging in the CVS history shows
that indeed the code used to be in the other order, and I switched it
(and added the second comment block) in this old commit:

http://archives.postgresql.org/pgsql-committers/2000-08/msg00218.php

I suppose that the SQL-function support at the time required that its
calling memory context be persistent until it returned ExprEndResult,
but I sure don't recall any details.  It's entirely possible that that
requirement no longer exists, or could easily be eliminated given all
the other changes that have happened since then.  nodeFunctionscan.c
seems to reset the current context for each call of a SRF, so I'd think
that anything that can't cope with that should have been flushed out
by now.

If you feel like poking at this, I *strongly* recommend doing your
testing in an --enable-cassert build.  You'll have no idea whether you
freed stuff too early if you don't have CLOBBER_FREED_MEMORY enabled.

            regards, tom lane

Re: Performance of aggregates over set-returning functions

From
Bruce Momjian
Date:
This this a bug or TODO item?

---------------------------------------------------------------------------

Tom Lane wrote:
> "John Smith" <sodgodofall@gmail.com> writes:
> >> It's pipelined either way.  But int8 is a pass-by-reference data type,
> >> and it sounds like we have a memory leak for this case.
>
> > Thanks for your reply. How easy is it to fix this? Which portion of
> > the code should we look to change?
>
> I was just looking at that.  The issue looks to me that nodeResult.c
> (and other plan node types that support SRFs in their targetlists)
> do this:
>
>     /*
>      * Check to see if we're still projecting out tuples from a previous scan
>      * tuple (because there is a function-returning-set in the projection
>      * expressions).  If so, try to project another one.
>      */
>     if (node->ps.ps_TupFromTlist)
>     {
>         resultSlot = ExecProject(node->ps.ps_ProjInfo, &isDone);
>         if (isDone == ExprMultipleResult)
>             return resultSlot;
>         /* Done with that source tuple... */
>         node->ps.ps_TupFromTlist = false;
>     }
>
>     /*
>      * Reset per-tuple memory context to free any expression evaluation
>      * storage allocated in the previous tuple cycle.  Note this can't happen
>      * until we're done projecting out tuples from a scan tuple.
>      */
>     ResetExprContext(econtext);
>
> whereas there would be no memory leak if these two chunks of code were
> in the other order.  The question is whether resetting the context first
> would throw away any data that we *do* still need for the repeated
> ExecProject calls.  That second comment block implies there's something
> we do need.
>
> I'm not sure why it's like this.  Some digging in the CVS history shows
> that indeed the code used to be in the other order, and I switched it
> (and added the second comment block) in this old commit:
>
> http://archives.postgresql.org/pgsql-committers/2000-08/msg00218.php
>
> I suppose that the SQL-function support at the time required that its
> calling memory context be persistent until it returned ExprEndResult,
> but I sure don't recall any details.  It's entirely possible that that
> requirement no longer exists, or could easily be eliminated given all
> the other changes that have happened since then.  nodeFunctionscan.c
> seems to reset the current context for each call of a SRF, so I'd think
> that anything that can't cope with that should have been flushed out
> by now.
>
> If you feel like poking at this, I *strongly* recommend doing your
> testing in an --enable-cassert build.  You'll have no idea whether you
> freed stuff too early if you don't have CLOBBER_FREED_MEMORY enabled.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Performance of aggregates over set-returning functions

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> This this a bug or TODO item?

TODO, I think.  I wouldn't want to risk pushing a change in this into
back branches.

            regards, tom lane

>> I'm not sure why it's like this.  Some digging in the CVS history shows
>> that indeed the code used to be in the other order, and I switched it
>> (and added the second comment block) in this old commit:
>>
>> http://archives.postgresql.org/pgsql-committers/2000-08/msg00218.php
>>
>> I suppose that the SQL-function support at the time required that its
>> calling memory context be persistent until it returned ExprEndResult,
>> but I sure don't recall any details.  It's entirely possible that that
>> requirement no longer exists, or could easily be eliminated given all
>> the other changes that have happened since then.  nodeFunctionscan.c
>> seems to reset the current context for each call of a SRF, so I'd think
>> that anything that can't cope with that should have been flushed out
>> by now.
>>
>> If you feel like poking at this, I *strongly* recommend doing your
>> testing in an --enable-cassert build.  You'll have no idea whether you
>> freed stuff too early if you don't have CLOBBER_FREED_MEMORY enabled.

Re: Performance of aggregates over set-returning functions

From
Bruce Momjian
Date:
OK, added to TODO:

* Reduce memory usage of aggregates in set returning functions

  http://archives.postgresql.org/pgsql-performance/2008-01/msg00031.php


---------------------------------------------------------------------------

Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > This this a bug or TODO item?
>
> TODO, I think.  I wouldn't want to risk pushing a change in this into
> back branches.
>
>             regards, tom lane
>
> >> I'm not sure why it's like this.  Some digging in the CVS history shows
> >> that indeed the code used to be in the other order, and I switched it
> >> (and added the second comment block) in this old commit:
> >>
> >> http://archives.postgresql.org/pgsql-committers/2000-08/msg00218.php
> >>
> >> I suppose that the SQL-function support at the time required that its
> >> calling memory context be persistent until it returned ExprEndResult,
> >> but I sure don't recall any details.  It's entirely possible that that
> >> requirement no longer exists, or could easily be eliminated given all
> >> the other changes that have happened since then.  nodeFunctionscan.c
> >> seems to reset the current context for each call of a SRF, so I'd think
> >> that anything that can't cope with that should have been flushed out
> >> by now.
> >>
> >> If you feel like poking at this, I *strongly* recommend doing your
> >> testing in an --enable-cassert build.  You'll have no idea whether you
> >> freed stuff too early if you don't have CLOBBER_FREED_MEMORY enabled.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.org&extra=pgsql-performance

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Performance of aggregates over set-returning functions

From
"John Smith"
Date:
Sorry for the long delay in following up on this suggestion. The
change Tom suggested fixed the performance problems I was seeing, but
I never ran the full regression suite on the modified code, as
everything in my performance tests seemed to indicate the bug was
fixed (i.e, no errors even with --cassert-enabled). When I recently
ran the regression suite on the "fixed" version of Postgres, the
"misc" test suite fails with the following error message: "ERROR:
cache lookup failed for type 2139062143". Is this a manifestation of
the problem where certain items are being freed too early? Any other
ideas as to what's going on here?

Thanks,
John

On Tue, Jan 8, 2008 at 8:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "John Smith" <sodgodofall@gmail.com> writes:
>>> It's pipelined either way.  But int8 is a pass-by-reference data type,
>>> and it sounds like we have a memory leak for this case.
>
>> Thanks for your reply. How easy is it to fix this? Which portion of
>> the code should we look to change?
>
> I was just looking at that.  The issue looks to me that nodeResult.c
> (and other plan node types that support SRFs in their targetlists)
> do this:
>
>    /*
>     * Check to see if we're still projecting out tuples from a previous scan
>     * tuple (because there is a function-returning-set in the projection
>     * expressions).  If so, try to project another one.
>     */
>    if (node->ps.ps_TupFromTlist)
>    {
>        resultSlot = ExecProject(node->ps.ps_ProjInfo, &isDone);
>        if (isDone == ExprMultipleResult)
>            return resultSlot;
>        /* Done with that source tuple... */
>        node->ps.ps_TupFromTlist = false;
>    }
>
>    /*
>     * Reset per-tuple memory context to free any expression evaluation
>     * storage allocated in the previous tuple cycle.  Note this can't happen
>     * until we're done projecting out tuples from a scan tuple.
>     */
>    ResetExprContext(econtext);
>
> whereas there would be no memory leak if these two chunks of code were
> in the other order.  The question is whether resetting the context first
> would throw away any data that we *do* still need for the repeated
> ExecProject calls.  That second comment block implies there's something
> we do need.
>
> I'm not sure why it's like this.  Some digging in the CVS history shows
> that indeed the code used to be in the other order, and I switched it
> (and added the second comment block) in this old commit:
>
> http://archives.postgresql.org/pgsql-committers/2000-08/msg00218.php
>
> I suppose that the SQL-function support at the time required that its
> calling memory context be persistent until it returned ExprEndResult,
> but I sure don't recall any details.  It's entirely possible that that
> requirement no longer exists, or could easily be eliminated given all
> the other changes that have happened since then.  nodeFunctionscan.c
> seems to reset the current context for each call of a SRF, so I'd think
> that anything that can't cope with that should have been flushed out
> by now.
>
> If you feel like poking at this, I *strongly* recommend doing your
> testing in an --enable-cassert build.  You'll have no idea whether you
> freed stuff too early if you don't have CLOBBER_FREED_MEMORY enabled.
>
>                        regards, tom lane
>

Re: Performance of aggregates over set-returning functions

From
Tom Lane
Date:
"John Smith" <sodgodofall@gmail.com> writes:
> Sorry for the long delay in following up on this suggestion. The
> change Tom suggested fixed the performance problems I was seeing, but
> I never ran the full regression suite on the modified code, as
> everything in my performance tests seemed to indicate the bug was
> fixed (i.e, no errors even with --cassert-enabled). When I recently
> ran the regression suite on the "fixed" version of Postgres, the
> "misc" test suite fails with the following error message: "ERROR:
> cache lookup failed for type 2139062143". Is this a manifestation of
> the problem where certain items are being freed too early?

Yup --- something's trying to use memory that's already been freed.
The particular case is evidently a field containing a type OID.

You might be able to get a clue where the problem is by getting a gdb
stack trace back from errfinish(), but this sort of kills my optimism
that the 7.0-era problem is gone ...

            regards, tom lane