Performance of aggregates over set-returning functions - Mailing list pgsql-performance

From John Smith
Subject Performance of aggregates over set-returning functions
Date
Msg-id b88f0d670801081821r70fb10b4j4c176eed96918434@mail.gmail.com
Whole thread Raw
Responses Re: Performance of aggregates over set-returning functions  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
My set-up:

Postgres 8.2.5 on AMD x86_64 compiled with GCC 3.4.4  on Gentoo Linux 2.6.17
4 GB of RAM,
shared_buffers = 1000
work_mem = 1024


This is regarding performance of set-returning functions in queries. I
use generate_series() in the following as an example. The true
motivation is a need for a custom set-returning function that returns
large result sets, on which I'd like to use Postgres to compute groups
and aggregates.

Consider the following query:

         postgres=# select count(*) from generate_series(1,100000000000000000);

A vmstat while this query was running seems to suggest that the
generate_series() was being materialized to disk first and then the
count computed (blocks were being written to disk even though there
was loads of memory still available).

A small variation in the query (i.e. placing the generate_series in
the select list without a from clause) exhibited a different
behaviour:

         postgres=# select count(*) from (select
generate_series(1,100000000000000000)) as A;

This time, Postgres seemed to be trying to build the entire
generate_series() result set in memory and eventually started to swap
out until the swap space limit after which it crashed.

         server closed the connection unexpectedly
         This probably means the server terminated abnormally
                  before or while processing the request.
         The connection to the server was lost. Attempting reset: Failed.
         !>

Interestingly though, when the range in the generate_series() was
small enough to fit in 4 bytes of memory (e.g.
generate_series(1,1000000000) ), the above query completed consuming
only negligible amount of memory. So, it looked like the aggregate
computation was being pipelined with the tuples returned from
generate_series().

        postgres=# select count(*) from (select
generate_series(1,1000000000)) as A;
           count
        ------------
         1000000000
        (1 row)

It seems the only difference between the queries is that in the former
case, the generate_series(bigint, bigint) overload is selected,
whereas in the latter case, the generate_series(int, int) overload is
selected. A little investigation seemed to suggest that in the former
case (where the generate_series() was returning bigint), Postgres was
using palloc to store the numbers so that it could pass them by
reference. By contrast, in the latter case, the numbers are small
enough (4-byte int) to be passed by value.

Assuming the aggregate computation is pipelined with the reading of
the tuples from the function (like in the 4-byte case), the aggregate
was probably immediately updated. But then, the memory so allocated
wasnt freed and that's what was resulting in eating up memory. Is this
right? If not, could you explain why this behaviour was being
observed?

First question: why isnt the aggregate computation being pipelined
with the read of the tuples from the set-returning function in the
first query's case (i.e., when generate_series appears in the from
list)?

Second question: is there a recommended way of using set returning
functions in queries that will enable such a pipelining, i.e., without
materializing and without consuming huge amounts of memory?

Third question: more generally, is there a way I can ask Postgres to
report an Out-of-memory the moment it tries to consume greater than a
certain percentage of memory (instead of letting it hit the swap and
eventually die or thrash) ?


Thanks!
- John

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Loss of cache when persistent connexion is closed
Next
From: Tom Lane
Date:
Subject: Re: Performance of aggregates over set-returning functions