Performance of aggregates over set-returning functions - Mailing list pgsql-performance
From | John Smith |
---|---|
Subject | Performance of aggregates over set-returning functions |
Date | |
Msg-id | b88f0d670801081821r70fb10b4j4c176eed96918434@mail.gmail.com Whole thread Raw |
Responses |
Re: Performance of aggregates over set-returning functions
|
List | pgsql-performance |
My set-up: Postgres 8.2.5 on AMD x86_64 compiled with GCC 3.4.4 on Gentoo Linux 2.6.17 4 GB of RAM, shared_buffers = 1000 work_mem = 1024 This is regarding performance of set-returning functions in queries. I use generate_series() in the following as an example. The true motivation is a need for a custom set-returning function that returns large result sets, on which I'd like to use Postgres to compute groups and aggregates. Consider the following query: postgres=# select count(*) from generate_series(1,100000000000000000); A vmstat while this query was running seems to suggest that the generate_series() was being materialized to disk first and then the count computed (blocks were being written to disk even though there was loads of memory still available). A small variation in the query (i.e. placing the generate_series in the select list without a from clause) exhibited a different behaviour: postgres=# select count(*) from (select generate_series(1,100000000000000000)) as A; This time, Postgres seemed to be trying to build the entire generate_series() result set in memory and eventually started to swap out until the swap space limit after which it crashed. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> Interestingly though, when the range in the generate_series() was small enough to fit in 4 bytes of memory (e.g. generate_series(1,1000000000) ), the above query completed consuming only negligible amount of memory. So, it looked like the aggregate computation was being pipelined with the tuples returned from generate_series(). postgres=# select count(*) from (select generate_series(1,1000000000)) as A; count ------------ 1000000000 (1 row) It seems the only difference between the queries is that in the former case, the generate_series(bigint, bigint) overload is selected, whereas in the latter case, the generate_series(int, int) overload is selected. A little investigation seemed to suggest that in the former case (where the generate_series() was returning bigint), Postgres was using palloc to store the numbers so that it could pass them by reference. By contrast, in the latter case, the numbers are small enough (4-byte int) to be passed by value. Assuming the aggregate computation is pipelined with the reading of the tuples from the function (like in the 4-byte case), the aggregate was probably immediately updated. But then, the memory so allocated wasnt freed and that's what was resulting in eating up memory. Is this right? If not, could you explain why this behaviour was being observed? First question: why isnt the aggregate computation being pipelined with the read of the tuples from the set-returning function in the first query's case (i.e., when generate_series appears in the from list)? Second question: is there a recommended way of using set returning functions in queries that will enable such a pipelining, i.e., without materializing and without consuming huge amounts of memory? Third question: more generally, is there a way I can ask Postgres to report an Out-of-memory the moment it tries to consume greater than a certain percentage of memory (instead of letting it hit the swap and eventually die or thrash) ? Thanks! - John
pgsql-performance by date: