Thread: Performance of aggregates over set-returning functions
My set-up: Postgres 8.2.5 on AMD x86_64 compiled with GCC 3.4.4 on Gentoo Linux 2.6.17 4 GB of RAM, shared_buffers = 1000 work_mem = 1024 This is regarding performance of set-returning functions in queries. I use generate_series() in the following as an example. The true motivation is a need for a custom set-returning function that returns large result sets, on which I'd like to use Postgres to compute groups and aggregates. Consider the following query: postgres=# select count(*) from generate_series(1,100000000000000000); A vmstat while this query was running seems to suggest that the generate_series() was being materialized to disk first and then the count computed (blocks were being written to disk even though there was loads of memory still available). A small variation in the query (i.e. placing the generate_series in the select list without a from clause) exhibited a different behaviour: postgres=# select count(*) from (select generate_series(1,100000000000000000)) as A; This time, Postgres seemed to be trying to build the entire generate_series() result set in memory and eventually started to swap out until the swap space limit after which it crashed. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> Interestingly though, when the range in the generate_series() was small enough to fit in 4 bytes of memory (e.g. generate_series(1,1000000000) ), the above query completed consuming only negligible amount of memory. So, it looked like the aggregate computation was being pipelined with the tuples returned from generate_series(). postgres=# select count(*) from (select generate_series(1,1000000000)) as A; count ------------ 1000000000 (1 row) It seems the only difference between the queries is that in the former case, the generate_series(bigint, bigint) overload is selected, whereas in the latter case, the generate_series(int, int) overload is selected. A little investigation seemed to suggest that in the former case (where the generate_series() was returning bigint), Postgres was using palloc to store the numbers so that it could pass them by reference. By contrast, in the latter case, the numbers are small enough (4-byte int) to be passed by value. Assuming the aggregate computation is pipelined with the reading of the tuples from the function (like in the 4-byte case), the aggregate was probably immediately updated. But then, the memory so allocated wasnt freed and that's what was resulting in eating up memory. Is this right? If not, could you explain why this behaviour was being observed? First question: why isnt the aggregate computation being pipelined with the read of the tuples from the set-returning function in the first query's case (i.e., when generate_series appears in the from list)? Second question: is there a recommended way of using set returning functions in queries that will enable such a pipelining, i.e., without materializing and without consuming huge amounts of memory? Third question: more generally, is there a way I can ask Postgres to report an Out-of-memory the moment it tries to consume greater than a certain percentage of memory (instead of letting it hit the swap and eventually die or thrash) ? Thanks! - John
"John Smith" <sodgodofall@gmail.com> writes: > Consider the following query: > postgres=# select count(*) from generate_series(1,100000000000000000); > A vmstat while this query was running seems to suggest that the > generate_series() was being materialized to disk first and then the > count computed Yeah, that's what I'd expect given the way nodeFunctionscan.c works. > A small variation in the query (i.e. placing the generate_series in > the select list without a from clause) exhibited a different > behaviour: > postgres=# select count(*) from (select > generate_series(1,100000000000000000)) as A; > This time, Postgres seemed to be trying to build the entire > generate_series() result set in memory and eventually started to swap > out until the swap space limit after which it crashed. Hmm, at worst you should get an "out of memory" error --- that's what I get. I think you're suffering from the dreaded "OOM kill" disease. > Interestingly though, when the range in the generate_series() was > small enough to fit in 4 bytes of memory (e.g. > generate_series(1,1000000000) ), the above query completed consuming > only negligible amount of memory. So, it looked like the aggregate > computation was being pipelined with the tuples returned from > generate_series(). It's pipelined either way. But int8 is a pass-by-reference data type, and it sounds like we have a memory leak for this case. > Third question: more generally, is there a way I can ask Postgres to > report an Out-of-memory the moment it tries to consume greater than a > certain percentage of memory (instead of letting it hit the swap and > eventually die or thrash) ? First, you want to turn off memory overcommit in your kernel; second, you might find that running the postmaster under conservative ulimit settings would make the behavior nicer. regards, tom lane
> > Interestingly though, when the range in the generate_series() was > > small enough to fit in 4 bytes of memory (e.g. > > generate_series(1,1000000000) ), the above query completed consuming > > only negligible amount of memory. So, it looked like the aggregate > > computation was being pipelined with the tuples returned from > > generate_series(). > > It's pipelined either way. But int8 is a pass-by-reference data type, > and it sounds like we have a memory leak for this case. Thanks for your reply. How easy is it to fix this? Which portion of the code should we look to change? - John
"John Smith" <sodgodofall@gmail.com> writes: >> It's pipelined either way. But int8 is a pass-by-reference data type, >> and it sounds like we have a memory leak for this case. > Thanks for your reply. How easy is it to fix this? Which portion of > the code should we look to change? I was just looking at that. The issue looks to me that nodeResult.c (and other plan node types that support SRFs in their targetlists) do this: /* * Check to see if we're still projecting out tuples from a previous scan * tuple (because there is a function-returning-set in the projection * expressions). If so, try to project another one. */ if (node->ps.ps_TupFromTlist) { resultSlot = ExecProject(node->ps.ps_ProjInfo, &isDone); if (isDone == ExprMultipleResult) return resultSlot; /* Done with that source tuple... */ node->ps.ps_TupFromTlist = false; } /* * Reset per-tuple memory context to free any expression evaluation * storage allocated in the previous tuple cycle. Note this can't happen * until we're done projecting out tuples from a scan tuple. */ ResetExprContext(econtext); whereas there would be no memory leak if these two chunks of code were in the other order. The question is whether resetting the context first would throw away any data that we *do* still need for the repeated ExecProject calls. That second comment block implies there's something we do need. I'm not sure why it's like this. Some digging in the CVS history shows that indeed the code used to be in the other order, and I switched it (and added the second comment block) in this old commit: http://archives.postgresql.org/pgsql-committers/2000-08/msg00218.php I suppose that the SQL-function support at the time required that its calling memory context be persistent until it returned ExprEndResult, but I sure don't recall any details. It's entirely possible that that requirement no longer exists, or could easily be eliminated given all the other changes that have happened since then. nodeFunctionscan.c seems to reset the current context for each call of a SRF, so I'd think that anything that can't cope with that should have been flushed out by now. If you feel like poking at this, I *strongly* recommend doing your testing in an --enable-cassert build. You'll have no idea whether you freed stuff too early if you don't have CLOBBER_FREED_MEMORY enabled. regards, tom lane
This this a bug or TODO item? --------------------------------------------------------------------------- Tom Lane wrote: > "John Smith" <sodgodofall@gmail.com> writes: > >> It's pipelined either way. But int8 is a pass-by-reference data type, > >> and it sounds like we have a memory leak for this case. > > > Thanks for your reply. How easy is it to fix this? Which portion of > > the code should we look to change? > > I was just looking at that. The issue looks to me that nodeResult.c > (and other plan node types that support SRFs in their targetlists) > do this: > > /* > * Check to see if we're still projecting out tuples from a previous scan > * tuple (because there is a function-returning-set in the projection > * expressions). If so, try to project another one. > */ > if (node->ps.ps_TupFromTlist) > { > resultSlot = ExecProject(node->ps.ps_ProjInfo, &isDone); > if (isDone == ExprMultipleResult) > return resultSlot; > /* Done with that source tuple... */ > node->ps.ps_TupFromTlist = false; > } > > /* > * Reset per-tuple memory context to free any expression evaluation > * storage allocated in the previous tuple cycle. Note this can't happen > * until we're done projecting out tuples from a scan tuple. > */ > ResetExprContext(econtext); > > whereas there would be no memory leak if these two chunks of code were > in the other order. The question is whether resetting the context first > would throw away any data that we *do* still need for the repeated > ExecProject calls. That second comment block implies there's something > we do need. > > I'm not sure why it's like this. Some digging in the CVS history shows > that indeed the code used to be in the other order, and I switched it > (and added the second comment block) in this old commit: > > http://archives.postgresql.org/pgsql-committers/2000-08/msg00218.php > > I suppose that the SQL-function support at the time required that its > calling memory context be persistent until it returned ExprEndResult, > but I sure don't recall any details. It's entirely possible that that > requirement no longer exists, or could easily be eliminated given all > the other changes that have happened since then. nodeFunctionscan.c > seems to reset the current context for each call of a SRF, so I'd think > that anything that can't cope with that should have been flushed out > by now. > > If you feel like poking at this, I *strongly* recommend doing your > testing in an --enable-cassert build. You'll have no idea whether you > freed stuff too early if you don't have CLOBBER_FREED_MEMORY enabled. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > This this a bug or TODO item? TODO, I think. I wouldn't want to risk pushing a change in this into back branches. regards, tom lane >> I'm not sure why it's like this. Some digging in the CVS history shows >> that indeed the code used to be in the other order, and I switched it >> (and added the second comment block) in this old commit: >> >> http://archives.postgresql.org/pgsql-committers/2000-08/msg00218.php >> >> I suppose that the SQL-function support at the time required that its >> calling memory context be persistent until it returned ExprEndResult, >> but I sure don't recall any details. It's entirely possible that that >> requirement no longer exists, or could easily be eliminated given all >> the other changes that have happened since then. nodeFunctionscan.c >> seems to reset the current context for each call of a SRF, so I'd think >> that anything that can't cope with that should have been flushed out >> by now. >> >> If you feel like poking at this, I *strongly* recommend doing your >> testing in an --enable-cassert build. You'll have no idea whether you >> freed stuff too early if you don't have CLOBBER_FREED_MEMORY enabled.
OK, added to TODO: * Reduce memory usage of aggregates in set returning functions http://archives.postgresql.org/pgsql-performance/2008-01/msg00031.php --------------------------------------------------------------------------- Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > This this a bug or TODO item? > > TODO, I think. I wouldn't want to risk pushing a change in this into > back branches. > > regards, tom lane > > >> I'm not sure why it's like this. Some digging in the CVS history shows > >> that indeed the code used to be in the other order, and I switched it > >> (and added the second comment block) in this old commit: > >> > >> http://archives.postgresql.org/pgsql-committers/2000-08/msg00218.php > >> > >> I suppose that the SQL-function support at the time required that its > >> calling memory context be persistent until it returned ExprEndResult, > >> but I sure don't recall any details. It's entirely possible that that > >> requirement no longer exists, or could easily be eliminated given all > >> the other changes that have happened since then. nodeFunctionscan.c > >> seems to reset the current context for each call of a SRF, so I'd think > >> that anything that can't cope with that should have been flushed out > >> by now. > >> > >> If you feel like poking at this, I *strongly* recommend doing your > >> testing in an --enable-cassert build. You'll have no idea whether you > >> freed stuff too early if you don't have CLOBBER_FREED_MEMORY enabled. > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.org&extra=pgsql-performance -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Sorry for the long delay in following up on this suggestion. The change Tom suggested fixed the performance problems I was seeing, but I never ran the full regression suite on the modified code, as everything in my performance tests seemed to indicate the bug was fixed (i.e, no errors even with --cassert-enabled). When I recently ran the regression suite on the "fixed" version of Postgres, the "misc" test suite fails with the following error message: "ERROR: cache lookup failed for type 2139062143". Is this a manifestation of the problem where certain items are being freed too early? Any other ideas as to what's going on here? Thanks, John On Tue, Jan 8, 2008 at 8:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "John Smith" <sodgodofall@gmail.com> writes: >>> It's pipelined either way. But int8 is a pass-by-reference data type, >>> and it sounds like we have a memory leak for this case. > >> Thanks for your reply. How easy is it to fix this? Which portion of >> the code should we look to change? > > I was just looking at that. The issue looks to me that nodeResult.c > (and other plan node types that support SRFs in their targetlists) > do this: > > /* > * Check to see if we're still projecting out tuples from a previous scan > * tuple (because there is a function-returning-set in the projection > * expressions). If so, try to project another one. > */ > if (node->ps.ps_TupFromTlist) > { > resultSlot = ExecProject(node->ps.ps_ProjInfo, &isDone); > if (isDone == ExprMultipleResult) > return resultSlot; > /* Done with that source tuple... */ > node->ps.ps_TupFromTlist = false; > } > > /* > * Reset per-tuple memory context to free any expression evaluation > * storage allocated in the previous tuple cycle. Note this can't happen > * until we're done projecting out tuples from a scan tuple. > */ > ResetExprContext(econtext); > > whereas there would be no memory leak if these two chunks of code were > in the other order. The question is whether resetting the context first > would throw away any data that we *do* still need for the repeated > ExecProject calls. That second comment block implies there's something > we do need. > > I'm not sure why it's like this. Some digging in the CVS history shows > that indeed the code used to be in the other order, and I switched it > (and added the second comment block) in this old commit: > > http://archives.postgresql.org/pgsql-committers/2000-08/msg00218.php > > I suppose that the SQL-function support at the time required that its > calling memory context be persistent until it returned ExprEndResult, > but I sure don't recall any details. It's entirely possible that that > requirement no longer exists, or could easily be eliminated given all > the other changes that have happened since then. nodeFunctionscan.c > seems to reset the current context for each call of a SRF, so I'd think > that anything that can't cope with that should have been flushed out > by now. > > If you feel like poking at this, I *strongly* recommend doing your > testing in an --enable-cassert build. You'll have no idea whether you > freed stuff too early if you don't have CLOBBER_FREED_MEMORY enabled. > > regards, tom lane >
"John Smith" <sodgodofall@gmail.com> writes: > Sorry for the long delay in following up on this suggestion. The > change Tom suggested fixed the performance problems I was seeing, but > I never ran the full regression suite on the modified code, as > everything in my performance tests seemed to indicate the bug was > fixed (i.e, no errors even with --cassert-enabled). When I recently > ran the regression suite on the "fixed" version of Postgres, the > "misc" test suite fails with the following error message: "ERROR: > cache lookup failed for type 2139062143". Is this a manifestation of > the problem where certain items are being freed too early? Yup --- something's trying to use memory that's already been freed. The particular case is evidently a field containing a type OID. You might be able to get a clue where the problem is by getting a gdb stack trace back from errfinish(), but this sort of kills my optimism that the 7.0-era problem is gone ... regards, tom lane