Thread: OLAP CUBE/ROLLUP Operators and GROUP BY grouping sets
I'm curious if anyone has ever looked into adding OLAP functionality (per the SQL99 specs) into PostGreSQL. I don't actually own the proper SQL99 specifications, and since the newer sql2003 ones are coming out I don't know which to purchase. Could someone point me in the right direction? I've looked in the mailing lists and the docs and found some interest in olap like functionality, but not like what I found in other databases (such as Oracle and DB2). More specifically I would like to add grouping sets, and the CUBE and ROLLUP operators, into postgresql. Since modifying such the GROUP BY operation would necessitate changing the query structure, wouldn't that affect the query rewrites and genetic optimizer? On a superficial level yes, but would any existing query rewriting care about additional grouping being done in a GROUP BY operation? More specifically, what would changing the query structure affect by making the GROUP BY clause a list of lists (the list of the grouping sets) instead of an expression list as it currently is? An example of a ROLLUP result might be (pseudoresults..): CREATE TABLE SALES_SUMMARY (NAME TEXT, DEPARTMENT TEXT, SALES INTEGER); -- populate with data SELECT DEPARTMENT, NAME, SUM(SALES) FROM SALES_SUMMARY GROUP BY ROLLUP(DEPARTMENT,NAME); DEPARTMENT NAME SUM(SALES) -------------- -------- ---------- Dept one Bob 13 Dept one Jeff 12 Dept one NULL 25 Dept two Jim 10 Dept two Mary 11 Dept two NULL 21 NULL NULL 46 -- Where the rows with NULLs represent the subtotals and grandtotals, respectively. Any thoughts? Along the same vein, window partitioning would be nice for aggregates. Aggregate expressions like: RANK() OVER (PARTITION BY DEPARTMENT ORDER BY SALARY DESC). They get rid of a lot of subselect operations quite nicely. These are not simple projects, I know. There are a lot of features in high databases I would like to have in open source tools, and I would like to make a contribution towards getting them there ;) PS - ...no, I won't even mention materialized views... ----------------- Robert Bedell
"Robert Bedell" <robert@friendlygenius.com> writes: > I'm curious if anyone has ever looked into adding OLAP functionality (per > the SQL99 specs) into PostGreSQL. There was a fairly crude CUBE implementation submitted (and rejected) a few months ago, but there's not been any work I thought had a chance of getting committed. This is an issue of implementation quality rather than whether we want the feature --- I think the community is interested in adding any and all features that are in SQL99. > More specifically I would like to add grouping sets, and the CUBE and ROLLUP > operators, into postgresql. Since modifying such the GROUP BY operation > would necessitate changing the query structure, wouldn't that affect the > query rewrites and genetic optimizer? I don't think either the rewriter or GEQO would notice at all. The regular optimizer definitely would though. > These are not simple projects, I know. Might be a tad ambitious for your first venture into backend hacking... regards, tom lane
Robert Bedell wrote: > I'm curious if anyone has ever looked into adding OLAP functionality (per > the SQL99 specs) into PostGreSQL. I don't actually own the proper SQL99 > specifications, and since the newer sql2003 ones are coming out I don't know > which to purchase. Could someone point me in the right direction? For a recent draft of SQL2003 see: http://www.wiscorp.com/sql/sql_2003_standard.zip FWIW I've been thinking about working on some SQL 2003 OLAP features for Postgres 7.5, but it isn't clear what I'll have time for yet. I'm hoping to at least work on aggregates that accept two arguments (binary set functions), e.g. REGR_*, and window frame aggregates. (section 4.15.4 and 10.9) > PS - ...no, I won't even mention materialized views... There has been a recent discussion about materialized views on one of the lists. I think someone might already be working on it. Search the archives. Joe
> > More specifically I would like to add grouping sets, and the CUBE and > ROLLUP > > operators, into postgresql. Since modifying such the GROUP BY operation > > would necessitate changing the query structure, wouldn't that affect the > > query rewrites and genetic optimizer? > > I don't think either the rewriter or GEQO would notice at all. The > regular optimizer definitely would though. Alright, I will focus on the regular optimizer more specifically then. Thanks! > > These are not simple projects, I know. > > Might be a tad ambitious for your first venture into backend hacking... I agree completely. I'm not purporting to jump in quite that quickly, but it is something I would like to see added, and am willing to put the time towards. If you have any suggestions for items in the TODO list that would move me more quickly towards my goal I would appreciate it. (Also useful to avoid working on something someone else is!) To Joe Conway: Thanks for the link! And if you have any specific ideas on window frame aggregates I would like to discuss them with you out of band ;) Thanks for you time ;) Cheers, Robert
Tom Lane kirjutas N, 18.12.2003 kell 00:27: > "Robert Bedell" <robert@friendlygenius.com> writes: > > I'm curious if anyone has ever looked into adding OLAP functionality (per > > the SQL99 specs) into PostGreSQL. As a first project one could think of implementing NULLS FIRST/LAST (from 4.14.9) for all ORDER BY operations. > There was a fairly crude CUBE implementation submitted (and rejected) a > few months ago, but there's not been any work I thought had a chance of > getting committed. This is an issue of implementation quality rather > than whether we want the feature --- I think the community is interested > in adding any and all features that are in SQL99. > > > More specifically I would like to add grouping sets, and the CUBE and ROLLUP > > operators, into postgresql. Since modifying such the GROUP BY operation > > would necessitate changing the query structure, wouldn't that affect the > > query rewrites and genetic optimizer? > > I don't think either the rewriter or GEQO would notice at all. The > regular optimizer definitely would though. If it would mess up the optimiser, then could the extra aggregators not be put in after optimisations, at least for hash aggregators ? The implementation using hash aggregators should be just a SMOP http://people.kldp.org/~eunjea/jargon/?idx=SMOP.html ;) While ROLLUP could easily be implemented on the same scan over sorted data as an ordinary GROUP BY by adding extra aggregators for partial matches, CUBE and arbitrary grouping sets can't. ie the query SELECT SUM(a) GROUP BY ROLLUP (b,c) if run over a sorted set, would have a distinct aggregator for each of (b,c), (b,) and (,) which can be reinitialised/reused every time b,c or b changes. CUBE(b,c) would need hash aggregators for at least (,c). > > These are not simple projects, I know. > > Might be a tad ambitious for your first venture into backend hacking... OTOH it may be quite easy to *test* implementation ideas by always expand GROUP BY into ROLLUP or CUBE by adding the aggregators just before the executor, without touching the parser and pre-optimisation query tree at all. -------------- Hannu
Robert Bedell kirjutas N, 18.12.2003 kell 01:02: > > > These are not simple projects, I know. > > > > Might be a tad ambitious for your first venture into backend hacking... > > I agree completely. I'm not purporting to jump in quite that quickly, but > it is something I would like to see added, and am willing to put the time > towards. If you have any suggestions for items in the TODO list that would > move me more quickly towards my goal I would appreciate it. I guess that by adding hash aggregates Tom solved most problems of adding ROLLUP, CUBE and GROUPING SETS. OTOH, I'm not sure if hash aggregates can already spill to disk if not enough memory is available for keeping them all. If not, then adding this capability would be great push towards their general use for GROUPING SETS. ALso, a mix of scan-over-sorted-group-by + hash aggregates for out-of-order extra groups would be great way to using less memory for hash aggregates. ---------------- Hannu
> I guess that by adding hash aggregates Tom solved most problems of > adding ROLLUP, CUBE and GROUPING SETS. > > OTOH, I'm not sure if hash aggregates can already spill to disk if not > enough memory is available for keeping them all. If not, then adding > this capability would be great push towards their general use for > GROUPING SETS. > > ALso, a mix of scan-over-sorted-group-by + hash aggregates for > out-of-order extra groups would be great way to using less memory for > hash aggregates. The other issue is that in a scan-over-sorted-group-by without out of order grouping sets you can return tuples as you reset the aggregators. With out of order grouping sets you would have to wait until the whole table was scanned - at least for those grouping sets - to return the resulting tuples. Since this could get rather large the spill to disk functionality is necessary. It should probably mimic how the sort does it... Another point is selecting the best way to sort a given collection of grouping sets for minimal memory usage. Any ORDER BY in the query should really be applied after the grouping operation. The CUBE and ROLLUP operators should really be applied by expanding them into the equivalent collections of grouping sets. Cheers, Robert
Robert Bedell kirjutas N, 18.12.2003 kell 01:55: > > I guess that by adding hash aggregates Tom solved most problems of > > adding ROLLUP, CUBE and GROUPING SETS. > > > > OTOH, I'm not sure if hash aggregates can already spill to disk if not > > enough memory is available for keeping them all. If not, then adding > > this capability would be great push towards their general use for > > GROUPING SETS. > > > > ALso, a mix of scan-over-sorted-group-by + hash aggregates for > > out-of-order extra groups would be great way to using less memory for > > hash aggregates. > > The other issue is that in a scan-over-sorted-group-by without out of order > grouping sets you can return tuples as you reset the aggregators. With out > of order grouping sets you would have to wait until the whole table was > scanned - at least for those grouping sets - to return the resulting tuples. > Since this could get rather large the spill to disk functionality is > necessary. It should probably mimic how the sort does it... > > Another point is selecting the best way to sort a given collection of > grouping sets for minimal memory usage. it seems that the longest GROUPING SET and all its left-continuous subsets could be collected from the sorted scan and the rest from hash aggregates. GROUPING SET () will always need a "hash" ;) To optimise any further would require use of statistics data, and is probably not a good idea to do before having the simpler one implemented > Any ORDER BY in the query should > really be applied after the grouping operation. > > The CUBE and ROLLUP operators should really be applied by expanding them > into the equivalent collections of grouping sets. For pure ROLLUP one could shortcut the split-into-groups and put-together-again process, as ROLLUP is already doable from single sorted scan. ------------- Hannu
> it seems that the longest GROUPING SET and all its left-continuous > subsets could be collected from the sorted scan and the rest from hash > aggregates. > > GROUPING SET () will always need a "hash" ;) > > To optimise any further would require use of statistics data, and is > probably not a good idea to do before having the simpler one implemented Absolutely right. That's a good starting point. > > Any ORDER BY in the query should > > really be applied after the grouping operation. > > > > The CUBE and ROLLUP operators should really be applied by expanding them > > into the equivalent collections of grouping sets. > > For pure ROLLUP one could shortcut the split-into-groups and > put-together-again process, as ROLLUP is already doable from single > sorted scan. Actually as long as the grouping sets are all left-continuous of the longest grouping set it's doable from a single sorted scan. If done with the right implementation separating resetable aggregators and out of order aggregators you could get this optimization for free. This avoids having to look for ROLLUP specifically. Cheers, Robert
> > For pure ROLLUP one could shortcut the split-into-groups and > > put-together-again process, as ROLLUP is already doable from single > > sorted scan. > > Actually as long as the grouping sets are all left-continuous of the > longest > grouping set it's doable from a single sorted scan. If done with the > right > implementation separating resetable aggregators and out of order > aggregators > you could get this optimization for free. This avoids having to look for > ROLLUP specifically. Ok, putting it that way was kind of dumb - and you already said that mostly. What I mean to say is you don't have to check for ROLLUP specifically to get that optimization.. This also allows you to do "GROUP BY a, b, ROLLUP(c,d)" (resulting in grouping sets {{a,b,c,d},{a,b,c},{a,b}}) in an optimized fashion without having to check for PURE rollup. Cheers, Robert
Hannu Krosing <hannu@tm.ee> writes: > OTOH, I'm not sure if hash aggregates can already spill to disk if not > enough memory is available for keeping them all. They do not, which is something it'd be good to fix, since if the planner drastically underestimates the number of groups, you could end up with a hashtable far bigger than the advertised SortMem. Even if this doesn't end up with an "out of memory" error, it could still drive you into swap hell. (Maybe that should be Robert's get-your-feet-wet project.) regards, tom lane
> Hannu Krosing <hannu@tm.ee> writes: > > OTOH, I'm not sure if hash aggregates can already spill to disk if not > > enough memory is available for keeping them all. > > They do not, which is something it'd be good to fix, since if the > planner drastically underestimates the number of groups, you could end > up with a hashtable far bigger than the advertised SortMem. Even if > this doesn't end up with an "out of memory" error, it could still drive > you into swap hell. > > (Maybe that should be Robert's get-your-feet-wet project.) Ok, I've been looking through the executor - nodeGroup.c, nodeAgg.c, dynahash.c, execGrouping.c, etc.. I've found the places where hashed aggregation occurs. It looks like hashed groups simply read the entire result in memory via a function chain through lookup_hash_entry() -> LookupTupleHashEntry() -> hash_search(). I think that LookupTupleHashEntry is the best place to put the code to spill over unto disk, since that is used only by the Group, Agg, and Subplan executor nodes. Putting it there gives the added benefit of letting hashed subselect results spill over unto disk as well (not sure if that's the right thing to do). I have a couple of questions: 1) When does the optimizer set the nodeAgg plan to HASHED? The current implementation simply reads through the entire result before returning anything, so obviously it's not always done. Sorry if I should RTFM on this one.... 2) What mechanism would be best to use for storing the data on disk? I know there is a temporary table mechanism, I'll be hunting for that shortly.. 3) What should define the spillover point. The documentation points to the 'sort_mem' parameter for this, but the code doesn't look to actually implement that yet. Similarly for the hash_search() function - I didn't see anything to store it to disk. 4) Should LookupTupleHashEntry() be worried about the pointers it receives...similarly for hash_search()? Obviously (2) is the really big question. What's the best way to do this? I still have a bit more to go before I understand what's going on, but I'm starting to grasp it. Any tips would be appreciated! :) Cheers! Robert
"Robert Bedell" <robert@friendlygenius.com> writes: > 1) When does the optimizer set the nodeAgg plan to HASHED? See grouping_planner() in src/backend/optimizer/plan/planner.c particularly the logic around use_hashed_grouping. > 2) What mechanism would be best to use for storing the data on disk? I know > there is a temporary table mechanism, I'll be hunting for that shortly.. Temp files, not temp tables. You could look at src/backend/utils/sort/tuplesort.c or src/backend/executor/nodeHash.c for examples. > 3) What should define the spillover point. sort_mem. > The documentation points to the > 'sort_mem' parameter for this, but the code doesn't look to actually > implement that yet. Well, yeah, that's sort of exactly the point ... it's considered during planning but the executor code has no fallback if the planner guesses wrong. > 4) Should LookupTupleHashEntry() be worried about the pointers it > receives...similarly for hash_search()? Eh? regards, tom lane
Thanks for the pointers! > > The documentation points to the > > 'sort_mem' parameter for this, but the code doesn't look to actually > > implement that yet. > > Well, yeah, that's sort of exactly the point ... it's considered during > planning but the executor code has no fallback if the planner guesses > wrong. Yep. The doc is wishful thinking then ;) I was just verifying that. > > 4) Should LookupTupleHashEntry() be worried about the pointers it > > receives...similarly for hash_search()? Was wondering how safe it was to write hashed data to disk.. given your suggestions (and the obvious necessity) pretty safe, but it requires some handling. My bad. Cheers! Robert
Also ... "Robert Bedell" <robert@friendlygenius.com> writes: > ... I think that LookupTupleHashEntry > is the best place to put the code to spill over unto disk, since that is > used only by the Group, Agg, and Subplan executor nodes. That's the right area to be looking at, but I don't think you can expect to do a decent job with localized hacking in LookupTupleHashEntry. That function's API is predicated on the assumption that you have random access to any entry in the hash table --- which stops being true as soon as you spill some entries to disk. The key problem here is to figure out how to reorganize the computation to cope efficiently with nonrandom access to the hash table. You might want to look at the algorithm embodied in nodeHash.c/nodeHashjoin.c, which partitions the hash table and avoids cross-partition accesses. I don't think it's directly applicable :-(, because it chooses the number of partitions in advance, which is exactly the luxury we do not have here --- if the planner had gotten the number of entries right then we'd not be needing to spill. But it might give some ideas. regards, tom lane
> That's the right area to be looking at, but I don't think you can expect > to do a decent job with localized hacking in LookupTupleHashEntry. That > function's API is predicated on the assumption that you have random > access to any entry in the hash table --- which stops being true as soon > as you spill some entries to disk. The key problem here is to figure > out how to reorganize the computation to cope efficiently with nonrandom > access to the hash table. You might want to look at the algorithm > embodied in nodeHash.c/nodeHashjoin.c, which partitions the hash table > and avoids cross-partition accesses. I don't think it's directly > applicable :-(, because it chooses the number of partitions in advance, > which is exactly the luxury we do not have here --- if the planner had > gotten the number of entries right then we'd not be needing to spill. > But it might give some ideas. Without detailed statistics on the distribution of the result set with respect to the grouping set(s), and without sorting the data, I don't think there is a way to efficiently chop it up. One idea is this: If you scan the result set once for all grouping sets, storing those uniquely in hash batches in temp files, you could do the aggegration in multiple passes by rescanning the result set for each batch and skipping the rows that didn't match anything in the current hash batch. The problem is that you're rescanning the underlying result all the time, and this is not going to scale. Hence I don't like this idea. I think a better idea is to actually leave the random access concept in there and implement an LRU cache on the grouping sets in the memory space provided by sort_mem, growing that cache as needed until it has to drop it out to disk. This is essentially what virtual memory would do for us anyway, but it's nicer on system since we cap the resources instead of it. I don't know how this would fit into the existing hash table mechanism (maybe not at all), but it would allow grouping sets to spill over unto disk. This, after all, is just an emergency break for a HASHED plan node that was not correctly planned in the first place. It would also allow spillover of grouping sets later on rather efficiently: If the strategy is AGG_HASHED, the entire thing should fit into memory and we don't need to worry about sorting the input or separating grouping sets into resettable aggregators and out of order aggregators. If you DO need to worry about that, you: - sort the input - return the tuples resulting from resettable aggregators (those computable from the ordering of the sort) immediately - use the hash table mechanism to store grouping results that are out of order. (with it's associated spillover unto disk). Since the hash table preserved the concept of random access this is transparent. - return the results from the hash table after you are done with the initial scan of the underlying table. The main issues with an LRU cache of the hash table on disk are (save me on this, I'm dumb....:)): - nonuniform size of tuples - storing buckets instead of single tuple entries - avoiding thrashing Anyway those are just my two cents. Take them as they are, I haven't finished looking at tuplestore.c and co yet. Cheers! Robert
We have a rude hack of temping hashed aggs to disk to deal with the case where there is not enough memory. I don't think that's an ideal solution, but it certainly has the code to dump to file. I can post the patch later in the day .. (This is some code for our undergrad db class assignment. I was planning to clean it up and submit it properly but I never got the time) -- Pip-pip Sailesh http://www.cs.berkeley.edu/~sailesh