Re: OLAP CUBE/ROLLUP Operators and GROUP BY grouping sets - Mailing list pgsql-hackers
From | Robert Bedell |
---|---|
Subject | Re: OLAP CUBE/ROLLUP Operators and GROUP BY grouping sets |
Date | |
Msg-id | 200312180039224.SM00984@xavier Whole thread Raw |
In response to | Re: OLAP CUBE/ROLLUP Operators and GROUP BY grouping sets (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
> That's the right area to be looking at, but I don't think you can expect > to do a decent job with localized hacking in LookupTupleHashEntry. That > function's API is predicated on the assumption that you have random > access to any entry in the hash table --- which stops being true as soon > as you spill some entries to disk. The key problem here is to figure > out how to reorganize the computation to cope efficiently with nonrandom > access to the hash table. You might want to look at the algorithm > embodied in nodeHash.c/nodeHashjoin.c, which partitions the hash table > and avoids cross-partition accesses. I don't think it's directly > applicable :-(, because it chooses the number of partitions in advance, > which is exactly the luxury we do not have here --- if the planner had > gotten the number of entries right then we'd not be needing to spill. > But it might give some ideas. Without detailed statistics on the distribution of the result set with respect to the grouping set(s), and without sorting the data, I don't think there is a way to efficiently chop it up. One idea is this: If you scan the result set once for all grouping sets, storing those uniquely in hash batches in temp files, you could do the aggegration in multiple passes by rescanning the result set for each batch and skipping the rows that didn't match anything in the current hash batch. The problem is that you're rescanning the underlying result all the time, and this is not going to scale. Hence I don't like this idea. I think a better idea is to actually leave the random access concept in there and implement an LRU cache on the grouping sets in the memory space provided by sort_mem, growing that cache as needed until it has to drop it out to disk. This is essentially what virtual memory would do for us anyway, but it's nicer on system since we cap the resources instead of it. I don't know how this would fit into the existing hash table mechanism (maybe not at all), but it would allow grouping sets to spill over unto disk. This, after all, is just an emergency break for a HASHED plan node that was not correctly planned in the first place. It would also allow spillover of grouping sets later on rather efficiently: If the strategy is AGG_HASHED, the entire thing should fit into memory and we don't need to worry about sorting the input or separating grouping sets into resettable aggregators and out of order aggregators. If you DO need to worry about that, you: - sort the input - return the tuples resulting from resettable aggregators (those computable from the ordering of the sort) immediately - use the hash table mechanism to store grouping results that are out of order. (with it's associated spillover unto disk). Since the hash table preserved the concept of random access this is transparent. - return the results from the hash table after you are done with the initial scan of the underlying table. The main issues with an LRU cache of the hash table on disk are (save me on this, I'm dumb....:)): - nonuniform size of tuples - storing buckets instead of single tuple entries - avoiding thrashing Anyway those are just my two cents. Take them as they are, I haven't finished looking at tuplestore.c and co yet. Cheers! Robert
pgsql-hackers by date: