Re: OLAP CUBE/ROLLUP Operators and GROUP BY grouping sets - Mailing list pgsql-hackers

From Robert Bedell
Subject Re: OLAP CUBE/ROLLUP Operators and GROUP BY grouping sets
Date
Msg-id 200312180039224.SM00984@xavier
Whole thread Raw
In response to Re: OLAP CUBE/ROLLUP Operators and GROUP BY grouping sets  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
> That's the right area to be looking at, but I don't think you can expect
> to do a decent job with localized hacking in LookupTupleHashEntry.  That
> function's API is predicated on the assumption that you have random
> access to any entry in the hash table --- which stops being true as soon
> as you spill some entries to disk.  The key problem here is to figure
> out how to reorganize the computation to cope efficiently with nonrandom
> access to the hash table.  You might want to look at the algorithm
> embodied in nodeHash.c/nodeHashjoin.c, which partitions the hash table
> and avoids cross-partition accesses.  I don't think it's directly
> applicable :-(, because it chooses the number of partitions in advance,
> which is exactly the luxury we do not have here --- if the planner had
> gotten the number of entries right then we'd not be needing to spill.
> But it might give some ideas.

Without detailed statistics on the distribution of the result set with
respect to the grouping set(s), and without sorting the data, I don't think
there is a way to efficiently chop it up.

One idea is this: If you scan the result set once for all grouping sets,
storing those uniquely in hash batches in temp files, you could do the
aggegration in multiple passes by rescanning the result set for each batch
and skipping the rows that didn't match anything in the current hash batch.
The problem is that you're rescanning the underlying result all the time,
and this is not going to scale.  Hence I don't like this idea.

I think a better idea is to actually leave the random access concept in
there and implement an LRU cache on the grouping sets in the memory space
provided by sort_mem, growing that cache as needed until it has to drop it
out to disk.  This is essentially what virtual memory would do for us
anyway, but it's nicer on system since we cap the resources instead of it.
I don't know how this would fit into the existing hash table mechanism
(maybe not at all), but it would allow grouping sets to spill over unto
disk.  This, after all, is just an emergency break for a HASHED plan node
that was not correctly planned in the first place.

It would also allow spillover of grouping sets later on rather efficiently:
If the strategy is AGG_HASHED, the entire thing should fit into memory and
we don't need to worry about sorting the input or separating grouping sets
into resettable aggregators and out of order aggregators.  If you DO need to
worry about that, you:

- sort the input
- return the tuples resulting from resettable aggregators (those computable
from the ordering of the sort) immediately
- use the hash table mechanism to store grouping results that are out of
order.  (with it's associated spillover unto disk).  Since the hash table
preserved the concept of random access this is transparent.
- return the results from the hash table after you are done with the initial
scan of the underlying table.

The main issues with an LRU cache of the hash table on disk are (save me on
this, I'm dumb....:)):

- nonuniform size of tuples
- storing buckets instead of single tuple entries
- avoiding thrashing

Anyway those are just my two cents.  Take them as they are, I haven't
finished looking at tuplestore.c and co yet.

Cheers!

Robert



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: OLAP CUBE/ROLLUP Operators and GROUP BY grouping sets
Next
From: Dave Cramer
Date:
Subject: apache rotate logs