Re: [HACKERS] Hash support for grouping sets - Mailing list pgsql-hackers

From Andrew Gierth
Subject Re: [HACKERS] Hash support for grouping sets
Date
Msg-id 87y3vw7z9g.fsf@news-spur.riddles.org.uk
Whole thread Raw
In response to Re: [HACKERS] Hash support for grouping sets  (Mark Dilger <hornschnorter@gmail.com>)
Responses Re: [HACKERS] Hash support for grouping sets  (Mark Dilger <hornschnorter@gmail.com>)
List pgsql-hackers
>>>>> "Mark" == Mark Dilger <hornschnorter@gmail.com> writes:
Mark> You define DiscreteKnapsack to take integer weights and doubleMark> values, and perform the usual Dynamic
Programmingalgorithm toMark> solve.  But the only place you call this, you pass in NULL forMark> the values, indicating
thatall the values are equal.  I'mMark> confused why in this degenerate case you bother with the DP atMark> all.  Can't
youjust load the knapsack from lightest to heaviestMark> after sorting, reducing the runtime complexity to O(nlogn)?
It'sMark>been a long day.  Sorry if I am missing something obvious.
 

That's actually a very good question.

(Though in passing I should point out that the runtime cost is going to
be negligible in all practical cases. Even an 8-way cube (256 grouping
sets) has only 70 rollups, and a 4-way cube has only 6; the limit of
4096 grouping sets was only chosen because it was a nice round number
that was larger than comparable limits in other database products. Other
stuff in the grouping sets code has worse time bounds; the
bipartite-match code used to minimize the number of rollups is
potentially O(n^2.5) in the number of grouping sets.)

Thinking about it, it's not at all obvious what we should be optimizing
for here in the absence of accurate sort costs. Right now what matters
is saving as many sort steps as possible, since that's a saving likely
to be many orders of magnitude greater than the differences between two
sorts. The one heuristic that might be useful is to prefer the smaller
estimated size if other factors are equal, just on memory usage grounds,
but even that seems a bit tenuous.

I'm inclined to leave things as they are in the current patch, and maybe
revisit it during beta if we get any relevant feedback from people
actually using grouping sets?
Mark> I'm guessing you intend to extend the code at some later date toMark> have different values for items.  Is that
right?

Well, as I mentioned in a reply to Andres, we probably should eventually
optimize for greatest reduction in cost, and the code as it stands would
allow that to be added easily, but that would require having more
accurate cost info than we have now. cost_sort doesn't take into
consideration the number or types of sort columns or the costs of their
comparison functions, so all sorts of the same input data are costed
equally.

-- 
Andrew (irc:RhodiumToad)



pgsql-hackers by date:

Previous
From: Jesper Pedersen
Date:
Subject: Re: [HACKERS] Page Scan Mode in Hash Index
Next
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] createlang/droplang deprecated