Re: 9.5: Memory-bounded HashAgg - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: 9.5: Memory-bounded HashAgg |
Date | |
Msg-id | 1408035285.2335.163.camel@jeff-desktop Whole thread Raw |
In response to | Re: 9.5: Memory-bounded HashAgg ("Tomas Vondra" <tv@fuzzy.cz>) |
Responses |
Re: 9.5: Memory-bounded HashAgg
|
List | pgsql-hackers |
On Thu, 2014-08-14 at 16:17 +0200, Tomas Vondra wrote: > Either it belongs to the current batch (and either it's in the hash > table, or you add it there), or it's not - in that case write it to a > temp file. I think the part you left out is that you need two files per batch: one for the dumped-out partially-computed state values, and one for the tuples. In other words, you haven't really discussed the step where you reunite the tuples with that partially-computed state. > For sure, it's not for free - it may write to quite a few files. Is it > more expensive than what you propose? I'm not sure about that. With > your batching scheme, you'll end up with lower number of large batches, > and you'll need to read and split them, possibly repeatedly. The > batching scheme from hashjoin minimizes this. My approach only has fewer batches if it elects to have fewer batches, which might happen for two reasons:1. A cardinality misestimate. This certainly could happen, but we do have useful numbers to work from (we know the number of tuples and distincts that we've read so far), so it's far from a blind guess. 2. We're concerned about the random I/O from way too manypartitions. > I fail to see how this is different from your approach? How can you > output any tuples before processing the whole inner relation? Right, the only thing I avoid is scanning the hash table and dumping out the groups. This isn't a major distinction, more like "my approach does a little less work before returning tuples", and I'm not even sure I can defend that, so I'll retract this point. > Your approach is to do multi-level batching, and I was thinking whether > it'd be possible to use the same approach (single level). But in > retrospect it probably does not make much sense, because the multi-level > batching is one of the points of the proposed approach. Now that I think about it, many of the points we discussed could actually work with either approach: * In my approach, if I need more partitions, I could create more in much the same way as HashJoin to keep it single-level (as you suggest above). * In your approach, if there are too many partitions, you could avoid random I/O by intentionally putting tuples from multiple partitions in a single file and moving them while reading. * If given a way to write out the partially-computed states, I could evict some groups from the hash table to keep an array_agg() bounded. Our approaches only differ on one fundamental trade-off that I see: (A) My approach requires a hash lookup of an already-computedhash for every incoming tuple, not only the ones going into the hash table. (B) Your approach requires scanning the hash table anddumping out the states every time the hash table fills up, which therefore requires a way to dump out the partial states. You could probably win the argument by pointing out that (A) is O(N) and (B) is O(log2(N)). But I suspect that cost (A) is very low. Unfortunately, it would take some effort to test your approach because we'd actually need a way to write out the partially-computed state, and the algorithm itself seems a little more complex. So I'm not really sure how to proceed. Regards,Jeff Davis
pgsql-hackers by date: