Re: 9.5: Memory-bounded HashAgg - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: 9.5: Memory-bounded HashAgg |
Date | |
Msg-id | 1407906142.2335.47.camel@jeff-desktop Whole thread Raw |
In response to | Re: 9.5: Memory-bounded HashAgg ("Tomas Vondra" <tv@fuzzy.cz>) |
Responses |
Re: 9.5: Memory-bounded HashAgg
|
List | pgsql-hackers |
On Tue, 2014-08-12 at 14:58 +0200, Tomas Vondra wrote: > CREATE AGGREGATE myaggregate ( > ... > SERIALIZE_FUNC = 'dump_data', > DESERIALIZE_FUNC = 'read_data', > ... > ); Seems reasonable. > I don't see why it should get messy? In the end, you have a chunk of > data and a hash for it. Perhaps it's fine; I'd have to see the approach. > It just means you need to walk through the hash table, look at the > hashes and dump ~50% of the groups to a file. If you have fixed-size states, why would you *want* to remove the group? What is gained? One thing I like about my simple approach is that it returns a good number of groups after each pass, and then those are completely finished (returned to the operator above, even). That's impossible with HashJoin because the hashing all needs to be done before the probe phase begins. The weakness of my approach is the array_agg case that you mention, because this approach doesn't offer a way to dump out transition states. It seems like that could be added later, but let me know if you see a problem there. > I think you're missing the point, here. You need to compute the hash in > both cases. And then you either can do a lookup or just peek at the first > few bits of the hash to see whether it's in the current batch or not. I understood that. The point I was trying to make (which might or might not be true) was that: (a) this only matters for a failed lookup, because a successful lookup would just go in the hash table anyway; and (b) a failed lookup probably doesn't cost much compared to all of the other things that need to happen along that path. I should have chosen a better example though. For instance: if the lookup fails, we need to write the tuple, and writing the tuple is sure to swamp the cost of a failed hash lookup. > is much faster than a lookup. Also, as the hash table grows (beyond L3 > cache size, which is a few MBs today), it becomes much slower in my > experience - that's one of the lessons I learnt while hacking on the > hashjoin. And we're dealing with hashagg not fitting into work_mem, so > this seems to be relevant. Could be, but this is also the path that goes to disk, so I'm not sure how significant it is. > > Because I suspect there are costs in having an extra file around that > > I'm not accounting for directly. We are implicitly assuming that the OS > > will keep around enough buffers for each BufFile to do sequential writes > > when needed. If we create a zillion partitions, then either we end up > > with random I/O or we push the memory burden into the OS buffer cache. > > Assuming I understand it correctly, I think this logic is broken. Are you > saying "We'll try to do memory-bounded hashagg, but not for the really > large datasets because of fear we might cause random I/O"? No, the memory is still bounded even for very high cardinality inputs (ignoring array_agg case for now). When a partition is processed later, it also might exhaust work_mem, and need to write out tuples to its own set of partitions. This allows memory-bounded execution to succeed even if the number of partitions each iteration is one, though it will result in repeated I/O for the same tuple. > While I certainly understand your concerns about generating excessive > amount of random I/O, I think the modern filesystem are handling that just > fine (coalescing the writes into mostly sequential writes, etc.). Also, > current hardware is really good at handling this (controllers with write > cache, SSDs etc.). All of that requires memory. We shouldn't dodge a work_mem limit by using the kernel's memory, instead. > Also, if hash-join does not worry about number of batches, why should > hashagg worry about that? I expect the I/O patterns to be very similar. One difference with HashJoin is that, to create a large number of batches, the inner side must be huge, which is not the expected operating mode for HashJoin[1]. Regardless, every partition that is active *does* have a memory cost. HashJoin might ignore that cost, but that doesn't make it right. I think the right analogy here is to Sort's poly-phase merge -- it doesn't merge all of the runs at once; see the comments at the top of tuplesort.c. In other words, sometimes it's better to have fewer partitions (for hashing) or merge fewer runs at once (for sorting). It does more repeated I/O, but the I/O is more sequential. > In any case, trying to fix this by limiting number of partitions seems > like a bad approach. I think factoring those concerns into a costing > model is more appropriate. Fair enough. I haven't modeled the cost yet; and I agree that an upper limit is quite crude. > (a) COUNT(DISTINCT) -> this is solved by a custom aggregate Is there a reason we can't offer a hash-based strategy for this one? It would have to be separate hash tables for different aggregates, but it seems like it could work. > (b) bad estimate of required memory -> this is common for aggregates > passing 'internal' state (planner uses some quite high defaults) Maybe some planner hooks? Ideas? Regards,Jeff Davis
pgsql-hackers by date: