Re: 9.5: Memory-bounded HashAgg - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: 9.5: Memory-bounded HashAgg |
Date | |
Msg-id | 1407820015.6623.54.camel@jeff-desktop Whole thread Raw |
In response to | Re: 9.5: Memory-bounded HashAgg (Tomas Vondra <tv@fuzzy.cz>) |
Responses |
Re: 9.5: Memory-bounded HashAgg
|
List | pgsql-hackers |
On Mon, 2014-08-11 at 01:29 +0200, Tomas Vondra wrote: > On 10.8.2014 23:26, Jeff Davis wrote: > > This patch is requires the Memory Accounting patch, or something > > similar to track memory usage. > > I think the patch you sent actually includes the accounting patch. Is > that on purpose, or by accident? Accident, thank you. > So once a group gets into memory, it stays there? That's going to work > fine for aggregates with fixed-size state (int4, or generally state that > gets allocated and does not grow), but I'm afraid for aggregates with > growing state (as for example array_agg and similar) that's not really a > solution. I agree in theory, but for now I'm just not handling that case at all because there is other work that needs to be done first. For one thing, we would need a way to save the transition state, and we don't really have that. In the case of array_agg, the state is not serialized and there's no generic way to ask it to serialize itself without finalizing. I'm open to ideas. Do you think my patch is going generally in the right direction, and we can address this problem later; or do you think we need a different approach entirely? > While hacking on the hash join, I envisioned the hash aggregate might > work in a very similar manner, i.e. something like this: > > * nbatches=1, nbits=0 > * when work_mem gets full => nbatches *= 2, nbits += 1 > * get rid of half the groups, using nbits from the hash > => dump the current states into 'states.batchno' file > => dump further tuples to 'tuples.batchno' file > * continue until the end, or until work_mem gets full again It would get a little messy with HashAgg. Hashjoin is dealing entirely with tuples; HashAgg deals with tuples and groups. Also, if the transition state is fixed-size (or even nearly so), it makes no sense to remove groups from the hash table before they are finished. We'd need to detect that somehow, and it seems almost like two different algorithms (though maybe not a bad idea to use a different algorithm for things like array_agg). Not saying that it can't be done, but (unless you have an idea) requires quite a bit more work than what I did here. > It also seems to me that the logic of the patch is about this: > > * try to lookup the group in the hash table > * found => call the transition function > * not found > * enough space => call transition function > * not enough space => tuple/group goes to a batch > > Which pretty much means all tuples need to do the lookup first. The nice > thing on the hash-join approach is that you don't really need to do the > lookup - you just need to peek at the hash whether the group belongs to > the current batch (and if not, to which batch it should go). That's an interesting point. I suspect that, in practice, the cost of hashing the tuple is more expensive (or at least not much cheaper than) doing a failed lookup. > For aggregates using 'internal' to pass pointers that requires some help > from the author - serialization/deserialization functions. Ah, yes, this is what I was referring to earlier. > > * EXPLAIN details for disk usage > > * choose number of partitions intelligently > > What is the purpose of HASH_DISK_MAX_PARTITIONS? I mean, when we decide > we need 2048 partitions, why should we use less if we believe it will > get us over work_mem? Because I suspect there are costs in having an extra file around that I'm not accounting for directly. We are implicitly assuming that the OS will keep around enough buffers for each BufFile to do sequential writes when needed. If we create a zillion partitions, then either we end up with random I/O or we push the memory burden into the OS buffer cache. We could try to model those costs explicitly to put some downward pressure on the number of partitions we select, but I just chose to cap it for now. > For us, removing the sort is a big deal, because we're working with > >100M rows regularly. It's more complicated though, because the sort is > usually enforced by COUNT(DISTINCT) and that's not going to disappear > because of this patch. But that's solvable with a custom aggregate. I hope this offers you a good alternative. I'm not sure it will ever beat sort for very high cardinality cases, but I hope it can beat sort when the group size averages something higher than one. It will also be safer, so the optimizer can be more aggressive about choosing HashAgg. Thank you for taking a look so quickly! Regards,Jeff Davis
pgsql-hackers by date: