Re: 9.5: Memory-bounded HashAgg - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: 9.5: Memory-bounded HashAgg |
Date | |
Msg-id | 1408000948.2335.103.camel@jeff-desktop Whole thread Raw |
In response to | Re: 9.5: Memory-bounded HashAgg ("Tomas Vondra" <tv@fuzzy.cz>) |
Responses |
Re: 9.5: Memory-bounded HashAgg
Re: 9.5: Memory-bounded HashAgg |
List | pgsql-hackers |
I think the hash-join like approach is reasonable, but I also think you're going to run into a lot of challenges that make it more complex for HashAgg. For instance, let's say you have the query: SELECT x, array_agg(y) FROM foo GROUP BY x; Say the transition state is an array (for the sake of simplicity), so the hash table has something like: 1000 => {7, 8, 9} 1001 => {12, 13, 14} You run out of memory and need to split the hash table, so you scan the hash table and find that group 1001 needs to be written to disk. So you serialize the key and array and write them out. Then the next tuple you get is (1001, 19). What do you do? Create a new group 1001 => {19} (how do you combine it later with the first one)? Or try to fetch the existing group 1001 from disk and advance it (horrible random I/O)? On Wed, 2014-08-13 at 12:31 +0200, Tomas Vondra wrote: > My understanding of the batching algorithm (and I may be wrong on this > one) is that once you choose the number of batches, it's pretty much > fixed. Is that the case? It's only fixed for that one "work item" (iteration). A different K can be selected if memory is exhausted again. But you're right: this is a little less flexible than HashJoin. > But what will happen in case of significant cardinality underestimate? > I.e. what will happen if you decide to use 16 batches, and then find > out 256 would be more appropriate? I believe you'll end up with batches > 16x the size you'd want, most likely exceeding work_mem. Yes, except that work_mem would never be exceeded. If the partitions are 16X work_mem, then each would be added as another work_item, and hopefully it would choose better the next time. > > One thing I like about my simple approach is that it returns a good > > number of groups after each pass, and then those are completely finished > > (returned to the operator above, even). That's impossible with HashJoin > > because the hashing all needs to be done before the probe phase begins. > > The hash-join approach returns ~1/N groups after each pass, so I fail to > see how this is better? You can't return any tuples until you begin the probe phase, and that doesn't happen until you've hashed the entire inner side (which involves splitting and other work). With my patch, it will return some tuples after the first scan. Perhaps I'm splitting hairs here, but the idea of finalizing some groups as early as possible seems appealing. > Aha! And the new batches are 'private' to the work item, making it a bit > recursive, right? Is there any reason not to just double the number of > batches globally? I didn't quite follow this proposal. > It also seems to me that using HASH_DISK_MAX_PARTITIONS, and then allowing > each work item to create it's own set of additional partitions effectively > renders the HASH_DISK_MAX_PARTITIONS futile. It's the number of active partitions that matter, because that's what causes the random I/O. Regards,Jeff Davis
pgsql-hackers by date: