Re: Spilling hashed SetOps and aggregates to disk - Mailing list pgsql-hackers
From | David Gershuni |
---|---|
Subject | Re: Spilling hashed SetOps and aggregates to disk |
Date | |
Msg-id | 49FB34E0-337F-4FEE-AB3D-97879E4E6954@cs.cmu.edu Whole thread Raw |
In response to | Re: Spilling hashed SetOps and aggregates to disk (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Spilling hashed SetOps and aggregates to disk
|
List | pgsql-hackers |
> On Jun 19, 2018, at 10:36 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > But I am worried that I am missing something, because it appears that > for AGG_MIXED, we wait until the end of the last phase to emit the > contents of the hash tables. Wouldn't they be complete after the first > phase? You're right. They're complete after the first phase phase, but I believe they're not emitted until the end in order to minimize the complexity of the code. I don't think the possibility of exceeding work_mem was taken into account when this decision was made. > I was imagining that (in the simplest approach) each hash table would > have its own set of spilled tuples. If grouping set A can be advanced > (because it's present in the hash table) but B cannot (because it's not > in the hash table and we are at the memory limit), then it goes into > the spill file for B but not A. That means that A is still finished > after the first pass. ... > I realize that this simple approach could be inefficient if multiple > hash tables are spilling because we'd duplicate the spilled tuples. But > let me know if it won't work at all. This approach seems functionally correct, but I don't like the idea of transforming O(N) tuples of disk I/O into O(S*N) tuples of disk I/O (in the worst case). As of yesterday, I think I have a better approach that will work even for non-sortable grouping keys. (See below). > I am having a hard time trying to satisfy all of the constraints that > have been raised so far: > > * handle data types with hash ops but no btree ops > * handle many different group size distributions and input orders > efficiently > * avoid regressions or inefficiencies for grouping sets > * handle variable-size (particularly O(n)-space) transition states > > without taking on a lot of complexity. I like to work toward the right > design, but I think simplicity also factors in to the right design. > NodeAgg.c is already one of the larger and more complicated files that > we have. Yes, there are a lot of moving pieces here! And it will be non-trivial to satisfy them all with one approach. I'm working on a new aggstrategy that could be used in conjunction with AGG_SORT. This approach could replace the current AGG_HASH and would satisfy all of the requirements you mentioned for aggregates that have serial/deserial functions. My approach is based on the HashSort algorithm that I mentioned before but with new modifications to a) handle multiple grouping sets and b) not require btree ops on the grouping keys. a) is accomplished by hashing all grouping sets simultaneously and putting each set in its own memory context. Whenever work_mem is exceeded, we sort and spill the hash table with the largest memory context. b) is accomplished by not sorting groups by their actual grouping keys, but instead sorting by their hash values. This works because we don't need a true sort. We just need to process identical groups in a consistent order during the merge step. As long as we're careful about collisions during the merge, this should work. > A lot of the complexity we already have is that grouping sets are > performed in one giant executor node rather than exposing the > individual phases as separate executor nodes. I suppose the reason > (correct me if I'm wrong) is that there are two outputs from a phase: > the aggregated groups, and the original input tuples in a new sort > order. That seems solvable -- the executor passes bitmaps around until > BitmapHeapScan turns them into tuples. That is an interesting idea, but it's somewhat orthogonal to spill implementation. Still, it could be a nice way to decompose aggregate nodes that require multiple phases/strategies. I'm curious to hear what others think. Best, David
pgsql-hackers by date: