Re: 9.5: Memory-bounded HashAgg - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: 9.5: Memory-bounded HashAgg
Date
Msg-id 53F0DAA7.3000805@fuzzy.cz
Whole thread Raw
In response to 9.5: Memory-bounded HashAgg  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On 10.8.2014 23:26, Jeff Davis wrote:
> This patch is requires the Memory Accounting patch, or something similar
> to track memory usage.
>
> The attached patch enables hashagg to spill to disk, which means that
> hashagg will contain itself to work_mem even if the planner makes a
> bad misestimate of the cardinality.
>
> This is a well-known concept; there's even a Berkeley homework
> assignment floating around to implement it -- in postgres 7.2, no
> less. I didn't take the exact same approach as the homework assignment
> suggests, but it's not much different, either. My apologies if some
> classes are still using this as a homework assignment, but postgres
> needs to eventually have an answer to this problem.
>
> Included is a GUC, "enable_hashagg_disk" (default on), which allows
> the planner to choose hashagg even if it doesn't expect the hashtable
> to fit in memory. If it's off, and the planner misestimates the
> cardinality, hashagg will still use the disk to contain itself to
> work_mem.
>
> One situation that might surprise the user is if work_mem is set too
> low, and the user is *relying* on a misestimate to pick hashagg. With
> this patch, it would end up going to disk, which might be
> significantly slower. The solution for the user is to increase
> work_mem.
>
> Rough Design:
>
> Change the hash aggregate algorithm to accept a generic "work item",
> which consists of an input file as well as some other bookkeeping
> information.
>
> Initially prime the algorithm by adding a single work item where the
> file is NULL, indicating that it should read from the outer plan.
>
> If the memory is exhausted during execution of a work item, then
> continue to allow existing groups to be aggregated, but do not allow new
> groups to be created in the hash table. Tuples representing new groups
> are saved in an output partition file referenced in the work item that
> is currently being executed.
>
> When the work item is done, emit any groups in the hash table, clear the
> hash table, and turn each output partition file into a new work item.
>
> Each time through at least some groups are able to stay in the hash
> table, so eventually none will need to be saved in output partitions, no
> new work items will be created, and the algorithm will terminate. This
> is true even if the number of output partitions is always one.
>
> Open items:
>    * costing
>    * EXPLAIN details for disk usage
>    * choose number of partitions intelligently
>    * performance testing
>
> Initial tests indicate that it can be competitive with sort+groupagg
> when the disk is involved, but more testing is required.
>
> Feedback welcome.

I've been working on this for a few hours - getting familiar with the
code, testing queries etc. Two comments.

1) Apparently there's something broken, because with this:

   create table table_b (fk_id int, val_a int, val_b int);
   insert into table_b
      select i, mod(i,1000), mod(i,1000)
        from generate_series(1,10000000) s(i);
   analyze table_b;

   I get this:

   set work_mem = '8MB';
   explain analyze select fk_id, count(*)
           from table_b where val_a < 50 and val_b < 50 group by 1;
   > The connection to the server was lost. Attempting reset: Failed.

   Stacktrace attached, but apparently there's a segfault in
   advance_transition_function when accessing pergroupstate.

   This happened for all queries that I tried, once they needed to do
   the batching.

2) Using the same hash value both for dynahash and batching seems
   really fishy to me. I'm not familiar with dynahash, but I'd bet
   the way it's done now will lead to bad distribution in the hash
   table (some buckets will be always empty in some batches, etc.).

   This is why hashjoin tries so hard to use non-overlapping parts
   of the hash for batchno/bucketno.

   The hashjoin implements it's onw hash table, which makes it clear
   how the bucket is derived from the hash value. I'm not sure how
   dynahash does that, but I'm pretty sure we can'd just reuse the hash
   value like this.

   I see two options - compute our own hash value, or somehow derive
   a new one (e.g. by doing "hashvalue XOR random_seed"). I'm not sure
   the latter would work, though.

regards
Tomas

Attachment

pgsql-hackers by date:

Previous
From: Anastasia Lubennikova
Date:
Subject: Re: Index-only scans for GIST
Next
From: Tomas Vondra
Date:
Subject: Re: 9.5: Memory-bounded HashAgg