Home > mailing lists

Re: 9.5: Memory-bounded HashAgg - Mailing list pgsql-hackers

From	Jeff Davis
Subject	Re: 9.5: Memory-bounded HashAgg
Date	December 11, 2014 10:46:48
Msg-id	1418294807.5584.32.camel@jeff-desktop Whole thread Raw
In response to	9.5: Memory-bounded HashAgg (Jeff Davis <pgsql@j-davis.com>)
Responses	Re: 9.5: Memory-bounded HashAgg Re: 9.5: Memory-bounded HashAgg
List	pgsql-hackers

Tree view

On Sun, 2014-08-10 at 14:26 -0700, Jeff Davis wrote:
> This patch is requires the Memory Accounting patch, or something similar
> to track memory usage.
>
> The attached patch enables hashagg to spill to disk, which means that
> hashagg will contain itself to work_mem even if the planner makes a
> bad misestimate of the cardinality.

New patch attached. All open items are complete, though the patch may
have a few rough edges.

Summary of changes:

 * rebased on top of latest memory accounting patch
http://www.postgresql.org/message-id/1417497257.5584.5.camel@jeff-desktop
 * added a flag to hash_create to prevent it from creating an extra
level of memory context
   - without this, the memory accounting would have a measurable impact
on performance
 * cost model for the disk usage
 * intelligently choose the number of partitions for each pass of the
data
 * explain support
 * in build_hash_table(), be more intelligent about the value of
nbuckets to pass to BuildTupleHashTable()
   - BuildTupleHashTable tries to choose a value to keep the table in
work_mem, but it isn't very accurate.
 * some very rudimentary testing (sanity checks, really) shows good
results

Summary of previous discussion (my summary; I may have missed some
points):

Tom Lane requested that the patch also handle the case where transition
values grow (e.g. array_agg) beyond work_mem. I feel this patch provides
a lot of benefit as it is, and trying to handle that case would be a lot
more work (we need a way to write the transition values out to disk at a
minimum, and perhaps combine them with other transition values). I also
don't think my patch would interfere with a fix there in the future.

Tomas Vondra suggested an alternative design that more closely resembles
HashJoin: instead of filling up the hash table and then spilling any new
groups, the idea would be to split the current data into two partitions,
keep one in the hash table, and spill the other (see
ExecHashIncreaseNumBatches()). This has the advantage that it's very
fast to identify whether the tuple is part of the in-memory batch or
not; and we can avoid even looking in the memory hashtable if not.

The batch-splitting approach has a major downside, however: you are
likely to evict a skew value from the in-memory batch, which will result
in all subsequent tuples with that skew value going to disk. My approach
never evicts from the in-memory table until we actually finalize the
groups, so the skew values are likely to be completely processed in the
first pass.

So, the attached patch implements my original approach, which I still
feel is the best solution.

Regards,
    Jeff Davis

Attachment

hashagg-disk-20141211.patch

pgsql-hackers by date:

From: Etsuro Fujita
Date: 11 December 2014, 10:30:40
Subject: Re: inherit support for foreign tables

From: Heikki Linnakangas
Date: 11 December 2014, 11:01:51
Subject: Re: Directory/File Access Permissions for COPY and Generic File Access Functions

Re: 9.5: Memory-bounded HashAgg - Mailing list pgsql-hackers

Attachment

Previous

Next