Re: Memory-Bounded Hash Aggregation - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Memory-Bounded Hash Aggregation
Date
Msg-id 20191128174644.snpuqldt5rndjru7@development
Whole thread Raw
In response to Re: Memory-Bounded Hash Aggregation  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Memory-Bounded Hash Aggregation  (Melanie Plageman <melanieplageman@gmail.com>)
Re: Memory-Bounded Hash Aggregation  (Jeff Davis <pgsql@j-davis.com>)
Re: Memory-Bounded Hash Aggregation  (Jeff Davis <pgsql@j-davis.com>)
Re: Memory-Bounded Hash Aggregation  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On Wed, Nov 27, 2019 at 02:58:04PM -0800, Jeff Davis wrote:
>On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote:
>> Right now the patch always initializes 32 spill partitions. Have you
>> given
>> any thought into how to intelligently pick an optimal number of
>> partitions yet?
>
>Attached a new patch that addresses this.
>
>1. Divide hash table memory used by the number of groups in the hash
>table to get the average memory used per group.
>2. Multiply by the number of groups spilled -- which I pessimistically
>estimate as the number of tuples spilled -- to get the total amount of
>memory that we'd like to have to process all spilled tuples at once.

Isn't the "number of tuples = number of groups" estimate likely to be
way too pessimistic? IIUC the consequence is that it pushes us to pick
more partitions than necessary, correct?

Could we instead track how many tuples we actually consumed for the the
in-memory groups, and then use this information to improve the estimate
of number of groups? I mean, if we know we've consumed 1000 tuples which
created 100 groups, then we know there's ~1:10 ratio.

>3. Divide the desired amount of memory by work_mem to get the number of
>partitions we'd like to have such that each partition can be processed
>in work_mem without spilling.
>4. Apply a few sanity checks, fudge factors, and limits.
>
>Using this runtime information should be substantially better than
>using estimates and projections.
>
>Additionally, I removed some branches from the common path. I think I
>still have more work to do there.
>
>I also rebased of course, and fixed a few other things.
>

A couple of comments based on eye-balling the patch:


1) Shouldn't the hashagg_mem_overflow use the other GUC naming, i.e.
maybe it should be enable_hashagg_mem_overflow or something similar?


2) I'm a bit puzzled by this code in ExecInterpExpr (there are multiple
such blocks, this is just an example)

     aggstate = op->d.agg_init_trans.aggstate;
     pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
     pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];

     /* If transValue has not yet been initialized, do so now. */
     if (pergroup_allaggs != NULL && pergroup->noTransValue)
     { ... }

How could the (pergroup_allaggs != NULL) protect against anything? Let's
assume the pointer really is NULL. Surely we'll get a segfault on the
preceding line which does dereference it

     pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];

Or am I missing anything?


3) execGrouping.c

A couple of functions would deserve a comment, explaining what it does.

  - LookupTupleHashEntryHash
  - prepare_hash_slot
  - calculate_hash

And it's not clear to me why we should remove part of the comment before
TupleHashTableHash.


4) I'm not sure I agree with this reasoning that HASH_PARTITION_FACTOR
making the hash tables smaller is desirable - it may be, but if that was
generally the case we'd just use small hash tables all the time. It's a
bit annoying to give user the capability to set work_mem and then kinda
override that.

  * ... Another benefit of having more, smaller partitions is that small
  * hash tables may perform better than large ones due to memory caching
  * effects.


5) Not sure what "directly" means in this context?

  * partitions at the time we need to spill, and because this algorithm
  * shouldn't depend too directly on the internal memory needs of a
  * BufFile.

#define HASH_PARTITION_MEM (HASH_MIN_PARTITIONS * BLCKSZ)

Does that mean we don't want to link to PGAlignedBlock, or what?


6) I think we should have some protection against underflows in this
piece of code:

- this would probably deserve some protection against underflow if HASH_PARTITION_MEM gets too big

     if (hashagg_mem_overflow)
         aggstate->hash_mem_limit = SIZE_MAX;
     else
         aggstate->hash_mem_limit = (work_mem * 1024L) - HASH_PARTITION_MEM;

At the moment it's safe because work_mem is 64kB at least, and
HASH_PARTITION_MEM is 32kB (4 partitions, 8kB each). But if we happen to
bump HASH_MIN_PARTITIONS up, this can underflow.


7) Shouldn't lookup_hash_entry briefly explain why/how it handles the
memory limit?


8) The comment before lookup_hash_entries says:

  ...
  * Return false if hash table has exceeded its memory limit.
  ..

But that's clearly bogus, because that's a void function.


9) Shouldn't the hash_finish_initial_spills calls in agg_retrieve_direct
have a comment, similar to the surrounding code? Might be an overkill,
not sure.


10) The comment for agg_refill_hash_table says

  * Should only be called after all in memory hash table entries have been
  * consumed.

Can we enforce that with an assert, somehow?


11) The hash_spill_npartitions naming seems a bit confusing, because it
seems to imply it's about the "spill" while in practice it just choses
number of spill partitions. Maybe hash_choose_num_spill_partitions would
be better?


12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the
reasoning behind the current value (256)? Not wanting to pick too many
partitions? Comment?

     if (npartitions > HASH_MAX_PARTITIONS)
         npartitions = HASH_MAX_PARTITIONS;


13) As for this:

     /* make sure that we don't exhaust the hash bits */
     if (partition_bits + input_bits >= 32)
         partition_bits = 32 - input_bits;

We already ran into this issue (exhausting bits in a hash value) in
hashjoin batching, we should be careful to use the same approach in both
places (not the same code, just general approach).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Ranier Vilela
Date:
Subject: RE: [Incident report]Backend process crashed when executing 2pctransaction
Next
From: Pavel Stehule
Date:
Subject: Re: missing estimation for coalesce function