Re: 9.5: Memory-bounded HashAgg - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: 9.5: Memory-bounded HashAgg
Date
Msg-id ba3bef7f482caae5707b89f6e2243bd5.squirrel@sq.gransy.com
Whole thread Raw
In response to Re: 9.5: Memory-bounded HashAgg  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: 9.5: Memory-bounded HashAgg
List pgsql-hackers
On 14 Srpen 2014, 9:22, Jeff Davis wrote:
> I think the hash-join like approach is reasonable, but I also think
> you're going to run into a lot of challenges that make it more complex
> for HashAgg. For instance, let's say you have the query:
>
>   SELECT x, array_agg(y) FROM foo GROUP BY x;
>
> Say the transition state is an array (for the sake of simplicity), so
> the hash table has something like:
>
>   1000 => {7,   8,  9}
>   1001 => {12, 13, 14}
>
> You run out of memory and need to split the hash table, so you scan the
> hash table and find that group 1001 needs to be written to disk. So you
> serialize the key and array and write them out.
>
> Then the next tuple you get is (1001, 19). What do you do? Create a new
> group 1001 => {19} (how do you combine it later with the first one)? Or
> try to fetch the existing group 1001 from disk and advance it (horrible
> random I/O)?

No, that's not how it works. The batching algorithm works with a hash of
the group. For example let's suppose you do this:
  batchno = hash % nbatches;

which essentially keeps the last few bits of the hash. 0 bits for
nbatches=1, 1 bit for nbatches=2, 2 bits for nbatches=4 etc.

So let's say we have 2 batches, and we're working on the first batch.
That means we're using 1 bit:
   batchno = hash % 2;

and for the first batch we're keeping only groups with batchno=0. So
only groups with 0 as the last bit are in batchno==0.

When running out of memory, you simply do
   nbatches *= 2

and start considering one more bit from the hash. So if you had this
before:
   group_a => batchno=0 => {7,   8,  9}   group_b => batchno=0 => {12, 13, 14}   group_c => batchno=0 => {23,  1, 45}
group_d=> batchno=0 => {77, 37, 54}
 

(where batchno is a bit string), after doubling the number of batches
you get something like this:
   group_a => batchno=10 => {7,   8,  9}   group_b => batchno=00 => {12, 13, 14}   group_c => batchno=00 => {23,  1,
45}  group_d => batchno=10 => {77, 37, 54}
 

So you have only two possible batchno values here, depending on the new
most-significant bit - either you got 0 (which means it's still in the
current batch) or 1 (and you need to move it to the temp file of the
new batch).

Then, when you get a new tuple, you get it's hash and do a simple check
of the last few bits - effectively computing batchno just like before
  batchno = hash % nbatches;

Either it belongs to the current batch (and either it's in the hash
table, or you add it there), or it's not - in that case write it to a
temp file.

It gets a bit more complex when you increase the number of batches
repeatedly (effectively you need to do the check/move when reading the
batches).

For sure, it's not for free - it may write to quite a few files. Is it
more expensive than what you propose? I'm not sure about that. With
your batching scheme, you'll end up with lower number of large batches,
and you'll need to read and split them, possibly repeatedly. The
batching scheme from hashjoin minimizes this.

IMHO the only way to find out is to some actual tests.

> On Wed, 2014-08-13 at 12:31 +0200, Tomas Vondra wrote:
>> My understanding of the batching algorithm (and I may be wrong on this
>> one) is that once you choose the number of batches, it's pretty much
>> fixed. Is that the case?
>
> It's only fixed for that one "work item" (iteration). A different K can
> be selected if memory is exhausted again. But you're right: this is a
> little less flexible than HashJoin.
>
>> But what will happen in case of significant cardinality underestimate?
>> I.e. what will happen if you decide to use 16 batches, and then find
>> out 256 would be more appropriate? I believe you'll end up with batches
>> 16x the size you'd want, most likely exceeding work_mem.
>
> Yes, except that work_mem would never be exceeded. If the partitions are
> 16X work_mem, then each would be added as another work_item, and
> hopefully it would choose better the next time.

Only for aggregates with fixed-length state. For aggregates with growing
serialize/deserialize, the states may eventually exceeding work_mem.

>> > One thing I like about my simple approach is that it returns a good
>> > number of groups after each pass, and then those are completely
>> finished
>> > (returned to the operator above, even). That's impossible with
>> HashJoin
>> > because the hashing all needs to be done before the probe phase
>> begins.
>>
>> The hash-join approach returns ~1/N groups after each pass, so I fail to
>> see how this is better?
>
> You can't return any tuples until you begin the probe phase, and that
> doesn't happen until you've hashed the entire inner side (which involves
> splitting and other work). With my patch, it will return some tuples
> after the first scan. Perhaps I'm splitting hairs here, but the idea of
> finalizing some groups as early as possible seems appealing.

I fail to see how this is different from your approach? How can you
output any tuples before processing the whole inner relation?

After the first scan, the hash-join approach is pretty much guaranteed
to output ~1/N tuples.

>> Aha! And the new batches are 'private' to the work item, making it a bit
>> recursive, right? Is there any reason not to just double the number of
>> batches globally?
>
> I didn't quite follow this proposal.

Again, it's about a difference between your batching approach and the
hashjoin-style batching. The hashjoin batching keeps a single level of
batches, and when hitting work_mem just doubles the number of batches.

Your approach is to do multi-level batching, and I was thinking whether
it'd be possible to use the same approach (single level). But in
retrospect it probably does not make much sense, because the multi-level
batching is one of the points of the proposed approach.

>> It also seems to me that using HASH_DISK_MAX_PARTITIONS, and then
>> allowing
>> each work item to create it's own set of additional partitions
>> effectively
>> renders the HASH_DISK_MAX_PARTITIONS futile.
>
> It's the number of active partitions that matter, because that's what
> causes the random I/O.

OK, point taken. While I share the general concern about random I/O,
I'm not sure this case is particularly problematic.

regard
Tomas




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Immediate standby promotion
Next
From: Tom Lane
Date:
Subject: Re: jsonb format is pessimal for toast compression