Re: 9.5: Memory-bounded HashAgg - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: 9.5: Memory-bounded HashAgg
Date
Msg-id 1408000948.2335.103.camel@jeff-desktop
Whole thread Raw
In response to Re: 9.5: Memory-bounded HashAgg  ("Tomas Vondra" <tv@fuzzy.cz>)
Responses Re: 9.5: Memory-bounded HashAgg
Re: 9.5: Memory-bounded HashAgg
List pgsql-hackers
I think the hash-join like approach is reasonable, but I also think
you're going to run into a lot of challenges that make it more complex
for HashAgg. For instance, let's say you have the query:
 SELECT x, array_agg(y) FROM foo GROUP BY x;

Say the transition state is an array (for the sake of simplicity), so
the hash table has something like:
 1000 => {7,   8,  9} 1001 => {12, 13, 14}

You run out of memory and need to split the hash table, so you scan the
hash table and find that group 1001 needs to be written to disk. So you
serialize the key and array and write them out.

Then the next tuple you get is (1001, 19). What do you do? Create a new
group 1001 => {19} (how do you combine it later with the first one)? Or
try to fetch the existing group 1001 from disk and advance it (horrible
random I/O)?



On Wed, 2014-08-13 at 12:31 +0200, Tomas Vondra wrote:
> My understanding of the batching algorithm (and I may be wrong on this
> one) is that once you choose the number of batches, it's pretty much
> fixed. Is that the case?

It's only fixed for that one "work item" (iteration). A different K can
be selected if memory is exhausted again. But you're right: this is a
little less flexible than HashJoin.

> But what will happen in case of significant cardinality underestimate?
> I.e. what will happen if you decide to use 16 batches, and then find
> out 256 would be more appropriate? I believe you'll end up with batches
> 16x the size you'd want, most likely exceeding work_mem.

Yes, except that work_mem would never be exceeded. If the partitions are
16X work_mem, then each would be added as another work_item, and
hopefully it would choose better the next time.

> > One thing I like about my simple approach is that it returns a good
> > number of groups after each pass, and then those are completely finished
> > (returned to the operator above, even). That's impossible with HashJoin
> > because the hashing all needs to be done before the probe phase begins.
> 
> The hash-join approach returns ~1/N groups after each pass, so I fail to
> see how this is better?

You can't return any tuples until you begin the probe phase, and that
doesn't happen until you've hashed the entire inner side (which involves
splitting and other work). With my patch, it will return some tuples
after the first scan. Perhaps I'm splitting hairs here, but the idea of
finalizing some groups as early as possible seems appealing.

> Aha! And the new batches are 'private' to the work item, making it a bit
> recursive, right? Is there any reason not to just double the number of
> batches globally?

I didn't quite follow this proposal.

> It also seems to me that using HASH_DISK_MAX_PARTITIONS, and then allowing
> each work item to create it's own set of additional partitions effectively
> renders the HASH_DISK_MAX_PARTITIONS futile.

It's the number of active partitions that matter, because that's what
causes the random I/O.

Regards,Jeff Davis






pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: pg_dump bug in 9.4beta2 and HEAD
Next
From: Michael Paquier
Date:
Subject: Re: WAL format and API changes (9.5)