Re: 9.5: Memory-bounded HashAgg - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: 9.5: Memory-bounded HashAgg
Date
Msg-id 1407820015.6623.54.camel@jeff-desktop
Whole thread Raw
In response to Re: 9.5: Memory-bounded HashAgg  (Tomas Vondra <tv@fuzzy.cz>)
Responses Re: 9.5: Memory-bounded HashAgg
List pgsql-hackers
On Mon, 2014-08-11 at 01:29 +0200, Tomas Vondra wrote:
> On 10.8.2014 23:26, Jeff Davis wrote:
> > This patch is requires the Memory Accounting patch, or something
> > similar to track memory usage.
> 
> I think the patch you sent actually includes the accounting patch. Is
> that on purpose, or by accident?

Accident, thank you.

> So once a group gets into memory, it stays there? That's going to work
> fine for aggregates with fixed-size state (int4, or generally state that
> gets allocated and does not grow), but I'm afraid for aggregates with
> growing state (as for example array_agg and similar) that's not really a
> solution.

I agree in theory, but for now I'm just not handling that case at all
because there is other work that needs to be done first. For one thing,
we would need a way to save the transition state, and we don't really
have that. In the case of array_agg, the state is not serialized and
there's no generic way to ask it to serialize itself without finalizing.

I'm open to ideas. Do you think my patch is going generally in the right
direction, and we can address this problem later; or do you think we
need a different approach entirely?

> While hacking on the hash join, I envisioned the hash aggregate might
> work in a very similar manner, i.e. something like this:
> 
>   * nbatches=1, nbits=0
>   * when work_mem gets full => nbatches *= 2, nbits += 1
>   * get rid of half the groups, using nbits from the hash
>      => dump the current states into 'states.batchno' file
>      => dump further tuples to 'tuples.batchno' file
>   * continue until the end, or until work_mem gets full again

It would get a little messy with HashAgg. Hashjoin is dealing entirely
with tuples; HashAgg deals with tuples and groups.

Also, if the transition state is fixed-size (or even nearly so), it
makes no sense to remove groups from the hash table before they are
finished. We'd need to detect that somehow, and it seems almost like two
different algorithms (though maybe not a bad idea to use a different
algorithm for things like array_agg).

Not saying that it can't be done, but (unless you have an idea) requires
quite a bit more work than what I did here.

> It also seems to me that the logic of the patch is about this:
> 
>   * try to lookup the group in the hash table
>     * found => call the transition function
>     * not found
>         * enough space => call transition function
>         * not enough space => tuple/group goes to a batch
> 
> Which pretty much means all tuples need to do the lookup first. The nice
> thing on the hash-join approach is that you don't really need to do the
> lookup - you just need to peek at the hash whether the group belongs to
> the current batch (and if not, to which batch it should go).

That's an interesting point. I suspect that, in practice, the cost of
hashing the tuple is more expensive (or at least not much cheaper than)
doing a failed lookup.

> For aggregates using 'internal' to pass pointers that requires some help
> from the author - serialization/deserialization functions.

Ah, yes, this is what I was referring to earlier.

> >    * EXPLAIN details for disk usage
> >    * choose number of partitions intelligently
> 
> What is the purpose of HASH_DISK_MAX_PARTITIONS? I mean, when we decide
> we need 2048 partitions, why should we use less if we believe it will
> get us over work_mem?

Because I suspect there are costs in having an extra file around that
I'm not accounting for directly. We are implicitly assuming that the OS
will keep around enough buffers for each BufFile to do sequential writes
when needed. If we create a zillion partitions, then either we end up
with random I/O or we push the memory burden into the OS buffer cache.

We could try to model those costs explicitly to put some downward
pressure on the number of partitions we select, but I just chose to cap
it for now.

> For us, removing the sort is a big deal, because we're working with
> >100M rows regularly. It's more complicated though, because the sort is
> usually enforced by COUNT(DISTINCT) and that's not going to disappear
> because of this patch. But that's solvable with a custom aggregate.

I hope this offers you a good alternative.

I'm not sure it will ever beat sort for very high cardinality cases, but
I hope it can beat sort when the group size averages something higher
than one. It will also be safer, so the optimizer can be more aggressive
about choosing HashAgg.

Thank you for taking a look so quickly!

Regards,Jeff Davis







pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Removing dependency to wsock32.lib when compiling code on WIndows
Next
From: Noah Misch
Date:
Subject: Re: Removing dependency to wsock32.lib when compiling code on WIndows