Thread: AllocSetEstimateChunkSpace()

AllocSetEstimateChunkSpace()

From
Jeff Davis
Date:
Attached is a small patch that introduces a simple function,
AllocSetEstimateChunkSpace(), and uses it to improve the estimate for
the size of a hash entry for hash aggregation.

Getting reasonable estimates for the hash entry size (including the
TupleHashEntryData, the group key tuple, the per-group state, and by-
ref transition values) is important for multiple reasons:

* It helps the planner know when hash aggregation is likely to spill,
and how to cost it.

* It helps hash aggregation regulate itself by setting a group limit
(separate from the memory limit) to account for growing by-ref
transition values.

* It helps choose a good initial size for the hash table. Too large of
a hash table will crowd out space that could be used for the group keys
or the per-group state.

Each group requires up to three palloc chunks: one for the group key
tuple, one for the per-group states, and one for a by-ref transition
value. Each chunk can have a lot of overhead: in addition to the chunk
header (16 bytes overhead), it also rounds the value up to a power of
two (~25% overhead). So, it's important to account for this chunk
overhead.

I considered making it a method of a memory context, but the planner
will call this function before the hash agg memory context is created.
It seems to make more sense for this to just be an AllocSet-specific
function for now.

Regards,
    Jeff Davis


Attachment

Re: AllocSetEstimateChunkSpace()

From
Jeff Davis
Date:
On Tue, 2020-03-24 at 18:12 -0700, Jeff Davis wrote:
> I considered making it a method of a memory context, but the planner
> will call this function before the hash agg memory context is
> created.

I implemented this approach also; attached.

It works a little better by having access to the memory context, so it
knows set->allocChunkLimit. It also allows it  to work with the slab
allocator, which needs the memory context to return a useful number.
However, this introduces more code and awkwardly needs to use
CurrentMemoryContext when called from the planner (because the actual
memory context we're try to estimate for doesn't exist yet).

I slightly favor the previous approach (mentioned in the parent email) 
because it's simple and effective. But I'm fine with this one if
someone thinks it will be better for other use cases.

Regards,
    Jeff Davis


Attachment

Re: AllocSetEstimateChunkSpace()

From
Andres Freund
Date:
Hi,

On 2020-03-24 18:12:03 -0700, Jeff Davis wrote:
> Attached is a small patch that introduces a simple function,
> AllocSetEstimateChunkSpace(), and uses it to improve the estimate for
> the size of a hash entry for hash aggregation.
> 
> Getting reasonable estimates for the hash entry size (including the
> TupleHashEntryData, the group key tuple, the per-group state, and by-
> ref transition values) is important for multiple reasons:
> 
> * It helps the planner know when hash aggregation is likely to spill,
> and how to cost it.
> 
> * It helps hash aggregation regulate itself by setting a group limit
> (separate from the memory limit) to account for growing by-ref
> transition values.
> 
> * It helps choose a good initial size for the hash table. Too large of
> a hash table will crowd out space that could be used for the group keys
> or the per-group state.
> 
> Each group requires up to three palloc chunks: one for the group key
> tuple, one for the per-group states, and one for a by-ref transition
> value. Each chunk can have a lot of overhead: in addition to the chunk
> header (16 bytes overhead), it also rounds the value up to a power of
> two (~25% overhead). So, it's important to account for this chunk
> overhead.

As mention on im/call: I think this is mainly an argument for combining
the group tuple & per-group state allocations. I'm kind of fine with
afterwards taking the allocator overhead into account.


I still don't buy that its useful to estimate the by-ref transition
value overhead. We just don't have anything even have close enough to a
meaningful value to base this on. Even if we want to consider the
initial transition value or something, we'd be better off initially
over-estimating the size of the transition state by a lot more than 25%
(I am thinking more like 4x or so, with a minumum of 128 bytes or
so). Since this is about the initial size of the hashtable, we're better
off with a too small table, imo. A "too large" table is more likely to
end up needing to spill when filled to only a small degree.


I am kind of wondering if there's actually much point in trying to be
accurate here at all. Especially when doing this from the
planner. Since, for a large fraction of aggregates, we're going to very
roughly guess at transition space anyway, it seems like a bunch of
"overhead factors" could turn out to be better than trying to be
accurate on some parts, while still widely guessing at transition space.
But I'm not sure.


> I considered making it a method of a memory context, but the planner
> will call this function before the hash agg memory context is created.
> It seems to make more sense for this to just be an AllocSet-specific
> function for now.

-1 to this approach. I think it's architecturally the wrong direction to
add more direct calls to functions within specific contexts.



On 2020-03-25 11:46:31 -0700, Jeff Davis wrote:
> On Tue, 2020-03-24 at 18:12 -0700, Jeff Davis wrote:
> > I considered making it a method of a memory context, but the planner
> > will call this function before the hash agg memory context is
> > created.
> 
> I implemented this approach also; attached.
> 
> It works a little better by having access to the memory context, so it
> knows set->allocChunkLimit. It also allows it  to work with the slab
> allocator, which needs the memory context to return a useful number.
> However, this introduces more code and awkwardly needs to use
> CurrentMemoryContext when called from the planner (because the actual
> memory context we're try to estimate for doesn't exist yet).

Yea, the "context needs to exist" part sucks. I really don't want to add
calls directly into AllocSet from more places though. And just ignoring
the parameters to the context seems wrong too.

Greetings,

Andres Freund



Re: AllocSetEstimateChunkSpace()

From
Jeff Davis
Date:
On Wed, 2020-03-25 at 12:42 -0700, Andres Freund wrote:
> As mention on im/call: I think this is mainly an argument for
> combining
> the group tuple & per-group state allocations. I'm kind of fine with
> afterwards taking the allocator overhead into account.

The overhead comes from two places: (a) the chunk header; and (b) the
round-up-to-nearest-power-of-two behavior.

Combining the group tuple and per-group states only saves the overhead
from (a); it does nothing to help (b), which is often bigger. And it
only saves that overhead when there *is* a per-group state (i.e. not
for a DISTINCT query).

> I still don't buy that its useful to estimate the by-ref transition
> value overhead. We just don't have anything even have close enough to
> a
> meaningful value to base this on. 

By-ref transition values aren't a primary motivation for me. I'm fine
leaving that discussion separate if that's a sticking point. But if we
do have a way to measure chunk overhead, I don't really see a reason
not to use it for by-ref as well.

> -1 to [AllocSet-specific] approach. I think it's architecturally the
> wrong direction to
> add more direct calls to functions within specific contexts.

OK.

> Yea, the "context needs to exist" part sucks. I really don't want to
> add
> calls directly into AllocSet from more places though. And just
> ignoring
> the parameters to the context seems wrong too.

So do you generally favor the EstimateMemoryChunkSpace() patch (that
works for all context types)? Or do you have another suggestion? Or do
you think we should just do nothing?

Regards,
    Jeff Davis





Re: AllocSetEstimateChunkSpace()

From
Andres Freund
Date:
Hi,

On 2020-03-25 14:43:43 -0700, Jeff Davis wrote:
> On Wed, 2020-03-25 at 12:42 -0700, Andres Freund wrote:
> > As mention on im/call: I think this is mainly an argument for
> > combining
> > the group tuple & per-group state allocations. I'm kind of fine with
> > afterwards taking the allocator overhead into account.
> 
> The overhead comes from two places: (a) the chunk header; and (b) the
> round-up-to-nearest-power-of-two behavior.
> 
> Combining the group tuple and per-group states only saves the overhead
> from (a); it does nothing to help (b), which is often bigger.

Hm? It very well can help with b), since the round-up only happens once
now? I guess you could argue that it's possible that afterwards we'd
more likely to end in a bigger size class, and thus have roughly the
same amount of waste due rounding? But I don't think that's all that
convincing.

I still, as I mentioned on the call, suspect that the right thing here
is to use an allocation strategy that suffers from neither a nor b (for
tuple and pergroup) and that has faster allocations too. That then also
would have the consequence that we don't need to care about per-alloc
overhead anymore (be it a or b).


> And it only saves that overhead when there *is* a per-group state
> (i.e. not for a DISTINCT query).

So?


> > I still don't buy that its useful to estimate the by-ref transition
> > value overhead. We just don't have anything even have close enough to
> > a
> > meaningful value to base this on. 
> 
> By-ref transition values aren't a primary motivation for me. I'm fine
> leaving that discussion separate if that's a sticking point. But if we
> do have a way to measure chunk overhead, I don't really see a reason
> not to use it for by-ref as well.

Well, my point is that it's pretty much pointless for by-ref types. The
size estimates, if they exist, are so inaccurate that we don't gain
anything by including it. As I said before, I think we'd be better off
initially assuming a higher transition space estimate.


> > Yea, the "context needs to exist" part sucks. I really don't want to
> > add
> > calls directly into AllocSet from more places though. And just
> > ignoring
> > the parameters to the context seems wrong too.
> 
> So do you generally favor the EstimateMemoryChunkSpace() patch (that
> works for all context types)? Or do you have another suggestion? Or do
> you think we should just do nothing?

I think I'm increasingly leaning towards either using a constant
overhead factor, or just getting rid of all memory context
overhead. There's clearly no obviously correct design for the "chunk
size" functions, and not having overhead is better than ~correctly
estimating it.

Greetings,

Andres Freund



Re: AllocSetEstimateChunkSpace()

From
Jeff Davis
Date:
On Wed, 2020-03-25 at 15:09 -0700, Andres Freund wrote:
> > The overhead comes from two places: (a) the chunk header; and (b)
> > the
> > round-up-to-nearest-power-of-two behavior.
> > 
> > Combining the group tuple and per-group states only saves the
> > overhead
> > from (a); it does nothing to help (b), which is often bigger.
> 
> Hm? It very well can help with b), since the round-up only happens
> once
> now? I guess you could argue that it's possible that afterwards we'd
> more likely to end in a bigger size class, and thus have roughly the
> same amount of waste due rounding? But I don't think that's all that
> convincing.

Why is that not convincing? Each size class is double the previous one,
so piling double the memory into a single allocation doesn't help at
all. Two palloc(20)s turn into two 32-byte chunks; one palloc(40) turns
into a 64-byte chunk.

You might get lucky and the second chunk will fit in the wasted space
from the first chunk; but when it does cross a boundary, it will be a
bigger boundary and wipe out any efficiencies that you gained
previously.

Of course it depends on the exact distribution. But I don't see any
reason why we'd expect a distribution that would be favorable to
combining chunks together (except to avoid the chunk header, problem
(a)).

> I still, as I mentioned on the call, suspect that the right thing
> here
> is to use an allocation strategy that suffers from neither a nor b
> (for
> tuple and pergroup) and that has faster allocations too. That then
> also
> would have the consequence that we don't need to care about per-alloc
> overhead anymore (be it a or b).

It might make sense for the next release but I'm wary of more churn in
nodeAgg.c at this stage. It's not a trivial change because the
different allocations happen in different places and combining them
would be tricky.

> > So do you generally favor the EstimateMemoryChunkSpace() patch
> > (that
> > works for all context types)? Or do you have another suggestion? Or
> > do
> > you think we should just do nothing?
> 
> I think I'm increasingly leaning towards either using a constant
> overhead factor, or just getting rid of all memory context
> overhead. There's clearly no obviously correct design for the "chunk
> size" functions, and not having overhead is better than ~correctly
> estimating it.

Trying to actually eliminate the overhead sounds like v14 to me.

I believe the formula for AllocSet overhead can be approximated with:
   16 + size/4

That would probably be better than a constant but a little hacky. I can
do that as a spot fix if this patch proves unpopular.

Regards,
    Jeff Davis