Thread: AllocSetEstimateChunkSpace()
Attached is a small patch that introduces a simple function, AllocSetEstimateChunkSpace(), and uses it to improve the estimate for the size of a hash entry for hash aggregation. Getting reasonable estimates for the hash entry size (including the TupleHashEntryData, the group key tuple, the per-group state, and by- ref transition values) is important for multiple reasons: * It helps the planner know when hash aggregation is likely to spill, and how to cost it. * It helps hash aggregation regulate itself by setting a group limit (separate from the memory limit) to account for growing by-ref transition values. * It helps choose a good initial size for the hash table. Too large of a hash table will crowd out space that could be used for the group keys or the per-group state. Each group requires up to three palloc chunks: one for the group key tuple, one for the per-group states, and one for a by-ref transition value. Each chunk can have a lot of overhead: in addition to the chunk header (16 bytes overhead), it also rounds the value up to a power of two (~25% overhead). So, it's important to account for this chunk overhead. I considered making it a method of a memory context, but the planner will call this function before the hash agg memory context is created. It seems to make more sense for this to just be an AllocSet-specific function for now. Regards, Jeff Davis
Attachment
On Tue, 2020-03-24 at 18:12 -0700, Jeff Davis wrote: > I considered making it a method of a memory context, but the planner > will call this function before the hash agg memory context is > created. I implemented this approach also; attached. It works a little better by having access to the memory context, so it knows set->allocChunkLimit. It also allows it to work with the slab allocator, which needs the memory context to return a useful number. However, this introduces more code and awkwardly needs to use CurrentMemoryContext when called from the planner (because the actual memory context we're try to estimate for doesn't exist yet). I slightly favor the previous approach (mentioned in the parent email) because it's simple and effective. But I'm fine with this one if someone thinks it will be better for other use cases. Regards, Jeff Davis
Attachment
Hi, On 2020-03-24 18:12:03 -0700, Jeff Davis wrote: > Attached is a small patch that introduces a simple function, > AllocSetEstimateChunkSpace(), and uses it to improve the estimate for > the size of a hash entry for hash aggregation. > > Getting reasonable estimates for the hash entry size (including the > TupleHashEntryData, the group key tuple, the per-group state, and by- > ref transition values) is important for multiple reasons: > > * It helps the planner know when hash aggregation is likely to spill, > and how to cost it. > > * It helps hash aggregation regulate itself by setting a group limit > (separate from the memory limit) to account for growing by-ref > transition values. > > * It helps choose a good initial size for the hash table. Too large of > a hash table will crowd out space that could be used for the group keys > or the per-group state. > > Each group requires up to three palloc chunks: one for the group key > tuple, one for the per-group states, and one for a by-ref transition > value. Each chunk can have a lot of overhead: in addition to the chunk > header (16 bytes overhead), it also rounds the value up to a power of > two (~25% overhead). So, it's important to account for this chunk > overhead. As mention on im/call: I think this is mainly an argument for combining the group tuple & per-group state allocations. I'm kind of fine with afterwards taking the allocator overhead into account. I still don't buy that its useful to estimate the by-ref transition value overhead. We just don't have anything even have close enough to a meaningful value to base this on. Even if we want to consider the initial transition value or something, we'd be better off initially over-estimating the size of the transition state by a lot more than 25% (I am thinking more like 4x or so, with a minumum of 128 bytes or so). Since this is about the initial size of the hashtable, we're better off with a too small table, imo. A "too large" table is more likely to end up needing to spill when filled to only a small degree. I am kind of wondering if there's actually much point in trying to be accurate here at all. Especially when doing this from the planner. Since, for a large fraction of aggregates, we're going to very roughly guess at transition space anyway, it seems like a bunch of "overhead factors" could turn out to be better than trying to be accurate on some parts, while still widely guessing at transition space. But I'm not sure. > I considered making it a method of a memory context, but the planner > will call this function before the hash agg memory context is created. > It seems to make more sense for this to just be an AllocSet-specific > function for now. -1 to this approach. I think it's architecturally the wrong direction to add more direct calls to functions within specific contexts. On 2020-03-25 11:46:31 -0700, Jeff Davis wrote: > On Tue, 2020-03-24 at 18:12 -0700, Jeff Davis wrote: > > I considered making it a method of a memory context, but the planner > > will call this function before the hash agg memory context is > > created. > > I implemented this approach also; attached. > > It works a little better by having access to the memory context, so it > knows set->allocChunkLimit. It also allows it to work with the slab > allocator, which needs the memory context to return a useful number. > However, this introduces more code and awkwardly needs to use > CurrentMemoryContext when called from the planner (because the actual > memory context we're try to estimate for doesn't exist yet). Yea, the "context needs to exist" part sucks. I really don't want to add calls directly into AllocSet from more places though. And just ignoring the parameters to the context seems wrong too. Greetings, Andres Freund
On Wed, 2020-03-25 at 12:42 -0700, Andres Freund wrote: > As mention on im/call: I think this is mainly an argument for > combining > the group tuple & per-group state allocations. I'm kind of fine with > afterwards taking the allocator overhead into account. The overhead comes from two places: (a) the chunk header; and (b) the round-up-to-nearest-power-of-two behavior. Combining the group tuple and per-group states only saves the overhead from (a); it does nothing to help (b), which is often bigger. And it only saves that overhead when there *is* a per-group state (i.e. not for a DISTINCT query). > I still don't buy that its useful to estimate the by-ref transition > value overhead. We just don't have anything even have close enough to > a > meaningful value to base this on. By-ref transition values aren't a primary motivation for me. I'm fine leaving that discussion separate if that's a sticking point. But if we do have a way to measure chunk overhead, I don't really see a reason not to use it for by-ref as well. > -1 to [AllocSet-specific] approach. I think it's architecturally the > wrong direction to > add more direct calls to functions within specific contexts. OK. > Yea, the "context needs to exist" part sucks. I really don't want to > add > calls directly into AllocSet from more places though. And just > ignoring > the parameters to the context seems wrong too. So do you generally favor the EstimateMemoryChunkSpace() patch (that works for all context types)? Or do you have another suggestion? Or do you think we should just do nothing? Regards, Jeff Davis
Hi, On 2020-03-25 14:43:43 -0700, Jeff Davis wrote: > On Wed, 2020-03-25 at 12:42 -0700, Andres Freund wrote: > > As mention on im/call: I think this is mainly an argument for > > combining > > the group tuple & per-group state allocations. I'm kind of fine with > > afterwards taking the allocator overhead into account. > > The overhead comes from two places: (a) the chunk header; and (b) the > round-up-to-nearest-power-of-two behavior. > > Combining the group tuple and per-group states only saves the overhead > from (a); it does nothing to help (b), which is often bigger. Hm? It very well can help with b), since the round-up only happens once now? I guess you could argue that it's possible that afterwards we'd more likely to end in a bigger size class, and thus have roughly the same amount of waste due rounding? But I don't think that's all that convincing. I still, as I mentioned on the call, suspect that the right thing here is to use an allocation strategy that suffers from neither a nor b (for tuple and pergroup) and that has faster allocations too. That then also would have the consequence that we don't need to care about per-alloc overhead anymore (be it a or b). > And it only saves that overhead when there *is* a per-group state > (i.e. not for a DISTINCT query). So? > > I still don't buy that its useful to estimate the by-ref transition > > value overhead. We just don't have anything even have close enough to > > a > > meaningful value to base this on. > > By-ref transition values aren't a primary motivation for me. I'm fine > leaving that discussion separate if that's a sticking point. But if we > do have a way to measure chunk overhead, I don't really see a reason > not to use it for by-ref as well. Well, my point is that it's pretty much pointless for by-ref types. The size estimates, if they exist, are so inaccurate that we don't gain anything by including it. As I said before, I think we'd be better off initially assuming a higher transition space estimate. > > Yea, the "context needs to exist" part sucks. I really don't want to > > add > > calls directly into AllocSet from more places though. And just > > ignoring > > the parameters to the context seems wrong too. > > So do you generally favor the EstimateMemoryChunkSpace() patch (that > works for all context types)? Or do you have another suggestion? Or do > you think we should just do nothing? I think I'm increasingly leaning towards either using a constant overhead factor, or just getting rid of all memory context overhead. There's clearly no obviously correct design for the "chunk size" functions, and not having overhead is better than ~correctly estimating it. Greetings, Andres Freund
On Wed, 2020-03-25 at 15:09 -0700, Andres Freund wrote: > > The overhead comes from two places: (a) the chunk header; and (b) > > the > > round-up-to-nearest-power-of-two behavior. > > > > Combining the group tuple and per-group states only saves the > > overhead > > from (a); it does nothing to help (b), which is often bigger. > > Hm? It very well can help with b), since the round-up only happens > once > now? I guess you could argue that it's possible that afterwards we'd > more likely to end in a bigger size class, and thus have roughly the > same amount of waste due rounding? But I don't think that's all that > convincing. Why is that not convincing? Each size class is double the previous one, so piling double the memory into a single allocation doesn't help at all. Two palloc(20)s turn into two 32-byte chunks; one palloc(40) turns into a 64-byte chunk. You might get lucky and the second chunk will fit in the wasted space from the first chunk; but when it does cross a boundary, it will be a bigger boundary and wipe out any efficiencies that you gained previously. Of course it depends on the exact distribution. But I don't see any reason why we'd expect a distribution that would be favorable to combining chunks together (except to avoid the chunk header, problem (a)). > I still, as I mentioned on the call, suspect that the right thing > here > is to use an allocation strategy that suffers from neither a nor b > (for > tuple and pergroup) and that has faster allocations too. That then > also > would have the consequence that we don't need to care about per-alloc > overhead anymore (be it a or b). It might make sense for the next release but I'm wary of more churn in nodeAgg.c at this stage. It's not a trivial change because the different allocations happen in different places and combining them would be tricky. > > So do you generally favor the EstimateMemoryChunkSpace() patch > > (that > > works for all context types)? Or do you have another suggestion? Or > > do > > you think we should just do nothing? > > I think I'm increasingly leaning towards either using a constant > overhead factor, or just getting rid of all memory context > overhead. There's clearly no obviously correct design for the "chunk > size" functions, and not having overhead is better than ~correctly > estimating it. Trying to actually eliminate the overhead sounds like v14 to me. I believe the formula for AllocSet overhead can be approximated with: 16 + size/4 That would probably be better than a constant but a little hacky. I can do that as a spot fix if this patch proves unpopular. Regards, Jeff Davis