Thread: Improve hash-agg performance
Hi, There's two things I found while working on faster expression evaluation, slot deforming and batched execution. As those two issues often seemed quite dominant cost-wise it seemed worthwhile to evaluate them independently. 1) We atm do one ExecProject() to compute each aggregate's arguments. Turns out it's noticeably faster to compute the argument for all aggregates in one go. Both because it reduces the amount of function call / moves more things into a relatively tight loop, and because it allows to deform all the required columns at once, rather than one-by-one. For a single aggregate it'd be faster to avoid ExecProject alltogether (i.e. directly evaluate the expression as we used to), but as soon you have two the new approach is faster. 2) For hash-aggs we right now we store the representative tuple using the input tuple's format, with unneeded columns set to NULL. That turns out to be expensive if the aggregated-on columns are not leading columns, because we have to skip over a potentially large number of NULLs. The fix here is to simply use a different tuple format for the hashtable. That doesn't cause overhead, because we already move columns in/out of the hashslot explicitly anyway. Comments? Regards, Andres Freund
Attachment
On 11/03/2016 01:07 PM, Andres Freund wrote: > Hi, > > There's two things I found while working on faster expression > evaluation, slot deforming and batched execution. As those two issues > often seemed quite dominant cost-wise it seemed worthwhile to evaluate > them independently. > > 1) We atm do one ExecProject() to compute each aggregate's > arguments. Turns out it's noticeably faster to compute the argument > for all aggregates in one go. Both because it reduces the amount of > function call / moves more things into a relatively tight loop, and > because it allows to deform all the required columns at once, rather > than one-by-one. For a single aggregate it'd be faster to avoid > ExecProject alltogether (i.e. directly evaluate the expression as we > used to), but as soon you have two the new approach is faster. Makes sense. If we do your refactoring of ExecEvalExpr into an intermediate opcode representation, I assume the performance difference will go away anyway. > 2) For hash-aggs we right now we store the representative tuple using > the input tuple's format, with unneeded columns set to NULL. That > turns out to be expensive if the aggregated-on columns are not > leading columns, because we have to skip over a potentially large > number of NULLs. The fix here is to simply use a different tuple > format for the hashtable. That doesn't cause overhead, because we > already move columns in/out of the hashslot explicitly anyway. Heh, I came to the same conclusion a couple of months ago when I was profiling the aggregate code. I never got around to finish up and post the patch I wrote back then, but here you go, for comparison. It's pretty much the same as what you got here. So yeah, seems like a good idea. - Heikki
Attachment
On 2016-11-04 15:18:49 +0200, Heikki Linnakangas wrote: > On 11/03/2016 01:07 PM, Andres Freund wrote: > > Hi, > > > > There's two things I found while working on faster expression > > evaluation, slot deforming and batched execution. As those two issues > > often seemed quite dominant cost-wise it seemed worthwhile to evaluate > > them independently. > > > > 1) We atm do one ExecProject() to compute each aggregate's > > arguments. Turns out it's noticeably faster to compute the argument > > for all aggregates in one go. Both because it reduces the amount of > > function call / moves more things into a relatively tight loop, and > > because it allows to deform all the required columns at once, rather > > than one-by-one. For a single aggregate it'd be faster to avoid > > ExecProject alltogether (i.e. directly evaluate the expression as we > > used to), but as soon you have two the new approach is faster. > > Makes sense. If we do your refactoring of ExecEvalExpr into an intermediate > opcode representation, I assume the performance difference will go away > anyway. Precisely. > > 2) For hash-aggs we right now we store the representative tuple using > > the input tuple's format, with unneeded columns set to NULL. That > > turns out to be expensive if the aggregated-on columns are not > > leading columns, because we have to skip over a potentially large > > number of NULLs. The fix here is to simply use a different tuple > > format for the hashtable. That doesn't cause overhead, because we > > already move columns in/out of the hashslot explicitly anyway. > Heh, I came to the same conclusion a couple of months ago when I was > profiling the aggregate code. I never got around to finish up and post the > patch I wrote back then, but here you go, for comparison. It's pretty much > the same as what you got here. So yeah, seems like a good idea. > + /* > + * Note that we don't deduplicate key columns. That would complicate > + * the comparisons. Don't write silly queries! (Not sure if that would get > + * through the parser and optimizer, though). Hehe. You made me spill more coffee. Thanks for looking! Greetings, Andres Freund
Hi, On 2016-11-03 04:07:21 -0700, Andres Freund wrote: > Hi, > > There's two things I found while working on faster expression > evaluation, slot deforming and batched execution. As those two issues > often seemed quite dominant cost-wise it seemed worthwhile to evaluate > them independently. > > 1) We atm do one ExecProject() to compute each aggregate's > arguments. Turns out it's noticeably faster to compute the argument > for all aggregates in one go. Both because it reduces the amount of > function call / moves more things into a relatively tight loop, and > because it allows to deform all the required columns at once, rather > than one-by-one. For a single aggregate it'd be faster to avoid > ExecProject alltogether (i.e. directly evaluate the expression as we > used to), but as soon you have two the new approach is faster. > > 2) For hash-aggs we right now we store the representative tuple using > the input tuple's format, with unneeded columns set to NULL. That > turns out to be expensive if the aggregated-on columns are not > leading columns, because we have to skip over a potentially large > number of NULLs. The fix here is to simply use a different tuple > format for the hashtable. That doesn't cause overhead, because we > already move columns in/out of the hashslot explicitly anyway. I pushed a bit more polished versions of these. Andres