Thread: Improve hash-agg performance

Improve hash-agg performance

From
Andres Freund
Date:
Hi,

There's two things I found while working on faster expression
evaluation, slot deforming and batched execution. As those two issues
often seemed quite dominant cost-wise it seemed worthwhile to evaluate
them independently.

1) We atm do one ExecProject() to compute each aggregate's
   arguments. Turns out it's noticeably faster to compute the argument
   for all aggregates in one go. Both because it reduces the amount of
   function call / moves more things into a relatively tight loop, and
   because it allows to deform all the required columns at once, rather
   than one-by-one.  For a single aggregate it'd be faster to avoid
   ExecProject alltogether (i.e. directly evaluate the expression as we
   used to), but as soon you have two the new approach is faster.

2) For hash-aggs we right now we store the representative tuple using
   the input tuple's format, with unneeded columns set to NULL. That
   turns out to be expensive if the aggregated-on columns are not
   leading columns, because we have to skip over a potentially large
   number of NULLs.  The fix here is to simply use a different tuple
   format for the hashtable.  That doesn't cause overhead, because we
   already move columns in/out of the hashslot explicitly anyway.

Comments?

Regards,

Andres Freund

Attachment

Re: Improve hash-agg performance

From
Heikki Linnakangas
Date:
On 11/03/2016 01:07 PM, Andres Freund wrote:
> Hi,
>
> There's two things I found while working on faster expression
> evaluation, slot deforming and batched execution. As those two issues
> often seemed quite dominant cost-wise it seemed worthwhile to evaluate
> them independently.
>
> 1) We atm do one ExecProject() to compute each aggregate's
>    arguments. Turns out it's noticeably faster to compute the argument
>    for all aggregates in one go. Both because it reduces the amount of
>    function call / moves more things into a relatively tight loop, and
>    because it allows to deform all the required columns at once, rather
>    than one-by-one.  For a single aggregate it'd be faster to avoid
>    ExecProject alltogether (i.e. directly evaluate the expression as we
>    used to), but as soon you have two the new approach is faster.

Makes sense. If we do your refactoring of ExecEvalExpr into an
intermediate opcode representation, I assume the performance difference
will go away anyway.

> 2) For hash-aggs we right now we store the representative tuple using
>    the input tuple's format, with unneeded columns set to NULL. That
>    turns out to be expensive if the aggregated-on columns are not
>    leading columns, because we have to skip over a potentially large
>    number of NULLs.  The fix here is to simply use a different tuple
>    format for the hashtable.  That doesn't cause overhead, because we
>    already move columns in/out of the hashslot explicitly anyway.

Heh, I came to the same conclusion a couple of months ago when I was
profiling the aggregate code. I never got around to finish up and post
the patch I wrote back then, but here you go, for comparison. It's
pretty much the same as what you got here. So yeah, seems like a good idea.

- Heikki


Attachment

Re: Improve hash-agg performance

From
Andres Freund
Date:
On 2016-11-04 15:18:49 +0200, Heikki Linnakangas wrote:
> On 11/03/2016 01:07 PM, Andres Freund wrote:
> > Hi,
> > 
> > There's two things I found while working on faster expression
> > evaluation, slot deforming and batched execution. As those two issues
> > often seemed quite dominant cost-wise it seemed worthwhile to evaluate
> > them independently.
> > 
> > 1) We atm do one ExecProject() to compute each aggregate's
> >    arguments. Turns out it's noticeably faster to compute the argument
> >    for all aggregates in one go. Both because it reduces the amount of
> >    function call / moves more things into a relatively tight loop, and
> >    because it allows to deform all the required columns at once, rather
> >    than one-by-one.  For a single aggregate it'd be faster to avoid
> >    ExecProject alltogether (i.e. directly evaluate the expression as we
> >    used to), but as soon you have two the new approach is faster.
> 
> Makes sense. If we do your refactoring of ExecEvalExpr into an intermediate
> opcode representation, I assume the performance difference will go away
> anyway.

Precisely.


> > 2) For hash-aggs we right now we store the representative tuple using
> >    the input tuple's format, with unneeded columns set to NULL. That
> >    turns out to be expensive if the aggregated-on columns are not
> >    leading columns, because we have to skip over a potentially large
> >    number of NULLs.  The fix here is to simply use a different tuple
> >    format for the hashtable.  That doesn't cause overhead, because we
> >    already move columns in/out of the hashslot explicitly anyway.

> Heh, I came to the same conclusion a couple of months ago when I was
> profiling the aggregate code. I never got around to finish up and post the
> patch I wrote back then, but here you go, for comparison. It's pretty much
> the same as what you got here. So yeah, seems like a good idea.


> +        /*
> +         * Note that we don't deduplicate key columns. That would complicate
> +         * the comparisons. Don't write silly queries! (Not sure if that would get
> +         * through the parser and optimizer, though).

Hehe. You made me spill more coffee.


Thanks for looking!

Greetings,

Andres Freund



Re: Improve hash-agg performance

From
Andres Freund
Date:
Hi,

On 2016-11-03 04:07:21 -0700, Andres Freund wrote:
> Hi,
>
> There's two things I found while working on faster expression
> evaluation, slot deforming and batched execution. As those two issues
> often seemed quite dominant cost-wise it seemed worthwhile to evaluate
> them independently.
>
> 1) We atm do one ExecProject() to compute each aggregate's
>    arguments. Turns out it's noticeably faster to compute the argument
>    for all aggregates in one go. Both because it reduces the amount of
>    function call / moves more things into a relatively tight loop, and
>    because it allows to deform all the required columns at once, rather
>    than one-by-one.  For a single aggregate it'd be faster to avoid
>    ExecProject alltogether (i.e. directly evaluate the expression as we
>    used to), but as soon you have two the new approach is faster.
>
> 2) For hash-aggs we right now we store the representative tuple using
>    the input tuple's format, with unneeded columns set to NULL. That
>    turns out to be expensive if the aggregated-on columns are not
>    leading columns, because we have to skip over a potentially large
>    number of NULLs.  The fix here is to simply use a different tuple
>    format for the hashtable.  That doesn't cause overhead, because we
>    already move columns in/out of the hashslot explicitly anyway.

I pushed a bit more polished versions of these.

Andres