Thread: [PATCH] Lazy hashaggregate when no aggregation is needed

[PATCH] Lazy hashaggregate when no aggregation is needed

From
Ants Aasma
Date:
A user complained on pgsql-performance that SELECT col FROM table
GROUP BY col LIMIT 2; performs a full table scan. ISTM that it's safe
to return tuples from hash-aggregate as they are found when no
aggregate functions are in use. Attached is a first shot at that. The
planner is modified so that when the optimization applies, hash table
size check is compared against the limit and start up cost comes from
the input. The executor is modified so that when the hash table is not
filled yet and the optimization applies, nodes are returned
immediately.

Can somebody poke holes in this? The patch definitely needs some code
cleanup in nodeAgg.c, but otherwise it passes regression tests and
seems to work as intended. It also optimizes the SELECT DISTINCT col
FROM table LIMIT 2; case, but not SELECT DISTINCT ON (col) col FROM
table LIMIT 2 because it is explicitly forced to use sorted
aggregation.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Attachment

Re: [PATCH] Lazy hashaggregate when no aggregation is needed

From
Tom Lane
Date:
Ants Aasma <ants@cybertec.at> writes:
> A user complained on pgsql-performance that SELECT col FROM table
> GROUP BY col LIMIT 2; performs a full table scan. ISTM that it's safe
> to return tuples from hash-aggregate as they are found when no
> aggregate functions are in use. Attached is a first shot at that.

As I commented in the other thread, the user would be a lot better off
if he'd had an index on the column in question.  I'm not sure it's worth
complicating the hashagg logic when an indexscan + groupagg would
address the case better.
        regards, tom lane


Re: [PATCH] Lazy hashaggregate when no aggregation is needed

From
Jay Levitt
Date:
Tom Lane wrote:
> Ants Aasma<ants@cybertec.at> writes:
>> A user complained on pgsql-performance that SELECT col FROM table
>> GROUP BY col LIMIT 2; performs a full table scan. ISTM that it's safe
>> to return tuples from hash-aggregate as they are found when no
>> aggregate functions are in use. Attached is a first shot at that.
>
> As I commented in the other thread, the user would be a lot better off
> if he'd had an index on the column in question. I'm not sure it's worth
> complicating the hashagg logic when an indexscan + groupagg would
> address the case better.

Would this patch help in the case where "table" is actually a set-returning 
function, and thus can't have an index? (I don't yet know enough about the 
tree to know when hashaggs get used). I'm wondering if this is a useful 
exception to the "restrictions can't get pushed down through GROUP BYs" rule.

Jay


Re: [PATCH] Lazy hashaggregate when no aggregation is needed

From
Etsuro Fujita
Date:
Hi,

I would like to ask a question before looking into the patch.

At 21:56 12/03/30 -0400, Jay Levitt wrote:
>Tom Lane wrote:
>>Ants Aasma<ants@cybertec.at> writes:
>>>A user complained on pgsql-performance that SELECT col FROM table
>>>GROUP BY col LIMIT 2; performs a full table scan. ISTM that it's safe
>>>to return tuples from hash-aggregate as they are found when no
>>>aggregate functions are in use. Attached is a first shot at that.
>>
>>As I commented in the other thread, the user would be a lot better off
>>if he'd had an index on the column in question. I'm not sure it's worth
>>complicating the hashagg logic when an indexscan + groupagg would
>>address the case better.
>
>Would this patch help in the case where "table" is actually a 
>set-returning function, and thus can't have an index?

ISTM that in many cases, the result size of a set-returning function is 
not so large compared with that of a full plain table scan.  So, in such a 
case a full hash aggregation is not so time consuming.  Am I wrong?

Best regards,
Etsuro Fujita 



Re: [PATCH] Lazy hashaggregate when no aggregation is needed

From
Robert Haas
Date:
On Fri, Jun 15, 2012 at 6:55 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:
>>>> A user complained on pgsql-performance that SELECT col FROM table
>>>> GROUP BY col LIMIT 2; performs a full table scan. ISTM that it's safe
>>>> to return tuples from hash-aggregate as they are found when no
>>>> aggregate functions are in use. Attached is a first shot at that.
>>>
>>>
>>> As I commented in the other thread, the user would be a lot better off
>>> if he'd had an index on the column in question. I'm not sure it's worth
>>> complicating the hashagg logic when an indexscan + groupagg would
>>> address the case better.
>>
>> Would this patch help in the case where "table" is actually a
>> set-returning function, and thus can't have an index?
>
> ISTM that in many cases, the result size of a set-returning function is not
> so large compared with that of a full plain table scan.  So, in such a case
> a full hash aggregation is not so time consuming.  Am I wrong?

This query is a little unusual in that it involves both an aggregate
and a limit.

Now, sorted aggregates work pretty well with limit, because you can be
sure upon seeing the beginning of the next group that you are done
with the previous group.  But in a hash aggregate, you normally can't
start returning results until you've read the entire input, so it
doesn't work so well with limit.

However, as Ants points out, we could make it work better for the
special case where we're not actually doing any aggregation, because
in that case we can emit the row for each group when the group is
created, rather than waiting until end-of-input.  This is only going
to help when there is a LIMIT, though.  Moreover, if there happens to
be an ORDER BY, then the data will have to be pre-sorted, in which
case you may as well use a sorted aggregate.  So the use case for this
optimization is basically DISTINCT plus LIMIT but not ORDER BY.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH] Lazy hashaggregate when no aggregation is needed

From
Ants Aasma
Date:
On Fri, Jun 15, 2012 at 3:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> However, as Ants points out, we could make it work better for the
> special case where we're not actually doing any aggregation, because
> in that case we can emit the row for each group when the group is
> created, rather than waiting until end-of-input.  This is only going
> to help when there is a LIMIT, though.  Moreover, if there happens to
> be an ORDER BY, then the data will have to be pre-sorted, in which
> case you may as well use a sorted aggregate.  So the use case for this
> optimization is basically DISTINCT plus LIMIT but not ORDER BY.

Exactly. I think the first question for this patch should be whether
this use-case is worth the complexity of the patch. I can't imagine
any really compelling use cases that need an arbitrary distinct subset
of results. The original complaint on -performance [1], didn't specify
a real world use case, but it seemed to be a case of an ORM generating
suboptimal queries. On the other hand, the patch itself is in my
opinion rather simple, so it might be worth it.

It has one outstanding issue, query_planner chooses the cheapest path
based on total cost. This can be suboptimal when that path happens to
have high startup cost. It seems to me that enabling the query_planner
to find the cheapest unsorted path returning a limited amount of
tuples would require some major surgery to the planner. To be clear,
this is only a case of missed optimization, not a regression.

It won't help set returning functions because the tuplestore for those
is fully materialized when the first row is fetched.

[1] http://archives.postgresql.org/message-id/16737833.463.1332881676120.JavaMail.geo-discussion-forums%40pbcpw7

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


Re: [PATCH] Lazy hashaggregate when no aggregation is needed

From
Robert Haas
Date:
On Fri, Jun 15, 2012 at 9:22 AM, Ants Aasma <ants@cybertec.at> wrote:
> Exactly. I think the first question for this patch should be whether
> this use-case is worth the complexity of the patch. I can't imagine
> any really compelling use cases that need an arbitrary distinct subset
> of results.

Me neither.

> The original complaint on -performance [1], didn't specify
> a real world use case, but it seemed to be a case of an ORM generating
> suboptimal queries. On the other hand, the patch itself is in my
> opinion rather simple, so it might be worth it.

Yeah.

> It has one outstanding issue, query_planner chooses the cheapest path
> based on total cost. This can be suboptimal when that path happens to
> have high startup cost. It seems to me that enabling the query_planner
> to find the cheapest unsorted path returning a limited amount of
> tuples would require some major surgery to the planner. To be clear,
> this is only a case of missed optimization, not a regression.

I'm confused by this remark, because surely the query planner does it
this way only if there's no LIMIT.  When there is a LIMIT, we choose
based on the startup cost plus the estimated fraction of the total
cost we expect to pay based on dividing the LIMIT by the overall row
count estimate.  Or is this not what you're talking about?

> It won't help set returning functions because the tuplestore for those
> is fully materialized when the first row is fetched.

That's surely a problem for another day.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH] Lazy hashaggregate when no aggregation is needed

From
"Etsuro Fujita"
Date:
Hi,

> -----Original Message-----
> From: Robert Haas [mailto:robertmhaas@gmail.com]
> Sent: Tuesday, June 19, 2012 3:12 AM
> To: Ants Aasma
> Cc: Etsuro Fujita; Jay Levitt; Tom Lane; PostgreSQL-development; Francois
> Deliege
> Subject: Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is
> needed
>
> On Fri, Jun 15, 2012 at 9:22 AM, Ants Aasma <ants@cybertec.at> wrote:
> > Exactly. I think the first question for this patch should be whether
> > this use-case is worth the complexity of the patch. I can't imagine
> > any really compelling use cases that need an arbitrary distinct subset
> > of results.
>
> Me neither.
>
> > The original complaint on -performance [1], didn't specify a real
> > world use case, but it seemed to be a case of an ORM generating
> > suboptimal queries. On the other hand, the patch itself is in my
> > opinion rather simple, so it might be worth it.
>
> Yeah.
>
> > It has one outstanding issue, query_planner chooses the cheapest path
> > based on total cost. This can be suboptimal when that path happens to
> > have high startup cost. It seems to me that enabling the query_planner
> > to find the cheapest unsorted path returning a limited amount of
> > tuples would require some major surgery to the planner. To be clear,
> > this is only a case of missed optimization, not a regression.
>
> I'm confused by this remark, because surely the query planner does it this
> way only if there's no LIMIT.  When there is a LIMIT, we choose based on
> the startup cost plus the estimated fraction of the total cost we expect
> to pay based on dividing the LIMIT by the overall row count estimate.  Or
> is this not what you're talking about?

I think that Ants is pointing the way of estimating costs in
choose_hashed_grouping()/choose_hashed_distinct(), ie cost_agg() for
cheapest_path + hashagg, where the costs are calculated based on the total
cost only of cheapest_path.  I think that it might be good to do cost_agg()
for the discussed case with the AGG_SORTED strategy, not the AGG_HASHED
strategy.

> > It won't help set returning functions because the tuplestore for those
> > is fully materialized when the first row is fetched.
>
> That's surely a problem for another day.

OK

Best regards,
Etsuro Fujita





Re: [PATCH] Lazy hashaggregate when no aggregation is needed

From
Robert Haas
Date:
On Tue, Jun 19, 2012 at 5:41 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:
>> I'm confused by this remark, because surely the query planner does it this
>> way only if there's no LIMIT.  When there is a LIMIT, we choose based on
>> the startup cost plus the estimated fraction of the total cost we expect
>> to pay based on dividing the LIMIT by the overall row count estimate.  Or
>> is this not what you're talking about?
>
> I think that Ants is pointing the way of estimating costs in
> choose_hashed_grouping()/choose_hashed_distinct(), ie cost_agg() for
> cheapest_path + hashagg, where the costs are calculated based on the total
> cost only of cheapest_path.  I think that it might be good to do cost_agg()
> for the discussed case with the AGG_SORTED strategy, not the AGG_HASHED
> strategy.

Well, Ants already made some adjustments to those functions; not sure
if this means they need some more adjustment, but I don't see that
there's a general problem with the costing algorithm around LIMIT.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH] Lazy hashaggregate when no aggregation is needed

From
Robert Haas
Date:
On Fri, Jun 22, 2012 at 10:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jun 19, 2012 at 5:41 AM, Etsuro Fujita
> <fujita.etsuro@lab.ntt.co.jp> wrote:
>>> I'm confused by this remark, because surely the query planner does it this
>>> way only if there's no LIMIT.  When there is a LIMIT, we choose based on
>>> the startup cost plus the estimated fraction of the total cost we expect
>>> to pay based on dividing the LIMIT by the overall row count estimate.  Or
>>> is this not what you're talking about?
>>
>> I think that Ants is pointing the way of estimating costs in
>> choose_hashed_grouping()/choose_hashed_distinct(), ie cost_agg() for
>> cheapest_path + hashagg, where the costs are calculated based on the total
>> cost only of cheapest_path.  I think that it might be good to do cost_agg()
>> for the discussed case with the AGG_SORTED strategy, not the AGG_HASHED
>> strategy.
>
> Well, Ants already made some adjustments to those functions; not sure
> if this means they need some more adjustment, but I don't see that
> there's a general problem with the costing algorithm around LIMIT.

Ants, do you intend to update this patch for this CommitFest?  Or at
all?  It seems nobody's too excited about this, so I'm not sure
whether it makes sense for you to put more work on it.  But please
advise as to your plans.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH] Lazy hashaggregate when no aggregation is needed

From
"Etsuro Fujita"
Date:
Hi,

> -----Original Message-----
> From: Robert Haas [mailto:robertmhaas@gmail.com]
> Sent: Wednesday, June 27, 2012 5:09 AM
> To: Etsuro Fujita
> Cc: Ants Aasma; Jay Levitt; Tom Lane; PostgreSQL-development; Francois Deliege
> Subject: Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is
needed
>
> On Fri, Jun 22, 2012 at 10:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Tue, Jun 19, 2012 at 5:41 AM, Etsuro Fujita
> > <fujita.etsuro@lab.ntt.co.jp> wrote:
> >>> I'm confused by this remark, because surely the query planner does it this
> >>> way only if there's no LIMIT.  When there is a LIMIT, we choose based on
> >>> the startup cost plus the estimated fraction of the total cost we expect
> >>> to pay based on dividing the LIMIT by the overall row count estimate.  Or
> >>> is this not what you're talking about?
> >>
> >> I think that Ants is pointing the way of estimating costs in
> >> choose_hashed_grouping()/choose_hashed_distinct(), ie cost_agg() for
> >> cheapest_path + hashagg, where the costs are calculated based on the total
> >> cost only of cheapest_path.  I think that it might be good to do cost_agg()
> >> for the discussed case with the AGG_SORTED strategy, not the AGG_HASHED
> >> strategy.
> >
> > Well, Ants already made some adjustments to those functions; not sure
> > if this means they need some more adjustment, but I don't see that
> > there's a general problem with the costing algorithm around LIMIT.
>
> Ants, do you intend to update this patch for this CommitFest?  Or at
> all?  It seems nobody's too excited about this, so I'm not sure
> whether it makes sense for you to put more work on it.  But please
> advise as to your plans.

Please excuse my slow response, I would also like to know your plan.

Best regards,
Etsuro Fujita

> Thanks,
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>




Re: [PATCH] Lazy hashaggregate when no aggregation is needed

From
Ants Aasma
Date:
Sorry about the delay in answering. I have been swamped with non-PG
related things lately.

On Tue, Jun 26, 2012 at 11:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jun 22, 2012 at 10:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Jun 19, 2012 at 5:41 AM, Etsuro Fujita
>> <fujita.etsuro@lab.ntt.co.jp> wrote:
>>>> I'm confused by this remark, because surely the query planner does it this
>>>> way only if there's no LIMIT.  When there is a LIMIT, we choose based on
>>>> the startup cost plus the estimated fraction of the total cost we expect
>>>> to pay based on dividing the LIMIT by the overall row count estimate.  Or
>>>> is this not what you're talking about?

My reasoning was that query_planner returns the cheapest-total path
and cheapest fractional presorted (by the aggregation pathkeys). When
evaluating hash-aggregates with this patch these two are indeed
compared considering the esimated fraction of the total cost, but this
might miss cheapest fractional unordered path for lazy hashaggregates.

Reviewing the code now I discovered this path could be picked out from
the pathlist, just like it is done by
get_cheapest_fractional_path_for_pathkeys when pathkeys is nil. This
would need to be returned in addition to the other two paths. To
minimize overhead, this should only be done when we possibly want to
consider lazy hash-aggregation (there is a group clause with no
aggregates and grouping is hashable) But this is starting to get
pretty crufty considering that there doesn't seem to be any really
compelling usecases for this.

> Ants, do you intend to update this patch for this CommitFest?  Or at
> all?  It seems nobody's too excited about this, so I'm not sure
> whether it makes sense for you to put more work on it.  But please
> advise as to your plans.

If anyone thinks that this patch might be worth considering, then I'm
prepared to do minor cleanup this CF (I saw some possibly unnecessary
cruft in agg_fill_hash_and_retrieve). On the other hand, if you think
the use case is too marginal to consider for inclusion then I won't
shed a tear if this gets rejected. For me this was mostly a learning
experience for poking around in the planner.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


Re: [PATCH] Lazy hashaggregate when no aggregation is needed

From
"Etsuro Fujita"
Date:
Hi Ants,

> -----Original Message-----
> From: Ants Aasma [mailto:ants@cybertec.at]
> Sent: Wednesday, June 27, 2012 9:23 PM
> To: Robert Haas
> Cc: Etsuro Fujita; Jay Levitt; Tom Lane; PostgreSQL-development; Francois
> Deliege
> Subject: Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is
needed
>
> Sorry about the delay in answering. I have been swamped with non-PG
> related things lately.
>
> On Tue, Jun 26, 2012 at 11:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Fri, Jun 22, 2012 at 10:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Tue, Jun 19, 2012 at 5:41 AM, Etsuro Fujita
> >> <fujita.etsuro@lab.ntt.co.jp> wrote:
> >>>> I'm confused by this remark, because surely the query planner does it
this
> >>>> way only if there's no LIMIT.  When there is a LIMIT, we choose based on
> >>>> the startup cost plus the estimated fraction of the total cost we expect
> >>>> to pay based on dividing the LIMIT by the overall row count estimate.  Or
> >>>> is this not what you're talking about?
>
> My reasoning was that query_planner returns the cheapest-total path
> and cheapest fractional presorted (by the aggregation pathkeys). When
> evaluating hash-aggregates with this patch these two are indeed
> compared considering the esimated fraction of the total cost, but this
> might miss cheapest fractional unordered path for lazy hashaggregates.
>
> Reviewing the code now I discovered this path could be picked out from
> the pathlist, just like it is done by
> get_cheapest_fractional_path_for_pathkeys when pathkeys is nil. This
> would need to be returned in addition to the other two paths. To
> minimize overhead, this should only be done when we possibly want to
> consider lazy hash-aggregation (there is a group clause with no
> aggregates and grouping is hashable) But this is starting to get
> pretty crufty considering that there doesn't seem to be any really
> compelling usecases for this.
>
> > Ants, do you intend to update this patch for this CommitFest?  Or at
> > all?  It seems nobody's too excited about this, so I'm not sure
> > whether it makes sense for you to put more work on it.  But please
> > advise as to your plans.
>
> If anyone thinks that this patch might be worth considering, then I'm
> prepared to do minor cleanup this CF (I saw some possibly unnecessary
> cruft in agg_fill_hash_and_retrieve). On the other hand, if you think
> the use case is too marginal to consider for inclusion then I won't
> shed a tear if this gets rejected. For me this was mostly a learning
> experience for poking around in the planner.

Honestly, I'm not sure that it's worth including this, considering the use
case...

Thanks,

Best regards,
Etsuro Fujita

> Ants Aasma
> --
> Cybertec Schönig & Schönig GmbH
> Gröhrmühlgasse 26
> A-2700 Wiener Neustadt
> Web: http://www.postgresql-support.de
>




Re: [PATCH] Lazy hashaggregate when no aggregation is needed

From
Robert Haas
Date:
On Thu, Jun 28, 2012 at 10:22 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:
> Honestly, I'm not sure that it's worth including this, considering the use
> case...

Since nobody seems crazy about pursuing this, I'm marking this patch Rejected.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [PATCH] Lazy hashaggregate when no aggregation is needed

From
Ants Aasma
Date:
On Mon, Jul 2, 2012 at 5:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 28, 2012 at 10:22 PM, Etsuro Fujita
> <fujita.etsuro@lab.ntt.co.jp> wrote:
>> Honestly, I'm not sure that it's worth including this, considering the use
>> case...
>
> Since nobody seems crazy about pursuing this, I'm marking this patch Rejected.

Thank you, for considering this and many thanks to Etsuro Fujita for reviewing.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de