Thread: [PATCH] Lazy hashaggregate when no aggregation is needed
A user complained on pgsql-performance that SELECT col FROM table GROUP BY col LIMIT 2; performs a full table scan. ISTM that it's safe to return tuples from hash-aggregate as they are found when no aggregate functions are in use. Attached is a first shot at that. The planner is modified so that when the optimization applies, hash table size check is compared against the limit and start up cost comes from the input. The executor is modified so that when the hash table is not filled yet and the optimization applies, nodes are returned immediately. Can somebody poke holes in this? The patch definitely needs some code cleanup in nodeAgg.c, but otherwise it passes regression tests and seems to work as intended. It also optimizes the SELECT DISTINCT col FROM table LIMIT 2; case, but not SELECT DISTINCT ON (col) col FROM table LIMIT 2 because it is explicitly forced to use sorted aggregation. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Attachment
Ants Aasma <ants@cybertec.at> writes: > A user complained on pgsql-performance that SELECT col FROM table > GROUP BY col LIMIT 2; performs a full table scan. ISTM that it's safe > to return tuples from hash-aggregate as they are found when no > aggregate functions are in use. Attached is a first shot at that. As I commented in the other thread, the user would be a lot better off if he'd had an index on the column in question. I'm not sure it's worth complicating the hashagg logic when an indexscan + groupagg would address the case better. regards, tom lane
Tom Lane wrote: > Ants Aasma<ants@cybertec.at> writes: >> A user complained on pgsql-performance that SELECT col FROM table >> GROUP BY col LIMIT 2; performs a full table scan. ISTM that it's safe >> to return tuples from hash-aggregate as they are found when no >> aggregate functions are in use. Attached is a first shot at that. > > As I commented in the other thread, the user would be a lot better off > if he'd had an index on the column in question. I'm not sure it's worth > complicating the hashagg logic when an indexscan + groupagg would > address the case better. Would this patch help in the case where "table" is actually a set-returning function, and thus can't have an index? (I don't yet know enough about the tree to know when hashaggs get used). I'm wondering if this is a useful exception to the "restrictions can't get pushed down through GROUP BYs" rule. Jay
Hi, I would like to ask a question before looking into the patch. At 21:56 12/03/30 -0400, Jay Levitt wrote: >Tom Lane wrote: >>Ants Aasma<ants@cybertec.at> writes: >>>A user complained on pgsql-performance that SELECT col FROM table >>>GROUP BY col LIMIT 2; performs a full table scan. ISTM that it's safe >>>to return tuples from hash-aggregate as they are found when no >>>aggregate functions are in use. Attached is a first shot at that. >> >>As I commented in the other thread, the user would be a lot better off >>if he'd had an index on the column in question. I'm not sure it's worth >>complicating the hashagg logic when an indexscan + groupagg would >>address the case better. > >Would this patch help in the case where "table" is actually a >set-returning function, and thus can't have an index? ISTM that in many cases, the result size of a set-returning function is not so large compared with that of a full plain table scan. So, in such a case a full hash aggregation is not so time consuming. Am I wrong? Best regards, Etsuro Fujita
On Fri, Jun 15, 2012 at 6:55 AM, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote: >>>> A user complained on pgsql-performance that SELECT col FROM table >>>> GROUP BY col LIMIT 2; performs a full table scan. ISTM that it's safe >>>> to return tuples from hash-aggregate as they are found when no >>>> aggregate functions are in use. Attached is a first shot at that. >>> >>> >>> As I commented in the other thread, the user would be a lot better off >>> if he'd had an index on the column in question. I'm not sure it's worth >>> complicating the hashagg logic when an indexscan + groupagg would >>> address the case better. >> >> Would this patch help in the case where "table" is actually a >> set-returning function, and thus can't have an index? > > ISTM that in many cases, the result size of a set-returning function is not > so large compared with that of a full plain table scan. So, in such a case > a full hash aggregation is not so time consuming. Am I wrong? This query is a little unusual in that it involves both an aggregate and a limit. Now, sorted aggregates work pretty well with limit, because you can be sure upon seeing the beginning of the next group that you are done with the previous group. But in a hash aggregate, you normally can't start returning results until you've read the entire input, so it doesn't work so well with limit. However, as Ants points out, we could make it work better for the special case where we're not actually doing any aggregation, because in that case we can emit the row for each group when the group is created, rather than waiting until end-of-input. This is only going to help when there is a LIMIT, though. Moreover, if there happens to be an ORDER BY, then the data will have to be pre-sorted, in which case you may as well use a sorted aggregate. So the use case for this optimization is basically DISTINCT plus LIMIT but not ORDER BY. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 15, 2012 at 3:13 PM, Robert Haas <robertmhaas@gmail.com> wrote: > However, as Ants points out, we could make it work better for the > special case where we're not actually doing any aggregation, because > in that case we can emit the row for each group when the group is > created, rather than waiting until end-of-input. This is only going > to help when there is a LIMIT, though. Moreover, if there happens to > be an ORDER BY, then the data will have to be pre-sorted, in which > case you may as well use a sorted aggregate. So the use case for this > optimization is basically DISTINCT plus LIMIT but not ORDER BY. Exactly. I think the first question for this patch should be whether this use-case is worth the complexity of the patch. I can't imagine any really compelling use cases that need an arbitrary distinct subset of results. The original complaint on -performance [1], didn't specify a real world use case, but it seemed to be a case of an ORM generating suboptimal queries. On the other hand, the patch itself is in my opinion rather simple, so it might be worth it. It has one outstanding issue, query_planner chooses the cheapest path based on total cost. This can be suboptimal when that path happens to have high startup cost. It seems to me that enabling the query_planner to find the cheapest unsorted path returning a limited amount of tuples would require some major surgery to the planner. To be clear, this is only a case of missed optimization, not a regression. It won't help set returning functions because the tuplestore for those is fully materialized when the first row is fetched. [1] http://archives.postgresql.org/message-id/16737833.463.1332881676120.JavaMail.geo-discussion-forums%40pbcpw7 Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Fri, Jun 15, 2012 at 9:22 AM, Ants Aasma <ants@cybertec.at> wrote: > Exactly. I think the first question for this patch should be whether > this use-case is worth the complexity of the patch. I can't imagine > any really compelling use cases that need an arbitrary distinct subset > of results. Me neither. > The original complaint on -performance [1], didn't specify > a real world use case, but it seemed to be a case of an ORM generating > suboptimal queries. On the other hand, the patch itself is in my > opinion rather simple, so it might be worth it. Yeah. > It has one outstanding issue, query_planner chooses the cheapest path > based on total cost. This can be suboptimal when that path happens to > have high startup cost. It seems to me that enabling the query_planner > to find the cheapest unsorted path returning a limited amount of > tuples would require some major surgery to the planner. To be clear, > this is only a case of missed optimization, not a regression. I'm confused by this remark, because surely the query planner does it this way only if there's no LIMIT. When there is a LIMIT, we choose based on the startup cost plus the estimated fraction of the total cost we expect to pay based on dividing the LIMIT by the overall row count estimate. Or is this not what you're talking about? > It won't help set returning functions because the tuplestore for those > is fully materialized when the first row is fetched. That's surely a problem for another day. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, > -----Original Message----- > From: Robert Haas [mailto:robertmhaas@gmail.com] > Sent: Tuesday, June 19, 2012 3:12 AM > To: Ants Aasma > Cc: Etsuro Fujita; Jay Levitt; Tom Lane; PostgreSQL-development; Francois > Deliege > Subject: Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is > needed > > On Fri, Jun 15, 2012 at 9:22 AM, Ants Aasma <ants@cybertec.at> wrote: > > Exactly. I think the first question for this patch should be whether > > this use-case is worth the complexity of the patch. I can't imagine > > any really compelling use cases that need an arbitrary distinct subset > > of results. > > Me neither. > > > The original complaint on -performance [1], didn't specify a real > > world use case, but it seemed to be a case of an ORM generating > > suboptimal queries. On the other hand, the patch itself is in my > > opinion rather simple, so it might be worth it. > > Yeah. > > > It has one outstanding issue, query_planner chooses the cheapest path > > based on total cost. This can be suboptimal when that path happens to > > have high startup cost. It seems to me that enabling the query_planner > > to find the cheapest unsorted path returning a limited amount of > > tuples would require some major surgery to the planner. To be clear, > > this is only a case of missed optimization, not a regression. > > I'm confused by this remark, because surely the query planner does it this > way only if there's no LIMIT. When there is a LIMIT, we choose based on > the startup cost plus the estimated fraction of the total cost we expect > to pay based on dividing the LIMIT by the overall row count estimate. Or > is this not what you're talking about? I think that Ants is pointing the way of estimating costs in choose_hashed_grouping()/choose_hashed_distinct(), ie cost_agg() for cheapest_path + hashagg, where the costs are calculated based on the total cost only of cheapest_path. I think that it might be good to do cost_agg() for the discussed case with the AGG_SORTED strategy, not the AGG_HASHED strategy. > > It won't help set returning functions because the tuplestore for those > > is fully materialized when the first row is fetched. > > That's surely a problem for another day. OK Best regards, Etsuro Fujita
On Tue, Jun 19, 2012 at 5:41 AM, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote: >> I'm confused by this remark, because surely the query planner does it this >> way only if there's no LIMIT. When there is a LIMIT, we choose based on >> the startup cost plus the estimated fraction of the total cost we expect >> to pay based on dividing the LIMIT by the overall row count estimate. Or >> is this not what you're talking about? > > I think that Ants is pointing the way of estimating costs in > choose_hashed_grouping()/choose_hashed_distinct(), ie cost_agg() for > cheapest_path + hashagg, where the costs are calculated based on the total > cost only of cheapest_path. I think that it might be good to do cost_agg() > for the discussed case with the AGG_SORTED strategy, not the AGG_HASHED > strategy. Well, Ants already made some adjustments to those functions; not sure if this means they need some more adjustment, but I don't see that there's a general problem with the costing algorithm around LIMIT. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 22, 2012 at 10:12 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jun 19, 2012 at 5:41 AM, Etsuro Fujita > <fujita.etsuro@lab.ntt.co.jp> wrote: >>> I'm confused by this remark, because surely the query planner does it this >>> way only if there's no LIMIT. When there is a LIMIT, we choose based on >>> the startup cost plus the estimated fraction of the total cost we expect >>> to pay based on dividing the LIMIT by the overall row count estimate. Or >>> is this not what you're talking about? >> >> I think that Ants is pointing the way of estimating costs in >> choose_hashed_grouping()/choose_hashed_distinct(), ie cost_agg() for >> cheapest_path + hashagg, where the costs are calculated based on the total >> cost only of cheapest_path. I think that it might be good to do cost_agg() >> for the discussed case with the AGG_SORTED strategy, not the AGG_HASHED >> strategy. > > Well, Ants already made some adjustments to those functions; not sure > if this means they need some more adjustment, but I don't see that > there's a general problem with the costing algorithm around LIMIT. Ants, do you intend to update this patch for this CommitFest? Or at all? It seems nobody's too excited about this, so I'm not sure whether it makes sense for you to put more work on it. But please advise as to your plans. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, > -----Original Message----- > From: Robert Haas [mailto:robertmhaas@gmail.com] > Sent: Wednesday, June 27, 2012 5:09 AM > To: Etsuro Fujita > Cc: Ants Aasma; Jay Levitt; Tom Lane; PostgreSQL-development; Francois Deliege > Subject: Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is needed > > On Fri, Jun 22, 2012 at 10:12 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jun 19, 2012 at 5:41 AM, Etsuro Fujita > > <fujita.etsuro@lab.ntt.co.jp> wrote: > >>> I'm confused by this remark, because surely the query planner does it this > >>> way only if there's no LIMIT. When there is a LIMIT, we choose based on > >>> the startup cost plus the estimated fraction of the total cost we expect > >>> to pay based on dividing the LIMIT by the overall row count estimate. Or > >>> is this not what you're talking about? > >> > >> I think that Ants is pointing the way of estimating costs in > >> choose_hashed_grouping()/choose_hashed_distinct(), ie cost_agg() for > >> cheapest_path + hashagg, where the costs are calculated based on the total > >> cost only of cheapest_path. I think that it might be good to do cost_agg() > >> for the discussed case with the AGG_SORTED strategy, not the AGG_HASHED > >> strategy. > > > > Well, Ants already made some adjustments to those functions; not sure > > if this means they need some more adjustment, but I don't see that > > there's a general problem with the costing algorithm around LIMIT. > > Ants, do you intend to update this patch for this CommitFest? Or at > all? It seems nobody's too excited about this, so I'm not sure > whether it makes sense for you to put more work on it. But please > advise as to your plans. Please excuse my slow response, I would also like to know your plan. Best regards, Etsuro Fujita > Thanks, > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Sorry about the delay in answering. I have been swamped with non-PG related things lately. On Tue, Jun 26, 2012 at 11:08 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jun 22, 2012 at 10:12 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Jun 19, 2012 at 5:41 AM, Etsuro Fujita >> <fujita.etsuro@lab.ntt.co.jp> wrote: >>>> I'm confused by this remark, because surely the query planner does it this >>>> way only if there's no LIMIT. When there is a LIMIT, we choose based on >>>> the startup cost plus the estimated fraction of the total cost we expect >>>> to pay based on dividing the LIMIT by the overall row count estimate. Or >>>> is this not what you're talking about? My reasoning was that query_planner returns the cheapest-total path and cheapest fractional presorted (by the aggregation pathkeys). When evaluating hash-aggregates with this patch these two are indeed compared considering the esimated fraction of the total cost, but this might miss cheapest fractional unordered path for lazy hashaggregates. Reviewing the code now I discovered this path could be picked out from the pathlist, just like it is done by get_cheapest_fractional_path_for_pathkeys when pathkeys is nil. This would need to be returned in addition to the other two paths. To minimize overhead, this should only be done when we possibly want to consider lazy hash-aggregation (there is a group clause with no aggregates and grouping is hashable) But this is starting to get pretty crufty considering that there doesn't seem to be any really compelling usecases for this. > Ants, do you intend to update this patch for this CommitFest? Or at > all? It seems nobody's too excited about this, so I'm not sure > whether it makes sense for you to put more work on it. But please > advise as to your plans. If anyone thinks that this patch might be worth considering, then I'm prepared to do minor cleanup this CF (I saw some possibly unnecessary cruft in agg_fill_hash_and_retrieve). On the other hand, if you think the use case is too marginal to consider for inclusion then I won't shed a tear if this gets rejected. For me this was mostly a learning experience for poking around in the planner. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Hi Ants, > -----Original Message----- > From: Ants Aasma [mailto:ants@cybertec.at] > Sent: Wednesday, June 27, 2012 9:23 PM > To: Robert Haas > Cc: Etsuro Fujita; Jay Levitt; Tom Lane; PostgreSQL-development; Francois > Deliege > Subject: Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is needed > > Sorry about the delay in answering. I have been swamped with non-PG > related things lately. > > On Tue, Jun 26, 2012 at 11:08 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Jun 22, 2012 at 10:12 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> On Tue, Jun 19, 2012 at 5:41 AM, Etsuro Fujita > >> <fujita.etsuro@lab.ntt.co.jp> wrote: > >>>> I'm confused by this remark, because surely the query planner does it this > >>>> way only if there's no LIMIT. When there is a LIMIT, we choose based on > >>>> the startup cost plus the estimated fraction of the total cost we expect > >>>> to pay based on dividing the LIMIT by the overall row count estimate. Or > >>>> is this not what you're talking about? > > My reasoning was that query_planner returns the cheapest-total path > and cheapest fractional presorted (by the aggregation pathkeys). When > evaluating hash-aggregates with this patch these two are indeed > compared considering the esimated fraction of the total cost, but this > might miss cheapest fractional unordered path for lazy hashaggregates. > > Reviewing the code now I discovered this path could be picked out from > the pathlist, just like it is done by > get_cheapest_fractional_path_for_pathkeys when pathkeys is nil. This > would need to be returned in addition to the other two paths. To > minimize overhead, this should only be done when we possibly want to > consider lazy hash-aggregation (there is a group clause with no > aggregates and grouping is hashable) But this is starting to get > pretty crufty considering that there doesn't seem to be any really > compelling usecases for this. > > > Ants, do you intend to update this patch for this CommitFest? Or at > > all? It seems nobody's too excited about this, so I'm not sure > > whether it makes sense for you to put more work on it. But please > > advise as to your plans. > > If anyone thinks that this patch might be worth considering, then I'm > prepared to do minor cleanup this CF (I saw some possibly unnecessary > cruft in agg_fill_hash_and_retrieve). On the other hand, if you think > the use case is too marginal to consider for inclusion then I won't > shed a tear if this gets rejected. For me this was mostly a learning > experience for poking around in the planner. Honestly, I'm not sure that it's worth including this, considering the use case... Thanks, Best regards, Etsuro Fujita > Ants Aasma > -- > Cybertec Schönig & Schönig GmbH > Gröhrmühlgasse 26 > A-2700 Wiener Neustadt > Web: http://www.postgresql-support.de >
On Thu, Jun 28, 2012 at 10:22 PM, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote: > Honestly, I'm not sure that it's worth including this, considering the use > case... Since nobody seems crazy about pursuing this, I'm marking this patch Rejected. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 2, 2012 at 5:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 28, 2012 at 10:22 PM, Etsuro Fujita > <fujita.etsuro@lab.ntt.co.jp> wrote: >> Honestly, I'm not sure that it's worth including this, considering the use >> case... > > Since nobody seems crazy about pursuing this, I'm marking this patch Rejected. Thank you, for considering this and many thanks to Etsuro Fujita for reviewing. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de