[HACKERS] AGG_HASHED cost estimate - Mailing list pgsql-hackers

From Jeff Janes
Subject [HACKERS] AGG_HASHED cost estimate
Date
Msg-id CAMkU=1xKoOkh04=6p03+5vaSGaWYuJN-wnt0+eR2EC0cnnWedg@mail.gmail.com
Whole thread Raw
Responses Re: [HACKERS] AGG_HASHED cost estimate  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
List pgsql-hackers

In cost_size.c, there is this comment block:

+        * Note: in this cost model, AGG_SORTED and AGG_HASHED have exactly the
+        * same total CPU cost, but AGG_SORTED has lower startup cost.  If the
+        * input path is already sorted appropriately, AGG_SORTED should be
+        * preferred (since it has no risk of memory overflow).  This will happen
+        * as long as the computed total costs are indeed exactly equal --- but
+        * if there's roundoff error we might do the wrong thing.  So be sure
+        * that the computations below form the same intermediate values in the
+        * same order.

But, why should they have the same cost in the first place?  A sorted aggregation just has to do an equality comparison on each adjacent pair, which is costed as (cpu_operator_cost * numGroupCols) * input_tuples.   A hash aggregation has to do a hashcode computation for each input, which apparently is also costed at (cpu_operator_cost * numGroupCols) * input_tuples.  But it also has to do the equality comparison between the input tuple and any aggregation it is about to be aggregated into, to make sure the hashcode is not a false collision.  That should be another (cpu_operator_cost * numGroupCols) * (input_tuples - numGroups), shouldn't it?  Currently, that is apparently assumed to be free.  

Is there something here I am overlooking?

Cheers,

Jeff

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [HACKERS] Continuous buildfarm failures on hamster with bin-check
Next
From: Noah Misch
Date:
Subject: Re: [HACKERS] pg_dump emits ALTER TABLE ONLY partitioned_table