Re: hashagg slowdown due to spill changes - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: hashagg slowdown due to spill changes
Date
Msg-id 20200625091024.dhslw2gqgqtgjsoh@development
Whole thread Raw
In response to Re: hashagg slowdown due to spill changes  (Melanie Plageman <melanieplageman@gmail.com>)
List pgsql-hackers
On Wed, Jun 24, 2020 at 05:26:07PM -0700, Melanie Plageman wrote:
>On Tue, Jun 23, 2020 at 10:06 AM Andres Freund <andres@anarazel.de> wrote:
>
>> Hi,
>>
>> On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote:
>> > On Mon, Jun 22, 2020 at 9:02 PM Andres Freund <andres@anarazel.de>
>> wrote:
>> > > It's not this patch's fault, but none, really none, of this stuff
>> should
>> > > be in the executor.
>> > >
>> > >
>> > Were you thinking it could be done in grouping_planner() and then the
>> > bitmaps could be saved in the PlannedStmt?
>> > Or would you have to wait until query_planner()? Or are you imagining
>> > somewhere else entirely?
>>
>> I haven't thought about it in too much detail, but I would say
>> create_agg_plan() et al. I guess there's some argument to be made to do
>> it in setrefs.c, because we already do convert_combining_aggrefs() there
>> (but I don't like that much).
>>
>> There's no reason to do it before we actually decided on one specific
>> path, so doing it earlier than create_plan() seems unnecessary. And
>> having it in agg specific code seems better than putting it into global
>> routines.
>>
>> There's probably an argument for having a bit more shared code between
>> create_agg_plan(), create_group_plan() and
>> create_groupingsets_plan(). But even just adding a new extract_*_cols()
>> call to each of those would probably be ok.
>>
>>
>So, my summary of this point in the context of the other discussion
>upthread is:
>
>Planner should extract the columns that hashagg will need later during
>planning. Planner should not have HashAgg/MixedAgg nodes request smaller
>targetlists from their children with CP_SMALL_TLIST to avoid unneeded
>projection overhead.
>Also, even this extraction should only be done when the number of groups
>is large enough to suspect a spill.
>

IMO we should extract the columns irrespectedly of the estimates,
otherwise we won't be able to handle underestimates efficiently.

>
>Not to stir the pot, but I did notice that hashjoin uses CP_SMALL_TLIST
>in create_hashjoin_plan() for the inner side sub-tree and the outer side
>one if there are multiple batches. I wondered what was different about
>that vs hashagg (i.e. why it is okay to do that there).
>

Yeah. That means that if we have to start batching during execution, we
may need to spill much more datai. I'd say that's a hashjoin issue that
we should fix too (in v14).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Backpatch b61d161c14 (Introduce vacuum errcontext ...)
Next
From: Daniel Gustafsson
Date:
Subject: Re: Missing some ifndef FRONTEND at the top of logging.c andfile_utils.c