Re: hashagg slowdown due to spill changes - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: hashagg slowdown due to spill changes
Date
Msg-id CAAKRu_YsSu2uTyi5wHUGk2VYmtc2giciNzciXhtr+L07uoNqSw@mail.gmail.com
Whole thread Raw
In response to Re: hashagg slowdown due to spill changes  (Andres Freund <andres@anarazel.de>)
Responses Re: hashagg slowdown due to spill changes
List pgsql-hackers


On Tue, Jun 23, 2020 at 10:06 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote:
> On Mon, Jun 22, 2020 at 9:02 PM Andres Freund <andres@anarazel.de> wrote:
> > It's not this patch's fault, but none, really none, of this stuff should
> > be in the executor.
> >
> >
> Were you thinking it could be done in grouping_planner() and then the
> bitmaps could be saved in the PlannedStmt?
> Or would you have to wait until query_planner()? Or are you imagining
> somewhere else entirely?

I haven't thought about it in too much detail, but I would say
create_agg_plan() et al. I guess there's some argument to be made to do
it in setrefs.c, because we already do convert_combining_aggrefs() there
(but I don't like that much).

There's no reason to do it before we actually decided on one specific
path, so doing it earlier than create_plan() seems unnecessary. And
having it in agg specific code seems better than putting it into global
routines.

There's probably an argument for having a bit more shared code between
create_agg_plan(), create_group_plan() and
create_groupingsets_plan(). But even just adding a new extract_*_cols()
call to each of those would probably be ok.


So, my summary of this point in the context of the other discussion
upthread is:

Planner should extract the columns that hashagg will need later during
planning. Planner should not have HashAgg/MixedAgg nodes request smaller
targetlists from their children with CP_SMALL_TLIST to avoid unneeded
projection overhead.
Also, even this extraction should only be done when the number of groups
is large enough to suspect a spill.

So, I wrote a patch that extracts the columns the same way as in
ExecInitAgg but in create_agg_plan() and it doesn't work because we
haven't called set_plan_references().

Then, I wrote a patch that does this in set_upper_references(), and it
seems to work. I've attached that one.
It is basically Jeff's patch (based somewhat on my patch) which extracts
the columns in ExecInitAgg but I moved the functions over to setrefs.c
and gave them a different name.

It's not very elegant.
I shoved it in at the end of set_upper_references(), but I think there
should be a nice way to do it while setting the references for each var
instead of walking over the nodes again.
Also, I think that the bitmapsets for the colnos should maybe be put
somewhere less prominent (than in the main Agg plan node?), since they
are only used in one small place.
I tried putting both bitmaps in an array of two bitmaps in the Agg node
(since there will always be two) to make it look a bit neater, but it
was pretty confusing and error prone to remember which one was
aggregated and which one was unaggregated.

Note that I didn't do anything with costing like only extracting the
columns if there are a lot of groups.

Also, I didn't revert the CP_SMALL_TLIST change in create_agg_plan() or
create_groupingsets_plan().

Not to stir the pot, but I did notice that hashjoin uses CP_SMALL_TLIST
in create_hashjoin_plan() for the inner side sub-tree and the outer side
one if there are multiple batches. I wondered what was different about
that vs hashagg (i.e. why it is okay to do that there).

--
Melanie Plageman
Attachment

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Default setting for enable_hashagg_disk
Next
From: vignesh C
Date:
Subject: Re: [PATCH] Initial progress reporting for COPY command