Thread: hashagg slowdown due to spill changes
Hi, While preparing my pgcon talk I noticed that our hash-agg performance degraded noticeably. Looks to me like it's due to the spilling-hashagg changes. Sample benchmark: config: -c huge_pages=on -c shared_buffers=32GB -c jit=0 -c max_parallel_workers_per_gather=0 (largely just to reduce variance) data prep: CREATE TABLE fewgroups_many_rows AS SELECT (random() * 4)::int cat, (random()*10000)::int val FROM generate_series(1, 100000000); VACUUM (FREEZE, ANALYZE) fewgroups_many_rows; test prep: CREATE EXTENSION IF NOT EXISTS pg_prewarm;SELECT pg_prewarm('fewgroups_many_rows', 'buffer'); test: SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1; Using best-of-three timing: 12 12221.031 ms master 13855.129 ms While not the end of the world, that's a definitely noticable and reproducible slowdown (~12%). I don't think this is actually an inherent cost, but a question of how the code ended up being organized. Here's a perf diff of profiles for both versions: # Baseline Delta Abs Shared Object Symbol # ........ ......... ................ ......................................... # +6.70% postgres [.] LookupTupleHashEntryHash +6.37% postgres [.] prepare_hash_slot +4.74% postgres [.] TupleHashTableHash_internal.isra.0 20.36% -2.89% postgres [.] ExecInterpExpr 6.31% -2.73% postgres [.] lookup_hash_entries +2.36% postgres [.] lookup_hash_entry +2.14% postgres [.] ExecJustAssignScanVar 2.28% +1.97% postgres [.] ExecScan 2.54% +1.93% postgres [.] MemoryContextReset 3.84% -1.86% postgres [.] SeqNext 10.19% -1.50% postgres [.] tts_buffer_heap_getsomeattrs +1.42% postgres [.] hash_bytes_uint32 +1.39% postgres [.] TupleHashTableHash +1.10% postgres [.] tts_virtual_clear 3.36% -0.74% postgres [.] ExecAgg +0.45% postgres [.] CheckForSerializableConflictOutNeeded 0.25% +0.44% postgres [.] hashint4 5.80% -0.35% postgres [.] tts_minimal_getsomeattrs 1.91% -0.33% postgres [.] heap_getnextslot 4.86% -0.32% postgres [.] heapgettup_pagemode 1.46% -0.32% postgres [.] tts_minimal_clear While some of this is likely is just noise, it's pretty clear that we spend a substantial amount of additional time below lookup_hash_entries(). And looking at the code, I'm not too surprised: Before there was basically one call from nodeAgg.c to execGrouping.c for each tuple and hash table. Now it's a lot more complicated: 1) nodeAgg.c: prepare_hash_slot() 2) execGrouping.c: TupleHashTableHash() 3) nodeAgg.c: lookup_hash_entry() 4) execGrouping.c: LookupTupleHashEntryHash() For each of these data needs to be peeled out of one or more of AggState / AggStatePerHashData / TupleHashTable. There's no way the compiler can know that nothing inside those changes, therefore it has to reload the contents repeatedly. By my look at the profiles, that's where most of the time is going. There's also the issue that the signalling whether to insert / not to insert got unnecessarily complicated. There's several checks: 1) lookup_hash_entry() (p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;) 2) LookupTupleHashEntry_internal() (if (isnew)) 3) lookup_hash_entry() (if (entry == NULL) and if (isnew)) 4) lookup_hash_entries() if (!in_hash_table) Not performance related: I am a bit confused why the new per-hash stuff in lookup_hash_entries() isn't in lookup_hash_entry()? I assume that's because of agg_refill_hash_table()? Why isn't the flow more like this: 1) prepare_hash_slot() 2) if (aggstate->hash_spill_mode) goto 3; else goto 4 3) entry = LookupTupleHashEntry(&hash); if (!entry) hashagg_spill_tuple(); 4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry) That way there's again exactly one call to execGrouping.c, there's no need for nodeAgg to separately compute the hash, there's far fewer branches... Doing things this way might perhaps make agg_refill_hash_table() a tiny bit more complicated, but it'll also avoid the slowdown for the vast majority of cases where we're not spilling. Greetings, Andres Freund
On 2020-Jun-05, Andres Freund wrote: > While preparing my pgcon talk I noticed that our hash-agg performance > degraded noticeably. Looks to me like it's due to the spilling-hashagg > changes. Jeff, what are your thoughts on this? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote: > Before there was basically one call from nodeAgg.c to execGrouping.c > for > each tuple and hash table. Now it's a lot more complicated: > 1) nodeAgg.c: prepare_hash_slot() > 2) execGrouping.c: TupleHashTableHash() > 3) nodeAgg.c: lookup_hash_entry() > 4) execGrouping.c: LookupTupleHashEntryHash() The reason that I did it that way was to be able to store the hash along with the saved tuple (similar to what HashJoin does), which avoids recalculation. That could be a nice savings for some cases, like when work_mem is small but the data still fits in system memory, which I expect to be fairly common. But based on your numbers, it might be a bad trade-off overall. > Why isn't the flow more like this: > 1) prepare_hash_slot() > 2) if (aggstate->hash_spill_mode) goto 3; else goto 4 > 3) entry = LookupTupleHashEntry(&hash); if (!entry) > hashagg_spill_tuple(); > 4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry) I'll work up a patch to refactor this. I'd still like to see if we can preserve the calculate-hash-once behavior somehow. Regards, Jeff Davis
On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote: > Why isn't the flow more like this: > 1) prepare_hash_slot() > 2) if (aggstate->hash_spill_mode) goto 3; else goto 4 > 3) entry = LookupTupleHashEntry(&hash); if (!entry) > hashagg_spill_tuple(); > 4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry) I see, you are suggesting that I change around the execGrouping.c signatures to return the hash, which will avoid the extra call. That makes more sense. Regards, Jeff Davis
Hi, On 2020-06-08 13:41:29 -0700, Jeff Davis wrote: > On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote: > > Before there was basically one call from nodeAgg.c to execGrouping.c > > for > > each tuple and hash table. Now it's a lot more complicated: > > 1) nodeAgg.c: prepare_hash_slot() > > 2) execGrouping.c: TupleHashTableHash() > > 3) nodeAgg.c: lookup_hash_entry() > > 4) execGrouping.c: LookupTupleHashEntryHash() > > The reason that I did it that way was to be able to store the hash > along with the saved tuple (similar to what HashJoin does), which > avoids recalculation. That makes sense. But then you can just use a separate call into execGrouping for that purpose. > > Why isn't the flow more like this: > > 1) prepare_hash_slot() > > 2) if (aggstate->hash_spill_mode) goto 3; else goto 4 > > 3) entry = LookupTupleHashEntry(&hash); if (!entry) > > hashagg_spill_tuple(); > > 4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry) > > I'll work up a patch to refactor this. I'd still like to see if we can > preserve the calculate-hash-once behavior somehow. Cool! Greetings, Andres Freund
On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote: > Hi, > > While preparing my pgcon talk I noticed that our hash-agg performance > degraded noticeably. Looks to me like it's due to the spilling- > hashagg > changes. Attached a proposed fix. (Might require some minor cleanup). The only awkward part is that LookupTupleHashEntry() needs a new out parameter to pass the hash value back to the caller. Ordinarily, the caller can get that from the returned entry, but if isnew==NULL, then the function might return NULL (and the caller wouldn't have an entry from which to read the hash value). Regards, Jeff Davis
Attachment
Hi, On June 10, 2020 6:15:39 PM PDT, Jeff Davis <pgsql@j-davis.com> wrote: >On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote: >> Hi, >> >> While preparing my pgcon talk I noticed that our hash-agg performance >> degraded noticeably. Looks to me like it's due to the spilling- >> hashagg >> changes. > >Attached a proposed fix. (Might require some minor cleanup). > >The only awkward part is that LookupTupleHashEntry() needs a new out >parameter to pass the hash value back to the caller. Ordinarily, the >caller can get that from the returned entry, but if isnew==NULL, then >the function might return NULL (and the caller wouldn't have an entry >from which to read the hash value). Great! Did you run any performance tests? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote: > Did you run any performance tests? Yes, I reproduced your ~12% regression from V12, and this patch nearly eliminated it for me. Regards, Jeff Davis
Hi, On 2020-06-11 11:14:02 -0700, Jeff Davis wrote: > On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote: > > Did you run any performance tests? > > Yes, I reproduced your ~12% regression from V12, and this patch nearly > eliminated it for me. I spent a fair bit of time looking at the difference. Jeff had let me know on chat that he was still seeing some difference, but couldn't quite figure out where that was. Trying it out myself, I observed that the patch helped, but not that much. After a bit I found one major reason for why: LookupTupleHashEntryHash() assigned the hash to pointer provided by the caller's before doing the insertion. That ended up causing a pipeline stall (I assume it's store forwarding, but not sure). Moving the assignment to the caller variable to after the insertion got rid of that. It got within 3-4% after that change. I did a number of small microoptimizations that each helped, but didn't get quite get to the level of 12. Finally I figured out that that's due to an issue outside of nodeAgg.c itself: commit 4cad2534da6d17067d98cf04be2dfc1bda8f2cd0 Author: Tomas Vondra <tomas.vondra@postgresql.org> Date: 2020-05-31 14:43:13 +0200 Use CP_SMALL_TLIST for hash aggregate Due to this change we end up with an additional projection in queries like this: postgres[212666][1]=# \d fewgroups_many_rows Table "public.fewgroups_many_rows" ┌────────┬─────────┬───────────┬──────────┬─────────┐ │ Column │ Type │ Collation │ Nullable │ Default │ ├────────┼─────────┼───────────┼──────────┼─────────┤ │ cat │ integer │ │ not null │ │ │ val │ integer │ │ not null │ │ └────────┴─────────┴───────────┴──────────┴─────────┘ postgres[212666][1]=# explain SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1; ┌───────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├───────────────────────────────────────────────────────────────────────────────────────┤ │ HashAggregate (cost=1942478.48..1942478.53 rows=5 width=12) │ │ Group Key: cat │ │ -> Seq Scan on fewgroups_many_rows (cost=0.00..1442478.32 rows=100000032 width=4) │ └───────────────────────────────────────────────────────────────────────────────────────┘ (3 rows) as 'val' is "projected away".. After neutering the tlist change, Jeff's patch and my changes to it yield performance *above* v12. I don't see why it's ok to force an additional projection in the very common case of hashaggs over a few rows. So I think we need to rethink 4cad2534da6. Greetings, Andres Freund
Attachment
On Fri, 2020-06-12 at 14:37 -0700, Andres Freund wrote: > I don't see why it's ok to force an additional projection in the very > common case of hashaggs over a few rows. So I think we need to > rethink > 4cad2534da6. One possibility is to project only spilled tuples, more similar to Melanie's patch from a while ago: https://www.postgresql.org/message-id/CAAKRu_aefEsv+UkQWqu+ioEnoiL2LJu9Diuh9BR8MbyXuZ0j4A@mail.gmail.com Which makes sense, but it's also more code. Regards, Jeff Davis
On Fri, Jun 12, 2020 at 03:29:08PM -0700, Jeff Davis wrote: >On Fri, 2020-06-12 at 14:37 -0700, Andres Freund wrote: >> I don't see why it's ok to force an additional projection in the very >> common case of hashaggs over a few rows. So I think we need to >> rethink >> 4cad2534da6. > >One possibility is to project only spilled tuples, more similar to >Melanie's patch from a while ago: > > >https://www.postgresql.org/message-id/CAAKRu_aefEsv+UkQWqu+ioEnoiL2LJu9Diuh9BR8MbyXuZ0j4A@mail.gmail.com > >Which makes sense, but it's also more code. > I agree, we should revert 4cad2534da and only project tuples when we actually need to spill them. Did any of the WIP patches actually implement that, or do we need to write that patch from scratch? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2020-06-13 01:06:25 +0200, Tomas Vondra wrote: > I agree, we should revert 4cad2534da and only project tuples when we > actually need to spill them. There are cases where projecting helps for non-spilling aggregates too, but only for the representative tuple. It doesn't help in the case at hand, because there's just 5 hashtable entries but millions of rows. So we're unnecessarily projecting all-5 rows. But when there are many different groups, it'd be different, because then the size of the representative tuple can matter substantially. Do you think we should tackle this for 13? To me 4cad2534da seems like a somewhat independent improvement to spillable hashaggs. Greetings, Andres Freund
On Fri, 2020-06-12 at 17:12 -0700, Andres Freund wrote: > Do you think we should tackle this for 13? To me 4cad2534da seems > like a > somewhat independent improvement to spillable hashaggs. We've gone back and forth on this issue a few times, so let's try to get some agreement before we revert 4cad2534da. I added Robert because he also seemed to think it was a reasonable idea. Regards, Jeff Davis
On Sat, Jun 13, 2020 at 11:48:09AM -0700, Jeff Davis wrote: >On Fri, 2020-06-12 at 17:12 -0700, Andres Freund wrote: >> Do you think we should tackle this for 13? To me 4cad2534da seems >> like a >> somewhat independent improvement to spillable hashaggs. > >We've gone back and forth on this issue a few times, so let's try to >get some agreement before we revert 4cad2534da. I added Robert because >he also seemed to think it was a reasonable idea. > I can't speak for Robert, but I haven't expected the extra projection would be this high. And I agree with Andres it's not very nice we have to do this even for aggregates with just a handful of groups that don't need to spill. In any case, I think we need to address this somehow for v13 - either we keep the 4cad2534da patch in, or we tweak the cost model to reflect the extra I/O costs, or we project only when spilling. I'm not in a position to whip up a patch soon, though :-( regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2020-06-12 15:29:08 -0700, Jeff Davis wrote: > On Fri, 2020-06-12 at 14:37 -0700, Andres Freund wrote: > > I don't see why it's ok to force an additional projection in the very > > common case of hashaggs over a few rows. So I think we need to > > rethink > > 4cad2534da6. > > One possibility is to project only spilled tuples, more similar to > Melanie's patch from a while ago: > > > https://www.postgresql.org/message-id/CAAKRu_aefEsv+UkQWqu+ioEnoiL2LJu9Diuh9BR8MbyXuZ0j4A@mail.gmail.com > > Which makes sense, but it's also more code. I'm somewhat inclined to think that we should revert 4cad2534da6 and then look at how precisely to tackle this in 14. It'd probably make sense to request small tlists when the number of estimated groups is large, and not otherwise. Greetings, Andres Freund
On Sun, 2020-06-14 at 11:14 -0700, Andres Freund wrote: > I'm somewhat inclined to think that we should revert 4cad2534da6 and > then look at how precisely to tackle this in 14. I'm fine with that. > It'd probably make sense to request small tlists when the number of > estimated groups is large, and not otherwise. That seems like a nice compromise that would be non-invasive, at least for create_agg_plan(). Regards, Jeff Davis
On Sun, Jun 14, 2020 at 11:09:55PM -0700, Jeff Davis wrote: >On Sun, 2020-06-14 at 11:14 -0700, Andres Freund wrote: >> I'm somewhat inclined to think that we should revert 4cad2534da6 and >> then look at how precisely to tackle this in 14. > >I'm fine with that. > I don't see how we could just revert 4cad2534d and leave this for v14. The hashagg spilling is IMHO almost guaranteed to be a pain point for some users, as it will force some queries to serialize large amounts of data. Yes, some of this is a cost for hashagg enforcing work_mem at runtime, I'm fine with that. We'd get reports about that too, but we can justify that cost ... But just reverting 4cad2534d will make this much worse, I think, as illustrated by the benchmarks I did in [1]. And no, this is not really fixable by tweaking the cost parameters - even with the current code (i.e. 4cad2534d in place) I had to increase random_page_cost to 60 on the temp tablespace (on SATA RAID) to get good plans with parallelism enabled. I haven't tried, but I presume without 4cad2534d I'd have to push r_p_c even further ... [1] https://www.postgresql.org/message-id/20200519151202.u2p2gpiawoaznsv2%40development >> It'd probably make sense to request small tlists when the number of >> estimated groups is large, and not otherwise. > >That seems like a nice compromise that would be non-invasive, at least >for create_agg_plan(). > Maybe. It'd certainly better than nothing. It's not clear to me what would a good threshold be, though. And it's not going to handle cases of under-estimates. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > But just reverting 4cad2534d will make this much worse, I think, as > illustrated by the benchmarks I did in [1]. I share this concern, although I do not know what we should do about it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> But just reverting 4cad2534d will make this much worse, I think, as >> illustrated by the benchmarks I did in [1]. > I share this concern, although I do not know what we should do about it. Well, it's only June. Let's put it on the open issues list for v13 and continue to think about it. I concur that the hashagg spill patch has made this something that we should worry about for v13, so just reverting without a better answer isn't very appetizing. regards, tom lane
On Mon, 2020-06-15 at 11:19 -0400, Robert Haas wrote: > On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > But just reverting 4cad2534d will make this much worse, I think, as > > illustrated by the benchmarks I did in [1]. > > I share this concern, although I do not know what we should do about > it. I attached an updated version of Melanie's patch, combined with the changes to copy only the necessary attributes to a new slot before spilling. There are a couple changes: * I didn't see a reason to descend into a GroupingFunc node, so I removed that. * I used a flag in the context rather than two separate callbacks to the expression walker. This patch gives the space benefits that we see on master, without the regression for small numbers of tuples. I saw a little bit of noise in my test results, but I'm pretty sure it's a win all around. It could use some review/cleanup though. Regards, Jeff Davis
Attachment
On Mon, Jun 15, 2020 at 07:38:45PM -0700, Jeff Davis wrote: >On Mon, 2020-06-15 at 11:19 -0400, Robert Haas wrote: >> On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: >> > But just reverting 4cad2534d will make this much worse, I think, as >> > illustrated by the benchmarks I did in [1]. >> >> I share this concern, although I do not know what we should do about >> it. > >I attached an updated version of Melanie's patch, combined with the >changes to copy only the necessary attributes to a new slot before >spilling. There are a couple changes: > >* I didn't see a reason to descend into a GroupingFunc node, so I >removed that. > >* I used a flag in the context rather than two separate callbacks to >the expression walker. > >This patch gives the space benefits that we see on master, without the >regression for small numbers of tuples. I saw a little bit of noise in >my test results, but I'm pretty sure it's a win all around. It could >use some review/cleanup though. > Looks reasonable. I can't redo my tests at the moment, the machine is busy with something else. I'll give it a try over the weekend. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, I think it'd be good to get the changes that aren't related to projection merged. As far as I can tell there's performance regressions both because of the things I'd listed upthread, and due to the projection issue. That's not obvious because because we first won performance and then lost it again in several incremental steps. COPY (SELECT (random() * 4)::int cat, (random()*10000)::int val FROM generate_series(1, 100000000)) TO '/tmp/data' WITH BINARY; BEGIN; DROP TABLE IF EXISTS fewgroups_many_rows; CREATE TABLE fewgroups_many_rows(cat int4 not null, val int4 not null); COPY fewgroups_many_rows FROM '/tmp/data' WITH (FORMAT BINARY, FREEZE); COMMIT; VACUUM FREEZE fewgroups_many_rows; Test prep: Test query: SET seed=0;SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1; (the seed seems to reduce noise due to hashtable iv being the same) best of six: 9e1c9f959422192bbe1b842a2a1ffaf76b080196 12031.906 ms d52eaa094847d395f942827a6f413904e516994c 12045.487 ms ac88807f9b227ddcd92b8be9a053094837c1b99a 11950.006 ms 36d22dd95bc87ca68e742da91f47f8826f8758c9 11769.991 ms 5ac4e9a12c6543414891cd8972b2cd36a08e40cc 11551.932 ms 1fdb7f9789c4550204cd62d1746a7deed1dc4c29 11706.948 ms 4eaea3db150af56aa2e40efe91997fd25f3b6d73 11999.908 ms 11de6c903da99a4b2220acfa776fc26c7f384ccc 11999.054 ms b7fabe80df9a65010bfe5e5d0a979bacebfec382 12165.463 ms 2742c45080077ed3b08b810bb96341499b86d530 12137.505 ms 1f39bce021540fde00990af55b4432c55ef4b3c7 12501.764 ms 9b60c4b979bce060495e2b05ba01d1cc6bffdd2d 12389.047 ms 4cad2534da6d17067d98cf04be2dfc1bda8f2cd0 13319.786 ms 1b2c29469a58cd9086bd86e20c708eb437564a80 13330.616 ms There's certainly some noise in here, but I think the trends are valid. > /* > - * find_unaggregated_cols > - * Construct a bitmapset of the column numbers of un-aggregated Vars > - * appearing in our targetlist and qual (HAVING clause) > + * Walk tlist and qual to find referenced colnos, dividing them into > + * aggregated and unaggregated sets. > */ > -static Bitmapset * > -find_unaggregated_cols(AggState *aggstate) > +static void > +find_cols(AggState *aggstate, Bitmapset **aggregated, Bitmapset **unaggregated) > { It's not this patch's fault, but none, really none, of this stuff should be in the executor. Greetings, Andres Freund
On Mon, Jun 22, 2020 at 9:02 PM Andres Freund <andres@anarazel.de> wrote:
> /*
> - * find_unaggregated_cols
> - * Construct a bitmapset of the column numbers of un-aggregated Vars
> - * appearing in our targetlist and qual (HAVING clause)
> + * Walk tlist and qual to find referenced colnos, dividing them into
> + * aggregated and unaggregated sets.
> */
> -static Bitmapset *
> -find_unaggregated_cols(AggState *aggstate)
> +static void
> +find_cols(AggState *aggstate, Bitmapset **aggregated, Bitmapset **unaggregated)
> {
It's not this patch's fault, but none, really none, of this stuff should
be in the executor.
Were you thinking it could be done in grouping_planner() and then the
bitmaps could be saved in the PlannedStmt?
Or would you have to wait until query_planner()? Or are you imagining
somewhere else entirely?
bitmaps could be saved in the PlannedStmt?
Or would you have to wait until query_planner()? Or are you imagining
somewhere else entirely?
--
Melanie Plageman
Hi, On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote: > On Mon, Jun 22, 2020 at 9:02 PM Andres Freund <andres@anarazel.de> wrote: > > It's not this patch's fault, but none, really none, of this stuff should > > be in the executor. > > > > > Were you thinking it could be done in grouping_planner() and then the > bitmaps could be saved in the PlannedStmt? > Or would you have to wait until query_planner()? Or are you imagining > somewhere else entirely? I haven't thought about it in too much detail, but I would say create_agg_plan() et al. I guess there's some argument to be made to do it in setrefs.c, because we already do convert_combining_aggrefs() there (but I don't like that much). There's no reason to do it before we actually decided on one specific path, so doing it earlier than create_plan() seems unnecessary. And having it in agg specific code seems better than putting it into global routines. There's probably an argument for having a bit more shared code between create_agg_plan(), create_group_plan() and create_groupingsets_plan(). But even just adding a new extract_*_cols() call to each of those would probably be ok. Greetings, Andres Freund
On Tue, Jun 23, 2020 at 10:06 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote:
> On Mon, Jun 22, 2020 at 9:02 PM Andres Freund <andres@anarazel.de> wrote:
> > It's not this patch's fault, but none, really none, of this stuff should
> > be in the executor.
> >
> >
> Were you thinking it could be done in grouping_planner() and then the
> bitmaps could be saved in the PlannedStmt?
> Or would you have to wait until query_planner()? Or are you imagining
> somewhere else entirely?
I haven't thought about it in too much detail, but I would say
create_agg_plan() et al. I guess there's some argument to be made to do
it in setrefs.c, because we already do convert_combining_aggrefs() there
(but I don't like that much).
There's no reason to do it before we actually decided on one specific
path, so doing it earlier than create_plan() seems unnecessary. And
having it in agg specific code seems better than putting it into global
routines.
There's probably an argument for having a bit more shared code between
create_agg_plan(), create_group_plan() and
create_groupingsets_plan(). But even just adding a new extract_*_cols()
call to each of those would probably be ok.
So, my summary of this point in the context of the other discussion
upthread is:
Planner should extract the columns that hashagg will need later during
planning. Planner should not have HashAgg/MixedAgg nodes request smaller
targetlists from their children with CP_SMALL_TLIST to avoid unneeded
projection overhead.
upthread is:
Planner should extract the columns that hashagg will need later during
planning. Planner should not have HashAgg/MixedAgg nodes request smaller
targetlists from their children with CP_SMALL_TLIST to avoid unneeded
projection overhead.
Also, even this extraction should only be done when the number of groups
is large enough to suspect a spill.
So, I wrote a patch that extracts the columns the same way as in
ExecInitAgg but in create_agg_plan() and it doesn't work because we
haven't called set_plan_references().
Then, I wrote a patch that does this in set_upper_references(), and it
seems to work. I've attached that one.
It is basically Jeff's patch (based somewhat on my patch) which extracts
the columns in ExecInitAgg but I moved the functions over to setrefs.c
and gave them a different name.
It's not very elegant.
I shoved it in at the end of set_upper_references(), but I think there
should be a nice way to do it while setting the references for each var
instead of walking over the nodes again.
Also, I think that the bitmapsets for the colnos should maybe be put
somewhere less prominent (than in the main Agg plan node?), since they
are only used in one small place.
I tried putting both bitmaps in an array of two bitmaps in the Agg node
(since there will always be two) to make it look a bit neater, but it
was pretty confusing and error prone to remember which one was
aggregated and which one was unaggregated.
Note that I didn't do anything with costing like only extracting the
columns if there are a lot of groups.
Also, I didn't revert the CP_SMALL_TLIST change in create_agg_plan() or
create_groupingsets_plan().
Not to stir the pot, but I did notice that hashjoin uses CP_SMALL_TLIST
in create_hashjoin_plan() for the inner side sub-tree and the outer side
one if there are multiple batches. I wondered what was different about
that vs hashagg (i.e. why it is okay to do that there).
is large enough to suspect a spill.
So, I wrote a patch that extracts the columns the same way as in
ExecInitAgg but in create_agg_plan() and it doesn't work because we
haven't called set_plan_references().
Then, I wrote a patch that does this in set_upper_references(), and it
seems to work. I've attached that one.
It is basically Jeff's patch (based somewhat on my patch) which extracts
the columns in ExecInitAgg but I moved the functions over to setrefs.c
and gave them a different name.
It's not very elegant.
I shoved it in at the end of set_upper_references(), but I think there
should be a nice way to do it while setting the references for each var
instead of walking over the nodes again.
Also, I think that the bitmapsets for the colnos should maybe be put
somewhere less prominent (than in the main Agg plan node?), since they
are only used in one small place.
I tried putting both bitmaps in an array of two bitmaps in the Agg node
(since there will always be two) to make it look a bit neater, but it
was pretty confusing and error prone to remember which one was
aggregated and which one was unaggregated.
Note that I didn't do anything with costing like only extracting the
columns if there are a lot of groups.
Also, I didn't revert the CP_SMALL_TLIST change in create_agg_plan() or
create_groupingsets_plan().
Not to stir the pot, but I did notice that hashjoin uses CP_SMALL_TLIST
in create_hashjoin_plan() for the inner side sub-tree and the outer side
one if there are multiple batches. I wondered what was different about
that vs hashagg (i.e. why it is okay to do that there).
Melanie Plageman
Attachment
On Wed, Jun 24, 2020 at 05:26:07PM -0700, Melanie Plageman wrote: >On Tue, Jun 23, 2020 at 10:06 AM Andres Freund <andres@anarazel.de> wrote: > >> Hi, >> >> On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote: >> > On Mon, Jun 22, 2020 at 9:02 PM Andres Freund <andres@anarazel.de> >> wrote: >> > > It's not this patch's fault, but none, really none, of this stuff >> should >> > > be in the executor. >> > > >> > > >> > Were you thinking it could be done in grouping_planner() and then the >> > bitmaps could be saved in the PlannedStmt? >> > Or would you have to wait until query_planner()? Or are you imagining >> > somewhere else entirely? >> >> I haven't thought about it in too much detail, but I would say >> create_agg_plan() et al. I guess there's some argument to be made to do >> it in setrefs.c, because we already do convert_combining_aggrefs() there >> (but I don't like that much). >> >> There's no reason to do it before we actually decided on one specific >> path, so doing it earlier than create_plan() seems unnecessary. And >> having it in agg specific code seems better than putting it into global >> routines. >> >> There's probably an argument for having a bit more shared code between >> create_agg_plan(), create_group_plan() and >> create_groupingsets_plan(). But even just adding a new extract_*_cols() >> call to each of those would probably be ok. >> >> >So, my summary of this point in the context of the other discussion >upthread is: > >Planner should extract the columns that hashagg will need later during >planning. Planner should not have HashAgg/MixedAgg nodes request smaller >targetlists from their children with CP_SMALL_TLIST to avoid unneeded >projection overhead. >Also, even this extraction should only be done when the number of groups >is large enough to suspect a spill. > IMO we should extract the columns irrespectedly of the estimates, otherwise we won't be able to handle underestimates efficiently. > >Not to stir the pot, but I did notice that hashjoin uses CP_SMALL_TLIST >in create_hashjoin_plan() for the inner side sub-tree and the outer side >one if there are multiple batches. I wondered what was different about >that vs hashagg (i.e. why it is okay to do that there). > Yeah. That means that if we have to start batching during execution, we may need to spill much more datai. I'd say that's a hashjoin issue that we should fix too (in v14). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2020-06-12 14:37:15 -0700, Andres Freund wrote: > On 2020-06-11 11:14:02 -0700, Jeff Davis wrote: > > On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote: > > > Did you run any performance tests? > > > > Yes, I reproduced your ~12% regression from V12, and this patch nearly > > eliminated it for me. > > I spent a fair bit of time looking at the difference. Jeff had let me > know on chat that he was still seeing some difference, but couldn't > quite figure out where that was. > > Trying it out myself, I observed that the patch helped, but not that > much. After a bit I found one major reason for why: > LookupTupleHashEntryHash() assigned the hash to pointer provided by the > caller's before doing the insertion. That ended up causing a pipeline > stall (I assume it's store forwarding, but not sure). Moving the > assignment to the caller variable to after the insertion got rid of > that. This is still not resolved. We're right now slower than 12. It's effectively not possible to do performance comparisons right now. This was nearly two months ago. Greetings, Andres Freund
On Fri, Jul 24, 2020 at 4:51 PM Andres Freund <andres@anarazel.de> wrote: > This is still not resolved. We're right now slower than 12. It's > effectively not possible to do performance comparisons right now. This > was nearly two months ago. I have added a new open item for this separate LookupTupleHashEntryHash()/lookup_hash_entry() pipeline-stall issue. (For the record I mistakenly believed that commit 23023022 resolved all of the concerns raised on this thread, which is why I closed out the open item associated with this thread. Evidently work remains to fix a remaining regression that affects simple in-memory hash aggregation, though.) -- Peter Geoghegan
On Sat, Jul 25, 2020 at 12:41 PM Peter Geoghegan <pg@bowt.ie> wrote: > I have added a new open item for this separate > LookupTupleHashEntryHash()/lookup_hash_entry() pipeline-stall issue. Attached is a rebased version of Andres' now-bitrot 2020-06-12 patch ("aggspeed.diff"). I find that Andres original "SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1;" test case is noticeably improved by the patch. Without the patch, v13 takes ~11.46 seconds. With the patch, it takes only ~10.64 seconds. Didn't test it against v12 yet, but I have no reason to doubt Andres' explanation. I gather that if we can get this patch committed, we can close the relevant LookupTupleHashEntryHash() open item. Can you take this off my hands, Jeff? Thanks -- Peter Geoghegan
Attachment
On Sat, 2020-07-25 at 15:08 -0700, Peter Geoghegan wrote: > I find that Andres original "SELECT cat, count(*) FROM > fewgroups_many_rows GROUP BY 1;" test case is noticeably improved by > the patch. Without the patch, v13 takes ~11.46 seconds. With the > patch, it takes only ~10.64 seconds. I saw less of an improvement than you or Andres (perhaps just more noise). But given that both you and Andres are reporting a measurable improvement, then I went ahead and committed it. Thank you. Regards, Jeff Davis
On Sun, Jul 26, 2020 at 4:17 PM Jeff Davis <pgsql@j-davis.com> wrote: > I saw less of an improvement than you or Andres (perhaps just more > noise). But given that both you and Andres are reporting a measurable > improvement, then I went ahead and committed it. Thank you. Thanks! -- Peter Geoghegan