Thread: significant slowdown of HashAggregate between 9.6 and 10
Hi
One czech Postgres user reported performance issue related to speed HashAggregate in nested loop.
The speed of 9.6
HashAggregate (cost=27586.10..27728.66 rows=14256 width=24)
(actual time=0.003..0.049 rows=39 loops=599203)
The speed of 10.7
HashAggregate (cost=27336.78..27552.78 rows=21600 width=24)
(actual time=0.011..0.156 rows=38 loops=597137)
So it looks so HashAgg is about 3x slower - with brutal nested loop it is a problem.
I wrote simple benchmark and really looks so our hash aggregate is slower and slower.
insert into foo select random()*1000, random()*4, random()*4, random()* 2, random()*100, random()*100 from generate_series(1,2000000);
analyze foo;
9.6.7
postgres=# explain (analyze, buffers) select i from generate_series(1,500000) g(i) where exists (select count(*) cx from foo group by b, c, d having count(*) = i);
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Function Scan on generate_series g (cost=0.00..57739020.00 rows=500 width=4) (actual time=807.485..3364.515 rows=74 loops=1) │
│ Filter: (SubPlan 1) │
│ Rows Removed by Filter: 499926 │
│ Buffers: shared hit=12739, temp read=856 written=855 │
│ SubPlan 1 │
│ -> HashAggregate (cost=57739.00..57739.75 rows=75 width=20) (actual time=0.006..0.006 rows=0 loops=500000) │
│ Group Key: foo.b, foo.c, foo.d │
│ Filter: (count(*) = g.i) │
│ Rows Removed by Filter: 75 │
│ Buffers: shared hit=12739 │
│ -> Seq Scan on foo (cost=0.00..32739.00 rows=2000000 width=12) (actual time=0.015..139.736 rows=2000000 loops=1) │
│ Buffers: shared hit=12739 │
│ Planning time: 0.276 ms │
│ Execution time: 3365.758 ms │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(14 rows)
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Function Scan on generate_series g (cost=0.00..57739020.00 rows=500 width=4) (actual time=807.485..3364.515 rows=74 loops=1) │
│ Filter: (SubPlan 1) │
│ Rows Removed by Filter: 499926 │
│ Buffers: shared hit=12739, temp read=856 written=855 │
│ SubPlan 1 │
│ -> HashAggregate (cost=57739.00..57739.75 rows=75 width=20) (actual time=0.006..0.006 rows=0 loops=500000) │
│ Group Key: foo.b, foo.c, foo.d │
│ Filter: (count(*) = g.i) │
│ Rows Removed by Filter: 75 │
│ Buffers: shared hit=12739 │
│ -> Seq Scan on foo (cost=0.00..32739.00 rows=2000000 width=12) (actual time=0.015..139.736 rows=2000000 loops=1) │
│ Buffers: shared hit=12739 │
│ Planning time: 0.276 ms │
│ Execution time: 3365.758 ms │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(14 rows)
10.9
postgres=# explain (analyze, buffers) select i from generate_series(1,500000) g(i) where exists (select count(*) cx from foo group by b, c, d having count(*) = i);
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Function Scan on generate_series g (cost=0.00..57739020.00 rows=500 width=4) (actual time=825.468..4919.063 rows=74 loops=1) │
│ Filter: (SubPlan 1) │
│ Rows Removed by Filter: 499926 │
│ Buffers: shared hit=12739, temp read=856 written=855 │
│ SubPlan 1 │
│ -> HashAggregate (cost=57739.00..57739.75 rows=75 width=20) (actual time=0.009..0.009 rows=0 loops=500000) │
│ Group Key: foo.b, foo.c, foo.d │
│ Filter: (count(*) = g.i) │
│ Rows Removed by Filter: 75 │
│ Buffers: shared hit=12739 │
│ -> Seq Scan on foo (cost=0.00..32739.00 rows=2000000 width=12) (actual time=0.025..157.887 rows=2000000 loops=1) │
│ Buffers: shared hit=12739 │
│ Planning time: 0.829 ms │
│ Execution time: 4920.800 ms │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(14 rows)
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Function Scan on generate_series g (cost=0.00..57739020.00 rows=500 width=4) (actual time=825.468..4919.063 rows=74 loops=1) │
│ Filter: (SubPlan 1) │
│ Rows Removed by Filter: 499926 │
│ Buffers: shared hit=12739, temp read=856 written=855 │
│ SubPlan 1 │
│ -> HashAggregate (cost=57739.00..57739.75 rows=75 width=20) (actual time=0.009..0.009 rows=0 loops=500000) │
│ Group Key: foo.b, foo.c, foo.d │
│ Filter: (count(*) = g.i) │
│ Rows Removed by Filter: 75 │
│ Buffers: shared hit=12739 │
│ -> Seq Scan on foo (cost=0.00..32739.00 rows=2000000 width=12) (actual time=0.025..157.887 rows=2000000 loops=1) │
│ Buffers: shared hit=12739 │
│ Planning time: 0.829 ms │
│ Execution time: 4920.800 ms │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(14 rows)
master
postgres=# explain (analyze, buffers) select i from generate_series(1,500000) g(i) where exists (select count(*) cx from foo group by b, c, d having count(*) = i);
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ QUERY PLAN
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
│ Function Scan on generate_series g (cost=0.00..28869973750.00 rows=250000 width=4) (actual time=901.639..6057.943 rows=74 loops=1)
│ Filter: (SubPlan 1)
│ Rows Removed by Filter: 499926
│ Buffers: shared hit=12739, temp read=855 written=855
│ SubPlan 1
│ -> HashAggregate (cost=57739.00..57739.94 rows=1 width=20) (actual time=0.012..0.012 rows=0 loops=500000)
│ Group Key: foo.b, foo.c, foo.d
│ Filter: (count(*) = g.i)
│ Peak Memory Usage: 37 kB
│ Rows Removed by Filter: 75
│ Buffers: shared hit=12739
│ -> Seq Scan on foo (cost=0.00..32739.00 rows=2000000 width=12) (actual time=0.017..262.497 rows=2000000 loops=1)
│ Buffers: shared hit=12739
│ Planning Time: 0.275 ms
│ Buffers: shared hit=1
│ Execution Time: 6059.266 ms
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(16 rows)
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ QUERY PLAN
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
│ Function Scan on generate_series g (cost=0.00..28869973750.00 rows=250000 width=4) (actual time=901.639..6057.943 rows=74 loops=1)
│ Filter: (SubPlan 1)
│ Rows Removed by Filter: 499926
│ Buffers: shared hit=12739, temp read=855 written=855
│ SubPlan 1
│ -> HashAggregate (cost=57739.00..57739.94 rows=1 width=20) (actual time=0.012..0.012 rows=0 loops=500000)
│ Group Key: foo.b, foo.c, foo.d
│ Filter: (count(*) = g.i)
│ Peak Memory Usage: 37 kB
│ Rows Removed by Filter: 75
│ Buffers: shared hit=12739
│ -> Seq Scan on foo (cost=0.00..32739.00 rows=2000000 width=12) (actual time=0.017..262.497 rows=2000000 loops=1)
│ Buffers: shared hit=12739
│ Planning Time: 0.275 ms
│ Buffers: shared hit=1
│ Execution Time: 6059.266 ms
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(16 rows)
Pavel
st 3. 6. 2020 v 17:32 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:
Hi
One czech Postgres user reported performance issue related to speed HashAggregate in nested loop.
The speed of 9.6
HashAggregate (cost=27586.10..27728.66 rows=14256 width=24)
(actual time=0.003..0.049 rows=39 loops=599203)
The speed of 10.7
HashAggregate (cost=27336.78..27552.78 rows=21600 width=24)
(actual time=0.011..0.156 rows=38 loops=597137)So it looks so HashAgg is about 3x slower - with brutal nested loop it is a problem.I wrote simple benchmark and really looks so our hash aggregate is slower and slower.create table foo(a int, b int, c int, d int, e int, f int);
insert into foo select random()*1000, random()*4, random()*4, random()* 2, random()*100, random()*100 from generate_series(1,2000000);
analyze foo;9.6.7postgres=# explain (analyze, buffers) select i from generate_series(1,500000) g(i) where exists (select count(*) cx from foo group by b, c, d having count(*) = i);
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Function Scan on generate_series g (cost=0.00..57739020.00 rows=500 width=4) (actual time=807.485..3364.515 rows=74 loops=1) │
│ Filter: (SubPlan 1) │
│ Rows Removed by Filter: 499926 │
│ Buffers: shared hit=12739, temp read=856 written=855 │
│ SubPlan 1 │
│ -> HashAggregate (cost=57739.00..57739.75 rows=75 width=20) (actual time=0.006..0.006 rows=0 loops=500000) │
│ Group Key: foo.b, foo.c, foo.d │
│ Filter: (count(*) = g.i) │
│ Rows Removed by Filter: 75 │
│ Buffers: shared hit=12739 │
│ -> Seq Scan on foo (cost=0.00..32739.00 rows=2000000 width=12) (actual time=0.015..139.736 rows=2000000 loops=1) │
│ Buffers: shared hit=12739 │
│ Planning time: 0.276 ms │
│ Execution time: 3365.758 ms │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(14 rows)10.9postgres=# explain (analyze, buffers) select i from generate_series(1,500000) g(i) where exists (select count(*) cx from foo group by b, c, d having count(*) = i);
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Function Scan on generate_series g (cost=0.00..57739020.00 rows=500 width=4) (actual time=825.468..4919.063 rows=74 loops=1) │
│ Filter: (SubPlan 1) │
│ Rows Removed by Filter: 499926 │
│ Buffers: shared hit=12739, temp read=856 written=855 │
│ SubPlan 1 │
│ -> HashAggregate (cost=57739.00..57739.75 rows=75 width=20) (actual time=0.009..0.009 rows=0 loops=500000) │
│ Group Key: foo.b, foo.c, foo.d │
│ Filter: (count(*) = g.i) │
│ Rows Removed by Filter: 75 │
│ Buffers: shared hit=12739 │
│ -> Seq Scan on foo (cost=0.00..32739.00 rows=2000000 width=12) (actual time=0.025..157.887 rows=2000000 loops=1) │
│ Buffers: shared hit=12739 │
│ Planning time: 0.829 ms │
│ Execution time: 4920.800 ms │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(14 rows)masterpostgres=# explain (analyze, buffers) select i from generate_series(1,500000) g(i) where exists (select count(*) cx from foo group by b, c, d having count(*) = i);
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ QUERY PLAN
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
│ Function Scan on generate_series g (cost=0.00..28869973750.00 rows=250000 width=4) (actual time=901.639..6057.943 rows=74 loops=1)
│ Filter: (SubPlan 1)
│ Rows Removed by Filter: 499926
│ Buffers: shared hit=12739, temp read=855 written=855
│ SubPlan 1
│ -> HashAggregate (cost=57739.00..57739.94 rows=1 width=20) (actual time=0.012..0.012 rows=0 loops=500000)
│ Group Key: foo.b, foo.c, foo.d
│ Filter: (count(*) = g.i)
│ Peak Memory Usage: 37 kB
│ Rows Removed by Filter: 75
│ Buffers: shared hit=12739
│ -> Seq Scan on foo (cost=0.00..32739.00 rows=2000000 width=12) (actual time=0.017..262.497 rows=2000000 loops=1)
│ Buffers: shared hit=12739
│ Planning Time: 0.275 ms
│ Buffers: shared hit=1
│ Execution Time: 6059.266 ms
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(16 rows)Regards
I tried to run same query on half data size, and the performance is almost same. Probably the performance issue can be related to initialization or finalization of aggregation.
Pavel
Pavel
Hi, Not sure what's the root cause, but I can reproduce it. Timings for 9.6, 10 and master (all built from git with the same options) without explain analyze look like this: 9.6 ----------------- Time: 1971.314 ms Time: 1995.875 ms Time: 1997.408 ms Time: 2069.913 ms Time: 2004.196 ms 10 ----------------------------- Time: 2815.434 ms (00:02.815) Time: 2862.589 ms (00:02.863) Time: 2841.126 ms (00:02.841) Time: 2803.040 ms (00:02.803) Time: 2805.527 ms (00:02.806) master ----------------------------- Time: 3479.233 ms (00:03.479) Time: 3537.901 ms (00:03.538) Time: 3459.314 ms (00:03.459) Time: 3542.810 ms (00:03.543) Time: 3482.141 ms (00:03.482) So there seems to be +40% between 9.6 and 10, and further +25% between 10 and master. However, plain hashagg, measured e.g. like this: select count(*) cx from foo group by b, c, d having count(*) = 1; does not indicate any slowdown at all, so I think you're right it has something to do with the looping. Profiles from those versions look like this: 9.6 --------------------------------------------------------- Samples Overhead Shared Objec Symbol 14.19% postgres [.] ExecMakeFunctionResultNoSets 13.65% postgres [.] finalize_aggregates 12.54% postgres [.] hash_seq_search 6.70% postgres [.] finalize_aggregate.isra.0 5.71% postgres [.] ExecEvalParamExec 5.54% postgres [.] ExecEvalAggref 5.00% postgres [.] ExecStoreMinimalTuple 4.34% postgres [.] ExecAgg 4.08% postgres [.] ExecQual 2.67% postgres [.] slot_deform_tuple 2.24% postgres [.] pgstat_init_function_usage 2.22% postgres [.] check_stack_depth 2.14% postgres [.] MemoryContextReset 1.89% postgres [.] hash_search_with_hash_value 1.72% postgres [.] project_aggregates 1.68% postgres [.] pgstat_end_function_usage 1.59% postgres [.] slot_getattr 10 ------------------------------------------------------------ Samples Overhead Shared Object Symbol 15.18% postgres [.] slot_deform_tuple 13.09% postgres [.] agg_retrieve_hash_table 12.02% postgres [.] ExecInterpExpr 7.47% postgres [.] finalize_aggregates 7.38% postgres [.] tuplehash_iterate 5.13% postgres [.] prepare_projection_slot 4.86% postgres [.] finalize_aggregate.isra.0 4.05% postgres [.] bms_is_member 3.97% postgres [.] slot_getallattrs 3.59% postgres [.] ExecStoreMinimalTuple 2.85% postgres [.] project_aggregates 1.95% postgres [.] ExecClearTuple 1.71% libc-2.30.so [.] __memset_avx2_unaligned_erms 1.69% postgres [.] ExecEvalParamExec 1.58% postgres [.] MemoryContextReset 1.17% postgres [.] slot_getattr 1.03% postgres [.] slot_getsomeattrs master -------------------------------------------------------------- Samples Overhead Shared Object Symbol 17.07% postgres [.] agg_retrieve_hash_table 15.46% postgres [.] tuplehash_iterate 11.83% postgres [.] tts_minimal_getsomeattrs 9.39% postgres [.] ExecInterpExpr 6.94% postgres [.] prepare_projection_slot 4.85% postgres [.] finalize_aggregates 4.27% postgres [.] bms_is_member 3.80% postgres [.] finalize_aggregate.isra.0 3.80% postgres [.] tts_minimal_store_tuple 2.22% postgres [.] project_aggregates 2.07% postgres [.] tts_virtual_clear 2.07% postgres [.] MemoryContextReset 1.78% postgres [.] tts_minimal_clear 1.61% postgres [.] ExecEvalParamExec 1.46% postgres [.] slot_getsomeattrs_int 1.34% libc-2.30.so [.] __memset_avx2_unaligned_erms Not sure what to think about this. Seems slot_deform_tuple got way more expensive between 9.6 and 10, for some reason. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2020-06-03 21:31:01 +0200, Tomas Vondra wrote: > So there seems to be +40% between 9.6 and 10, and further +25% between > 10 and master. However, plain hashagg, measured e.g. like this: Ugh. Since I am a likely culprit, I'm taking a look. > Not sure what to think about this. Seems slot_deform_tuple got way more > expensive between 9.6 and 10, for some reason. Might indicate too many calls instead. Or it could just be the fact that we have expressions deform all columns once, instead of deforming columns one-by-one now. Greetings, Andres Freund
Hi, On 2020-06-03 13:26:43 -0700, Andres Freund wrote: > On 2020-06-03 21:31:01 +0200, Tomas Vondra wrote: > > So there seems to be +40% between 9.6 and 10, and further +25% between > > 10 and master. However, plain hashagg, measured e.g. like this: As far as I can tell the 10->master difference comes largely from the difference of the number of buckets in the hashtable. In 10 it is: Breakpoint 1, tuplehash_create (ctx=0x5628251775c8, nelements=75, private_data=0x5628251952f0) and in master it is: Breakpoint 1, tuplehash_create (ctx=0x5628293a0a70, nelements=256, private_data=0x5628293a0b90) As far as I can tell the timing difference simply is the cost of iterating 500k times over a hashtable with fairly few entries. Which is, unsurprisingly, more expensive if the hashtable is larger. The reason the hashtable got bigger in 12 is commit 1f39bce021540fde00990af55b4432c55ef4b3c7 Author: Jeff Davis <jdavis@postgresql.org> Date: 2020-03-18 15:42:02 -0700 Disk-based Hash Aggregation. which introduced +/* minimum number of initial hash table buckets */ +#define HASHAGG_MIN_BUCKETS 256 I don't really see much explanation for that part in the commit, perhaps Jeff can chime in? I think optimizing for the gazillion hash table scans isn't particularly important. Rarely is a query going to have 500k scans of unchanging aggregated data. So I'm not too concerned about the 13 regression - but I also see very little reason to just always use 256 buckets? It's pretty darn common to end up with 1-2 groups, what's the point of this? I'll look into 9.6->10 after buying groceries... But I'd wish there were a relevant benchmark, I don't think it's worth optimizing for this case. Greetings, Andres Freund
On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote: > +/* minimum number of initial hash table buckets */ > +#define HASHAGG_MIN_BUCKETS 256 > > > I don't really see much explanation for that part in the commit, > perhaps > Jeff can chime in? I did this in response to a review comment (point #5): https://www.postgresql.org/message-id/20200219191636.gvdywx32kwbix6kd@development Tomas suggested a min of 1024, and I thought I was being more conservative choosing 256. Still too high, I guess? I can lower it. What do you think is a reasonable minimum? Regards, Jeff Davis
Hi, On 2020-06-04 18:22:03 -0700, Jeff Davis wrote: > On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote: > > +/* minimum number of initial hash table buckets */ > > +#define HASHAGG_MIN_BUCKETS 256 > > > > > > I don't really see much explanation for that part in the commit, > > perhaps > > Jeff can chime in? > > I did this in response to a review comment (point #5): > > > https://www.postgresql.org/message-id/20200219191636.gvdywx32kwbix6kd@development > > Tomas suggested a min of 1024, and I thought I was being more > conservative choosing 256. Still too high, I guess? > I can lower it. What do you think is a reasonable minimum? I don't really see why there needs to be a minimum bigger than 1? Greetings, Andres Freund
On Thu, Jun 04, 2020 at 06:57:58PM -0700, Andres Freund wrote: >Hi, > >On 2020-06-04 18:22:03 -0700, Jeff Davis wrote: >> On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote: >> > +/* minimum number of initial hash table buckets */ >> > +#define HASHAGG_MIN_BUCKETS 256 >> > >> > >> > I don't really see much explanation for that part in the commit, >> > perhaps >> > Jeff can chime in? >> >> I did this in response to a review comment (point #5): >> >> >> https://www.postgresql.org/message-id/20200219191636.gvdywx32kwbix6kd@development >> >> Tomas suggested a min of 1024, and I thought I was being more >> conservative choosing 256. Still too high, I guess? > >> I can lower it. What do you think is a reasonable minimum? > >I don't really see why there needs to be a minimum bigger than 1? > I think you're right. I think I was worried about having to resize the hash table in case of an under-estimate, and it seemed fine to waste a tiny bit more memory to prevent that. But this example shows we may need to scan the hash table sequentially, which means it's not just about memory consumption. So in hindsight we either don't need the limit at all, or maybe it could be much lower (IIRC it reduces probability of collision, but maybe dynahash does that anyway internally). I wonder if hashjoin has the same issue, but probably not - I don't think we'll ever scan that internal hash table sequentially. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2020-06-05 15:25:26 +0200, Tomas Vondra wrote: > I think you're right. I think I was worried about having to resize the > hash table in case of an under-estimate, and it seemed fine to waste a > tiny bit more memory to prevent that. It's pretty cheap to resize a hashtable with a handful of entries, so I'm not worried about that. It's also how it has worked for a *long* time, so I think unless we have some good reason to change that, I wouldn't. > But this example shows we may need to scan the hash table > sequentially, which means it's not just about memory consumption. We *always* scan the hashtable sequentially, no? Otherwise there's no way to get at the aggregated data. > So in hindsight we either don't need the limit at all, or maybe it > could be much lower (IIRC it reduces probability of collision, but > maybe dynahash does that anyway internally). This is simplehash using code. Which resizes on a load factor of 0.9. > I wonder if hashjoin has the same issue, but probably not - I don't > think we'll ever scan that internal hash table sequentially. I think we do for some outer joins (c.f. ExecPrepHashTableForUnmatched()), but it's probably not relevant performance-wise. Greetings, Andres Freund