Thread: significant slowdown of HashAggregate between 9.6 and 10

significant slowdown of HashAggregate between 9.6 and 10

From
Pavel Stehule
Date:
Hi

One czech Postgres user reported performance issue related to speed HashAggregate in nested loop.

The speed of 9.6

HashAggregate  (cost=27586.10..27728.66 rows=14256 width=24)
(actual time=0.003..0.049 rows=39 loops=599203)

The speed of 10.7

HashAggregate  (cost=27336.78..27552.78 rows=21600 width=24)
(actual time=0.011..0.156 rows=38 loops=597137)

So it looks so HashAgg is about 3x slower - with brutal nested loop it is a problem.

I wrote simple benchmark and really looks so our hash aggregate is slower and slower.

create table foo(a int, b int, c int, d int, e int, f int);
insert into foo select random()*1000, random()*4, random()*4, random()* 2, random()*100, random()*100 from generate_series(1,2000000);

analyze foo;

9.6.7
postgres=# explain (analyze, buffers) select i from generate_series(1,500000) g(i) where exists (select count(*) cx from foo group by b, c, d having count(*) = i);
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                          QUERY PLAN                                                           │
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Function Scan on generate_series g  (cost=0.00..57739020.00 rows=500 width=4) (actual time=807.485..3364.515 rows=74 loops=1) │
│   Filter: (SubPlan 1)                                                                                                         │
│   Rows Removed by Filter: 499926                                                                                              │
│   Buffers: shared hit=12739, temp read=856 written=855                                                                        │
│   SubPlan 1                                                                                                                   │
│     ->  HashAggregate  (cost=57739.00..57739.75 rows=75 width=20) (actual time=0.006..0.006 rows=0 loops=500000)              │
│           Group Key: foo.b, foo.c, foo.d                                                                                      │
│           Filter: (count(*) = g.i)                                                                                            │
│           Rows Removed by Filter: 75                                                                                          │
│           Buffers: shared hit=12739                                                                                           │
│           ->  Seq Scan on foo  (cost=0.00..32739.00 rows=2000000 width=12) (actual time=0.015..139.736 rows=2000000 loops=1)  │
│                 Buffers: shared hit=12739                                                                                     │
│ Planning time: 0.276 ms                                                                                                       │
│ Execution time: 3365.758 ms                                                                                                   │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(14 rows)

10.9

postgres=# explain (analyze, buffers) select i from generate_series(1,500000) g(i) where exists (select count(*) cx from foo group by b, c, d having count(*) = i);
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                          QUERY PLAN                                                           │
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Function Scan on generate_series g  (cost=0.00..57739020.00 rows=500 width=4) (actual time=825.468..4919.063 rows=74 loops=1) │
│   Filter: (SubPlan 1)                                                                                                         │
│   Rows Removed by Filter: 499926                                                                                              │
│   Buffers: shared hit=12739, temp read=856 written=855                                                                        │
│   SubPlan 1                                                                                                                   │
│     ->  HashAggregate  (cost=57739.00..57739.75 rows=75 width=20) (actual time=0.009..0.009 rows=0 loops=500000)              │
│           Group Key: foo.b, foo.c, foo.d                                                                                      │
│           Filter: (count(*) = g.i)                                                                                            │
│           Rows Removed by Filter: 75                                                                                          │
│           Buffers: shared hit=12739                                                                                           │
│           ->  Seq Scan on foo  (cost=0.00..32739.00 rows=2000000 width=12) (actual time=0.025..157.887 rows=2000000 loops=1)  │
│                 Buffers: shared hit=12739                                                                                     │
│ Planning time: 0.829 ms                                                                                                       │
│ Execution time: 4920.800 ms                                                                                                   │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(14 rows)

master


postgres=# explain (analyze, buffers) select i from generate_series(1,500000) g(i) where exists (select count(*) cx from foo group by b, c, d having count(*) = i);
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│                                                             QUERY PLAN                                                              
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
│ Function Scan on generate_series g  (cost=0.00..28869973750.00 rows=250000 width=4) (actual time=901.639..6057.943 rows=74 loops=1)
│   Filter: (SubPlan 1)                                                                                                              
│   Rows Removed by Filter: 499926                                                                                                    
│   Buffers: shared hit=12739, temp read=855 written=855                                                                              
│   SubPlan 1                                                                                                                        
│     ->  HashAggregate  (cost=57739.00..57739.94 rows=1 width=20) (actual time=0.012..0.012 rows=0 loops=500000)                    
│           Group Key: foo.b, foo.c, foo.d                                                                                            
│           Filter: (count(*) = g.i)                                                                                                  
│           Peak Memory Usage: 37 kB                                                                                                  
│           Rows Removed by Filter: 75                                                                                                
│           Buffers: shared hit=12739                                                                                                
│           ->  Seq Scan on foo  (cost=0.00..32739.00 rows=2000000 width=12) (actual time=0.017..262.497 rows=2000000 loops=1)        
│                 Buffers: shared hit=12739                                                                                          
│ Planning Time: 0.275 ms                                                                                                            
│   Buffers: shared hit=1                                                                                                            
│ Execution Time: 6059.266 ms                                                                                                        
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(16 rows)

Regards

Pavel

Re: significant slowdown of HashAggregate between 9.6 and 10

From
Pavel Stehule
Date:


st 3. 6. 2020 v 17:32 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:
Hi

One czech Postgres user reported performance issue related to speed HashAggregate in nested loop.

The speed of 9.6

HashAggregate  (cost=27586.10..27728.66 rows=14256 width=24)
(actual time=0.003..0.049 rows=39 loops=599203)

The speed of 10.7

HashAggregate  (cost=27336.78..27552.78 rows=21600 width=24)
(actual time=0.011..0.156 rows=38 loops=597137)

So it looks so HashAgg is about 3x slower - with brutal nested loop it is a problem.

I wrote simple benchmark and really looks so our hash aggregate is slower and slower.

create table foo(a int, b int, c int, d int, e int, f int);
insert into foo select random()*1000, random()*4, random()*4, random()* 2, random()*100, random()*100 from generate_series(1,2000000);

analyze foo;

9.6.7
postgres=# explain (analyze, buffers) select i from generate_series(1,500000) g(i) where exists (select count(*) cx from foo group by b, c, d having count(*) = i);
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                          QUERY PLAN                                                           │
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Function Scan on generate_series g  (cost=0.00..57739020.00 rows=500 width=4) (actual time=807.485..3364.515 rows=74 loops=1) │
│   Filter: (SubPlan 1)                                                                                                         │
│   Rows Removed by Filter: 499926                                                                                              │
│   Buffers: shared hit=12739, temp read=856 written=855                                                                        │
│   SubPlan 1                                                                                                                   │
│     ->  HashAggregate  (cost=57739.00..57739.75 rows=75 width=20) (actual time=0.006..0.006 rows=0 loops=500000)              │
│           Group Key: foo.b, foo.c, foo.d                                                                                      │
│           Filter: (count(*) = g.i)                                                                                            │
│           Rows Removed by Filter: 75                                                                                          │
│           Buffers: shared hit=12739                                                                                           │
│           ->  Seq Scan on foo  (cost=0.00..32739.00 rows=2000000 width=12) (actual time=0.015..139.736 rows=2000000 loops=1)  │
│                 Buffers: shared hit=12739                                                                                     │
│ Planning time: 0.276 ms                                                                                                       │
│ Execution time: 3365.758 ms                                                                                                   │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(14 rows)

10.9

postgres=# explain (analyze, buffers) select i from generate_series(1,500000) g(i) where exists (select count(*) cx from foo group by b, c, d having count(*) = i);
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                          QUERY PLAN                                                           │
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Function Scan on generate_series g  (cost=0.00..57739020.00 rows=500 width=4) (actual time=825.468..4919.063 rows=74 loops=1) │
│   Filter: (SubPlan 1)                                                                                                         │
│   Rows Removed by Filter: 499926                                                                                              │
│   Buffers: shared hit=12739, temp read=856 written=855                                                                        │
│   SubPlan 1                                                                                                                   │
│     ->  HashAggregate  (cost=57739.00..57739.75 rows=75 width=20) (actual time=0.009..0.009 rows=0 loops=500000)              │
│           Group Key: foo.b, foo.c, foo.d                                                                                      │
│           Filter: (count(*) = g.i)                                                                                            │
│           Rows Removed by Filter: 75                                                                                          │
│           Buffers: shared hit=12739                                                                                           │
│           ->  Seq Scan on foo  (cost=0.00..32739.00 rows=2000000 width=12) (actual time=0.025..157.887 rows=2000000 loops=1)  │
│                 Buffers: shared hit=12739                                                                                     │
│ Planning time: 0.829 ms                                                                                                       │
│ Execution time: 4920.800 ms                                                                                                   │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(14 rows)

master


postgres=# explain (analyze, buffers) select i from generate_series(1,500000) g(i) where exists (select count(*) cx from foo group by b, c, d having count(*) = i);
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│                                                             QUERY PLAN                                                              
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
│ Function Scan on generate_series g  (cost=0.00..28869973750.00 rows=250000 width=4) (actual time=901.639..6057.943 rows=74 loops=1)
│   Filter: (SubPlan 1)                                                                                                              
│   Rows Removed by Filter: 499926                                                                                                    
│   Buffers: shared hit=12739, temp read=855 written=855                                                                              
│   SubPlan 1                                                                                                                        
│     ->  HashAggregate  (cost=57739.00..57739.94 rows=1 width=20) (actual time=0.012..0.012 rows=0 loops=500000)                    
│           Group Key: foo.b, foo.c, foo.d                                                                                            
│           Filter: (count(*) = g.i)                                                                                                  
│           Peak Memory Usage: 37 kB                                                                                                  
│           Rows Removed by Filter: 75                                                                                                
│           Buffers: shared hit=12739                                                                                                
│           ->  Seq Scan on foo  (cost=0.00..32739.00 rows=2000000 width=12) (actual time=0.017..262.497 rows=2000000 loops=1)        
│                 Buffers: shared hit=12739                                                                                          
│ Planning Time: 0.275 ms                                                                                                            
│   Buffers: shared hit=1                                                                                                            
│ Execution Time: 6059.266 ms                                                                                                        
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(16 rows)

Regards

I tried to run same query on half data size, and the performance is almost same. Probably the performance issue can be related to initialization or finalization of aggregation.

Pavel
 

Pavel

Re: significant slowdown of HashAggregate between 9.6 and 10

From
Tomas Vondra
Date:
Hi,

Not sure what's the root cause, but I can reproduce it. Timings for 9.6,
10 and master (all built from git with the same options) without explain
analyze look like this:

9.6
-----------------
Time: 1971.314 ms
Time: 1995.875 ms
Time: 1997.408 ms
Time: 2069.913 ms
Time: 2004.196 ms

10
-----------------------------
Time: 2815.434 ms (00:02.815)
Time: 2862.589 ms (00:02.863)
Time: 2841.126 ms (00:02.841)
Time: 2803.040 ms (00:02.803)
Time: 2805.527 ms (00:02.806)

master
-----------------------------
Time: 3479.233 ms (00:03.479)
Time: 3537.901 ms (00:03.538)
Time: 3459.314 ms (00:03.459)
Time: 3542.810 ms (00:03.543)
Time: 3482.141 ms (00:03.482)

So there seems to be +40% between 9.6 and 10, and further +25% between
10 and master. However, plain hashagg, measured e.g. like this:

   select count(*) cx from foo group by b, c, d having count(*) = 1;

does not indicate any slowdown at all, so I think you're right it has
something to do with the looping.

Profiles from those versions look like this:

9.6
---------------------------------------------------------
Samples
Overhead  Shared Objec  Symbol
   14.19%  postgres      [.] ExecMakeFunctionResultNoSets
   13.65%  postgres      [.] finalize_aggregates
   12.54%  postgres      [.] hash_seq_search
    6.70%  postgres      [.] finalize_aggregate.isra.0
    5.71%  postgres      [.] ExecEvalParamExec
    5.54%  postgres      [.] ExecEvalAggref
    5.00%  postgres      [.] ExecStoreMinimalTuple
    4.34%  postgres      [.] ExecAgg
    4.08%  postgres      [.] ExecQual
    2.67%  postgres      [.] slot_deform_tuple
    2.24%  postgres      [.] pgstat_init_function_usage
    2.22%  postgres      [.] check_stack_depth
    2.14%  postgres      [.] MemoryContextReset
    1.89%  postgres      [.] hash_search_with_hash_value
    1.72%  postgres      [.] project_aggregates
    1.68%  postgres      [.] pgstat_end_function_usage
    1.59%  postgres      [.] slot_getattr


10
------------------------------------------------------------
Samples
Overhead  Shared Object       Symbol
   15.18%  postgres            [.] slot_deform_tuple
   13.09%  postgres            [.] agg_retrieve_hash_table
   12.02%  postgres            [.] ExecInterpExpr
    7.47%  postgres            [.] finalize_aggregates
    7.38%  postgres            [.] tuplehash_iterate
    5.13%  postgres            [.] prepare_projection_slot
    4.86%  postgres            [.] finalize_aggregate.isra.0
    4.05%  postgres            [.] bms_is_member
    3.97%  postgres            [.] slot_getallattrs
    3.59%  postgres            [.] ExecStoreMinimalTuple
    2.85%  postgres            [.] project_aggregates
    1.95%  postgres            [.] ExecClearTuple
    1.71%  libc-2.30.so        [.] __memset_avx2_unaligned_erms
    1.69%  postgres            [.] ExecEvalParamExec
    1.58%  postgres            [.] MemoryContextReset
    1.17%  postgres            [.] slot_getattr
    1.03%  postgres            [.] slot_getsomeattrs


master
--------------------------------------------------------------
Samples
Overhead  Shared Object       Symbol
   17.07%  postgres            [.] agg_retrieve_hash_table
   15.46%  postgres            [.] tuplehash_iterate
   11.83%  postgres            [.] tts_minimal_getsomeattrs
    9.39%  postgres            [.] ExecInterpExpr
    6.94%  postgres            [.] prepare_projection_slot
    4.85%  postgres            [.] finalize_aggregates
    4.27%  postgres            [.] bms_is_member
    3.80%  postgres            [.] finalize_aggregate.isra.0
    3.80%  postgres            [.] tts_minimal_store_tuple
    2.22%  postgres            [.] project_aggregates
    2.07%  postgres            [.] tts_virtual_clear
    2.07%  postgres            [.] MemoryContextReset
    1.78%  postgres            [.] tts_minimal_clear
    1.61%  postgres            [.] ExecEvalParamExec
    1.46%  postgres            [.] slot_getsomeattrs_int
    1.34%  libc-2.30.so        [.] __memset_avx2_unaligned_erms

Not sure what to think about this. Seems slot_deform_tuple got way more
expensive between 9.6 and 10, for some reason. 


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: significant slowdown of HashAggregate between 9.6 and 10

From
Andres Freund
Date:
Hi,

On 2020-06-03 21:31:01 +0200, Tomas Vondra wrote:
> So there seems to be +40% between 9.6 and 10, and further +25% between
> 10 and master. However, plain hashagg, measured e.g. like this:

Ugh.

Since I am a likely culprit, I'm taking a look.


> Not sure what to think about this. Seems slot_deform_tuple got way more
> expensive between 9.6 and 10, for some reason.

Might indicate too many calls instead. Or it could just be the fact that
we have expressions deform all columns once, instead of deforming
columns one-by-one now.

Greetings,

Andres Freund



Re: significant slowdown of HashAggregate between 9.6 and 10

From
Andres Freund
Date:
Hi,

On 2020-06-03 13:26:43 -0700, Andres Freund wrote:
> On 2020-06-03 21:31:01 +0200, Tomas Vondra wrote:
> > So there seems to be +40% between 9.6 and 10, and further +25% between
> > 10 and master. However, plain hashagg, measured e.g. like this:

As far as I can tell the 10->master difference comes largely from the
difference of the number of buckets in the hashtable.

In 10 it is:
Breakpoint 1, tuplehash_create (ctx=0x5628251775c8, nelements=75, private_data=0x5628251952f0)
and in master it is:
Breakpoint 1, tuplehash_create (ctx=0x5628293a0a70, nelements=256, private_data=0x5628293a0b90)

As far as I can tell the timing difference simply is the cost of
iterating 500k times over a hashtable with fairly few entries. Which is,
unsurprisingly, more expensive if the hashtable is larger.

The reason the hashtable got bigger in 12 is

commit 1f39bce021540fde00990af55b4432c55ef4b3c7
Author: Jeff Davis <jdavis@postgresql.org>
Date:   2020-03-18 15:42:02 -0700

    Disk-based Hash Aggregation.

which introduced

+/* minimum number of initial hash table buckets */
+#define HASHAGG_MIN_BUCKETS 256


I don't really see much explanation for that part in the commit, perhaps
Jeff can chime in?


I think optimizing for the gazillion hash table scans isn't particularly
important. Rarely is a query going to have 500k scans of unchanging
aggregated data. So I'm not too concerned about the 13 regression - but
I also see very little reason to just always use 256 buckets? It's
pretty darn common to end up with 1-2 groups, what's the point of this?


I'll look into 9.6->10 after buying groceries... But I'd wish there were
a relevant benchmark, I don't think it's worth optimizing for this case.

Greetings,

Andres Freund



Re: significant slowdown of HashAggregate between 9.6 and 10

From
Jeff Davis
Date:
On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote:
> +/* minimum number of initial hash table buckets */
> +#define HASHAGG_MIN_BUCKETS 256
> 
> 
> I don't really see much explanation for that part in the commit,
> perhaps
> Jeff can chime in?

I did this in response to a review comment (point #5):


https://www.postgresql.org/message-id/20200219191636.gvdywx32kwbix6kd@development

Tomas suggested a min of 1024, and I thought I was being more
conservative choosing 256. Still too high, I guess?

I can lower it. What do you think is a reasonable minimum?

Regards,
    Jeff Davis





Re: significant slowdown of HashAggregate between 9.6 and 10

From
Andres Freund
Date:
Hi,

On 2020-06-04 18:22:03 -0700, Jeff Davis wrote:
> On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote:
> > +/* minimum number of initial hash table buckets */
> > +#define HASHAGG_MIN_BUCKETS 256
> > 
> > 
> > I don't really see much explanation for that part in the commit,
> > perhaps
> > Jeff can chime in?
> 
> I did this in response to a review comment (point #5):
> 
> 
> https://www.postgresql.org/message-id/20200219191636.gvdywx32kwbix6kd@development
> 
> Tomas suggested a min of 1024, and I thought I was being more
> conservative choosing 256. Still too high, I guess?

> I can lower it. What do you think is a reasonable minimum?

I don't really see why there needs to be a minimum bigger than 1?

Greetings,

Andres Freund



Re: significant slowdown of HashAggregate between 9.6 and 10

From
Tomas Vondra
Date:
On Thu, Jun 04, 2020 at 06:57:58PM -0700, Andres Freund wrote:
>Hi,
>
>On 2020-06-04 18:22:03 -0700, Jeff Davis wrote:
>> On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote:
>> > +/* minimum number of initial hash table buckets */
>> > +#define HASHAGG_MIN_BUCKETS 256
>> >
>> >
>> > I don't really see much explanation for that part in the commit,
>> > perhaps
>> > Jeff can chime in?
>>
>> I did this in response to a review comment (point #5):
>>
>>
>> https://www.postgresql.org/message-id/20200219191636.gvdywx32kwbix6kd@development
>>
>> Tomas suggested a min of 1024, and I thought I was being more
>> conservative choosing 256. Still too high, I guess?
>
>> I can lower it. What do you think is a reasonable minimum?
>
>I don't really see why there needs to be a minimum bigger than 1?
>

I think you're right. I think I was worried about having to resize the
hash table in case of an under-estimate, and it seemed fine to waste a
tiny bit more memory to prevent that. But this example shows we may need
to scan the hash table sequentially, which means it's not just about
memory consumption. So in hindsight we either don't need the limit at
all, or maybe it could be much lower (IIRC it reduces probability of
collision, but maybe dynahash does that anyway internally).

I wonder if hashjoin has the same issue, but probably not - I don't
think we'll ever scan that internal hash table sequentially.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: significant slowdown of HashAggregate between 9.6 and 10

From
Andres Freund
Date:
Hi,

On 2020-06-05 15:25:26 +0200, Tomas Vondra wrote:
> I think you're right. I think I was worried about having to resize the
> hash table in case of an under-estimate, and it seemed fine to waste a
> tiny bit more memory to prevent that.

It's pretty cheap to resize a hashtable with a handful of entries, so I'm not
worried about that. It's also how it has worked for a *long* time, so I think
unless we have some good reason to change that, I wouldn't.


> But this example shows we may need to scan the hash table
> sequentially, which means it's not just about memory consumption.

We *always* scan the hashtable sequentially, no? Otherwise there's no way to
get at the aggregated data.


> So in hindsight we either don't need the limit at all, or maybe it
> could be much lower (IIRC it reduces probability of collision, but
> maybe dynahash does that anyway internally).

This is simplehash using code. Which resizes on a load factor of 0.9.


> I wonder if hashjoin has the same issue, but probably not - I don't
> think we'll ever scan that internal hash table sequentially.

I think we do for some outer joins (c.f. ExecPrepHashTableForUnmatched()), but
it's probably not relevant performance-wise.

Greetings,

Andres Freund