Thread: Hash join gets slower as work_mem increases?

From:
Albe Laurenz
Date:

I have a query that runs *slower* if I increase work_mem.

The execution plans are identical in both cases, except that a temp file
is used when work_mem is smaller.

The relevant lines of EXPLAIN ANALYZE output are:

With work_mem='100MB':
->  Hash Join  (cost=46738.74..285400.61 rows=292 width=8) (actual time=4296.986..106087.683 rows=187222 loops=1)
      Hash Cond: ("*SELECT* 1_2".postadresse_id = p.postadresse_id)
      Buffers: shared hit=1181177 dirtied=1, temp read=7232 written=7230

With work_mem='500MB':
->  Hash Join  (cost=46738.74..285400.61 rows=292 width=8) (actual time=3802.849..245970.049 rows=187222 loops=1)
      Hash Cond: ("*SELECT* 1_2".postadresse_id = p.postadresse_id)
      Buffers: shared hit=1181175 dirtied=111

I ran operf on both backends, and they look quite similar, except that the
number of samples is different (this is "opreport -c" output):

CPU: Intel Sandy Bridge microarchitecture, speed 2899.8 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        image name               symbol name
-------------------------------------------------------------------------------
  112       0.0019  postgres                 ExecProcNode
  3020116  49.9904  postgres                 ExecScanHashBucket
  3021162  50.0077  postgres                 ExecHashJoin
3020116  92.8440  postgres                 ExecScanHashBucket
  3020116  49.9207  postgres                 ExecScanHashBucket [self]
  3020116  49.9207  postgres                 ExecScanHashBucket
  8190      0.1354  vmlinux                  apic_timer_interrupt

What could be an explanation for this?
Is this known behaviour?

Yours,
Laurenz Albe

From:
Pavel Stehule
Date:

Hi



I ran operf on both backends, and they look quite similar, except that the
number of samples is different (this is "opreport -c" output):

CPU: Intel Sandy Bridge microarchitecture, speed 2899.8 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        image name               symbol name
-------------------------------------------------------------------------------
  112       0.0019  postgres                 ExecProcNode
  3020116  49.9904  postgres                 ExecScanHashBucket
  3021162  50.0077  postgres                 ExecHashJoin
3020116  92.8440  postgres                 ExecScanHashBucket
  3020116  49.9207  postgres                 ExecScanHashBucket [self]
  3020116  49.9207  postgres                 ExecScanHashBucket
  8190      0.1354  vmlinux                  apic_timer_interrupt

What could be an explanation for this?
Is this known behaviour?

one issue was fixed in 9.5

large hash table can introduce a lot of outs from L1, L2 caches.

Pavel
 

Yours,
Laurenz Albe

--
Sent via pgsql-performance mailing list ()
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

From:
Tomas Vondra
Date:

Hi,

On 01/29/2016 04:17 PM, Albe Laurenz wrote:
> I have a query that runs *slower* if I increase work_mem.
>
> The execution plans are identical in both cases, except that a temp file
> is used when work_mem is smaller.
>
> The relevant lines of EXPLAIN ANALYZE output are:
>
> With work_mem='100MB':
> ->  Hash Join  (cost=46738.74..285400.61 rows=292 width=8) (actual time=4296.986..106087.683 rows=187222 loops=1)
>        Hash Cond: ("*SELECT* 1_2".postadresse_id = p.postadresse_id)
>        Buffers: shared hit=1181177 dirtied=1, temp read=7232 written=7230
>
> With work_mem='500MB':
> ->  Hash Join  (cost=46738.74..285400.61 rows=292 width=8) (actual time=3802.849..245970.049 rows=187222 loops=1)
>        Hash Cond: ("*SELECT* 1_2".postadresse_id = p.postadresse_id)
>        Buffers: shared hit=1181175 dirtied=111
>
> I ran operf on both backends, and they look quite similar, except that the
> number of samples is different (this is "opreport -c" output):
>
> CPU: Intel Sandy Bridge microarchitecture, speed 2899.8 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 90000
> samples  %        image name               symbol name
> -------------------------------------------------------------------------------
>    112       0.0019  postgres                 ExecProcNode
>    3020116  49.9904  postgres                 ExecScanHashBucket
>    3021162  50.0077  postgres                 ExecHashJoin
> 3020116  92.8440  postgres                 ExecScanHashBucket
>    3020116  49.9207  postgres                 ExecScanHashBucket [self]
>    3020116  49.9207  postgres                 ExecScanHashBucket
>    8190      0.1354  vmlinux                  apic_timer_interrupt
>
> What could be an explanation for this?
> Is this known behaviour?

There is a bunch of possible causes for such behavior, but it's quite
impossible to say if this is an example of one of them as you have not
posted the interesting parts of the explain plan. Also, knowing
PostgreSQL version would be useful.

I don't think the example you posted is due to exceeding on-CPU cache as
that's just a few MBs per socket, so the smaller work_mem is
significantly larger.

What I'd expect to be the issue here is under-estimate of the hash table
size, resulting in too few buckets and thus long chains of tuples that
need to be searched sequentially. Smaller work_mem values usually limit
the length of those chains in favor of batching.

Please, post the whole explain plan - especially the info about number
of buckets/batches and the Hash node details.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From:
Albe Laurenz
Date:

Tomas Vondra wrote:
> On 01/29/2016 04:17 PM, Albe Laurenz wrote:
>> I have a query that runs *slower* if I increase work_mem.
>>
>> The execution plans are identical in both cases, except that a temp file
>> is used when work_mem is smaller.

>> What could be an explanation for this?
>> Is this known behaviour?
> 
> There is a bunch of possible causes for such behavior, but it's quite
> impossible to say if this is an example of one of them as you have not
> posted the interesting parts of the explain plan. Also, knowing
> PostgreSQL version would be useful.
> 
> I don't think the example you posted is due to exceeding on-CPU cache as
> that's just a few MBs per socket, so the smaller work_mem is
> significantly larger.
> 
> What I'd expect to be the issue here is under-estimate of the hash table
> size, resulting in too few buckets and thus long chains of tuples that
> need to be searched sequentially. Smaller work_mem values usually limit
> the length of those chains in favor of batching.
> 
> Please, post the whole explain plan - especially the info about number
> of buckets/batches and the Hash node details.

Thanks for looking at this.
Sorry, I forgot to mention that this is PostgreSQL 9.3.10.

I didn't post the whole plan since it is awfully long, I'll include hyperlinks
for the whole plan.

work_mem = '100MB' (http://explain.depesz.com/s/7b6a):

->  Hash Join  (cost=46738.74..285400.61 rows=292 width=8) (actual time=4296.986..106087.683 rows=187222 loops=1)
       Hash Cond: ("*SELECT* 1_2".postadresse_id = p.postadresse_id)
       Buffers: shared hit=1181177 dirtied=1, temp read=7232 written=7230
[...]
       ->  Hash  (cost=18044.92..18044.92 rows=4014 width=8) (actual time=4206.892..4206.892 rows=3096362 loops=1)
             Buckets: 1024  Batches: 2 (originally 1)  Memory Usage: 102401kB
             Buffers: shared hit=1134522 dirtied=1, temp written=5296

work_mem = '500MB' (http://explain.depesz.com/s/Cgkl):

->  Hash Join  (cost=46738.74..285400.61 rows=292 width=8) (actual time=3802.849..245970.049 rows=187222 loops=1)
       Hash Cond: ("*SELECT* 1_2".postadresse_id = p.postadresse_id)
       Buffers: shared hit=1181175 dirtied=111
[...]
       ->  Hash  (cost=18044.92..18044.92 rows=4014 width=8) (actual time=3709.584..3709.584 rows=3096360 loops=1)
             Buckets: 1024  Batches: 1  Memory Usage: 120952kB
             Buffers: shared hit=1134520 dirtied=111

Does that support your theory?

There is clearly an underestimate here, caused by correlated attributes, but
is that the cause for the bad performance with increased work_mem?

Yours,
Laurenz Albe

From:
Tomas Vondra
Date:

On 02/01/2016 10:38 AM, Albe Laurenz wrote:
> Tomas Vondra wrote:
...
> I didn't post the whole plan since it is awfully long, I'll include hyperlinks
> for the whole plan.
>
> work_mem = '100MB' (http://explain.depesz.com/s/7b6a):
>
> ->  Hash Join  (cost=46738.74..285400.61 rows=292 width=8) (actual time=4296.986..106087.683 rows=187222 loops=1)
>         Hash Cond: ("*SELECT* 1_2".postadresse_id = p.postadresse_id)
>         Buffers: shared hit=1181177 dirtied=1, temp read=7232 written=7230
> [...]
>         ->  Hash  (cost=18044.92..18044.92 rows=4014 width=8) (actual time=4206.892..4206.892 rows=3096362 loops=1)
>               Buckets: 1024  Batches: 2 (originally 1)  Memory Usage: 102401kB
>               Buffers: shared hit=1134522 dirtied=1, temp written=5296
>
> work_mem = '500MB' (http://explain.depesz.com/s/Cgkl):
>
> ->  Hash Join  (cost=46738.74..285400.61 rows=292 width=8) (actual time=3802.849..245970.049 rows=187222 loops=1)
>         Hash Cond: ("*SELECT* 1_2".postadresse_id = p.postadresse_id)
>         Buffers: shared hit=1181175 dirtied=111
> [...]
>         ->  Hash  (cost=18044.92..18044.92 rows=4014 width=8) (actual time=3709.584..3709.584 rows=3096360 loops=1)
>               Buckets: 1024  Batches: 1  Memory Usage: 120952kB
>               Buffers: shared hit=1134520 dirtied=111
>
> Does that support your theory?
>
> There is clearly an underestimate here, caused by correlated attributes, but
> is that the cause for the bad performance with increased work_mem?

Yes, that's clearly the culprit here. In both cases we estimate here are
only ~4000 tuples in the hash, and 9.3 sizes the hash table to have at
most ~10 tuples per bucket (in a linked list).

However we actually get ~3M rows, so there will be ~3000 tuples per
bucket, and that's extremely expensive to walk. The reason why 100MB is
faster is that it's using 2 batches, thus making the lists "just" ~1500
tuples long.

This is pretty much exactly the reason why I reworked hash joins in 9.5.
I'd bet it's going to be ~20x faster on that version.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From:
Albe Laurenz
Date:

Tomas Vondra wrote:
> Yes, that's clearly the culprit here. In both cases we estimate here are
> only ~4000 tuples in the hash, and 9.3 sizes the hash table to have at
> most ~10 tuples per bucket (in a linked list).
> 
> However we actually get ~3M rows, so there will be ~3000 tuples per
> bucket, and that's extremely expensive to walk. The reason why 100MB is
> faster is that it's using 2 batches, thus making the lists "just" ~1500
> tuples long.
> 
> This is pretty much exactly the reason why I reworked hash joins in 9.5.
> I'd bet it's going to be ~20x faster on that version.

Thank you for the explanation!

Yours,
Laurenz Albe