Re: Why is a hash join preferred when it does not fit in work_mem - Mailing list pgsql-general

From Dimitrios Apostolou
Subject Re: Why is a hash join preferred when it does not fit in work_mem
Date
Msg-id a32ad594-5c54-7bb5-dcff-bd12f8ed368b@gmx.net
Whole thread Raw
In response to Re: Why is a hash join preferred when it does not fit in work_mem  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Why is a hash join preferred when it does not fit in work_mem
List pgsql-general
On Fri, 13 Jan 2023, David Rowley wrote:
>
> I'd expect reducing random_page_cost to make the Mege Join cheaper as
> that's where the Index Scan is. I'm not quite sure where you think the
> random I/O is coming from in a batched hash join.

Thanks for the feedback, indeed you are right! Decreasing random_page_cost
to values way below the default makes the planner prefer the merge join!
This seems strange to me.

Please correct me if I'm wrong, as I'm a newcomer to PostgreSQL, but here
is how I understand things according to posts I've read, and classical
algorithms:

+ The Hash Join is fastest when one side fits in work_mem. Then on one
   hand you have a hash table lookup (amortized O(1)) and on the other
   hand, if the table has M rows that that do not fit in memory, you have
   sequential reads from the disk (given low fragmentation of the table or
   index files):  For every line you read from the disk, you lookup the key
   in the hash table.

   If the hash table does not fit in RAM then the cost becomes prohibitive.
   Every lookup is a random access possibly hitting the disk. The total
   cost should be random_page_cost * M.

+ The Merge Join involves mostly sequential accesses if the disk files are
   not fragmented. It reads sequentially and in parallel from both tables,
   merging the results where the key matches.

   It requires on-disk sorting (because tables don't fit in work_mem), but
   even this operation requires little disk seeking. A merge-sort algorithm
   might have a random access cost of logN * random_page_cost.

So I would expect an increased random_page_cost to benefit the Merge Join
algorithm. And since my setup involves spinning disks, it does makes sense
to increase it.


> It would be interesting to see the same plans with SET track_io_timing
> = on; set.  It's possible that there's less *actual* I/O going on with
> the Merge Join plan vs the Hash Join plan.  Since we do buffered I/O,
> without track_io_timing, we don't know if the read buffers resulted in
> an actual disk read or a read from the kernel buffers.


The database has been VACUUM ANALYZEd first and is otherwise idle.
Every query has been run twice, and I paste here only the 2nd run.


Slow Hash Join:

# EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM tasks_mm_workitems NATURAL JOIN workitem_ids;

  Hash Join  (cost=121222.68..257633.01 rows=3702994 width=241) (actual time=145641.295..349682.387 rows=3702994
loops=1)
    Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, workitem_ids.workitem_id
    Inner Unique: true
    Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n)
    Buffers: shared hit=12121 read=50381, temp read=56309 written=56309
    I/O Timings: shared/local read=745.925, temp read=162199.307 write=172758.699
    ->  Seq Scan on public.tasks_mm_workitems  (cost=0.00..53488.94 rows=3702994 width=8) (actual time=0.114..1401.896
rows=3702994loops=1) 
          Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
          Buffers: shared hit=65 read=16394
          I/O Timings: shared/local read=183.959
    ->  Hash  (cost=59780.19..59780.19 rows=1373719 width=237) (actual time=145344.555..145344.557 rows=1373737
loops=1)
          Output: workitem_ids.workitem_id, workitem_ids.workitem_n
          Buckets: 4096  Batches: 512  Memory Usage: 759kB
          Buffers: shared hit=12056 read=33987, temp written=43092
          I/O Timings: shared/local read=561.966, temp write=142221.740
          ->  Seq Scan on public.workitem_ids  (cost=0.00..59780.19 rows=1373719 width=237) (actual
time=0.033..1493.652rows=1373737 loops=1) 
                Output: workitem_ids.workitem_id, workitem_ids.workitem_n
                Buffers: shared hit=12056 read=33987
                I/O Timings: shared/local read=561.966
  Settings: effective_cache_size = '500MB', enable_memoize = 'off', hash_mem_multiplier = '1',
max_parallel_workers_per_gather= '1', work_mem = '1MB' 
  Planning:
    Buffers: shared hit=8
  Planning Time: 0.693 ms
  Execution Time: 350290.496 ms
(24 rows)


Fast Merge Join:

# SET enable_hashjoin TO off;
SET

# EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM tasks_mm_workitems NATURAL JOIN workitem_ids;

  Merge Join  (cost=609453.49..759407.78 rows=3702994 width=241) (actual time=4602.623..9700.435 rows=3702994 loops=1)
    Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, workitem_ids.workitem_id
    Merge Cond: (workitem_ids.workitem_n = tasks_mm_workitems.workitem_n)
    Buffers: shared hit=5310 read=66086, temp read=32621 written=32894
    I/O Timings: shared/local read=566.121, temp read=228.063 write=526.739
    ->  Index Scan using workitem_ids_pkey on public.workitem_ids  (cost=0.43..81815.86 rows=1373719 width=237) (actual
time=0.034..1080.800rows=1373737 loops=1) 
          Output: workitem_ids.workitem_n, workitem_ids.workitem_id
          Buffers: shared hit=5310 read=49627
          I/O Timings: shared/local read=448.952
    ->  Materialize  (cost=609372.91..627887.88 rows=3702994 width=8) (actual time=4602.576..6621.072 rows=3702994
loops=1)
          Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
          Buffers: shared read=16459, temp read=32621 written=32894
          I/O Timings: shared/local read=117.168, temp read=228.063 write=526.739
          ->  Sort  (cost=609372.91..618630.40 rows=3702994 width=8) (actual time=4602.569..5414.072 rows=3702994
loops=1)
                Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
                Sort Key: tasks_mm_workitems.workitem_n
                Sort Method: external merge  Disk: 65256kB
                Buffers: shared read=16459, temp read=32621 written=32894
                I/O Timings: shared/local read=117.168, temp read=228.063 write=526.739
                ->  Seq Scan on public.tasks_mm_workitems  (cost=0.00..53488.94 rows=3702994 width=8) (actual
time=0.034..1113.868rows=3702994 loops=1) 
                      Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
                      Buffers: shared read=16459
                      I/O Timings: shared/local read=117.168
  Settings: effective_cache_size = '500MB', enable_hashjoin = 'off', enable_memoize = 'off', hash_mem_multiplier = '1',
max_parallel_workers_per_gather= '1', work_mem = '1MB' 
  Planning:
    Buffers: shared hit=2 read=6
    I/O Timings: shared/local read=0.152
  Planning Time: 0.570 ms
  Execution Time: 12165.894 ms
(29 rows)


Regards,
Dimitris




pgsql-general by date:

Previous
From: "Zwettler Markus (OIZ)"
Date:
Subject: AW: AW: [Extern] Re: postgres restore & needed history files
Next
From: Harmen
Date:
Subject: row estimate for partial index