Thread: Hash Join performance

Hash Join performance

From
Vamsidhar Thummala
Date:

From the documentation, I understand that range of actual time represents the time taken for retrieving the first result and the last result respectively. However, the following output of explain analyze confuses me:
GroupAggregate  (cost=632185.58..632525.55 rows=122884 width=57) (actual time=187382.499..187383.241 rows=57 loops=1)
-> Sort (cost=632185.58..632201.78 rows=122884 width=57) (actual time=187167.792..187167.905 rows=399 loops=1)
Sort Key: orders.o_totalprice, orders.o_orderdate, customer.c_name, customer.c_custkey, orders.o_orderkey
-> Hash Join (cost=399316.78..629388.21 rows=122884 width=57) (actual time=122805.133..186107.210 rows=399 loops=1)
Hash Cond: (public.lineitem.l_orderkey = orders.o_orderkey)
-> Seq Scan on lineitem (cost=0.00..163912.71 rows=6000742 width=14) (actual time=0.022..53597.555 rows=6001215 loops=1)
-> Hash (cost=398960.15..398960.15 rows=30713 width=51) (actual time=112439.592..112439.592 rows=57 loops=1)
-> Hash Join (cost=369865.37..398960.15 rows=30713 width=51) (actual time=80638.283..111855.510 rows=57 loops=1)
Hash Cond: (orders.o_custkey = customer.c_custkey)
-> Nested Loop (cost=364201.67..391753.70 rows=30713 width=29) (actual time=75598.246..107760.054 rows=57 loops=1)
-> GroupAggregate (cost=364201.67..366634.97 rows=30713 width=14) (actual time=75427.115..96167.167 rows=57 loops=1)
Filter: (sum(l_quantity) > 300::numeric)
-> Sort (cost=364201.67..364992.54 rows=6000742 width=14) (actual time=74987.112..86289.063 rows=6001215 loops=1)
Sort Key: public.lineitem.l_orderkey
-> Seq Scan on lineitem (cost=0.00..163912.71 rows=6000742 width=14) (actual time=0.006..51153.880 rows=6001215 loops=1)
-> Index Scan using orders_pkey on orders (cost=0.00..0.81 rows=1 width=25) (actual time=169.485..173.006 rows=1 loops=57)
Index Cond: (orders.o_orderkey = "IN_subquery".l_orderkey)
-> Hash (cost=4360.96..4360.96 rows=150072 width=26) (actual time=998.378..998.378 rows=150000 loops=1)
-> Seq Scan on customer (cost=0.00..4360.96 rows=150072 width=26) (actual time=8.188..883.778 rows=150000 loops=1)
Total runtime: 187644.927 ms
(20 rows)

My settings: Memory - 1GB, Data size - 1GB, Lineitem ~ 650MB, shared_buffers: 200MB, work_mem: 1MB.
PostgreSQL version: 8.2, OS: Sun Solaris 10u4

Query: TPC-H 18, Large Volume Customer Query
Questions:

1) The actual time on Seq Scan on Lineitem shows that the first record is fetched at time 0.022ms and the last record is fetched at 53.5s. Does it mean the sequential scan is completed with-in first 53.4s (absolute time)? Or does it mean that sequential scan started at 112.43s (after build phase of Hash Join) and finished at 165.93s (112.43 + 53.5s)? My understanding is that former is true. If so, the sequential scan has to fetched around 6M records (~650MB) ahead of build phase of Hash Join, which seems surprising. Is this called prefetching at DB level? Where does the DB hold all these records? Definitely, it can't hold in shared_buffers since it's only 200MB.

2) Why is the Hash Join (top most) so slow? The hash is build over the output of subplan which produces 57 records (~20kb). We can assume that these 57 records fit into work_mem. Now, the Hash Join is producing first record at 122.8s where as the Hash build is completed at 112.4s (10.4s difference. I have seen in some cases, this gap is even worse). Also, the total time for Hash Join is 63.3s, which seems too high given that Lineitem is already in the buffer. What is happening over here?

Appreciate your help!

Regards,
~Vamsi

Re: Hash Join performance

From
Tom Lane
Date:
Vamsidhar Thummala <vamsi@cs.duke.edu> writes:
> 1) The actual time on Seq Scan on Lineitem shows that the first record is
> fetched at time 0.022ms and the last record is fetched at 53.5s. Does it
> mean the sequential scan is completed with-in first 53.4s (absolute time)?

No, it means that we spent a total of 53.5 seconds executing that plan
node and its children.  There's no direct way to determine how that was
interleaved with the execution of a peer plan node.  In the particular
case here, since that seqscan is the outer child of a hash join, you
can infer that all the time charged to the inner child (the Hash node
and its children) happened first, while we were building the hashtable,
which is then probed for each row of the outer relation.

> 2) Why is the Hash Join (top most) so slow?

Doesn't look that bad to me.  The net time charged to the HashJoin node
is 186107.210 - 53597.555 - 112439.592 = 20070.063 msec.  In addition it
would be reasonable to count the hashtable build time, which evidently
is 112439.592 - 111855.510 = 584.082 msec.  So the hashtable build took
about 10 msec/row, in addition to the data fetching; and then the actual
join spent about 3 microsec per outer row, again exclusive of obtaining
those rows.  The table build seems a bit slow, maybe, but I don't see a
problem with the join speed.

            regards, tom lane

Re: Hash Join performance

From
Vamsidhar Thummala
Date:
Thanks for such quick response.

On Fri, Mar 13, 2009 at 5:34 PM, Tom Lane wrote:
> 2) Why is the Hash Join (top most) so slow?

Doesn't look that bad to me.  The net time charged to the HashJoin node
is 186107.210 - 53597.555 - 112439.592 = 20070.063 msec.  In addition it
would be reasonable to count the hashtable build time, which evidently
is 112439.592 - 111855.510 = 584.082 msec.  So the hashtable build took
about 10 msec/row, in addition to the data fetching; and then the actual
join spent about 3 microsec per outer row, again exclusive of obtaining
those rows.  The table build seems a bit slow, maybe, but I don't see a
problem with the join speed.

I am wondering why are we subtracting the entire Seq Scan time of Lineitem from the total time to calculate the HashJoin time. 
Does the Hash probing start as soon as the first record of Lineitem is available, i.e., after 112439.592ms?

Here is another plan I have for the same TPC-H 18 query with different configuration parameters (shared_buffers set to 400MB, just for experimental purposes) and HashJoin seems to take longer time (at least 155.58s based on above calculation):

GroupAggregate  (cost=905532.09..912092.04 rows=119707 width=57) (actual time=392705.160..392705.853 rows=57 loops=1)
-> Sort (cost=905532.09..906082.74 rows=119707 width=57) (actual time=392705.116..392705.220 rows=399 loops=1)
Sort Key: orders.o_totalprice, orders.o_orderdate, customer.c_name, customer.c_custkey, orders.o_orderkey
-> Hash Join (cost=507550.05..877523.36 rows=119707 width=57) (actual time=72616.327..392703.675 rows=399 loops=1)
Hash Cond: (public.lineitem.l_orderkey = orders.o_orderkey)
-> Seq Scan on lineitem (cost=0.00..261655.05 rows=6000947 width=14) (actual time=0.027..178712.709 rows=6001215 loops=1)
-> Hash (cost=506580.84..506580.84 rows=29921 width=51) (actual time=58421.050..58421.050 rows=57 loops=1)
-> Hash Join (cost=416568.25..506580.84 rows=29921 width=51) (actual time=25208.925..58419.502 rows=57 loops=1)
Hash Cond: (orders.o_custkey = customer.c_custkey)
-> Merge IN Join (cost=405349.14..493081.88 rows=29921 width=29) (actual time=37.244..57646.024 rows=57 loops=1)
Merge Cond: (orders.o_orderkey = "IN_subquery".l_orderkey)
-> Index Scan using orders_pkey on orders (cost=0.00..79501.17 rows=1499952 width=25) (actual time=0.100..5379.828 rows=1496151 loops=1)
-> Materialize (cost=405349.14..406004.72 rows=29921 width=4) (actual time=34.825..51619.816 rows=57 loops=1)
-> GroupAggregate (cost=0.00..404639.71 rows=29921 width=14) (actual time=34.818..51619.488 rows=57 loops=1)
Filter: (sum(l_quantity) > 300::numeric)
-> Index Scan using fkey_lineitem_1 on lineitem (cost=0.00..348617.14 rows=6000947 width=14) (actual time=0.079..44140.117 rows=6001215 loops=1)
-> Hash (cost=6803.60..6803.60 rows=149978 width=26) (actual time=640.980..640.980 rows=150000 loops=1)
-> Seq Scan on customer (cost=0.00..6803.60 rows=149978 width=26) (actual time=0.021..510.993 rows=150000 loops=1)

I re-ran the query multiple times to verify the accuracy of results.

Regards,
~Vamsi

Re: Hash Join performance

From
Tom Lane
Date:
Vamsidhar Thummala <vamsi@cs.duke.edu> writes:
> I am wondering why are we subtracting the entire Seq Scan time of Lineitem
> from the total time to calculate the HashJoin time.

Well, if you're trying to identify the speed of the join itself and not
how long it takes to provide the input for it, that seems like a
sensible calculation to make.

> Here is another plan I have for the same TPC-H 18 query with different
> configuration parameters (shared_buffers set to 400MB, just for experimental
> purposes) and HashJoin seems to take longer time (at least 155.58s based on
> above calculation):

Yeah, that seems to work out to about 25us per row instead of 3us, which
is a lot slower.  Maybe the hash got split up into multiple batches ...
what have you got work_mem set to?  Try turning on log_temp_files and
see if it records any temp files as getting created.

            regards, tom lane

Re: Hash Join performance

From
Vamsidhar Thummala
Date:

On Fri, Mar 13, 2009 at 7:08 PM, Tom Lane wrote:
Vamsidhar Thummala writes:
> I am wondering why are we subtracting the entire Seq Scan time of Lineitem
> from the total time to calculate the HashJoin time.

Well, if you're trying to identify the speed of the join itself and not
how long it takes to provide the input for it, that seems like a
sensible calculation to make.

I am still not clear on this. I am thinking the output is produced in a pipelined fashion i.e., as soon as the record of outer child is read (sequentially here) and if HashJoin finds a match by probing the inner hash table (in memory), we have an output record. Please correct if I am wrong here.
 


> Here is another plan I have for the same TPC-H 18 query with different
> configuration parameters (shared_buffers set to 400MB, just for experimental
> purposes) and HashJoin seems to take longer time (at least 155.58s based on
> above calculation):

Yeah, that seems to work out to about 25us per row instead of 3us, which
is a lot slower.  Maybe the hash got split up into multiple batches ...
what have you got work_mem set to?  Try turning on log_temp_files and
see if it records any temp files as getting created.

Unfortunately, I am working with Postgres 8.2 which doesn't have log_temp_files. The work_mem is still at 1MB (all other parameters were kept constant apart from shared_buffers w.r.t previous configuration). The hash is build on 57 records (~20kb, customer row length is 179 bytes and orders row length is 104 bytes) produced by inner subplan and so I will be surprised if multiple batches are created.

Thank you.

Regards,
-Vamsi