Re: [HACKERS] How to change order sort of table in HashJoin - Mailing list pgsql-general
From | Man |
---|---|
Subject | Re: [HACKERS] How to change order sort of table in HashJoin |
Date | |
Msg-id | be95a39d-c9e7-6640-7804-b90d4164172f@gmail.com Whole thread Raw |
In response to | Re: [HACKERS] How to change order sort of table in HashJoin (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: [HACKERS] How to change order sort of table in HashJoin
|
List | pgsql-general |
Thanks for response, sir. On 11/20/2016 1:18 AM, Tom Lane wrote: > Man Trieu <man.trieu@gmail.com> writes: >> As in the example below, i think the plan which hash table is created on >> testtbl2 (the fewer tuples) should be choosen. > The planner usually prefers to hash on the table that has a flatter > MCV histogram, since a hash table with many key collisions will be > inefficient. You might find it illuminating to read the comments around > estimate_hash_bucketsize(). Thanks, I will read it. Additional information. In 9.6 the second table (lesser tuple) was choosen (the same testdata). There are something (cost estimation?) different in previous versions. --- In 9.6.1 --- postgres=# explain analyze select * from testtbl1 inner join testtbl2 using(c1,c2,c3); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=6935.57..60389.58 rows=1 width=60) (actual time=80.214..1165.762 rows=142857 loops=1) Hash Cond: ((testtbl1.c1 = testtbl2.c1) AND (testtbl1.c2 = testtbl2.c2) AND (testtbl1.c3 = testtbl2.c3)) -> Seq Scan on testtbl1 (cost=0.00..21276.00 rows=1000000 width=56) (actual time=0.038..226.324 rows=1000000 loops=1) -> Hash (cost=3039.57..3039.57 rows=142857 width=56) (actual time=79.632..79.632 rows=142857 loops=1) Buckets: 65536 Batches: 4 Memory Usage: 3658kB -> Seq Scan on testtbl2 (cost=0.00..3039.57 rows=142857 width=56) (actual time=0.028..20.646 rows=142857 loops=1) Planning time: 0.252 ms Execution time: 1174.588 ms (8 rows) ------ --- In 9.4.10 --- postgres=# explain analyze select * from testtbl1 inner join testtbl2 using(c1,c2,c3); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=48542.00..67353.86 rows=1 width=60) (actual time=880.580..1277.611 rows=142857 loops=1) Hash Cond: ((testtbl2.c1 = testtbl1.c1) AND (testtbl2.c2 = testtbl1.c2) AND (testtbl2.c3 = testtbl1.c3)) -> Seq Scan on testtbl2 (cost=0.00..3039.57 rows=142857 width=56) (actual time=0.016..24.421 rows=142857 loops=1) -> Hash (cost=21276.00..21276.00 rows=1000000 width=56) (actual time=878.296..878.296 rows=1000000 loops=1) Buckets: 8192 Batches: 32 Memory Usage: 2839kB -> Seq Scan on testtbl1 (cost=0.00..21276.00 rows=1000000 width=56) (actual time=0.025..258.193 rows=1000000 loops=1) Planning time: 2.683 ms Execution time: 1285.868 ms (8 rows) ------ > In general, given a hashtable that fits in memory and light bucket > loading, a hash join is more or less O(M) + O(N); it doesn't matter > so much whether the larger table is on the inside. It does matter if > the table gets big enough to force batching of the join, but that's > not happening in your example (at least not the first one; it's unclear > to me why it did happen in the second one). The key thing that will > drive the choice, then, is avoiding a skewed bucket distribution that > causes lots of comparisons for common values. > > regards, tom lane Thanks and best regards,
pgsql-general by date: