Thread: No hash join across partitioned tables?

No hash join across partitioned tables?

From

Kris Jurka

Date:

16 April 2009, 20:09:43

PG (8.3.7) doesn't seem to want to do a hash join across two partitioned
tables.  I have two partition hierarchies: impounds (with different
impound sources) and liens (with vehicle liens from different companies).
Trying to match those up gives:

EXPLAIN SELECT COUNT(*)
FROM impounds i
     JOIN liens l ON (i.vin = l.vin);

  Aggregate  (cost=11164042.66..11164042.67 rows=1 width=0)
    ->  Nested Loop  (cost=0.27..3420012.94 rows=3097611886 width=0)
          Join Filter: ((i.vin)::text = (l.vin)::text)
          ->  Append  (cost=0.00..1072.77 rows=33577 width=21)
                ->  Seq Scan on impounds i  (cost=0.00..11.40 rows=140 width=21)
                ->  Seq Scan on impounds_s1 i  (cost=0.00..926.87 rows=29587 width=18)
                ->  Seq Scan on impounds_s2 i  (cost=0.00..99.96 rows=3296 width=18)
                ->  Seq Scan on impounds_s3 i  (cost=0.00..23.14 rows=414 width=18)
                ->  Seq Scan on impounds_s4 i  (cost=0.00..11.40 rows=140 width=21)
          ->  Append  (cost=0.27..101.64 rows=15 width=21)
                ->  Bitmap Heap Scan on liens l  (cost=0.27..5.60 rows=2 width=21)
                      Recheck Cond: ((l.vin)::text = (i.vin)::text)
                      ->  Bitmap Index Scan on liens_pk  (cost=0.00..0.27 rows=2 width=0)
                            Index Cond: ((l.vin)::text = (i.vin)::text)
                ->  Index Scan using liens_s1_pk on liens_s1 l  (cost=0.00..7.02 rows=1 width=18)
                      Index Cond: ((l.vin)::text = (i.vin)::text)
                ->  Index Scan using liens_s2_pk on liens_s2 l  (cost=0.00..3.47 rows=1 width=21)
                      Index Cond: ((l.vin)::text = (i.vin)::text)
                ->  Index Scan using newliens_s3_pk on liens_s3 l  (cost=0.00..7.52 rows=1 width=18)
                      Index Cond: ((l.vin)::text = (i.vin)::text)
                ->  Index Scan using newliens_s4_pk on liens_s4 l  (cost=0.00..7.67 rows=1 width=18)
                      Index Cond: ((l.vin)::text = (i.vin)::text)
                ->  Index Scan using newliens_s5_pk on liens_s5 l  (cost=0.00..7.62 rows=1 width=18)
                      Index Cond: ((l.vin)::text = (i.vin)::text)
                ->  Index Scan using newliens_s6_pk on liens_s6 l  (cost=0.00..7.61 rows=1 width=18)
                      Index Cond: ((l.vin)::text = (i.vin)::text)
                ->  Index Scan using newliens_s7_pk on liens_s7 l  (cost=0.00..7.50 rows=1 width=18)
                      Index Cond: ((l.vin)::text = (i.vin)::text)
                ->  Index Scan using newliens_s8_pk on liens_s8 l  (cost=0.00..7.36 rows=1 width=18)
                      Index Cond: ((l.vin)::text = (i.vin)::text)
                ->  Index Scan using newliens_s9_pk on liens_s9 l  (cost=0.00..7.43 rows=1 width=18)
                      Index Cond: ((l.vin)::text = (i.vin)::text)
                ->  Index Scan using newliens_s10_pk on liens_s10 l  (cost=0.00..7.79 rows=1 width=18)
                      Index Cond: ((l.vin)::text = (i.vin)::text)
                ->  Index Scan using newliens_s11_pk on liens_s11 l  (cost=0.00..8.07 rows=1 width=18)
                      Index Cond: ((l.vin)::text = (i.vin)::text)
                ->  Index Scan using newliens_s12_pk on liens_s12 l  (cost=0.00..8.45 rows=1 width=18)
                      Index Cond: ((l.vin)::text = (i.vin)::text)
                ->  Index Scan using newliens_s13_pk on liens_s13 l  (cost=0.00..8.53 rows=1 width=18)
                      Index Cond: ((l.vin)::text = (i.vin)::text)


This takes quite a while as it's got to do tons of index probes which
results it tons of random IO.  I killed this after five minutes of
running.

But if I do:

CREATE TABLE i1 AS SELECT * FROM impounds;
CREATE TABLE l1 AS SELECT * FROM liens;

I get a reasonable plan, which runs in about 15 seconds, from:

EXPLAIN SELECT COUNT(*)
FROM i1 i
         JOIN l1 l ON (i.vin = l.vin);

  Aggregate  (cost=749054.78..749054.79 rows=1 width=0)
    ->  Hash Join  (cost=1444.18..748971.43 rows=33338 width=0)
          Hash Cond: ((l.vin)::text = (i.vin)::text)
          ->  Seq Scan on l1 l  (cost=0.00..332068.96 rows=18449996
width=18)
          ->  Hash  (cost=1027.97..1027.97 rows=33297 width=18)
                ->  Seq Scan on i1 i  (cost=0.00..1027.97 rows=33297
width=18)


I've tried to force the hash join plan on the partitioned tables via:

set enable_nestloop to off;

This results in a merge join plan which needs to do a giant sort, again
killed after five minutes.

  Aggregate  (cost=58285765.20..58285765.21 rows=1 width=0)
    ->  Merge Join  (cost=4077389.31..50541735.48 rows=3097611886 width=0)
          Merge Cond: ((i.vin)::text = (l.vin)::text)
          ->  Sort  (cost=4286.45..4370.39 rows=33577 width=21)
                Sort Key: i.vin
                ->  Append  (cost=0.00..1072.77 rows=33577 width=21)
                      ->  Seq Scan on impounds i  (cost=0.00..11.40 rows=140 width=21)
                      ->  [Seq Scans on other partitions]
          ->  Materialize  (cost=4073102.86..4303737.81 rows=18450796 width=21)
                ->  Sort  (cost=4073102.86..4119229.85 rows=18450796 width=21)
                      Sort Key: l.vin
                      ->  Append  (cost=0.00..332797.96 rows=18450796 width=21)
                            ->  Seq Scan on liens l  (cost=0.00..14.00 rows=400 width=21)
                            ->  [Seq Scans on other partitions]


Disabling mergejoin pushes it back to a nestloop join.  Why can't it hash
join these two together?

Kris Jurka

Re: No hash join across partitioned tables?

From

Tom Lane

Date:

16 April 2009, 20:12:15

Kris Jurka <books@ejurka.com> writes:
> PG (8.3.7) doesn't seem to want to do a hash join across two partitioned
> tables.

Could we see the whole declaration of these tables?  (pg_dump -s output
would be convenient)

            regards, tom lane

Re: No hash join across partitioned tables?

From

Kris Jurka

Date:

16 April 2009, 20:30:55

On Thu, 16 Apr 2009, Tom Lane wrote:

> Kris Jurka <books@ejurka.com> writes:
>> PG (8.3.7) doesn't seem to want to do a hash join across two partitioned
>> tables.
>
> Could we see the whole declaration of these tables?  (pg_dump -s output
> would be convenient)
>

The attached table definition with no data wants to mergejoin first, but
after disabling mergejoin it does indeed do a hashjoin.

Looking back at the cost estimates for the merge and nestloop joins, it
seems to be selecting the number of rows in the cartesian product * .005
while the number of output rows in this case is 2437 (cartesian product *
4e-9).  Perhaps the cost estimates for the real data are so high because
of this bogus row count that the fudge factor to disable mergejoin isn't
enough?

Kris Jurka

Attachment

hash-join-partition.sql

Re: No hash join across partitioned tables?

From

Kris Jurka

Date:

16 April 2009, 22:02:08

On Thu, 16 Apr 2009, Kris Jurka wrote:

> Perhaps the cost estimates for the real data are so high because of this
> bogus row count that the fudge factor to disable mergejoin isn't enough?
>

Indeed, I get these cost estimates on 8.4b1 with an increased
disable_cost value:

nestloop:  11171206.18
merge:     58377401.39
hash:     116763544.76

So the default disable_cost isn't enough to push it to use the hash join
plan and goes back to nestloop.  Since disable_cost hasn't been touched
since January 2000, perhaps it's time to bump that up to match today's
hardware and problem sizes?  This isn't even a particularly big problem,
it's joing 18M rows against 30k.

The real problem is getting reasonable stats to pass through the partition
Append step, so it can make a reasonable estimate of the join output size.

Kris Jurka

Re: No hash join across partitioned tables?

From

Tom Lane

Date:

17 April 2009, 12:02:39

Kris Jurka <books@ejurka.com> writes:
> So the default disable_cost isn't enough to push it to use the hash join
> plan and goes back to nestloop.  Since disable_cost hasn't been touched
> since January 2000, perhaps it's time to bump that up to match today's
> hardware and problem sizes?

I think disable_cost was originally set at a time when costs were
integers :-(.  Yeah, there's probably no reason not to throw another
zero or two on it.

Is there another issue here besides that one?  I think you were hoping
that the hash join would be faster than the alternatives, but the cost
estimate says it's a lot slower.  Is that actually the case?

            regards, tom lane

Re: No hash join across partitioned tables?

From

Tom Lane

Date:

17 April 2009, 12:08:27

Kris Jurka <books@ejurka.com> writes:
> The hash join takes less than twenty seconds, the other two joins I
> killed after five minutes.  I can try to collect explain analyze results
> later today if you'd like.

Please, unless the test case you already posted has similar behavior.

            regards, tom lane

Re: No hash join across partitioned tables?

From

Kris Jurka

Date:

17 April 2009, 12:12:41

Tom Lane wrote:

> Is there another issue here besides that one?  I think you were hoping
> that the hash join would be faster than the alternatives, but the cost
> estimate says it's a lot slower.  Is that actually the case?
>

The hash join takes less than twenty seconds, the other two joins I
killed after five minutes.  I can try to collect explain analyze results
later today if you'd like.

Kris Jurka

Re: No hash join across partitioned tables?

From

Kris Jurka

Date:

17 April 2009, 14:05:45

Tom Lane wrote:
> Kris Jurka <books@ejurka.com> writes:
>> The hash join takes less than twenty seconds, the other two joins I
>> killed after five minutes.  I can try to collect explain analyze results
>> later today if you'd like.
>

Attached are the explain analyze results.  The analyze part hits the
hash join worst of all, so I've also included the timings without analyzing.

Method       Time (ms)  Time w/Analyze (ms)
nestloop     304853     319060
merge        514517     683757
hash          18957     143731

Kris Jurka
 Aggregate  (cost=116546928.75..116546928.76 rows=1 width=0) (actual time=143731.602..143731.604 rows=1 loops=1)
   ->  Hash Join  (cost=1470.48..108802907.84 rows=3097608361 width=0) (actual time=308.015..143724.055 rows=2437
loops=1)
         Hash Cond: ((l.vin)::text = (i.vin)::text)
         ->  Append  (cost=0.00..332098.75 rows=18450775 width=18) (actual time=0.069..99984.899 rows=18449996 loops=1)
               ->  Seq Scan on liens l  (cost=0.00..14.00 rows=400 width=21) (actual time=0.003..0.003 rows=0 loops=1)
               ->  Seq Scan on liens_s1 l  (cost=0.00..18633.44 rows=917444 width=18) (actual time=0.060..1828.740
rows=917444loops=1) 
               ->  Seq Scan on liens_s2 l  (cost=0.00..20.92 rows=1192 width=18) (actual time=0.010..2.274 rows=1192
loops=1)
               ->  Seq Scan on liens_s3 l  (cost=0.00..53793.79 rows=2934179 width=18) (actual time=0.054..5777.782
rows=2934179loops=1) 
               ->  Seq Scan on liens_s4 l  (cost=0.00..21069.39 rows=1214139 width=18) (actual time=0.065..2413.429
rows=1214139loops=1) 
               ->  Seq Scan on liens_s5 l  (cost=0.00..29966.37 rows=1726837 width=18) (actual time=0.046..3394.974
rows=1726837loops=1) 
               ->  Seq Scan on liens_s6 l  (cost=0.00..10587.18 rows=462918 width=18) (actual time=0.053..936.379
rows=462918loops=1) 
               ->  Seq Scan on liens_s7 l  (cost=0.00..14.00 rows=400 width=21) (actual time=0.003..0.003 rows=0
loops=1)
               ->  Seq Scan on liens_s8 l  (cost=0.00..86004.68 rows=4956168 width=18) (actual time=0.045..9729.965
rows=4956182loops=1) 
               ->  Seq Scan on liens_s9 l  (cost=0.00..320.29 rows=18429 width=18) (actual time=0.010..34.880
rows=18429loops=1) 
               ->  Seq Scan on liens_s10 l  (cost=0.00..18398.16 rows=951016 width=18) (actual time=0.055..1889.948
rows=951016loops=1) 
               ->  Seq Scan on liens_s11 l  (cost=0.00..9956.22 rows=543022 width=18) (actual time=0.055..1070.156
rows=543022loops=1) 
               ->  Seq Scan on liens_s12 l  (cost=0.00..78813.85 rows=4541785 width=18) (actual time=0.012..9431.035
rows=4541792loops=1) 
               ->  Seq Scan on liens_s13 l  (cost=0.00..4506.46 rows=182846 width=18) (actual time=0.049..374.788
rows=182846loops=1) 
         ->  Hash  (cost=1050.77..1050.77 rows=33577 width=18) (actual time=256.374..256.374 rows=33297 loops=1)
               ->  Append  (cost=0.00..1050.77 rows=33577 width=18) (actual time=0.019..188.152 rows=33297 loops=1)
                     ->  Seq Scan on impounds i  (cost=0.00..11.40 rows=140 width=21) (actual time=0.002..0.002 rows=0
loops=1)
                     ->  Seq Scan on impounds_s1 i  (cost=0.00..11.40 rows=140 width=21) (actual time=0.002..0.002
rows=0loops=1) 
                     ->  Seq Scan on impounds_s2 i  (cost=0.00..913.87 rows=29587 width=18) (actual time=0.008..60.728
rows=29587loops=1) 
                     ->  Seq Scan on impounds_s3 i  (cost=0.00..18.14 rows=414 width=18) (actual time=0.009..0.848
rows=414loops=1) 
                     ->  Seq Scan on impounds_s4 i  (cost=0.00..95.96 rows=3296 width=18) (actual time=0.012..6.894
rows=3296loops=1) 
 Total runtime: 143731.788 ms
(26 rows)

                                                                    QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=57241210.61..57241210.62 rows=1 width=0) (actual time=683467.350..683467.352 rows=1 loops=1)
   ->  Merge Join  (cost=2940810.41..49497189.70 rows=3097608361 width=0) (actual time=434026.342..683460.545 rows=2437
loops=1)
         Merge Cond: ((l.vin)::text = (i.vin)::text)
         ->  Sort  (cost=2937235.46..2983362.40 rows=18450775 width=18) (actual time=433519.957..637389.755
rows=18449961loops=1) 
               Sort Key: l.vin
               Sort Method:  external merge  Disk: 504728kB
               ->  Append  (cost=0.00..332098.75 rows=18450775 width=18) (actual time=14.764..102905.170 rows=18449996
loops=1)
                     ->  Seq Scan on liens l  (cost=0.00..14.00 rows=400 width=21) (actual time=0.003..0.003 rows=0
loops=1)
                     ->  Seq Scan on liens_s1 l  (cost=0.00..18633.44 rows=917444 width=18) (actual
time=14.755..2167.668rows=917444 loops=1) 
                     ->  Seq Scan on liens_s2 l  (cost=0.00..20.92 rows=1192 width=18) (actual time=0.012..2.304
rows=1192loops=1) 
                     ->  Seq Scan on liens_s3 l  (cost=0.00..53793.79 rows=2934179 width=18) (actual
time=8.751..6311.537rows=2934179 loops=1) 
                     ->  Seq Scan on liens_s4 l  (cost=0.00..21069.39 rows=1214139 width=18) (actual
time=0.039..2412.847rows=1214139 loops=1) 
                     ->  Seq Scan on liens_s5 l  (cost=0.00..29966.37 rows=1726837 width=18) (actual
time=0.038..3805.214rows=1726837 loops=1) 
                     ->  Seq Scan on liens_s6 l  (cost=0.00..10587.18 rows=462918 width=18) (actual time=0.037..990.027
rows=462918loops=1) 
                     ->  Seq Scan on liens_s7 l  (cost=0.00..14.00 rows=400 width=21) (actual time=0.003..0.003 rows=0
loops=1)
                     ->  Seq Scan on liens_s8 l  (cost=0.00..86004.68 rows=4956168 width=18) (actual
time=0.037..9992.213rows=4956182 loops=1) 
                     ->  Seq Scan on liens_s9 l  (cost=0.00..320.29 rows=18429 width=18) (actual time=0.012..35.403
rows=18429loops=1) 
                     ->  Seq Scan on liens_s10 l  (cost=0.00..18398.16 rows=951016 width=18) (actual
time=0.039..2078.649rows=951016 loops=1) 
                     ->  Seq Scan on liens_s11 l  (cost=0.00..9956.22 rows=543022 width=18) (actual
time=0.039..1254.598rows=543022 loops=1) 
                     ->  Seq Scan on liens_s12 l  (cost=0.00..78813.85 rows=4541785 width=18) (actual
time=0.044..10506.463rows=4541792 loops=1) 
                     ->  Seq Scan on liens_s13 l  (cost=0.00..4506.46 rows=182846 width=18) (actual time=7.880..591.710
rows=182846loops=1) 
         ->  Sort  (cost=3574.95..3658.89 rows=33577 width=18) (actual time=494.662..559.550 rows=33312 loops=1)
               Sort Key: i.vin
               Sort Method:  quicksort  Memory: 4138kB
               ->  Append  (cost=0.00..1050.77 rows=33577 width=18) (actual time=0.037..178.794 rows=33297 loops=1)
                     ->  Seq Scan on impounds i  (cost=0.00..11.40 rows=140 width=21) (actual time=0.005..0.005 rows=0
loops=1)
                     ->  Seq Scan on impounds_s1 i  (cost=0.00..11.40 rows=140 width=21) (actual time=0.002..0.002
rows=0loops=1) 
                     ->  Seq Scan on impounds_s2 i  (cost=0.00..913.87 rows=29587 width=18) (actual time=0.022..59.502
rows=29587loops=1) 
                     ->  Seq Scan on impounds_s3 i  (cost=0.00..18.14 rows=414 width=18) (actual time=0.009..0.829
rows=414loops=1) 
                     ->  Seq Scan on impounds_s4 i  (cost=0.00..95.96 rows=3296 width=18) (actual time=0.011..6.646
rows=3296loops=1) 
 Total runtime: 683757.901 ms
(31 rows)


 Aggregate  (cost=11171206.17..11171206.18 rows=1 width=0) (actual time=319059.9 81..319059.983 rows=1 loops=1)
   ->  Nested Loop  (cost=0.27..3427185.27 rows=3097608361 width=0) (actual time =2918.842..319051.903 rows=2437
loops=1)
         Join Filter: ((i.vin)::text = (l.vin)::text)
         ->  Append  (cost=0.00..1050.77 rows=33577 width=18) (actual time=0.034 ..259.582 rows=33297 loops=1)
               ->  Seq Scan on impounds i  (cost=0.00..11.40 rows=140 width=21) (actual time=0.004..0.004 rows=0
loops=1)
               ->  Seq Scan on impounds_s1 i  (cost=0.00..11.40 rows=140 width=21) (actual time=0.002..0.002 rows=0
loops=1)
               ->  Seq Scan on impounds_s2 i  (cost=0.00..913.87 rows=29587 width=18) (actual time=0.021..111.905
rows=29587loops=1) 
               ->  Seq Scan on impounds_s3 i  (cost=0.00..18.14 rows=414 width=18) (actual time=4.488..6.009 rows=414
loops=1)
               ->  Seq Scan on impounds_s4 i  (cost=0.00..95.96 rows=3296 width=18) (actual time=8.542..27.109
rows=3296loops=1) 
         ->  Append  (cost=0.27..101.85 rows=15 width=19) (actual time=9.048..9.563 rows=0 loops=33297)
               ->  Bitmap Heap Scan on liens l  (cost=0.27..5.60 rows=2 width=21) (actual time=0.007..0.007 rows=0
loops=33297)
                     Recheck Cond: ((l.vin)::text = (i.vin)::text)
                     ->  Bitmap Index Scan on liens_pk (cost=0.00..0.27 rows=2 width=0) (actual time=0.003..0.003
rows=0loops=33297) 
                           Index Cond: ((l.vin)::text = (i.vin)::text)
               ->  Index Scan using newliens_s1_pk on liens_s1 l (cost=0.00..7.60 rows=1 width=18) (actual
time=1.109..1.109rows=0 loops=33297) 
                     Index Cond: ((l.vin)::text = (i.vin)::text)
               ->  Index Scan using liens_s2_pk on liens_s2 l (cost=0.00..7.32 rows=1 width=18) (actual
time=0.008..0.008rows=0 loops=33297) 
                     Index Cond: ((l.vin)::text = (i.vin)::text)
               ->  Index Scan using newliens_s3_pk on liens_s3 l (cost=0.00..8.07 rows=1 width=18) (actual
time=0.921..0.921rows=0 loops=33297) 
                     Index Cond: ((l.vin)::text = (i.vin)::text)
               ->  Index Scan using newliens_s4_pk on liens_s4 l (cost=0.00..7.67 rows=1 width=18) (actual
time=1.334..1.334rows=0 loops=33297) 
                     Index Cond: ((l.vin)::text = (i.vin)::text)
               ->  Index Scan using newliens_s5_pk on liens_s5 l (cost=0.00..7.79 rows=1 width=18) (actual
time=1.800..1.800rows=0 loops=33297) 
                     Index Cond: ((l.vin)::text = (i.vin)::text)
               ->  Index Scan using newliens_s6_pk on liens_s6 l (cost=0.00..7.49 rows=1 width=18) (actual
time=0.604..0.604rows=0 loops=33297) 
                     Index Cond: ((l.vin)::text = (i.vin)::text)
               ->  Index Scan using liens_s7_pk on liens_s7 l (cost=0.00..3.47 rows=1 width=21) (actual
time=0.003..0.003rows=0 loops=33297) 
                     Index Cond: ((l.vin)::text = (i.vin)::text)
               ->  Index Scan using newliens_s8_pk on liens_s8 l (cost=0.00..8.53 rows=1 width=18) (actual
time=1.141..1.141rows=0 loops=33297) 
                     Index Cond: ((l.vin)::text = (i.vin)::text)
               ->  Index Scan using newliens_s9_pk on liens_s9 l (cost=0.00..7.33 rows=1 width=18) (actual
time=0.037..0.037rows=0 loops=33297) 
                     Index Cond: ((l.vin)::text = (i.vin)::text)
               ->  Index Scan using newliens_s10_pk on liens_s10 l (cost=0.00..7.61 rows=1 width=18) (actual
time=1.162..1.162rows=0 loops=33297) 
                     Index Cond: ((l.vin)::text = (i.vin)::text)
               ->  Index Scan using newliens_s11_pk on liens_s11 l (cost=0.00..7.51 rows=1 width=18) (actual
time=0.099..0.099rows=0 loops=33297) 
                     Index Cond: ((l.vin)::text = (i.vin)::text)
               ->  Index Scan using newliens_s12_pk on liens_s12 l (cost=0.00..8.43 rows=1 width=18) (actual
time=1.073..1.074rows=0 loops=33297) 
                     Index Cond: ((l.vin)::text = (i.vin)::text)
               ->  Index Scan using newliens_s13_pk on liens_s13 l (cost=0.00..7.43 rows=1 width=18) (actual
time=0.238..0.238rows=0 loops=33297) 
                     Index Cond: ((l.vin)::text = (i.vin)::text)
 Total runtime: 319060.547 ms
(41 rows)

Re: No hash join across partitioned tables?

From

Tom Lane

Date:

19 April 2009, 20:31:58

Kris Jurka <books@ejurka.com> writes:
> The real problem is getting reasonable stats to pass through the partition
> Append step, so it can make a reasonable estimate of the join output size.

I dug around a bit and concluded that the lack of stats for the Append
relation is indeed the main problem.  It's not so much the bad join size
estimate (although that could hurt for cases where you need to join this
result to another table).  Rather, it's that the planner is deliberately
biased against picking hash joins in the absence of stats for the inner
relation.  Per the comments for estimate_hash_bucketsize:

 * If no statistics are available, use a default estimate of 0.1.  This will
 * discourage use of a hash rather strongly if the inner relation is large,
 * which is what we want.  We do not want to hash unless we know that the
 * inner rel is well-dispersed (or the alternatives seem much worse).

While we could back off the default a bit here, I think it'd be better
to fix it by not punting on the stats-for-append-relations problem.
That doesn't seem like material for 8.4 at this point, though.

            regards, tom lane

Re: No hash join across partitioned tables?

From

Bruce Momjian

Date:

25 February 2010, 19:46:45

Did this get addressed?

---------------------------------------------------------------------------

Tom Lane wrote:
> Kris Jurka <books@ejurka.com> writes:
> > The real problem is getting reasonable stats to pass through the partition
> > Append step, so it can make a reasonable estimate of the join output size.
>
> I dug around a bit and concluded that the lack of stats for the Append
> relation is indeed the main problem.  It's not so much the bad join size
> estimate (although that could hurt for cases where you need to join this
> result to another table).  Rather, it's that the planner is deliberately
> biased against picking hash joins in the absence of stats for the inner
> relation.  Per the comments for estimate_hash_bucketsize:
>
>  * If no statistics are available, use a default estimate of 0.1.  This will
>  * discourage use of a hash rather strongly if the inner relation is large,
>  * which is what we want.  We do not want to hash unless we know that the
>  * inner rel is well-dispersed (or the alternatives seem much worse).
>
> While we could back off the default a bit here, I think it'd be better
> to fix it by not punting on the stats-for-append-relations problem.
> That doesn't seem like material for 8.4 at this point, though.
>
>             regards, tom lane
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

Re: No hash join across partitioned tables?

From

Tom Lane

Date:

25 February 2010, 20:05:02

Bruce Momjian <bruce@momjian.us> writes:
> Did this get addressed?

Partially.  There are stats now but autovacuum is not bright about
when to update them.

            regards, tom lane

Re: No hash join across partitioned tables?

From

Robert Haas

Date:

02 March 2010, 12:18:03

On Thu, Feb 25, 2010 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>> Did this get addressed?
>
> Partially.  There are stats now but autovacuum is not bright about
> when to update them.

Is that something you're planning to fix for 9.0?  If not, we at least
need to document what we intend for people to do about it.

...Robert

Re: No hash join across partitioned tables?

From

Tom Lane

Date:

02 March 2010, 12:23:32

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Feb 25, 2010 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Partially. �There are stats now but autovacuum is not bright about
>> when to update them.

> Is that something you're planning to fix for 9.0?  If not, we at least
> need to document what we intend for people to do about it.

I want to look at it, but I'm not sure whether the fix will be small
enough that we want to put it in during beta.

            regards, tom lane

Re: No hash join across partitioned tables?

From

Grzegorz Jaśkiewicz

Date:

02 March 2010, 12:27:28

On Tue, Mar 2, 2010 at 4:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Feb 25, 2010 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Partially. There are stats now but autovacuum is not bright about
>> when to update them.

> Is that something you're planning to fix for 9.0? If not, we at least
> need to document what we intend for people to do about it.

I want to look at it, but I'm not sure whether the fix will be small
enough that we want to put it in during beta.

I am pretty sure many people will appreciate it, even if it isn't going to be small.

Is that stat collection across child tables any useful by it self ?

--
GJ

Re: No hash join across partitioned tables?

From

Robert Haas

Date:

09 June 2010, 16:59:23

On Tue, Mar 2, 2010 at 12:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Feb 25, 2010 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Partially.  There are stats now but autovacuum is not bright about
>>> when to update them.
>
>> Is that something you're planning to fix for 9.0?  If not, we at least
>> need to document what we intend for people to do about it.
>
> I want to look at it, but I'm not sure whether the fix will be small
> enough that we want to put it in during beta.

In going back through emails I had marked as possibly needing another
look before 9.0 is released, I came across this issue again.  As I
understand it, analyze (or analyse) now collects statistics for both
the parent individually, and for the parent and its children together.
 However, as I further understand it, autovacuum won't actually fire
off an analyze unless there's enough activity on the parent table
considered individually to warrant it.  So if you have an empty parent
and a bunch of children with data in it, your stats will still stink,
unless you analyze by hand.

Assuming my understanding of the problem is correct, we could:

(a) fix it,
(b) document that you should consider periodic manual analyze commands
in this situation, or
(c) do nothing.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: No hash join across partitioned tables?

From

Tom Lane

Date:

09 June 2010, 17:11:55

Robert Haas <robertmhaas@gmail.com> writes:
> In going back through emails I had marked as possibly needing another
> look before 9.0 is released, I came across this issue again.  As I
> understand it, analyze (or analyse) now collects statistics for both
> the parent individually, and for the parent and its children together.
>  However, as I further understand it, autovacuum won't actually fire
> off an analyze unless there's enough activity on the parent table
> considered individually to warrant it.  So if you have an empty parent
> and a bunch of children with data in it, your stats will still stink,
> unless you analyze by hand.

Check.

> Assuming my understanding of the problem is correct, we could:

> (a) fix it,
> (b) document that you should consider periodic manual analyze commands
> in this situation, or
> (c) do nothing.

> Thoughts?

The objections to (a) are that it might result in excessive ANALYZE work
if not done intelligently, and that we haven't got a patch ready anyway.
I would have liked to get to this for 9.0 but I feel it's a bit late
now.

            regards, tom lane

Re: No hash join across partitioned tables?

From

Bruce Momjian

Date:

02 July 2010, 00:00:14

Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > In going back through emails I had marked as possibly needing another
> > look before 9.0 is released, I came across this issue again.  As I
> > understand it, analyze (or analyse) now collects statistics for both
> > the parent individually, and for the parent and its children together.
> >  However, as I further understand it, autovacuum won't actually fire
> > off an analyze unless there's enough activity on the parent table
> > considered individually to warrant it.  So if you have an empty parent
> > and a bunch of children with data in it, your stats will still stink,
> > unless you analyze by hand.
>
> Check.
>
> > Assuming my understanding of the problem is correct, we could:
>
> > (a) fix it,
> > (b) document that you should consider periodic manual analyze commands
> > in this situation, or
> > (c) do nothing.
>
> > Thoughts?
>
> The objections to (a) are that it might result in excessive ANALYZE work
> if not done intelligently, and that we haven't got a patch ready anyway.
> I would have liked to get to this for 9.0 but I feel it's a bit late
> now.

What do we want to do about the above issue?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + None of us is going to be here forever. +

Re: No hash join across partitioned tables?

From

Tom Lane

Date:

02 July 2010, 01:05:16

Bruce Momjian <bruce@momjian.us> writes:
> Tom Lane wrote:
>> I would have liked to get to this for 9.0 but I feel it's a bit late
>> now.

> What do we want to do about the above issue?

TODO item.

            regards, tom lane

Re: No hash join across partitioned tables?

From

Bruce Momjian

Date:

02 July 2010, 17:54:00

Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Tom Lane wrote:
> >> I would have liked to get to this for 9.0 but I feel it's a bit late
> >> now.
>
> > What do we want to do about the above issue?
>
> TODO item.

Added to TODO:

    Have autoanalyze of parent tables occur when child tables are modified

    * http://archives.postgresql.org/message-id/AANLkTinx8lLTEKWcyEQ1rxVz6WMJVKNezfXW5TKnNAU6@mail.gmail.com

I am surprised there is no documentation update requirement for this.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + None of us is going to be here forever. +

Re: No hash join across partitioned tables?

From

Tom Lane

Date:

02 July 2010, 17:58:59

Bruce Momjian <bruce@momjian.us> writes:
> I am surprised there is no documentation update requirement for this.

Somebody put something about it in the docs a few days ago, IIRC.

            regards, tom lane

Re: No hash join across partitioned tables?

From

Robert Haas

Date:

02 July 2010, 18:11:07

On Fri, Jul 2, 2010 at 4:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>> I am surprised there is no documentation update requirement for this.
>
> Somebody put something about it in the docs a few days ago, IIRC.

That was me.

http://archives.postgresql.org/pgsql-committers/2010-06/msg00144.php

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: No hash join across partitioned tables?

From

Bruce Momjian

Date:

02 July 2010, 18:12:33

Robert Haas wrote:
> On Fri, Jul 2, 2010 at 4:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> >> I am surprised there is no documentation update requirement for this.
> >
> > Somebody put something about it in the docs a few days ago, IIRC.
>
> That was me.
>
> http://archives.postgresql.org/pgsql-committers/2010-06/msg00144.php

Oh, thanks, I missed that.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + None of us is going to be here forever. +

Re: No hash join across partitioned tables?

From

Alvaro Herrera

Date:

16 October 2010, 02:03:25

Excerpts from Robert Haas's message of mié jun 09 15:47:55 -0400 2010:

> In going back through emails I had marked as possibly needing another
> look before 9.0 is released, I came across this issue again.  As I
> understand it, analyze (or analyse) now collects statistics for both
> the parent individually, and for the parent and its children together.
>  However, as I further understand it, autovacuum won't actually fire
> off an analyze unless there's enough activity on the parent table
> considered individually to warrant it.  So if you have an empty parent
> and a bunch of children with data in it, your stats will still stink,
> unless you analyze by hand.

So, is there something we could now do about this, while there's still
time before 9.1?

I haven't followed this issue very closely, but it seems to me that what
we want is that we want an ANALYZE in a child table to be mutated into
an analyze of its parent table, if the conditions are right; and that an
ANALYZE of a parent removes the child tables from being analyzed on the
same run.

If we analyze the parent, do we also update the children stats, or is it
just that we keep two stats for the parent, one with children and one
without, both being updated when the parent is analyzed?

If the latter's the case, maybe we should modify ANALYZE a bit more, so
that we can analyze the whole hierarchy in one go, and store the lot of
stats with a single pass (each child alone, the parent alone, the parent
plus children).  However it's not real clear how would this work with
multiple inheritance levels.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: No hash join across partitioned tables?

From

Tom Lane

Date:

16 October 2010, 02:22:53

Alvaro Herrera <alvherre@commandprompt.com> writes:
> If we analyze the parent, do we also update the children stats, or is it
> just that we keep two stats for the parent, one with children and one
> without, both being updated when the parent is analyzed?

The latter.

The trick here is that we need to fire an analyze on the parent even
though only its children may have had any updates.

> If the latter's the case, maybe we should modify ANALYZE a bit more, so
> that we can analyze the whole hierarchy in one go, and store the lot of
> stats with a single pass (each child alone, the parent alone, the parent
> plus children).  However it's not real clear how would this work with
> multiple inheritance levels.

It's also not clear how it works without blowing out memory...

            regards, tom lane

Re: No hash join across partitioned tables?

From

Samuel Gendler

Date:

16 October 2010, 02:35:57

On Fri, Oct 15, 2010 at 10:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:
> If we analyze the parent, do we also update the children stats, or is it
> just that we keep two stats for the parent, one with children and one
> without, both being updated when the parent is analyzed?

The latter.

The trick here is that we need to fire an analyze on the parent even
though only its children may have had any updates.

> If the latter's the case, maybe we should modify ANALYZE a bit more, so
> that we can analyze the whole hierarchy in one go, and store the lot of
> stats with a single pass (each child alone, the parent alone, the parent
> plus children). However it's not real clear how would this work with
> multiple inheritance levels.

An issue with automatically analyzing the entire hierarchy is 'abstract' table definitions. I've got a set of tables for storing the same data at different granularities of aggregation. Within each granularity, I've got partitions, but because the set of columns is identical for each granularity, I've got an abstract table definition that is inherited by everything. I don't need or want statistics kept on that table because I never query across the abstract table, only the parent table of each aggregation granularity

create table abstract_fact_table (

time timestamp,

measure1 bigint,

measure2 bigint,

measure3 bigint,

fk1 bigint,

fk2 bigint

);

create table minute_scale_fact_table (

} inherits abstract_fact_table;

// Then there are several partitions for minute scale data

create table hour_scale_fact_table (

) inherits abstract_fact_table;

// then several partitions for hour scale data

etc. I do run queries on the minute_scale_fact_table and hour_scale_fact_table but never do so on abstract_fact_table. I could certainly modify my schema such that the abstract table goes away entirely easily enough, but I find this easier for new developers to come in and comprehend, since the similarity between the table definitions is explicit.

I'm glad this topic came up, as I was unaware that I need to run analyze on the parent partitions separately - and no data is every inserted directly into the top level of each granularity hierarchy, so it will never fire by itself.

If I am using ORM and I've got functionality in a common baseclass in the source code, I'll often implement its mapping in the database via a parent table that the table for any subclass mapping can inherit from. Again, I have no interest in maintaining statistics on the parent table, since I never query against it directly.

Re: No hash join across partitioned tables?

From

Alvaro Herrera

Date:

16 October 2010, 12:29:57

Excerpts from Samuel Gendler's message of sáb oct 16 02:35:46 -0300 2010:

> An issue with automatically analyzing the entire hierarchy is 'abstract'
> table definitions.  I've got a set of tables for storing the same data at
> different granularities of aggregation.  Within each granularity, I've got
> partitions, but because the set of columns is identical for each
> granularity, I've got an abstract table definition that is inherited by
> everything.  I don't need or want statistics kept on that table because I
> never query across the abstract table, only the parent table of each
> aggregation granularity

Hmm, I think you'd be better served by using LIKE instead of regular
inheritance.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: No hash join across partitioned tables?

From

Samuel Gendler

Date:

18 October 2010, 03:15:42

On Sat, Oct 16, 2010 at 8:29 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote:

Excerpts from Samuel Gendler's message of sáb oct 16 02:35:46 -0300 2010:

> An issue with automatically analyzing the entire hierarchy is 'abstract'
> table definitions. I've got a set of tables for storing the same data at
> different granularities of aggregation. Within each granularity, I've got
> partitions, but because the set of columns is identical for each
> granularity, I've got an abstract table definition that is inherited by
> everything. I don't need or want statistics kept on that table because I
> never query across the abstract table, only the parent table of each
> aggregation granularity

Hmm, I think you'd be better served by using LIKE instead of regular
inheritance.

Yep. I inherited the architecture, though, and changing it hasn't been a high priority.

--sam

Re: No hash join across partitioned tables?

From

Alvaro Herrera

Date:

18 October 2010, 11:45:12

Excerpts from Samuel Gendler's message of lun oct 18 03:13:01 -0300 2010:
> On Sat, Oct 16, 2010 at 8:29 AM, Alvaro Herrera
> <alvherre@commandprompt.com>wrote:
>
> > Excerpts from Samuel Gendler's message of sáb oct 16 02:35:46 -0300 2010:
> >
> > > An issue with automatically analyzing the entire hierarchy is
> > > 'abstract' table definitions.  I've got a set of tables for
> > > storing the same data at different granularities of aggregation.
> > > Within each granularity, I've got partitions, but because the set
> > > of columns is identical for each granularity, I've got an abstract
> > > table definition that is inherited by everything.  I don't need or
> > > want statistics kept on that table because I never query across
> > > the abstract table, only the parent table of each aggregation
> > > granularity
> >
> > Hmm, I think you'd be better served by using LIKE instead of regular
> > inheritance.
>
> Yep.  I inherited the architecture, though, and changing it hasn't been a
> high priority.

I understand that; my point is merely that maybe we shouldn't work
through many hoops to solve this particular facet of the problem,
because it seems to be pilot error.  (If you really needed to avoid the
extra I/O that would be caused by unnecessary analyzes, you could turn
autovac off for the abstract tables).

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: No hash join across partitioned tables?

From

Robert Haas

Date:

26 October 2010, 09:23:38

On Sat, Oct 16, 2010 at 1:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> If we analyze the parent, do we also update the children stats, or is it
>> just that we keep two stats for the parent, one with children and one
>> without, both being updated when the parent is analyzed?
>
> The latter.
>
> The trick here is that we need to fire an analyze on the parent even
> though only its children may have had any updates.

Can we execute a SQL query at the point where we need this
information?  Because it doesn't seem too hard to work up a query that
totals the inserts, updates, and reltuples across all children of each
table.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company