Thread: issue with query optimizer when joining two partitioned tables

issue with query optimizer when joining two partitioned tables

From
Anish Kejariwal
Date:
I have run into issue where the query optimizer is choosing the wrong execution plan when I'm trying to join two large tables that have been partitioned.  I would really appreciate it if someone could help me out this.  I don't know whether I've found a bug in the optimizer, or whether there is some parameter/option I need to set in postgres.  Below, I've included my execution plans.  I'm using postgres 9.0.3, and I'm running this on a pretty beefy Linux server.

My two tables:
-widget: has 4041866 records, and is broken up into 4 partitions (no records are in the parent table).
-icecream: I'm starting with zero records, but since this there could be billions of ice-cream records, I will partition and will not have any records in the parent table.

So, then I then create my first partition in icecream table, and load 4041866 records into it.

Here is the query I'm using to join the two tables:
explain analyze
SELECT
      r.widget_id,
      r.widget_type_id,
      avg(rc.cost)::double precision cost_avg
FROM
      widget r,
      icecream rc
WHERE
r.widget_type_id = 4
and r.widgetset_id = 5
AND r.widget_id = rc.widget_id
and rc.dataset_id = 281
group by r.widget_id,r.chromosome, r.start_pos, r.end_pos,r.widget_type_id
;

Here is the corresponding execution plan:

                                                                                          QUERY PLAN                                                                                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=147262.20..147299.12 rows=1136 width=41) (actual time=31876.290..31904.880 rows=11028 loops=1)
   ->  Merge Join  (cost=95574.83..112841.79 rows=1147347 width=41) (actual time=31130.870..31832.922 rows=11028 loops=1)
         Merge Cond: (r.widget_id = rc.widget_id)
         ->  Sort  (cost=1913.89..1942.27 rows=11352 width=21) (actual time=56.818..68.701 rows=11028 loops=1)
               Sort Key: r.widget_id
               Sort Method:  quicksort  Memory: 1246kB
               ->  Append  (cost=4.28..1149.30 rows=11352 width=21) (actual time=0.139..40.513 rows=11028 loops=1)
                     ->  Bitmap Heap Scan on widget r  (cost=4.28..12.75 rows=1 width=48) (actual time=0.030..0.030 rows=0 loops=1)
                           Recheck Cond: (widgetset_id = 5)
                           Filter: (widget_type_id = 4)
                           ->  Bitmap Index Scan on widget_widgetset_id_idx  (cost=0.00..4.28 rows=4 width=0) (actual time=0.023..0.023 rows=0 loops=1)
                                 Index Cond: (widgetset_id = 5)
                     ->  Index Scan using widget_part_5_widget_widget_type_id_idx on widget_part_5 r  (cost=0.00..1136.55 rows=11351 width=21) (actual time=0.106..18.489 rows=11028 loops=1)
                           Index Cond: (widget_type_id = 4)
                           Filter: (widgetset_id = 5)
         ->  Sort  (cost=93660.94..93711.47 rows=20214 width=24) (actual time=29730.522..30766.354 rows=946140 loops=1)
               Sort Key: rc.widget_id
               Sort Method:  external sort  Disk: 165952kB
               ->  Append  (cost=0.00..92215.33 rows=20214 width=24) (actual time=0.057..13731.204 rows=4041866 loops=1)
                     ->  Seq Scan on icecream rc  (cost=0.00..23.00 rows=5 width=24) (actual time=0.002..0.002 rows=0 loops=1)
                           Filter: (dataset_id = 281)
                     ->  Seq Scan on icecream_part_281 rc  (cost=0.00..92192.33 rows=20209 width=24) (actual time=0.051..5427.730 rows=4041866 loops=1)
                           Filter: (dataset_id = 281)
 Total runtime: 33182.945 ms
(24 rows)


The query is doing a merge join, is taking 33 seconds, but should take less than a second.  So, then I do: select * from icecream;

Now, when I run the same query again, I get a different and correct execution plan (nested loop), and the query takes less than 1 second as I would expect.


                                                                                        QUERY PLAN                                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=7223611.41..7223648.33 rows=1136 width=41) (actual time=392.822..420.166 rows=11028 loops=1)
   ->  Nested Loop  (cost=4.28..341195.22 rows=229413873 width=41) (actual time=0.231..331.800 rows=11028 loops=1)
         Join Filter: (r.widget_id = rc.widget_id)
         ->  Append  (cost=4.28..1149.30 rows=11352 width=21) (actual time=0.051..50.181 rows=11028 loops=1)
               ->  Bitmap Heap Scan on widget r  (cost=4.28..12.75 rows=1 width=48) (actual time=0.013..0.013 rows=0 loops=1)
                     Recheck Cond: (widgetset_id = 5)
                     Filter: (widget_type_id = 4)
                     ->  Bitmap Index Scan on widget_widgetset_id_idx  (cost=0.00..4.28 rows=4 width=0) (actual time=0.007..0.007 rows=0 loops=1)
                           Index Cond: (widgetset_id = 5)
               ->  Index Scan using widget_part_5_widget_widget_type_id_idx on widget_part_5 r  (cost=0.00..1136.55 rows=11351 width=21) (actual time=0.033..21.254 rows=11028 loops=1)
                     Index Cond: (widget_type_id = 4)
                     Filter: (widgetset_id = 5)
         ->  Append  (cost=0.00..29.88 rows=6 width=24) (actual time=0.014..0.018 rows=1 loops=11028)
               ->  Seq Scan on icecream rc  (cost=0.00..23.00 rows=5 width=24) (actual time=0.001..0.001 rows=0 loops=11028)
                     Filter: (rc.dataset_id = 281)
               ->  Index Scan using icecream_part_281_widget_id_idx on icecream_part_281 rc  (cost=0.00..6.88 rows=1 width=24) (actual time=0.009..0.010 rows=1 loops=11028)
                     Index Cond: (rc.widget_id = r.widget_id)
                     Filter: (rc.dataset_id = 281)
 Total runtime: 431.935 ms
(19 rows)


My guess as to what happened:
-because the icecream parent table has zero records, the query optimizer chooses the incorrect execution plan
-when I do select * from icecream, the optimizer now knows how many records are really in the icecream table, by knowing that the icecream table has partitions.

Next, if I run vacuum analyze on the parent table, I again get a wrong/slow execution plan (this time it uses the hash join).  Again, I think this is because the parent table itself has zero records.


                                                                                       QUERY PLAN                                                                                       
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=191926.03..191962.95 rows=1136 width=41) (actual time=28967.567..28994.395 rows=11028 loops=1)
   ->  Hash Join  (cost=166424.79..191585.47 rows=11352 width=41) (actual time=28539.196..28917.830 rows=11028 loops=1)
         Hash Cond: (r.widget_id = rc.widget_id)
         ->  Append  (cost=4.28..1149.30 rows=11352 width=21) (actual time=0.054..54.068 rows=11028 loops=1)
               ->  Bitmap Heap Scan on widget r  (cost=4.28..12.75 rows=1 width=48) (actual time=0.013..0.013 rows=0 loops=1)
                     Recheck Cond: (widgetset_id = 5)
                     Filter: (widget_type_id = 4)
                     ->  Bitmap Index Scan on widget_widgetset_id_idx  (cost=0.00..4.28 rows=4 width=0) (actual time=0.007..0.007 rows=0 loops=1)
                           Index Cond: (widgetset_id = 5)
               ->  Index Scan using widget_part_5_widget_widget_type_id_idx on widget_part_5 r  (cost=0.00..1136.55 rows=11351 width=21) (actual time=0.035..22.419 rows=11028 loops=1)
                     Index Cond: (widget_type_id = 4)
                     Filter: (widgetset_id = 5)
         ->  Hash  (cost=92214.73..92214.73 rows=4041823 width=24) (actual time=28438.419..28438.419 rows=4041866 loops=1)
               Buckets: 524288  Batches: 2  Memory Usage: 118449kB
               ->  Append  (cost=0.00..92214.73 rows=4041823 width=24) (actual time=0.020..14896.908 rows=4041866 loops=1)
                     ->  Seq Scan on icecream rc  (cost=0.00..23.00 rows=5 width=24) (actual time=0.002..0.002 rows=0 loops=1)
                           Filter: (dataset_id = 281)
                     ->  Seq Scan on icecream_part_281 rc  (cost=0.00..92191.73 rows=4041818 width=24) (actual time=0.012..5718.592 rows=4041866 loops=1)
                           Filter: (dataset_id = 281)
 Total runtime: 29007.937 ms
(20 rows)


select * from icecream does not fix this issue.


I could of course disable hash join and merge join to force postgres to use a nested loop, but my system is often joining these two tables, and I'd rather not have to set this in every single place.
set enable_mergejoin=off;
set enable_hashjoin=off;
set enable_nestloop = on;


thanks in advance!!!

Anish






Re: issue with query optimizer when joining two partitioned tables

From
Heikki Linnakangas
Date:
On 09.07.2011 00:36, Anish Kejariwal wrote:
> My guess as to what happened:
> -because the icecream parent table has zero records, the query optimizer
> chooses the incorrect execution plan
> -when I do select * from icecream, the optimizer now knows how many records
> are really in the icecream table, by knowing that the icecream table has
> partitions.

"select * from icecream" won't have any direct effect on the
optimization of subsequent queries. What probably happened is that
autoanalyze ran in the background while you ran that select, and
analyzed some of the partitions. Simply waiting a while would've had the
same effect.

> Next, if I run vacuum analyze on the parent table, I again get a wrong/slow
> execution plan (this time it uses the hash join).  Again, I think this is
> because the parent table itself has zero records.
>
>
>
>             QUERY PLAN
>
>
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>   HashAggregate  (cost=191926.03..191962.95 rows=1136 width=41) (actual
> time=28967.567..28994.395 rows=11028 loops=1)
>     ->   Hash Join  (cost=166424.79..191585.47 rows=11352 width=41) (actual
> time=28539.196..28917.830 rows=11028 loops=1)
>           Hash Cond: (r.widget_id = rc.widget_id)
>           ->   Append  (cost=4.28..1149.30 rows=11352 width=21) (actual
> time=0.054..54.068 rows=11028 loops=1)
>                 ->   Bitmap Heap Scan on widget r  (cost=4.28..12.75 rows=1
> width=48) (actual time=0.013..0.013 rows=0 loops=1)
>                       Recheck Cond: (widgetset_id = 5)
>                       Filter: (widget_type_id = 4)
>                       ->   Bitmap Index Scan on widget_widgetset_id_idx
>   (cost=0.00..4.28 rows=4 width=0) (actual time=0.007..0.007 rows=0 loops=1)
>                             Index Cond: (widgetset_id = 5)
>                 ->   Index Scan using widget_part_5_widget_widget_type_id_idx
> on widget_part_5 r  (cost=0.00..1136.55 rows=11351 width=21) (actual
> time=0.035..22.419 rows=11028 loops=1)
>                       Index Cond: (widget_type_id = 4)
>                       Filter: (widgetset_id = 5)
>           ->   Hash  (cost=92214.73..92214.73 rows=4041823 width=24) (actual
> time=28438.419..28438.419 rows=4041866 loops=1)
>                 Buckets: 524288  Batches: 2  Memory Usage: 118449kB
>                 ->   Append  (cost=0.00..92214.73 rows=4041823 width=24)
> (actual time=0.020..14896.908 rows=4041866 loops=1)
>                       ->   Seq Scan on icecream rc  (cost=0.00..23.00 rows=5
> width=24) (actual time=0.002..0.002 rows=0 loops=1)
>                             Filter: (dataset_id = 281)
>                       ->   Seq Scan on icecream_part_281 rc
>   (cost=0.00..92191.73 rows=4041818 width=24) (actual time=0.012..5718.592
> rows=4041866 loops=1)
>                             Filter: (dataset_id = 281)
>   Total runtime: 29007.937 ms
> (20 rows)

The cost estimates in the above slow plan are pretty accurate, so I
suspect the cost estimates for the fast plan are not, or the planner
would choose that.

> I could of course disable hash join and merge join to force postgres to use
> a nested loop, but my system is often joining these two tables, and I'd
> rather not have to set this in every single place.
> set enable_mergejoin=off;
> set enable_hashjoin=off;
> set enable_nestloop = on;

Can you do explain analyze with these settings? That might give us a
clue on where it's going wrong.

Also, I suspect that when you load more data into icecream, the planner
might start to pick the faster plan, because the seqscan on icecream
will start to look more expensive compared to the index scan and nested
loop join in the faster plan.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: issue with query optimizer when joining two partitioned tables

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 09.07.2011 00:36, Anish Kejariwal wrote:
>> My guess as to what happened:
>> -because the icecream parent table has zero records, the query optimizer
>> chooses the incorrect execution plan
>> -when I do select * from icecream, the optimizer now knows how many records
>> are really in the icecream table, by knowing that the icecream table has
>> partitions.

> "select * from icecream" won't have any direct effect on the
> optimization of subsequent queries. What probably happened is that
> autoanalyze ran in the background while you ran that select, and
> analyzed some of the partitions. Simply waiting a while would've had the
> same effect.

Yeah.  Also, the reason that a manual vacuum on icecream changes things
yet again is that in 9.0 and up, we have a notion of summary stats
across the whole inheritance tree, but autoanalyze hasn't been taught to
gather those.  The manual command on the parent table does gather them,
though.

So what's happening here is that we suddenly have an accurate idea of
the size of the join product as a result of having inheritance summary
stats to estimate with, and that drives the estimated cost of the merge
or hash join down out of the stratosphere.  The estimated cost of the
nestloop goes down a lot too, but not as much.

I experimented with a similar case here, and it seems like a lot of the
remaining error in the nestloop estimate comes from this:

>> ->   Seq Scan on icecream rc  (cost=0.00..23.00 rows=5 width=24) (actual time=0.002..0.002 rows=0 loops=1)
>>        Filter: (dataset_id = 281)

The indexscan on the nonempty child partition is estimated at less than
10 cost units, so this is a *large* fraction of what the planner sees as
the per-outer-row cost of a nestloop.  And with more than 11000 rows on
the other side of the join, that discourages it from using the nestloop.
In reality of course this takes negligible time compared to examining
the child partition.

Now why is the seqscan cost estimate so large, when actually the parent
icecream table is totally empty?  It's because the planner has been
taught to never believe that an empty table is empty.  If memory serves,
it's really estimating on an assumption that the table contains 10 pages
and some corresponding number of rows.  This is a reasonable defensive
posture when dealing with ordinary tables, I think, since most likely
if the catalogs say the table is empty that's just a leftover from when
it was created.  But maybe we should reconsider the heuristic for tables
that are members of inheritance trees --- particularly parents of
inheritance trees.

I was able to defeat the empty-table heuristic here by doing

update pg_class set relpages = 1 where relname = 'icecream';

and then I started getting much more realistic estimates in my test
case.  (It still wanted to use a merge join initially, but after
knocking down random_page_cost it went to the nestloop.)  It would
be interesting to see what sorts of results Anish gets with that.

            regards, tom lane

Re: issue with query optimizer when joining two partitioned tables

From
Anish Kejariwal
Date:
Thanks Tom and Heikki!  I really appreciate your help.

I went ahead and loaded all the data.  In the icream table, I now have ~175 partitions, each with 4041866 records.

The data finished loading 12 hours ago, and I then ran the same query I gave you guys, and it took 25 seconds since it used the wrong execution plan as expected.

 HashAggregate  (cost=27680.90..28045.81 rows=11228 width=41) (actual time=24769.190..24817.618 rows=11028 loops=1)
   ->  Hash Join  (cost=1304.04..18901.88 rows=292634 width=41) (actual time=3938.965..24688.718 rows=11028 loops=1)
         Hash Cond: (rc.widget_id = r.widget_id)
         ->  Append  (cost=0.00..12110.95 rows=292634 width=24) (actual time=2854.925..22887.638 rows=309579 loops=1)
               ->  Seq Scan on icecream rc  (cost=0.00..25.60 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=1)
                     Filter: ((widgetset_id = 5) AND (dataset_id = 283))
               ->  Index Scan using icecream_part_283_widgetset_id_idx on icecream_part_283 rc  (cost=0.00..12085.35 rows=292633 width=24) (actual time=2854.915..2
1784.769 rows=309579 loops=1)
                     Index Cond: (widgetset_id = 5)
                     Filter: (dataset_id = 283)
         ->  Hash  (cost=1163.69..1163.69 rows=11228 width=21) (actual time=1083.704..1083.704 rows=11028 loops=1)
               Buckets: 2048  Batches: 1  Memory Usage: 604kB
               ->  Append  (cost=4.28..1163.69 rows=11228 width=21) (actual time=528.216..1066.659 rows=11028 loops=1)
                     ->  Bitmap Heap Scan on widget r  (cost=4.28..12.75 rows=1 width=48) (actual time=528.017..528.017 rows=0 loops=1)
                           Recheck Cond: (widgetset_id = 5)
                           Filter: (widget_type_id = 4)
                           ->  Bitmap Index Scan on widget_widgetset_id_idx  (cost=0.00..4.28 rows=4 width=0) (actual time=527.995..527.995 rows=0 loops=1)
                                 Index Cond: (widgetset_id = 5)
                     ->  Index Scan using widget_part_5_widget_widget_type_id_idx on widget_part_5 r  (cost=0.00..1150.94 rows=11227 width=21) (actual time=0.191..512.847 rows=1
1028 loops=1)
                           Index Cond: (widget_type_id = 4)
                           Filter: (widgetset_id = 5)
 Total runtime: 24844.016 ms
(21 rows)


I then changed my enable* options to force it to use a nested loop:
set enable_mergejoin=off;
set enable_hashjoin=off;
set enable_nestloop = on;


 HashAggregate  (cost=460004.79..460369.70 rows=11228 width=41) (actual time=298.014..341.822 rows=11028 loops=1)
   ->  Nested Loop  (cost=4.28..338747.85 rows=4041898 width=41) (actual time=0.175..248.529 rows=11028 loops=1)
         Join Filter: (r.widget_id = rc.widget_id)
         ->  Append  (cost=4.28..1163.69 rows=11228 width=21) (actual time=0.053..42.532 rows=11028 loops=1)
               ->  Bitmap Heap Scan on widget r  (cost=4.28..12.75 rows=1 width=48) (actual time=0.014..0.014 rows=0 loops=1)
                     Recheck Cond: (widgetset_id = 5)
                     Filter: (widget_type_id = 4)
                     ->  Bitmap Index Scan on widget_widgetset_id_idx  (cost=0.00..4.28 rows=4 width=0) (actual time=0.008..0.008 rows=0 loops=1)
                           Index Cond: (widgetset_id = 5)
               ->  Index Scan using widget_part_5_widget_widget_type_id_idx on widget_part_5 r  (cost=0.00..1150.94 rows=11227 width=21) (actual time=0.032..18.410 rows=11028 lo
ops=1)
                     Index Cond: (widget_type_id = 4)
                     Filter: (widgetset_id = 5)
         ->  Append  (cost=0.00..29.99 rows=6 width=24) (actual time=0.009..0.012 rows=1 loops=11028)
               ->  Seq Scan on icecream rc  (cost=0.00..23.00 rows=5 width=24) (actual time=0.001..0.001 rows=0 loops=11028)
                     Filter: (rc.dataset_id = 283)
               ->  Index Scan using icecream_part_283_widget_id_idx on icecream_part_283 rc  (cost=0.00..6.99 rows=1 width=24) (actual time=0.004..0.006 rows=1 loo
ps=11028)
                     Index Cond: (rc.widget_id = r.widget_id)
                     Filter: (rc.dataset_id = 283)
 Total runtime: 361.180 ms
(19 rows)

The query was nice and fast as expected.

I then restored the enable* options back to default.  The query was slow again and taking around ~19 seconds.  So, this gives us some information about whether autoanalyze is running in the background:
-I don't think waiting is making a difference.  I loaded the data 12 hours ago, and then ran the query and the query was very slow.  I then ran the query again but for a different dataset and it was also slow.  It was only once I changed the enable* parameters could I get my expected performance.  
-I would have tried select * from ice-cream again, but with 176 partitions this query is no longer feasible.  But, I do agree that select * from icecream is causing the statistics to be updated.

So, then I followed Tom's suggestion to defeat the empty-table heuristic.

select relpages from pg_class where relname = 'icecream';

 relpages 
----------
        0
(1 row)

Ok, so the planner thinks that the parent table is empty.  I then ran:
update pg_class set relpages = 1 where relname = 'icecream';


 HashAggregate  (cost=201199.27..201564.18 rows=11228 width=41) (actual time=277.195..304.620 rows=11028 loops=1)
   ->  Nested Loop  (cost=4.28..79942.45 rows=4041894 width=41) (actual time=0.227..231.181 rows=11028 loops=1)
         Join Filter: (r.widget_id = rc.widget_id)
         ->  Append  (cost=4.28..1163.69 rows=11228 width=21) (actual time=0.125..40.834 rows=11028 loops=1)
               ->  Bitmap Heap Scan on widget r  (cost=4.28..12.75 rows=1 width=48) (actual time=0.022..0.022 rows=0 loops=1)
                     Recheck Cond: (widgetset_id = 5)
                     Filter: (widget_type_id = 4)
                     ->  Bitmap Index Scan on widget_widgetset_id_idx  (cost=0.00..4.28 rows=4 width=0) (actual time=0.019..0.019 rows=0 loops=1)
                           Index Cond: (widgetset_id = 5)
               ->  Index Scan using widget_part_5_widget_widget_type_id_idx on widget_part_5 r  (cost=0.00..1150.94 rows=11227 width=21) (actual time=0.100..18.964 rows=11028 lo
ops=1)
                     Index Cond: (widget_type_id = 4)
                     Filter: (widgetset_id = 5)
         ->  Append  (cost=0.00..6.99 rows=2 width=24) (actual time=0.008..0.012 rows=1 loops=11028)
               ->  Seq Scan on icecream rc  (cost=0.00..0.00 rows=1 width=24) (actual time=0.001..0.001 rows=0 loops=11028)
                     Filter: (rc.dataset_id = 283)
               ->  Index Scan using icecream_part_283_widget_id_idx on icecream_part_283 rc  (cost=0.00..6.99 rows=1 width=24) (actual time=0.004..0.006 rows=1 loo
ps=11028)
                     Index Cond: (rc.widget_id = r.widget_id)
                     Filter: (rc.dataset_id = 283)
 Total runtime: 318.634 ms
(19 rows)


Wow!  that fixes it.  Thanks you so much!!!! I've been struggling with this issue for 2-3 days.  (Also, in the past, I've seen inconsistent performance with this query, which may be the result of the planner sometimes choosing the wrong plan, but I'll chase that down later).

Tom said: But maybe we should reconsider the heuristic for tables that are members of inheritance trees --- particularly parents of inheritance trees.

I agree.  I think postgres should get updated to take this into account.  I shouldn't have to set the relpages to 1 for all the empty parent tables that I have partitioned.  Should I file this as a bug/enhancement?

Also, do I need to worry about about autoanalyze/autovacuum setting back relpages to zero for the parent icecream table?

thanks!!!
Anish



On Sat, Jul 9, 2011 at 10:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 09.07.2011 00:36, Anish Kejariwal wrote:
>> My guess as to what happened:
>> -because the icecream parent table has zero records, the query optimizer
>> chooses the incorrect execution plan
>> -when I do select * from icecream, the optimizer now knows how many records
>> are really in the icecream table, by knowing that the icecream table has
>> partitions.

> "select * from icecream" won't have any direct effect on the
> optimization of subsequent queries. What probably happened is that
> autoanalyze ran in the background while you ran that select, and
> analyzed some of the partitions. Simply waiting a while would've had the
> same effect.

Yeah.  Also, the reason that a manual vacuum on icecream changes things
yet again is that in 9.0 and up, we have a notion of summary stats
across the whole inheritance tree, but autoanalyze hasn't been taught to
gather those.  The manual command on the parent table does gather them,
though.

So what's happening here is that we suddenly have an accurate idea of
the size of the join product as a result of having inheritance summary
stats to estimate with, and that drives the estimated cost of the merge
or hash join down out of the stratosphere.  The estimated cost of the
nestloop goes down a lot too, but not as much.

I experimented with a similar case here, and it seems like a lot of the
remaining error in the nestloop estimate comes from this:

>> ->   Seq Scan on icecream rc  (cost=0.00..23.00 rows=5 width=24) (actual time=0.002..0.002 rows=0 loops=1)
>>        Filter: (dataset_id = 281)

The indexscan on the nonempty child partition is estimated at less than
10 cost units, so this is a *large* fraction of what the planner sees as
the per-outer-row cost of a nestloop.  And with more than 11000 rows on
the other side of the join, that discourages it from using the nestloop.
In reality of course this takes negligible time compared to examining
the child partition.

Now why is the seqscan cost estimate so large, when actually the parent
icecream table is totally empty?  It's because the planner has been
taught to never believe that an empty table is empty.  If memory serves,
it's really estimating on an assumption that the table contains 10 pages
and some corresponding number of rows.  This is a reasonable defensive
posture when dealing with ordinary tables, I think, since most likely
if the catalogs say the table is empty that's just a leftover from when
it was created.  But maybe we should reconsider the heuristic for tables
that are members of inheritance trees --- particularly parents of
inheritance trees.

I was able to defeat the empty-table heuristic here by doing

update pg_class set relpages = 1 where relname = 'icecream';

and then I started getting much more realistic estimates in my test
case.  (It still wanted to use a merge join initially, but after
knocking down random_page_cost it went to the nestloop.)  It would
be interesting to see what sorts of results Anish gets with that.

                       regards, tom lane

Re: issue with query optimizer when joining two partitioned tables

From
Jeremy Harris
Date:
On 2011-07-09 18:43, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> On 09.07.2011 00:36, Anish Kejariwal wrote:
>>> My guess as to what happened:
>>> -because the icecream parent table has zero records, the query optimizer
>>> chooses the incorrect execution plan
>>> -when I do select * from icecream, the optimizer now knows how many records
>>> are really in the icecream table, by knowing that the icecream table has
>>> partitions.
>
>> "select * from icecream" won't have any direct effect on the
>> optimization of subsequent queries. What probably happened is that
>> autoanalyze ran in the background while you ran that select, and
>> analyzed some of the partitions. Simply waiting a while would've had the
>> same effect.
>
> Yeah.  Also, the reason that a manual vacuum on icecream changes things
> yet again is that in 9.0 and up, we have a notion of summary stats
> across the whole inheritance tree, but autoanalyze hasn't been taught to
> gather those.  The manual command on the parent table does gather them,
> though.

Is stats-gathering significantly more expensive than an FTS?   Could an FTS
update stats as a matter of course (or perhaps only if enough changes in table)?
--
Jeremy