Thread: Strange query optimization in 7.3.2

Strange query optimization in 7.3.2

From
Alec Mitchell
Date:
Hello,

    I've encountered what seems to be a very strange behavior in the query
optimizer using postgresql 7.3.2.  Using a query that looks like this:

EXPLAIN ANALYZE SELECT * from (trucks tr JOIN terminals t ON (t.terminal =
tr.terminal) JOIN  manifests m ON (tr.trailer = m.trailer)) JOIN stops s ON
(m.manifest = s.manifest) WHERE ((tr.group_num = 1) AND (t.city_id = 2) AND
(s.date BETWEEN '1/1/2003' AND '1/31/2003'));

I get a somewhat slow query with a plan that is not terrbily optimal.

However when I accidentally made the query look like this:

EXPLAIN ANALYZE SELECT * from (trucks tr JOIN terminals t ON (t.terminal =
tr.terminal) JOIN  manifests m ON (tr.trailer = m.trailer)) JOIN stops s ON
(m.manifest = s.manifest) WHERE ((s.manifest = m.manifest) AND (tr.trailer =
m.trailer) AND (t.terminal = tr.terminal) AND (tr.group_num = 1) AND
(t.city_id = 2) AND (s.date BETWEEN '1/1/2003' AND '1/31/2003'));

The query plan choosen is more than three times as fast.  This is especially
strange considering it is identical to the first query apart from redundant
information for the joins.

If I change the JOIN ON to NATURAL JOIN the plan chosen is the the slow one,
unless of course the redundant information is included as well.  Using
non-explicit joins also results in the slow query plan.  The redundant
information has a drastic effect on the query plan, but makes no change to
the results.  Does anyone have any idea why this might be?

Here are the results of EXPLAIN ANALYZE for the two queries:

The Slow query:

 Hash Join  (cost=202.44..7332.06 rows=9711 width=140) (actual
time=21.20..6567.87 rows=19775 loops=1)
   Hash Cond: ("outer".manifest = "inner".manifest)
   ->  Seq Scan on stops s  (cost=0.00..6421.15 rows=117416 width=88) (actual
time=0.18..4811.78 rows=118606 loops=1)
         Filter: ((date >= '2003-01-01'::date) AND (date <=
'2003-01-31'::date))
   ->  Hash  (cost=199.64..199.64 rows=1119 width=52) (actual
time=20.54..20.54 rows=0 loops=1)
         ->  Merge Join  (cost=21.86..199.64 rows=1119 width=52) (actual
time=14.72..20.14 rows=52 loops=1)
               Merge Cond: ("outer".trailer = "inner".trailer)
               ->  Index Scan using manifests_trailer_idx on manifests m
(cost=0.00..516.82 rows=13526 width=15) (actual time=0.41..4.25 rows=174
loops=1)
               ->  Sort  (cost=21.86..22.01 rows=62 width=37) (actual
time=10.86..11.01 rows=52 loops=1)
                     Sort Key: tr.trailer
                     ->  Hash Join  (cost=1.47..20.01 rows=62 width=37)
(actual time=1.63..9.49 rows=52 loops=1)
                           Hash Cond: ("outer".terminal = "inner".terminal)
                           ->  Seq Scan on trucks tr  (cost=0.00..15.38
rows=319 width=21) (actual time=0.05..5.91 rows=319 loops=1)
                                 Filter: (group_num = 1)
                           ->  Hash  (cost=1.45..1.45 rows=7 width=16) (actual
time=0.32..0.32 rows=0 loops=1)
                                 ->  Seq Scan on terminals t  (cost=0.00..1.45
rows=7 width=16) (actual time=0.06..0.27 rows=7 loops=1)
                                       Filter: (city_id = 2)
 Total runtime: 6635.96 msec


The faster (redundant) Query:

 Nested Loop  (cost=21.86..6496.24 rows=9711 width=140) (actual
time=14.90..2045.87 rows=19775 loops=1)
   ->  Merge Join  (cost=21.86..199.64 rows=1119 width=52) (actual
time=14.67..32.01 rows=52 loops=1)
         Merge Cond: ("outer".trailer = "inner".trailer)
         ->  Index Scan using manifests_trailer_idx on manifests m
(cost=0.00..516.82 rows=13526 width=15) (actual time=0.40..4.47 rows=174
loops=1)
         ->  Sort  (cost=21.86..22.01 rows=62 width=37) (actual
time=10.80..11.02 rows=52 loops=1)
               Sort Key: tr.trailer
               ->  Hash Join  (cost=1.47..20.01 rows=62 width=37) (actual
time=1.49..9.42 rows=52 loops=1)
                     Hash Cond: ("outer".terminal = "inner".terminal)
                     ->  Seq Scan on trucks tr  (cost=0.00..15.38 rows=319
width=21) (actual time=0.04..5.88 rows=319 loops=1)
                           Filter: (group_num = 1)
                     ->  Hash  (cost=1.45..1.45 rows=7 width=16) (actual
time=0.31..0.31 rows=0 loops=1)
                           ->  Seq Scan on terminals t  (cost=0.00..1.45
rows=7 width=16) (actual time=0.06..0.26 rows=7 loops=1)
                                 Filter: (city_id = 2)
   ->  Index Scan using stops_mainfest_date_idx on stops s  (cost=0.00..5.62
rows=1 width=88) (actual time=0.13..17.13 rows=380 loops=52)
         Index Cond: (("outer".manifest = s.manifest) AND (s.manifest =
"outer".manifest) AND (s.date >= '2003-01-01'::date) AND (s.date <=
'2003-01-31'::date))
 Total runtime: 2123.36 msec

As you can see the planner estimates are all the same for the operations
common to both plans, but a different choice is made for the final (big)
join.  Thanks in advance for any enlightenment on this strange issue.  I'd
rather not be using such a kludgy query to optimize things unless I have no
other option.

Alec Mitchell


Re: Strange query optimization in 7.3.2

From
Tom Lane
Date:
Alec Mitchell <apm13@columbia.edu> writes:
>     I've encountered what seems to be a very strange behavior in the query
> optimizer using postgresql 7.3.2.

I think the reason for the change in plan is the same bug discussed at
http://fts.postgresql.org/db/mw/msg.html?mid=1064055

However, you will probably not like the fix, since it eliminates the
bogusly small cost estimate for the duplicated index condition, and
thereby ensures that your less-favored plan will always be chosen :-(

What would be interesting is to look into why the planner's estimated
costs are inaccurate.  I think the main cause is the badly-off join
estimate for the tr/t join --- notice it's estimating 1119 rows out
where only 52 are actually produced.  The nestloop's runtime is directly
proportional to the number of outer rows, so this leads directly to a
factor-of-20 overestimate of the nestloop's cost, discouraging the
planner from using it.  The bug that's triggered by the duplicate
index condition underestimates the cost, thereby negating that error to
some extent.

You should look into whether increasing the statistics targets for
t.terminal and tr.terminal would improve the accuracy of the join
estimate.

            regards, tom lane


Re: Strange query optimization in 7.3.2

From
Alec Mitchell
Date:
On Monday 14 April 2003 01:25 pm, Tom Lane wrote:
> I think the reason for the change in plan is the same bug discussed at
> http://fts.postgresql.org/db/mw/msg.html?mid=1064055
>
> However, you will probably not like the fix, since it eliminates the
> bogusly small cost estimate for the duplicated index condition, and
> thereby ensures that your less-favored plan will always be chosen :-(
>
> What would be interesting is to look into why the planner's estimated
> costs are inaccurate.  I think the main cause is the badly-off join
> estimate for the tr/t join --- notice it's estimating 1119 rows out
> where only 52 are actually produced.  The nestloop's runtime is directly
> proportional to the number of outer rows, so this leads directly to a
> factor-of-20 overestimate of the nestloop's cost, discouraging the
> planner from using it.  The bug that's triggered by the duplicate
> index condition underestimates the cost, thereby negating that error to
> some extent.
>
> You should look into whether increasing the statistics targets for
> t.terminal and tr.terminal would improve the accuracy of the join
> estimate.

That bug does seem to be the cause of the confusing plan change.  I really
don't understand that bad estimate for the join though (the estimate of 1119
is for the tr/m join, rather than the tr/t join).

Here are some details about the tables and joins involved:

The tr/t join produces 52 rows with unique trailers (the primary key on tr)
out of the 750 available (the planner estimates 62).  These are then joined
with the manifests table m, which has 13526 rows.  The relationship between
tr.trailer and m.trailer is a bit complex.  Of the 750 possible trailer
values in tr, 607 have a one to one mapping to rows in m.  The remaining 143
values are each referenced in 1-70 (avg 24) different rows in m.
Additionally, there are 9510 rows in m (the vast majority), which have a null
value for trailer (perhaps that is the cause of these bad statistics).

This particular query happens to call only trailers with a one to one
relationship to rows in m (this not unlikely considering that this condition
is true for 607 out of the 750 values).  Setting the statistics targets for
m.trailer to 200 or even 750, both of which should be overkill considering
that there are only 750 possible values, and then performing a VACUUM ANALYZE
strangely makes no difference to the row estimates.

I'm having a lot of trouble tracking down the reason for this bad estimate
(all the preceding estimates are quite good).  Any further guidance would be
greatly appreciated.

Thanks,
Alec Mitchell


Re: Strange query optimization in 7.3.2

From
Tom Lane
Date:
Alec Mitchell <apm13@columbia.edu> writes:
> The tr/t join produces 52 rows with unique trailers (the primary key on tr)
> out of the 750 available (the planner estimates 62).  These are then joined
> with the manifests table m, which has 13526 rows.  The relationship between
> tr.trailer and m.trailer is a bit complex.  Of the 750 possible trailer
> values in tr, 607 have a one to one mapping to rows in m.  The remaining 143
> values are each referenced in 1-70 (avg 24) different rows in m.
> Additionally, there are 9510 rows in m (the vast majority), which have a null
> value for trailer (perhaps that is the cause of these bad statistics).

Hmm ... we fixed a bug last fall in which NULLs were twice-excluded from
the estimates for range queries, leading to silly results when the
proportion of nulls got nontrivial.  This isn't a range query, but
I wonder if there's a similar bug lurking here ...

Could you see your way to sending me a dump of these tables for testing
purposes?  You could exclude the columns not used in the FROM/WHERE
clauses, if that is needed to satisfy privacy worries.

            regards, tom lane


Re: Strange query optimization in 7.3.2

From
Tom Lane
Date:
Alec Mitchell <apm13@columbia.edu> writes:
> On Monday 14 April 2003 04:08 pm, Tom Lane wrote:
>> Could you see your way to sending me a dump of these tables for testing
>> purposes?  You could exclude the columns not used in the FROM/WHERE
>> clauses, if that is needed to satisfy privacy worries.

> I've attached dumps of the relevant tables (not including the stops
> table (s), it is huge, and doesn't come into play until after the bad
> planner estimate).

Indeed, there's a problem here with nulls --- there is a case in
eqjoinsel() that didn't account for nulls, and should've.  I've applied
the attached patch to fix it.  This reduces the estimate of the tr/t/m
join size from 1119 to 332, which is still large compared to the true
size of 52, but it's a lot better.  Without the stops table, I can't
really tell if this changes the complete plan or not --- could you apply
the patch and find out?

BTW, the remaining error seems to be because the tr.group_num = 1
constraint skews the fraction of matching "trailer" values.  I fear
there's little chance of persuading the system to figure that out in the
near future --- it can't be done without cross-column correlation
statistics, which we do not keep.

            regards, tom lane

*** src/backend/utils/adt/selfuncs.c.orig    Sat Mar 22 20:49:13 2003
--- src/backend/utils/adt/selfuncs.c    Tue Apr 15 01:13:01 2003
***************
*** 1552,1578 ****
          {
              /*
               * We do not have MCV lists for both sides.  Estimate the join
!              * selectivity as MIN(1/nd1, 1/nd2).  This is plausible if we
!              * assume that the values are about equally distributed: a
!              * given tuple of rel1 will join to either 0 or N2/nd2 rows of
!              * rel2, so total join rows are at most N1*N2/nd2 giving a
!              * join selectivity of not more than 1/nd2.  By the same logic
!              * it is not more than 1/nd1, so MIN(1/nd1, 1/nd2) is an upper
!              * bound.  Using the MIN() means we estimate from the point of
!              * view of the relation with smaller nd (since the larger nd
!              * is determining the MIN).  It is reasonable to assume that
!              * most tuples in this rel will have join partners, so the
!              * bound is probably reasonably tight and should be taken
!              * as-is.
               *
               * XXX Can we be smarter if we have an MCV list for just one
               * side? It seems that if we assume equal distribution for the
               * other side, we end up with the same answer anyway.
               */
              if (nd1 > nd2)
!                 selec = 1.0 / nd1;
              else
!                 selec = 1.0 / nd2;
          }

          if (have_mcvs1)
--- 1552,1584 ----
          {
              /*
               * We do not have MCV lists for both sides.  Estimate the join
!              * selectivity as MIN(1/nd1,1/nd2)*(1-nullfrac1)*(1-nullfrac2).
!              * This is plausible if we assume that the join operator is
!              * strict and the non-null values are about equally distributed:
!              * a given non-null tuple of rel1 will join to either zero or
!              * N2*(1-nullfrac2)/nd2 rows of rel2, so total join rows are at
!              * most N1*(1-nullfrac1)*N2*(1-nullfrac2)/nd2 giving a join
!              * selectivity of not more than (1-nullfrac1)*(1-nullfrac2)/nd2.
!              * By the same logic it is not more than
!              * (1-nullfrac1)*(1-nullfrac2)/nd1, so the expression with MIN()
!              * is an upper bound.  Using the MIN() means we estimate from the
!              * point of view of the relation with smaller nd (since the larger
!              * nd is determining the MIN).  It is reasonable to assume that
!              * most tuples in this rel will have join partners, so the bound
!              * is probably reasonably tight and should be taken as-is.
               *
               * XXX Can we be smarter if we have an MCV list for just one
               * side? It seems that if we assume equal distribution for the
               * other side, we end up with the same answer anyway.
               */
+             double        nullfrac1 = stats1->stanullfrac;
+             double        nullfrac2 = stats2->stanullfrac;
+
+             selec = (1.0 - nullfrac1) * (1.0 - nullfrac2);
              if (nd1 > nd2)
!                 selec /= nd1;
              else
!                 selec /= nd2;
          }

          if (have_mcvs1)


Re: Strange query optimization in 7.3.2

From
Alec Mitchell
Date:
On Monday 14 April 2003 10:39 pm, Tom Lane wrote:
> Indeed, there's a problem here with nulls --- there is a case in
> eqjoinsel() that didn't account for nulls, and should've.  I've applied
> the attached patch to fix it.  This reduces the estimate of the tr/t/m
> join size from 1119 to 332, which is still large compared to the true
> size of 52, but it's a lot better.  Without the stops table, I can't
> really tell if this changes the complete plan or not --- could you apply
> the patch and find out?
>
> BTW, the remaining error seems to be because the tr.group_num = 1
> constraint skews the fraction of matching "trailer" values.  I fear
> there's little chance of persuading the system to figure that out in the
> near future --- it can't be done without cross-column correlation
> statistics, which we do not keep.

I applied the patch to the 7.3.2 sources, but strangely I get a segfault when
I run initdb using the patched PostgreSQL.  Specifically I get this error:

creating system views... ok
loading pg_description... /usr/local/pgsql/bin/initdb: line 801: 14633 Done
( cat  <<EOF
    CREATE TEMP TABLE tmp_pg_description (      objoid oid,     classname
name,objsubid int4,   description text) WITHOUT OIDS;
    COPY tmp_pg_description FROM STDIN;
EOF
 cat "$POSTGRES_DESCR"; cat  <<EOF
\.
    INSERT INTO pg_description SELECT   t.objoid, c.oid, t.objsubid,
t.description     FROM tmp_pg_description t, pg_class c WHERE c.relname =
t.classname;
EOF
 )
     14634 Segmentation fault      | "$PGPATH"/postgres $PGSQL_OPT template1
>/dev/null

initdb failed.
Removing /var/lib/postgres/data.

This doesn't make much sense to me as the patch is pretty straightforward.
I'm running an up to date Debian Woody distribution, with a 2.4.21-pre6
kernel.  Let me know if there is anything I can do to help figure this out.

Having spent so much time thinking about this query and its odd results I've
realized that I there is a query which does not involve the manifests table
but returns the same results and always provides more accurate statistics for
the planner (the stops table is not normalized and has some redundant data,
allowing a few different paths to the same results).  It's marginally slower
but should scale better as the stops and manifests tables grow.  As a result,
I'm not too worried about the particulars of this optimization anymore
(though looking at the plan and estimates for my new query, an estimate of
332 rows from m should probably result in the faster nested loop plan being
choosen).

Thanks for all your help,
Alec Mitchell


Re: Strange query optimization in 7.3.2

From
Tom Lane
Date:
Alec Mitchell <apm13@columbia.edu> writes:
> I applied the patch to the 7.3.2 sources, but strangely I get a segfault when
> I run initdb using the patched PostgreSQL.  Specifically I get this error:

Sigh ... I do know better than to commit changes without having
regression-tested 'em.  Honest ;-)

Add this patch atop the last:

*** src/backend/utils/adt/selfuncs.c.orig    Tue Apr 15 01:18:12 2003
--- src/backend/utils/adt/selfuncs.c    Wed Apr 16 00:37:58 2003
***************
*** 1610,1617 ****
               * side? It seems that if we assume equal distribution for the
               * other side, we end up with the same answer anyway.
               */
!             double        nullfrac1 = stats1->stanullfrac;
!             double        nullfrac2 = stats2->stanullfrac;

              selec = (1.0 - nullfrac1) * (1.0 - nullfrac2);
              if (nd1 > nd2)
--- 1610,1617 ----
               * side? It seems that if we assume equal distribution for the
               * other side, we end up with the same answer anyway.
               */
!             double        nullfrac1 = stats1 ? stats1->stanullfrac : 0.0;
!             double        nullfrac2 = stats2 ? stats2->stanullfrac : 0.0;

              selec = (1.0 - nullfrac1) * (1.0 - nullfrac2);
              if (nd1 > nd2)



            regards, tom lane


Re: Strange query optimization in 7.3.2

From
Alec Mitchell
Date:
On Tuesday 15 April 2003 09:40 pm, Tom Lane wrote:
> Sigh ... I do know better than to commit changes without having
> regression-tested 'em.  Honest ;-)
>
> Add this patch atop the last:
>
    It is running once again, thanks.  Unfortunately this doesn't seem to change
the query plan (actual times have changed because this is running on a faster
computer):

 Hash Join  (cost=123.97..7169.64 rows=2890 width=140) (actual
time=134.53..3106.29 rows=19775 loops=1)
   Hash Cond: ("outer".manifest = "inner".manifest)
   ->  Seq Scan on stops s  (cost=0.00..6421.15 rows=117680 width=88) (actual
time=0.19..2241.00 rows=118606 loops=1)
         Filter: ((date >= '01/01/2003'::date) AND (date <=
'01/31/2003'::date))
   ->  Hash  (cost=123.13..123.13 rows=332 width=52) (actual
time=134.09..134.09 rows=0 loops=1)
         ->  Merge Join  (cost=16.79..123.13 rows=332 width=52) (actual
time=132.26..133.88 rows=52 loops=1)
               Merge Cond: ("outer".trailer = "inner".trailer)
               ->  Index Scan using manifests_trailer_idx on manifests m
(cost=0.00..309.32 rows=13526 width=15) (actual time=51.96..83.88 rows=174
loops=1)
               ->  Sort  (cost=16.79..16.94 rows=62 width=37) (actual
time=48.68..48.74 rows=52 loops=1)
                     Sort Key: tr.trailer
                     ->  Hash Join  (cost=1.47..14.94 rows=62 width=37)
(actual time=36.33..48.31 rows=52 loops=1)
                           Hash Cond: ("outer".terminal = "inner".terminal)
                           ->  Index Scan using trucks_group_idx on trucks tr
(cost=0.00..10.31 rows=319 width=21) (actual time=29.79..41.05 rows=319
loops=1)
                                 Index Cond: (group_num = 1)
                           ->  Hash  (cost=1.45..1.45 rows=7 width=16) (actual
time=5.82..5.82 rows=0 loops=1)
                                 ->  Seq Scan on terminals t  (cost=0.00..1.45
rows=7 width=16) (actual time=5.70..5.79 rows=7 loops=1)
                                       Filter: (city_id = 2)
 Total runtime: 3130.95 msec

It seems odd to me that Index Scan on manifests predicts that all rows will be
returned, but I guess that doesn't really influence the join decision.  Also
it turns out the "better" query I was using mistakenly included redundant
join conditions again.  The same slow execution path is choosen without the
redundancy (silly me).  So even a query that gives accurate planner
statistics (estimate of 62 rows vs. 52 actual rows at the join) results in a
hash join which ignores the existing indices rather than a nested loop which
uses them efficiently.  Maybe I should just disable seqscan for this query,
and reenable it afterward (it will be run within a function, so that should
be easy).  If you think that this issue merits further consideration, I'd be
happy to send you the relevant columns of the stops table.  Many thanks for
your help so far.

Alec Mitchell


Re: Strange query optimization in 7.3.2

From
Alec Mitchell
Date:
On Thursday 17 April 2003 10:12 am, Alec Mitchell wrote:
> It seems odd to me that Index Scan on manifests predicts that all rows will
> be returned, but I guess that doesn't really influence the join decision.
> Also it turns out the "better" query I was using mistakenly included
> redundant join conditions again.  The same slow execution path is choosen
> without the redundancy (silly me).  So even a query that gives accurate
> planner statistics (estimate of 62 rows vs. 52 actual rows at the join)
> results in a hash join which ignores the existing indices rather than a
> nested loop which uses them efficiently.  Maybe I should just disable
> seqscan for this query, and reenable it afterward (it will be run within a
> function, so that should be easy).  If you think that this issue merits
> further consideration, I'd be happy to send you the relevant columns of the
> stops table.  Many thanks for your help so far.
>
I'll reply to myself to provide some details on why the nested loop isn't
choosen unless enable_seqscan is set to off.  This query skips the indirect
route through the manifests table and joins directly to the stops table on
(s.trailer = tr.trailer) like this:

SELECT * from (trucks tr JOIN terminals t ON (t.terminal = tr.terminal)) JOIN
stops s ON (tr.trailer = s.trailer) WHERE ((tr.group_num = 1) AND (t.city_id
= 2) AND (s.date BETWEEN '1/1/2003' AND '1/31/2003'));

I had been doing the join on manifests because the join on an integer is
faster than a join on varchar, but manifests will grow much faster than
trailers, that combined with the unsolvably poor statistics for the tr/m join
make that a poor choice for this query as the tables grow.

With enable_seqscan set to off, I get the fastest query plan (at least with
the patched version, unpatched I have to turn off both enable_hashjoin and
enable_mergejoin).  It looks like the reason the fast nested loop isn't
choosen by default is that the planner estimates a cost of 477.54 per loop
for 62 loops, whereas the actual cost is 8.34 per loop for 52 loops.  That's
an order of magnitude farther off than the planners other cost estimates.
The EXPLAIN ANALYZE output follows:

 Nested Loop  (cost=6.02..29697.65 rows=7545 width=125) (actual
time=1.43..756.54 rows=19775 loops=1)
   ->  Hash Join  (cost=6.02..19.50 rows=62 width=37) (actual time=1.32..5.22
rows=52 loops=1)
         Hash Cond: ("outer".terminal = "inner".terminal)
         ->  Index Scan using trucks_group_idx on trucks tr  (cost=0.00..10.31
rows=319 width=21) (actual time=0.04..3.12 rows=319 loops=1)
               Index Cond: (group_num = 1)
         ->  Hash  (cost=6.00..6.00 rows=7 width=16) (actual time=0.58..0.58
rows=0 loops=1)
               ->  Index Scan using terminals_pkey on terminals t
(cost=0.00..6.00 rows=7 width=16) (actual time=0.38..0.55 rows=7 loops=1)
                     Filter: (city_id = 2)
   ->  Index Scan using stops_trailer_date_idx on stops s  (cost=0.00..476.91
rows=124 width=88) (actual time=0.05..7.89 rows=380 loops=52)
         Index Cond: (("outer".trailer = s.trailer) AND (s.date >=
'01/01/2003'::date) AND (s.date <= '01/31/2003'::date))
 Total runtime: 785.12 msec


Thanks Again,
Alec Mitchell


Re: Strange query optimization in 7.3.2

From
Tom Lane
Date:
Alec Mitchell <apm13@columbia.edu> writes:
> With enable_seqscan set to off, I get the fastest query plan (at least with
> the patched version, unpatched I have to turn off both enable_hashjoin and
> enable_mergejoin).  It looks like the reason the fast nested loop isn't
> choosen by default is that the planner estimates a cost of 477.54 per loop
> for 62 loops, whereas the actual cost is 8.34 per loop for 52 loops.

This is a known issue --- the planner overestimates the cost of a
nestloop with inner indexscan, because it doesn't allow for the fact
that the successive inner indexscans aren't independent.  (The upper
btree layers, at least, are sure to stay in memory over the successive
probes of the inner table ... but the costing charges for a from-scratch
indexscan each time through.)  We've discussed this before but I don't
think anyone's found an appropriate substitute equation.

            regards, tom lane