Thread: osdl-dbt3 run results - puzzled by the execution plans

osdl-dbt3 run results - puzzled by the execution plans

From
Jenny Zhang
Date:
Our hardware/software configuration:
kernel: 2.5.74
distro: RH7.2
pgsql:  7.3.3
CPUS:   8
MHz:    700.217
model:  Pentium III (Cascades)
memory: 8298888 kB
shmmax: 3705032704

We did several sets of runs(repeating runs with the same database
parameters) and have the following observation:

1. With everything else the same, we did two run sets with
small effective_cache_size (default=1000) and large (655360 i.e. 5GB
or 60% of the system memory 8GB).  It seems to me that small
effective_cache_size favors the choice of nested loop joins (NLJ)
while the big effective_cache_size is in favor of merge joins (MJ).
We thought the large effective_cache_size should lead us to better
plans. But we found the opposite.

Three plans out of 22 are different.  Two of those plans are worse
in execution time by 2 times and 8 times.   For example, one plan,
that included  NLJ ran in 4 seconds but the other, switching to an
MJ, ran in  32 seconds.  Please refer to the link at the end of
this mail for the query and plans.  Did we miss something, or
improvements are needed for the optimizer?

2. Thanks to all the response we got from this mailing list, we
decided to use SETSEED(0) default_statistics_target=1000 to reduce
the variation.   We get now the exact the same execution plans
and costs with repeated runs and that reduced the variation a lot.
However, within the same run set consist of 6 runs, we see 2-3%
standard deviation for the run metrics associated with the multiple
stream part of the test (as opposed to the single stream part).

We would like to reduce the variation to be less than 1% so that a
2% change between two different kernels would be significant.
Is there anything else we can do?

query: http://developer.osdl.org/~jenny/11.sql
plan with small effective_cache_size:
http://developer.osdl.org/~jenny/small_effective_cache_size_plan
plan with large effective_cache_size:
http://developer.osdl.org/~jenny/large_effective_cache_size_plan

Thanks,
Jenny


Re: osdl-dbt3 run results - puzzled by the execution plans

From
"Matt Clark"
Date:
> We thought the large effective_cache_size should lead us to better
> plans. But we found the opposite.

Maybe it's inappropriate for little old me to jump in here, but the plan
isn't usually that important compared to the actual runtime.  The links you
give show the output of 'explain' but not 'explain analyze', so it's not
clear wich plan is actually _faster_.

If you really do have only 8MB of FS cache, then either plan will run
slowly.  If you really do have 5GB of FS cache then either plan will run a
lot faster.  Why would you deliberately give the planner false information
about this?

PG obviously thinks plan 1 is 'better' when pages have to be fetched from
disk, and plan 2 is 'better' when they don't.  Which is really better
depends on whether those pages do have to be fetched from disk or not, and
PG can only know what you tell it about that, so changing ECS without
actually removing the RAM from the system seems a little pointless to me...

M





Re: osdl-dbt3 run results - puzzled by the execution

From
Jenny Zhang
Date:
Thanks for your prompt reply.

On Thu, 2003-09-18 at 16:19, Matt Clark wrote:
> > We thought the large effective_cache_size should lead us to better
> > plans. But we found the opposite.
>
> Maybe it's inappropriate for little old me to jump in here, but the plan
> isn't usually that important compared to the actual runtime.  The links you
> give show the output of 'explain' but not 'explain analyze', so it's not
> clear wich plan is actually _faster_.
>
I put the EXPLAIN ANALYZE output at:
http://developer.osdl.org/~jenny/large_explain_analyze
http://developer.osdl.org/~jenny/small_explain_analyze
The actual execution time is 37 seconds(large) vs 5 seconds (small).

I concluded the one with nested loop one is faster since we saw it
consistently faster than the merge join one in our runs.
> If you really do have only 8MB of FS cache, then either plan will run
> slowly.  If you really do have 5GB of FS cache then either plan will run a
> lot faster.  Why would you deliberately give the planner false information
> about this?
>
We did not.  A little history of our runs:
When we first started, not knowing PG well, we just used the default ECS
value(1000).
Then we realized since we have 8G of RAM, we should set ECS to 655360.
But this leads the optimizer to pick a bad plan.  This is the reason why
we post this message.
> PG obviously thinks plan 1 is 'better' when pages have to be fetched from
> disk, and plan 2 is 'better' when they don't.  Which is really better
> depends on whether those pages do have to be fetched from disk or not, and
> PG can only know what you tell it about that, so changing ECS without
> actually removing the RAM from the system seems a little pointless to me...
>
> M
>
>
>
Regards,
Jenny


Re: osdl-dbt3 run results - puzzled by the execution plans

From
Tom Lane
Date:
Jenny Zhang <jenny@osdl.org> writes:
> ...  It seems to me that small
> effective_cache_size favors the choice of nested loop joins (NLJ)
> while the big effective_cache_size is in favor of merge joins (MJ).

No, I wouldn't think that, because a nestloop plan will involve repeated
fetches of the same tuples whereas a merge join doesn't (at least not
when it sorts its inner input, as this plan does).  Larger cache
improves the odds of a repeated fetch not having to do I/O.  In practice
a larger cache area would also have some effects on access costs for the
sort's temp file, but I don't think the planner's cost model for sorting
takes that into account.

As Matt Clark points out nearby, the real question is whether these
planner estimates have anything to do with reality.  EXPLAIN ANALYZE
results would be far more interesting than plain EXPLAIN.

> However, within the same run set consist of 6 runs, we see 2-3%
> standard deviation for the run metrics associated with the multiple
> stream part of the test (as opposed to the single stream part).

<python> Och, laddie, we useta *dream* of 2-3% variation </python>

> We would like to reduce the variation to be less than 1% so that a
> 2% change between two different kernels would be significant.

I think this is a pipe dream.  Variation in where the data gets laid
down on your disk drive would alone create more than that kind of delta.
I'm frankly amazed you could get repeatability within 2-3%.

            regards, tom lane

Re: osdl-dbt3 run results - puzzled by the execution

From
"Matt Clark"
Date:
> I put the EXPLAIN ANALYZE output at:
> http://developer.osdl.org/~jenny/large_explain_analyze
> http://developer.osdl.org/~jenny/small_explain_analyze
> The actual execution time is 37 seconds(large) vs 5 seconds (small).
>

There's an obvious row count misestimation in the 'large' plan:

->  Sort  (cost=519.60..520.60 rows=400 width=31) (actual time=106.88..143.49 rows=30321 loops=1)

but I'm not good enough at reading these things to tell if that's the cause of the problem, or if so how to fix it :-(




Re: osdl-dbt3 run results - puzzled by the execution plans

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> I think this is a pipe dream.  Variation in where the data gets laid
> down on your disk drive would alone create more than that kind of delta.
> I'm frankly amazed you could get repeatability within 2-3%.

I think the reason he gets good repeatability is because he's talking about
the aggregate results for a whole test run. Not individual queries. In theory
you could just run the whole test multiple times. The more times you run it
the lower the variation in the total run time would be.

Actually, the variation in run time is also a useful statistic, both for
postgres and the kernel. It might be useful to do multiple complete runs and
keep track of the average standard deviation of the time required for each
step.

Higher standard deviation implies queries can't be reliably depended on not to
take inordinately long, which can be a problem for some working models. For
the kernel it could mean latency issues or it could mean the swapper or buffer
cache was overly aggressive.

--
greg

Re: osdl-dbt3 run results - puzzled by the execution plans

From
Manfred Koizar
Date:
On Thu, 18 Sep 2003 15:36:50 -0700, Jenny Zhang <jenny@osdl.org>
wrote:
>We thought the large effective_cache_size should lead us to better
>plans. But we found the opposite.

The common structure of your query plans is:

 Sort
   Sort Key: sum((partsupp.ps_supplycost * partsupp.ps_availqty))
   InitPlan
     ->  Aggregate
           ->  SubPlan
   ->  Aggregate
         Filter: (sum((ps_supplycost * ps_availqty)) > $0)
         ->  Group
               ->  Sort
                     Sort Key: partsupp.ps_partkey
                     ->  SubPlan (same as above)

where the SubPlan is

 ->  Merge Join  (cost=519.60..99880.05 rows=32068 width=65)
                 (actual time=114.78..17435.28 rows=30400 loops=1)
                 ctr=5.73
       Merge Cond: ("outer".ps_suppkey = "inner".s_suppkey)
       ->  Index Scan using i_ps_suppkey on partsupp
                 (cost=0.00..96953.31 rows=801712 width=34)
                 (actual time=0.42..14008.92 rows=799361 loops=1)
                 ctr=6.92
       ->  Sort  (cost=519.60..520.60 rows=400 width=31)
                 (actual time=106.88..143.49 rows=30321 loops=1)
                 ctr=3.63
             Sort Key: supplier.s_suppkey
             ->  SubSubPlan

for large effective_cache_size and

 ->  Nested Loop  (cost=0.00..130168.30 rows=32068 width=65)
                  (actual time=0.56..1374.41 rows=30400 loops=1)
                  ctr=94.71
       ->  SubSubPlan
       ->  Index Scan using i_ps_suppkey on partsupp
                 (cost=0.00..323.16 rows=80 width=34)
                 (actual time=0.16..2.98 rows=80 loops=380)
                 ctr=108.44
             Index Cond: (partsupp.ps_suppkey = "outer".s_suppkey)

for small effective_cache_size.  Both subplans have an almost
identical subsubplan:

->  Nested Loop  (cost=0.00..502.31 rows=400 width=31)
                 (actual time=0.23..110.51 rows=380 loops=1)
                 ctr=4.55
      Join Filter: ("inner".s_nationkey = "outer".n_nationkey)
      ->  Seq Scan on nation  (cost=0.00..1.31 rows=1 width=10)
                              (actual time=0.08..0.14 rows=1 loops=1)
                              ctr=9.36
            Filter: (n_name = 'ETHIOPIA'::bpchar)
      ->  Seq Scan on supplier (cost=0.00..376.00 rows=10000 width=21)
                          (actual time=0.10..70.72 rows=10000 loops=1)
                               ctr=5.32

I have added the ctr (cost:time ratio) for each plan node.  These
values are mostly between 5 and 10 with two notable exceptions:

1)     ->  Sort  (cost=519.60..520.60 rows=400 width=31)
                 (actual time=106.88..143.49 rows=30321 loops=1)
                 ctr=3.63

It has already been noticed by Matt Clark that this is the only plan
node where the row count estimation looks wrong.  However, I don't
believe that this has great influence on the total cost of the plan,
because the ctr is not far from the usual range and if it were a bit
higher, it would only add a few hundred cost units to a branch costing
almost 100000 units.  BTW I vaguely remember that there is something
strange with the way actual rows are counted inside a merge join.
Look at the branch below this plan node:  It shows an actual row count
of 380.

2)     ->  Index Scan using i_ps_suppkey on partsupp
                 (cost=0.00..323.16 rows=80 width=34)
                 (actual time=0.16..2.98 rows=80 loops=380)
                 ctr=108.44

Here we have the only plan node where loops > 1, and it is the only
one where the ctr is far off.  The planner computes the cost for one
loop and multiplies it by the number of loops (which it estimates
quite accurately to be 400), thus getting a total cost of ca. 130000.
We have no reason to believe that the single loop cost is very far
from reality (for a *single* index scan), but the planner does not
account for additional index scans hitting pages in the cache that
have been brought in by preceding scans.  This is a known problem, Tom
has mentioned it several times, IIRC.

Now I'm very interested in getting a better understanding of this
problem, so could you please report the results of

. \d i_ps_suppkey

. VACUUM VERBOSE ANALYSE partsupp;
  VACUUM VERBOSE ANALYSE supplier;

. SELECT attname, null_frac, avg_witdh, n_distinct, correlation
    FROM pg_stats
   WHERE tablename = 'partsupp' AND attname IN ('ps_suppkey', ...);

  Please insert other interesting column names for ..., especially
  those contained in i_ps_suppkey, if any.

. SELECT relname, relpages, reltuples
    FROM pg_class
   WHERE relname IN ('partsupp', 'supplier', ...);
                                             ^^^
                    Add relevant index names here.

. EXPLAIN ANALYSE
  SELECT ps_partkey, ps_supplycost, ps_availqty
    FROM partsupp, supplier
   WHERE ps_suppkey = s_suppkey AND s_nationkey = '<youknowit>';

  The idea is to eliminate parts of the plan that are always the same.
  Omitting nation is possibly to much a simplification.  In this case
  please re-add it.
  Do this test for small and large effective_cache_size.
  Force the use of other join methods by setting enable_<joinmethod>
  to off.  Post all results.


Jenny, I understand that this long message contains more questions
than answers and is not of much help for you.  OTOH your tests might
be very helpful for Postgres development ...

Servus
 Manfred

Re: osdl-dbt3 run results - puzzled by the execution

From
Jenny Zhang
Date:
I posted more results as you requested:

On Fri, 2003-09-19 at 08:08, Manfred Koizar wrote:
> On Thu, 18 Sep 2003 15:36:50 -0700, Jenny Zhang <jenny@osdl.org>
> wrote:
> >We thought the large effective_cache_size should lead us to better
> >plans. But we found the opposite.
>
> The common structure of your query plans is:
>
>  Sort
>    Sort Key: sum((partsupp.ps_supplycost * partsupp.ps_availqty))
>    InitPlan
>      ->  Aggregate
>            ->  SubPlan
>    ->  Aggregate
>          Filter: (sum((ps_supplycost * ps_availqty)) > $0)
>          ->  Group
>                ->  Sort
>                      Sort Key: partsupp.ps_partkey
>                      ->  SubPlan (same as above)
>
> where the SubPlan is
>
>  ->  Merge Join  (cost=519.60..99880.05 rows=32068 width=65)
>                  (actual time=114.78..17435.28 rows=30400 loops=1)
>                  ctr=5.73
>        Merge Cond: ("outer".ps_suppkey = "inner".s_suppkey)
>        ->  Index Scan using i_ps_suppkey on partsupp
>                  (cost=0.00..96953.31 rows=801712 width=34)
>                  (actual time=0.42..14008.92 rows=799361 loops=1)
>                  ctr=6.92
>        ->  Sort  (cost=519.60..520.60 rows=400 width=31)
>                  (actual time=106.88..143.49 rows=30321 loops=1)
>                  ctr=3.63
>              Sort Key: supplier.s_suppkey
>              ->  SubSubPlan
>
> for large effective_cache_size and
>
>  ->  Nested Loop  (cost=0.00..130168.30 rows=32068 width=65)
>                   (actual time=0.56..1374.41 rows=30400 loops=1)
>                   ctr=94.71
>        ->  SubSubPlan
>        ->  Index Scan using i_ps_suppkey on partsupp
>                  (cost=0.00..323.16 rows=80 width=34)
>                  (actual time=0.16..2.98 rows=80 loops=380)
>                  ctr=108.44
>              Index Cond: (partsupp.ps_suppkey = "outer".s_suppkey)
>
> for small effective_cache_size.  Both subplans have an almost
> identical subsubplan:
>
> ->  Nested Loop  (cost=0.00..502.31 rows=400 width=31)
>                  (actual time=0.23..110.51 rows=380 loops=1)
>                  ctr=4.55
>       Join Filter: ("inner".s_nationkey = "outer".n_nationkey)
>       ->  Seq Scan on nation  (cost=0.00..1.31 rows=1 width=10)
>                               (actual time=0.08..0.14 rows=1 loops=1)
>                               ctr=9.36
>             Filter: (n_name = 'ETHIOPIA'::bpchar)
>       ->  Seq Scan on supplier (cost=0.00..376.00 rows=10000 width=21)
>                           (actual time=0.10..70.72 rows=10000 loops=1)
>                                ctr=5.32
>
> I have added the ctr (cost:time ratio) for each plan node.  These
> values are mostly between 5 and 10 with two notable exceptions:
>
> 1)     ->  Sort  (cost=519.60..520.60 rows=400 width=31)
>                  (actual time=106.88..143.49 rows=30321 loops=1)
>                  ctr=3.63
>
> It has already been noticed by Matt Clark that this is the only plan
> node where the row count estimation looks wrong.  However, I don't
> believe that this has great influence on the total cost of the plan,
> because the ctr is not far from the usual range and if it were a bit
> higher, it would only add a few hundred cost units to a branch costing
> almost 100000 units.  BTW I vaguely remember that there is something
> strange with the way actual rows are counted inside a merge join.
> Look at the branch below this plan node:  It shows an actual row count
> of 380.
>
> 2)     ->  Index Scan using i_ps_suppkey on partsupp
>                  (cost=0.00..323.16 rows=80 width=34)
>                  (actual time=0.16..2.98 rows=80 loops=380)
>                  ctr=108.44
>
> Here we have the only plan node where loops > 1, and it is the only
> one where the ctr is far off.  The planner computes the cost for one
> loop and multiplies it by the number of loops (which it estimates
> quite accurately to be 400), thus getting a total cost of ca. 130000.
> We have no reason to believe that the single loop cost is very far
> from reality (for a *single* index scan), but the planner does not
> account for additional index scans hitting pages in the cache that
> have been brought in by preceding scans.  This is a known problem, Tom
> has mentioned it several times, IIRC.
>
> Now I'm very interested in getting a better understanding of this
> problem, so could you please report the results of
>
> . \d i_ps_suppkey
>
http://developer.osdl.org/~jenny/pgsql-optimizer/disc_i_ps_suppkey
> . VACUUM VERBOSE ANALYSE partsupp;
>   VACUUM VERBOSE ANALYSE supplier;
>
http://developer.osdl.org/~jenny/pgsql-optimizer/vacuum_verbose_analyze_partsupp
http://developer.osdl.org/~jenny/pgsql-optimizer/vacuum_verbose_analyze_suppler
> . SELECT attname, null_frac, avg_witdh, n_distinct, correlation
>     FROM pg_stats
>    WHERE tablename = 'partsupp' AND attname IN ('ps_suppkey', ...);
>
>   Please insert other interesting column names for ..., especially
>   those contained in i_ps_suppkey, if any.
>
I put all the related columns
http://developer.osdl.org/~jenny/pgsql-optimizer/info_partsupp_col

> . SELECT relname, relpages, reltuples
>     FROM pg_class
>    WHERE relname IN ('partsupp', 'supplier', ...);
>                                              ^^^
>                     Add relevant index names here.
>
I put all the related tables
http://developer.osdl.org/~jenny/pgsql-optimizer/info_table

> . EXPLAIN ANALYSE
>   SELECT ps_partkey, ps_supplycost, ps_availqty
>     FROM partsupp, supplier
>    WHERE ps_suppkey = s_suppkey AND s_nationkey = '<youknowit>';
>
>   The idea is to eliminate parts of the plan that are always the same.
>   Omitting nation is possibly to much a simplification.  In this case
>   please re-add it.
>   Do this test for small and large effective_cache_size.
>   Force the use of other join methods by setting enable_<joinmethod>
>   to off.  Post all results.
>
http://developer.osdl.org/~jenny/pgsql-optimizer/explain_query_mk
>
> Jenny, I understand that this long message contains more questions
> than answers and is not of much help for you.  OTOH your tests might
> be very helpful for Postgres development ...
Let me know if you need anything else

Jenny


Re: osdl-dbt3 run results - puzzled by the execution

From
Jenny Zhang
Date:
On Thu, 2003-09-18 at 20:20, Tom Lane wrote:
> Jenny Zhang <jenny@osdl.org> writes:
> > ...  It seems to me that small
> > effective_cache_size favors the choice of nested loop joins (NLJ)
> > while the big effective_cache_size is in favor of merge joins (MJ).
>
> No, I wouldn't think that, because a nestloop plan will involve repeated
> fetches of the same tuples whereas a merge join doesn't (at least not
> when it sorts its inner input, as this plan does).  Larger cache
> improves the odds of a repeated fetch not having to do I/O.  In practice
> a larger cache area would also have some effects on access costs for the
> sort's temp file, but I don't think the planner's cost model for sorting
> takes that into account.
I think there is some misunderstanding here.  What I meant to say is:
From the plans we got, the optimizer favors the choice of nested loop
joins (NLJ) while the big effective_cache_size is in favor of merge
joins (MJ).  Which we think is not appropriate.  We verified that
sort_mem has no impact on the plans.  Though it would be nice to take
that into account.
>
> As Matt Clark points out nearby, the real question is whether these
> planner estimates have anything to do with reality.  EXPLAIN ANALYZE
> results would be far more interesting than plain EXPLAIN.
>
> > However, within the same run set consist of 6 runs, we see 2-3%
> > standard deviation for the run metrics associated with the multiple
> > stream part of the test (as opposed to the single stream part).
>
> <python> Och, laddie, we useta *dream* of 2-3% variation </python>
>
BTW, I am a she :-)
> > We would like to reduce the variation to be less than 1% so that a
> > 2% change between two different kernels would be significant.
>
> I think this is a pipe dream.  Variation in where the data gets laid
> down on your disk drive would alone create more than that kind of delta.
> I'm frankly amazed you could get repeatability within 2-3%.
>
Greg is right.  The repeatability is due to the aggregate results for a
whole test run.  As for individual query, the power test(single stream)
is very consistent, and the throughput test(multiple streams), any given
query execution time varies up to 15% if no swapping.  If we set
sort_mem too high and swapping occurs, the variation is bigger.

Jenny


Re: osdl-dbt3 run results - puzzled by the execution

From
Jenny Zhang
Date:
On Fri, 2003-09-19 at 06:12, Greg Stark wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>
> > I think this is a pipe dream.  Variation in where the data gets laid
> > down on your disk drive would alone create more than that kind of delta.
> > I'm frankly amazed you could get repeatability within 2-3%.
>
> I think the reason he gets good repeatability is because he's talking about
> the aggregate results for a whole test run. Not individual queries. In theory
> you could just run the whole test multiple times. The more times you run it
> the lower the variation in the total run time would be.
>
That is right.  The repeatability is due to the aggregate results for a
whole test run.  As for individual query, the power test(single stream)
is very consistent, and the throughput test(multiple streams), any given
query execution time varies up to 15% if no swapping.  If we set
sort_mem too high and swapping occurs, the variation is bigger.

> Actually, the variation in run time is also a useful statistic, both for
> postgres and the kernel. It might be useful to do multiple complete runs and
> keep track of the average standard deviation of the time required for each
> step.
>
I created a page with the execution time(in seconds), average, and
stddev for each query and each steps.  The data is collected from 6 dbt3
runs.
http://developer.osdl.org/~jenny/pgsql-optimizer/exetime.html

> Higher standard deviation implies queries can't be reliably depended on not to
> take inordinately long, which can be a problem for some working models. For
> the kernel it could mean latency issues or it could mean the swapper or buffer
> cache was overly aggressive.
I agree.  I can think of another reason why the performance varies even
the swapper and buffer cache is not overly aggressive.  Since PG depends
on OS to manage the buffer cache(correct me if I am wrong), it is up to
OS to decide what to keep in the cache.  And OS can not anticipate what
is likely needed next.

Thanks,
Jenny


Re: osdl-dbt3 run results - puzzled by the execution

From
Manfred Koizar
Date:
On Fri, 19 Sep 2003 11:35:35 -0700, Jenny Zhang <jenny@osdl.org>
wrote:
>I posted more results as you requested:

Unfortunately they only confirm what I suspected earlier:

>> 2)     ->  Index Scan using i_ps_suppkey on partsupp
>>                  (cost=0.00..323.16 rows=80 width=34)
>>                  (actual time=0.16..2.98 rows=80 loops=380)
>>                  ctr=108.44

>> the planner does not
>> account for additional index scans hitting pages in the cache that
>> have been brought in by preceding scans.  This is a known problem

PF1 = estimated number of page fetches for one loop ~ 320
L   = estimated number of loops ~ 400
P   = number of pages in relation ~ 21000

Cutting down the number of heap page fetches if PF1 * L > P and P <
effective_cache_size seems like an obvious improvement, but I was not
able to figure out where to make this change.  Maybe it belongs into
costsize.c near

    run_cost += outer_path_rows *
        (inner_path->total_cost - inner_path->startup_cost) *
        joininfactor;

in cost_nestloop() or it should be pushed into the index cost
estimation functions.  Hackers?

For now you have to keep lying about effective_cache_size to make the
planner overestimate merge joins to compensate for the planner's
overestimation of nested loops.  Sorry for having no better answer.

Servus
 Manfred

Re: [HACKERS] osdl-dbt3 run results - puzzled by the execution

From
Tom Lane
Date:
Manfred Koizar <mkoi-pg@aon.at> writes:
> Cutting down the number of heap page fetches if PF1 * L > P and P <
> effective_cache_size seems like an obvious improvement, but I was not
> able to figure out where to make this change.  Maybe it belongs into
> costsize.c near
>     run_cost += outer_path_rows *
>         (inner_path->total_cost - inner_path->startup_cost) *
>         joininfactor;

I've been intending for some time to try to restructure the cost
estimator so that repeated indexscans can be costed more accurately.
Within the context of the heap-fetch-estimating algorithm, I think
the entire execution of a nestloop-with-inner-index-scan could probably
be treated as a single scan.  I'm not sure how we adjust the estimates
for the index-access part, though clearly those are too high as well.

This doesn't seem to be a localized change unfortunately.  Certainly
costsize.c can't do it alone.

            regards, tom lane