Re: Merge joins on index scans - Mailing list pgsql-performance

From Tom Lane
Subject Re: Merge joins on index scans
Date
Msg-id 13979.1457917039@sss.pgh.pa.us
Whole thread Raw
In response to Merge joins on index scans  (James Parks <james.parks@meraki.net>)
List pgsql-performance
James Parks <james.parks@meraki.net> writes:
> On Mon, Feb 29, 2016 at 5:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The other explain shows a scan of "a" reading about 490k rows and
>> returning 395 of them, so there's a factor of about 200 re-read here.
>> I wonder if the planner should have inserted a materialize node to
>> reduce that.
>>
>> However, I think the real problem is upstream of that: if that indexscan
>> was estimated at 26163.20 units, how'd the mergejoin above it get costed
>> at only 7850.13 units?  The answer has to be that the planner thought the
>> merge would stop before reading most of "a", as a result of limited range
>> of b.a_id.  It would be interesting to look into what the actual maximum
>> b.a_id value is.

> I've attached a pg_dump of a database that contains all of the data, in the
> event you (or others) would like to look at it. The attachment is ~1.8MB
> (gzipped), and you can replay the pg_dump file on a database that has just
> been created with initdb.

Thanks for sending the test data --- I got a chance to look at this
finally.  It looks like my first suspicion was right and the second one
wrong: the planner is badly underestimating the amount of rescan required
and thus not putting in a Materialize buffer node where needed.

If I force a Materialize node to be put in, without changing any cost
estimates (basically, set path->materialize_inner = 1 at the end of
final_cost_mergejoin), then I get this:

 Sort  (cost=7343.28..7343.62 rows=137 width=16) (actual time=218.342..232.817 rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2120kB
   ->  Merge Join  (cost=696.93..7338.41 rows=137 width=16) (actual time=18.627..136.549 rows=83427 loops=1)
         Merge Cond: (b.a_id = a.id)
         ->  Index Scan using a_idx on b  (cost=0.29..2708.59 rows=83427 width=16) (actual time=0.048..28.876
rows=83427loops=1) 
         ->  Materialize  (cost=0.42..24368.01 rows=805 width=8) (actual time=18.568..74.447 rows=83658 loops=1)
               ->  Index Scan using a_pkey on a  (cost=0.42..24366.00 rows=805 width=8) (actual time=18.560..66.186
rows=238loops=1) 
                     Filter: (nonce = 64)
                     Rows Removed by Filter: 87955
 Execution time: 239.195 ms

which is a pretty satisfactory result in terms of runtime, though
still a bit slower than the hash alternative.

Now, there are 490166 "a" rows, but we can see that the inner indexscan
stopped after reading 87955+238 = 88193 of them.  So there really is an
early-stop effect and the planner seems to have gotten that about right.
The problem is the rescan effect, which we can now quantify at 83658/238
or about 350:1.  Despite which, stepping through final_cost_mergejoin
finds that it estimates *zero* rescan effect.  If I force the
rescannedtuples estimate to be the correct value of 83658 - 238 = 83420,
it properly decides that a materialize node ought to be put in; but that
also increases its overall cost estimate for this mergejoin to 7600, which
causes it to prefer doing the join in the other direction:

 Sort  (cost=7343.28..7343.62 rows=137 width=16) (actual time=169.579..184.184 rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2120kB
   ->  Merge Join  (cost=696.93..7338.41 rows=137 width=16) (actual time=16.460..108.562 rows=83427 loops=1)
         Merge Cond: (a.id = b.a_id)
         ->  Index Scan using a_pkey on a  (cost=0.42..24366.00 rows=805 width=8) (actual time=16.442..63.197 rows=238
loops=1)
               Filter: (nonce = 64)
               Rows Removed by Filter: 87955
         ->  Index Scan using a_idx on b  (cost=0.29..2708.59 rows=83427 width=16) (actual time=0.013..26.811
rows=83427loops=1) 
 Execution time: 190.441 ms

which is an even more satisfactory outcome; the cheaper merge direction
is properly estimated as cheaper, and this is actually a tad faster than
the hash join for me.

So the entire problem here is a bogus rescannedtuples estimate.  The code
comment about that estimate is

     * When there are equal merge keys in the outer relation, the mergejoin
     * must rescan any matching tuples in the inner relation. This means
     * re-fetching inner tuples; we have to estimate how often that happens.
     *
     * For regular inner and outer joins, the number of re-fetches can be
     * estimated approximately as size of merge join output minus size of
     * inner relation. Assume that the distinct key values are 1, 2, ..., and
     * denote the number of values of each key in the outer relation as m1,
     * m2, ...; in the inner relation, n1, n2, ...  Then we have
     *
     * size of join = m1 * n1 + m2 * n2 + ...
     *
     * number of rescanned tuples = (m1 - 1) * n1 + (m2 - 1) * n2 + ... = m1 *
     * n1 + m2 * n2 + ... - (n1 + n2 + ...) = size of join - size of inner
     * relation
     *
     * This equation works correctly for outer tuples having no inner match
     * (nk = 0), but not for inner tuples having no outer match (mk = 0); we
     * are effectively subtracting those from the number of rescanned tuples,
     * when we should not.  Can we do better without expensive selectivity
     * computations?

Some investigation of the actual keys values says that indeed there are a
whole lot of a.id values that have no match in b.a_id, so I think the
comment at the end is telling us what the problem is.  Is there a better
way to make this estimate?

            regards, tom lane


pgsql-performance by date:

Previous
From: Vladimir Borodin
Date:
Subject: Re: DIsk I/O from pg_stat_activity
Next
From: Andreas Joseph Krogh
Date:
Subject: Searching GIN-index (FTS) and sort by timestamp-column