Re: question about executing JOINs - Mailing list pgsql-general

From Josh Burdick
Subject Re: question about executing JOINs
Date
Msg-id 3DAC28F1.9080306@gradient.cis.upenn.edu
Whole thread Raw
In response to question about executing JOINs  (Josh Burdick <jburdick@gradient.cis.upenn.edu>)
List pgsql-general
Tom Lane wrote:

>Josh Burdick <jburdick@gradient.cis.upenn.edu> writes:
>
>
>>Nested Loop  (cost=0.00..198183643.06 rows=85230539 width=51)
>>  ->  Seq Scan on ucsc_ref_seq_ali_hg12  (cost=0.00..817.65 rows=15165
>>width=29)
>>  ->  Index Scan using ucsc_snp_tsc_hg12_chrom_start on
>>ucsc_snp_tsc_hg12  (cost=0.00..12962.39 rows=4713 width=22)
>>
>>
>
>What's the actual numbers of rows involved?  (EXPLAIN ANALYZE output
>would be far more useful than plain EXPLAIN.)
>
>
    The two tables have 15485 and 1095497 rows, respectively.
    The join returns 11135953 records, which is humongous, but is still
only 0.066% of the number of pairs of rows.  (Which is why I stopped
EXPLAIN ANALYZE after 45 minutes.)

>Do you have a feeling for the number of rows that would be produced by
>just the JOIN/ON condition (no constraint on snp_start)?  How does that
>compare to EXPLAIN's estimate of that number of rows?
>
>
    The JOIN/ON condition is on a column with 24 possible values.  I'm
not quite sure how close the planner would be on that case, but it would
be a lot closer.

>
>
>>    The planner is assuming that it's a cross join: that we need all
>>pairs of records.  It's not taking into account the WHERE clause which
>>restricts to a tiny fraction of the records.
>>
>>
>
>No it isn't, and yes it is, but it evidently is making a bad estimate of
>the fraction of rows eliminated by those clauses.  I'd like to find out
>just what its estimate of that fraction is and what the correct value
>would be.
>
>
    The planner's original estimates of how many rows are needed from
each table (15165 and 1145280) are quite close; pretty much all rows are
needed from each table.  It estimates that the MERGE JOIN will return
85230539 rows, and 85230539 / (15165 * 1145280) = 0.49%.  (As opposed to
0.066%.)

    For what its worth, if you just multiply the number of rows together
for the original query,  you get 17368171200, while for the version
using the index scan, you get 71472645, which is a lot less, and within
a factor of 10 of the actual number of rows returned.  So perhaps that
number could be factored into the cost estimate better somehow.

>
>
>>Perhaps the planner should assume that a nested loop over an index
>>scan only looks at 1% of its records?
>>
>>
>
>Arbitrary assumptions designed to fix one example tend to break other
>examples ...
>
>
    Fair enough.  I think the current hack is good enough -- if you
really need to override the planner, you can.  And the eventual fix
should be to make the planner smarter.

>            regards, tom lane
>
>
>
>
    Hope this helps,
    Josh

--
Josh Burdick
jburdick@gradient.cis.upenn.edu
http://www.cis.upenn.edu/~jburdick





pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: RV: bad result in a query!! :-(
Next
From: "Jose Antonio Leo"
Date:
Subject: Re: RV: bad result in a query!! hopeless