Re: incorrect row estimates for primary key join - Mailing list pgsql-performance

From Ben
Subject Re: incorrect row estimates for primary key join
Date
Msg-id FBA13180-4428-4384-B41D-A9D9A7096790@gmail.com
Whole thread Raw
In response to Re: incorrect row estimates for primary key join  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: incorrect row estimates for primary key join
List pgsql-performance
On Jun 25, 2013, at 4:36 PM, Tom Lane wrote:

>> That seems intuitive, but some of the estimates need to be made
>> before all such information is available.  Maybe we can do
>> something about that some day....
>> Maybe someone else will jump in here with more details than I can
>> provide (at least without hours digging in the source code).
>
> It does not attempt to match up query WHERE clauses with indexes during
> selectivity estimation, so the existence of a multi-column unique
> constraint wouldn't help it improve the estimate.

thanks tom, that answered my question.

> In the case at hand, I doubt that a better result rowcount estimate
> would have changed the planner's opinion of how to do the join.  The OP
> seems to be imagining that 2 million index probes into a large table
> would be cheap, but that's hardly true.  It's quite likely that the
> mergejoin actually is the best way to do the query.  If it isn't really
> best on his hardware, I would think that indicates a need for some
> tuning of the cost parameters.  Another thing that might be helpful for
> working with such large tables is increasing work_mem, to make hashes
> and sorts run faster.

i apologize if i seemed like i was presuming to know what the best query plan is.  i fully understand that the query
plannersometimes makes unintuitive decisions which turn out to be for the best, having experienced it first hand many
times. since i've nudged my company to use postgresql (instead of mysql/sqlite), we've been very happy with it.  also,
havingtried my hand (and failing) at making good gist selectivity estimators, i think i've got a
not-completely-ignorant10,000 ft view of the trade-offs it tries to make, when sequential scans are better than
repeatedindex lookups, et cetera.  i'm writing because i found this example, which shows yet another thing i don't
understandabout the query planner, and i am trying to learn better about it. 

you've already answered my main question (whether or not unique constraints are used to help row estimation.)  there's
acouple more issues which i don't quite understand : 

1) when i give a hint to the query planner to not expect more than number-of-rows-in-jointable (via a limit), switches
toa nested loop + index scan, but with the same row estimates.  i'll show the plan i had in the first email : 

Limit  (cost=0.00..178452647.11 rows=2500000 width=28)
  ->  Nested Loop  (cost=0.00..1306127545.35 rows=18297957 width=28)
        ->  Seq Scan on jointable  (cost=0.00..35867.11 rows=2066911 width=24)
        ->  Index Scan using bigtable_pkey on bigtable  (cost=0.00..631.88 rows=1 width=28)
              Index Cond: ((id1 = jointable.id1) AND (id2 = jointable.id2) AND (id3 = jointable.id3) AND (id4 =
jointable.id4)AND (id5 = jointable.id5)) 
(5 rows)

before, i was misreading this as saying the planner was going to execute the nested loop fully (rows=18 million), and
thenlimit the results.  i am now reading it as saying that the inner nested loop will be short-circuited after it
generatesenough rows.  if this is true, it seems to imply that, in query plan with deeply nested inner nested loops,
oneshould read the inner loop row estimates with a grain of salt, as there might be limits (potentially many levels
outwards)which can short-circuit them.  am i wrong about this? 

2) when doing the sort+merge join, it choses to sort bigtable rather than use an index scan.  i've tried to give hints
byrequesting the results come in primary key order, but it keeps sorting by a permutation of the primary key and then
resortingthe join results at the end.  so obviously the random seek cost dominates the sequential read + sort (which i
findsurprising, but again i am happy to be surprised.)  that seems fine for a query which is going to touch the whole
table. but i can't seem to come up with a query which would ever favor using an index scan.  for example this : 

explain select * from bigtable order by (id1, id2, id3, id4, id5) limit 1;
                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Limit  (cost=91923132.04..91923132.04 rows=1 width=28)
   ->  Sort  (cost=91923132.04..99842047.88 rows=3167566336 width=28)
         Sort Key: (ROW(id1, id2, id3, id4, id5))
         ->  Seq Scan on bigtable  (cost=0.00..76085300.36 rows=3167566336 width=28)
(4 rows)

(apologies bigtable has grown since i've first started this thread.)  shouldn't an index scan definitely be fastest
here? you don't need to touch the whole table or index.  maybe there something i have misconfigured here? 

best regards, ben

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: incorrect row estimates for primary key join
Next
From: Tom Lane
Date:
Subject: Re: Weird, bad 0.5% selectivity estimate for a column equal to itself