Re: Explain explained - Mailing list pgsql-novice

From Markus Stocker
Subject Re: Explain explained
Date
Msg-id a9dfaf710803041252u47a623f5yce1ab9af5a39e4cb@mail.gmail.com
Whole thread Raw
In response to Re: Explain explained  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Explain explained  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-novice
Tom,

Thanks for your inputs, very enlightening!

On Tue, Mar 4, 2008 at 1:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Markus Stocker" <markus@wilabs.ch> writes:
>  > 1/ How does postgresql know that the constrain individual_name.name =
>  > 'http://www.University0.edu' matches one row (which is in fact true)?
>
>  It doesn't *know* that, except in the special case where it sees there's
>  a unique index on that column.

Indeed, there is a unique index on that column, sorry about that. In
fact, pg_stats says that there are no MCVs for that column
(obviously). So that explains a lot. I get

sel(individual_name.name = 'http://www.University0.edu')
= (1 - null_frac) / num_distinct
= (1 - 0) / 17174

rows
= rel_cardinality * sel
= 17174 * 1 / 17174

which I suppose explains too why the expected number of rows is 1.

> There's a new section in the 8.3 docs that gives some more details about the estimation process:
>  http://www.postgresql.org/docs/8.3/static/planner-stats-details.html

Thanks for pointing to this, very well documented.

>  > Further, why expects postgresql
>  > for each operation a row size of 1?
>
>  It's not expecting more than one join partner at each step.  Again
>  that's just statistical.

Sounds reasonable. The first constrain (i.e. individual_name.name)
returns (estimated) 1 row (see above) hence successive joins can't
have more than one.

>  > 2/ Sequential scans seem to me more expensive compared to index scans.
>  > I'm wondering why the sequential scan on individual_name is the first
>  > executed in the plan.
>
>  I was wondering that too; it looks like it should be a candidate for an
>  index search.  Datatype problem maybe?  Again, you've not shown us the
>  table definitions...

I guess this is explained too now, at least partially. If I drop the
index on individual_name.name there is still pg_stats telling that
there are no MCVs. Hence, the expected number of rows doesn't change
(even after analyze, the values simply are unique). Not sure how the
sequential scan influences the overall costs compared to the index
scan, though.

The schema for individual_name relation is

 Column |  Type   | Modifiers
--------+---------+-----------
 id     | integer | not null
 name   | text    | not null
Indexes:
    "individual_name_pkey" PRIMARY KEY, btree (id)
    "individual_name_name_idx" UNIQUE, btree (name)

whereas the schema for the concept_assertion relation is

   Column   |  Type   | Modifiers
------------+---------+-----------
 concept    | integer | not null
 individual | integer | not null
Indexes:
    "concept_assertion_pkey" PRIMARY KEY, btree (concept, individual)
    "concept_assertion_concept_idx" btree (concept)
Foreign-key constraints:
    "concept_assertion_concept_fkey" FOREIGN KEY (concept) REFERENCES
tbox_name(id)
    "concept_assertion_individual_fkey" FOREIGN KEY (individual)
REFERENCES individual_name(id)


>  > 3/ There is one condition in the query, i.e. concept_assertion.concept
>  > = 5 with an empty result set, i.e. selectivity 0. In fact, the last
>  > index scan on concept_assertion ca_1 in the plan is never executed
>  > (this is what 'explain analyze' tells me). I'm wondering, why this
>  > constrain is not executed first. By executing this first, we could
>  > just never execute everything else.
>
>  Postgres never uses a rowcount estimate of less than one row,
>  so it's not going to try to optimize that way.  There are various
>  rationales behind that choice, but the main one is that we don't
>  trust the statistics unreservedly.  The odds that an estimate of zero
>  is more accurate than an estimate of one just aren't good enough,
>  and the likelihood of generating a really awful plan if we did believe
>  zero rows is too high.

Agreed, makes a lot of sense. If I try to calculate the estimated rows
for concept_assertion.concept = 5 (i.e. the constrain which
effectively returns an empty result set) I get

sel(concept_assertion.concept = 5)
= (1 - sum(mvf)) / (n_distinct - num_mcv)
= 0.021545454

rows
= rel_cardinality * sel
= 18128 * 0.021545454
= (rounded) 391

Clearly, at least by inspecting the two constraints individually, i.e.
the individual_name.name = 'http://www.University0.edu' and the
concept_assertion.concept = 5 constrain, the optimizer chooses the
first as it returns less expected rows, i.e. 1 vs. 391.

Hope the explanations sound somewhat reasonable.

Thanks again,
markus

pgsql-novice by date:

Previous
From: Tom Lane
Date:
Subject: Re: Explain explained
Next
From: Tom Lane
Date:
Subject: Re: Explain explained