Re: Performance improvement hints - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Performance improvement hints
Date
Msg-id 7154.968767944@sss.pgh.pa.us
Whole thread Raw
In response to Performance improvement hints  (devik@cdi.cz)
List pgsql-hackers
devik@cdi.cz writes:
> 1) when optimizer computes size of join it does it as
>    card(R1)*card(R2)*selectivity. Suppose two relations
>    (R1 & R2) each 10000 rows. If you (inner) join them
>    using equality operator, the result is at most 10000
>    rows (min(card(R1),card(R2)). But pg estimates
>    1 000 000 (uses selectivity 0.01 here).

0.01 is only the default estimate used if you've never done a VACUUM
ANALYZE (hint hint).  After ANALYZE, there are column statistics
available that will give a better estimate.

Note that your claim above is incorrect unless you are joining on unique
columns, anyway.  In the extreme case, if all the entries have the same
value in the column being used, you'd get card(R1)*card(R2) output rows.
I'm unwilling to make the system assume column uniqueness without
evidence to back it up, because the consequences of assuming an overly
small output row count are a lot worse than assuming an overly large
one.

One form of evidence that the planner should take into account here is
the existence of a UNIQUE index on a column --- if one has been created,
we could assume column uniqueness even if no VACUUM ANALYZE has ever
been done on the table.  This is on the to-do list, but I don't feel
it's real high priority.  The planner's results are pretty much going
to suck in the absence of VACUUM ANALYZE stats anyway :-(

>    Then when computing cost it will result in very high
>    cost in case of hash and loop join BUT low (right)
>    cost for merge join. It is because for hash and loop
>    joins the cost is estimated from row count but merge
>    join uses another estimation (as it always know that
>    merge join can be done only on equality op).
>    It then leads to use of mergejoin for majority of joins.
>    Unfortunately I found that in majority of such cases
>    the hash join is two times faster.

The mergejoin cost calculation may be overly optimistic.  The cost
estimates certainly need further work.

>    But indexscan always lookups actual record in heap even if
>    all needed attributes are contained in the index.
>    Oracle and even MSSQL reads attributes directly from index
>    without looking for actual tuple at heap.

Doesn't work in Postgres' storage management scheme --- the heap
tuple must be consulted to see if it's still valid.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Dibs for upcoming commit
Next
From: Tom Lane
Date:
Subject: Re: Weird function behavior from Sept 11 snapshot