Thread: Data point on the competition regarding selectivity of unknown parameters

Data point on the competition regarding selectivity of unknown parameters

From
Greg Stark
Date:

One of the things I think has to change with postgres is the default
selectivity assumptions for inequality operators. They're way to high
currently. Probably the single most frequently asked question on -performance
and -general are people asking why Postgres isn't using their index. And while
some of the cases are caused by other things, easily half the time it's simply
Postgres making unreasonably pessimistic assumptions about the selectivity of
inequalities.

Just out of curiosity I checked what Oracle does. This took my a lot of effort
so I hope you find the data useful. 

The short of it is that Oracle assumes 5% for a single inequality. It assumes
.25% for a range query, but that could just be the product of two 5%s. I would
guess it's not handling range queries as a special case.

This actually is a startling coincidence. I was already planning to argue for
precisely 5% myself as a reasonable compromise. I think even lower values make
sense but at least 5% would be low enough to consistently cause index scans,
which seems to match users' expectations.


Methodology:

This is with Oracle 10g (aka 10.1.0.2.0) on Linux.

I created a table with 100,000 records containing a single integer column
populated with integers ranging from 1 to 100,000 and ran "analyze table test2
compute statistics" on it. This is vaguely equivalent to "vacuum analyze full"
as far as gathering statistics. It tells it to scan the entire table, not to
use any sampling.

I then ran
explain plan for select 1 from test2 where a > :0;select cardinality from plan_table;

The results were 5000.
I did the same thing for "where a>:0 and a<:1"

The results were 250.

I intend to try some other distributions and other where clauses tomorrow. But
for now it's time for some sleep.

-- 
greg



Greg Stark <gsstark@mit.edu> writes:
> One of the things I think has to change with postgres is the default
> selectivity assumptions for inequality operators. They're way to high
> currently.

Maybe so, but 5% is grossly too low.  We'd just be throwing ourselves
into a different set of badly misoptimized queries.  On what grounds can
you argue that "WHERE x > y" will select only 5% of the rows, if you
have no knowledge about either x or y?  (And no, I won't buy the
argument that users are only interested in queries that fetch small
proportions of tables.  That argument comes from thinking of only one
class of applications.)

Perhaps more to the point, 5% is still too high to ensure selection of
an indexscan.  You'd need something more like 0.5%, which is even more
obviously hopelessly misguided.
        regards, tom lane