Re: [SQL] optimizer not using an index... - Mailing list pgsql-sql

From Tom Lane
Subject Re: [SQL] optimizer not using an index...
Date
Msg-id 7075.935772727@sss.pgh.pa.us
Whole thread Raw
In response to Re: [SQL] optimizer not using an index...  (Howie <caffeine@toodarkpark.org>)
List pgsql-sql
Howie <caffeine@toodarkpark.org> writes:
> On Fri, 27 Aug 1999, Tom Lane wrote:
>> Well, mumble.  The optimizer certainly needs work, but what's your
>> evidence that this query would be faster if done another way?  Hashing
>> the smaller tables, as it's doing, ought to be a pretty good strategy.

> primary focus was on why pgsql decided to do a complete table scan of
> customers ( searching for customer.email ) when there's an index on it;
> surely an index scan would be loads faster than a complete table scan.

No, an index scan will be loads slower if most or all of the table has
to be visited.  If it were always faster, the optimizer would have a
much easier task ;-).  The join we are considering here is on
pincodes.codenum = customers.codenum, and since the system (mistakenly)
thinks it is going to be joining a lot of rows from pincodes, doing it
with a hash rather than an indexscan on customers.codenum looks good.

You bring up a good point though: why isn't it deciding to use the
restriction on customer.email for an indexscan of customers, using the
index on email rather than the one on codenum?  (I assume you have both)

It looks like the thing is misestimating the selectivity of that
restriction too: there are not going to be 8757 customer rows that pass
that WHERE clause --- but the system thinks that, and that's why it's
rejecting the indexscan as being slower than the sequential.  I had
missed that point before, being distracted by the even sillier row count
from the other subjoin.  There are clearly two different selectivity
mistakes being made here :-(

> the other interesting item i ran into was why one query used the indexes
> provided and the other didnt.  granted, pincodes has at least 10x more
> entries than customers does ( ~300k+ vs ~11k ).

Right, and the issue seems to be that when you are restricting pincodes
with an extra WHERE clause, the estimated output row count is realistic,
whereas when you aren't restricting pincodes it ain't.  I'm not sure
why, yet.

>> But it's estimating it's going to have to probe that table 300k times,
>> which makes the hashjoin look mighty attractive...

> why would it have to probe pincodes 300k times when there's a unique index
> on pincodes ( pincodes.codenum ) and a unique index on customers (
> customers.codenum ) ?

It's guessing that it's going to get 300k rows out of the join of
pincodes to all the little tables (ie, about one row per pincode row;
it doesn't know there's only going to be one match), and then it thinks
it's going to have to look through most of the customers table for
match(es) to each of those rows.  If the selectivity estimates that the
system was making were in the right ballpark, then the hash join
probably *would* be the right way to do it.  The problem here seems
to be that the estimated row counts are so far off-base.  I'd like to
find out why.

> i could send you a dump, but the db is fairly large; 7m uncompressed, 2m
> gzip -9'ed.  none of the data is overly sensetive.  i am, however, on an
> overly lagged 28k8.

2m gzipped is no problem at this end, but I'll bet we'd see the same
behavior with only half or a third of the data, if you can extract a
consistent subset easily...

> i havent tried trimming the db down.  im not sure that would 'fix' the
> problem if the optimizer is misguessing the number of rows it's going to
> have to look at...

No, it wouldn't be a fix for your problem.  I was just thinking of
making a smaller, easier-to-ship test case that I could use to make
sure the problem is fixed for 6.6.

> over the weekend im planning on upgrading to 6.5.1, but i dont recall
> seeing any changes to the optimizer in the changelog...

There are not; this is work for 6.6.  The fixes involve storing more and
different info in the system statistics table, so there is no prospect
of back-patching them into 6.5.*.  (BTW, I think 6.5.2 will be out
shortly, so you might want to wait another week before updating.)
        regards, tom lane


pgsql-sql by date:

Previous
From: Frank Joerdens
Date:
Subject: vacuum fails
Next
From: Thomas Good
Date:
Subject: Re: [SQL] vacuum fails