Thread: Yet another "Why won't PostgreSQL use my index?"

Yet another "Why won't PostgreSQL use my index?"

From
"Gregory Wood"
Date:
Trying to use a single column index on a somewhat large table (1.9M rows),
and PostgreSQL really doesn't want to. It estimates the number of rows at
12749 (actual 354), which is only .6% of the table... well within reasonable
index range I would think. And yes, I've run an analyze on the table.

Here are the queries I've run:
===============
cns=# analyze re_site_listings_index;
ANALYZE
cns=# select count(1) from re_site_listings_index;
  count
---------
 1906455
(1 row)

cns=# explain analyze select * from re_site_listings_index where
idx_siteid=237;
NOTICE:  QUERY PLAN:

Seq Scan on re_site_listings_index  (cost=0.00..41050.76 rows=12749
width=302) (actual time=158.57..2839.78 rows=354 loops=1)
Total runtime: 2841.60 msec

EXPLAIN
cns=# set enable_seqscan=false;
SET VARIABLE
cns=# explain analyze select * from re_site_listings_index where
idx_siteid=237;
NOTICE:  QUERY PLAN:

Index Scan using bill_idx_siteid on re_site_listings_index
(cost=0.00..48402.08 rows=12749 width=302) (actual time=0.30..3.95 rows=354
loops=1)
Total runtime: 5.76 msec

EXPLAIN
cns=# select version();
                               version
---------------------------------------------------------------------
 PostgreSQL 7.2.1 on i386-unknown-freebsd4.3, compiled by GCC 2.95.3
(1 row)

===============
I *think* that's all the relevant information... please let me know if I
forgot anything.

Greg


Re: Yet another "Why won't PostgreSQL use my index?"

From
Stephan Szabo
Date:
On Thu, 20 Jun 2002, Gregory Wood wrote:

> Trying to use a single column index on a somewhat large table (1.9M rows),
> and PostgreSQL really doesn't want to. It estimates the number of rows at
> 12749 (actual 354), which is only .6% of the table... well within reasonable
> index range I would think. And yes, I've run an analyze on the table.

It might be interesting to figure out why it's over estimating the rows
so badly.  What do the statistics in pg_statistic for that
relation/attribute look like?


Re: Yet another "Why won't PostgreSQL use my index?"

From
Tom Lane
Date:
"Gregory Wood" <gregw@com-stock.com> writes:
> Trying to use a single column index on a somewhat large table (1.9M rows),
> and PostgreSQL really doesn't want to. It estimates the number of rows at
> 12749 (actual 354), which is only .6% of the table... well within reasonable
> index range I would think. And yes, I've run an analyze on the table.

Could we see the analyze results --- ie, the pg_stats row for the
idx_siteid column?

Increasing the statistics target for that column might help, but I'm
interested to see what it thinks the distribution is right now.  The
problem AFAICT is the factor-of-36 overestimation of the number of
matching rows; had that estimate been even within a factor of 20 of
correct, the correct plan would have been chosen.

            regards, tom lane

Re: Yet another "Why won't PostgreSQL use my index?"

From
Manfred Koizar
Date:
On Thu, 20 Jun 2002 12:05:53 -0400, "Gregory Wood"
<gregw@com-stock.com> wrote:
>cns=# explain analyze select * from re_site_listings_index where
>idx_siteid=237;
>NOTICE:  QUERY PLAN:
>
>Seq Scan on re_site_listings_index  (cost=0.00..41050.76 rows=12749
>width=302) (actual time=158.57..2839.78 rows=354 loops=1)
>Total runtime: 2841.60 msec
>
>EXPLAIN
>cns=# set enable_seqscan=false;
>SET VARIABLE
>cns=# explain analyze select * from re_site_listings_index where
>idx_siteid=237;
>NOTICE:  QUERY PLAN:
>
>Index Scan using bill_idx_siteid on re_site_listings_index
>(cost=0.00..48402.08 rows=12749 width=302) (actual time=0.30..3.95 rows=354
>loops=1)
>Total runtime: 5.76 msec

Greg,

apparently random_page_cost is set to the default value of 4.
The planner assumes that the rows are scattered all over the table and
that it has to do 12000 random page reads;  the total cost is
calculated to be approx. 12000 * random_page_cost = 48000, which is
more than the estimated 41000 for a seq scan.  So a seq scan looks
cheaper.

    SET random_page_cost=3;

and try again.  Experiment with other values, I guess you will see a
change somewhere between 3.3 and 3.5.

In fact the tuples seem to be close to each other, so several of them
fit on the same page, but the planner does not know this.  I'm sorry,
I don't know how to tell it.

But as long as setting random_page_cost to a lower value helps, this
should be ok.  The default value of 4 seems to be too high for many
situations, anyway.

Servus
 Manfred

Re: Yet another "Why won't PostgreSQL use my index?"

From
"Gregory Wood"
Date:
I am assuming that this query is retrieving what you are looking for:

SELECT pg_statistic.* FROM pg_statistic JOIN pg_attribute ON
starelid=attrelid WHERE attname='idx_siteid';

I've attached the results so they don't wrap quite as badly.

Also included is a count for the individual idx_siteid values, if that might
be useful to see how they are distibuted.

Greg

----- Original Message -----
From: "Stephan Szabo" <sszabo@megazone23.bigpanda.com>
To: "Gregory Wood" <gregw@com-stock.com>
Cc: "PostgreSQL-General" <pgsql-general@postgresql.org>
Sent: Thursday, June 20, 2002 12:19 PM
Subject: Re: [GENERAL] Yet another "Why won't PostgreSQL use my index?"


>
> On Thu, 20 Jun 2002, Gregory Wood wrote:
>
> > Trying to use a single column index on a somewhat large table (1.9M
rows),
> > and PostgreSQL really doesn't want to. It estimates the number of rows
at
> > 12749 (actual 354), which is only .6% of the table... well within
reasonable
> > index range I would think. And yes, I've run an analyze on the table.
>
> It might be interesting to figure out why it's over estimating the rows
> so badly.  What do the statistics in pg_statistic for that
> relation/attribute look like?
>

Attachment

Re: Yet another "Why won't PostgreSQL use my index?"

From
"Gregory Wood"
Date:
Nothing like replying to your own message...

I updated that file with the pg_stats information as well.

Don't know if that was included with my other message as I was editing it
when it got sent :/

Greg

----- Original Message -----
From: "Gregory Wood" <gregw@com-stock.com>
To: "PostgreSQL-General" <pgsql-general@postgresql.org>
Sent: Thursday, June 20, 2002 1:13 PM
Subject: Re: [GENERAL] Yet another "Why won't PostgreSQL use my index?"


> I am assuming that this query is retrieving what you are looking for:
>
> SELECT pg_statistic.* FROM pg_statistic JOIN pg_attribute ON
> starelid=attrelid WHERE attname='idx_siteid';
>
> I've attached the results so they don't wrap quite as badly.
>
> Also included is a count for the individual idx_siteid values, if that
might
> be useful to see how they are distibuted.
>
> Greg
>
> ----- Original Message -----
> From: "Stephan Szabo" <sszabo@megazone23.bigpanda.com>
> To: "Gregory Wood" <gregw@com-stock.com>
> Cc: "PostgreSQL-General" <pgsql-general@postgresql.org>
> Sent: Thursday, June 20, 2002 12:19 PM
> Subject: Re: [GENERAL] Yet another "Why won't PostgreSQL use my index?"
>
>
> >
> > On Thu, 20 Jun 2002, Gregory Wood wrote:
> >
> > > Trying to use a single column index on a somewhat large table (1.9M
> rows),
> > > and PostgreSQL really doesn't want to. It estimates the number of rows
> at
> > > 12749 (actual 354), which is only .6% of the table... well within
> reasonable
> > > index range I would think. And yes, I've run an analyze on the table.
> >
> > It might be interesting to figure out why it's over estimating the rows
> > so badly.  What do the statistics in pg_statistic for that
> > relation/attribute look like?
> >
>

Attachment

Re: Yet another "Why won't PostgreSQL use my index?"

From
"Gregory Wood"
Date:
----- Original Message -----
From: "Manfred Koizar" <mkoi-pg@aon.at>
To: "Gregory Wood" <gregw@com-stock.com>
Cc: "PostgreSQL-General" <pgsql-general@postgresql.org>
Sent: Thursday, June 20, 2002 12:55 PM
Subject: Re: [GENERAL] Yet another "Why won't PostgreSQL use my index?"


> On Thu, 20 Jun 2002 12:05:53 -0400, "Gregory Wood"
> <gregw@com-stock.com> wrote:
> >cns=# explain analyze select * from re_site_listings_index where
> >idx_siteid=237;
> >NOTICE:  QUERY PLAN:
> >
> >Seq Scan on re_site_listings_index  (cost=0.00..41050.76 rows=12749
> >width=302) (actual time=158.57..2839.78 rows=354 loops=1)
> >Total runtime: 2841.60 msec
> >
> >EXPLAIN
> >cns=# set enable_seqscan=false;
> >SET VARIABLE
> >cns=# explain analyze select * from re_site_listings_index where
> >idx_siteid=237;
> >NOTICE:  QUERY PLAN:
> >
> >Index Scan using bill_idx_siteid on re_site_listings_index
> >(cost=0.00..48402.08 rows=12749 width=302) (actual time=0.30..3.95
rows=354
> >loops=1)
> >Total runtime: 5.76 msec
>
> Greg,
>
> apparently random_page_cost is set to the default value of 4.
> The planner assumes that the rows are scattered all over the table and
> that it has to do 12000 random page reads;  the total cost is
> calculated to be approx. 12000 * random_page_cost = 48000, which is
> more than the estimated 41000 for a seq scan.  So a seq scan looks
> cheaper.
>
> SET random_page_cost=3;
>
> and try again.  Experiment with other values, I guess you will see a
> change somewhere between 3.3 and 3.5.
>
> In fact the tuples seem to be close to each other, so several of them
> fit on the same page, but the planner does not know this.  I'm sorry,
> I don't know how to tell it.
>
> But as long as setting random_page_cost to a lower value helps, this
> should be ok.  The default value of 4 seems to be too high for many
> situations, anyway.
>
> Servus
>  Manfred

Very, very helpful information... thank you Manfred!

I'll have to play around with this setting a bit to find what values are
best; I'm just worried that I might set it too far in the other direction
and thereby screw things up.

I guess the best way to approach that particular tuning problem is to find a
query where the estimated row numbers is close to the actual page numbers
and then try different values until the random page reads start to become
slower than the sequential scan. Fun fun.

Of course if PostgreSQL were estimating the number of rows correctly, that
would be less of a problem. Seems that our data is throwing off the
statistics... we have some values that appear tens of thousands of times and
others that appear only a few times, with a few values (such as the example
I sent) in between. Perhaps it's time to look at TABLE SET STATISTICS...

Greg


Re: Yet another "Why won't PostgreSQL use my index?"

From
Manfred Koizar
Date:
On Thu, 20 Jun 2002 15:00:09 -0400, "Gregory Wood"
<gregw@com-stock.com> wrote:
>I guess the best way to approach that particular tuning problem is to find a
>query where the estimated row numbers is close to the actual page numbers
>and then try different values until the random page reads start to become
>slower than the sequential scan. Fun fun.
>
>Of course if PostgreSQL were estimating the number of rows correctly, that
>would be less of a problem.

Yes, less of a problem.  But it wouldn't make the problem go away.
About two weeks ago I had that kind of fun you mentioned in a
situation where estimated numbers of rows were +/- 5% close to
reality.  However, if the planner believes that one random page read
is necessary for each tuple, but in reality 50 or more tuples can be
fetched with one page read, then the planner is off by a factor 50 or
more in favour of seq scans.

If you know that there is a tendency for your data to be physically
ordered by index value, you can put in a counterweight in favour of
index scans by lowering random_page_cost.  Of course this won't work,
if you have multiple indices implying very different sort orders.

I thought that the planner had a notion of "clustering", but I cannot
recall where I got this idea from.  Must have read something on the
hackers list ...  Sorry, I cannot provide any details.  Most probably
Tom Lane can ...

>Seems that our data is throwing off the
>statistics... we have some values that appear tens of thousands of times and
>others that appear only a few times, with a few values (such as the example
>I sent) in between. Perhaps it's time to look at TABLE SET STATISTICS...

At least it can't hurt :-)

Servus
 Manfred

Re: Yet another "Why won't PostgreSQL use my index?"

From
Tom Lane
Date:
"Gregory Wood" <gregw@com-stock.com> writes:
> Of course if PostgreSQL were estimating the number of rows correctly, that
> would be less of a problem. Seems that our data is throwing off the
> statistics... we have some values that appear tens of thousands of times and
> others that appear only a few times, with a few values (such as the example
> I sent) in between. Perhaps it's time to look at TABLE SET STATISTICS...

I believe that pushing the SET STATISTICS target up to 50 or so would
solve the problem nicely, at the cost of making ANALYZE run longer.

However, it also bothered me that your actual runtime ratio was nearly
500:1 when the rows estimation was off "only" 36:1.  There's still an
estimation error of more than a factor of 10 in there, and that can't be
explained by arguing about the appropriate value of random_page_cost.
(random_page_cost less than one is certainly nonsensical.)

I'm wondering whether the indexscan case was benefiting from pages
having been read into memory by the preceding seqscan.  If you run the
seqscan plan twice in a row, does the runtime stay about the same?

            regards, tom lane

Re: Yet another "Why won't PostgreSQL use my index?"

From
"Gregory Wood"
Date:
> > Of course if PostgreSQL were estimating the number of rows correctly,
that
> > would be less of a problem. Seems that our data is throwing off the
> > statistics... we have some values that appear tens of thousands of times
and
> > others that appear only a few times, with a few values (such as the
example
> > I sent) in between. Perhaps it's time to look at TABLE SET STATISTICS...
>
> I believe that pushing the SET STATISTICS target up to 50 or so would
> solve the problem nicely, at the cost of making ANALYZE run longer.

Better that ANALYZE take a few seconds longer than my queries <grin>.

> However, it also bothered me that your actual runtime ratio was nearly
> 500:1 when the rows estimation was off "only" 36:1.  There's still an
> estimation error of more than a factor of 10 in there, and that can't be
> explained by arguing about the appropriate value of random_page_cost.
> (random_page_cost less than one is certainly nonsensical.)
>
> I'm wondering whether the indexscan case was benefiting from pages
> having been read into memory by the preceding seqscan.  If you run the
> seqscan plan twice in a row, does the runtime stay about the same?

Appears that way:

cns=# explain analyze select * from re_site_listings_index where
idx_siteid=387;
NOTICE:  QUERY PLAN:

Seq Scan on re_site_listings_index  (cost=0.00..42110.74 rows=13828
width=302) (actual time=2095.26..2095.26 rows=0 loops=1)
Total runtime: 2095.43 msec

EXPLAIN
cns=# explain analyze select * from re_site_listings_index where
idx_siteid=387;
NOTICE:  QUERY PLAN:

Seq Scan on re_site_listings_index  (cost=0.00..42110.74 rows=13828
width=302) (actual time=2095.12..2095.12 rows=0 loops=1)
Total runtime: 2095.24 msec

EXPLAIN
cns=# explain analyze select * from re_site_listings_index where
idx_siteid=387;
NOTICE:  QUERY PLAN:

Seq Scan on re_site_listings_index  (cost=0.00..42110.74 rows=13828
width=302) (actual time=2082.50..2082.50 rows=0 loops=1)
Total runtime: 2082.62 msec

EXPLAIN
cns=# explain analyze select * from re_site_listings_index where
idx_siteid=387;
NOTICE:  QUERY PLAN:

Seq Scan on re_site_listings_index  (cost=0.00..42110.74 rows=13828
width=302) (actual time=2057.49..2057.49 rows=0 loops=1)
Total runtime: 2057.60 msec

Greg


Re: Yet another "Why won't PostgreSQL use my index?"

From
Tom Lane
Date:
Manfred Koizar <mkoi-pg@aon.at> writes:
> If you know that there is a tendency for your data to be physically
> ordered by index value, you can put in a counterweight in favour of
> index scans by lowering random_page_cost.  Of course this won't work,
> if you have multiple indices implying very different sort orders.

Of course, that's a hack that is quite unrelated to the real problem...

> I thought that the planner had a notion of "clustering", but I cannot
> recall where I got this idea from.

It does, as of 7.2, but it's entirely possible that the fudge-factor
being applied for that is all wrong.  I have not had any time to work on
that problem recently, and so the equation that made it into 7.2 was
just a crude first hack with no theory behind it.  See the
indexCorrelation adjustment code in cost_index() in
src/backend/optimizer/path/costsize.c if you're interested in fooling
with it.

Even in the uncorrelated case, the estimation equation *does* consider
the probability of multiple hits on the same heap page.  Before you
assert that "the planner believes that one random page read is necessary
for each tuple", I suggest reading the code...

            regards, tom lane

Re: Yet another "Why won't PostgreSQL use my index?"

From
Stephan Szabo
Date:
On Thu, 20 Jun 2002, Gregory Wood wrote:

> I guess the best way to approach that particular tuning problem is to find a
> query where the estimated row numbers is close to the actual page numbers
> and then try different values until the random page reads start to become
> slower than the sequential scan. Fun fun.
>
> Of course if PostgreSQL were estimating the number of rows correctly, that
> would be less of a problem. Seems that our data is throwing off the
> statistics... we have some values that appear tens of thousands of times and
> others that appear only a few times, with a few values (such as the example
> I sent) in between. Perhaps it's time to look at TABLE SET STATISTICS...

Yeah.  Since the number of the high frequency values is greater than the
number it keeps track of, you're in the same general boat as the earlier
statistics wierdnesses.  It's really easy to play with the set statistics
though. :)