Thread: Poor index choice -- multiple indexes of the same columns

Poor index choice -- multiple indexes of the same columns

From

"Karl O. Pinc"

Date:

27 June 2005, 19:20:33

Postgresql 8.0.3

Hi,

I have a query

select 1
  from census
  where date < '1975-9-21' and sname = 'RAD' and status != 'A'
  limit 1;

Explain analyze says it always uses the index made by:

   CREATE INDEX census_date_sname ON census (date, sname);

this is even after I made the index:

   CREATE INDEX census_sname_date ON census (sname, date);

I made census_sname_date because it ran too slow. By deleting
census_date_sname (temporarly, because my apps don't like this)
I can force the use of census_sname_date and the query runs fine.

Seems to me that when there's a constant value in the query
and an = comparision it will always be faster to use the (b-tree)
index that's ordered first by the constant value, as then all further
blocks are guarenteed to have a higher relevant information
density.  At least when compared with another index that has the
same columns in it.

As you might imagine there are relatively few sname values and
relatively many date values in my data.  I use a query like the
above in a trigger to enforce bounds limitations.  I don't
expect (want) to find any rows returned.

I've figured out how to avoid executing this code very often,
so this is not presently a serious problem for me.

Karl <kop@meme.com>
Free Software:  "You don't pay back, you pay forward."
                  -- Robert A. Heinlein

Re: Poor index choice -- multiple indexes of the same columns

From

Josh Berkus

Date:

27 June 2005, 19:35:43

Karl,

> Seems to me that when there's a constant value in the query
> and an = comparision it will always be faster to use the (b-tree)
> index that's ordered first by the constant value, as then all further
> blocks are guarenteed to have a higher relevant information
> density.  At least when compared with another index that has the
> same columns in it.

That really depends on the stats.   Such a choice would *not* be
appropriate if the < comparison was expected to return 1- rows while the =
condition applied to 15% of the table.

What is your STATISTICS_TARGET for the relevant columns set to?   When's
the last time you ran analyze?  If this is all updated, you want to post
the pg_stats rows for the relevant columns?

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Poor index choice -- multiple indexes of the same

From

"Karl O. Pinc"

Date:

27 June 2005, 22:47:57

On 06/27/2005 05:37:41 PM, Josh Berkus wrote:
> Karl,
>
> > Seems to me that when there's a constant value in the query
> > and an = comparision it will always be faster to use the (b-tree)
> > index that's ordered first by the constant value, as then all
> further
> > blocks are guarenteed to have a higher relevant information
> > density.  At least when compared with another index that has the
> > same columns in it.
>
> That really depends on the stats.   Such a choice would *not* be
> appropriate if the < comparison was expected to return 1- rows while
> the =
> condition applied to 15% of the table.

We're talking internals here so I don't know what I'm talking
about, but, when the = comparison returns 15% of the table
you can find your way straight to the 1- (sic) relevent rows
because that 15% is further sorted by the second column of the
index.  So that's one disk read and after that when you scan
the rest of the blocks every datum is relevant/returned.
So your scan will pass through fewer disk blocks.  The only
case that would make sense to consider using the other
index is if the planner knew it could
get the answer in 1 disk read, in which case it should be
able to get the answer out of either index in one disk read
as both indexes are on the same columns.

> What is your STATISTICS_TARGET for the relevant columns set to?

STATISTICS_TARGET is the default, which I read as 10 the docs.

> When's
> the last time you ran analyze?

I'm doing this in a torture test script, loading data.
Every fibnocci number of rows * 100 I VACCUM ANALYZE.
So, 100, 200, 300, 500, 800, etc.

Just for grins I've created the index I'd like it to use
and run VACUUM ANALYZE and shown the EXPLAIN ANALYZE below.

> If this is all updated, you want to
> post
> the pg_stats rows for the relevant columns?

Pg_stats rows below.  (I've tried to wrap the lines short
so as not to mess up anybody's mailer.)

# create index census_sname_date on census (sname, date);
CREATE INDEX
# vacuum analyze census;
VACUUM
# explain analyze select 1 from census where date < '1975-9-21'
  and sname = 'RAD' and status != 'A' ;
                                                             QUERY
  PLAN
---------------------------------------------------------------
---------------------------------------------------------------
----
  Index Scan using census_date_sname on census  (cost=0.00..2169.51
rows=1437 width=0) (actual time=40.610..40.610 rows=0 loops=1)
    Index Cond: ((date < '1975-09-21'::date) AND (sname =
'RAD'::bpchar))
    Filter: (status <> 'A'::bpchar)
  Total runtime: 40.652 ms
(4 rows)

Compare with:

# drop index census_date_sname;
DROP INDEX
# explain analyze select date from census where sname = 'RAD'
  and date < '1975-9-21' and status != 'A' limit 1;
                                                               QUERY
PLAN
-------------------------------------------------------------------
-------------------------------------------------------------------
  Limit  (cost=0.00..3.37 rows=1 width=4) (actual time=0.097..0.097
rows=0 loops=1)
    ->  Index Scan using census_sname_date on census
(cost=0.00..5203.95 rows=1544 width=4) (actual time=0.094..0.094
rows=0 loops=1)
          Index Cond: ((sname = 'RAD'::bpchar) AND (date <
'1975-09-21'::date))
          Filter: (status <> 'A'::bpchar)
  Total runtime: 0.133 ms
(5 rows)




# select * from pg_stats where tablename = 'census' and (attname =
'sname' or attname = 'date');
  schemaname | tablename | attname | null_frac | avg_width | n_distinct
| most_common_vals | most_common_freqs | histogram_bounds |
correlation
------------+-----------+---------+-----------+-----------+-----------
-+--------------------------------------------------------------------
---------------------------------------------+------------------------
----------------------------------------------------------------------
--------------+-------------------------------------------------------
---------------------------------------------------------------------+
-------------
  babase | census | date | 0 | 4 | 4687 |
{1979-02-01,1976-06-16,1977-03-23,1978-08-25,1979-09-20,1971-06-28
,1972-04-28,1972-08-27,1974-04-06,1975-03-19}
|
{0.002,0.00166667,0.00166667,0.00166667,0.00166667,0.00133333
,0.00133333,0.00133333,0.00133333,0.00133333}
|
{1959-07-15,1966-02-18,1969-02-22,1971-01-10,1972-07-26,1974-02-09
,1975-05-27,1976-07-28,1977-08-19,1978-08-07,1979-10-02}
| 1
  babase | census | sname | 0 | 7 | 177 |
{MAX,ALT,PRE,COW,EST,JAN,RIN,ZUM,DUT,LUL} |
{0.0166667,0.015,0.015,0.0146667
,0.0143333,0.014,0.0136667,0.0136667,0.0133333,0.0133333}
| {ALI,BUN,FAN,IBI,LER,NDO,PET,RUS,SLM,TOT,XEN} | 0.0446897
(2 rows)

Thanks.


Karl <kop@meme.com>
Free Software:  "You don't pay back, you pay forward."
                  -- Robert A. Heinlein

Re: Poor index choice -- multiple indexes of the same

From

"Karl O. Pinc"

Date:

27 June 2005, 23:22:34

On 06/27/2005 09:36:51 PM, Karl O. Pinc wrote:

> I'm doing this in a torture test script, loading data.
> Every fibnocci number of rows * 100 I VACCUM ANALYZE.
> So, 100, 200, 300, 500, 800, etc.

(And of course disconnect my client and re-connect so
as to use the new statistics.  sure would be nice if
I didn't have to do this.)

Karl <kop@meme.com>
Free Software:  "You don't pay back, you pay forward."
                  -- Robert A. Heinlein

Re: Poor index choice -- multiple indexes of the same columns

From

Tom Lane

Date:

28 June 2005, 06:31:23

"Karl O. Pinc" <kop@meme.com> writes:
> I have a query

> select 1
>   from census
>   where date < '1975-9-21' and sname = 'RAD' and status != 'A'
>   limit 1;

> Explain analyze says it always uses the index made by:

>    CREATE INDEX census_date_sname ON census (date, sname);

> this is even after I made the index:

>    CREATE INDEX census_sname_date ON census (sname, date);

I don't believe that any existing release can tell the difference
between these two indexes as far as costs go.  I just recently
added some code to btcostestimate that would cause it to prefer
the index on (sname, date) but of course that's not released yet.

However: isn't the above query pretty seriously underspecified?
With a LIMIT and no ORDER BY, you are asking for a random one
of the rows matching the condition.  I realize that with
"select 1" you may not care much, but adding a suitable ORDER BY
would help push the planner towards using the right index.  In
this case "ORDER BY sname DESC, date DESC" would probably do the
trick.

            regards, tom lane

Re: Poor index choice -- multiple indexes of the same

From

"Karl O. Pinc"

Date:

28 June 2005, 14:43:29

On 06/28/2005 01:40:56 AM, Tom Lane wrote:
> "Karl O. Pinc" <kop@meme.com> writes:
> > I have a query
>
> > select 1
> >   from census
> >   where date < '1975-9-21' and sname = 'RAD' and status != 'A'
> >   limit 1;
>
> > Explain analyze says it always uses the index made by:
>
> >    CREATE INDEX census_date_sname ON census (date, sname);
>
> > this is even after I made the index:
>
> >    CREATE INDEX census_sname_date ON census (sname, date);
>
> I don't believe that any existing release can tell the difference
> between these two indexes as far as costs go.  I just recently
> added some code to btcostestimate that would cause it to prefer
> the index on (sname, date) but of course that's not released yet.
>
> However: isn't the above query pretty seriously underspecified?
> With a LIMIT and no ORDER BY, you are asking for a random one
> of the rows matching the condition.  I realize that with
> "select 1" you may not care much, but adding a suitable ORDER BY
> would help push the planner towards using the right index.  In
> this case "ORDER BY sname DESC, date DESC" would probably do the
> trick.

Yes, that works.  I'd already tried "ORDER BY date DESC", before
I first wrote, and that did not work.  (I started with no LIMIT
either, and tried adding specifications until I gave up.  It's
very good that the new planner will figure out things by itself.)
"ORDER BY sname DESC" works as well.  This is a
bit odd, as with the constant in the = comparison "ORDER BY date
DESC" is the same as "ORDER BY sname DESC, date DESC".
I guess that's why I gave up on my attempts to get the planner
to use the (sname, date) index before I got to your solution.

Thanks everybody for the help.

Karl <kop@meme.com>
Free Software:  "You don't pay back, you pay forward."
                  -- Robert A. Heinlein