Thread: Help with unpredictable use of indexes on large tables...

Help with unpredictable use of indexes on large tables...

From
Mike Leahy
Date:
Hello list,

I've been having a bit of difficulty getting Postgres to use indexes on
some large tables that I have.  Included below are the results from
'explain analyze' for two queries that should get the unique years of
data that are available from two different tables (tbl_ind_schools_edu
and tbl_ind_schools_con). The *_edu table has two years of data in it
while the *_con table has one year, so *_edu essentially has nearly
twice as many rows (i.e., 55k vs. 26k).  There is an integer column
called "year" in each table that flags what year each row of data is
from, and a btree index on this column for both tables (called
"schoolse_year" and "schoolsc_year" respectively).  Both tables have
been fully vacuumed/analyzed.  When I try to get the distinct number of
years from these tables, it does a sequential scan to get two unique
values from the "year" column in the *_edu table, but it uses an index
scan to get a single unique value from the "year" column from the *_con
table.  In both cases, I would have expected the index scan to be used.
 I also tried this using the year column as group by clause, but in that
case, neither of the queries use the index scan.

I know I can force PostgreSQL to use indexes by setting enable_seqscan
to off, and this does improve the performance of the query.  But I'm
wondering why the query analyzer doesn't use this index on the larger
table.  I have several other tables of a similar nature (basically the
same data aggregated at different levels), where one table has two years
of data, and the other has one.  In several cases, the table with two
years never utilizes its index on the year column unless I force it to
do so.  I should point out that the table with two years of data also
has a much larger number of columns, all with indexes since they are all
potentially used for querying subsets from the tables.  Is there
something particularly wrong that I might doing (or something that I'm
not doing) to prevent the indexes from being properly used?

Thanks in advance for any advice...
Mike

========================================================================

dbname=# explain analyze select distinct year from tbl_ind_schools_edu;
                         QUERY PLAN

------------------------------------------------------------------------
 Unique  (cost=32302.16..32579.31 rows=2 width=2) (actual
time=1545.911..2084.170 rows=2 loops=1)
   ->  Sort  (cost=32302.16..32440.74 rows=55431 width=2) (actual
time=1545.901..1892.026 rows=55431 loops=1)
         Sort Key: "year"
         ->  Seq Scan on tbl_ind_schools_edu  (cost=0.00..27485.31
rows=55431 width=2) (actual time=0.074..1180.303 rows=55431 loops=1)
 Total runtime: 2085.294 ms
(5 rows)

dbname=# explain analyze select distinct year from tbl_ind_schools_con;
                        QUERY PLAN

------------------------------------------------------------------------
 Unique  (cost=0.00..954.37 rows=1 width=2) (actual time=76.277..372.526
rows=1 loops=1)
   ->  Index Scan using schoolsc_year on tbl_ind_schools_con
(cost=0.00..887.08 rows=26916 width=2) (actual time=76.265..275.314
rows=26916 loops=1)
 Total runtime: 372.659 ms
(3 rows)

Re: Help with unpredictable use of indexes on large tables...

From
Tom Lane
Date:
Mike Leahy <mgleahy@alumni.uwaterloo.ca> writes:
> ... When I try to get the distinct number of
> years from these tables, it does a sequential scan to get two unique
> values from the "year" column in the *_edu table, but it uses an index
> scan to get a single unique value from the "year" column from the *_con
> table.  In both cases, I would have expected the index scan to be used.

You have a fundamental misunderstanding of what's going on here.  Both
plans fetch the entire table contents.  The difference is how the data
is brought into sorted order for the UNIQUE step --- either by an
explicit sort, or by scanning the table in index order.

A full-table index scan is usually pretty darn inefficient (too much
random access) and so it's often the case that the sort approach is
actually faster.

The two EXPLAINs you provided aren't compelling evidence of anything
wrong because they are for two different-sized tables ... but to the
extent that the results are comparable it appears that the planner is
actually biased in favor of the indexscan plan (cost divided by actual
time is way lower for the indexscan).

What you should look at is performance of the two approaches on the
*same* table (fool with enable_sort and/or enable_indexscan to force
the alternative choices) and then see whether it makes sense to tweak
the planner cost parameters for your installation.

Also I'd suggest trying

    select year from [table] group by year

which is capable of using a hash aggregation approach; that will likely
beat either of these plans.

            regards, tom lane

Re: Help with unpredictable use of indexes on large tables...

From
Mike Leahy
Date:
Tom,

Thanks for the advice.  I realize that I have little understanding of
index usage in PostgreSQL - I'm doing my best to improve this.  Below is
another comparison of the 'distinct' and 'group by' queries from the
same table with seqscan set to on and off.  I does look like the group
by works better (with seqscan off), as you suggested.  I'll try some
more tinkering to see what I can make happen.

However, I guess what I'm really trying to do in the context that I'm
currently working is summarize the unique values stored in the index,
rather than querying the table itself.  Is this possible, or reasonable
to do?

Thanks again for your help,
Mike

(P.S. - Sorry if you get this twice Tom)

=======================================================================

dbname=# set session enable_seqscan to on;
SET
dbname=# explain analyze select distinct year from tbl_ind_schools_edu;
                        QUERY PLAN

-----------------------------------------------------------------------
 Unique  (cost=32302.16..32579.31 rows=2 width=2) (actual
time=2871.115..3705.652 rows=2 loops=1)
   ->  Sort  (cost=32302.16..32440.74 rows=55431 width=2) (actual
time=2871.105..3268.114 rows=55431 loops=1)
         Sort Key: "year"
         ->  Seq Scan on tbl_ind_schools_edu  (cost=0.00..27485.31
rows=55431 width=2) (actual time=0.091..1903.820 rows=55431 loops=1)
 Total runtime: 3707.879 ms
(5 rows)

dbname=# set session enable_seqscan to off;
SET
dbname=# explain analyze select distinct year from tbl_ind_schools_edu;
                        QUERY PLAN

-----------------------------------------------------------------------
 Unique  (cost=0.00..161575.24 rows=2 width=2) (actual
time=0.312..2143.846 rows=2 loops=1)
   ->  Index Scan using schoolse_school_year on tbl_ind_schools_edu
(cost=0.00..161436.67 rows=55431 width=2) (actual time=0.286..1717.445
rows=55431 loops=1)
 Total runtime: 2144.100 ms
(3 rows)

dbname=# set session enable_seqscan to on;
SET
dbname=# explain analyze select year from tbl_ind_schools_edu group by year;
                        QUERY PLAN

-----------------------------------------------------------------------
 HashAggregate  (cost=27623.89..27623.91 rows=2 width=2) (actual
time=2176.003..2176.010 rows=2 loops=1)
   ->  Seq Scan on tbl_ind_schools_edu  (cost=0.00..27485.31 rows=55431
width=2) (actual time=0.072..1697.776 rows=55431 loops=1)
 Total runtime: 2254.643 ms
(3 rows)

dbname=# set session enable_seqscan to off;
SET
dbname=# explain analyze select year from tbl_ind_schools_edu group by year;
                        QUERY PLAN

-----------------------------------------------------------------------
 Group  (cost=0.00..161575.24 rows=2 width=2) (actual
time=0.350..2128.425 rows=2 loops=1)
   ->  Index Scan using schoolse_school_year on tbl_ind_schools_edu
(cost=0.00..161436.67 rows=55431 width=2) (actual time=0.296..1689.331
rows=55431 loops=1)
 Total runtime: 2129.799 ms
(3 rows)




Tom Lane wrote:
> Mike Leahy <mgleahy@alumni.uwaterloo.ca> writes:
>> ... When I try to get the distinct number of
>> years from these tables, it does a sequential scan to get two unique
>> values from the "year" column in the *_edu table, but it uses an index
>> scan to get a single unique value from the "year" column from the *_con
>> table.  In both cases, I would have expected the index scan to be used.
>
> You have a fundamental misunderstanding of what's going on here.  Both
> plans fetch the entire table contents.  The difference is how the data
> is brought into sorted order for the UNIQUE step --- either by an
> explicit sort, or by scanning the table in index order.
>
> A full-table index scan is usually pretty darn inefficient (too much
> random access) and so it's often the case that the sort approach is
> actually faster.
>
> The two EXPLAINs you provided aren't compelling evidence of anything
> wrong because they are for two different-sized tables ... but to the
> extent that the results are comparable it appears that the planner is
> actually biased in favor of the indexscan plan (cost divided by actual
> time is way lower for the indexscan).
>
> What you should look at is performance of the two approaches on the
> *same* table (fool with enable_sort and/or enable_indexscan to force
> the alternative choices) and then see whether it makes sense to tweak
> the planner cost parameters for your installation.
>
> Also I'd suggest trying
>
>     select year from [table] group by year
>
> which is capable of using a hash aggregation approach; that will likely
> beat either of these plans.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
>

Re: Help with unpredictable use of indexes on large tables...

From
"John D. Burger"
Date:
Tom Lane wrote:

> Also I'd suggest trying
>
>     select year from [table] group by year
>
> which is capable of using a hash aggregation approach; that will likely
> beat either of these plans.

Just out of curiosity, why doesn't the planner consider the same plan
for the OP's original query:

   select distinct year from [table]

Aren't these equivalent (except for the order)?

Thanks.

- John Burger


Re: Help with unpredictable use of indexes on large tables...

From
Tom Lane
Date:
"John D. Burger" <john@mitre.org> writes:
> Tom Lane wrote:
>> Also I'd suggest trying
>> select year from [table] group by year
>> which is capable of using a hash aggregation approach; that will likely
>> beat either of these plans.

> Just out of curiosity, why doesn't the planner consider the same plan
> for the OP's original query:
>    select distinct year from [table]
> Aren't these equivalent (except for the order)?

Mainly just that the DISTINCT code hasn't been rewritten to consider the
possibility.  The current implementation of DISTINCT is pretty tightly
intertwined with ORDER BY; I think it would take some serious hacking to
disentangle the two, and no one's got round to it.

            regards, tom lane