Thread: Index not used on group by

Index not used on group by

From
Andrey Repko
Date:
Hello all,

    I have table ma_data, that contain above 300000 rows.
   This table has primary key id, and field alias_id.
   I create index (btree)on this field.
   Set statistic:

 ALTER TABLE "public"."ma_data"
   ALTER COLUMN "alias_id" SET STATISTICS 998;

   So, when I do something like
     SELECT alias_id FROM ma_data GROUP BY alias_id
   and have (with seq_scan off):

   Group  (cost=0.00..1140280.63 rows=32 width=4) (actual time=0.159..2640.090 rows=32 loops=1)
      ->  Index Scan using reference_9_fk on ma_data  (cost=0.00..1139526.57 rows=301624 width=4) (actual
time=0.120..1471.128rows=301624 loops=1) 
    Total runtime: 2640.407 ms
   (3 rows)

   As I understand there are some problems with visibility of records,
   but some others DBMS used indexes without problems(for example
   FireBird)? Or maybe some another information be helpful for me and
   community.

--
С наилучшими пожеланиями,
  Репко Андрей Владимирович       mailto:repko@sart.must-ipra.com


Re: Index not used on group by

From
Richard Huxton
Date:
Andrey Repko wrote:
>
>     I have table ma_data, that contain above 300000 rows.
>    This table has primary key id, and field alias_id.
>    I create index (btree)on this field.
>    Set statistic:
>
>  ALTER TABLE "public"."ma_data"
>    ALTER COLUMN "alias_id" SET STATISTICS 998;
>
>    So, when I do something like
>      SELECT alias_id FROM ma_data GROUP BY alias_id

Why are you using GROUP BY without any aggregate functions?

What happens if you use something like
   SELECT DISTINCT alias_id FROM ma_data;

--
   Richard Huxton
   Archonet Ltd

Re: Index not used on group by

From
Андрей Репко
Date:
Здравствуйте Richard,

Tuesday, September 27, 2005, 1:48:15 PM, Вы писали:

RH> Andrey Repko wrote:
>>
>>     I have table ma_data, that contain above 300000 rows.
>>    This table has primary key id, and field alias_id.
>>    I create index (btree)on this field.
>>    Set statistic:
>>
>>  ALTER TABLE "public"."ma_data"
>>    ALTER COLUMN "alias_id" SET STATISTICS 998;
>>
>>    So, when I do something like
>>      SELECT alias_id FROM ma_data GROUP BY alias_id

RH> Why are you using GROUP BY without any aggregate functions?

RH> What happens if you use something like
RH>    SELECT DISTINCT alias_id FROM ma_data;
sart_ma=# EXPLAIN ANALYZE SELECT DISTINCT alias_id FROM ma_data;
                                                          QUERY PLAN


-------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=65262.63..66770.75 rows=32 width=4) (actual time=16780.214..18250.761 rows=32 loops=1)
   ->  Sort  (cost=65262.63..66016.69 rows=301624 width=4) (actual time=16780.204..17255.129 rows=301624 loops=1)
         Sort Key: alias_id
         ->  Seq Scan on ma_data  (cost=0.00..37811.24 rows=301624 width=4) (actual time=6.896..15321.023 rows=301624
loops=1)
 Total runtime: 18292.542 ms
(5 rows)

sart_ma=# EXPLAIN ANALYZE SELECT alias_id FROM ma_data GROUP BY alias_id;
                                                       QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=38565.30..38565.62 rows=32 width=4) (actual time=15990.863..15990.933 rows=32 loops=1)
   ->  Seq Scan on ma_data  (cost=0.00..37811.24 rows=301624 width=4) (actual time=3.446..14572.141 rows=301624
loops=1)
 Total runtime: 15991.244 ms
(3 rows)

--
С наилучшими пожеланиями,
  Репко Андрей Владимирович       mailto:repko@sart.must-ipra.com


Re: Index not used on group by

From
Андрей Репко
Date:
Здравствуйте Richard,

Tuesday, September 27, 2005, 2:08:31 PM, Вы писали:


>> sart_ma=# EXPLAIN ANALYZE SELECT alias_id FROM ma_data GROUP BY alias_id;
>>                                                        QUERY PLAN
>>
-------------------------------------------------------------------------------------------------------------------------
>>  HashAggregate  (cost=38565.30..38565.62 rows=32 width=4)
>> (actual time=15990.863..15990.933 rows=32 loops=1)
>>    ->  Seq Scan on ma_data  (cost=0.00..37811.24 rows=301624
>> width=4) (actual time=3.446..14572.141 rows=301624 loops=1)
>>  Total runtime: 15991.244 ms

RH> OK - the planner thinks it's doing the right thing, your cost estimates
RH> are way off. If you look back at where you got an index-scan, it's cost
RH> was 1.1 million.
RH>    Index Scan using reference_9_fk on ma_data (cost=0.00..1139526.57
But why PG scan _all_ the records in the table? As I understand we can
"just" select information from index, not scaning all the table? Of
course if we select ALL records from table index can't help us.
If I write something like:
SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =1 LIMIT 1)
UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =2 LIMIT 1)
UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =3 LIMIT 1)
UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =4 LIMIT 1)
UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =5 LIMIT 1)
UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =6 LIMIT 1)
UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =7 LIMIT 1)
UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =8 LIMIT 1)
UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =9 LIMIT 1)
...
UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id=max_alias_id LIMIT 1)
It works better, much better.

RH> That's way above the numbers for seq-scan+hash/sort, so if the cost
RH> estimate was right PG would be making the right choice. Looks like you
RH> need to check your configuration settings. Have you read:
RH>    http://www.powerpostgresql.com/PerfList
RH> or
RH>    http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html
Thanks.


--
С наилучшими пожеланиями,
  Репко Андрей Владимирович       mailto:repko@sart.must-ipra.com


Re: Index not used on group by

From
Richard Huxton
Date:
Андрей Репко wrote:
> RH> What happens if you use something like
> RH>    SELECT DISTINCT alias_id FROM ma_data;
> sart_ma=# EXPLAIN ANALYZE SELECT DISTINCT alias_id FROM ma_data;
>                                                           QUERY PLAN
>
>
-------------------------------------------------------------------------------------------------------------------------------
>  Unique  (cost=65262.63..66770.75 rows=32 width=4) (actual time=16780.214..18250.761 rows=32 loops=1)
>    ->  Sort  (cost=65262.63..66016.69 rows=301624 width=4) (actual time=16780.204..17255.129 rows=301624 loops=1)
>          Sort Key: alias_id
>          ->  Seq Scan on ma_data  (cost=0.00..37811.24 rows=301624 width=4) (actual time=6.896..15321.023 rows=301624
loops=1)
>  Total runtime: 18292.542 ms

> sart_ma=# EXPLAIN ANALYZE SELECT alias_id FROM ma_data GROUP BY alias_id;
>                                                        QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------
>  HashAggregate  (cost=38565.30..38565.62 rows=32 width=4) (actual time=15990.863..15990.933 rows=32 loops=1)
>    ->  Seq Scan on ma_data  (cost=0.00..37811.24 rows=301624 width=4) (actual time=3.446..14572.141 rows=301624
loops=1)
>  Total runtime: 15991.244 ms

OK - the planner thinks it's doing the right thing, your cost estimates
are way off. If you look back at where you got an index-scan, it's cost
was 1.1 million.
   Index Scan using reference_9_fk on ma_data (cost=0.00..1139526.57

That's way above the numbers for seq-scan+hash/sort, so if the cost
estimate was right PG would be making the right choice. Looks like you
need to check your configuration settings. Have you read:
   http://www.powerpostgresql.com/PerfList
or
   http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html
--
   Richard Huxton
   Archonet Ltd


Re: Index not used on group by

From
Richard Huxton
Date:
Андрей Репко wrote:
> Здравствуйте Richard,
>
> Tuesday, September 27, 2005, 2:08:31 PM, Вы писали:
>
>
>
>>>sart_ma=# EXPLAIN ANALYZE SELECT alias_id FROM ma_data GROUP BY alias_id;
>>>                                                       QUERY PLAN

>>>-------------------------------------------------------------------------------------------------------------------------
>>> HashAggregate  (cost=38565.30..38565.62 rows=32 width=4)
>>>(actual time=15990.863..15990.933 rows=32 loops=1)
>>>   ->  Seq Scan on ma_data  (cost=0.00..37811.24 rows=301624
>>>width=4) (actual time=3.446..14572.141 rows=301624 loops=1)
>>> Total runtime: 15991.244 ms
>
>
> RH> OK - the planner thinks it's doing the right thing, your cost estimates
> RH> are way off. If you look back at where you got an index-scan, it's cost
> RH> was 1.1 million.
> RH>    Index Scan using reference_9_fk on ma_data (cost=0.00..1139526.57
> But why PG scan _all_ the records in the table? As I understand we can
> "just" select information from index, not scaning all the table? Of
> course if we select ALL records from table index can't help us.

Actually, if you select more than 5-10% of the rows (in general) you are
better off using a seq-scan.

PostgreSQL estimates the total cost of possible query plans and picks
the cheapest. In your case your configuration settings seem to be
pushing the cost of an index scan much higher than it is. So, it picks
the sequential-scan.

> If I write something like:
> SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =1 LIMIT 1)
> UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =2 LIMIT 1)
> UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =3 LIMIT 1)
> UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =4 LIMIT 1)
> UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =5 LIMIT 1)
> UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =6 LIMIT 1)
> UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =7 LIMIT 1)
> UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =8 LIMIT 1)
> UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id =9 LIMIT 1)
> ...
> UNION ALL SELECT (SELECT  alias_id  FROM ma_data WHERE alias_id=max_alias_id LIMIT 1)
> It works better, much better.

Of course - it will always choose index queries here - it can see you
are only fetching one row in each subquery.

Correct your configuration settings so PG estimates the cost of an index
  query correctly and all should be well.
--
   Richard Huxton
   Archonet Ltd