Thread: hash aggregation

hash aggregation

From

Korisk

Date:

10 October 2012, 16:09:16

Hello! Is it possible to speed up the plan?

hashes=# \d hashcheck
                              Table "public.hashcheck"
 Column |       Type        |                       Modifiers
--------+-------------------+--------------------------------------------------------
 id     | integer           | not null default nextval('hashcheck_id_seq'::regclass)
 name   | character varying |
 value  | character varying |
Indexes:
    "btr" btree (name)

hashes=# select version();
                                                   version
--------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.2.1 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.7.2 20120921 (Red Hat 4.7.2-2), 64-bit
(1 row)

hashes=# explain analyse verbose select name, count(name) as cnt from hashcheck group by name order by name desc;
                                                                 QUERY PLAN
                  


---------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=573977.88..573978.38 rows=200 width=32) (actual time=10351.280..10351.551 rows=4000 loops=1)
   Output: name, (count(name))
   Sort Key: hashcheck.name
   Sort Method: quicksort  Memory: 315kB
   ->  HashAggregate  (cost=573968.24..573970.24 rows=200 width=32) (actual time=10340.507..10341.288 rows=4000
loops=1)
         Output: name, count(name)
         ->  Seq Scan on public.hashcheck  (cost=0.00..447669.16 rows=25259816 width=32) (actual time=0.019..2798.058
rows=25259817loops=1) 
               Output: id, name, value
 Total runtime: 10351.989 ms
(9 rows)

hashes=#

Thank you.

Re: hash aggregation

From

Sergey Konoplev

Date:

10 October 2012, 21:30:35

On Wed, Oct 10, 2012 at 9:09 AM, Korisk <Korisk@yandex.ru> wrote:
> Hello! Is it possible to speed up the plan?
>  Sort  (cost=573977.88..573978.38 rows=200 width=32) (actual time=10351.280..10351.551 rows=4000 loops=1)
>    Output: name, (count(name))
>    Sort Key: hashcheck.name
>    Sort Method: quicksort  Memory: 315kB
>    ->  HashAggregate  (cost=573968.24..573970.24 rows=200 width=32) (actual time=10340.507..10341.288 rows=4000
loops=1)
>          Output: name, count(name)
>          ->  Seq Scan on public.hashcheck  (cost=0.00..447669.16 rows=25259816 width=32) (actual time=0.019..2798.058
rows=25259817loops=1) 
>                Output: id, name, value
>  Total runtime: 10351.989 ms

AFAIU there are no query optimization solution for this.

It may be worth to create a table hashcheck_stat (name, cnt) and
increment/decrement the cnt values with triggers if you need to get
counts fast.

--
Sergey Konoplev

a database and software architect
http://www.linkedin.com/in/grayhemp

Jabber: gray.ru@gmail.com Skype: gray-hemp Phone: +14158679984

Re: hash aggregation

From

Korisk

Date:

11 October 2012, 04:13:34

Thanx for the advice, but increment table is not acceptable because it should be a plenty of them.
Nevertheless in the investigations was achieved some progress (7.4 sec vs 19.6 sec).
But using IOS scan you can see that there is an abnormal cost calculations it make me suspicious  of little bugs.

Thanks for your answer.


hashes=# \d hashcheck;
                              Table "public.hashcheck"
 Column |       Type        |                       Modifiers
--------+-------------------+--------------------------------------------------------
 id     | integer           | not null default nextval('hashcheck_id_seq'::regclass)
 name   | character varying |
 value  | character varying |
Indexes:
    "hashcheck_name_idx" btree (name)

hashes=# vacuum hashcheck;
VACUUM
hashes=# set random_page_cost=0.1;
SET
hashes=# set seq_page_cost=0.1;
SET

hashes=# explain analyse verbose select name, count(name) as cnt from  hashcheck group by name order by name desc;
                                                                 QUERY PLAN
                  

---------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=407366.72..407367.22 rows=200 width=32) (actual time=10712.505..10712.765 rows=4001 loops=1)
   Output: name, (count(name))
   Sort Key: hashcheck.name
   Sort Method: quicksort  Memory: 315kB
   ->  HashAggregate  (cost=407357.08..407359.08 rows=200 width=32) (actual time=10702.285..10703.054 rows=4001
loops=1)
         Output: name, count(name)
         ->  Seq Scan on public.hashcheck  (cost=0.00..277423.12 rows=25986792 width=32) (actual time=0.054..2877.100
rows=25990002loops=1) 
               Output: id, name, value
 Total runtime: 10712.989 ms
(9 rows)

hashes=#  set enable_seqscan = off;
SET
hashes=# explain analyse verbose select name, count(name) as cnt from  hashcheck group by name order by name desc;
                                                                                          QUERY PLAN
                  

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=10000000000.00..10000528610.88 rows=200 width=32) (actual time=0.116..7452.005 rows=4001
loops=1)
   Output: name, count(name)
   ->  Index Only Scan Backward using hashcheck_name_idx on public.hashcheck  (cost=10000000000.00..10000398674.92
rows=25986792width=32) 
 (actual time=0.104..3785.767 rows=25990002 loops=1)
         Output: name
         Heap Fetches: 0
 Total runtime: 7452.509 ms
(6 rows)

Благодаря шаманствам на:
http://www.sql.ru/forum/actualthread.aspx?tid=974484

11.10.2012, 01:30, "Sergey Konoplev" <gray.ru@gmail.com>:
> On Wed, Oct 10, 2012 at 9:09 AM, Korisk <Korisk@yandex.ru> wrote:
>
>>  Hello! Is it possible to speed up the plan?
>>   Sort  (cost=573977.88..573978.38 rows=200 width=32) (actual time=10351.280..10351.551 rows=4000 loops=1)
>>     Output: name, (count(name))
>>     Sort Key: hashcheck.name
>>     Sort Method: quicksort  Memory: 315kB
>>     ->  HashAggregate  (cost=573968.24..573970.24 rows=200 width=32) (actual time=10340.507..10341.288 rows=4000
loops=1)
>>           Output: name, count(name)
>>           ->  Seq Scan on public.hashcheck  (cost=0.00..447669.16 rows=25259816 width=32) (actual
time=0.019..2798.058rows=25259817 loops=1) 
>>                 Output: id, name, value
>>   Total runtime: 10351.989 ms
>
> AFAIU there are no query optimization solution for this.
>
> It may be worth to create a table hashcheck_stat (name, cnt) and
> increment/decrement the cnt values with triggers if you need to get
> counts fast.
>
> --
> Sergey Konoplev
>
> a database and software architect
> http://www.linkedin.com/in/grayhemp
>
> Jabber: gray.ru@gmail.com Skype: gray-hemp Phone: +14158679984

Re: hash aggregation

From

Craig Ringer

Date:

11 October 2012, 04:38:55

On 10/11/2012 12:13 PM, Korisk wrote:
> Thanx for the advice, but increment table is not acceptable because it should be a plenty of them.
> Nevertheless in the investigations was achieved some progress (7.4 sec vs 19.6 sec).
> But using IOS scan

"IOS scan" ?

Do you mean some kind of I/O monitoring tool?

> you can see that there is an abnormal cost calculations it make me suspicious  of little bugs.

Abnormal how?

The cost estimates aren't times, I/Os, or anything you know, they're a
purely relative figure for comparing plan costs.

> hashes=#  set enable_seqscan = off;
> SET

What's your seq_page_cost and random_page_cost?


> hashes=# explain analyse verbose select name, count(name) as cnt from  hashcheck group by name order by name desc;
>                                                                                            QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>   GroupAggregate  (cost=10000000000.00..10000528610.88 rows=200 width=32) (actual time=0.116..7452.005 rows=4001
loops=1)
>     Output: name, count(name)
>     ->  Index Only Scan Backward using hashcheck_name_idx on public.hashcheck
           ^^^^^^^^^^^^^^^^^^^^^^^^
If you don't mind the increased cost of insert/update/delete try:

     CREATE INDEX hashcheck_name_rev_idx
     ON public.hashcheck (name DESC);

ie create the index in descending order.

--
Craig Ringer

Re: hash aggregation

From

Sergey Konoplev

Date:

11 October 2012, 06:15:29

On Wed, Oct 10, 2012 at 9:13 PM, Korisk <Korisk@yandex.ru> wrote:
>    ->  Index Only Scan Backward using hashcheck_name_idx on public.hashcheck  (cost=10000000000.00..10000398674.92
rows=25986792width=32) 

It seems odd.

Is it possible to look at the non default configuration?

SELECT name, setting, reset_val FROM pg_settings WHERE setting <> reset_val;

>  (actual time=0.104..3785.767 rows=25990002 loops=1)
>          Output: name
>          Heap Fetches: 0
>  Total runtime: 7452.509 ms
> (6 rows)
>
> Благодаря шаманствам на:
> http://www.sql.ru/forum/actualthread.aspx?tid=974484
>
> 11.10.2012, 01:30, "Sergey Konoplev" <gray.ru@gmail.com>:
>> On Wed, Oct 10, 2012 at 9:09 AM, Korisk <Korisk@yandex.ru> wrote:
>>
>>>  Hello! Is it possible to speed up the plan?
>>>   Sort  (cost=573977.88..573978.38 rows=200 width=32) (actual time=10351.280..10351.551 rows=4000 loops=1)
>>>     Output: name, (count(name))
>>>     Sort Key: hashcheck.name
>>>     Sort Method: quicksort  Memory: 315kB
>>>     ->  HashAggregate  (cost=573968.24..573970.24 rows=200 width=32) (actual time=10340.507..10341.288 rows=4000
loops=1)
>>>           Output: name, count(name)
>>>           ->  Seq Scan on public.hashcheck  (cost=0.00..447669.16 rows=25259816 width=32) (actual
time=0.019..2798.058rows=25259817 loops=1) 
>>>                 Output: id, name, value
>>>   Total runtime: 10351.989 ms
>>
>> AFAIU there are no query optimization solution for this.
>>
>> It may be worth to create a table hashcheck_stat (name, cnt) and
>> increment/decrement the cnt values with triggers if you need to get
>> counts fast.
>>
>> --
>> Sergey Konoplev
>>
>> a database and software architect
>> http://www.linkedin.com/in/grayhemp
>>
>> Jabber: gray.ru@gmail.com Skype: gray-hemp Phone: +14158679984



--
Sergey Konoplev

a database and software architect
http://www.linkedin.com/in/grayhemp

Jabber: gray.ru@gmail.com Skype: gray-hemp Phone: +14158679984

Re: hash aggregation

From

Korisk

Date:

11 October 2012, 15:15:24

"IOS scan" ?
Index Only Scan

What's your seq_page_cost and random_page_cost?

hashes=# SELECT name, setting, reset_val FROM pg_settings WHERE setting <> reset_val;
          name           |    setting     | reset_val
-------------------------+----------------+-----------
 archive_command         | (disabled)     |
 enable_bitmapscan       | off            | on
 enable_indexscan        | off            | on
 enable_seqscan          | off            | on
 log_file_mode           | 0600           | 384
 random_page_cost        | 0.1            | 4
 seq_page_cost           | 0.1            | 1
 transaction_isolation   | read committed | default
 unix_socket_permissions | 0777           | 511
(9 rows)

Postgresql 9.2.1 was configured and built with default settings.

Thank you.

Re: hash aggregation

From

Sergey Konoplev

Date:

11 October 2012, 17:55:34

On Thu, Oct 11, 2012 at 8:15 AM, Korisk <Korisk@yandex.ru> wrote:
> What's your seq_page_cost and random_page_cost?
> hashes=# SELECT name, setting, reset_val FROM pg_settings WHERE setting <> reset_val;
>           name           |    setting     | reset_val
> -------------------------+----------------+-----------
>  archive_command         | (disabled)     |
>  enable_bitmapscan       | off            | on
>  enable_indexscan        | off            | on
>  enable_seqscan          | off            | on
>  log_file_mode           | 0600           | 384
>  random_page_cost        | 0.1            | 4
>  seq_page_cost           | 0.1            | 1
>  transaction_isolation   | read committed | default
>  unix_socket_permissions | 0777           | 511

Could you please try to set *_page_cost to 1 and then EXPLAIN ANALYZE it again?

>    ->  Index Only Scan Backward using hashcheck_name_idx on public.hashcheck
>  (cost=10000000000.00..10000398674.92 rows=25986792 width=32)
>  (actual time=0.104..3785.767 rows=25990002 loops=1)

I am just guessing but it might probably be some kind of a precision
bug, and I would like to check this.

> (9 rows)
>
> Postgresql 9.2.1 was configured and built with default settings.
>
> Thank you.



--
Sergey Konoplev

a database and software architect
http://www.linkedin.com/in/grayhemp

Jabber: gray.ru@gmail.com Skype: gray-hemp Phone: +14158679984

Re: hash aggregation

From

Korisk

Date:

12 October 2012, 03:55:59

Again the same cost.


hashes=# SELECT name, setting, reset_val FROM pg_settings WHERE setting <> reset_val;
          name           |    setting     | reset_val
-------------------------+----------------+-----------
 archive_command         | (disabled)     |
 enable_bitmapscan       | off            | on
 enable_indexscan        | off            | on
 enable_seqscan          | off            | on
 log_file_mode           | 0600           | 384
 random_page_cost        | 1              | 4
 transaction_isolation   | read committed | default
 unix_socket_permissions | 0777           | 511
(8 rows)

hashes=# explain  analyse verbose select name, count(name) as cnt from  hashcheck group by name order by name desc;
                                                                                        QUERY PLAN
                  


------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------
 GroupAggregate  (cost=10000000000.00..10000596612.97 rows=200 width=32) (actual time=0.136..7272.240 rows=4001
loops=1)
   Output: name, count(name)
   ->  Index Only Scan using hashcheck_name_rev_idx on public.hashcheck  (cost=10000000000.00..10000466660.96
rows=25990002width=32) (act 
ual time=0.121..3624.624 rows=25990002 loops=1)
         Output: name
         Heap Fetches: 0
 Total runtime: 7272.735 ms
(6 rows)






11.10.2012, 21:55, "Sergey Konoplev" <gray.ru@gmail.com>:
> On Thu, Oct 11, 2012 at 8:15 AM, Korisk <Korisk@yandex.ru> wrote:
>
>>  What's your seq_page_cost and random_page_cost?
>>  hashes=# SELECT name, setting, reset_val FROM pg_settings WHERE setting <> reset_val;
>>            name           |    setting     | reset_val
>>  -------------------------+----------------+-----------
>>   archive_command         | (disabled)     |
>>   enable_bitmapscan       | off            | on
>>   enable_indexscan        | off            | on
>>   enable_seqscan          | off            | on
>>   log_file_mode           | 0600           | 384
>>   random_page_cost        | 0.1            | 4
>>   seq_page_cost           | 0.1            | 1
>>   transaction_isolation   | read committed | default
>>   unix_socket_permissions | 0777           | 511
>
> Could you please try to set *_page_cost to 1 and then EXPLAIN ANALYZE it again?
>
>>     ->  Index Only Scan Backward using hashcheck_name_idx on public.hashcheck
>>   (cost=10000000000.00..10000398674.92 rows=25986792 width=32)
>>   (actual time=0.104..3785.767 rows=25990002 loops=1)
>
> I am just guessing but it might probably be some kind of a precision
> bug, and I would like to check this.
>
>>  (9 rows)
>>
>>  Postgresql 9.2.1 was configured and built with default settings.
>>
>>  Thank you.
>
> --
> Sergey Konoplev
>
> a database and software architect
> http://www.linkedin.com/in/grayhemp
>
> Jabber: gray.ru@gmail.com Skype: gray-hemp Phone: +14158679984

Re: hash aggregation

From

Sergey Konoplev

Date:

12 October 2012, 04:01:44

On Thu, Oct 11, 2012 at 8:55 PM, Korisk <Korisk@yandex.ru> wrote:
> hashes=# explain analyse verbose select name, count(name) as cnt from  hashcheck group by name order by name desc;

Now set enable_bitmapscan and enable_indexscan to on an try it again.

Then set enable_seqscan to on and run it one more time.

>                                                                                         QUERY PLAN
>
>
------------------------------------------------------------------------------------------------------------------------------------------
> ------------------------------------------------
>  GroupAggregate  (cost=10000000000.00..10000596612.97 rows=200 width=32) (actual time=0.136..7272.240 rows=4001
loops=1)
>    Output: name, count(name)
>    ->  Index Only Scan using hashcheck_name_rev_idx on public.hashcheck  (cost=10000000000.00..10000466660.96
rows=25990002width=32) (act 
> ual time=0.121..3624.624 rows=25990002 loops=1)
>          Output: name
>          Heap Fetches: 0
>  Total runtime: 7272.735 ms
> (6 rows)
>
>
>
>
>
>
> 11.10.2012, 21:55, "Sergey Konoplev" <gray.ru@gmail.com>:
>> On Thu, Oct 11, 2012 at 8:15 AM, Korisk <Korisk@yandex.ru> wrote:
>>
>>>  What's your seq_page_cost and random_page_cost?
>>>  hashes=# SELECT name, setting, reset_val FROM pg_settings WHERE setting <> reset_val;
>>>            name           |    setting     | reset_val
>>>  -------------------------+----------------+-----------
>>>   archive_command         | (disabled)     |
>>>   enable_bitmapscan       | off            | on
>>>   enable_indexscan        | off            | on
>>>   enable_seqscan          | off            | on
>>>   log_file_mode           | 0600           | 384
>>>   random_page_cost        | 0.1            | 4
>>>   seq_page_cost           | 0.1            | 1
>>>   transaction_isolation   | read committed | default
>>>   unix_socket_permissions | 0777           | 511
>>
>> Could you please try to set *_page_cost to 1 and then EXPLAIN ANALYZE it again?
>>
>>>     ->  Index Only Scan Backward using hashcheck_name_idx on public.hashcheck
>>>   (cost=10000000000.00..10000398674.92 rows=25986792 width=32)
>>>   (actual time=0.104..3785.767 rows=25990002 loops=1)
>>
>> I am just guessing but it might probably be some kind of a precision
>> bug, and I would like to check this.
>>
>>>  (9 rows)
>>>
>>>  Postgresql 9.2.1 was configured and built with default settings.
>>>
>>>  Thank you.
>>
>> --
>> Sergey Konoplev
>>
>> a database and software architect
>> http://www.linkedin.com/in/grayhemp
>>
>> Jabber: gray.ru@gmail.com Skype: gray-hemp Phone: +14158679984



--
Sergey Konoplev

a database and software architect
http://www.linkedin.com/in/grayhemp

Jabber: gray.ru@gmail.com Skype: gray-hemp Phone: +14158679984

Re: hash aggregation

From

Korisk

Date:

12 October 2012, 04:14:43

Strange situation.
After indexscan enabling the cost is seriously decreased.

hashes=# set enable_bitmapscan=on;
SET
hashes=# explain  analyse verbose select name, count(name) as cnt from  hashcheck group by name order by name desc;
                                                                                        QUERY PLAN
                  


------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------
 GroupAggregate  (cost=10000000000.00..10000596612.97 rows=200 width=32) (actual time=0.187..7424.799 rows=4001
loops=1)
   Output: name, count(name)
   ->  Index Only Scan using hashcheck_name_rev_idx on public.hashcheck  (cost=10000000000.00..10000466660.96
rows=25990002width=32) (act 
ual time=0.166..3698.776 rows=25990002 loops=1)
         Output: name
         Heap Fetches: 0
 Total runtime: 7425.403 ms
(6 rows)

hashes=# set enable_indexscan=on;
SET
hashes=# explain  analyse verbose select name, count(name) as cnt from  hashcheck group by name order by name desc;
                                                                                QUERY PLAN
                  


------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------
 GroupAggregate  (cost=0.00..596612.97 rows=200 width=32) (actual time=0.148..7339.115 rows=4001 loops=1)
   Output: name, count(name)
   ->  Index Only Scan using hashcheck_name_rev_idx on public.hashcheck  (cost=0.00..466660.96 rows=25990002 width=32)
(actualtime=0.129. 
.3653.848 rows=25990002 loops=1)
         Output: name
         Heap Fetches: 0
 Total runtime: 7339.592 ms
(6 rows)

hashes=# set enable_seqscan=on;
SET
hashes=# explain  analyse verbose select name, count(name) as cnt from  hashcheck group by name order by name desc;
                                                                  QUERY PLAN
                  


------------------------------------------------------------------------------------------------------------------------------------------
-----
 Sort  (cost=565411.67..565412.17 rows=200 width=32) (actual time=21746.799..21747.026 rows=4001 loops=1)
   Output: name, (count(name))
   Sort Key: hashcheck.name
   Sort Method: quicksort  Memory: 315kB
   ->  HashAggregate  (cost=565402.03..565404.03 rows=200 width=32) (actual time=21731.551..21733.277 rows=4001
loops=1)
         Output: name, count(name)
         ->  Seq Scan on public.hashcheck  (cost=0.00..435452.02 rows=25990002 width=32) (actual time=29.431..13383.812
rows=25990002loop 
s=1)
               Output: id, name, value
 Total runtime: 21747.356 ms
(9 rows)

Re: hash aggregation

From

Ondrej Ivanič

Date:

12 October 2012, 04:33:00

Hi,

On 12 October 2012 15:14, Korisk <Korisk@yandex.ru> wrote:
> Strange situation.
> After indexscan enabling the cost is seriously decreased.

You can not really disable any scan method. enable_xxx = off just sets
very high cost (=10000000000) for that operation.

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)
(http://www.linkedin.com/in/ondrejivanic)

Re: hash aggregation

From

Sergey Konoplev

Date:

12 October 2012, 07:10:30

On Thu, Oct 11, 2012 at 9:14 PM, Korisk <Korisk@yandex.ru> wrote:
> Strange situation.
> After indexscan enabling the cost is seriously decreased.

AFAIK when the planner has to choose between index scans and seq scans
and both of this options are off it uses one of this strategies anyway
but puts 10000000000.00 as a lower cost for this (thanks Maxim Boguk
for the explanation in chat).

>    ->  Index Only Scan using hashcheck_name_rev_idx on public.hashcheck  (cost=10000000000.00..10000466660.96
rows=25990002width=32) (act 
> ual time=0.166..3698.776 rows=25990002 loops=1)

So when you enabled one of these options it started using it as usual.

> hashes=# set enable_indexscan=on;
> SET
> hashes=# explain  analyse verbose select name, count(name) as cnt from  hashcheck group by name order by name desc;

[cut]

>    ->  Index Only Scan using hashcheck_name_rev_idx on public.hashcheck  (cost=0.00..466660.96 rows=25990002
width=32)(actual time=0.129. 
> .3653.848 rows=25990002 loops=1)

What I can not understand is why the seq scan's estimated cost is
better the index scan's one. It depends on the number of pages in
index/relation. May be the index is heavily bloated?

Let's see the sizes:

select pg_total_relation_size('hashcheck')
select pg_total_relation_size('hashcheck_name_rev_idx');


> hashes=# set enable_seqscan=on;
> SET
> hashes=# explain  analyse verbose select name, count(name) as cnt from  hashcheck group by name order by name desc;
>                                                                   QUERY PLAN
>
>
------------------------------------------------------------------------------------------------------------------------------------------
> -----
>  Sort  (cost=565411.67..565412.17 rows=200 width=32) (actual time=21746.799..21747.026 rows=4001 loops=1)
>    Output: name, (count(name))
>    Sort Key: hashcheck.name
>    Sort Method: quicksort  Memory: 315kB
>    ->  HashAggregate  (cost=565402.03..565404.03 rows=200 width=32) (actual time=21731.551..21733.277 rows=4001
loops=1)
>          Output: name, count(name)
>          ->  Seq Scan on public.hashcheck  (cost=0.00..435452.02 rows=25990002 width=32) (actual
time=29.431..13383.812rows=25990002 loop 
> s=1)
>                Output: id, name, value
>  Total runtime: 21747.356 ms
> (9 rows)
>
>
>
>
>



--
Sergey Konoplev

a database and software architect
http://www.linkedin.com/in/grayhemp

Jabber: gray.ru@gmail.com Skype: gray-hemp Phone: +14158679984

Re: hash aggregation

From

Korisk

Date:

12 October 2012, 14:37:58

> What I can not understand is why the seq scan's estimated cost is
> better the index scan's one. It depends on the number of pages in
> index/relation. May be the index is heavily bloated?
Mm i don't know how to see bloating level. But the index was created by
create index on hashcheck using btree (name)
after the table population.

Sizes:
hashes=# select pg_total_relation_size('hashcheck');
 pg_total_relation_size
------------------------
             2067701760
(1 row)

hashes=# select pg_total_relation_size('hashcheck_name_rev_idx');
 pg_total_relation_size
------------------------
              629170176
(1 row)

Re: hash aggregation

From

Tomas Vondra

Date:

12 October 2012, 21:14:40

On 12.10.2012 09:10, Sergey Konoplev wrote:
> What I can not understand is why the seq scan's estimated cost is
> better the index scan's one. It depends on the number of pages in
> index/relation. May be the index is heavily bloated?

The IOS cost depends on other things too. The index can't be read simply
as a sequence of pages, the scan needs to jump around the tree to read
the tuples in the right order.

With the index size being close to the size of the table, the cost of
these operations may easily outweight the benefits. And I suspect this
is the case here, because the table has only 3 columns (INT and two text
ones), and each row has some overhead (header), that may further
decrease the difference between index and table size.

Nevertheless, the cost estimate here is wrong - either it's estimating
something wrong, or maybe everything is in the case and the planner does
not know about that.

Tomas

Re: hash aggregation

From

Tomas Vondra

Date:

12 October 2012, 21:28:22

On 11.10.2012 17:15, Korisk wrote:
> "IOS scan" ?
> Index Only Scan
>
> What's your seq_page_cost and random_page_cost?
>
> hashes=# SELECT name, setting, reset_val FROM pg_settings WHERE setting <> reset_val;
>           name           |    setting     | reset_val
> -------------------------+----------------+-----------
>  archive_command         | (disabled)     |
>  enable_bitmapscan       | off            | on
>  enable_indexscan        | off            | on
>  enable_seqscan          | off            | on
>  log_file_mode           | 0600           | 384
>  random_page_cost        | 0.1            | 4
>  seq_page_cost           | 0.1            | 1
>  transaction_isolation   | read committed | default
>  unix_socket_permissions | 0777           | 511
> (9 rows)
>
> Postgresql 9.2.1 was configured and built with default settings.
>
> Thank you.

Hi,

so how much RAM does the system have? Because if you're using the
default shared buffers size (32MB IIRC), that's the first thing you
should bump up. It's usually recommended to set it to ~25% of RAM, but
not more than ~10GB. Set also the work_mem and maintenance_work_mem,
depending on the amount of RAM you have.

Then set effective_cache_size to 75% of RAM (this is just a hint to the
planner, it won't really allocate memory).

Restart the database and try the queries again. Don't run them with
EXPLAIN ANALYZE because that adds overhead that may easily make some of
the queries much slower.

It's great to see the estimates and actual row counts, but for timing
queries it's a poor choice (even the TIMING OFF added in 9.2 is not
exactly overhead-free). Maybe this is what made the seqscan look much
slower?

I usually run them from psql like this

\o /dev/null
\timing on
SELECT ...

which gives me more reliable timing results (especially when executed
right on the server).

Only if all this tuning fails, it's time to fine-tune the knobs, i.e.
the cost variables. Please, don't change the seq_page_cost, always keep
it at 1.0 and change only the other values.

For example if everything fits into the RAM, you may change the
random_page_cost to 1.5 or lower (I'd never recommend to set it lower
than seq_page_cost), and then you may start tuning the cpu_* costs.

But please, this is the last thing you should do - tune the server
properly first. There's even a very nice wiki page about tuning
PostgreSQL servers:

  http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

regards
Tomas