Understanding GROUP BY Plan in PostgreSQL - Mailing list pgsql-novice

From Sameer Kumar
Subject Understanding GROUP BY Plan in PostgreSQL
Date
Msg-id CADp-Sm7dOrsyW9jaQCXMQ0isuLnmS0b8BeZeSNm9tO2-ywn2Qg@mail.gmail.com
Whole thread Raw
List pgsql-novice
I am trying to understand PostgreSQL's plan for GROUP BY query. 

Let me first share some details of the env-

RAM
edbstore=# \! free
             total       used       free     shared    buffers     cached
Mem:       1915456    1800936     114520      46268      98000    1194272
-/+ buffers/cache:     508664    1406792
Swap:      2097148         76    2097072


CPU
edbstore=# \! lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2

OS-

 cat /etc/redhat-release
CentOS release 6.6 (Final)

usnam -a gives me below info-
 2.6.32-504.el6.x86_64


Here is my table structure-
edbstore=# \d cust_hist
   Table "edbstore.cust_hist"
  Column   |  Type   | Modifiers
------------+---------+-----------
customerid | integer | not null
orderid    | integer | not null
prod_id    | integer | not null
Indexes:
   "idx_cust_hist_ord" btree (orderid)
   "ix_cust_hist_customerid" btree (customerid)
Foreign-key constraints:
   "fk_cust_hist_customerid" FOREIGN KEY (customerid) REFERENCES customers(customerid) ON DELETE CASCADE


The size of table
edbstore=# select pg_size_pretty(pg_relation_size('cust_hist'));
pg_size_pretty
----------------
2616 kB
(1 row)

edbstore=# select pg_size_pretty(pg_relation_size('idx_cust_hist_ord'));
pg_size_pretty
----------------
1336 kB
(1 row)

Stats in pg_class
edbstore=# select relpages, reltuples from pg_class where relname='cust_hist';
relpages | reltuples
----------+-----------
     327 |     60350
(1 row)

edbstore=# select relpages, reltuples from pg_class where relname='idx_cust_hist_ord';
relpages | reltuples
----------+-----------
     167 |     60350
(1 row)


edbstore=# select count(*) from cust_hist;
count
-------
60350
(1 row)



My DB parameters- 
edbstore=# show work_mem;
work_mem
----------
1MB
(1 row)

edbstore=# show shared_buffers;
shared_buffers
----------------
128MB
(1 row)

edbstore=# show random_page_cost ;
random_page_cost
------------------
2
(1 row)

edbstore=# show seq_page_cost ;
seq_page_cost
---------------
1
(1 row)

edbstore=# select version();
   version
---------------------------------------------------------------------------------------------------------------
PostgreSQL 9.3.6 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52), 64-bit
(1 row)


The explain plan-

edbstore=# set enable_seqscan to on;
SET



edbstore=# explain analyze select orderid, count(1) from edbstore.cust_hist group by orderid;
                                                     QUERY PLAN

------------------------------------------------------------------------------------------------------------------
--
 HashAggregate  (cost=1232.25..1343.88 rows=11163 width=4) (actual time=49.494..55.302 rows=12000 loops=1)
   ->  Seq Scan on cust_hist  (cost=0.00..930.50 rows=60350 width=4) (actual time=0.016..14.353 rows=60350 loops=1
)
 Total runtime: 56.749 ms
(3 rows)

edbstore=# set enable_seqscan to off;
SET
edbstore=# explain analyze select orderid, count(1) from edbstore.cust_hist group by orderid;
                                                                     QUERY PLAN

------------------------------------------------------------------------------------------------------------------
----------------------------------
 GroupAggregate  (cost=0.29..1980.92 rows=11163 width=4) (actual time=0.055..51.766 rows=12000 loops=1)
   ->  Index Only Scan using idx_cust_hist_ord on cust_hist  (cost=0.29..1567.54 rows=60350 width=4) (actual time=
0.041..28.532 rows=60350 loops=1)
         Heap Fetches: 60350
 Total runtime: 53.227 ms
(4 rows)

I tried to increase data to 4 times (by copying the same set of data to the table 3 more times).

edbstore=# set enable_seqscan to on;
SET
edbstore=# explain analyze select orderid, count(1) from edbstore.cust_hist group by orderid;
                                                      QUERY PLAN

------------------------------------------------------------------------------------------------------------------
-----
 HashAggregate  (cost=4926.00..5035.16 rows=10916 width=4) (actual time=242.233..247.740 rows=12000 loops=1)
   ->  Seq Scan on cust_hist  (cost=0.00..3719.00 rows=241400 width=4) (actual time=0.040..96.845 rows=241400 loop
s=1)
 Total runtime: 249.415 ms
(3 rows)

edbstore=# set enable_seqscan to off;
SET
edbstore=# explain analyze select orderid, count(1) from edbstore.cust_hist group by orderid;
                                                                      QUERY PLAN

------------------------------------------------------------------------------------------------------------------
-------------------------------------
 GroupAggregate  (cost=0.42..8796.20 rows=10916 width=4) (actual time=0.228..217.513 rows=12000 loops=1)
   ->  Index Only Scan using idx_cust_hist_ord on cust_hist  (cost=0.42..7480.04 rows=241400 width=4) (actual time
=0.190..145.130 rows=241400 loops=1)
         Heap Fetches: 241400
 Total runtime: 219.003 ms
(4 rows)

It seems PostgreSQL always prefers to choose Sequential Scan over Index only scan. Also the cost of just the specific step involving Index Only Scan seems to be higher than sequential scan. I am also not able to understand the plan involving the index only scan where the Index Only scan starts with 0.42 cost and even the next step GroupAggregate starts with same cost.

Can someone please help me understand this plan also why PostgreSQL prefers the Sequential Scan for GROUP BY clause?

pgsql-novice by date:

Previous
From: Luca Ferrari
Date:
Subject: Re: Why is Hash index not transaction safe.
Next
From: Wei Shan
Date:
Subject: Managing PostgreSQL Streaming Replication Cluster