Understanding GROUP BY Plan in PostgreSQL - Mailing list pgsql-novice
From | Sameer Kumar |
---|---|
Subject | Understanding GROUP BY Plan in PostgreSQL |
Date | |
Msg-id | CADp-Sm7dOrsyW9jaQCXMQ0isuLnmS0b8BeZeSNm9tO2-ywn2Qg@mail.gmail.com Whole thread Raw |
List | pgsql-novice |
I am trying to understand PostgreSQL's plan for GROUP BY query.
Let me first share some details of the env-
RAM
edbstore=# \! free
total used free shared buffers cached
Mem: 1915456 1800936 114520 46268 98000 1194272
-/+ buffers/cache: 508664 1406792
Swap: 2097148 76 2097072
CPU
edbstore=# \! lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
OS-
cat /etc/redhat-release
CentOS release 6.6 (Final)
usnam -a gives me below info-
2.6.32-504.el6.x86_64
Here is my table structure-
edbstore=# \d cust_hist
Table "edbstore.cust_hist"
Column | Type | Modifiers
------------+---------+-----------
customerid | integer | not null
orderid | integer | not null
prod_id | integer | not null
Indexes:
"idx_cust_hist_ord" btree (orderid)
"ix_cust_hist_customerid" btree (customerid)
Foreign-key constraints:
"fk_cust_hist_customerid" FOREIGN KEY (customerid) REFERENCES customers(customerid) ON DELETE CASCADE
The size of table
edbstore=# select pg_size_pretty(pg_relation_size('cust_hist'));
pg_size_pretty
----------------
2616 kB
(1 row)
edbstore=# select pg_size_pretty(pg_relation_size('idx_cust_hist_ord'));
pg_size_pretty
----------------
1336 kB
(1 row)
Stats in pg_class
edbstore=# select relpages, reltuples from pg_class where relname='cust_hist';
relpages | reltuples
----------+-----------
327 | 60350
(1 row)
edbstore=# select relpages, reltuples from pg_class where relname='idx_cust_hist_ord';
relpages | reltuples
----------+-----------
167 | 60350
(1 row)
edbstore=# select count(*) from cust_hist;
count
-------
60350
(1 row)
My DB parameters-
edbstore=# show work_mem;
work_mem
----------
1MB
(1 row)
edbstore=# show shared_buffers;
shared_buffers
----------------
128MB
(1 row)
edbstore=# show random_page_cost ;
random_page_cost
------------------
2
(1 row)
edbstore=# show seq_page_cost ;
seq_page_cost
---------------
1
(1 row)
edbstore=# select version();
version
---------------------------------------------------------------------------------------------------------------
PostgreSQL 9.3.6 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52), 64-bit
(1 row)
The explain plan-
edbstore=# set enable_seqscan to on;
SET
edbstore=# explain analyze select orderid, count(1) from edbstore.cust_hist group by orderid;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
--
HashAggregate (cost=1232.25..1343.88 rows=11163 width=4) (actual time=49.494..55.302 rows=12000 loops=1)
-> Seq Scan on cust_hist (cost=0.00..930.50 rows=60350 width=4) (actual time=0.016..14.353 rows=60350 loops=1
)
Total runtime: 56.749 ms
(3 rows)
edbstore=# set enable_seqscan to off;
SET
edbstore=# explain analyze select orderid, count(1) from edbstore.cust_hist group by orderid;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
----------------------------------
GroupAggregate (cost=0.29..1980.92 rows=11163 width=4) (actual time=0.055..51.766 rows=12000 loops=1)
-> Index Only Scan using idx_cust_hist_ord on cust_hist (cost=0.29..1567.54 rows=60350 width=4) (actual time=
0.041..28.532 rows=60350 loops=1)
Heap Fetches: 60350
Total runtime: 53.227 ms
(4 rows)
I tried to increase data to 4 times (by copying the same set of data to the table 3 more times).
edbstore=# set enable_seqscan to on;
SET
edbstore=# explain analyze select orderid, count(1) from edbstore.cust_hist group by orderid;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
-----
HashAggregate (cost=4926.00..5035.16 rows=10916 width=4) (actual time=242.233..247.740 rows=12000 loops=1)
-> Seq Scan on cust_hist (cost=0.00..3719.00 rows=241400 width=4) (actual time=0.040..96.845 rows=241400 loop
s=1)
Total runtime: 249.415 ms
(3 rows)
edbstore=# set enable_seqscan to off;
SET
edbstore=# explain analyze select orderid, count(1) from edbstore.cust_hist group by orderid;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
-------------------------------------
GroupAggregate (cost=0.42..8796.20 rows=10916 width=4) (actual time=0.228..217.513 rows=12000 loops=1)
-> Index Only Scan using idx_cust_hist_ord on cust_hist (cost=0.42..7480.04 rows=241400 width=4) (actual time
=0.190..145.130 rows=241400 loops=1)
Heap Fetches: 241400
Total runtime: 219.003 ms
(4 rows)
It seems PostgreSQL always prefers to choose Sequential Scan over Index only scan. Also the cost of just the specific step involving Index Only Scan seems to be higher than sequential scan. I am also not able to understand the plan involving the index only scan where the Index Only scan starts with 0.42 cost and even the next step GroupAggregate starts with same cost.
Can someone please help me understand this plan also why PostgreSQL prefers the Sequential Scan for GROUP BY clause?
pgsql-novice by date: