Home > mailing lists

Understanding GROUP BY Plan in PostgreSQL - Mailing list pgsql-novice

From	Sameer Kumar
Subject	Understanding GROUP BY Plan in PostgreSQL
Date	May 14, 2015 01:01:40
Msg-id	CADp-Sm7dOrsyW9jaQCXMQ0isuLnmS0b8BeZeSNm9tO2-ywn2Qg@mail.gmail.com Whole thread Raw
List	pgsql-novice

Tree view

I am trying to understand PostgreSQL's plan for GROUP BY query.

Let me first share some details of the env-

RAM

edbstore=# \! free

total used free shared buffers cached

Mem: 1915456 1800936 114520 46268 98000 1194272

-/+ buffers/cache: 508664 1406792

Swap: 2097148 76 2097072

CPU

edbstore=# \! lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 2

OS-

cat /etc/redhat-release

CentOS release 6.6 (Final)

usnam -a gives me below info-

2.6.32-504.el6.x86_64

Here is my table structure-

edbstore=# \d cust_hist

Table "edbstore.cust_hist"

Column | Type | Modifiers

------------+---------+-----------

customerid | integer | not null

orderid | integer | not null

prod_id | integer | not null

Indexes:

"idx_cust_hist_ord" btree (orderid)

"ix_cust_hist_customerid" btree (customerid)

Foreign-key constraints:

"fk_cust_hist_customerid" FOREIGN KEY (customerid) REFERENCES customers(customerid) ON DELETE CASCADE

The size of table

edbstore=# select pg_size_pretty(pg_relation_size('cust_hist'));

pg_size_pretty

----------------

2616 kB

(1 row)

edbstore=# select pg_size_pretty(pg_relation_size('idx_cust_hist_ord'));

pg_size_pretty

----------------

1336 kB

(1 row)

Stats in pg_class

edbstore=# select relpages, reltuples from pg_class where relname='cust_hist';

relpages | reltuples

----------+-----------

327 | 60350

(1 row)

edbstore=# select relpages, reltuples from pg_class where relname='idx_cust_hist_ord';

relpages | reltuples

----------+-----------

167 | 60350

(1 row)

edbstore=# select count(*) from cust_hist;

count

-------

60350

(1 row)

My DB parameters-

edbstore=# show work_mem;

work_mem

----------

1MB

(1 row)

edbstore=# show shared_buffers;

shared_buffers

----------------

128MB

(1 row)

edbstore=# show random_page_cost ;

random_page_cost

------------------

(1 row)

edbstore=# show seq_page_cost ;

seq_page_cost

---------------

(1 row)

edbstore=# select version();

version

---------------------------------------------------------------------------------------------------------------

PostgreSQL 9.3.6 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52), 64-bit

(1 row)

The explain plan-

edbstore=# set enable_seqscan to on;

SET

edbstore=# explain analyze select orderid, count(1) from edbstore.cust_hist group by orderid;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------

HashAggregate (cost=1232.25..1343.88 rows=11163 width=4) (actual time=49.494..55.302 rows=12000 loops=1)

-> Seq Scan on cust_hist (cost=0.00..930.50 rows=60350 width=4) (actual time=0.016..14.353 rows=60350 loops=1

)

Total runtime: 56.749 ms

(3 rows)

edbstore=# set enable_seqscan to off;

SET

edbstore=# explain analyze select orderid, count(1) from edbstore.cust_hist group by orderid;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------

----------------------------------

GroupAggregate (cost=0.29..1980.92 rows=11163 width=4) (actual time=0.055..51.766 rows=12000 loops=1)

-> Index Only Scan using idx_cust_hist_ord on cust_hist (cost=0.29..1567.54 rows=60350 width=4) (actual time=

0.041..28.532 rows=60350 loops=1)

Heap Fetches: 60350

Total runtime: 53.227 ms

(4 rows)

I tried to increase data to 4 times (by copying the same set of data to the table 3 more times).

edbstore=# set enable_seqscan to on;

SET

edbstore=# explain analyze select orderid, count(1) from edbstore.cust_hist group by orderid;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------

-----

HashAggregate (cost=4926.00..5035.16 rows=10916 width=4) (actual time=242.233..247.740 rows=12000 loops=1)

-> Seq Scan on cust_hist (cost=0.00..3719.00 rows=241400 width=4) (actual time=0.040..96.845 rows=241400 loop

s=1)

Total runtime: 249.415 ms

(3 rows)

edbstore=# set enable_seqscan to off;

SET

edbstore=# explain analyze select orderid, count(1) from edbstore.cust_hist group by orderid;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------

-------------------------------------

GroupAggregate (cost=0.42..8796.20 rows=10916 width=4) (actual time=0.228..217.513 rows=12000 loops=1)

-> Index Only Scan using idx_cust_hist_ord on cust_hist (cost=0.42..7480.04 rows=241400 width=4) (actual time

=0.190..145.130 rows=241400 loops=1)

Heap Fetches: 241400

Total runtime: 219.003 ms

(4 rows)

It seems PostgreSQL always prefers to choose Sequential Scan over Index only scan. Also the cost of just the specific step involving Index Only Scan seems to be higher than sequential scan. I am also not able to understand the plan involving the index only scan where the Index Only scan starts with 0.42 cost and even the next step GroupAggregate starts with same cost.

Can someone please help me understand this plan also why PostgreSQL prefers the Sequential Scan for GROUP BY clause?

pgsql-novice by date:

From: Luca Ferrari
Date: 12 May 2015, 07:07:21
Subject: Re: Why is Hash index not transaction safe.

From: Wei Shan
Date: 14 May 2015, 15:38:50
Subject: Managing PostgreSQL Streaming Replication Cluster

Understanding GROUP BY Plan in PostgreSQL - Mailing list pgsql-novice

Previous

Next