Home > mailing lists

Re: [HACKERS] [PATCH] Incremental sort - Mailing list pgsql-hackers

From	Alexander Korotkov
Subject	Re: [HACKERS] [PATCH] Incremental sort
Date	March 10, 2018 23:05:25
Msg-id	CAPpHfdtHggE3gK1xytjk4QV1skXoeO2DHtcw37tBA5H+ew7N5Q@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] [PATCH] Incremental sort (Alexander Korotkov <a.korotkov@postgrespro.ru>)
Responses	Re: [HACKERS] [PATCH] Incremental sort (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List	pgsql-hackers

Tree view

On Sat, Mar 10, 2018 at 6:42 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Thu, Mar 8, 2018 at 2:49 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Thank you very much for testing and benchmarking. I'll investigate
the regressions you found.

Now, there's a caveat in those tests - the data set is synthetic and
perfectly random, i.e. all groups equally likely, no correlations or
anything like that.

I wonder what is the "worst case" scenario, i.e. how to construct a data
set with particularly bad behavior of the incremental sort.

I think that depends on the reason of bad behavior of incremental sort.
For example, our quick sort implementation behaves very good on
presorted data. But, incremental sort appears to be not so good in
this case as Heikki showed upthread. That prompted me to test
presorted datasets (which appeared to be "worst case") more intensively.
But I suspect that regressions you found have another reason, and
correspondingly "worst case" would be also different.
When I'll investigate the reason of regressions, I'll try to construct
"worst case" as well.

After some investigation of benchmark results, I found 2 sources of

regressions of incremental sort.

Case 1: Underlying node scan lose is bigger than incremental sort win

===== 33 [Wed Mar 7 10:14:14 CET 2018] scale:10000000 groups:10 work_mem:64MB incremental:on max_workers:0 =====

SELECT * FROM s_1 ORDER BY a, b

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=1588080.84..1588080.84 rows=1 width=20) (actual time=5874.527..5874.527 rows=0 loops=1)

-> Incremental Sort (cost=119371.51..1488081.45 rows=9999939 width=20) (actual time=202.842..5653.224 rows=10000000 loops=1)

Sort Key: s_1.a, s_1.b

Presorted Key: s_1.a

Sort Method: external merge Disk: 29408kB

Sort Groups: 11

-> Index Scan using s_1_a_idx on s_1 (cost=0.43..323385.52 rows=9999939 width=20) (actual time=0.051..1494.105 rows=10000000 loops=1)

Planning time: 0.269 ms

Execution time: 5877.367 ms

(9 rows)

===== 37 [Wed Mar 7 10:15:51 CET 2018] scale:10000000 groups:10 work_mem:64MB incremental:off max_workers:0 =====

SELECT * FROM s_1 ORDER BY a, b

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------

Limit (cost=1656439.93..1656439.93 rows=1 width=20) (actual time=4741.716..4741.716 rows=0 loops=1)

-> Sort (cost=1531440.69..1556440.54 rows=9999939 width=20) (actual time=3522.156..4519.278 rows=10000000 loops=1)

Sort Key: s_1.a, s_1.b

Sort Method: external merge Disk: 293648kB

-> Seq Scan on s_1 (cost=0.00..163694.39 rows=9999939 width=20) (actual time=0.021..650.322 rows=10000000 loops=1)

Planning time: 0.249 ms

Execution time: 4777.088 ms

(7 rows)

In this case optimizer have decided that "Index Scan + Incremental Sort" would be

cheaper than "Seq Scan + Sort". But it appears that the amount of time we loose by

selecting Index Scan over Seq Scan is bigger than amount of time we win by

selecting Incremental Sort over Sort. I would note that regular Sort consumes

about 10X more disk space. I bet that all this space has fit to OS cache of test

machine. But optimizer did expect actual IO to take place in this case. This

has lead actual time to be inadequate the costing.

Case 2: Underlying node is not parallelyzed

===== 178 [Wed Mar 7 11:18:53 CET 2018] scale:10000000 groups:100 work_mem:8MB incremental:on max_workers:2 =====

SELECT * FROM s_2 ORDER BY a, b, c

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=1179047.88..1179047.88 rows=1 width=20) (actual time=4819.999..4819.999 rows=0 loops=1)

-> Incremental Sort (cost=89.04..1079047.34 rows=10000054 width=20) (actual time=0.203..4603.197 rows=10000000 loops=1)

Sort Key: s_2.a, s_2.b, s_2.c

Presorted Key: s_2.a, s_2.b

Sort Method: quicksort Memory: 135kB

Sort Groups: 10201

-> Index Scan using s_2_a_b_idx on s_2 (cost=0.43..406985.62 rows=10000054 width=20) (actual time=0.052..1461.177 rows=10000000 loops=1)

Planning time: 0.313 ms

Execution time: 4820.037 ms

(9 rows)

===== 182 [Wed Mar 7 11:20:11 CET 2018] scale:10000000 groups:100 work_mem:8MB incremental:off max_workers:2 =====

SELECT * FROM s_2 ORDER BY a, b, c

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=1705580.76..1705580.76 rows=1 width=20) (actual time=3985.818..3985.818 rows=0 loops=1)

-> Gather Merge (cost=649951.66..1622246.98 rows=8333378 width=20) (actual time=1782.354..3750.868 rows=10000000 loops=1)

Workers Planned: 2

Workers Launched: 2

-> Sort (cost=648951.64..659368.36 rows=4166689 width=20) (actual time=1778.362..2091.253 rows=3333333 loops=3)

Sort Key: s_2.a, s_2.b, s_2.c

Sort Method: external merge Disk: 99136kB

Worker 0: Sort Method: external merge Disk: 96984kB

Worker 1: Sort Method: external merge Disk: 97496kB

-> Parallel Seq Scan on s_2 (cost=0.00..105361.89 rows=4166689 width=20) (actual time=0.022..233.640 rows=3333333 loops=3)

Planning time: 0.265 ms

Execution time: 4007.591 ms

(12 rows)

The situation is similar to case #1 except that in the pair "Seq Scan + Sort" Sort also

gets paralellyzed. In the same way as in previous case, disk writes/reads during

external sort are overestimated, because they actually use OS cache.

I would also say that it's not necessary wrong decision of optimizer, because

doing this work in single backend may consume less resources despite being

overall slower.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

pgsql-hackers by date:

From: legrand legrand
Date: 10 March 2018, 21:43:47
Subject: Re: [PROPOSAL] timestamp informations to pg_stat_statements

From: Alvaro Herrera
Date: 10 March 2018, 23:19:07
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Re: [HACKERS] [PATCH] Incremental sort - Mailing list pgsql-hackers

Previous

Next