Thread: PoC: Partial sort

PoC: Partial sort

From
Alexander Korotkov
Date:
Hackers!

Currently when we need to get ordered result from table we have to choose one of two approaches: get results from index in exact order we need or do sort of tuples. However, it could be useful to mix both methods: get results from index in order which partially meets our requirements and do rest of work from heap.

Two attached patches are proof of concept for this approach.

partial-sort-1.patch

This patch allows to use index for order-by if order-by clause and index has non-empty common prefix. So, index gives right ordering for first n order-by columns. In order to provide right order for rest m columns, sort node is inserted. This sort node sorts groups of tuples where values of first n order-by columns are equal.

See an example.

create table test as (select id, (random()*10000)::int as v1, random() as v2 from generate_series(1,1000000) id);
create index test_v1_idx on test (v1);

We've index by v1 column, but we can get results ordered by v1, v2.

postgres=# select * from test order by v1, v2 limit 10;
   id   | v1 |         v2         
--------+----+--------------------
 390371 |  0 | 0.0284479795955122
 674617 |  0 | 0.0322008323855698
 881905 |  0 |  0.042586590629071
 972877 |  0 | 0.0531588457524776
 364903 |  0 | 0.0594307743012905
  82333 |  0 | 0.0666455538012087
 266488 |  0 |  0.072808934841305
 892215 |  0 | 0.0744258034974337
  13805 |  0 | 0.0794667331501842
 338435 |  0 |  0.171817752998322
(10 rows)

And it's fast using following plan.

                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=69214.06..69214.08 rows=10 width=16) (actual time=0.097..0.099 rows=10 loops=1)
   ->  Sort  (cost=69214.06..71714.06 rows=1000000 width=16) (actual time=0.096..0.097 rows=10 loops=1)
         Sort Key: v1, v2
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Index Scan using test_v1_idx on test  (cost=0.42..47604.42 rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
 Total runtime: 0.125 ms
(6 rows)

For sure, this approach is effective only when first n order-by columns we selected provides enough count of unique values (so, sorted groups are small). Patch is only PoC because it doesn't contains any try to estimate right cost of using partial sort. 

partial-knn-1.patch

KNN-GiST provides ability to get ordered results from index, but this order is based only on index information. For instance, GiST index contains bounding rectangles for polygons, and we can't get exact distance to polygon from index (similar situation is in PostGIS). In attached patch, GiST distance method can set recheck flag (similar to consistent method). This flag means that distance method returned lower bound of distance and we should recheck it from heap.

See an example.

create table test as (select id, polygon(3+(random()*10)::int, circle(point(random(), random()), 0.0003 + random()*0.001)) as p from generate_series(1,1000000) id);
create index test_idx on test using gist (p);

We can get results ordered by distance from polygon to point.

postgres=# select id, p <-> point(0.5,0.5) from test order by p <-> point(0.5,0.5) limit 10;
   id   |       ?column?       
--------+----------------------
 755611 | 0.000405855808916853
 807562 | 0.000464123777564343
 437778 | 0.000738524708741959
 947860 |  0.00076250998760724
 389843 | 0.000886362723569568
  17586 | 0.000981960100555216
 411329 |  0.00145338112316853
 894191 |  0.00149399559703506
 391907 |   0.0016647896049741
 235381 |  0.00167554614889509
(10 rows)

It's fast using just index scan.

                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230 rows=10 loops=1)
   ->  Index Scan using test_idx on test  (cost=0.29..157672.29 rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
         Order By: (p <-> '(0.5,0.5)'::point)
 Total runtime: 0.305 ms
(4 rows)

This patch is also only PoC because of following:
1) It's probably wrong at all to get heap tuple from index scan node. This work should be done from another node.
2) Assumption that order-by operator returns float8 comparable with GiST distance method result in general case is wrong.

------
With best regards,
Alexander Korotkov.
Attachment

Re: PoC: Partial sort

From
Andres Freund
Date:
Hi,

Cool stuff.

On 2013-12-14 13:59:02 +0400, Alexander Korotkov wrote:
> Currently when we need to get ordered result from table we have to choose
> one of two approaches: get results from index in exact order we need or do
> sort of tuples. However, it could be useful to mix both methods: get
> results from index in order which partially meets our requirements and do
> rest of work from heap.

>
------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=69214.06..69214.08 rows=10 width=16) (actual
> time=0.097..0.099 rows=10 loops=1)
>    ->  Sort  (cost=69214.06..71714.06 rows=1000000 width=16) (actual
> time=0.096..0.097 rows=10 loops=1)
>          Sort Key: v1, v2
>          Sort Method: top-N heapsort  Memory: 25kB
>          ->  Index Scan using test_v1_idx on test  (cost=0.42..47604.42
> rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
>  Total runtime: 0.125 ms
> (6 rows)

Is that actually all that beneficial when sorting with a bog standard
qsort() since that doesn't generally benefit from data being pre-sorted?
I think we might need to switch to a different algorithm to really
benefit from mostly pre-sorted input.

> *partial-knn-1.patch*
> 
> KNN-GiST provides ability to get ordered results from index, but this order
> is based only on index information. For instance, GiST index contains
> bounding rectangles for polygons, and we can't get exact distance to
> polygon from index (similar situation is in PostGIS). In attached patch,
> GiST distance method can set recheck flag (similar to consistent method).
> This flag means that distance method returned lower bound of distance and
> we should recheck it from heap.

> See an example.
> 
> create table test as (select id, polygon(3+(random()*10)::int,
> circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
> generate_series(1,1000000) id);
> create index test_idx on test using gist (p);
> 
> We can get results ordered by distance from polygon to point.
> 
> postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
> point(0.5,0.5) limit 10;

>
----------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
> rows=10 loops=1)
>    ->  Index Scan using test_idx on test  (cost=0.29..157672.29
> rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
>          Order By: (p <-> '(0.5,0.5)'::point)
>  Total runtime: 0.305 ms
> (4 rows)

Rechecking from the heap means adding a sort node though, which I don't
see here? Or am I misunderstanding something?
Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: PoC: Partial sort

From
Alexander Korotkov
Date:
Hi!

Thanks for feedback!

On Sat, Dec 14, 2013 at 4:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Hi,

Cool stuff.

On 2013-12-14 13:59:02 +0400, Alexander Korotkov wrote:
> Currently when we need to get ordered result from table we have to choose
> one of two approaches: get results from index in exact order we need or do
> sort of tuples. However, it could be useful to mix both methods: get
> results from index in order which partially meets our requirements and do
> rest of work from heap.

> ------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=69214.06..69214.08 rows=10 width=16) (actual
> time=0.097..0.099 rows=10 loops=1)
>    ->  Sort  (cost=69214.06..71714.06 rows=1000000 width=16) (actual
> time=0.096..0.097 rows=10 loops=1)
>          Sort Key: v1, v2
>          Sort Method: top-N heapsort  Memory: 25kB
>          ->  Index Scan using test_v1_idx on test  (cost=0.42..47604.42
> rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
>  Total runtime: 0.125 ms
> (6 rows)

Is that actually all that beneficial when sorting with a bog standard
qsort() since that doesn't generally benefit from data being pre-sorted?
I think we might need to switch to a different algorithm to really
benefit from mostly pre-sorted input.

In this patch I don't do full sort of dataset. For instance, index returns data ordered by first column and we need to order them also by second column. Then this node sorts groups (assumed to be small) where values of the first column are same by value of second column. And with limit clause only required number of such groups will be processed. But, I don't think we should expect pre-sorted values of second column inside a group.
 
> *partial-knn-1.patch*
>
> KNN-GiST provides ability to get ordered results from index, but this order
> is based only on index information. For instance, GiST index contains
> bounding rectangles for polygons, and we can't get exact distance to
> polygon from index (similar situation is in PostGIS). In attached patch,
> GiST distance method can set recheck flag (similar to consistent method).
> This flag means that distance method returned lower bound of distance and
> we should recheck it from heap.

> See an example.
>
> create table test as (select id, polygon(3+(random()*10)::int,
> circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
> generate_series(1,1000000) id);
> create index test_idx on test using gist (p);
>
> We can get results ordered by distance from polygon to point.
>
> postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
> point(0.5,0.5) limit 10;

> ----------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
> rows=10 loops=1)
>    ->  Index Scan using test_idx on test  (cost=0.29..157672.29
> rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
>          Order By: (p <-> '(0.5,0.5)'::point)
>  Total runtime: 0.305 ms
> (4 rows)

Rechecking from the heap means adding a sort node though, which I don't
see here? Or am I misunderstanding something?

KNN-GiST contain RB-tree of scanned items. In this patch item is rechecked inside GiST and reinserted into same RB-tree. It appears to be much easier implementation for PoC and also looks very efficient. I'm not sure what is actually right design for it. This is what I like to discuss.

------
With best regards,
Alexander Korotkov.
 

Re: PoC: Partial sort

From
Jeremy Harris
Date:
On 14/12/13 12:54, Andres Freund wrote:
> On 2013-12-14 13:59:02 +0400, Alexander Korotkov wrote:
>> Currently when we need to get ordered result from table we have to choose
>> one of two approaches: get results from index in exact order we need or do
>> sort of tuples. However, it could be useful to mix both methods: get
>> results from index in order which partially meets our requirements and do
>> rest of work from heap.
>
>>
------------------------------------------------------------------------------------------------------------------------------------------
>>   Limit  (cost=69214.06..69214.08 rows=10 width=16) (actual
>> time=0.097..0.099 rows=10 loops=1)
>>     ->  Sort  (cost=69214.06..71714.06 rows=1000000 width=16) (actual
>> time=0.096..0.097 rows=10 loops=1)
>>           Sort Key: v1, v2
>>           Sort Method: top-N heapsort  Memory: 25kB
>>           ->  Index Scan using test_v1_idx on test  (cost=0.42..47604.42
>> rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
>>   Total runtime: 0.125 ms
>> (6 rows)
>
> Is that actually all that beneficial when sorting with a bog standard
> qsort() since that doesn't generally benefit from data being pre-sorted?
> I think we might need to switch to a different algorithm to really
> benefit from mostly pre-sorted input.

Eg:  http://www.postgresql.org/message-id/5291467E.6070807@wizmail.org

Maybe Alexander and I should bash our heads together.

-- 
Cheers,   Jeremy



Re: PoC: Partial sort

From
Martijn van Oosterhout
Date:
On Sat, Dec 14, 2013 at 06:21:18PM +0400, Alexander Korotkov wrote:
> > Is that actually all that beneficial when sorting with a bog standard
> > qsort() since that doesn't generally benefit from data being pre-sorted?
> > I think we might need to switch to a different algorithm to really
> > benefit from mostly pre-sorted input.
> >
>
> In this patch I don't do full sort of dataset. For instance, index returns
> data ordered by first column and we need to order them also by second
> column. Then this node sorts groups (assumed to be small) where values of
> the first column are same by value of second column. And with limit clause
> only required number of such groups will be processed. But, I don't think
> we should expect pre-sorted values of second column inside a group.

Nice. I imagine this would be mostly beneficial for fast-start plans,
since you no longer need to sort the whole table prior to returning the
first tuple.

Reduced memory usage might be a factor, especially for large sorts
where you otherwise might need to spool to disk.

You can now use an index on (a) to improve sorting for (a,b).

Cost of sorting n groups of size l goes from O(nl log nl) to just O(nl
log l), useful for large n.

Minor comments:

I find cmpTuple a bad name. That's what it's doing but perhaps
cmpSkipColumns would be clearer.

I think it's worthwhile adding a seperate path for the skipCols = 0
case, to avoid extra copies.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: PoC: Partial sort

From
Andres Freund
Date:
Hi,

> > >  Limit  (cost=69214.06..69214.08 rows=10 width=16) (actual
> > > time=0.097..0.099 rows=10 loops=1)
> > >    ->  Sort  (cost=69214.06..71714.06 rows=1000000 width=16) (actual
> > > time=0.096..0.097 rows=10 loops=1)
> > >          Sort Key: v1, v2
> > >          Sort Method: top-N heapsort  Memory: 25kB
> > >          ->  Index Scan using test_v1_idx on test  (cost=0.42..47604.42
> > > rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
> > >  Total runtime: 0.125 ms
> > > (6 rows)
> >
> > Is that actually all that beneficial when sorting with a bog standard
> > qsort() since that doesn't generally benefit from data being pre-sorted?
> > I think we might need to switch to a different algorithm to really
> > benefit from mostly pre-sorted input.
> >
> 
> In this patch I don't do full sort of dataset. For instance, index returns
> data ordered by first column and we need to order them also by second
> column.

Ah, that makes sense.

> But, I don't think we should expect pre-sorted values of second column
> inside a group.

Yes, if you do it that way, there doesn't seem to any need to assume
that any more than we usually do.

I think you should make the explain output reflect the fact that we're
assuming v1 is presorted and just sorting v2. I'd be happy enough with:
Sort Key: v1, v2
Partial Sort: v2
or even just
"Partial Sort Key: [v1,] v2"
but I am sure others disagree.

> > > *partial-knn-1.patch*

> > Rechecking from the heap means adding a sort node though, which I don't
> > see here? Or am I misunderstanding something?

> KNN-GiST contain RB-tree of scanned items. In this patch item is rechecked
> inside GiST and reinserted into same RB-tree. It appears to be much easier
> implementation for PoC and also looks very efficient. I'm not sure what is
> actually right design for it. This is what I like to discuss.

I don't have enough clue about gist to say wether it's the right design,
but it doesn't look wrong to my eyes. It'd probably be useful to export
the knowledge that we are rechecking and how often that happens to the
outside.
While I didn't really look into the patch, I noticed in passing that you
pass a all_dead variable to heap_hot_search_buffer without using the
result - just pass NULL instead, that performs a bit less work.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: PoC: Partial sort

From
Andreas Karlsson
Date:
On 12/14/2013 10:59 AM, Alexander Korotkov wrote:
> This patch allows to use index for order-by if order-by clause and index
> has non-empty common prefix. So, index gives right ordering for first n
> order-by columns. In order to provide right order for rest m columns,
> sort node is inserted. This sort node sorts groups of tuples where
> values of first n order-by columns are equal.

I recently looked at the same problem. I see that you solved the 
rescanning problem by simply forcing the sort to be redone on 
ExecReScanSort if you have done a partial sort.

My idea for a solution was to modify tuplesort to allow storing the 
already sorted keys in either memtuples or the sort result file, but 
setting a field so it does not sort thee already sorted tuples again. 
This would allow the rescan to work as it used to, but I am unsure how 
clean or ugly this code would be. Was this something you considered?

-- 
Andreas Karlsson



Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Sat, Dec 14, 2013 at 6:39 PM, Martijn van Oosterhout <kleptog@svana.org> wrote:
On Sat, Dec 14, 2013 at 06:21:18PM +0400, Alexander Korotkov wrote:
> > Is that actually all that beneficial when sorting with a bog standard
> > qsort() since that doesn't generally benefit from data being pre-sorted?
> > I think we might need to switch to a different algorithm to really
> > benefit from mostly pre-sorted input.
> >
>
> In this patch I don't do full sort of dataset. For instance, index returns
> data ordered by first column and we need to order them also by second
> column. Then this node sorts groups (assumed to be small) where values of
> the first column are same by value of second column. And with limit clause
> only required number of such groups will be processed. But, I don't think
> we should expect pre-sorted values of second column inside a group.

Nice. I imagine this would be mostly beneficial for fast-start plans,
since you no longer need to sort the whole table prior to returning the
first tuple.

Reduced memory usage might be a factor, especially for large sorts
where you otherwise might need to spool to disk.

You can now use an index on (a) to improve sorting for (a,b).

Cost of sorting n groups of size l goes from O(nl log nl) to just O(nl
log l), useful for large n.

Agree. Your reasoning looks correct.
 
Minor comments:

I find cmpTuple a bad name. That's what it's doing but perhaps
cmpSkipColumns would be clearer.

I think it's worthwhile adding a seperate path for the skipCols = 0
case, to avoid extra copies.

Thanks. I'll take care about.

------
With best regards,
Alexander Korotkov.

Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Sat, Dec 14, 2013 at 7:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Hi,

> > >  Limit  (cost=69214.06..69214.08 rows=10 width=16) (actual
> > > time=0.097..0.099 rows=10 loops=1)
> > >    ->  Sort  (cost=69214.06..71714.06 rows=1000000 width=16) (actual
> > > time=0.096..0.097 rows=10 loops=1)
> > >          Sort Key: v1, v2
> > >          Sort Method: top-N heapsort  Memory: 25kB
> > >          ->  Index Scan using test_v1_idx on test  (cost=0.42..47604.42
> > > rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
> > >  Total runtime: 0.125 ms
> > > (6 rows)
> >
> > Is that actually all that beneficial when sorting with a bog standard
> > qsort() since that doesn't generally benefit from data being pre-sorted?
> > I think we might need to switch to a different algorithm to really
> > benefit from mostly pre-sorted input.
> >
>
> In this patch I don't do full sort of dataset. For instance, index returns
> data ordered by first column and we need to order them also by second
> column.

Ah, that makes sense.

> But, I don't think we should expect pre-sorted values of second column
> inside a group.

Yes, if you do it that way, there doesn't seem to any need to assume
that any more than we usually do.

I think you should make the explain output reflect the fact that we're
assuming v1 is presorted and just sorting v2. I'd be happy enough with:
Sort Key: v1, v2
Partial Sort: v2
or even just
"Partial Sort Key: [v1,] v2"
but I am sure others disagree.

Sure, I just didn't change explain output yet. It should look like what you propose. 

> > > *partial-knn-1.patch*

> > Rechecking from the heap means adding a sort node though, which I don't
> > see here? Or am I misunderstanding something?

> KNN-GiST contain RB-tree of scanned items. In this patch item is rechecked
> inside GiST and reinserted into same RB-tree. It appears to be much easier
> implementation for PoC and also looks very efficient. I'm not sure what is
> actually right design for it. This is what I like to discuss.

I don't have enough clue about gist to say wether it's the right design,
but it doesn't look wrong to my eyes. It'd probably be useful to export
the knowledge that we are rechecking and how often that happens to the
outside.
While I didn't really look into the patch, I noticed in passing that you
pass a all_dead variable to heap_hot_search_buffer without using the
result - just pass NULL instead, that performs a bit less work.

Useful notice, thanks.

------
With best regards,
Alexander Korotkov.

Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Sat, Dec 14, 2013 at 11:47 PM, Andreas Karlsson <andreas@proxel.se> wrote:
On 12/14/2013 10:59 AM, Alexander Korotkov wrote:
This patch allows to use index for order-by if order-by clause and index
has non-empty common prefix. So, index gives right ordering for first n
order-by columns. In order to provide right order for rest m columns,
sort node is inserted. This sort node sorts groups of tuples where
values of first n order-by columns are equal.

I recently looked at the same problem. I see that you solved the rescanning problem by simply forcing the sort to be redone on ExecReScanSort if you have done a partial sort.

Naturally, I'm sure I solved it at all :) I just get version of patch working for very limited use-cases.
 
My idea for a solution was to modify tuplesort to allow storing the already sorted keys in either memtuples or the sort result file, but setting a field so it does not sort thee already sorted tuples again. This would allow the rescan to work as it used to, but I am unsure how clean or ugly this code would be. Was this something you considered?

I'm not sure. I believe that best answer depends on particular parameter: how much memory we've for sort, how expensive is underlying node and how it performs rescan, how big are groups in partial sort.

------
With best regards,
Alexander Korotkov.

Re: PoC: Partial sort

From
Andreas Karlsson
Date:
On 12/18/2013 01:02 PM, Alexander Korotkov wrote:
>     My idea for a solution was to modify tuplesort to allow storing the
>     already sorted keys in either memtuples or the sort result file, but
>     setting a field so it does not sort thee already sorted tuples
>     again. This would allow the rescan to work as it used to, but I am
>     unsure how clean or ugly this code would be. Was this something you
>     considered?
>
>
> I'm not sure. I believe that best answer depends on particular
> parameter: how much memory we've for sort, how expensive is underlying
> node and how it performs rescan, how big are groups in partial sort.

Yes, if one does not need a rescan your solution will use less memory 
and about the same amount of CPU (if the tuplesort does not spill to 
disk). While if we keep all the already sorted tuples in the tuplesort 
rescans will be cheap but more memory will be used with an increased 
chance of spilling to disk.

-- 
Andreas Karlsson



Re: PoC: Partial sort

From
Alexander Korotkov
Date:
Hi!

Next revision. It expected to do better work with optimizer. It introduces presorted_keys argument of cost_sort function which represent number of keys already sorted in Path. Then this function uses estimate_num_groups to estimate number of groups with different values of presorted keys and assumes that dataset is uniformly divided by groups. get_cheapest_fractional_path_for_pathkeys tries to select the path matching most part of path keys.
You can see it's working pretty good on single table queries.

create table test as (select id, (random()*5)::int as v1, (random()*1000)::int as v2 from generate_series(1,1000000) id);
create index test_v1_idx on test (v1);
create index test_v1_v2_idx on test (v1, v2);
create index test_v2_idx on test (v2);
vacuum analyze;

postgres=# explain analyze select * from test order by v1, id;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Sort  (cost=149244.84..151744.84 rows=1000000 width=12) (actual time=2111.476..2586.493 rows=1000000 loops=1)
   Sort Key: v1, id
   Sort Method: external merge  Disk: 21512kB
   ->  Seq Scan on test  (cost=0.00..15406.00 rows=1000000 width=12) (actual time=0.012..113.815 rows=1000000 loops=1)
 Total runtime: 2683.011 ms
(5 rows)

postgres=# explain analyze select * from test order by v1, id limit 10;
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=11441.77..11442.18 rows=10 width=12) (actual time=79.980..79.982 rows=10 loops=1)
   ->  Partial sort  (cost=11441.77..53140.44 rows=1000000 width=12) (actual time=79.978..79.978 rows=10 loops=1)
         Sort Key: v1, id
         Presorted Key: v1
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Index Scan using test_v1_idx on test  (cost=0.42..47038.83 rows=1000000 width=12) (actual time=0.031..38.275 rows=100213 loops=1)
 Total runtime: 81.786 ms
(7 rows)

postgres=# explain analyze select * from test order by v1, v2 limit 10;
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..0.90 rows=10 width=12) (actual time=0.031..0.047 rows=10 loops=1)
   ->  Index Scan using test_v1_v2_idx on test  (cost=0.42..47286.28 rows=1000000 width=12) (actual time=0.029..0.043 rows=10 loops=1)
 Total runtime: 0.083 ms
(3 rows)

postgres=# explain analyze select * from test order by v2, id;
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Partial sort  (cost=97.75..99925.50 rows=1000000 width=12) (actual time=1.069..1299.481 rows=1000000 loops=1)
   Sort Key: v2, id
   Presorted Key: v2
   Sort Method: quicksort  Memory: 52kB
   ->  Index Scan using test_v2_idx on test  (cost=0.42..47603.79 rows=1000000 width=12) (actual time=0.030..812.083 rows=1000000 loops=1)
 Total runtime: 1393.850 ms
(6 rows)

However, work with joins needs more improvements.

------
With best regards,
Alexander Korotkov.
Attachment

Re: PoC: Partial sort

From
Martijn van Oosterhout
Date:
On Sun, Dec 22, 2013 at 07:38:05PM +0400, Alexander Korotkov wrote:
> Hi!
>
> Next revision. It expected to do better work with optimizer. It introduces
> presorted_keys argument of cost_sort function which represent number of
> keys already sorted in Path. Then this function uses estimate_num_groups to
> estimate number of groups with different values of presorted keys and
> assumes that dataset is uniformly divided by
> groups. get_cheapest_fractional_path_for_pathkeys tries to select the path
> matching most part of path keys.
> You can see it's working pretty good on single table queries.

Nice work! The plans look good and the calculated costs seem sane also.

I suppose the problem with the joins is generating the pathkeys?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Sun, Dec 22, 2013 at 8:12 PM, Martijn van Oosterhout <kleptog@svana.org> wrote:
On Sun, Dec 22, 2013 at 07:38:05PM +0400, Alexander Korotkov wrote:
> Hi!
>
> Next revision. It expected to do better work with optimizer. It introduces
> presorted_keys argument of cost_sort function which represent number of
> keys already sorted in Path. Then this function uses estimate_num_groups to
> estimate number of groups with different values of presorted keys and
> assumes that dataset is uniformly divided by
> groups. get_cheapest_fractional_path_for_pathkeys tries to select the path
> matching most part of path keys.
> You can see it's working pretty good on single table queries.

Nice work! The plans look good and the calculated costs seem sane also.

I suppose the problem with the joins is generating the pathkeys?

In general, problem is that partial sort is alternative to do less restrictive merge join and filter it's results. As far as I can see, taking care about it require some rework of merge optimization. For now, I didn't get what it's going to look like. I'll try to dig more into details.

------
With best regards,
Alexander Korotkov.

Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Sat, Dec 14, 2013 at 6:30 PM, Jeremy Harris <jgh@wizmail.org> wrote:
On 14/12/13 12:54, Andres Freund wrote:
On 2013-12-14 13:59:02 +0400, Alexander Korotkov wrote:
Currently when we need to get ordered result from table we have to choose
one of two approaches: get results from index in exact order we need or do
sort of tuples. However, it could be useful to mix both methods: get
results from index in order which partially meets our requirements and do
rest of work from heap.

------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=69214.06..69214.08 rows=10 width=16) (actual
time=0.097..0.099 rows=10 loops=1)
    ->  Sort  (cost=69214.06..71714.06 rows=1000000 width=16) (actual
time=0.096..0.097 rows=10 loops=1)
          Sort Key: v1, v2
          Sort Method: top-N heapsort  Memory: 25kB
          ->  Index Scan using test_v1_idx on test  (cost=0.42..47604.42
rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
  Total runtime: 0.125 ms
(6 rows)

Is that actually all that beneficial when sorting with a bog standard
qsort() since that doesn't generally benefit from data being pre-sorted?
I think we might need to switch to a different algorithm to really
benefit from mostly pre-sorted input.

Eg:  http://www.postgresql.org/message-id/5291467E.6070807@wizmail.org

Maybe Alexander and I should bash our heads together.

Partial sort patch is mostly optimizer/executor improvement rather than improvement of sort algorithm itself. But I would appreciate using enchantments of sorting algorithms in my work.

------
With best regards,
Alexander Korotkov. 

Re: PoC: Partial sort

From
Andreas Karlsson
Date:
On 12/22/2013 04:38 PM, Alexander Korotkov wrote:
> postgres=# explain analyze select * from test order by v1, id limit 10;
>                                                                    QUERY
> PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------------------
>   Limit  (cost=11441.77..11442.18 rows=10 width=12) (actual
> time=79.980..79.982 rows=10 loops=1)
>     ->  Partial sort  (cost=11441.77..53140.44 rows=1000000 width=12)
> (actual time=79.978..79.978 rows=10 loops=1)
>           Sort Key: v1, id
>           Presorted Key: v1
>           Sort Method: top-N heapsort  Memory: 25kB
>           ->  Index Scan using test_v1_idx on test  (cost=0.42..47038.83
> rows=1000000 width=12) (actual time=0.031..38.275 rows=100213 loops=1)
>   Total runtime: 81.786 ms
> (7 rows)

Have you thought about how do you plan to print which sort method and 
how much memory was used? Several different sort methods may have been 
use in the query. Should the largest amount of memory/disk be printed?

> However, work with joins needs more improvements.

That would be really nice to have, but the patch seems useful even 
without the improvements to joins.

-- 
Andreas Karlsson



Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Tue, Dec 24, 2013 at 6:02 AM, Andreas Karlsson <andreas@proxel.se> wrote:
On 12/22/2013 04:38 PM, Alexander Korotkov wrote:
postgres=# explain analyze select * from test order by v1, id limit 10;
                                                                   QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=11441.77..11442.18 rows=10 width=12) (actual
time=79.980..79.982 rows=10 loops=1)
    ->  Partial sort  (cost=11441.77..53140.44 rows=1000000 width=12)
(actual time=79.978..79.978 rows=10 loops=1)
          Sort Key: v1, id
          Presorted Key: v1
          Sort Method: top-N heapsort  Memory: 25kB
          ->  Index Scan using test_v1_idx on test  (cost=0.42..47038.83
rows=1000000 width=12) (actual time=0.031..38.275 rows=100213 loops=1)
  Total runtime: 81.786 ms
(7 rows)

Have you thought about how do you plan to print which sort method and how much memory was used? Several different sort methods may have been use in the query. Should the largest amount of memory/disk be printed?

Apparently, now amount of memory for sorted last group is printed. Your proposal makes sense: largest amount of memory/disk should be printed.
 
However, work with joins needs more improvements.

That would be really nice to have, but the patch seems useful even without the improvements to joins.

Attached revision of patch implements partial sort usage in merge joins.

create table test1 as (
select id, 
(random()*100)::int as v1, 
(random()*10000)::int as v2
from generate_series(1,1000000) id);

create table test2 as (
select id, 
(random()*100)::int as v1, 
(random()*10000)::int as v2
from generate_series(1,1000000) id);
create index test1_v1_idx on test1 (v1);
create index test2_v1_idx on test2 (v1);

create index test1_v1_idx on test1 (v1);
create index test2_v1_idx on test2 (v1);

# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1 and t1.v2 = t2.v2;
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Merge Join  (cost=2257.67..255273.39 rows=983360 width=24)
   Merge Cond: ((t1.v1 = t2.v1) AND (t1.v2 = t2.v2))
   ->  Partial sort  (cost=1128.84..116470.79 rows=1000000 width=12)
         Sort Key: t1.v1, t1.v2
         Presorted Key: t1.v1
         ->  Index Scan using test1_v1_idx on test1 t1  (cost=0.42..47604.01 rows=1000000 width=12)
   ->  Materialize  (cost=1128.83..118969.00 rows=1000000 width=12)
         ->  Partial sort  (cost=1128.83..116469.00 rows=1000000 width=12)
               Sort Key: t2.v1, t2.v2
               Presorted Key: t2.v1
               ->  Index Scan using test2_v1_idx on test2 t2  (cost=0.42..47602.22 rows=1000000 width=12)

I believe now patch covers desired functionality. I'm going to focus on nailing down details, refactoring and documenting.

------
With best regards,
Alexander Korotkov.
Attachment

Re: PoC: Partial sort

From
David Rowley
Date:
On Sat, Dec 28, 2013 at 9:28 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Dec 24, 2013 at 6:02 AM, Andreas Karlsson <andreas@proxel.se> wrote:
Attached revision of patch implements partial sort usage in merge joins.



I'm looking forward to doing a bit of testing on this patch. I think it is a really useful feature to get a bit more out of existing indexes.

I was about to test it tonight, but I'm having trouble getting the patch to compile... I'm really wondering which compiler you are using as it seems you're declaring your variables in some strange places.. See nodeSort.c line 101. These variables are declared after there has been an if statement in the same scope. That's not valid in C. (The patch did however apply without any complaints).

Here's a list of the errors I get when compiling with visual studios on windows.

"D:\Postgres\c\pgsql.sln" (default target) (1) ->
"D:\Postgres\c\postgres.vcxproj" (default target) (2) ->
(ClCompile target) ->
  src\backend\executor\nodeSort.c(101): error C2275: 'Sort' : illegal use of this type as an expression [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(101): error C2065: 'plannode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(102): error C2275: 'PlanState' : illegal use of this type as an expression [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(102): error C2065: 'outerNode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(103): error C2275: 'TupleDesc' : illegal use of this type as an expression [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(103): error C2146: syntax error : missing ';' before identifier 'tupDesc' [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(103): error C2065: 'tupDesc' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(120): error C2065: 'outerNode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(121): error C2065: 'tupDesc' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(121): error C2065: 'outerNode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(125): error C2065: 'tupDesc' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(126): error C2065: 'plannode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(126): error C2223: left of '->numCols' must point to struct/union [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(127): error C2065: 'plannode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(127): error C2223: left of '->sortColIdx' must point to struct/union [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(128): error C2065: 'plannode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(128): error C2223: left of '->sortOperators' must point to struct/union [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(129): error C2065: 'plannode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(129): error C2223: left of '->collations' must point to struct/union [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(130): error C2065: 'plannode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(130): error C2223: left of '->nullsFirst' must point to struct/union [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(132): error C2198: 'tuplesort_begin_heap' : too few arguments for call [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(143): error C2065: 'outerNode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(167): error C2065: 'tupDesc' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]

    13 Warning(s)
    24 Error(s)
 

Regards

David Rowley

Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Sat, Dec 28, 2013 at 1:04 PM, David Rowley <dgrowleyml@gmail.com> wrote:
On Sat, Dec 28, 2013 at 9:28 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Dec 24, 2013 at 6:02 AM, Andreas Karlsson <andreas@proxel.se> wrote:
Attached revision of patch implements partial sort usage in merge joins.



I'm looking forward to doing a bit of testing on this patch. I think it is a really useful feature to get a bit more out of existing indexes.

I was about to test it tonight, but I'm having trouble getting the patch to compile... I'm really wondering which compiler you are using as it seems you're declaring your variables in some strange places.. See nodeSort.c line 101. These variables are declared after there has been an if statement in the same scope. That's not valid in C. (The patch did however apply without any complaints).

Here's a list of the errors I get when compiling with visual studios on windows.

"D:\Postgres\c\pgsql.sln" (default target) (1) ->
"D:\Postgres\c\postgres.vcxproj" (default target) (2) ->
(ClCompile target) ->
  src\backend\executor\nodeSort.c(101): error C2275: 'Sort' : illegal use of this type as an expression [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(101): error C2065: 'plannode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(102): error C2275: 'PlanState' : illegal use of this type as an expression [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(102): error C2065: 'outerNode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(103): error C2275: 'TupleDesc' : illegal use of this type as an expression [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(103): error C2146: syntax error : missing ';' before identifier 'tupDesc' [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(103): error C2065: 'tupDesc' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(120): error C2065: 'outerNode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(121): error C2065: 'tupDesc' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(121): error C2065: 'outerNode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(125): error C2065: 'tupDesc' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(126): error C2065: 'plannode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(126): error C2223: left of '->numCols' must point to struct/union [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(127): error C2065: 'plannode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(127): error C2223: left of '->sortColIdx' must point to struct/union [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(128): error C2065: 'plannode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(128): error C2223: left of '->sortOperators' must point to struct/union [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(129): error C2065: 'plannode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(129): error C2223: left of '->collations' must point to struct/union [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(130): error C2065: 'plannode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(130): error C2223: left of '->nullsFirst' must point to struct/union [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(132): error C2198: 'tuplesort_begin_heap' : too few arguments for call [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(143): error C2065: 'outerNode' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]
  src\backend\executor\nodeSort.c(167): error C2065: 'tupDesc' : undeclared identifier [D:\Postgres\c\postgres.vcxproj]

    13 Warning(s)
    24 Error(s)

I've compiled it with clang. Yeah, there was mixed declarations. I've rechecked it with gcc, now it gives no warnings. I didn't try it with visual studio, but I hope it will be OK.

------
With best regards,
Alexander Korotkov.
Attachment

Re: PoC: Partial sort

From
David Rowley
Date:
On Sun, Dec 29, 2013 at 4:51 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
I've compiled it with clang. Yeah, there was mixed declarations. I've rechecked it with gcc, now it gives no warnings. I didn't try it with visual studio, but I hope it will be OK.


Thanks for the patch. It now compiles without any problems.
I've been doing a bit of testing with the patch testing a few different workloads. One thing that I've found is that in my test case when the table only contains 1 tuple for any given presort columns that the query is actually slower than when there are say 100 tuples to sort for any given presort group.

Here is my test case:

DROP TABLE IF EXISTS temperature_readings;

CREATE TABLE temperature_readings (
  readingid SERIAL NOT NULL,
  timestamp TIMESTAMP NOT NULL,
  locationid INT NOT NULL,
  temperature INT NOT NULL,
  PRIMARY KEY (readingid)
);

INSERT INTO temperature_readings (timestamp,locationid,temperature)
SELECT ts.timestamp, loc.locationid, -10 + random() * 40
FROM generate_series('1900-04-01','2000-04-01','1 day'::interval) ts(timestamp)
CROSS JOIN generate_series(1,1) loc(locationid);

VACUUM ANALYZE temperature_readings;

-- Warm buffers
SELECT AVG(temperature) FROM temperature_readings;

explain (buffers, analyze) select * from temperature_readings order by timestamp,locationid; -- (seqscan -> sort) 70.805ms

-- create an index to allow presorting on timestamp.
CREATE INDEX temperature_readings_timestamp_idx ON temperature_readings (timestamp);

-- warm index buffers
SELECT COUNT(*) FROM (SELECT timestamp FROM temperature_readings ORDER BY timestamp) c;

explain (buffers, analyze) select * from temperature_readings order by timestamp,locationid; -- index scan -> partial sort 253.032ms

The first query without the index to presort on runs in 70.805 ms, the 2nd query uses the index to presort and runs in 253.032 ms.

I ran the code through a performance profiler and found that about 80% of the time is spent in tuplesort_end and tuplesort_begin_heap. 

If it was possible to devise some way to reuse any previous tuplesortstate perhaps just inventing a reset method which clears out tuples, then we could see performance exceed the standard seqscan -> sort. The code the way it is seems to lookup the sort functions from the syscache for each group then allocate some sort space, so quite a bit of time is also spent in palloc0() and pfree()

If it was not possible to do this then maybe adding a cost to the number of sort groups would be better so that the optimization is skipped if there are too many sort groups.

Regards

David Rowley




 
------
With best regards,
Alexander Korotkov.

Re: PoC: Partial sort

From
Andreas Karlsson
Date:
On 12/29/2013 08:24 AM, David Rowley wrote:
> If it was possible to devise some way to reuse any
> previous tuplesortstate perhaps just inventing a reset method which
> clears out tuples, then we could see performance exceed the standard
> seqscan -> sort. The code the way it is seems to lookup the sort
> functions from the syscache for each group then allocate some sort
> space, so quite a bit of time is also spent in palloc0() and pfree()
>
> If it was not possible to do this then maybe adding a cost to the number
> of sort groups would be better so that the optimization is skipped if
> there are too many sort groups.

It should be possible. I have hacked a quick proof of concept for
reusing the tuplesort state. Can you try it and see if the performance
regression is fixed by this?

One thing which have to be fixed with my patch is that we probably want
to close the tuplesort once we have returned the last tuple from ExecSort().

I have attached my patch and the incremental patch on Alexander's patch.

--
Andreas Karlsson

Attachment

Re: PoC: Partial sort

From
David Rowley
Date:
On Tue, Dec 31, 2013 at 2:41 PM, Andreas Karlsson <andreas@proxel.se> wrote:
On 12/29/2013 08:24 AM, David Rowley wrote:
If it was possible to devise some way to reuse any
previous tuplesortstate perhaps just inventing a reset method which
clears out tuples, then we could see performance exceed the standard
seqscan -> sort. The code the way it is seems to lookup the sort
functions from the syscache for each group then allocate some sort
space, so quite a bit of time is also spent in palloc0() and pfree()

If it was not possible to do this then maybe adding a cost to the number
of sort groups would be better so that the optimization is skipped if
there are too many sort groups.

It should be possible. I have hacked a quick proof of concept for reusing the tuplesort state. Can you try it and see if the performance regression is fixed by this?

One thing which have to be fixed with my patch is that we probably want to close the tuplesort once we have returned the last tuple from ExecSort().

I have attached my patch and the incremental patch on Alexander's patch.


Thanks, the attached is about 5 times faster than it was previously with my test case upthread.

The times now look like:

No pre-sortable index:
Total runtime: 86.278 ms

With pre-sortable index with partial sorting
Total runtime: 47.500 ms

With the query where there is no index the sort remained in memory.

I spent some time trying to find a case where the partial sort is slower than the seqscan -> sort. The only places partial sort seems slower are when the number of estimated sort groups are around the crossover point where the planner would be starting to think about performing a seqscan -> sort instead. I'm thinking right now that it's not worth raising the costs around this as the partial sort is less likely to become a disk sort than the full sort is.

I'll keep going with trying to break it.

Regards

David Rowley

 
--
Andreas Karlsson

Re: PoC: Partial sort

From
Alvaro Herrera
Date:
David Rowley escribió:

> I was about to test it tonight, but I'm having trouble getting the patch to
> compile... I'm really wondering which compiler you are using as it seems
> you're declaring your variables in some strange places.. See nodeSort.c
> line 101. These variables are declared after there has been an if statement
> in the same scope. That's not valid in C. (The patch did however apply
> without any complaints).

AFAIR C99 allows mixed declarations and code.  Visual Studio only
implements C89 though, which is why it fails to compile there.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: PoC: Partial sort

From
Andreas Karlsson
Date:
On 12/28/2013 04:51 PM, Alexander Korotkov wrote:
> I've compiled it with clang. Yeah, there was mixed declarations. I've
> rechecked it with gcc, now it gives no warnings. I didn't try it with
> visual studio, but I hope it will be OK.

I looked at this version of the patch and noticed a possibility for 
improvement. You could decrement the bound for the tuplesort after every 
completed sort. Otherwise the optimizations for small limits wont apply 
to partial sort.

-- 
Andreas Karlsson



Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Tue, Dec 31, 2013 at 5:41 AM, Andreas Karlsson <andreas@proxel.se> wrote:
On 12/29/2013 08:24 AM, David Rowley wrote:
If it was possible to devise some way to reuse any
previous tuplesortstate perhaps just inventing a reset method which
clears out tuples, then we could see performance exceed the standard
seqscan -> sort. The code the way it is seems to lookup the sort
functions from the syscache for each group then allocate some sort
space, so quite a bit of time is also spent in palloc0() and pfree()

If it was not possible to do this then maybe adding a cost to the number
of sort groups would be better so that the optimization is skipped if
there are too many sort groups.

It should be possible. I have hacked a quick proof of concept for reusing the tuplesort state. Can you try it and see if the performance regression is fixed by this?

One thing which have to be fixed with my patch is that we probably want to close the tuplesort once we have returned the last tuple from ExecSort().

I have attached my patch and the incremental patch on Alexander's patch.

Thanks. It's included into attached version of patch. As wall as estimation improvements, more comments and regression tests fix.

------
With best regards,
Alexander Korotkov.
Attachment

Re: PoC: Partial sort

From
Marti Raudsepp
Date:
Hi Alexander,

First, thanks a lot for working on this feature. This PostgreSQL shortcoming crops up in all the time in web applications that implement paging by multiple sorted columns.

I've been trying it out in a few situations. I implemented a new enable_partialsort GUC to make it easier to turn on/off, this way it's a lot easier to test. The attached patch applies on top of partial-sort-5.patch

I will spend more time reviewing the patch, but some of this planner code is over my head. If there's any way I can help to make sure this lands in the next version, let me know.

----

The patch performs just as well as I would expect it to:

marti=# select ac.name, r.name from artist_credit ac join release r on (ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
Time: 9.830 ms
marti=# set enable_partialsort = off;
marti=# select ac.name, r.name from artist_credit ac join release r on (ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
Time: 1442.815 ms

A difference of almost 150x!

There's a missed opportunity in that the code doesn't consider pushing new Sort steps into subplans. For example, if there's no index on language(name) then this query cannot take advantage partial sorts:

marti=# explain select l.name, r.name from language l join release r on (l.id=r.language) order by l.name, r.name limit 1000;
 Limit  (cost=123203.20..123205.70 rows=1000 width=32)
   ->  Sort  (cost=123203.20..126154.27 rows=1180430 width=32)
         Sort Key: l.name, r.name
         ->  Hash Join  (cost=229.47..58481.49 rows=1180430 width=32)
               Hash Cond: (r.language = l.id)
               ->  Seq Scan on release r  (cost=0.00..31040.10 rows=1232610 width=26)
               ->  Hash  (cost=131.43..131.43 rows=7843 width=14)
                     ->  Seq Scan on language l  (cost=0.00..131.43 rows=7843 width=14)

But because there are only so few languages, it would be a lot faster to sort languages in advance and then do partial sort:
 Limit  (rows=1000 width=31)
   ->  Partial sort  (rows=1180881 width=31)
         Sort Key: l.name, r.name
         Presorted Key: l.name
         ->  Nested Loop  (rows=1180881 width=31)
               ->  Sort  (rows=7843 width=10)
                     Sort Key: name
                     ->  Seq Scan on language  (rows=7843 width=14)
               ->  Index Scan using release_language_idx on release r  (rows=11246 width=25)
                     Index Cond: (language = l.id)

Even an explicit sorted CTE cannot take advantage of partial sorts:
marti=# explain with sorted_lang as (select id, name from language order by name)
marti-# select l.name, r.name from sorted_lang l join release r on (l.id=r.language) order by l.name, r.name limit 1000;
 Limit  (cost=3324368.83..3324371.33 rows=1000 width=240)
   CTE sorted_lang
     ->  Sort  (cost=638.76..658.37 rows=7843 width=14)
           Sort Key: language.name
           ->  Seq Scan on language  (cost=0.00..131.43 rows=7843 width=14)
   ->  Sort  (cost=3323710.46..3439436.82 rows=46290543 width=240)
         Sort Key: l.name, r.name
         ->  Merge Join  (cost=664.62..785649.92 rows=46290543 width=240)
               Merge Cond: (r.language = l.id)
               ->  Index Scan using release_language_idx on release r  (cost=0.43..87546.06 rows=1232610 width=26)
               ->  Sort  (cost=664.19..683.80 rows=7843 width=222)
                     Sort Key: l.id
                     ->  CTE Scan on sorted_lang l  (cost=0.00..156.86 rows=7843 width=222)

But even with these limitations, this will easily be the killer feature of the next release, for me at least.

Regards,
Marti


On Mon, Jan 13, 2014 at 8:01 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Dec 31, 2013 at 5:41 AM, Andreas Karlsson <andreas@proxel.se> wrote:
On 12/29/2013 08:24 AM, David Rowley wrote:
If it was possible to devise some way to reuse any
previous tuplesortstate perhaps just inventing a reset method which
clears out tuples, then we could see performance exceed the standard
seqscan -> sort. The code the way it is seems to lookup the sort
functions from the syscache for each group then allocate some sort
space, so quite a bit of time is also spent in palloc0() and pfree()

If it was not possible to do this then maybe adding a cost to the number
of sort groups would be better so that the optimization is skipped if
there are too many sort groups.

It should be possible. I have hacked a quick proof of concept for reusing the tuplesort state. Can you try it and see if the performance regression is fixed by this?

One thing which have to be fixed with my patch is that we probably want to close the tuplesort once we have returned the last tuple from ExecSort().

I have attached my patch and the incremental patch on Alexander's patch.

Thanks. It's included into attached version of patch. As wall as estimation improvements, more comments and regression tests fix.

------
With best regards,
Alexander Korotkov.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Attachment

Re: PoC: Partial sort

From
Alexander Korotkov
Date:
Hi!

On Tue, Jan 14, 2014 at 12:54 AM, Marti Raudsepp <marti@juffo.org> wrote:
First, thanks a lot for working on this feature. This PostgreSQL shortcoming crops up in all the time in web applications that implement paging by multiple sorted columns.
 
Thanks!

I've been trying it out in a few situations. I implemented a new enable_partialsort GUC to make it easier to turn on/off, this way it's a lot easier to test. The attached patch applies on top of partial-sort-5.patch

I though about such option. Generally not because of testing convenience, but because of overhead of planning. This way you implement it is quite naive :) For instance, merge join rely on partial sort which will be replaced with simple sort.
 
I will spend more time reviewing the patch, but some of this planner code is over my head. If there's any way I can help to make sure this lands in the next version, let me know.

----

The patch performs just as well as I would expect it to:

marti=# select ac.name, r.name from artist_credit ac join release r on (ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
Time: 9.830 ms
marti=# set enable_partialsort = off;
marti=# select ac.name, r.name from artist_credit ac join release r on (ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
Time: 1442.815 ms

A difference of almost 150x!

There's a missed opportunity in that the code doesn't consider pushing new Sort steps into subplans. For example, if there's no index on language(name) then this query cannot take advantage partial sorts:

marti=# explain select l.name, r.name from language l join release r on (l.id=r.language) order by l.name, r.name limit 1000;
 Limit  (cost=123203.20..123205.70 rows=1000 width=32)
   ->  Sort  (cost=123203.20..126154.27 rows=1180430 width=32)
         Sort Key: l.name, r.name
         ->  Hash Join  (cost=229.47..58481.49 rows=1180430 width=32)
               Hash Cond: (r.language = l.id)
               ->  Seq Scan on release r  (cost=0.00..31040.10 rows=1232610 width=26)
               ->  Hash  (cost=131.43..131.43 rows=7843 width=14)
                     ->  Seq Scan on language l  (cost=0.00..131.43 rows=7843 width=14)

But because there are only so few languages, it would be a lot faster to sort languages in advance and then do partial sort:
 Limit  (rows=1000 width=31)
   ->  Partial sort  (rows=1180881 width=31)
         Sort Key: l.name, r.name
         Presorted Key: l.name
         ->  Nested Loop  (rows=1180881 width=31)
               ->  Sort  (rows=7843 width=10)
                     Sort Key: name
                     ->  Seq Scan on language  (rows=7843 width=14)
               ->  Index Scan using release_language_idx on release r  (rows=11246 width=25)
                     Index Cond: (language = l.id)

Even an explicit sorted CTE cannot take advantage of partial sorts:
marti=# explain with sorted_lang as (select id, name from language order by name)
marti-# select l.name, r.name from sorted_lang l join release r on (l.id=r.language) order by l.name, r.name limit 1000;
 Limit  (cost=3324368.83..3324371.33 rows=1000 width=240)
   CTE sorted_lang
     ->  Sort  (cost=638.76..658.37 rows=7843 width=14)
           Sort Key: language.name
           ->  Seq Scan on language  (cost=0.00..131.43 rows=7843 width=14)
   ->  Sort  (cost=3323710.46..3439436.82 rows=46290543 width=240)
         Sort Key: l.name, r.name
         ->  Merge Join  (cost=664.62..785649.92 rows=46290543 width=240)
               Merge Cond: (r.language = l.id)
               ->  Index Scan using release_language_idx on release r  (cost=0.43..87546.06 rows=1232610 width=26)
               ->  Sort  (cost=664.19..683.80 rows=7843 width=222)
                     Sort Key: l.id
                     ->  CTE Scan on sorted_lang l  (cost=0.00..156.86 rows=7843 width=222)

But even with these limitations, this will easily be the killer feature of the next release, for me at least.

I see. But I don't think it can be achieved by small changes in planner. Moreover, I didn't check but I think if you remove ordering by r.name you will still not get sorting languages in the inner node. So, this problem is not directly related to partial sort.

------
With best regards,
Alexander Korotkov

 

Re: PoC: Partial sort

From
Marti Raudsepp
Date:
On Tue, Jan 14, 2014 at 5:49 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>> I implemented a new
>> enable_partialsort GUC to make it easier to turn on/off

> I though about such option. Generally not because of testing convenience,
> but because of overhead of planning. This way you implement it is quite
> naive :) For instance, merge join rely on partial sort which will be
> replaced with simple sort.

Oh, this actually highlights a performance regression with the partial sort patch. I assumed the planner will discard the full sort because of higher costs, but it looks like the new code always assumes that a Partial sort will be cheaper than a Join Filter without considering costs. When doing a join USING (unique_indexed_value, something), the new plan is significantly worse.

Unpatched:
marti=# explain analyze select * from release a join release b using (id, name);
 Merge Join  (cost=0.85..179810.75 rows=12 width=158) (actual time=0.011..1279.596 rows=1232610 loops=1)
   Merge Cond: (a.id = b.id)
   Join Filter: ((a.name)::text = (b.name)::text)
   ->  Index Scan using release_id_idx on release a  (cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.005..211.928 rows=1232610 loops=1)
   ->  Index Scan using release_id_idx on release b  (cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.004..371.592 rows=1232610 loops=1)
 Total runtime: 1309.049 ms

Patched:
 Merge Join  (cost=0.98..179810.87 rows=12 width=158) (actual time=0.037..5034.158 rows=1232610 loops=1)
   Merge Cond: ((a.id = b.id) AND ((a.name)::text = (b.name)::text))
   ->  Partial sort  (cost=0.49..82201.56 rows=1232610 width=92) (actual time=0.013..955.938 rows=1232610 loops=1)
         Sort Key: a.id, a.name
         Presorted Key: a.id
         Sort Method: quicksort  Memory: 25kB
         ->  Index Scan using release_id_idx on release a  (cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.007..449.332 rows=1232610 loops=1)
   ->  Materialize  (cost=0.49..85283.09 rows=1232610 width=92) (actual time=0.019..1352.377 rows=1232610 loops=1)
         ->  Partial sort  (cost=0.49..82201.56 rows=1232610 width=92) (actual time=0.018..1223.251 rows=1232610 loops=1)
               Sort Key: b.id, b.name
               Presorted Key: b.id
               Sort Method: quicksort  Memory: 25kB
               ->  Index Scan using release_id_idx on release b  (cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.004..597.258 rows=1232610 loops=1)
 Total runtime: 5166.906 ms

----

There's another "wishlist" kind of thing with top-N heapsort bounds; if I do a query with LIMIT 1000 then every sort batch has Tuplesortstate.bound set to 1000, but it could be reduced after each batch. If the first batch is 900 rows then the 2nd batch only needs the top 100 rows at most.

Also, I find the name "partial sort" a bit confusing; this feature is not actually sorting *partially*, it's finishing the sort of partially-sorted data. Perhaps "batched sort" would explain the feature better? Because it does the sort in multiple batches instead of all at once. But maybe that's just me.

Regards,
Marti

Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Tue, Jan 14, 2014 at 11:16 PM, Marti Raudsepp <marti@juffo.org> wrote:
On Tue, Jan 14, 2014 at 5:49 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>> I implemented a new
>> enable_partialsort GUC to make it easier to turn on/off

> I though about such option. Generally not because of testing convenience,
> but because of overhead of planning. This way you implement it is quite
> naive :) For instance, merge join rely on partial sort which will be
> replaced with simple sort.

Oh, this actually highlights a performance regression with the partial sort patch. I assumed the planner will discard the full sort because of higher costs, but it looks like the new code always assumes that a Partial sort will be cheaper than a Join Filter without considering costs. When doing a join USING (unique_indexed_value, something), the new plan is significantly worse.

Unpatched:
marti=# explain analyze select * from release a join release b using (id, name);
 Merge Join  (cost=0.85..179810.75 rows=12 width=158) (actual time=0.011..1279.596 rows=1232610 loops=1)
   Merge Cond: (a.id = b.id)
   Join Filter: ((a.name)::text = (b.name)::text)
   ->  Index Scan using release_id_idx on release a  (cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.005..211.928 rows=1232610 loops=1)
   ->  Index Scan using release_id_idx on release b  (cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.004..371.592 rows=1232610 loops=1)
 Total runtime: 1309.049 ms

Patched:
 Merge Join  (cost=0.98..179810.87 rows=12 width=158) (actual time=0.037..5034.158 rows=1232610 loops=1)
   Merge Cond: ((a.id = b.id) AND ((a.name)::text = (b.name)::text))
   ->  Partial sort  (cost=0.49..82201.56 rows=1232610 width=92) (actual time=0.013..955.938 rows=1232610 loops=1)
         Sort Key: a.id, a.name
         Presorted Key: a.id
         Sort Method: quicksort  Memory: 25kB
         ->  Index Scan using release_id_idx on release a  (cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.007..449.332 rows=1232610 loops=1)
   ->  Materialize  (cost=0.49..85283.09 rows=1232610 width=92) (actual time=0.019..1352.377 rows=1232610 loops=1)
         ->  Partial sort  (cost=0.49..82201.56 rows=1232610 width=92) (actual time=0.018..1223.251 rows=1232610 loops=1)
               Sort Key: b.id, b.name
               Presorted Key: b.id
               Sort Method: quicksort  Memory: 25kB
               ->  Index Scan using release_id_idx on release b  (cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.004..597.258 rows=1232610 loops=1)
 Total runtime: 5166.906 ms

----
 
Interesting. Could you share the dataset?

There's another "wishlist" kind of thing with top-N heapsort bounds; if I do a query with LIMIT 1000 then every sort batch has Tuplesortstate.bound set to 1000, but it could be reduced after each batch. If the first batch is 900 rows then the 2nd batch only needs the top 100 rows at most.

Right. Just didn't implement it yet.
 
Also, I find the name "partial sort" a bit confusing; this feature is not actually sorting *partially*, it's finishing the sort of partially-sorted data. Perhaps "batched sort" would explain the feature better? Because it does the sort in multiple batches instead of all at once. But maybe that's just me.

I'm not sure. For me "batched sort" sounds like we're going to sort in batch something that we sorted separately before. Probably I'm wrong because I'm far from native english :)

------
With best regards,
Alexander Korotkov.  

Re: PoC: Partial sort

From
Marti Raudsepp
Date:
On Tue, Jan 14, 2014 at 9:28 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Jan 14, 2014 at 11:16 PM, Marti Raudsepp <marti@juffo.org> wrote:
Oh, this actually highlights a performance regression with the partial sort patch.
 
Interesting. Could you share the dataset?

It occurs with many datasets if work_mem is sufficiently low (10MB in my case). Here's a quicker way to reproduce a similar issue:

create table foo as select i, i as j from generate_series(1,10000000) i;
create index on foo(i);
explain analyze select * from foo a join foo b using (i, j);

The real data is from the "release" table from MusicBrainz database dump: https://musicbrainz.org/doc/MusicBrainz_Database/Download . It's nontrivial to set up though, so if you still need the real data, I can upload a pgdump for you.

Regards,
Marti

Re: PoC: Partial sort

From
Jeremy Harris
Date:
On 22/12/13 20:26, Alexander Korotkov wrote:
> On Sat, Dec 14, 2013 at 6:30 PM, Jeremy Harris <jgh@wizmail.org> wrote:
>
>> On 14/12/13 12:54, Andres Freund wrote:
>>> Is that actually all that beneficial when sorting with a bog standard
>>> qsort() since that doesn't generally benefit from data being pre-sorted?
>>> I think we might need to switch to a different algorithm to really
>>> benefit from mostly pre-sorted input.
>>>
>>
>> Eg:  http://www.postgresql.org/message-id/5291467E.6070807@wizmail.org
>>
>> Maybe Alexander and I should bash our heads together.
>
>
> Partial sort patch is mostly optimizer/executor improvement rather than
> improvement of sort algorithm itself.

I finally got as far as understanding Alexander's cleverness, and it
does make the performance advantage (on partially-sorted input) of the
merge-sort irrelevant.

There's a slight tradeoff possible between the code complexity of
the chunking code front-ending the sorter and just using the
enhanced sorter.  The chunking does reduce the peak memory usage
quite nicely too.

The implementation of the chunker does O(n) compares using the
keys of the feed-stream index, to identify the chunk boundaries.
Would it be possible to get this information from the Index Scan?
-- 
Cheers,   Jeremy




Re: PoC: Partial sort

From
Jeremy Harris
Date:
On 13/01/14 18:01, Alexander Korotkov wrote:
> Thanks. It's included into attached version of patch. As wall as estimation
> improvements, more comments and regression tests fix.

Would it be possible to totally separate the two sets of sort-keys,
only giving the non-index set to the tuplesort?  At present tuplesort
will, when it has a group to sort, make wasted compares on the
indexed set of keys before starting on the non-indexed set.
-- 
Cheers,  Jeremy




Re: PoC: Partial sort

From
Marti Raudsepp
Date:
Hi,

There's another small regression with this patch when used with expensive comparison functions, such as long text fields.

If we go through all this trouble in cmpSortSkipCols to prove that the first N sortkeys are equal, it would be nice if Tuplesort could skip their comparisons entirely; that's another nice optimization this patch can provide.

I've implemented that in the attached patch, which applies on top of your partial-sort-5.patch

Should the "Sort Key" field in EXPLAIN output be changed as well? I'd say no, I think that makes the partial sort steps harder to read.

Generate test data:
create table longtext as select (select repeat('a', 1000*100)) a, generate_series(1,1000) i;
create index on longtext(a);

Unpatched (using your original partial-sort-5.patch):
=# explain analyze select * from longtext order by a, i limit 10;
 Limit  (cost=2.34..19.26 rows=10 width=1160) (actual time=13477.739..13477.756 rows=10 loops=1)
   ->  Partial sort  (cost=2.34..1694.15 rows=1000 width=1160) (actual time=13477.737..13477.742 rows=10 loops=1)
         Sort Key: a, i
         Presorted Key: a
         Sort Method: top-N heapsort  Memory: 45kB
         ->  Index Scan using longtext_a_idx on longtext  (cost=0.65..1691.65 rows=1000 width=1160) (actual time=0.015..2.364 rows=1000 loops=1)
 Total runtime: 13478.158 ms
(7 rows)

=# set enable_indexscan=off;
=# explain analyze select * from longtext order by a, i limit 10;
 Limit  (cost=198.61..198.63 rows=10 width=1160) (actual time=6970.439..6970.458 rows=10 loops=1)
   ->  Sort  (cost=198.61..201.11 rows=1000 width=1160) (actual time=6970.438..6970.444 rows=10 loops=1)
         Sort Key: a, i
         Sort Method: top-N heapsort  Memory: 45kB
         ->  Seq Scan on longtext  (cost=0.00..177.00 rows=1000 width=1160) (actual time=0.007..1.763 rows=1000 loops=1)
 Total runtime: 6970.491 ms

Patched:
=# explain analyze select * from longtext order by a, i ;
 Partial sort  (cost=2.34..1694.15 rows=1000 width=1160) (actual time=0.024..4.603 rows=1000 loops=1)
   Sort Key: a, i
   Presorted Key: a
   Sort Method: quicksort  Memory: 27kB
   ->  Index Scan using longtext_a_idx on longtext  (cost=0.65..1691.65 rows=1000 width=1160) (actual time=0.013..2.094 rows=1000 loops=1)
 Total runtime: 5.418 ms

Regards,
Marti
Attachment

Re: PoC: Partial sort

From
Marti Raudsepp
Date:
Funny, I just wrote a patch to do that some minutes ago (didn't see your email yet).

http://www.postgresql.org/message-id/CABRT9RCK=wmFUYZdqU_+MOFW5PDevLxJmZ5B=eTJJNUBvyARxw@mail.gmail.com

Regards,
Marti



On Sat, Jan 18, 2014 at 7:10 PM, Jeremy Harris <jgh@wizmail.org> wrote:
On 13/01/14 18:01, Alexander Korotkov wrote:
Thanks. It's included into attached version of patch. As wall as estimation
improvements, more comments and regression tests fix.

Would it be possible to totally separate the two sets of sort-keys,
only giving the non-index set to the tuplesort?  At present tuplesort
will, when it has a group to sort, make wasted compares on the
indexed set of keys before starting on the non-indexed set.
--
Cheers,
  Jeremy




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: PoC: Partial sort

From
Jeremy Harris
Date:
On 31/12/13 01:41, Andreas Karlsson wrote:
> On 12/29/2013 08:24 AM, David Rowley wrote:
>> If it was possible to devise some way to reuse any
>> previous tuplesortstate perhaps just inventing a reset method which
>> clears out tuples, then we could see performance exceed the standard
>> seqscan -> sort. The code the way it is seems to lookup the sort
>> functions from the syscache for each group then allocate some sort
>> space, so quite a bit of time is also spent in palloc0() and pfree()
>>
>> If it was not possible to do this then maybe adding a cost to the number
>> of sort groups would be better so that the optimization is skipped if
>> there are too many sort groups.
>
> It should be possible. I have hacked a quick proof of concept for
> reusing the tuplesort state. Can you try it and see if the performance
> regression is fixed by this?
>
> One thing which have to be fixed with my patch is that we probably want
> to close the tuplesort once we have returned the last tuple from
> ExecSort().
>
> I have attached my patch and the incremental patch on Alexander's patch.

How does this work in combination with randomAccess ?
-- 
Thanks,   Jeremy




Re: PoC: Partial sort

From
Marti Raudsepp
Date:
On Sat, Jan 18, 2014 at 7:22 PM, Marti Raudsepp <marti@juffo.org> wrote:
> Total runtime: 5.418 ms

Oops, shouldn't have rushed this. Clearly the timings should have
tipped me off that it's broken. I didn't notice that cmpSortSkipCols
was re-using tuplesort's sortkeys.

Here's a patch that actually works; I added a new skipKeys attribute
to SortState. I had to extract the SortSupport-creation code from
tuplesort_begin_heap to a new function; but that's fine, because it
was already duplicated in ExecInitMergeAppend too.

I reverted the addition of tuplesort_get_sortkeys, which is not needed now.

Now the timings are:
Unpatched partial sort: 13478.158 ms
Full sort: 6802.063 ms
Patched partial sort: 6618.962 ms

Regards,
Marti

Attachment

Re: PoC: Partial sort

From
Andreas Karlsson
Date:
On 01/18/2014 08:13 PM, Jeremy Harris wrote:
> On 31/12/13 01:41, Andreas Karlsson wrote:
>> On 12/29/2013 08:24 AM, David Rowley wrote:
>>> If it was possible to devise some way to reuse any
>>> previous tuplesortstate perhaps just inventing a reset method which
>>> clears out tuples, then we could see performance exceed the standard
>>> seqscan -> sort. The code the way it is seems to lookup the sort
>>> functions from the syscache for each group then allocate some sort
>>> space, so quite a bit of time is also spent in palloc0() and pfree()
>>>
>>> If it was not possible to do this then maybe adding a cost to the number
>>> of sort groups would be better so that the optimization is skipped if
>>> there are too many sort groups.
>>
>> It should be possible. I have hacked a quick proof of concept for
>> reusing the tuplesort state. Can you try it and see if the performance
>> regression is fixed by this?
>>
>> One thing which have to be fixed with my patch is that we probably want
>> to close the tuplesort once we have returned the last tuple from
>> ExecSort().
>>
>> I have attached my patch and the incremental patch on Alexander's patch.
>
> How does this work in combination with randomAccess ?

As far as I can tell randomAccess was broken by the partial sort patch 
even before my change since it would not iterate over multiple 
tuplesorts anyway.

Alexander: Is this true or am I missing something?

-- 
Andreas Karlsson



Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Sun, Jan 19, 2014 at 5:57 AM, Andreas Karlsson <andreas@proxel.se> wrote:
On 01/18/2014 08:13 PM, Jeremy Harris wrote:
On 31/12/13 01:41, Andreas Karlsson wrote:
On 12/29/2013 08:24 AM, David Rowley wrote:
If it was possible to devise some way to reuse any
previous tuplesortstate perhaps just inventing a reset method which
clears out tuples, then we could see performance exceed the standard
seqscan -> sort. The code the way it is seems to lookup the sort
functions from the syscache for each group then allocate some sort
space, so quite a bit of time is also spent in palloc0() and pfree()

If it was not possible to do this then maybe adding a cost to the number
of sort groups would be better so that the optimization is skipped if
there are too many sort groups.

It should be possible. I have hacked a quick proof of concept for
reusing the tuplesort state. Can you try it and see if the performance
regression is fixed by this?

One thing which have to be fixed with my patch is that we probably want
to close the tuplesort once we have returned the last tuple from
ExecSort().

I have attached my patch and the incremental patch on Alexander's patch.

How does this work in combination with randomAccess ?

As far as I can tell randomAccess was broken by the partial sort patch even before my change since it would not iterate over multiple tuplesorts anyway.

Alexander: Is this true or am I missing something?

Yes, I decided that Sort node shouldn't provide randomAccess in the case of skipCols !=0. See assert in the beginning of ExecInitSort. I decided that it would be better to add explicit materialize node rather than store extra tuples in tuplesortstate each time.
I also adjusted ExecSupportsMarkRestore, ExecMaterializesOutput and ExecMaterializesOutput to make planner believe so. I found path->pathtype to be absolutely never T_Sort. Correct me if I'm wrong.

Another changes in this version of patch:
1) Applied patch to don't compare skipCols in tuplesort by Marti Raudsepp
2) Adjusting sort bound after processing buckets.

------
With best regards,
Alexander Korotkov.
Attachment

Re: PoC: Partial sort

From
Marti Raudsepp
Date:
Hi,

On Tue, Jan 14, 2014 at 5:49 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
>On Tue, Jan 14, 2014 at 12:54 AM, Marti Raudsepp <marti@juffo.org> wrote:
>> I've been trying it out in a few situations. I implemented a new
>> enable_partialsort GUC to make it easier to turn on/off, this way it's a lot
>> easier to test. The attached patch applies on top of partial-sort-5.patch
>
> I though about such option. Generally not because of testing convenience,
> but because of overhead of planning. This way you implement it is quite
> naive :)

I don't understand. I had another look at this and cost_sort still
seems like the best place to implement this, since that's where the
patch decides how many pre-sorted columns to use. Both mergejoin and
simple order-by plans call into it. If enable_partialsort=false then I
skip all pre-sorted options except full sort, making cost_sort behave
pretty much like it did before the patch.

I could change pathkeys_common to return 0, but that seems like a
generic function that shouldn't be tied to partialsort. The old code
paths called pathkeys_contained_in anyway, which has similar
complexity. (Apart for initial_cost_mergejoin, but that doesn't seem
special enough to make an exception for).

Or should I use?: enable_partialsort ? pathkeys_common(...) : 0

> For instance, merge join rely on partial sort which will be
> replaced with simple sort.

Are you saying that enable_partialsort=off should keep
partialsort-based mergejoins enabled?

Or are you saying that merge joins shouldn't use "simple sort" at all?
But merge join was already able to use full Sort nodes before your
patch.

What am I missing?

Regards,
Marti



Re: PoC: Partial sort

From
Marti Raudsepp
Date:
On Mon, Jan 20, 2014 at 2:43 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> Another changes in this version of patch:
> 1) Applied patch to don't compare skipCols in tuplesort by Marti Raudsepp
> 2) Adjusting sort bound after processing buckets.

Hi,

Here's a patch with some whitespace and coding style fixes for
partial-sort-6.patch

I tried to understand the mergejoin regression, but this code still
looks like Chinese to me. Can anyone else have a look at it?

Test case: http://www.postgresql.org/message-id/CABRT9RDd-P2RLRdHsMq8rCOB46k4a5O+bGz_up2bRGeeH4R6oQ@mail.gmail.com
Original report:
http://www.postgresql.org/message-id/CABRT9RCLLUyJ=bkeB132aVA_mVNx5==LvVvQMvUqDguFZtW+cg@mail.gmail.com

Regards,
Marti

Attachment

Re: PoC: Partial sort

From
Alexander Korotkov
Date:
Hi!

On Tue, Jan 21, 2014 at 3:24 AM, Marti Raudsepp <marti@juffo.org> wrote:
On Tue, Jan 14, 2014 at 5:49 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
>On Tue, Jan 14, 2014 at 12:54 AM, Marti Raudsepp <marti@juffo.org> wrote:
>> I've been trying it out in a few situations. I implemented a new
>> enable_partialsort GUC to make it easier to turn on/off, this way it's a lot
>> easier to test. The attached patch applies on top of partial-sort-5.patch
>
> I though about such option. Generally not because of testing convenience,
> but because of overhead of planning. This way you implement it is quite
> naive :)

I don't understand. I had another look at this and cost_sort still
seems like the best place to implement this, since that's where the
patch decides how many pre-sorted columns to use. Both mergejoin and
simple order-by plans call into it. If enable_partialsort=false then I
skip all pre-sorted options except full sort, making cost_sort behave
pretty much like it did before the patch.

I could change pathkeys_common to return 0, but that seems like a
generic function that shouldn't be tied to partialsort. The old code
paths called pathkeys_contained_in anyway, which has similar
complexity. (Apart for initial_cost_mergejoin, but that doesn't seem
special enough to make an exception for).

Or should I use?:
  enable_partialsort ? pathkeys_common(...) : 0

> For instance, merge join rely on partial sort which will be
> replaced with simple sort.

Are you saying that enable_partialsort=off should keep
partialsort-based mergejoins enabled?

Or are you saying that merge joins shouldn't use "simple sort" at all?
But merge join was already able to use full Sort nodes before your
patch.

Sorry that I didn't explained it. In particular I mean following:
1) With enable_partialsort = off all mergejoin logic should behave as without partial sort patch.
2) With partial sort patch get_cheapest_fractional_path_for_pathkeys function is much more expensive to execute. With enable_partialsort = off it should be as cheap as without partial sort patch.
I'll try to implement this option in this week.
For now, I have attempt to fix extra columns in mergejoin problem. It would be nice if you test it.

------
With best regards,
Alexander Korotkov. 
Attachment

Re: PoC: Partial sort

From
Marti Raudsepp
Date:
On Mon, Jan 27, 2014 at 9:26 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> For now, I have attempt to fix extra columns in mergejoin problem. It would
> be nice if you test it.

Yes, it solves the test cases I was trying with, thanks.

> 1) With enable_partialsort = off all mergejoin logic should behave as
> without partial sort patch.
> 2) With partial sort patch get_cheapest_fractional_path_for_pathkeys
> function is much more expensive to execute. With enable_partialsort = off it
> should be as cheap as without partial sort patch.

When it comes to planning time, I really don't think you should
bother. The planner enable_* settings are meant for troubleshooting,
debugging and learning about the planner. You should not expect people
to disable them in a production setting. It's not worth complicating
the code for that rare case.

This is stated in the documentation
(http://www.postgresql.org/docs/current/static/runtime-config-query.html)
and repeatedly on the mailing lists.

But some benchmarks of planning performance are certainly warranted.

Regards,
Marti



Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Tue, Jan 28, 2014 at 7:41 AM, Marti Raudsepp <marti@juffo.org> wrote:
On Mon, Jan 27, 2014 at 9:26 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> For now, I have attempt to fix extra columns in mergejoin problem. It would
> be nice if you test it.

Yes, it solves the test cases I was trying with, thanks.

> 1) With enable_partialsort = off all mergejoin logic should behave as
> without partial sort patch.
> 2) With partial sort patch get_cheapest_fractional_path_for_pathkeys
> function is much more expensive to execute. With enable_partialsort = off it
> should be as cheap as without partial sort patch.

When it comes to planning time, I really don't think you should
bother. The planner enable_* settings are meant for troubleshooting,
debugging and learning about the planner. You should not expect people
to disable them in a production setting. It's not worth complicating
the code for that rare case.

This is stated in the documentation
(http://www.postgresql.org/docs/current/static/runtime-config-query.html)
and repeatedly on the mailing lists.

But some benchmarks of planning performance are certainly warranted.

I didn't test it, but I worry that overhead might be high.
If it's true then it could be like constraint_exclusion option which id off by default because of planning overhead.

------
With best regards,
Alexander Korotkov.

Re: PoC: Partial sort

From
Marti Raudsepp
Date:
On Tue, Jan 28, 2014 at 7:51 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> I didn't test it, but I worry that overhead might be high.
> If it's true then it could be like constraint_exclusion option which id off
> by default because of planning overhead.

I see, that makes sense.

I will try to find the time to run some benchmarks in the coming few days.

Regards,
Marti



Re: PoC: Partial sort

From
Marti Raudsepp
Date:
On Tue, Jan 28, 2014 at 7:51 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Jan 28, 2014 at 7:41 AM, Marti Raudsepp <marti@juffo.org> wrote:
But some benchmarks of planning performance are certainly warranted.

I didn't test it, but I worry that overhead might be high.
If it's true then it could be like constraint_exclusion option which id off by default because of planning overhead.

Sorry I didn't get around to this before.

I ran some synthetic benchmarks with single-column inner joins between 5 tables, with indexes on both joined columns, using only EXPLAIN (so measuring planning time, not execution) in 9 scenarios to excercise different code paths. According to these measurements, the overhead ranges between 1.0 and 4.5% depending on the scenario.

----
Merge join with partial sort children seems like a fairly obscure use case (though I'm sure it can help a lot in those cases). The default should definitely allow partial sort in normal ORDER BY queries. What's under question here is whether to enable partial sort for mergejoin.

So I see 3 possible resolutions:
1. The overhead is deemed acceptable to enable by default, in which case we're done here.
2. Add a three-value runtime setting like: enable_partialsort = [ off | no_mergejoin | on ], defaulting to no_mergejoin (just to get the point across, clearly we need better naming). This is how constraint_exclusion works.
3. Remove the partialsort mergejoin code entirely, keeping the rest of the cases.

What do you think?

----
All the tests are available here: https://github.com/intgr/benchjunk/tree/master/partial_sort (using script run2.sh)

Overhead by test (partial-sort-7.patch.gz):
join5.sql 2.9% (all joins on the same column)
star5.sql 1.7% ("star schema" kind of join)
line5.sql 1.9% (joins chained to each other)
lim_join5.sql 4.5% (same as above, with LIMIT 1)
lim_star5.sql 2.8%
lim_line5.sql 1.8%
limord_join5.sql 4.3% (same as above, with ORDER BY & LIMIT 1)
limord_star5.sql 3.9%
limord_line5.sql 1.0%

Full data:
PostgreSQL @ git ac8bc3b
join5.sql tps = 499.490173 (excluding connections establishing)
join5.sql tps = 503.756335 (excluding connections establishing)
join5.sql tps = 504.814072 (excluding connections establishing)
star5.sql tps = 492.799230 (excluding connections establishing)
star5.sql tps = 492.570615 (excluding connections establishing)
star5.sql tps = 491.949985 (excluding connections establishing)
line5.sql tps = 773.945050 (excluding connections establishing)
line5.sql tps = 773.858068 (excluding connections establishing)
line5.sql tps = 774.551240 (excluding connections establishing)
lim_join5.sql tps = 392.539745 (excluding connections establishing)
lim_join5.sql tps = 391.867549 (excluding connections establishing)
lim_join5.sql tps = 393.361655 (excluding connections establishing)
lim_star5.sql tps = 418.431804 (excluding connections establishing)
lim_star5.sql tps = 419.258985 (excluding connections establishing)
lim_star5.sql tps = 419.434697 (excluding connections establishing)
lim_line5.sql tps = 713.852506 (excluding connections establishing)
lim_line5.sql tps = 713.636694 (excluding connections establishing)
lim_line5.sql tps = 712.971719 (excluding connections establishing)
limord_join5.sql tps = 381.068465 (excluding connections establishing)
limord_join5.sql tps = 380.379359 (excluding connections establishing)
limord_join5.sql tps = 381.182385 (excluding connections establishing)
limord_star5.sql tps = 412.997935 (excluding connections establishing)
limord_star5.sql tps = 411.401352 (excluding connections establishing)
limord_star5.sql tps = 413.209784 (excluding connections establishing)
limord_line5.sql tps = 688.906406 (excluding connections establishing)
limord_line5.sql tps = 689.445483 (excluding connections establishing)
limord_line5.sql tps = 688.758042 (excluding connections establishing)

partial-sort-7.patch.gz
join5.sql tps = 479.508034 (excluding connections establishing)
join5.sql tps = 488.263674 (excluding connections establishing)
join5.sql tps = 490.127433 (excluding connections establishing)
star5.sql tps = 482.106063 (excluding connections establishing)
star5.sql tps = 484.179687 (excluding connections establishing)
star5.sql tps = 483.027372 (excluding connections establishing)
line5.sql tps = 758.092993 (excluding connections establishing)
line5.sql tps = 759.697814 (excluding connections establishing)
line5.sql tps = 759.792792 (excluding connections establishing)
lim_join5.sql tps = 375.517211 (excluding connections establishing)
lim_join5.sql tps = 375.539109 (excluding connections establishing)
lim_join5.sql tps = 375.841645 (excluding connections establishing)
lim_star5.sql tps = 407.683110 (excluding connections establishing)
lim_star5.sql tps = 407.414409 (excluding connections establishing)
lim_star5.sql tps = 407.526613 (excluding connections establishing)
lim_line5.sql tps = 699.905101 (excluding connections establishing)
lim_line5.sql tps = 700.349675 (excluding connections establishing)
lim_line5.sql tps = 700.661762 (excluding connections establishing)
limord_join5.sql tps = 364.607236 (excluding connections establishing)
limord_join5.sql tps = 364.367705 (excluding connections establishing)
limord_join5.sql tps = 363.694065 (excluding connections establishing)
limord_star5.sql tps = 397.036792 (excluding connections establishing)
limord_star5.sql tps = 397.197359 (excluding connections establishing)
limord_star5.sql tps = 395.797940 (excluding connections establishing)
limord_line5.sql tps = 680.907397 (excluding connections establishing)
limord_line5.sql tps = 682.206481 (excluding connections establishing)
limord_line5.sql tps = 681.210267 (excluding connections establishing)

Regards,
Marti

Re: PoC: Partial sort

From
Robert Haas
Date:
On Wed, Feb 5, 2014 at 6:58 PM, Marti Raudsepp <marti@juffo.org> wrote:
> I ran some synthetic benchmarks with single-column inner joins between 5
> tables, with indexes on both joined columns, using only EXPLAIN (so
> measuring planning time, not execution) in 9 scenarios to excercise
> different code paths. According to these measurements, the overhead ranges
> between 1.0 and 4.5% depending on the scenario.

Hmm, sounds a little steep.  Why is it so expensive?  I'm probably
missing something here, because I would have thought that planner
support for partial sorts would consist mostly of considering the same
sorts we consider today, but with the costs reduced by the batching.
Changing the cost estimation that way can't be that much more
expensive than what we're already doing, so the overhead should be
minimal.  What the patch is actually doing seems to be something quite
a bit more invasive than that, but I'm not sure what it is exactly, or
why.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC: Partial sort

From
Marti Raudsepp
Date:
On Thu, Feb 6, 2014 at 5:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Hmm, sounds a little steep.  Why is it so expensive?  I'm probably
> missing something here, because I would have thought that planner
> support for partial sorts would consist mostly of considering the same
> sorts we consider today, but with the costs reduced by the batching.

I guess it's because the patch undoes some optimizations in the
mergejoin planner wrt caching merge clauses and adds a whole lot of
code to find_mergeclauses_for_pathkeys. In other code paths the
overhead does seem to be negligible.

Notice the removal of:
/* Select the right mergeclauses, if we didn't already */
/** Avoid rebuilding clause list if we already made one;* saves memory in big join trees...*/

Regards,
Marti



Re: PoC: Partial sort

From
Robert Haas
Date:
On Thu, Feb 6, 2014 at 3:39 AM, Marti Raudsepp <marti@juffo.org> wrote:
> On Thu, Feb 6, 2014 at 5:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Hmm, sounds a little steep.  Why is it so expensive?  I'm probably
>> missing something here, because I would have thought that planner
>> support for partial sorts would consist mostly of considering the same
>> sorts we consider today, but with the costs reduced by the batching.
>
> I guess it's because the patch undoes some optimizations in the
> mergejoin planner wrt caching merge clauses and adds a whole lot of
> code to find_mergeclauses_for_pathkeys. In other code paths the
> overhead does seem to be negligible.
>
> Notice the removal of:
> /* Select the right mergeclauses, if we didn't already */
> /*
>  * Avoid rebuilding clause list if we already made one;
>  * saves memory in big join trees...
>  */

Yeah, I noticed that.  My feeling is that those optimizations got put
in there because someone found them to be important, so I'm skeptical
about removing them.  It may be that having the capability to do a
partial sort makes it seem worth spending more CPU looking for merge
joins, but I'd vote for making any such change a separate patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC: Partial sort

From
Marti Raudsepp
Date:
On Thu, Feb 6, 2014 at 9:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> It may be that having the capability to do a
> partial sort makes it seem worth spending more CPU looking for merge
> joins, but I'd vote for making any such change a separate patch.

Agreed.

Alexander, should I work on splitting up the patch in two, or do you
want to do it yourself?

Should I merge my coding style and enable_partialsort patches while at
it, or do you still have reservations about those?

Regards,
Marti



Re: PoC: Partial sort

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Feb 6, 2014 at 3:39 AM, Marti Raudsepp <marti@juffo.org> wrote:
>> I guess it's because the patch undoes some optimizations in the
>> mergejoin planner wrt caching merge clauses and adds a whole lot of
>> code to find_mergeclauses_for_pathkeys. In other code paths the
>> overhead does seem to be negligible.

> Yeah, I noticed that.  My feeling is that those optimizations got put
> in there because someone found them to be important, so I'm skeptical
> about removing them.

I put them in, and yeah they are important.  Even with those, and even
with the rather arbitrary heuristic restrictions that joinpath.c puts on
what mergeclause lists to consider, the existing planner spends a whole
lot of effort on mergejoins --- possibly disproportionate to their actual
value.  I think that any patch that removes those optimizations is not
going to fly.  If anything, it'd be better to reduce the number of
mergejoins considered even further, because a lot of the possible plans
are not usefully different.

It's already the case that we expect indxpath.c to predict the useful
orderings (by reference to query_pathkeys and available mergejoin clauses)
and generate suitable paths, rather than trying to identify the orderings
at join time.  Can't that approach be extended to cover this technique?

In any case, the bottom line is that we don't want this patch to cause
the planner to consider large numbers of new but useless sort orderings.
        regards, tom lane



Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Thu, Feb 6, 2014 at 12:39 PM, Marti Raudsepp <marti@juffo.org> wrote:
On Thu, Feb 6, 2014 at 5:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Hmm, sounds a little steep.  Why is it so expensive?  I'm probably
> missing something here, because I would have thought that planner
> support for partial sorts would consist mostly of considering the same
> sorts we consider today, but with the costs reduced by the batching.

I guess it's because the patch undoes some optimizations in the
mergejoin planner wrt caching merge clauses and adds a whole lot of
code to find_mergeclauses_for_pathkeys. In other code paths the
overhead does seem to be negligible.

Notice the removal of:
/* Select the right mergeclauses, if we didn't already */
/*
 * Avoid rebuilding clause list if we already made one;
 * saves memory in big join trees...
 */

This is not only place that worry me about planning overhead. See get_cheapest_fractional_path_for_pathkeys. I had to estimate number of groups for each sorting column in order to get right fractional path. For partial sort path, cost of first batch should be included into initial cost. 
If don't do so, optimizer can pick up strange plans basing on assumption that it need only few rows from inner node. See an example.

create table test1 as (
select id, 
(random()*100)::int as v1, 
(random()*10000)::int as v2
from generate_series(1,1000000) id);

create table test2 as (
select id, 
(random()*100)::int as v1, 
(random()*10000)::int as v2
from generate_series(1,1000000) id);

create index test1_v1_idx on test1 (v1);

Plan without fraction estimation in get_cheapest_fractional_path_for_pathkeys:

postgres=# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1 order by t1.v1, t1.id limit 10;
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Limit  (cost=198956893.20..198956913.33 rows=10 width=24)
   ->  Partial sort  (cost=198956893.20..19909637942.82 rows=9791031169 width=24)
         Sort Key: t1.v1, t1.id
         Presorted Key: t1.v1
         ->  Nested Loop  (cost=0.42..19883065506.84 rows=9791031169 width=24)
               Join Filter: (t1.v1 = t2.v1)
               ->  Index Scan using test1_v1_idx on test1 t1  (cost=0.42..47600.84 rows=1000000 width=12)
               ->  Materialize  (cost=0.00..25289.00 rows=1000000 width=12)
                     ->  Seq Scan on test2 t2  (cost=0.00..15406.00 rows=1000000 width=12)
(9 rows)

Current version of patch:

postgres=# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1 order by t1.v1, t1.id limit 10;
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Limit  (cost=3699913.43..3699913.60 rows=10 width=24)
   ->  Partial sort  (cost=3699913.43..173638549.67 rows=9791031169 width=24)
         Sort Key: t1.v1, t1.id
         Presorted Key: t1.v1
         ->  Merge Join  (cost=150444.79..147066113.70 rows=9791031169 width=24)
               Merge Cond: (t1.v1 = t2.v1)
               ->  Index Scan using test1_v1_idx on test1 t1  (cost=0.42..47600.84 rows=1000000 width=12)
               ->  Materialize  (cost=149244.84..154244.84 rows=1000000 width=12)
                     ->  Sort  (cost=149244.84..151744.84 rows=1000000 width=12)
                           Sort Key: t2.v1
                           ->  Seq Scan on test2 t2  (cost=0.00..15406.00 rows=1000000 width=12)
(11 rows)

I don't compare actual execution times because I didn't wait until first plan execution ends up :-)
But anyway costs are extraordinary and inner sequential scan of 1000000 rows is odd.

------
With best regards,
Alexander Korotkov. 

Re: PoC: Partial sort

From
Marti Raudsepp
Date:
On Sun, Feb 9, 2014 at 7:37 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> This is not only place that worry me about planning overhead. See
> get_cheapest_fractional_path_for_pathkeys. I had to estimate number of
> groups for each sorting column in order to get right fractional path.

AFAICT this only happens once per plan and the overhead is O(n) to the
number of pathkeys? I can't get worried about that, but I guess it's
better to test anyway.

PS: You didn't answer my questions about splitting the patch. I guess
I'll have to do that anyway to run the tests.

Regards,
Marti



Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Mon, Feb 10, 2014 at 2:33 PM, Marti Raudsepp <marti@juffo.org> wrote:
On Sun, Feb 9, 2014 at 7:37 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> This is not only place that worry me about planning overhead. See
> get_cheapest_fractional_path_for_pathkeys. I had to estimate number of
> groups for each sorting column in order to get right fractional path.

AFAICT this only happens once per plan and the overhead is O(n) to the
number of pathkeys? I can't get worried about that, but I guess it's
better to test anyway.

PS: You didn't answer my questions about splitting the patch. I guess
I'll have to do that anyway to run the tests.

Done. Patch is splitted.

------
With best regards,
Alexander Korotkov.  
Attachment

Re: PoC: Partial sort

From
Marti Raudsepp
Date:
On Mon, Feb 10, 2014 at 8:59 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> Done. Patch is splitted.

Thanks!

I think the 1st patch now has a bug in initial_cost_mergejoin; you
still pass the "presorted_keys" argument to cost_sort, making it
calculate a partial sort cost, but generated plans never use partial
sort. I think 0 should be passed instead. Patch attached, needs to be
applied on top of partial-sort-basic-1 and then reverse-applied on
partial-sort-merge-1.

With partial-sort-basic-1 and this fix on the same test suite, the
planner overhead is now a more manageable 0.5% to 1.3%; one test is
faster by 0.5%. Built with asserts disabled, ran on Intel i5-3570K. In
an effort to reduce variance, I locked the server and pgbench to a
single CPU core (taskset -c 3), but there are still noticeable
run-to-run differences, so these numbers are a bit fuzzy. The faster
result is definitely not a fluke, however; it happens every time.

> On Mon, Feb 10, 2014 at 2:33 PM, Marti Raudsepp <marti@juffo.org> wrote:
>> AFAICT this only happens once per plan and the overhead is O(n) to the
>> number of pathkeys?

I was of course wrong about that, it also adds extra overhead when
iterating over the paths list.

----
Test "taskset -c 3 run2.sh" from
https://github.com/intgr/benchjunk/tree/master/partial_sort

Overhead percentages (between best of each 3 runs):
join5.sql 0.7
star5.sql 0.8
line5.sql 0.5
lim_join5.sql -0.5
lim_star5.sql 1.3
lim_line5.sql 0.5
limord_join5.sql 0.6
limord_star5.sql 0.5
limord_line5.sql 0.7

Raw results:
git 48870dd
join5.sql tps = 509.328070 (excluding connections establishing)
join5.sql tps = 509.772190 (excluding connections establishing)
join5.sql tps = 510.651517 (excluding connections establishing)
star5.sql tps = 499.208698 (excluding connections establishing)
star5.sql tps = 498.200314 (excluding connections establishing)
star5.sql tps = 496.269315 (excluding connections establishing)
line5.sql tps = 797.968831 (excluding connections establishing)
line5.sql tps = 797.011690 (excluding connections establishing)
line5.sql tps = 796.379258 (excluding connections establishing)
lim_join5.sql tps = 394.946024 (excluding connections establishing)
lim_join5.sql tps = 395.417689 (excluding connections establishing)
lim_join5.sql tps = 395.482958 (excluding connections establishing)
lim_star5.sql tps = 423.434393 (excluding connections establishing)
lim_star5.sql tps = 423.774305 (excluding connections establishing)
lim_star5.sql tps = 424.386099 (excluding connections establishing)
lim_line5.sql tps = 733.007330 (excluding connections establishing)
lim_line5.sql tps = 731.794731 (excluding connections establishing)
lim_line5.sql tps = 732.356280 (excluding connections establishing)
limord_join5.sql tps = 385.317921 (excluding connections establishing)
limord_join5.sql tps = 385.915870 (excluding connections establishing)
limord_join5.sql tps = 384.747848 (excluding connections establishing)
limord_star5.sql tps = 417.992615 (excluding connections establishing)
limord_star5.sql tps = 416.944685 (excluding connections establishing)
limord_star5.sql tps = 418.262647 (excluding connections establishing)
limord_line5.sql tps = 708.979203 (excluding connections establishing)
limord_line5.sql tps = 710.926866 (excluding connections establishing)
limord_line5.sql tps = 710.928907 (excluding connections establishing)

48870dd + partial-sort-basic-1.patch.gz + fix-cost_sort.patch
join5.sql tps = 505.488181 (excluding connections establishing)
join5.sql tps = 507.222759 (excluding connections establishing)
join5.sql tps = 506.549654 (excluding connections establishing)
star5.sql tps = 495.432915 (excluding connections establishing)
star5.sql tps = 494.906793 (excluding connections establishing)
star5.sql tps = 492.623808 (excluding connections establishing)
line5.sql tps = 789.315968 (excluding connections establishing)
line5.sql tps = 793.875456 (excluding connections establishing)
line5.sql tps = 790.545990 (excluding connections establishing)
lim_join5.sql tps = 396.956732 (excluding connections establishing)
lim_join5.sql tps = 397.515213 (excluding connections establishing)
lim_join5.sql tps = 397.578669 (excluding connections establishing)
lim_star5.sql tps = 417.459963 (excluding connections establishing)
lim_star5.sql tps = 418.024803 (excluding connections establishing)
lim_star5.sql tps = 418.830234 (excluding connections establishing)
lim_line5.sql tps = 729.186915 (excluding connections establishing)
lim_line5.sql tps = 726.288788 (excluding connections establishing)
lim_line5.sql tps = 728.123296 (excluding connections establishing)
limord_join5.sql tps = 383.484767 (excluding connections establishing)
limord_join5.sql tps = 383.021960 (excluding connections establishing)
limord_join5.sql tps = 383.722051 (excluding connections establishing)
limord_star5.sql tps = 414.138460 (excluding connections establishing)
limord_star5.sql tps = 414.063766 (excluding connections establishing)
limord_star5.sql tps = 416.130110 (excluding connections establishing)
limord_line5.sql tps = 706.002589 (excluding connections establishing)
limord_line5.sql tps = 705.632796 (excluding connections establishing)
limord_line5.sql tps = 704.991305 (excluding connections establishing)

Regards,
Marti

Attachment

Re: PoC: Partial sort

From
Marti Raudsepp
Date:
On Wed, Feb 12, 2014 at 11:54 PM, Marti Raudsepp <marti@juffo.org> wrote:
> With partial-sort-basic-1 and this fix on the same test suite, the
> planner overhead is now a more manageable 0.5% to 1.3%; one test is
> faster by 0.5%.

Ping, Robert or anyone, does this overhead seem bearable or is that
still too much?

Do these numbers look conclusive enough or should I run more tests?

> I think the 1st patch now has a bug in initial_cost_mergejoin; you
> still pass the "presorted_keys" argument to cost_sort, making it
> calculate a partial sort cost

Ping, Alexander?

Regards,
Marti



Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Thu, Feb 13, 2014 at 1:54 AM, Marti Raudsepp <marti@juffo.org> wrote:
I think the 1st patch now has a bug in initial_cost_mergejoin; you
still pass the "presorted_keys" argument to cost_sort, making it
calculate a partial sort cost, but generated plans never use partial
sort. I think 0 should be passed instead. Patch attached, needs to be
applied on top of partial-sort-basic-1 and then reverse-applied on
partial-sort-merge-1.

It doesn't look so for me. Merge join doesn't find partial sort especially. But if path with some presorted pathkeys will be accidentally selected then partial sort will be used. See create_mergejoin_plan function. So, I think this cost_sort call is relevant to create_mergejoin_plan. If we don't want partial sort to be used in such rare cases then we should revert it from both places. However, I doubt that it does any overhead, so we can leave it as is.

------
With best regards,
Alexander Korotkov.   

Re: PoC: Partial sort

From
Robert Haas
Date:
On Wed, Feb 19, 2014 at 1:39 PM, Marti Raudsepp <marti@juffo.org> wrote:
> On Wed, Feb 12, 2014 at 11:54 PM, Marti Raudsepp <marti@juffo.org> wrote:
>> With partial-sort-basic-1 and this fix on the same test suite, the
>> planner overhead is now a more manageable 0.5% to 1.3%; one test is
>> faster by 0.5%.
>
> Ping, Robert or anyone, does this overhead seem bearable or is that
> still too much?
>
> Do these numbers look conclusive enough or should I run more tests?

Tom should really be the one to comment on this, I think.  I read
through the patch quickly and it looks much less scary than the early
versions, but it's not obvious to me whether the remaining overhead is
enough to worry about.  I'd need to spend more time studying it to
form a really sound opinion on that topic, and unfortunately I don't
have that time right now.

I think it'd be interesting to try to determine specifically where
that overhead is coming from.  Pick the test case where it's the worst
(1.3%) and do a "perf" with and without the patch and look at the
difference in the call graph.  It's possible we could have changes on
that order of magnitude just from more or less fortuitous code layout
decisions as code shifts around, but it's also possible that there's a
real effect there we should think harder about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC: Partial sort

From
Peter Geoghegan
Date:
On Mon, Feb 10, 2014 at 10:59 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> Done. Patch is splitted.

I took a quick look at this.

Have you thought about making your new cmpSortSkipCols() function not
use real comparisons? Since in the circumstances in which this
optimization is expected to be effective (e.g. your original example)
we can also expect a relatively low cardinality for the first n
indexed attributes (again, as in your original example), in general
when cmpSortSkipCols() is called there is a high chance that it will
return true. If any pair of tuples (logically adjacent tuples fed in
to cmpSortSkipCols() by an index scan in logical order) are not fully
equal (i.e. their leading, indexed attributes are not equal) then we
don't care about the details -- we just know that a new sort grouping
is required.

The idea here is that you can get away with simple binary equality
comparisons, as we do when considering HOT-safety. Of course, you
might find that two bitwise unequal values are equal according to
their ordinary B-Tree support function 1 comparator (e.g. two numerics
that differ only in their display scale). AFAICT this should be okay,
since that just means that you have smaller sort groupings than
strictly necessary. I'm not sure if that's worth it to more or less
duplicate heap_tuple_attr_equals() to save a "mere" n expensive
comparisons, but it's something to think about (actually, there are
probably less than even n comparisons in practice because there'll be
a limit).

A similar idea appears in my SortSupport for text ("Poor man's
normalized key"/strxfrm()) patch. A poor man's key comparison didn't
work out, and there may be further differences that aren't captured in
the special simple key representation, so we need to do a "proper
comparison" to figure it out for sure. However, within the sortsupport
routine comparator, we know that we're being called in this context,
as a tie-breaker for a poor man's normalized key comparison that
returned 0, and so are optimistic about the two datums being fully
equal. An optimistic memcmp() is attempted before a strcoll() here if
the lengths also match.

I have not actually added special hints so that we're optimistic about
keys being equal in other places (places that have nothing to do with
the general idea of poor man's normalized keys), but that might not be
a bad idea. Actually, it might not be a bad idea to just always have
varstr_cmp() attempt a memcmp() first when two texts have equal
length, no matter how it's called.

-- 
Peter Geoghegan



Re: PoC: Partial sort

From
David Rowley
Date:
On Tue, Feb 11, 2014 at 7:59 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Done. Patch is splitted. 

I've started to look at this, and for now I'm still finding my way around the patch, so I'm not quite there yet with understanding everything. Never-the-less it seems best to post my comments early, so as to help maintain concurrency between the review and getting the patch into shape. 

I've only been looking at partial-sort-basic-1.patch so far;

The patch no longer applies to master, but this was only due to a tab being replaced by 2 spaces in a pgident run. I've attached an updated patch which currently applies without any issues.

Here's a few notes from reading over the code:

* pathkeys.c

  EquivalenceMember *member = (EquivalenceMember *)
lfirst(list_head(key->pk_eclass->ec_members));

You can use linitial() instead of lfirst(list_head()). The same thing occurs in costsize.c

* pathkeys.c

The following fragment:

n = pathkeys_common(root->query_pathkeys, pathkeys);

if (n != 0)
{
/* It's useful ... or at least the first N keys are */
return n;
}

return 0; /* path ordering not useful */
}

Could just read:

/* return the number of path keys in common, or 0 if there are none */
return pathkeys_common(root->query_pathkeys, pathkeys);

* execnodes.h

In struct SortState, some new fields don't have a comment.


I've also thrown a few different workloads at the patch and I'm very impressed with most of the results. Especially when LIMIT is used, however I've found a regression case which I thought I should highlight, but for now I can't quite see what could be done to fix it.

create table a (x int not null, y int not null);
insert into a select x.x,y.y from generate_series(1,1000000) x(x) cross join generate_series(1,10) y(y);

Patched:
explain analyze select x,y from a where x+0=1 order by x,y limit 10;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=92.42..163.21 rows=10 width=8) (actual time=6239.426..6239.429 rows=10 loops=1)
   ->  Partial sort  (cost=92.42..354064.37 rows=50000 width=8) (actual time=6239.406..6239.407 rows=10 loops=1)
         Sort Key: x, y
         Presorted Key: x
         Sort Method: quicksort  Memory: 25kB
         ->  Index Scan using a_x_idx on a  (cost=0.44..353939.13 rows=50000 width=8) (actual time=0.059..6239.319 rows=10 loops=1)
               Filter: ((x + 0) = 1)
               Rows Removed by Filter: 9999990
 Planning time: 0.212 ms
 Execution time: 6239.505 ms
(10 rows)


Time: 6241.220 ms

Unpatched:
explain analyze select x,y from a where x+0=1 order by x,y limit 10;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Limit  (cost=195328.26..195328.28 rows=10 width=8) (actual time=3077.759..3077.761 rows=10 loops=1)
   ->  Sort  (cost=195328.26..195453.26 rows=50000 width=8) (actual time=3077.757..3077.757 rows=10 loops=1)
         Sort Key: x, y
         Sort Method: quicksort  Memory: 25kB
         ->  Seq Scan on a  (cost=0.00..194247.77 rows=50000 width=8) (actual time=0.018..3077.705 rows=10 loops=1)
               Filter: ((x + 0) = 1)
               Rows Removed by Filter: 9999990
 Planning time: 0.510 ms
 Execution time: 3077.837 ms
(9 rows)


Time: 3080.201 ms

As you can see, the patched version performs an index scan in order to get the partially sorted results, but it does end up quite a bit slower than the seqscan/sort that the unpatched master performs. I'm not quite sure how realistic the x+0 = 1 WHERE clause is, but perhaps the same would happen if something like x+y = 1 was performed too.... After a bit more analysis on this, I see that if I change the 50k estimate to 10 in the debugger that the num_groups is properly estimated at 1 and it then performs the seq scan instead. So it looks like the costings of the patch are not to blame here. (The 50k row estimate comes from rel tuples  / DEFAULT_NUM_DISTINCT)

That's all I have at the moment... More to follow soon.

Regards

David Rowley

Attachment

Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Sun, Jul 13, 2014 at 6:45 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Mon, Feb 10, 2014 at 10:59 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> Done. Patch is splitted.

I took a quick look at this.

Have you thought about making your new cmpSortSkipCols() function not
use real comparisons? Since in the circumstances in which this
optimization is expected to be effective (e.g. your original example)
we can also expect a relatively low cardinality for the first n
indexed attributes (again, as in your original example), in general
when cmpSortSkipCols() is called there is a high chance that it will
return true. If any pair of tuples (logically adjacent tuples fed in
to cmpSortSkipCols() by an index scan in logical order) are not fully
equal (i.e. their leading, indexed attributes are not equal) then we
don't care about the details -- we just know that a new sort grouping
is required.

Actually, higher cardinality skip columns is better. Sorting of smaller groups is faster than sorting larger groups of same size. Also, with smaller groups you achieve limit more accurate (in average), i.e. sort smaller amount of total rows.
 
The idea here is that you can get away with simple binary equality
comparisons, as we do when considering HOT-safety. Of course, you
might find that two bitwise unequal values are equal according to
their ordinary B-Tree support function 1 comparator (e.g. two numerics
that differ only in their display scale). AFAICT this should be okay,
since that just means that you have smaller sort groupings than
strictly necessary. I'm not sure if that's worth it to more or less
duplicate heap_tuple_attr_equals() to save a "mere" n expensive
comparisons, but it's something to think about (actually, there are
probably less than even n comparisons in practice because there'll be
a limit).

Not correct. Smaller groups are not OK. Imagine that two representations of same skip column value exists. Index may return them in any order, even change them one by one. In this case sorting on other column never takes place, while it should. But some optimizations are still possible:
  1. Use bitwise comparison first, then recheck. But, no guarantees that acceleration will be achieved.
  2. Use equality check instead of btree comparison. For "text" datatype it would be rather faster because of no locale-aware comparison.

------
With best regards,
Alexander Korotkov.    

Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Tue, Aug 19, 2014 at 2:02 PM, David Rowley <dgrowleyml@gmail.com> wrote:
Here's a few notes from reading over the code:

* pathkeys.c

  EquivalenceMember *member = (EquivalenceMember *)
lfirst(list_head(key->pk_eclass->ec_members));

You can use linitial() instead of lfirst(list_head()). The same thing occurs in costsize.c

Fixed.
 
* pathkeys.c

The following fragment:

n = pathkeys_common(root->query_pathkeys, pathkeys);

if (n != 0)
{
/* It's useful ... or at least the first N keys are */
return n;
}

return 0; /* path ordering not useful */
}

Could just read:

/* return the number of path keys in common, or 0 if there are none */
return pathkeys_common(root->query_pathkeys, pathkeys);

Fixed.
 
* execnodes.h

In struct SortState, some new fields don't have a comment.

Fixed.
 
I've also thrown a few different workloads at the patch and I'm very impressed with most of the results. Especially when LIMIT is used, however I've found a regression case which I thought I should highlight, but for now I can't quite see what could be done to fix it.

create table a (x int not null, y int not null);
insert into a select x.x,y.y from generate_series(1,1000000) x(x) cross join generate_series(1,10) y(y);

Patched:
explain analyze select x,y from a where x+0=1 order by x,y limit 10;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=92.42..163.21 rows=10 width=8) (actual time=6239.426..6239.429 rows=10 loops=1)
   ->  Partial sort  (cost=92.42..354064.37 rows=50000 width=8) (actual time=6239.406..6239.407 rows=10 loops=1)
         Sort Key: x, y
         Presorted Key: x
         Sort Method: quicksort  Memory: 25kB
         ->  Index Scan using a_x_idx on a  (cost=0.44..353939.13 rows=50000 width=8) (actual time=0.059..6239.319 rows=10 loops=1)
               Filter: ((x + 0) = 1)
               Rows Removed by Filter: 9999990
 Planning time: 0.212 ms
 Execution time: 6239.505 ms
(10 rows)


Time: 6241.220 ms

Unpatched:
explain analyze select x,y from a where x+0=1 order by x,y limit 10;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Limit  (cost=195328.26..195328.28 rows=10 width=8) (actual time=3077.759..3077.761 rows=10 loops=1)
   ->  Sort  (cost=195328.26..195453.26 rows=50000 width=8) (actual time=3077.757..3077.757 rows=10 loops=1)
         Sort Key: x, y
         Sort Method: quicksort  Memory: 25kB
         ->  Seq Scan on a  (cost=0.00..194247.77 rows=50000 width=8) (actual time=0.018..3077.705 rows=10 loops=1)
               Filter: ((x + 0) = 1)
               Rows Removed by Filter: 9999990
 Planning time: 0.510 ms
 Execution time: 3077.837 ms
(9 rows)


Time: 3080.201 ms

As you can see, the patched version performs an index scan in order to get the partially sorted results, but it does end up quite a bit slower than the seqscan/sort that the unpatched master performs. I'm not quite sure how realistic the x+0 = 1 WHERE clause is, but perhaps the same would happen if something like x+y = 1 was performed too.... After a bit more analysis on this, I see that if I change the 50k estimate to 10 in the debugger that the num_groups is properly estimated at 1 and it then performs the seq scan instead. So it looks like the costings of the patch are not to blame here. (The 50k row estimate comes from rel tuples  / DEFAULT_NUM_DISTINCT)

Yes, the error comes from assumption of 50k row estimate. I've checked similar example when estimate is fine. 

create table b as (select x.x,y.y,x.x z from generate_series(1,1000000) x(x) cross join generate_series(1,10) y(y));
create index b_x_idx on b(x);
analyze b;

There is column z which is both not in index and not in "order by" clause. If we replace "x+0=1" with "z=1" optimizer didn't decide to use partial sort.

explain analyze select x,y,z from b where z=1 order by x,y limit 10;
                                                    QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Limit  (cost=179056.59..179056.61 rows=10 width=12) (actual time=1072.498..1072.500 rows=10 loops=1)
   ->  Sort  (cost=179056.59..179056.63 rows=18 width=12) (actual time=1072.495..1072.495 rows=10 loops=1)
         Sort Key: x, y
         Sort Method: quicksort  Memory: 25kB
         ->  Seq Scan on b  (cost=0.00..179056.21 rows=18 width=12) (actual time=0.020..1072.454 rows=10 loops=1)
               Filter: (z = 1)
               Rows Removed by Filter: 9999990
 Planning time: 0.501 ms
 Execution time: 1072.555 ms
(9 rows)

If we event force optimizer to use partial sort then cost estimation will be fine.

set enable_seqscan = off;
explain analyze select x,y,z from b where z=1 order by x,y limit 10;
                                                            QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Limit  (cost=169374.43..263471.04 rows=10 width=12) (actual time=2237.082..2237.083 rows=10 loops=1)
   ->  Partial sort  (cost=169374.43..338748.34 rows=18 width=12) (actual time=2237.082..2237.083 rows=10 loops=1)
         Sort Key: x, y
         Presorted Key: x
         Sort Method: quicksort  Memory: 25kB
         ->  Index Scan using b_x_idx on b  (cost=0.43..338748.13 rows=18 width=12) (actual time=0.047..2237.062 rows=10 loops=1)
               Filter: (z = 1)
               Rows Removed by Filter: 9999990
 Planning time: 0.089 ms
 Execution time: 2237.133 ms
(10 rows)

AFAICS wrong selectivity estimations are general problem which cause optimizer failures. But in your example "x+y=1" if expression index on "x+y" would exist then statistics over "x+y" will be collected. So, in case of expression index estimation will be fine.

------
With best regards,
Alexander Korotkov.   
Attachment

Re: PoC: Partial sort

From
Peter Geoghegan
Date:
On Fri, Sep 12, 2014 at 2:19 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> Actually, higher cardinality skip columns is better. Sorting of smaller
> groups is faster than sorting larger groups of same size. Also, with smaller
> groups you achieve limit more accurate (in average), i.e. sort smaller
> amount of total rows.

Higher cardinality leading key columns are better for what? Do you
mean they're better in that those cases are more sympathetic to your
patch, or better in the general sense that they'll perform better for
the user? The first example query, that you chose to demonstrate your
patch had a leading, indexed column (column "v1") with much lower
cardinality/n_distinct than the column that had to be sorted on
(column "v2"). That was what my comments were based on.

When this feature is used, there will often be a significantly lower
cardinality in the leading, indexed column (as in your example).
Otherwise, the user might well have been happy to just order on the
indexed/leading column alone. That isn't definitely true, but it's
often true.

>> I'm not sure if that's worth it to more or less
>> duplicate heap_tuple_attr_equals() to save a "mere" n expensive
>> comparisons, but it's something to think about (actually, there are
>> probably less than even n comparisons in practice because there'll be
>> a limit).
>
> Not correct. Smaller groups are not OK. Imagine that two representations of
> same skip column value exists. Index may return them in any order, even
> change them one by one. In this case sorting on other column never takes
> place, while it should.

I think I explained this badly - it wouldn't be okay to make the
grouping smaller, but as you say we could tie-break with a proper
B-Tree support function 1 comparison to make sure we really had
reached the end of our grouping.

FWIW I want all bttextcmp()/varstr_cmp() comparisons to try a memcmp()
first, so the use of the equality operator with text in mind that you
mention may soon not be useful at all. The evidence suggests that
memcmp() is so much cheaper than special preparatory NUL-termination +
strcoll() that we should always try it first when sorting text, even
when we have no good reason to think it will work. The memcmp() is
virtually free. And so, you see why it might be worth thinking about
this when we already have reasonable confidence that many comparisons
will indicate that datums are equal. Other datatypes will have
expensive "real" comparators, not just text or numeric.

-- 
Peter Geoghegan



Re: PoC: Partial sort

From
Peter Geoghegan
Date:
Some quick comments on partial-sort-basic-2.patch:

> *** a/src/include/utils/tuplesort.h
> --- b/src/include/utils/tuplesort.h
> ***************
> *** 24,29 ****
> --- 24,30 ----
>   #include "executor/tuptable.h"
>   #include "fmgr.h"
>   #include "utils/relcache.h"
> + #include "utils/sortsupport.h"

Why include sortsupport.h here?

I would like to see more comments, especially within ExecSort(). The
structure of that routine is quite unclear.

I don't really like this MakeSortSupportKeys() stuff within ExecSort():

> !       /* Support structures for cmpSortSkipCols - already sorted columns */
> !       if (skipCols)
> !           node->skipKeys = MakeSortSupportKeys(skipCols,
> !                                                plannode->sortColIdx,
> !                                                plannode->sortOperators,
> !                                                plannode->collations,
> !                                                plannode->nullsFirst);
>
> +       /* Only pass on remaining columns that are unsorted */
>         tuplesortstate = tuplesort_begin_heap(tupDesc,
> !                                             plannode->numCols - skipCols,
> !                                             &(plannode->sortColIdx[skipCols]),
> !                                             &(plannode->sortOperators[skipCols]),
> !                                             &(plannode->collations[skipCols]),
> !                                             &(plannode->nullsFirst[skipCols]),
>                                               work_mem,
>                                               node->randomAccess);

You're calling the new function MakeSortSupportKeys() (which
encapsulates setting up sortsupport state for all attributes) twice;
first, to populate the skip keys (the indexed attribute(s)), and
second, when tuplesort_begin_heap() is called, so that there is state
for unindexed sort groups that must be manually sorted. That doesn't
seem great.

I think we might be better off if a tuplesort function was called
shortly after tuplesort_begin_heap() is called. How top-n heap sorts
work is something that largely lives in tuplesort's head. Today, we
call tuplesort_set_bound() to hint to tuplesort "By the way, this is a
top-n heap sort applicable sort". I think that with this patch, we
should then hint (where applicable) "by the way, you won't actually be
required to sort those first n indexed attributes; rather, you can
expect to scan those in logical order. You may work the rest out
yourself, and may be clever about exploiting the sorted-ness of the
first few columns". The idea of managing a bunch of tiny sorts from
with ExecSort(), and calling the new function tuplesort_reset() seems
questionable. tuplesortstate is supposed to be private/opaque to
nodeSort.c, and the current design strains that.

I'd like to keep nodeSort.c simple. I think it's pretty clear that the
guts of this do not belong within ExecSort(), in any case. Also, the
additions there should be much better commented, wherever they finally
end up.

In this struct:

> *** a/src/include/nodes/execnodes.h
> --- b/src/include/nodes/execnodes.h
> *************** typedef struct SortState
> *** 1670,1678 ****
> --- 1670,1682 ----
>      bool        bounded;        /* is the result set bounded? */
>      int64       bound;          /* if bounded, how many tuples are needed */
>      bool        sort_Done;      /* sort completed yet? */
> +   bool        finished;       /* fetching tuples from outer node
> +                                  is finished ? */
>      bool        bounded_Done;   /* value of bounded we did the sort with */
>      int64       bound_Done;     /* value of bound we did the sort with */
>      void       *tuplesortstate; /* private state of tuplesort.c */
> +   SortSupport skipKeys;       /* columns already sorted in input */
> +   HeapTuple   prev;           /* previous tuple from outer node */
>   } SortState;

I think you should be clearer about the scope and duration of fields
like "finished", if this really belongs here. In general, there should
be some high-level comments about how the feature added by the patch
fits together, which there isn't right now.

That's all I have for now.

-- 
Peter Geoghegan



Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Sun, Sep 14, 2014 at 7:39 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Fri, Sep 12, 2014 at 2:19 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> Actually, higher cardinality skip columns is better. Sorting of smaller
> groups is faster than sorting larger groups of same size. Also, with smaller
> groups you achieve limit more accurate (in average), i.e. sort smaller
> amount of total rows.

Higher cardinality leading key columns are better for what? Do you
mean they're better in that those cases are more sympathetic to your
patch, or better in the general sense that they'll perform better for
the user? The first example query, that you chose to demonstrate your
patch had a leading, indexed column (column "v1") with much lower
cardinality/n_distinct than the column that had to be sorted on
(column "v2"). That was what my comments were based on.

When this feature is used, there will often be a significantly lower
cardinality in the leading, indexed column (as in your example).
Otherwise, the user might well have been happy to just order on the
indexed/leading column alone. That isn't definitely true, but it's
often true.

I just meant higher cardinality is cheaper for do partial sort. We could have some misunderstood because of notions "high" and "low" are relative. When cardinality is 1 then partial sort seems to be useless. When cardinality is few then it still could be cheaper to do sequential scan + sort rather than index scan + partial sort. When cardinality is close to number of rows then as you mentioned user probably don't need to sort by rest of columns. At least one exception is when user needs to force uniqueness of order. So, we need to target something in the middle of this two corner cases.
 
>> I'm not sure if that's worth it to more or less
>> duplicate heap_tuple_attr_equals() to save a "mere" n expensive
>> comparisons, but it's something to think about (actually, there are
>> probably less than even n comparisons in practice because there'll be
>> a limit).
>
> Not correct. Smaller groups are not OK. Imagine that two representations of
> same skip column value exists. Index may return them in any order, even
> change them one by one. In this case sorting on other column never takes
> place, while it should.

I think I explained this badly - it wouldn't be okay to make the
grouping smaller, but as you say we could tie-break with a proper
B-Tree support function 1 comparison to make sure we really had
reached the end of our grouping.

FWIW I want all bttextcmp()/varstr_cmp() comparisons to try a memcmp()
first, so the use of the equality operator with text in mind that you
mention may soon not be useful at all. The evidence suggests that
memcmp() is so much cheaper than special preparatory NUL-termination +
strcoll() that we should always try it first when sorting text, even
when we have no good reason to think it will work. The memcmp() is
virtually free. And so, you see why it might be worth thinking about
this when we already have reasonable confidence that many comparisons
will indicate that datums are equal. Other datatypes will have
expensive "real" comparators, not just text or numeric.

When strings are not equal bttextcmp still needs to use strcoll while texteq doesn't need that. So, it would be still advantage of using equality operator over comparison function: equality operator don't have to compare unequal values. However, real cost of this advantage depends on presorted columns cardinality as well.

------
With best regards,
Alexander Korotkov.  

Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Sun, Sep 14, 2014 at 9:32 AM, Peter Geoghegan <pg@heroku.com> wrote:
I think we might be better off if a tuplesort function was called
shortly after tuplesort_begin_heap() is called. How top-n heap sorts
work is something that largely lives in tuplesort's head. Today, we
call tuplesort_set_bound() to hint to tuplesort "By the way, this is a
top-n heap sort applicable sort". I think that with this patch, we
should then hint (where applicable) "by the way, you won't actually be
required to sort those first n indexed attributes; rather, you can
expect to scan those in logical order. You may work the rest out
yourself, and may be clever about exploiting the sorted-ness of the
first few columns". The idea of managing a bunch of tiny sorts from
with ExecSort(), and calling the new function tuplesort_reset() seems
questionable. tuplesortstate is supposed to be private/opaque to
nodeSort.c, and the current design strains that.

I'd like to keep nodeSort.c simple. I think it's pretty clear that the
guts of this do not belong within ExecSort(), in any case. Also, the
additions there should be much better commented, wherever they finally
end up.

As I understand, you propose to incapsulate partial sort algorithm into tuplesort. However, in this case we anyway need some significant change of its interface: let tuplesort decide when it's able to return tuple. Otherwise, we would miss significant part of LIMIT clause optimization. tuplesort_set_bound() can't solve all the cases. There could be other planner nodes between the partial sort and LIMIT.

------
With best regards,
Alexander Korotkov.   

Re: PoC: Partial sort

From
Andreas Karlsson
Date:
On 09/15/2014 01:58 PM, Alexander Korotkov wrote:
> On Sun, Sep 14, 2014 at 9:32 AM, Peter Geoghegan <pg@heroku.com
> <mailto:pg@heroku.com>> wrote:
>
>     I think we might be better off if a tuplesort function was called
>     shortly after tuplesort_begin_heap() is called. How top-n heap sorts
>     work is something that largely lives in tuplesort's head. Today, we
>     call tuplesort_set_bound() to hint to tuplesort "By the way, this is a
>     top-n heap sort applicable sort". I think that with this patch, we
>     should then hint (where applicable) "by the way, you won't actually be
>     required to sort those first n indexed attributes; rather, you can
>     expect to scan those in logical order. You may work the rest out
>     yourself, and may be clever about exploiting the sorted-ness of the
>     first few columns". The idea of managing a bunch of tiny sorts from
>     with ExecSort(), and calling the new function tuplesort_reset() seems
>     questionable. tuplesortstate is supposed to be private/opaque to
>     nodeSort.c, and the current design strains that.
>
>     I'd like to keep nodeSort.c simple. I think it's pretty clear that the
>     guts of this do not belong within ExecSort(), in any case. Also, the
>     additions there should be much better commented, wherever they finally
>     end up.
>
>
> As I understand, you propose to incapsulate partial sort algorithm into
> tuplesort. However, in this case we anyway need some significant change
> of its interface: let tuplesort decide when it's able to return tuple.
> Otherwise, we would miss significant part of LIMIT clause optimization.
> tuplesort_set_bound() can't solve all the cases. There could be other
> planner nodes between the partial sort and LIMIT.

Hi,

Are you planning to work on this patch for 9.6?

I generally agree with Peter that the changes to the sorting probably 
belong in the tuplesort code rather than in the executor. This way it 
should also be theoretically possible to support mark/restore.

Andreas



Re: PoC: Partial sort

From
Peter Geoghegan
Date:
On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson <andreas@proxel.se> wrote:
> Are you planning to work on this patch for 9.6?

FWIW I hope so. It's a nice patch.

-- 
Peter Geoghegan



Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Sun, Jun 7, 2015 at 11:01 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson <andreas@proxel.se> wrote:
> Are you planning to work on this patch for 9.6?

FWIW I hope so. It's a nice patch.

I'm trying to to whisk dust. Rebased version of patch is attached. This patch isn't passing regression tests because of plan changes. I'm not yet sure about those changes: why they happens and are they really regression?
Since I'm not very familiar with planning of INSERT ON CONFLICT and RLS, any help is appreciated.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Fri, Oct 16, 2015 at 7:11 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Sun, Jun 7, 2015 at 11:01 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson <andreas@proxel.se> wrote:
> Are you planning to work on this patch for 9.6?

FWIW I hope so. It's a nice patch.

I'm trying to to whisk dust. Rebased version of patch is attached. This patch isn't passing regression tests because of plan changes. I'm not yet sure about those changes: why they happens and are they really regression?
Since I'm not very familiar with planning of INSERT ON CONFLICT and RLS, any help is appreciated.

Planner regression is fixed in the attached version of patch. It appears that get_cheapest_fractional_path_for_pathkeys() behaved wrong when no ordering is required.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: PoC: Partial sort

From
Peter Geoghegan
Date:
On Tue, Oct 20, 2015 at 4:17 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> Planner regression is fixed in the attached version of patch. It appears
> that get_cheapest_fractional_path_for_pathkeys() behaved wrong when no
> ordering is required.

I don't see an entry in the CF app for this. This seems like something
I should review, though.

-- 
Peter Geoghegan



Re: PoC: Partial sort

From
Peter Geoghegan
Date:
On Tue, Oct 20, 2015 at 4:17 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> Planner regression is fixed in the attached version of patch. It appears
> that get_cheapest_fractional_path_for_pathkeys() behaved wrong when no
> ordering is required.

I took a look at this. My remarks are not comprehensive, but are just
what I noticed having looked at this for the first time in well over a
year.

Before anything else, I would like to emphasize that I think that this
is pretty important work; it's not just a "nice to have". I'm very
glad you picked it up, because we need it. In the real world, there
will be *lots* of cases that this helps.

Explain output
-------------------

A query like your original test query looks like this for me:

postgres=# explain analyze select * from test order by v1, v2 limit 100;
               QUERY
 
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------Limit
(cost=429.54..434.53 rows=100 width=16) (actual
 
time=15125.819..22414.038 rows=100 loops=1)  ->  Partial sort  (cost=429.54..50295.52 rows=1000000 width=16)
(actual time=15125.799..22413.297 rows=100 loops=1)        Sort Key: v1, v2        Presorted Key: v1        Sort
Method:top-N heapsort  Memory: 27kB        ->  Index Scan using test_v1_idx on test
 
(cost=0.42..47604.43 rows=1000000 width=16) (actual time=1.663..13.066
rows=151 loops=1)Planning time: 0.948 msExecution time: 22414.895 ms
(8 rows)

I thought about it for a while, and decided that you have the basic
shape of the explain output right here. I see where you are going by
having the sort node drive things.

I don't think the node should be called "Partial sort", though. I
think that this is better presented as just a "Sort" node with a
presorted key.

I think it might be a good idea to also have a "Sort Groups: 2" field
above. That illustrates that you are in fact performing 2 small sorts
per group, which is the reality. As you said, it's good to have this
be high, because then the sort operations don't need to do too many
comparisons, which could be expensive.

Sort Method
----------------

Even thought the explain analyze above shows "top-N heapsort" as its
sort method, that isn't really true. I actually ran this through a
debugger, which is why the above plan took so long to execute, in case
you wondered. I saw that in practice the first sort executed for the
first group uses a quicksort, while only the second sort (needed for
the 2 and last group in this example) used a top-N heapsort.

Is it really worth using a top-N heapsort to avoid sorting the last
little bit of tuples in the last group? Maybe we should never push
down to an individual sort operation (we have one
tuplesort_performsort() per group) that it should be bounded. Our
quicksort falls back on insertion sort in the event of only having 7
elements (at that level of recursion), so having this almost always
use quicksort may be no bad thing.

Even if you don't like that, the "Sort Method" shown above is just
misleading. I wonder, also, if you need to be more careful about
whether or not "Memory" is really the high watermark, as opposed to
the memory used by the last sort operation of many. There could be
many large tuples in one grouping, for example. Note that the current
code will not show any "Memory" in explain analyze for cases that have
memory freed before execution is done, which this is not consistent
with. Maybe that's not so important. Unsure.

trace_sort output shows that these queries often use a large number of
tiny individual sorts. Maybe that's okay, or maybe we should make it
clearer that the sorts are related. I now use trace_sort a lot.

Abbreviated Keys
-----------------------

It could be very bad for performance that the first non-presorted key
uses abbreviated keys. There needs to be a way to tell tuplesort to
not waste its time with them, just as there currently is for bounded
(top-N heapsort) sorts. They're almost certainly the wrong way to go,
unless you have huge groups (where partial sorting is unlikely to win
in the first place).

Other issues in executor
--------------------------------

This is sort of an optimizer issue, but code lives in execAmi.c.
Assert is redundant here:

+               case T_Sort:
+                       /* We shouldn't reach here without having plan node */
+                       Assert(node);
+                       /* With skipCols sort node holds only last bucket */
+                       if (node && ((Sort *)node)->skipCols == 0)
+                               return true;
+                       else
+                               return false;

I don't like that you've added a Plan node argument to
ExecMaterializesOutput() in this function, too.

There is similar assert/pointer test code within
ExecSupportsBackwardScan() and ExecSupportsMarkRestore(). In general,
I have concerns about the way the determination of a sort's ability to
do stuff like be scanned backwards is now made dynamic, which this new
code demonstrates:
       /*
+        * skipCols can't be used with either EXEC_FLAG_REWIND,
EXEC_FLAG_BACKWARD
+        * or EXEC_FLAG_MARK, because we hold only current bucket in
+        * tuplesortstate.
+        */
+       Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
+                 EXEC_FLAG_BACKWARD |
+                 EXEC_FLAG_MARK)) == 0);
+

I need to think some more about this general issue.

Misc. issues
----------------

_readSort() needs READ_INT_FIELD().  _outSort() similarly needs
WRITE_INT_FIELD(). You've mostly missed this stuff.

Please be more careful about this. It's always a good idea to run the
regression tests with "#define COPY_PARSE_PLAN_TREES" from time to
time, which tends to highlight these problems.

tuplesort.h should not include sortsupport.h. It's a modularity
violation, and besides which is unnecessary. Similarly, pathkeys.c
should not include optimizer/cost.h.

What is this?

+               if (inner_cheapest_total &&
inner_cheapest_total->pathtype == T_Sort)
+                       elog(ERROR, "Sort");

Optimizer
-------------

I am not an expert on the optimizer, but I do have some feedback.

* cost_sort() needs way way more comments.  Doesn't even mention
indexes. Not worth commenting further on until I know what it's
*supposed* to do, though.

* pathkeys_useful_for_ordering() now looks like a private convenience
wrapper for the new public function pathkeys_common().  I think that
comments should make this quite clear.

* compare_bifractional_path_costs() should live beside
compare_fractional_path_costs() and be public, I think. The existing
compare_fractional_path_costs() also only has a small number of
possible clients, and is still not static.

* Think it's not okay that there are new arguments, such as the
"tuples" argument for get_cheapest_fractional_path_for_pathkeys().

It seems a bad sign, design-wise, that a new argument of "PlannerInfo
*root" was added at end, for the narrow purpose of passing stuff to
estimate number of groups for the benefit of this patch.  ISTM that
grouping_planner() caller should do the
work itself as and when it alone needs to.

* New loop within get_cheapest_fractional_path_for_pathkeys() requires
far more explanation.

Explain theory behind derivation of compare_bifractional_path_costs()
fraction arguments, please. I think there might be simple heuristics
like this elsewhere in the optimizer or selfuncs.c, but you need to
share why you did things that way in the code.

* Within planner.c, "partial_sort_path" variable name in
grouping_planner() might be called something else.

Its purpose isn't clear. Also, the way that you mix path costs from
the new get_cheapest_fractional_path_for_pathkeys() into the new
cost_sort() needs to be explained in detail (as I already said,
cost_sort() is currently very under-documented).

Obviously the optimizer part of this needs the most work -- no
surprises there. I wonder if we cover all useful cases? I haven't yet
got around to using "#define OPTIMIZER_DEBUG" to see what's really
going on, which seems essential to understanding what is really
happening, but it looks like merge append paths can currently use the
optimization, whereas unique paths cannot. Have you thought about
that?

That's all I have for now...

-- 
Peter Geoghegan



Re: PoC: Partial sort

From
Tomas Vondra
Date:
Hi,

On 10/20/2015 01:17 PM, Alexander Korotkov wrote:
> On Fri, Oct 16, 2015 at 7:11 PM, Alexander Korotkov
> <aekorotkov@gmail.com <mailto:aekorotkov@gmail.com>> wrote:
>
>     On Sun, Jun 7, 2015 at 11:01 PM, Peter Geoghegan <pg@heroku.com
>     <mailto:pg@heroku.com>> wrote:
>
>         On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson
>         <andreas@proxel.se <mailto:andreas@proxel.se>> wrote:
>         > Are you planning to work on this patch for 9.6?
>
>         FWIW I hope so. It's a nice patch.
>
>
>     I'm trying to to whisk dust. Rebased version of patch is attached.
>     This patch isn't passing regression tests because of plan changes.
>     I'm not yet sure about those changes: why they happens and are they
>     really regression?
>     Since I'm not very familiar with planning of INSERT ON CONFLICT and
>     RLS, any help is appreciated.
>
>
> Planner regression is fixed in the attached version of patch. It appears
> that get_cheapest_fractional_path_for_pathkeys() behaved wrong when no
> ordering is required.
>

Alexander, are you working on this patch? I'd like to look at the patch, 
but the last available version (v4) no longer applies - there's plenty 
of bitrot. Do you plan to send an updated / rebased version?


The main thing I'm particularly interested in is how much is this 
coupled with the Sort node, and whether it's possible to feed partially 
sorted tuples into other nodes.

I'm particularly thinking about Hash Aggregate, because the partial sort 
allows to keep only the "current group" in a hash table, making it much 
more memory efficient / faster. What do you think?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PoC: Partial sort

From
Peter Geoghegan
Date:
On Sat, Jan 23, 2016 at 4:07 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> The main thing I'm particularly interested in is how much is this coupled
> with the Sort node, and whether it's possible to feed partially sorted
> tuples into other nodes.

That's cool, but I'm particularly interested in seeing Alexander get
back to this because it's an important project on its own. We should
really have this.

-- 
Peter Geoghegan



Re: PoC: Partial sort

From
Alexander Korotkov
Date:
Hi, Tomas!

On Sat, Jan 23, 2016 at 3:07 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 10/20/2015 01:17 PM, Alexander Korotkov wrote:
On Fri, Oct 16, 2015 at 7:11 PM, Alexander Korotkov
<aekorotkov@gmail.com <mailto:aekorotkov@gmail.com>> wrote:

    On Sun, Jun 7, 2015 at 11:01 PM, Peter Geoghegan <pg@heroku.com
    <mailto:pg@heroku.com>> wrote:

        On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson
        <andreas@proxel.se <mailto:andreas@proxel.se>> wrote:
        > Are you planning to work on this patch for 9.6?

        FWIW I hope so. It's a nice patch.


    I'm trying to to whisk dust. Rebased version of patch is attached.
    This patch isn't passing regression tests because of plan changes.
    I'm not yet sure about those changes: why they happens and are they
    really regression?
    Since I'm not very familiar with planning of INSERT ON CONFLICT and
    RLS, any help is appreciated.


Planner regression is fixed in the attached version of patch. It appears
that get_cheapest_fractional_path_for_pathkeys() behaved wrong when no
ordering is required.


Alexander, are you working on this patch? I'd like to look at the patch, but the last available version (v4) no longer applies - there's plenty of bitrot. Do you plan to send an updated / rebased version?

I'm sorry that I didn't found time for this yet. I'm certainly planning to get back to this in near future. The attached version is just rebased without any optimization.

The main thing I'm particularly interested in is how much is this coupled with the Sort node, and whether it's possible to feed partially sorted tuples into other nodes.

I'm particularly thinking about Hash Aggregate, because the partial sort allows to keep only the "current group" in a hash table, making it much more memory efficient / faster. What do you think?

This seems to me very reasonable optimization. And it would be nice to implement some generalized way of presorted group processing. For instance, we could have some special node, say "Group Scan" which have 2 children: source and node which process every group. For "partial sort" the second node would be Sort node. But it could be Hash Aggregate node as well.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 
Attachment

Re: PoC: Partial sort

From
Alexander Korotkov
Date:
Hi!

On Sat, Jan 23, 2016 at 10:07 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Sat, Jan 23, 2016 at 4:07 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> The main thing I'm particularly interested in is how much is this coupled
> with the Sort node, and whether it's possible to feed partially sorted
> tuples into other nodes.

That's cool, but I'm particularly interested in seeing Alexander get
back to this because it's an important project on its own. We should
really have this.

Thank you for your review and interest in this patch! I'm sorry for huge delay I made. I'm going to get back to this soon.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: PoC: Partial sort

From
Alvaro Herrera
Date:
Alexander Korotkov wrote:
> I'm sorry that I didn't found time for this yet. I'm certainly planning to
> get back to this in near future. The attached version is just rebased
> without any optimization.

Great to have a new version -- there seems to be a lot of interest in
this patch.  I'm moving this one to the next commitfest, thanks.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PoC: Partial sort

From
Peter Geoghegan
Date:
On Sun, Jan 31, 2016 at 4:16 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Great to have a new version -- there seems to be a lot of interest in
> this patch.  I'm moving this one to the next commitfest, thanks.

I am signed up to review this patch.

I was very surprised to see it in "Needs Review" state in the CF app
(Alexander just rebased the patch, and didn't do anything with the CF
app entry). Once again, this seems to have happened just because
Alvaro moved the patch to the next CF.

I've marked it "Waiting on Author" once more. Hopefully the CF app
will be fixed soon, so moving a patch to the next commitfest no longer
clobbers its state.

-- 
Peter Geoghegan



Re: PoC: Partial sort

From
Alexander Korotkov
Date:
Hi, Peter!

I finally went over your review.

On Wed, Nov 4, 2015 at 4:44 AM, Peter Geoghegan <pg@heroku.com> wrote:
Explain output
-------------------

A query like your original test query looks like this for me:

postgres=# explain analyze select * from test order by v1, v2 limit 100;
                                                                 QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=429.54..434.53 rows=100 width=16) (actual
time=15125.819..22414.038 rows=100 loops=1)
   ->  Partial sort  (cost=429.54..50295.52 rows=1000000 width=16)
(actual time=15125.799..22413.297 rows=100 loops=1)
         Sort Key: v1, v2
         Presorted Key: v1
         Sort Method: top-N heapsort  Memory: 27kB
         ->  Index Scan using test_v1_idx on test
(cost=0.42..47604.43 rows=1000000 width=16) (actual time=1.663..13.066
rows=151 loops=1)
 Planning time: 0.948 ms
 Execution time: 22414.895 ms
(8 rows)

I thought about it for a while, and decided that you have the basic
shape of the explain output right here. I see where you are going by
having the sort node drive things.

I don't think the node should be called "Partial sort", though. I
think that this is better presented as just a "Sort" node with a
presorted key.

I think it might be a good idea to also have a "Sort Groups: 2" field
above. That illustrates that you are in fact performing 2 small sorts
per group, which is the reality. As you said, it's good to have this
be high, because then the sort operations don't need to do too many
comparisons, which could be expensive.

I agree with your notes. In the attached version of path explain output was revised as you proposed.
 
Sort Method
----------------

Even thought the explain analyze above shows "top-N heapsort" as its
sort method, that isn't really true. I actually ran this through a
debugger, which is why the above plan took so long to execute, in case
you wondered. I saw that in practice the first sort executed for the
first group uses a quicksort, while only the second sort (needed for
the 2 and last group in this example) used a top-N heapsort.

Is it really worth using a top-N heapsort to avoid sorting the last
little bit of tuples in the last group? Maybe we should never push
down to an individual sort operation (we have one
tuplesort_performsort() per group) that it should be bounded. Our
quicksort falls back on insertion sort in the event of only having 7
elements (at that level of recursion), so having this almost always
use quicksort may be no bad thing.

Even if you don't like that, the "Sort Method" shown above is just
misleading. I wonder, also, if you need to be more careful about
whether or not "Memory" is really the high watermark, as opposed to
the memory used by the last sort operation of many. There could be
many large tuples in one grouping, for example. Note that the current
code will not show any "Memory" in explain analyze for cases that have
memory freed before execution is done, which this is not consistent
with. Maybe that's not so important. Unsure.

trace_sort output shows that these queries often use a large number of
tiny individual sorts. Maybe that's okay, or maybe we should make it
clearer that the sorts are related. I now use trace_sort a lot.

With partial sort we run multiple sorts in the same node. Ideally, we need to provide some aggregated information over runs.
This situation looks very similar to subplan which is called multiple times. I checked how it works for now.

# explain analyze select (select sum(x.i) from (select i from generate_series(1,t * 1000) i order by i desc) x) from generate_series(1, 20, 1) t;
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Function Scan on generate_series t  (cost=0.00..74853.92 rows=1000 width=4) (actual time=0.777..83.498 rows=20 loops=1)
   SubPlan 1
     ->  Aggregate  (cost=74.83..74.84 rows=1 width=4) (actual time=4.173..4.173 rows=1 loops=20)
           ->  Sort  (cost=59.83..62.33 rows=1000 width=4) (actual time=2.822..3.361 rows=10500 loops=20)
                 Sort Key: i.i
                 Sort Method: quicksort  Memory: 1706kB
                 ->  Function Scan on generate_series i  (cost=0.01..10.01 rows=1000 width=4) (actual time=0.499..1.106 rows=10500 loops=20)
 Planning time: 0.080 ms
 Execution time: 83.625 ms
(9 rows)

# explain analyze select (select sum(x.i) from (select i from generate_series(1,t * 1000) i order by i desc) x) from generate_series(20, 1, -1) t;
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Function Scan on generate_series t  (cost=0.00..74853.92 rows=1000 width=4) (actual time=11.414..86.127 rows=20 loops=1)
   SubPlan 1
     ->  Aggregate  (cost=74.83..74.84 rows=1 width=4) (actual time=4.305..4.305 rows=1 loops=20)
           ->  Sort  (cost=59.83..62.33 rows=1000 width=4) (actual time=2.944..3.486 rows=10500 loops=20)
                 Sort Key: i.i
                 Sort Method: quicksort  Memory: 71kB
                 ->  Function Scan on generate_series i  (cost=0.01..10.01 rows=1000 width=4) (actual time=0.527..1.125 rows=10500 loops=20)
 Planning time: 0.080 ms
 Execution time: 86.165 ms
(9 rows)

In the case of subplan explain analyze gives us just information about last subplan run. This makes me uneasy. From one side, it's probably OK that partial sort behaves like subplan while showing information just about last sort run. From the other side, we need some better solution for that in general case.

 
Abbreviated Keys
-----------------------

It could be very bad for performance that the first non-presorted key
uses abbreviated keys. There needs to be a way to tell tuplesort to
not waste its time with them, just as there currently is for bounded
(top-N heapsort) sorts. They're almost certainly the wrong way to go,
unless you have huge groups (where partial sorting is unlikely to win
in the first place).

Agree. I made 
 
Other issues in executor
--------------------------------

This is sort of an optimizer issue, but code lives in execAmi.c.
Assert is redundant here:

+               case T_Sort:
+                       /* We shouldn't reach here without having plan node */
+                       Assert(node);
+                       /* With skipCols sort node holds only last bucket */
+                       if (node && ((Sort *)node)->skipCols == 0)
+                               return true;
+                       else
+                               return false;

I don't like that you've added a Plan node argument to
ExecMaterializesOutput() in this function, too.

I don't like this too. But I didn't find better solution without significant rework of planner.
However, "Upper planner pathification" by Tom Lane seems to have such rework. It's likely sort becomes separate path node there.
Then ExecMaterializesOutput could read parameters of path node.
 
There is similar assert/pointer test code within
ExecSupportsBackwardScan() and ExecSupportsMarkRestore(). In general,
I have concerns about the way the determination of a sort's ability to
do stuff like be scanned backwards is now made dynamic, which this new
code demonstrates:

        /*
+        * skipCols can't be used with either EXEC_FLAG_REWIND,
EXEC_FLAG_BACKWARD
+        * or EXEC_FLAG_MARK, because we hold only current bucket in
+        * tuplesortstate.
+        */
+       Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
+
                  EXEC_FLAG_BACKWARD |
+
                  EXEC_FLAG_MARK)) == 0);
+

I need to think some more about this general issue.

It has to be dynamic if we want to keep full sort and partial sort in the same node. If properties of full sort and partial sort are different then and they share same node then this properties of Sort node have to be dynamic.
Alternative idea I have is that Sort node should fallback to full sort if it sees any of above flags. But I'm not sure this is right. In some cases it might be cheaper to partial sort then materialize than fallback to full sort.

Misc. issues
----------------

_readSort() needs READ_INT_FIELD().  _outSort() similarly needs
WRITE_INT_FIELD(). You've mostly missed this stuff.

Please be more careful about this. It's always a good idea to run the
regression tests with "#define COPY_PARSE_PLAN_TREES" from time to
time, which tends to highlight these problems.
 
Fixed. I've tried "#define COPY_PARSE_PLAN_TREES", now regression tests are passed with it.

tuplesort.h should not include sortsupport.h. It's a modularity
violation, and besides which is unnecessary. Similarly, pathkeys.c
should not include optimizer/cost.h.

Fixed. 

What is this?

+               if (inner_cheapest_total &&
inner_cheapest_total->pathtype == T_Sort)
+                       elog(ERROR, "Sort");

It's just piece of junk I used for debug. Deleted.
 

Optimizer
-------------

I am not an expert on the optimizer, but I do have some feedback.

* cost_sort() needs way way more comments.  Doesn't even mention
indexes. Not worth commenting further on until I know what it's
*supposed* to do, though.

I've added some comments.
 
* pathkeys_useful_for_ordering() now looks like a private convenience
wrapper for the new public function pathkeys_common().  I think that
comments should make this quite clear.

That's it. Explicit comment about that was added.
 
* compare_bifractional_path_costs() should live beside
compare_fractional_path_costs() and be public, I think. The existing
compare_fractional_path_costs() also only has a small number of
possible clients, and is still not static.

Now compare_bifractional_path_costs() is together with 
 
* Think it's not okay that there are new arguments, such as the
"tuples" argument for get_cheapest_fractional_path_for_pathkeys().

It seems a bad sign, design-wise, that a new argument of "PlannerInfo
*root" was added at end, for the narrow purpose of passing stuff to
estimate number of groups for the benefit of this patch.  ISTM that
grouping_planner() caller should do the
work itself as and when it alone needs to.

Now grouping_planner() should call separate function estimate_pathkeys_groups() which is responsible for estimating number of groups.
 
* New loop within get_cheapest_fractional_path_for_pathkeys() requires
far more explanation.

Explain theory behind derivation of compare_bifractional_path_costs()
fraction arguments, please. I think there might be simple heuristics
like this elsewhere in the optimizer or selfuncs.c, but you need to
share why you did things that way in the code.
 
Idea is that since partial sort fetches data per group then it would require fetching more data than fully presorted path.

* Within planner.c, "partial_sort_path" variable name in
grouping_planner() might be called something else.

Its purpose isn't clear. Also, the way that you mix path costs from
the new get_cheapest_fractional_path_for_pathkeys() into the new
cost_sort() needs to be explained in detail (as I already said,
cost_sort() is currently very under-documented).

I've tried to make it more clear. partial_sort_path is renamed to presorted_path. 
 
Obviously the optimizer part of this needs the most work -- no
surprises there. I wonder if we cover all useful cases? I haven't yet
got around to using "#define OPTIMIZER_DEBUG" to see what's really
going on, which seems essential to understanding what is really
happening, but it looks like merge append paths can currently use the
optimization, whereas unique paths cannot. Have you thought about
that?

Unique paths occasionally can use this optimization.

# create table test as (select id, random() as v from generate_series(1,1000000) id);
# create index test_v_idx on test(v);

# explain select distinct v, id from test;
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Unique  (cost=0.47..55104.41 rows=1000000 width=12)
   ->  Sort  (cost=0.47..50104.41 rows=1000000 width=12)
         Sort Key: v, id
         Presorted Key: v
         ->  Index Scan using test_v_idx on test  (cost=0.42..47604.41 rows=1000000 width=12)
(5 rows)

# explain select distinct id, v from test;
                                QUERY PLAN
---------------------------------------------------------------------------
 Unique  (cost=132154.34..139654.34 rows=1000000 width=12)
   ->  Sort  (cost=132154.34..134654.34 rows=1000000 width=12)
         Sort Key: id, v
         ->  Seq Scan on test  (cost=0.00..15406.00 rows=1000000 width=12)
(4 rows)

But it depends on attribute order. I could work out this case, but I would prefer some simple case to commit before. I already throw merge join optimization away for the sake of simplicity.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 
Attachment

Re: PoC: Partial sort

From
Alexander Korotkov
Date:
Hi!

Tom committed upper planner pathification patch.
Partial sort patch rebased to master is attached.  It was quite huge rebase in planner part of the patch.  But I think now patch becomes better, much more logical.
It's probably, something was missed after rebase.  I'm going to examine this path more carefully next week.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: PoC: Partial sort

From
Peter Geoghegan
Date:
Hi,

On Tue, Mar 1, 2016 at 7:06 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> I finally went over your review.

I'll respond to your points here. Note that I'm reviewing
"partial-sort-basic-7.patch", which you sent on March 13. I respond
here because this is where you answered my questions (I had no
feedback on "partial-sort-basic-6.patch", which didn't use the new
upper planner pathification stuff, unlike this latest version).

> On Wed, Nov 4, 2015 at 4:44 AM, Peter Geoghegan <pg@heroku.com> wrote:
>>
>> Explain output
>> -------------------

>> I think it might be a good idea to also have a "Sort Groups: 2" field
>> above. That illustrates that you are in fact performing 2 small sorts
>> per group, which is the reality. As you said, it's good to have this
>> be high, because then the sort operations don't need to do too many
>> comparisons, which could be expensive.
>
>
> I agree with your notes. In the attached version of path explain output was
> revised as you proposed.

Cool.

>> Sort Method
>> ----------------
>>
>> Even thought the explain analyze above shows "top-N heapsort" as its
>> sort method, that isn't really true. I actually ran this through a
>> debugger, which is why the above plan took so long to execute, in case
>> you wondered. I saw that in practice the first sort executed for the
>> first group uses a quicksort, while only the second sort (needed for
>> the 2 and last group in this example) used a top-N heapsort.

> With partial sort we run multiple sorts in the same node. Ideally, we need
> to provide some aggregated information over runs.
> This situation looks very similar to subplan which is called multiple times.
> I checked how it works for now.

Noticed this in nodeSort.c:

+       if (node->tuplesortstate != NULL)
+       {
+               tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+               node->groupsCount++;
+       }
+       else
+       {
+               /* Support structures for cmpSortSkipCols - already
sorted columns */
+               if (skipCols)
+                       prepareSkipCols(plannode, node);

+               /*
+                * Only pass on remaining columns that are unsorted.
Skip abbreviated
+                * keys usage for partial sort.  We unlikely will have
huge groups
+                * with partial sort.  Therefore usage of abbreviated
keys would be
+                * likely a waste of time.
+                */               tuplesortstate = tuplesort_begin_heap(tupDesc,

You should comment on which case is which, and put common case (no
skip cols) first. Similarly, the ExecSort() for(;;) should put the
common (non-partial) case first, which it does, but then the "first
tuple in partial sort" case first, then the "second or subsequent
partial sort" case last.

More comments here, please:

+typedef struct SkipKeyData
+{
+ FunctionCallInfoData fcinfo;
+ FmgrInfo flinfo;
+ OffsetNumber attno;
+} SkipKeyData;

(What's SkipKeyData?)

Also want comments for new SortState fields. SortState.prev is a
palloc()'d copy of tuple, which should be directly noted, as it is for
similar aggregate cases, etc.

Should you be more aggressive about freeing memory allocated for
SortState.prev tuples?

The new function cmpSortSkipCols() should say "Special case for
NULL-vs-NULL, else use standard comparison", or something. "Lets
pretend NULL is a value for implementation convenience" cases are
considered the exception, and are always noted as the exception.

> In the case of subplan explain analyze gives us just information about last
> subplan run. This makes me uneasy. From one side, it's probably OK that
> partial sort behaves like subplan while showing information just about last
> sort run. From the other side, we need some better solution for that in
> general case.

I see what you mean, but I wasn't so much complaining about that, as
complaining about the simple fact that we use a top-N heap sort *at
all*. This feels like the "limit" case is playing with partial sort
sub-sorts in a way that it shouldn't.

I see you have code like this to make this work:

+       /*
+        * Adjust bound_Done with number of tuples we've actually sorted.
+        */
+       if (node->bounded)
+       {
+               if (node->finished)
+                       node->bound_Done = node->bound;
+               else
+                       node->bound_Done = Min(node->bound,
node->bound_Done + nTuples);

But, why bother? Why not simply prevent tuplesort.c from ever using
the top-N heapsort method when it is called from nodeSort.c for a
partial sort (probably in the planner)?

Why, at a high level, does it make sense to pass down a limit to *any*
sort operation that makes up a partial sort, even the last? This seems
to be adding complexity without a benefit. A big advantage of top-N
heapsorts is that much less memory could be used, but this patch
already has the memory allocated that belonged to previous performsort
calls (mostly -- certainly has all those tuplesort.c memtuples
throughout, a major user of memory overall).  It's not going to be
very good at preventing work, except almost by accident because we
happen to have a limit up to just past the beginning of a smaller
partial sort group. I'd rather use quicksort, which is very versatile.
Top-N sorts make sense when sorting itself is the bottleneck, which it
probably won't be for a partial sort (that's the whole point).

If the sort method was very likely the same for every performsort
(quicksort), which it otherwise would be, then I'd care way way less
that that could be a bit misleading in EXPLAIN ANALYZE output, because
typically the last one would be "close enough". Although, this isn't
quite like your SubPlan example, because the Sort node isn't reported
as e.g. "SubPlan 1" by EXPLAIN.

I think that this has bugs for external sorts:

+void
+tuplesort_reset(Tuplesortstate *state)
+{
+       int i;
+
+       if (state->tapeset)
+               LogicalTapeSetClose(state->tapeset);
+
+       for (i = 0; i < state->memtupcount; i++)
+               free_sort_tuple(state, state->memtuples + i);
+
+       state->status = TSS_INITIAL;
+       state->memtupcount = 0;
+       state->boundUsed = false;
+       state->tapeset = NULL;
+       state->currentRun = 0;
+       state->result_tape = -1;
+       state->bounded = false;
+}

It's not okay to reset like this, especially not after the recent
commit 0011c0091, which could make this code unacceptably leak memory.
I realize that we really should never use an external sort here, but,
as you know, this is not the point.

So, I want to suggest that you use the regular code to destroy and
recreate a tuplesort in this case. Now, obviously that has some
significant disadvantages -- you want to reuse everything in the
common case when each sort is tiny. But you can still do that for that
very common case.

I think you need to use sortcontext memory context here on general
principle, even if current usage isn't broken by that.

Even if you get this right for external sorts once, it will break
again without anyone noticing until it's too late. Better to not rely
on it staying in sync, and find a way of having the standard
tuplesort.c initialization begin again.

Even though these free_sort_tuple() calls are still needed, you might
also call "MemoryContextReset(state->tuplecontext)" at the end. That
might prevent palloc() fragmentation when groups are of wildly
different sizes. Just an idea.

>> I don't like that you've added a Plan node argument to
>> ExecMaterializesOutput() in this function, too.
>
>
> I don't like this too. But I didn't find better solution without significant
> rework of planner.
> However, "Upper planner pathification" by Tom Lane seems to have such
> rework. It's likely sort becomes separate path node there.
> Then ExecMaterializesOutput could read parameters of path node.

A tuplesort may be randomAccess, or !randomAccess, as the caller
wishes. It's good for performance if the caller does not want
randomAccess, because then we can do our final merge on-the-fly if
it's an external sort, which helps a lot.

How is this different? ExecMaterializesOutput() seems to be about
whether or not the plan *could* materialize its output in principle,
even though you might well want to make it not do so in specific
cases. So, it's not so much that the new argument is ugly; rather, I
worry that it's wrong to make ExecMaterializesOutput() give a more
specific answer than anticipated by current callers.

Is the difference basically just that a partial sort could be
enormously faster, whereas a !randomAccess conventional sort is nice
to have, but not worth e.g. changing cost_sort() to account for?

You might just make a new function, ExecPlanMaterializesOutput(),
instead. That would call ExecMaterializesOutput() for non-Sort cases.

>> Optimizer
>> -------------
>>

>> * cost_sort() needs way way more comments.  Doesn't even mention
>> indexes. Not worth commenting further on until I know what it's
>> *supposed* to do, though.
>
>
> I've added some comments.

Looking at cost_sort() now, it's a bit clearer. I think that you
should make sure that everything is costed as a quicksort, though, if
you accept that we should try and make every small sort done by the
partial sort a quicksort. Which I think is a good idea. The common
case is that groups are small, but the qsort() insertion sort will be
very very fast for that case.

>> * New loop within get_cheapest_fractional_path_for_pathkeys() requires
>> far more explanation.
>>
>> Explain theory behind derivation of compare_bifractional_path_costs()
>> fraction arguments, please. I think there might be simple heuristics
>> like this elsewhere in the optimizer or selfuncs.c, but you need to
>> share why you did things that way in the code.
>
>
> Idea is that since partial sort fetches data per group then it would require
> fetching more data than fully presorted path.

I think I get it.

>> * Within planner.c, "partial_sort_path" variable name in
>> grouping_planner() might be called something else.
>>
>> Its purpose isn't clear. Also, the way that you mix path costs from
>> the new get_cheapest_fractional_path_for_pathkeys() into the new
>> cost_sort() needs to be explained in detail (as I already said,
>> cost_sort() is currently very under-documented).
>
>
> I've tried to make it more clear. partial_sort_path is renamed to
> presorted_path.

> Unique paths occasionally can use this optimization.

> But it depends on attribute order. I could work out this case, but I would
> prefer some simple case to commit before. I already throw merge join
> optimization away for the sake of simplicity.

I think that was the right decision under our time constraints.
However, I suggest noting that this should happen for unique paths in
the future, say within create_unique_path().

Other notes:

This looks like an old change you missed:

- * compare_path_fractional_costs
+ * compare_fractional_path_costs

All in all, this looks significantly better. Thanks for your work on
this. Sorry for the delay in my response, and that my review was
relatively rushed, but I'm rather busy at the moment with fighting
fires.

-- 
Peter Geoghegan



Re: PoC: Partial sort

From
David Steele
Date:
Hi Alexander,

On 3/23/16 8:39 PM, Peter Geoghegan wrote:

> This looks like an old change you missed:
>
> - * compare_path_fractional_costs
> + * compare_fractional_path_costs
>
> All in all, this looks significantly better. Thanks for your work on
> this. Sorry for the delay in my response, and that my review was
> relatively rushed, but I'm rather busy at the moment with fighting
> fires.

It looks like a new patch is required before this can be marked "ready 
for committer".  Will you have that ready soon?

Thanks,
-- 
-David
david@pgmasters.net



Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Tue, Mar 29, 2016 at 4:56 PM, David Steele <david@pgmasters.net> wrote:
On 3/23/16 8:39 PM, Peter Geoghegan wrote:

This looks like an old change you missed:

- * compare_path_fractional_costs
+ * compare_fractional_path_costs

All in all, this looks significantly better. Thanks for your work on
this. Sorry for the delay in my response, and that my review was
relatively rushed, but I'm rather busy at the moment with fighting
fires.

It looks like a new patch is required before this can be marked "ready for committer".  Will you have that ready soon?

Yes, that's it.  I'm working on it now.  I'm going to post it until tomorrow.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: PoC: Partial sort

From
Alexander Korotkov
Date:
Hi, Peter!

Thank you for review!

On Thu, Mar 24, 2016 at 3:39 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> Sort Method
>> ----------------
>>
>> Even thought the explain analyze above shows "top-N heapsort" as its
>> sort method, that isn't really true. I actually ran this through a
>> debugger, which is why the above plan took so long to execute, in case
>> you wondered. I saw that in practice the first sort executed for the
>> first group uses a quicksort, while only the second sort (needed for
>> the 2 and last group in this example) used a top-N heapsort.

> With partial sort we run multiple sorts in the same node. Ideally, we need
> to provide some aggregated information over runs.
> This situation looks very similar to subplan which is called multiple times.
> I checked how it works for now.

Noticed this in nodeSort.c:

+       if (node->tuplesortstate != NULL)
+       {
+               tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+               node->groupsCount++;
+       }
+       else
+       {
+               /* Support structures for cmpSortSkipCols - already
sorted columns */
+               if (skipCols)
+                       prepareSkipCols(plannode, node);

+               /*
+                * Only pass on remaining columns that are unsorted.
Skip abbreviated
+                * keys usage for partial sort.  We unlikely will have
huge groups
+                * with partial sort.  Therefore usage of abbreviated
keys would be
+                * likely a waste of time.
+                */
                tuplesortstate = tuplesort_begin_heap(tupDesc,

You should comment on which case is which, and put common case (no
skip cols) first. Similarly, the ExecSort() for(;;) should put the
common (non-partial) case first, which it does, but then the "first
tuple in partial sort" case first, then the "second or subsequent
partial sort" case last.
 
Done.

More comments here, please:

+typedef struct SkipKeyData
+{
+ FunctionCallInfoData fcinfo;
+ FmgrInfo flinfo;
+ OffsetNumber attno;
+} SkipKeyData;

(What's SkipKeyData?)

Also want comments for new SortState fields.

Done.
 
SortState.prev is a
palloc()'d copy of tuple, which should be directly noted, as it is for
similar aggregate cases, etc.

Should you be more aggressive about freeing memory allocated for
SortState.prev tuples?

Fixed.
 
The new function cmpSortSkipCols() should say "Special case for
NULL-vs-NULL, else use standard comparison", or something. "Lets
pretend NULL is a value for implementation convenience" cases are
considered the exception, and are always noted as the exception.

Comment is added.
 
> In the case of subplan explain analyze gives us just information about last
> subplan run. This makes me uneasy. From one side, it's probably OK that
> partial sort behaves like subplan while showing information just about last
> sort run. From the other side, we need some better solution for that in
> general case.

I see what you mean, but I wasn't so much complaining about that, as
complaining about the simple fact that we use a top-N heap sort *at
all*. This feels like the "limit" case is playing with partial sort
sub-sorts in a way that it shouldn't.

I see you have code like this to make this work:

+       /*
+        * Adjust bound_Done with number of tuples we've actually sorted.
+        */
+       if (node->bounded)
+       {
+               if (node->finished)
+                       node->bound_Done = node->bound;
+               else
+                       node->bound_Done = Min(node->bound,
node->bound_Done + nTuples);

But, why bother? Why not simply prevent tuplesort.c from ever using
the top-N heapsort method when it is called from nodeSort.c for a
partial sort (probably in the planner)?

Why, at a high level, does it make sense to pass down a limit to *any*
sort operation that makes up a partial sort, even the last? This seems
to be adding complexity without a benefit. A big advantage of top-N
heapsorts is that much less memory could be used, but this patch
already has the memory allocated that belonged to previous performsort
calls (mostly -- certainly has all those tuplesort.c memtuples
throughout, a major user of memory overall).  It's not going to be
very good at preventing work, except almost by accident because we
happen to have a limit up to just past the beginning of a smaller
partial sort group. I'd rather use quicksort, which is very versatile.
Top-N sorts make sense when sorting itself is the bottleneck, which it
probably won't be for a partial sort (that's the whole point).

Hmm... I'm not completely agree with that. In typical usage partial sort should definitely use quicksort.  However, fallback to other sort methods is very useful.  Decision of partial sort usage is made by planner.  But planner makes mistakes.  For example, our HashAggregate is purely in-memory.  In the case of planner mistake it causes OOM.  I met such situation in production and not once.  This is why I'd like partial sort to have graceful degradation for such cases.

If the sort method was very likely the same for every performsort
(quicksort), which it otherwise would be, then I'd care way way less
that that could be a bit misleading in EXPLAIN ANALYZE output, because
typically the last one would be "close enough". Although, this isn't
quite like your SubPlan example, because the Sort node isn't reported
as e.g. "SubPlan 1" by EXPLAIN.

I've adjusted EXPLAIN ANALYZE to show maximum resources consumption.
 

I think that this has bugs for external sorts:

+void
+tuplesort_reset(Tuplesortstate *state)
+{
+       int i;
+
+       if (state->tapeset)
+               LogicalTapeSetClose(state->tapeset);
+
+       for (i = 0; i < state->memtupcount; i++)
+               free_sort_tuple(state, state->memtuples + i);
+
+       state->status = TSS_INITIAL;
+       state->memtupcount = 0;
+       state->boundUsed = false;
+       state->tapeset = NULL;
+       state->currentRun = 0;
+       state->result_tape = -1;
+       state->bounded = false;
+}

It's not okay to reset like this, especially not after the recent
commit 0011c0091, which could make this code unacceptably leak memory.
I realize that we really should never use an external sort here, but,
as you know, this is not the point.

So, I want to suggest that you use the regular code to destroy and
recreate a tuplesort in this case. Now, obviously that has some
significant disadvantages -- you want to reuse everything in the
common case when each sort is tiny. But you can still do that for that
very common case.

I think you need to use sortcontext memory context here on general
principle, even if current usage isn't broken by that.

Even if you get this right for external sorts once, it will break
again without anyone noticing until it's too late. Better to not rely
on it staying in sync, and find a way of having the standard
tuplesort.c initialization begin again.

Even though these free_sort_tuple() calls are still needed, you might
also call "MemoryContextReset(state->tuplecontext)" at the end. That
might prevent palloc() fragmentation when groups are of wildly
different sizes. Just an idea.

I tried to manage this by introducing another memory context which is persistent between partial sort batches.  Other memory contexts are reset.
 
>> I don't like that you've added a Plan node argument to
>> ExecMaterializesOutput() in this function, too.
>
>
> I don't like this too. But I didn't find better solution without significant
> rework of planner.
> However, "Upper planner pathification" by Tom Lane seems to have such
> rework. It's likely sort becomes separate path node there.
> Then ExecMaterializesOutput could read parameters of path node.

A tuplesort may be randomAccess, or !randomAccess, as the caller
wishes. It's good for performance if the caller does not want
randomAccess, because then we can do our final merge on-the-fly if
it's an external sort, which helps a lot.

How is this different? ExecMaterializesOutput() seems to be about
whether or not the plan *could* materialize its output in principle,
even though you might well want to make it not do so in specific
cases. So, it's not so much that the new argument is ugly; rather, I
worry that it's wrong to make ExecMaterializesOutput() give a more
specific answer than anticipated by current callers.

Is the difference basically just that a partial sort could be
enormously faster, whereas a !randomAccess conventional sort is nice
to have, but not worth e.g. changing cost_sort() to account for?

You might just make a new function, ExecPlanMaterializesOutput(),
instead. That would call ExecMaterializesOutput() for non-Sort cases.

I've added ExecPlanMaterializesOutput() function.
 
>> Optimizer
>> -------------
>>

>> * cost_sort() needs way way more comments.  Doesn't even mention
>> indexes. Not worth commenting further on until I know what it's
>> *supposed* to do, though.
>
>
> I've added some comments.

Looking at cost_sort() now, it's a bit clearer. I think that you
should make sure that everything is costed as a quicksort, though, if
you accept that we should try and make every small sort done by the
partial sort a quicksort. Which I think is a good idea. The common
case is that groups are small, but the qsort() insertion sort will be
very very fast for that case.

I'm not sure that in partial sort we should estimate sorting of one group in other way than simple sort does.  I see following reasons:
1) I'd like partial sort to be able to use other sorting methods to provide graceful degradation in the case of planner mistakes as I pointed above.
2) Planner should don't choose partial sort plans in some cases.  That should happen because higher cost of these plans including high cost of partial sort.  Estimation of other sort methods looks like good way for reporting such high costs. 
 
This looks like an old change you missed:

- * compare_path_fractional_costs
+ * compare_fractional_path_costs

I think this is rather a typo fix.  Because now comment doesn't meet function name.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 
Attachment

Re: PoC: Partial sort

From
Peter Geoghegan
Date:
On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> Hmm... I'm not completely agree with that. In typical usage partial sort
> should definitely use quicksort.  However, fallback to other sort methods is
> very useful.  Decision of partial sort usage is made by planner.  But
> planner makes mistakes.  For example, our HashAggregate is purely in-memory.
> In the case of planner mistake it causes OOM.  I met such situation in
> production and not once.  This is why I'd like partial sort to have graceful
> degradation for such cases.

I think that this should be moved to the next CF, unless a committer
wants to pick it up today.


-- 
Peter Geoghegan



Re: PoC: Partial sort

From
Alexander Korotkov
Date:
On Fri, Apr 8, 2016 at 10:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> Hmm... I'm not completely agree with that. In typical usage partial sort
> should definitely use quicksort.  However, fallback to other sort methods is
> very useful.  Decision of partial sort usage is made by planner.  But
> planner makes mistakes.  For example, our HashAggregate is purely in-memory.
> In the case of planner mistake it causes OOM.  I met such situation in
> production and not once.  This is why I'd like partial sort to have graceful
> degradation for such cases.

I think that this should be moved to the next CF, unless a committer
wants to pick it up today.

Patch was rebased to current master.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: PoC: Partial sort

From
Michael Paquier
Date:
On Tue, Sep 13, 2016 at 5:32 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> On Fri, Apr 8, 2016 at 10:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>
>> On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
>> <aekorotkov@gmail.com> wrote:
>> > Hmm... I'm not completely agree with that. In typical usage partial sort
>> > should definitely use quicksort.  However, fallback to other sort
>> > methods is
>> > very useful.  Decision of partial sort usage is made by planner.  But
>> > planner makes mistakes.  For example, our HashAggregate is purely
>> > in-memory.
>> > In the case of planner mistake it causes OOM.  I met such situation in
>> > production and not once.  This is why I'd like partial sort to have
>> > graceful
>> > degradation for such cases.
>>
>> I think that this should be moved to the next CF, unless a committer
>> wants to pick it up today.
>
>
> Patch was rebased to current master.

Applies on HEAD at e8bdee27 and passes make-check, now I am seeing
zero documentation so it is a bit hard to see what this patch is
achieving without reading the thread.

$ git diff master --check
src/backend/optimizer/prep/prepunion.c:967: trailing whitespace.
+   cost_sort(&sorted_p, root, NIL, 0,
src/backend/utils/sort/tuplesort.c:1244: trailing whitespace.
+ * tuplesort_updatemax

+ * Returns true if the plan node isautomatically materializes its output
Typo here.

Still, this has received to reviews, so moved to next CF.
-- 
Michael



Re: PoC: Partial sort

From
Haribabu Kommi
Date:
Hi Peter,
 

This is a gentle reminder.

you assigned as reviewer to the current patch in the 11-2016 commitfest.
But you haven't shared your review yet in this commitfest on the latest
patch posted by the author. If you don't have any comments on the patch,
please move the patch into "ready for committer" state to get committer's
attention. This will help us in smoother operation of commitfest.

Please Ignore if you already shared your review.
 
Regards,
Hari Babu
Fujitsu Australia

Re: PoC: Partial sort

From
Robert Haas
Date:
On Tue, Sep 13, 2016 at 4:32 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> On Fri, Apr 8, 2016 at 10:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
>> <aekorotkov@gmail.com> wrote:
>> > Hmm... I'm not completely agree with that. In typical usage partial sort
>> > should definitely use quicksort.  However, fallback to other sort
>> > methods is
>> > very useful.  Decision of partial sort usage is made by planner.  But
>> > planner makes mistakes.  For example, our HashAggregate is purely
>> > in-memory.
>> > In the case of planner mistake it causes OOM.  I met such situation in
>> > production and not once.  This is why I'd like partial sort to have
>> > graceful
>> > degradation for such cases.
>>
>> I think that this should be moved to the next CF, unless a committer
>> wants to pick it up today.
>
> Patch was rebased to current master.

Just a few quick observations on this...

It strikes me that the API contract change in ExecMaterializesOutput
is pretty undesirable.  I think it would be better to have a new
executor node for this node rather than overloading the existing
"Sort" node, sharing code where possible of course.  The fact that
this would distinguish them more clearly in an EXPLAIN plan seems
good, too.  "Partial Sort" is the obvious thing, but there might be
even better alternatives -- maybe "Incremental Sort" or something like
that?  Because it's not partially sorting the data, it's making data
that already has some sort order have a more rigorous sort order.

I think that it will probably be pretty common to have queries where
the data is sorted by (mostly_unique_col) and we want to get it sorted
by (mostly_unique_col, disambiguation_col).  In such cases I wonder if
we'll incur a lot of overhead by feeding single tuples to the
tuplesort stuff and performing lots of 1-item sorts.  Not sure if that
case needs any special optimization.

I also think that the "HeapTuple prev" bit in SortState is probably
not the right way of doing things.  I think that should use an
additional TupleTableSlot rather than a HeapTuple.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PoC: Partial sort

From
Peter Geoghegan
Date:
On Mon, Nov 21, 2016 at 11:04 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> you assigned as reviewer to the current patch in the 11-2016 commitfest.
> But you haven't shared your review yet in this commitfest on the latest
> patch posted by the author. If you don't have any comments on the patch,
> please move the patch into "ready for committer" state to get committer's
> attention. This will help us in smoother operation of commitfest.

Sorry for the delay on this.

I agree with Robert's remarks today on TupleTableSlot, and would like
to see a revision that does that.

-- 
Peter Geoghegan



Re: PoC: Partial sort

From
Haribabu Kommi
Date:


On Fri, Dec 2, 2016 at 4:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Sep 13, 2016 at 4:32 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> On Fri, Apr 8, 2016 at 10:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
>> <aekorotkov@gmail.com> wrote:
>> > Hmm... I'm not completely agree with that. In typical usage partial sort
>> > should definitely use quicksort.  However, fallback to other sort
>> > methods is
>> > very useful.  Decision of partial sort usage is made by planner.  But
>> > planner makes mistakes.  For example, our HashAggregate is purely
>> > in-memory.
>> > In the case of planner mistake it causes OOM.  I met such situation in
>> > production and not once.  This is why I'd like partial sort to have
>> > graceful
>> > degradation for such cases.
>>
>> I think that this should be moved to the next CF, unless a committer
>> wants to pick it up today.
>
> Patch was rebased to current master.

Just a few quick observations on this...

It strikes me that the API contract change in ExecMaterializesOutput
is pretty undesirable.  I think it would be better to have a new
executor node for this node rather than overloading the existing
"Sort" node, sharing code where possible of course.  The fact that
this would distinguish them more clearly in an EXPLAIN plan seems
good, too.  "Partial Sort" is the obvious thing, but there might be
even better alternatives -- maybe "Incremental Sort" or something like
that?  Because it's not partially sorting the data, it's making data
that already has some sort order have a more rigorous sort order.

I think that it will probably be pretty common to have queries where
the data is sorted by (mostly_unique_col) and we want to get it sorted
by (mostly_unique_col, disambiguation_col).  In such cases I wonder if
we'll incur a lot of overhead by feeding single tuples to the
tuplesort stuff and performing lots of 1-item sorts.  Not sure if that
case needs any special optimization.

I also think that the "HeapTuple prev" bit in SortState is probably
not the right way of doing things.  I think that should use an
additional TupleTableSlot rather than a HeapTuple.


The feedback from the reviewer has received at the end of commitfest.
So Moved to next CF with "waiting on author" status.

 
Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] PoC: Partial sort

From
Michael Paquier
Date:
On Mon, Dec 5, 2016 at 2:04 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
>
> On Fri, Dec 2, 2016 at 4:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Tue, Sep 13, 2016 at 4:32 AM, Alexander Korotkov
>> <aekorotkov@gmail.com> wrote:
>> > On Fri, Apr 8, 2016 at 10:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> >> On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
>> >> <aekorotkov@gmail.com> wrote:
>> >> > Hmm... I'm not completely agree with that. In typical usage partial
>> >> > sort
>> >> > should definitely use quicksort.  However, fallback to other sort
>> >> > methods is
>> >> > very useful.  Decision of partial sort usage is made by planner.  But
>> >> > planner makes mistakes.  For example, our HashAggregate is purely
>> >> > in-memory.
>> >> > In the case of planner mistake it causes OOM.  I met such situation
>> >> > in
>> >> > production and not once.  This is why I'd like partial sort to have
>> >> > graceful
>> >> > degradation for such cases.
>> >>
>> >> I think that this should be moved to the next CF, unless a committer
>> >> wants to pick it up today.
>> >
>> > Patch was rebased to current master.
>>
>> Just a few quick observations on this...
>>
>> It strikes me that the API contract change in ExecMaterializesOutput
>> is pretty undesirable.  I think it would be better to have a new
>> executor node for this node rather than overloading the existing
>> "Sort" node, sharing code where possible of course.  The fact that
>> this would distinguish them more clearly in an EXPLAIN plan seems
>> good, too.  "Partial Sort" is the obvious thing, but there might be
>> even better alternatives -- maybe "Incremental Sort" or something like
>> that?  Because it's not partially sorting the data, it's making data
>> that already has some sort order have a more rigorous sort order.
>>
>> I think that it will probably be pretty common to have queries where
>> the data is sorted by (mostly_unique_col) and we want to get it sorted
>> by (mostly_unique_col, disambiguation_col).  In such cases I wonder if
>> we'll incur a lot of overhead by feeding single tuples to the
>> tuplesort stuff and performing lots of 1-item sorts.  Not sure if that
>> case needs any special optimization.
>>
>> I also think that the "HeapTuple prev" bit in SortState is probably
>> not the right way of doing things.  I think that should use an
>> additional TupleTableSlot rather than a HeapTuple.
>>
>
> The feedback from the reviewer has received at the end of commitfest.
> So Moved to next CF with "waiting on author" status.

This patch is on its 6th commit fest now. As the thread has died and
as feedback has been provided but not answered I am marking it as
returned with feedback.
-- 
Michael