Thread: [HACKERS] [PATCH] Incremental sort (was: PoC: Partial sort)

[HACKERS] [PATCH] Incremental sort (was: PoC: Partial sort)

From
Alexander Korotkov
Date:
Hi all!

I decided to start new thread for this patch for following two reasons.
 * It's renamed from "Partial sort" to "Incremental sort" per suggestion by Robert Haas [1].  New name much better characterizes the essence of algorithm.
 * I think it's not PoC anymore.  Patch received several rounds of review and now it's in the pretty good shape.

Attached revision of patch has following changes.
 * According to review [1], two new path and plan nodes are responsible for incremental sort: IncSortPath and IncSort which are inherited from SortPath and Sort correspondingly.  That allowed to get rid of set of hacks with minimal code changes.
 * According to review [1] and comment [2], previous tuple is stored in standalone tuple slot of SortState rather than just HeapTuple.
 * New GUC parameter enable_incsort is introduced to control planner ability to choose incremental sort.
 * Test of postgres_fdw with not pushed down cross join is corrected.  It appeared that with incremental sort such query is profitable to push down.  I changed ORDER BY columns so that index couldn't be used.  I think this solution is more elegant than setting enable_incsort = off.

Also patch has set of assorted code and comments improvements.

Links

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: [HACKERS] [PATCH] Incremental sort (was: PoC: Partial sort)

From
Robert Haas
Date:
On Sat, Feb 18, 2017 at 4:01 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I decided to start new thread for this patch for following two reasons.
>  * It's renamed from "Partial sort" to "Incremental sort" per suggestion by
> Robert Haas [1].  New name much better characterizes the essence of
> algorithm.
>  * I think it's not PoC anymore.  Patch received several rounds of review
> and now it's in the pretty good shape.
>
> Attached revision of patch has following changes.
>  * According to review [1], two new path and plan nodes are responsible for
> incremental sort: IncSortPath and IncSort which are inherited from SortPath
> and Sort correspondingly.  That allowed to get rid of set of hacks with
> minimal code changes.
>  * According to review [1] and comment [2], previous tuple is stored in
> standalone tuple slot of SortState rather than just HeapTuple.
>  * New GUC parameter enable_incsort is introduced to control planner ability
> to choose incremental sort.
>  * Test of postgres_fdw with not pushed down cross join is corrected.  It
> appeared that with incremental sort such query is profitable to push down.
> I changed ORDER BY columns so that index couldn't be used.  I think this
> solution is more elegant than setting enable_incsort = off.

I usually advocate for spelling things out instead of abbreviating, so
I guess I'll stay true to form here and suggest that abbreviating
incremental to inc doesn't seem like a great idea.  Is that sort
incrementing, incremental, incredible, incautious, or incorporated?

The first hunk in the patch, a change in the postgres_fdw regression
test output, looks an awful lot like a bug: now the query that
formerly returned various different numbers is returning all zeroes.
It might not actually be a bug, because you've also changed the test
query (not sure why), but anyway the new regression test output that
is all zeroes seems less useful for catching bugs in, say, the
ordering of the results than the old output where the different rows
were different.

I don't know of any existing cases where the same executor file is
responsible for executing more than 1 different type of executor node.
I was imagining a more-complete separation of the new executor node.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] Incremental sort (was: PoC: Partial sort)

From
Alexander Korotkov
Date:
On Sun, Feb 19, 2017 at 2:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Feb 18, 2017 at 4:01 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I decided to start new thread for this patch for following two reasons.
>  * It's renamed from "Partial sort" to "Incremental sort" per suggestion by
> Robert Haas [1].  New name much better characterizes the essence of
> algorithm.
>  * I think it's not PoC anymore.  Patch received several rounds of review
> and now it's in the pretty good shape.
>
> Attached revision of patch has following changes.
>  * According to review [1], two new path and plan nodes are responsible for
> incremental sort: IncSortPath and IncSort which are inherited from SortPath
> and Sort correspondingly.  That allowed to get rid of set of hacks with
> minimal code changes.
>  * According to review [1] and comment [2], previous tuple is stored in
> standalone tuple slot of SortState rather than just HeapTuple.
>  * New GUC parameter enable_incsort is introduced to control planner ability
> to choose incremental sort.
>  * Test of postgres_fdw with not pushed down cross join is corrected.  It
> appeared that with incremental sort such query is profitable to push down.
> I changed ORDER BY columns so that index couldn't be used.  I think this
> solution is more elegant than setting enable_incsort = off.

I usually advocate for spelling things out instead of abbreviating, so
I guess I'll stay true to form here and suggest that abbreviating
incremental to inc doesn't seem like a great idea.  Is that sort
incrementing, incremental, incredible, incautious, or incorporated?

I'm not that sure about naming of GUCs, because we already have enable_hashagg instead of enable_hashaggregate, enable_material instead of enable_materialize, enable_nestloop instead of enable_nestedloop.  But anyway I renamed "inc" to "Incremental" everywhere in the code.  I renamed enable_incsort GUC into enable_incrementalsort as well, because I don't have strong opinion here.

The first hunk in the patch, a change in the postgres_fdw regression
test output, looks an awful lot like a bug: now the query that
formerly returned various different numbers is returning all zeroes.
It might not actually be a bug, because you've also changed the test
query (not sure why), but anyway the new regression test output that
is all zeroes seems less useful for catching bugs in, say, the
ordering of the results than the old output where the different rows
were different.

Yes, I've changed regression test query as I mentioned in the previous message.  With incremental sort feature original query can't serve anymore as an example of non-pushdown join.  However, you're right that query which returns all zeroes doesn't look good there either.  So, I changed that query to ordering by column "c3" which is actually non-indexed textual representation of "c1".
 
I don't know of any existing cases where the same executor file is
responsible for executing more than 1 different type of executor node.
I was imagining a more-complete separation of the new executor node.

Ok, I put incremental sort into separate executor node.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] [PATCH] Incremental sort (was: PoC: Partial sort)

From
Mithun Cy
Date:
On Mon, Feb 27, 2017 at 8:29 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

This patch needs to be rebased.

1. It fails while applying as below

patching file src/test/regress/expected/sysviews.out
Hunk #1 FAILED at 70.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/expected/sysviews.out.rej
patching file src/test/regress/sql/inherit.sql

2. Also, there are compilation errors due to new commits.

-fwrapv -fexcess-precision=standard -O2 -I../../../../src/include
-D_GNU_SOURCE   -c -o createplan.o createplan.c
createplan.c: In function ‘create_gather_merge_plan’:
createplan.c:1510:11: warning: passing argument 3 of ‘make_sort’ makes
integer from pointer without a cast [enabled by default]          gm_plan->nullsFirst);          ^
createplan.c:235:14: note: expected ‘int’ but argument is of type ‘AttrNumber *’static Sort *make_sort(Plan *lefttree,
intnumCols, int skipCols,             ^ 
createplan.c:1510:11: warning: passing argument 4 of ‘make_sort’ from
incompatible pointer type [enabled by default]          gm_plan->nullsFirst);

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] [PATCH] Incremental sort (was: PoC: Partial sort)

From
Alexander Korotkov
Date:
Dear Mithun,

On Mon, Mar 20, 2017 at 10:01 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
On Mon, Feb 27, 2017 at 8:29 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

This patch needs to be rebased.

1. It fails while applying as below

patching file src/test/regress/expected/sysviews.out
Hunk #1 FAILED at 70.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/expected/sysviews.out.rej
patching file src/test/regress/sql/inherit.sql

2. Also, there are compilation errors due to new commits.

-fwrapv -fexcess-precision=standard -O2 -I../../../../src/include
-D_GNU_SOURCE   -c -o createplan.o createplan.c
createplan.c: In function ‘create_gather_merge_plan’:
createplan.c:1510:11: warning: passing argument 3 of ‘make_sort’ makes
integer from pointer without a cast [enabled by default]
           gm_plan->nullsFirst);
           ^
createplan.c:235:14: note: expected ‘int’ but argument is of type ‘AttrNumber *’
 static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
              ^
createplan.c:1510:11: warning: passing argument 4 of ‘make_sort’ from
incompatible pointer type [enabled by default]
           gm_plan->nullsFirst);

Thank you for the report.
Please, find rebased patch in the attachment.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Heikki Linnakangas
Date:
On 03/20/2017 11:33 AM, Alexander Korotkov wrote:
> Please, find rebased patch in the attachment.

I had a quick look at this.

* I'd love to have an explanation of what an Incremental Sort is, in the 
file header comment for nodeIncrementalSort.c.

* I didn't understand the maxMem stuff in tuplesort.c. The comments 
there use the phrase "on-disk memory", which seems like an oxymoron. 
Also, "maximum status" seems weird, as it assumes that there's a natural 
order to the states.

* In the below example, the incremental sort is significantly slower 
than the Seq Scan + Sort you get otherwise:

create table foo (a int4, b int4, c int4);
insert into sorttest select g, g, g from generate_series(1, 1000000) g;
vacuum foo;
create index i_sorttest on sorttest (a, b, c);
set work_mem='100MB';


postgres=# explain select count(*) from (select * from sorttest order by 
a, c) as t;                                              QUERY PLAN 

------------------------------------------------------------------------------------------------------- Aggregate
(cost=138655.68..138655.69rows=1 width=8)   ->  Incremental Sort  (cost=610.99..124870.38 rows=1102824 width=12)
Sort Key: sorttest.a, sorttest.c         Presorted Key: sorttest.a         ->  Index Only Scan using i_sorttest on
sorttest
 
(cost=0.43..53578.79 rows=1102824 width=12)
(5 rows)

Time: 0.409 ms
postgres=# select count(*) from (select * from sorttest order by a, c) as t;  count
--------- 1000000
(1 row)

Time: 387.091 ms


postgres=# explain select count(*) from (select * from sorttest order by 
a, c) as t;                                  QUERY PLAN 

------------------------------------------------------------------------------- Aggregate  (cost=130063.84..130063.85
rows=1width=8)   ->  Sort  (cost=115063.84..117563.84 rows=1000000 width=12)         Sort Key: sorttest.a, sorttest.c
     ->  Seq Scan on sorttest  (cost=0.00..15406.00 rows=1000000 
 
width=12)
(4 rows)

Time: 0.345 ms
postgres=# select count(*) from (select * from sorttest order by a, c) as t;  count
--------- 1000000
(1 row)

Time: 231.668 ms

According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To 
alleviate that, it might be worthwhile to add a special case for when 
the group contains exactly one group, and not put the tuple to the 
tuplesort in that case. Or if we cannot ensure that the Incremental Sort 
is actually faster, the cost model should probably be smarter, to avoid 
picking an incremental sort when it's not a win.

- Heikki




Re: [PATCH] Incremental sort

From
David Steele
Date:
Hi Alexander,

On 3/20/17 10:19 AM, Heikki Linnakangas wrote:
> On 03/20/2017 11:33 AM, Alexander Korotkov wrote:
>> Please, find rebased patch in the attachment.
>
> I had a quick look at this.

<...>

> According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To
> alleviate that, it might be worthwhile to add a special case for when
> the group contains exactly one group, and not put the tuple to the
> tuplesort in that case. Or if we cannot ensure that the Incremental Sort
> is actually faster, the cost model should probably be smarter, to avoid
> picking an incremental sort when it's not a win.

This thread has been idle for over a week.  Please respond with a new 
patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be marked 
"Returned with Feedback".

-- 
-David
david@pgmasters.net



Re: [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Tue, Mar 28, 2017 at 5:27 PM, David Steele <david@pgmasters.net> wrote:
Hi Alexander,

On 3/20/17 10:19 AM, Heikki Linnakangas wrote:
On 03/20/2017 11:33 AM, Alexander Korotkov wrote:
Please, find rebased patch in the attachment.

I had a quick look at this.

<...>

According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To
alleviate that, it might be worthwhile to add a special case for when
the group contains exactly one group, and not put the tuple to the
tuplesort in that case. Or if we cannot ensure that the Incremental Sort
is actually faster, the cost model should probably be smarter, to avoid
picking an incremental sort when it's not a win.

This thread has been idle for over a week.  Please respond with a new patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be marked "Returned with Feedback".

Thank you for reminder!

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Mon, Mar 20, 2017 at 5:19 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 03/20/2017 11:33 AM, Alexander Korotkov wrote:
Please, find rebased patch in the attachment.

I had a quick look at this.

* I'd love to have an explanation of what an Incremental Sort is, in the file header comment for nodeIncrementalSort.c.
 
Done.

* I didn't understand the maxMem stuff in tuplesort.c. The comments there use the phrase "on-disk memory", which seems like an oxymoron. Also, "maximum status" seems weird, as it assumes that there's a natural order to the states.
 
Variables were renamed.

* In the below example, the incremental sort is significantly slower than the Seq Scan + Sort you get otherwise:

create table foo (a int4, b int4, c int4);
insert into sorttest select g, g, g from generate_series(1, 1000000) g;
vacuum foo;
create index i_sorttest on sorttest (a, b, c);
set work_mem='100MB';


postgres=# explain select count(*) from (select * from sorttest order by a, c) as t;
                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Aggregate  (cost=138655.68..138655.69 rows=1 width=8)
   ->  Incremental Sort  (cost=610.99..124870.38 rows=1102824 width=12)
         Sort Key: sorttest.a, sorttest.c
         Presorted Key: sorttest.a
         ->  Index Only Scan using i_sorttest on sorttest (cost=0.43..53578.79 rows=1102824 width=12)
(5 rows)

Time: 0.409 ms
postgres=# select count(*) from (select * from sorttest order by a, c) as t;
  count
---------
 1000000
(1 row)

Time: 387.091 ms


postgres=# explain select count(*) from (select * from sorttest order by a, c) as t;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Aggregate  (cost=130063.84..130063.85 rows=1 width=8)
   ->  Sort  (cost=115063.84..117563.84 rows=1000000 width=12)
         Sort Key: sorttest.a, sorttest.c
         ->  Seq Scan on sorttest  (cost=0.00..15406.00 rows=1000000 width=12)
(4 rows)

Time: 0.345 ms
postgres=# select count(*) from (select * from sorttest order by a, c) as t;
  count
---------
 1000000
(1 row)

Time: 231.668 ms

According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To alleviate that, it might be worthwhile to add a special case for when the group contains exactly one group, and not put the tuple to the tuplesort in that case.

I'm not sure we should do such optimization for one tuple per group, since it's similar situation with 2 or 3 tuples per group.
 
Or if we cannot ensure that the Incremental Sort is actually faster, the cost model should probably be smarter, to avoid picking an incremental sort when it's not a win.

I added to cost_sort() extra costing for incremental sort: cost of extra tuple copying and comparing as well as cost of tuplesort reset.
The only problem is that I made following estimate for tuplesort reset:

run_cost += 10.0 * cpu_tuple_cost * num_groups;

It makes ordinal sort to be selected in your example, but it contains constant 10 which is quite arbitrary.  It would be nice to evade such hard coded constants, but I don't know how could we calculate such cost realistically.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: [PATCH] Incremental sort

From
Andres Freund
Date:
On 2017-03-29 00:17:02 +0300, Alexander Korotkov wrote:
> On Tue, Mar 28, 2017 at 5:27 PM, David Steele <david@pgmasters.net> wrote:
> 
> > Hi Alexander,
> >
> > On 3/20/17 10:19 AM, Heikki Linnakangas wrote:
> >
> >> On 03/20/2017 11:33 AM, Alexander Korotkov wrote:
> >>
> >>> Please, find rebased patch in the attachment.
> >>>
> >>
> >> I had a quick look at this.
> >>
> >
> > <...>
> >
> > According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To
> >> alleviate that, it might be worthwhile to add a special case for when
> >> the group contains exactly one group, and not put the tuple to the
> >> tuplesort in that case. Or if we cannot ensure that the Incremental Sort
> >> is actually faster, the cost model should probably be smarter, to avoid
> >> picking an incremental sort when it's not a win.
> >>
> >
> > This thread has been idle for over a week.  Please respond with a new
> > patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be marked
> > "Returned with Feedback".

> Thank you for reminder!

I've just done so.  Please resubmit once updated, it's a cool feature.

- Andres



Re: [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Mon, Apr 3, 2017 at 9:34 PM, Andres Freund <andres@anarazel.de> wrote:
On 2017-03-29 00:17:02 +0300, Alexander Korotkov wrote:
> On Tue, Mar 28, 2017 at 5:27 PM, David Steele <david@pgmasters.net> wrote:
> > On 3/20/17 10:19 AM, Heikki Linnakangas wrote:
> >
> >> On 03/20/2017 11:33 AM, Alexander Korotkov wrote:
> >>
> >>> Please, find rebased patch in the attachment.
> >>>
> >>
> >> I had a quick look at this.
> >>
> >
> > <...>
> >
> > According to 'perf', 85% of the CPU time is spent in ExecCopySlot(). To
> >> alleviate that, it might be worthwhile to add a special case for when
> >> the group contains exactly one group, and not put the tuple to the
> >> tuplesort in that case. Or if we cannot ensure that the Incremental Sort
> >> is actually faster, the cost model should probably be smarter, to avoid
> >> picking an incremental sort when it's not a win.
> >>
> >
> > This thread has been idle for over a week.  Please respond with a new
> > patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be marked
> > "Returned with Feedback".

> Thank you for reminder!

I've just done so.  Please resubmit once updated, it's a cool feature.

Thank you!
I already sent version of patch after David's reminder.
Please find rebased patch in the attachment.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [PATCH] Incremental sort

From
Andres Freund
Date:

On April 3, 2017 12:03:56 PM PDT, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
>On Mon, Apr 3, 2017 at 9:34 PM, Andres Freund <andres@anarazel.de>
>wrote:
>
>> On 2017-03-29 00:17:02 +0300, Alexander Korotkov wrote:
>> > On Tue, Mar 28, 2017 at 5:27 PM, David Steele <david@pgmasters.net>
>> wrote:
>> > > On 3/20/17 10:19 AM, Heikki Linnakangas wrote:
>> > >
>> > >> On 03/20/2017 11:33 AM, Alexander Korotkov wrote:
>> > >>
>> > >>> Please, find rebased patch in the attachment.
>> > >>>
>> > >>
>> > >> I had a quick look at this.
>> > >>
>> > >
>> > > <...>
>> > >
>> > > According to 'perf', 85% of the CPU time is spent in
>ExecCopySlot(). To
>> > >> alleviate that, it might be worthwhile to add a special case for
>when
>> > >> the group contains exactly one group, and not put the tuple to
>the
>> > >> tuplesort in that case. Or if we cannot ensure that the
>Incremental
>> Sort
>> > >> is actually faster, the cost model should probably be smarter,
>to
>> avoid
>> > >> picking an incremental sort when it's not a win.
>> > >>
>> > >
>> > > This thread has been idle for over a week.  Please respond with a
>new
>> > > patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be
>> marked
>> > > "Returned with Feedback".
>>
>> > Thank you for reminder!
>>
>> I've just done so.  Please resubmit once updated, it's a cool
>feature.
>>
>
>Thank you!
>I already sent version of patch after David's reminder.
>Please find rebased patch in the attachment.

Cool. I think that's still a bit late for v10?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Mon, Apr 3, 2017 at 10:05 PM, Andres Freund <andres@anarazel.de> wrote:
On April 3, 2017 12:03:56 PM PDT, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
>On Mon, Apr 3, 2017 at 9:34 PM, Andres Freund <andres@anarazel.de>
>wrote:
>
>> On 2017-03-29 00:17:02 +0300, Alexander Korotkov wrote:
>> > On Tue, Mar 28, 2017 at 5:27 PM, David Steele <david@pgmasters.net>
>> wrote:
>> > > On 3/20/17 10:19 AM, Heikki Linnakangas wrote:
>> > >
>> > >> On 03/20/2017 11:33 AM, Alexander Korotkov wrote:
>> > >>
>> > >>> Please, find rebased patch in the attachment.
>> > >>>
>> > >>
>> > >> I had a quick look at this.
>> > >>
>> > >
>> > > <...>
>> > >
>> > > According to 'perf', 85% of the CPU time is spent in
>ExecCopySlot(). To
>> > >> alleviate that, it might be worthwhile to add a special case for
>when
>> > >> the group contains exactly one group, and not put the tuple to
>the
>> > >> tuplesort in that case. Or if we cannot ensure that the
>Incremental
>> Sort
>> > >> is actually faster, the cost model should probably be smarter,
>to
>> avoid
>> > >> picking an incremental sort when it's not a win.
>> > >>
>> > >
>> > > This thread has been idle for over a week.  Please respond with a
>new
>> > > patch by 2017-03-30 00:00 AoE (UTC-12) or this submission will be
>> marked
>> > > "Returned with Feedback".
>>
>> > Thank you for reminder!
>>
>> I've just done so.  Please resubmit once updated, it's a cool
>feature.
>>
>
>Thank you!
>I already sent version of patch after David's reminder.
>Please find rebased patch in the attachment.

Cool. I think that's still a bit late for v10?

I don't know.  ISTM, that I addressed all the issues raised by reviewers.
Also, this patch is pending since late 2013.  It would be very nice to finally get it in...

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [PATCH] Incremental sort

From
Andres Freund
Date:
Hi,

On 2017-04-04 00:04:09 +0300, Alexander Korotkov wrote:
> > >Thank you!
> > >I already sent version of patch after David's reminder.
> > >Please find rebased patch in the attachment.
> >
> > Cool. I think that's still a bit late for v10?
> >
> 
> I don't know.  ISTM, that I addressed all the issues raised by reviewers.
> Also, this patch is pending since late 2013.  It would be very nice to
> finally get it in...

To me this hasn't gotten even remotely enough performance evaluation.
And I don't think it's fair to characterize it as pending since 2013,
given it was essentially "waiting on author" for most of that.

Greetings,

Andres Freund



Re: [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Tue, Apr 4, 2017 at 12:09 AM, Andres Freund <andres@anarazel.de> wrote:
On 2017-04-04 00:04:09 +0300, Alexander Korotkov wrote:
> > >Thank you!
> > >I already sent version of patch after David's reminder.
> > >Please find rebased patch in the attachment.
> >
> > Cool. I think that's still a bit late for v10?
> >
>
> I don't know.  ISTM, that I addressed all the issues raised by reviewers.
> Also, this patch is pending since late 2013.  It would be very nice to
> finally get it in...

To me this hasn't gotten even remotely enough performance evaluation.

I'm ready to put my efforts on that.
 
And I don't think it's fair to characterize it as pending since 2013,

Probably, this duration isn't good characteristic at all.
 
given it was essentially "waiting on author" for most of that.

What makes you think so?  Do you have some statistics?  Or is it just random assumption?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [PATCH] Incremental sort

From
Robert Haas
Date:
On Mon, Apr 3, 2017 at 5:09 PM, Andres Freund <andres@anarazel.de> wrote:
> To me this hasn't gotten even remotely enough performance evaluation.
> And I don't think it's fair to characterize it as pending since 2013,
> given it was essentially "waiting on author" for most of that.

This is undeniably a patch which has been kicking around for a lot of
time without getting a lot of attention, and if it just keeps getting
punted down the road, it's never going to become committable.
Alexander's questions upthread about what decisions the committer who
took an interest (Heikki) would prefer never really got an answer, for
example.  I don't deny that there may be some work left to do here,
but I think blaming the author for a week's delay when this has been
ignored so often for so long is unfair.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [PATCH] Incremental sort

From
Andres Freund
Date:
On 2017-04-03 22:18:21 -0400, Robert Haas wrote:
> On Mon, Apr 3, 2017 at 5:09 PM, Andres Freund <andres@anarazel.de> wrote:
> > To me this hasn't gotten even remotely enough performance evaluation.
> > And I don't think it's fair to characterize it as pending since 2013,
> > given it was essentially "waiting on author" for most of that.
> 
> This is undeniably a patch which has been kicking around for a lot of
> time without getting a lot of attention, and if it just keeps getting
> punted down the road, it's never going to become committable.

Indeed, it's old.  And it hasn't gotten enough timely feedback.

But I don't think the wait time can meaningfully be measured by
subtracting two dates:
The first version of the patch, as a PoC, has been posted 2013-12-14,
which then got a good amount of feedback & revisions, and then stalled
till 2014-07-12.  There a few back-and forths yielded a new version.
From 2014-09-15 till 2015-10-16 the patch stalled, waiting on its
author.  That version had open todos ([1]), as had the version from
2016-03-13 [2], which weren't addressed 2016-03-30 - unfortunately that
was pretty much when the tree was frozen.  2016-09-13 a rebased patch
was sent, some minor points were raised 2016-10-02 (unaddressed), a
larger review was done 2016-12-01 ([5]), unaddressed till 2017-02-18.
At that point we're in this thread.

There's obviously some long waiting-on-author periods in there.  And
some long needs-review periods.


> Alexander's questions upthread about what decisions the committer who
> took an interest (Heikki) would prefer never really got an answer, for
> example.  I don't deny that there may be some work left to do here,
> but I think blaming the author for a week's delay when this has been
> ignored so often for so long is unfair.

I'm not trying to blame Alexander for a week's worth of delay, at all.
It's just that, well, we're past the original code-freeze date, three
days before the "final" code freeze. I don't think fairness is something
we can achieve at this point :(.  Given the risk of regressions -
demonstrated in this thread although partially adressed - and the very
limited amount of benchmarking done, it seems unlikely that this is
going to be merged.

Regards,

Andres


[1] http://archives.postgresql.org/message-id/CAPpHfdvhwMsG69exCRUGK3ms-ng0PSPcucH5FU6tAaM-qL-1%2Bw%40mail.gmail.com
[2] http://archives.postgresql.org/message-id/CAPpHfdvzjYGLTyA-8ib8UYnKLPrewd9Z%3DT4YJNCRWiHWHHweWw%40mail.gmail.com
[3] http://archives.postgresql.org/message-id/CAPpHfdtCcHZ-mLWzsFrRCvHpV1LPSaOGooMZ3sa40AkwR=7ouQ@mail.gmail.com
[4] http://archives.postgresql.org/message-id/CAPpHfdvj1Tdi2WA64ZbBp5-yG-uzaRXzk3K7J7zt-cRX6YSd0A@mail.gmail.com
[5] http://archives.postgresql.org/message-id/CA+TgmoZapyHRm7NVyuyZ+yAV=U1a070BOgRe7PkgyrAegR4JDA@mail.gmail.com
[6] http://archives.postgresql.org/message-id/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com



Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Wed, Mar 29, 2017 at 5:14 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
I added to cost_sort() extra costing for incremental sort: cost of extra tuple copying and comparing as well as cost of tuplesort reset.
The only problem is that I made following estimate for tuplesort reset:

run_cost += 10.0 * cpu_tuple_cost * num_groups;

It makes ordinal sort to be selected in your example, but it contains constant 10 which is quite arbitrary.  It would be nice to evade such hard coded constants, but I don't know how could we calculate such cost realistically.

That appears to be wrong.  I intended to make cost_sort prefer plain sort over incremental sort for this dataset size.  But, that appears to be not always right solution.  Quick sort is so fast only on presorted data.
On my laptop I have following numbers for test case provided by Heikki.

Presorted data – very fast.

# explain select count(*) from (select * from sorttest order by a, c) as t;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Aggregate  (cost=147154.34..147154.35 rows=1 width=8)
   ->  Sort  (cost=132154.34..134654.34 rows=1000000 width=12)
         Sort Key: sorttest.a, sorttest.c
         ->  Seq Scan on sorttest  (cost=0.00..15406.00 rows=1000000 width=12)
(4 rows)

# select count(*) from (select * from sorttest order by a, c) as t;
  count
---------
 1000000
(1 row)

Time: 260,752 ms

Not presorted data – not so fast.  It's actually slower than incremental sort was.

# explain select count(*) from (select * from sorttest order by a desc, c desc) as t;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Aggregate  (cost=130063.84..130063.85 rows=1 width=8)
   ->  Sort  (cost=115063.84..117563.84 rows=1000000 width=12)
         Sort Key: sorttest.a DESC, sorttest.c DESC
         ->  Seq Scan on sorttest  (cost=0.00..15406.00 rows=1000000 width=12)
(4 rows)

# select count(*) from (select * from sorttest order by a desc, c desc) as t;
  count
---------
 1000000
(1 row)

Time: 416,207 ms

Thus, it would be nice to reflect the fact that our quicksort implementation is very fast on presorted data.  Fortunately, we have corresponding statistics: STATISTIC_KIND_CORRELATION.  However, it probably should be a subject of a separate patch.

But I'd like to make incremental sort not slower than quicksort in case of presorted data.  New idea about it comes to my mind.  Since cause of incremental sort slowness in this case is too frequent reset of tuplesort, then what if we would artificially put data in larger groups.  Attached revision of patch implements this: it doesn't stop to accumulate tuples to tuplesort until we have MIN_GROUP_SIZE tuples.

# explain select count(*) from (select * from sorttest order by a, c) as t;
                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Aggregate  (cost=85412.43..85412.43 rows=1 width=8)
   ->  Incremental Sort  (cost=0.46..72912.43 rows=1000000 width=12)
         Sort Key: sorttest.a, sorttest.c
         Presorted Key: sorttest.a
         ->  Index Only Scan using i_sorttest on sorttest  (cost=0.42..30412.42 rows=1000000 width=12)
(5 rows)

# select count(*) from (select * from sorttest order by a, c) as t;
  count
---------
 1000000
(1 row)

Time: 251,227 ms

# explain select count(*) from (select * from sorttest order by a desc, c desc) as t;
                                                   QUERY PLAN
────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Aggregate  (cost=85412.43..85412.43 rows=1 width=8)
   ->  Incremental Sort  (cost=0.46..72912.43 rows=1000000 width=12)
         Sort Key: sorttest.a DESC, sorttest.c DESC
         Presorted Key: sorttest.a
         ->  Index Only Scan Backward using i_sorttest on sorttest  (cost=0.42..30412.42 rows=1000000 width=12)
(5 rows)

# select count(*) from (select * from sorttest order by a desc, c desc) as t;
  count
---------
 1000000
(1 row)

Time: 253,270 ms

Now, incremental sort is not slower than quicksort.  And this seems to be cool.
However, in the LIMIT case we will pay the price of fetching some extra tuples from outer node.  But, that doesn't seem to hurt us too much.

# explain select * from sorttest order by a, c limit 10;
                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Limit  (cost=0.46..0.84 rows=10 width=12)
   ->  Incremental Sort  (cost=0.46..37500.78 rows=1000000 width=12)
         Sort Key: a, c
         Presorted Key: a
         ->  Index Only Scan using i_sorttest on sorttest  (cost=0.42..30412.42 rows=1000000 width=12)
(5 rows)

# select * from sorttest order by a, c limit 10;
 a  | b  | c
----+----+----
  1 |  1 |  1
  2 |  2 |  2
  3 |  3 |  3
  4 |  4 |  4
  5 |  5 |  5
  6 |  6 |  6
  7 |  7 |  7
  8 |  8 |  8
  9 |  9 |  9
 10 | 10 | 10
(10 rows)

Time: 0,903 ms

Any thoughts?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Peter Geoghegan
Date:
On Wed, Apr 26, 2017 at 8:39 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> That appears to be wrong.  I intended to make cost_sort prefer plain sort
> over incremental sort for this dataset size.  But, that appears to be not
> always right solution.  Quick sort is so fast only on presorted data.

As you may know, I've often said that the precheck for sorted input
added to our quicksort implementation by a3f0b3d is misguided. It
sometimes throws away a ton of work if the presorted input isn't
*perfectly* presorted. This happens when the first out of order tuple
is towards the end of the presorted input.

I think that it isn't fair to credit our qsort with doing so well on a
100% presorted case, because it doesn't do the necessary bookkeeping
to not throw that work away completely in certain important cases.

-- 
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/



Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Wed, Apr 26, 2017 at 7:56 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Apr 26, 2017 at 8:39 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> That appears to be wrong.  I intended to make cost_sort prefer plain sort
> over incremental sort for this dataset size.  But, that appears to be not
> always right solution.  Quick sort is so fast only on presorted data.

As you may know, I've often said that the precheck for sorted input
added to our quicksort implementation by a3f0b3d is misguided. It
sometimes throws away a ton of work if the presorted input isn't
*perfectly* presorted. This happens when the first out of order tuple
is towards the end of the presorted input.

I think that it isn't fair to credit our qsort with doing so well on a
100% presorted case, because it doesn't do the necessary bookkeeping
to not throw that work away completely in certain important cases.

OK, I get it.  Our qsort is so fast not only on 100% presorted case.
However, that doesn't change many things in context of incremental sort.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] [PATCH] Incremental sort

From
Peter Geoghegan
Date:
On Wed, Apr 26, 2017 at 10:10 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> OK, I get it.  Our qsort is so fast not only on 100% presorted case.
> However, that doesn't change many things in context of incremental sort.

The important point is to make any presorted test case only ~99%
presorted, so as to not give too much credit to the "high risk"
presort check optimization.

The switch to insertion sort that we left in (not the bad one removed
by a3f0b3d -- the insertion sort that actually comes from the B&M
paper) does "legitimately" make sorting faster with presorted cases.

-- 
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/



Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Wed, Apr 26, 2017 at 8:20 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Apr 26, 2017 at 10:10 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> OK, I get it.  Our qsort is so fast not only on 100% presorted case.
> However, that doesn't change many things in context of incremental sort.

The important point is to make any presorted test case only ~99%
presorted, so as to not give too much credit to the "high risk"
presort check optimization.

The switch to insertion sort that we left in (not the bad one removed
by a3f0b3d -- the insertion sort that actually comes from the B&M
paper) does "legitimately" make sorting faster with presorted cases.

I'm still focusing on making incremental sort not slower than qsort with presorted optimization.  Independently on whether this is "high risk" optimization or not...
However, adding more test cases is always good.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] [PATCH] Incremental sort

From
Robert Haas
Date:
On Wed, Apr 26, 2017 at 11:39 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> But I'd like to make incremental sort not slower than quicksort in case of
> presorted data.  New idea about it comes to my mind.  Since cause of
> incremental sort slowness in this case is too frequent reset of tuplesort,
> then what if we would artificially put data in larger groups.  Attached
> revision of patch implements this: it doesn't stop to accumulate tuples to
> tuplesort until we have MIN_GROUP_SIZE tuples.
>
> Now, incremental sort is not slower than quicksort.  And this seems to be
> cool.
> However, in the LIMIT case we will pay the price of fetching some extra
> tuples from outer node.  But, that doesn't seem to hurt us too much.
>
> Any thoughts?

Nice idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Thu, Apr 27, 2017 at 5:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Apr 26, 2017 at 11:39 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> But I'd like to make incremental sort not slower than quicksort in case of
> presorted data.  New idea about it comes to my mind.  Since cause of
> incremental sort slowness in this case is too frequent reset of tuplesort,
> then what if we would artificially put data in larger groups.  Attached
> revision of patch implements this: it doesn't stop to accumulate tuples to
> tuplesort until we have MIN_GROUP_SIZE tuples.
>
> Now, incremental sort is not slower than quicksort.  And this seems to be
> cool.
> However, in the LIMIT case we will pay the price of fetching some extra
> tuples from outer node.  But, that doesn't seem to hurt us too much.
>
> Any thoughts?

Nice idea.


Cool.
Than I'm going to make a set of synthetic performance tests in order to ensure that there is no regression.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Thu, Apr 27, 2017 at 5:23 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Thu, Apr 27, 2017 at 5:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Apr 26, 2017 at 11:39 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> But I'd like to make incremental sort not slower than quicksort in case of
> presorted data.  New idea about it comes to my mind.  Since cause of
> incremental sort slowness in this case is too frequent reset of tuplesort,
> then what if we would artificially put data in larger groups.  Attached
> revision of patch implements this: it doesn't stop to accumulate tuples to
> tuplesort until we have MIN_GROUP_SIZE tuples.
>
> Now, incremental sort is not slower than quicksort.  And this seems to be
> cool.
> However, in the LIMIT case we will pay the price of fetching some extra
> tuples from outer node.  But, that doesn't seem to hurt us too much.
>
> Any thoughts?

Nice idea.

Cool.
Than I'm going to make a set of synthetic performance tests in order to ensure that there is no regression.

Next revision of patch is attached.
This revision contains one important optimization. I found that it's not necessary to make every tuple go through prevTuple slot.  It's enough to save single sample tuple per sort group in order to compare skip columns with it.  This optimization allows to evade regression on large sort groups which I have observed.

I'm also attaching python script (incsort_test.py) which I use for synthetic performance benchmarking.  This script runs benchmarks which are similar to one posted by Heikki, but with some variations.  These benchmarks are aimed to check if there are cases when incremental sort is slower than plain sort.

This script generates tables with structure described in 'tables' array.  For generation of texts, md5 function is used.  For first GroupedCols number of table columns, groups of GroupSize equal values are generated.  Then there are columns which values are just sequential. In the last column have PreorderedFrac fraction of sequential values and rest of values are random. Therefore, we can measure influence of presorted optimization in qsort with various fractions of presorted data.  Also there is btree index which covers all the columns of that table.

The benchmark query select contents of generated table order by grouped columns and by last column.  Index only scan outputs tuples ordered by grouped columns, and incremental sort have to perform sorting inside those groups.  Plain sort case is forced to also use index only scans, in order to compare sort methods not scan methods.

Results are also attached (results.csv).  Last column contains difference between incremental and plain sort time in percents.  Negative value mean that incremental sort is faster in this case.

Incremental sort is faster in vast majority of cases.  It appears to be slower only when whose dataset is one sort group.  In this case incremental sort is useless, and it should be considered as misuse of incremental sort.  Slowdown is related to the fact that we anyway have to do extra comparisons, unless we somehow push our comparison result into qsort itself and save some cpu cycles (but that would be unreasonable break of encapsulation).  Thus, in such cases regression seems to be inevitable anyway.  I think we could evade this regression during query planning.  If we see that there would be only few groups, we should choose plain sort instead of incremental sort.

Any thoughts?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Robert Haas
Date:
On Fri, May 5, 2017 at 11:13 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Incremental sort is faster in vast majority of cases.  It appears to be
> slower only when whose dataset is one sort group.  In this case incremental
> sort is useless, and it should be considered as misuse of incremental sort.
> Slowdown is related to the fact that we anyway have to do extra comparisons,
> unless we somehow push our comparison result into qsort itself and save some
> cpu cycles (but that would be unreasonable break of encapsulation).  Thus,
> in such cases regression seems to be inevitable anyway.  I think we could
> evade this regression during query planning.  If we see that there would be
> only few groups, we should choose plain sort instead of incremental sort.

I'm sorry that I don't have time to review this in detail right now,
but it sounds like you are doing good work to file down cases where
this might cause regressions, which is great.  Regarding the point in
the paragraph above, I'd say that it's OK for the planner to be
responsible for picking between Sort and Incremental Sort in some way.
It is, after all, the planner's job to decide between different
strategies for executing the same query and, of course, sometimes it
will be wrong, but that's OK as long as it's not wrong too often (or
by too much, hopefully).  It may be a little difficult to get this
right, though, because I'm not sure that the information you need
actually exists (or is reliable).  For example, consider the case
where we need to sort 100m rows and there are 2 groups.  If 1 group
contains 1 row and the other group contains all of the rest, there is
really no point in an incremental sort.  On the other hand, if each
group contains 50m rows and we can get the data presorted by the
grouping column, there might be a lot of point to an incremental sort,
because two 50m-row sorts might be a lot cheaper than one 100m sort.
More generally, it's quite easy to imagine situations where the
individual groups can be quicksorted but sorting all of the rows
requires I/O, even when the number of groups isn't that big.  On the
other hand, the real sweet spot for this is probably the case where
the number of groups is very large, with many single-row groups or
many groups with just a few rows each, so if we can at least get this
to work in those cases that may be good enough.  On the third hand,
when costing aggregation, I think we often underestimate the number of
groups and there might well be similar problems here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Mon, May 8, 2017 at 6:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, May 5, 2017 at 11:13 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Incremental sort is faster in vast majority of cases.  It appears to be
> slower only when whose dataset is one sort group.  In this case incremental
> sort is useless, and it should be considered as misuse of incremental sort.
> Slowdown is related to the fact that we anyway have to do extra comparisons,
> unless we somehow push our comparison result into qsort itself and save some
> cpu cycles (but that would be unreasonable break of encapsulation).  Thus,
> in such cases regression seems to be inevitable anyway.  I think we could
> evade this regression during query planning.  If we see that there would be
> only few groups, we should choose plain sort instead of incremental sort.

I'm sorry that I don't have time to review this in detail right now,
but it sounds like you are doing good work to file down cases where
this might cause regressions, which is great.

Thank you for paying attention to this patch!
 
Regarding the point in
the paragraph above, I'd say that it's OK for the planner to be
responsible for picking between Sort and Incremental Sort in some way.
It is, after all, the planner's job to decide between different
strategies for executing the same query and, of course, sometimes it
will be wrong, but that's OK as long as it's not wrong too often (or
by too much, hopefully).

Right, I agree.
 
It may be a little difficult to get this
right, though, because I'm not sure that the information you need
actually exists (or is reliable).  For example, consider the case
where we need to sort 100m rows and there are 2 groups.  If 1 group
contains 1 row and the other group contains all of the rest, there is
really no point in an incremental sort.  On the other hand, if each
group contains 50m rows and we can get the data presorted by the
grouping column, there might be a lot of point to an incremental sort,
because two 50m-row sorts might be a lot cheaper than one 100m sort.
More generally, it's quite easy to imagine situations where the
individual groups can be quicksorted but sorting all of the rows
requires I/O, even when the number of groups isn't that big.  On the
other hand, the real sweet spot for this is probably the case where
the number of groups is very large, with many single-row groups or
many groups with just a few rows each, so if we can at least get this
to work in those cases that may be good enough.  On the third hand,
when costing aggregation, I think we often underestimate the number of
groups and there might well be similar problems here.

I agree with that.  I need to test this patch more carefully in the case when groups have different sizes.  It's likely I need to add yet another parameter to my testing script: groups count skew.

Patch rebased to current master is attached.  I'm going to improve my testing script and post new results. 

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Thu, Sep 14, 2017 at 2:48 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Patch rebased to current master is attached.  I'm going to improve my testing script and post new results. 

New benchmarking script and results are attached.  There new dataset parameter is introduced: skew factor.  Skew factor defines skew in distribution of groups sizes.
My idea of generating is just usage of power function where power is from 0 to 1.  Following formula is used to get group number for particular item number i.
[((i / number_of_indexes) ^ power) * number_of_groups]
For example, power = 1/6 gives following distribution of groups sizes:
group number    group size
0               2
1               63
2               665
3               3367
4               11529
5               31031
6               70993
7               144495
8               269297
9               468558

For convenience, instead of power itself, I use skew factor where power = 1.0 / (1.0 + skew).  Therefore, with skew = 0.0, distribution of groups sizes is uniform.  Larger skew gives more skewed distribution (and that seems to be quite intuitive).  For, negative skew, group sizes are mirrored as for corresponding positive skew.  For example, skew factor = -5.0 gives following groups sizes distribution:
group number    group size
0               468558
1               269297
2               144495
3               70993
4               31031
5               11529
6               3367
7               665
8               63
9               2

Results shows that between 2172 test cases, in 2113 incremental sort gives speedup while in 59 it causes slowdown.  The following 4 test cases show most significant slowdown (>10% of time).

Table                   GroupedCols GroupCount Skew PreorderedFrac FullSortMedian IncSortMedian TimeChangePercent
int4|int4|numeric       1                  100  -10              0   1.5688240528  2.0607631207             31.36
text|int8|text|int4     1                    1    0              0   1.7785198689  2.1816160679             22.66
int8|int8|int4          1                   10  -10              0    1.136412859  1.3166360855             15.86
numeric|text|int4|int8  2                   10  -10              1   0.4403841496  0.5070910454             15.15

As you can see, 3 of this 4 test cases have skewed distribution while one of them is related to costly location-aware comparison of text.  I've no particular idea of how to cope these slowdowns.  Probably, it's OK to have slowdown in some cases while have speedup in majority of cases (assuming there is an option to turn off new behavior).  Probably, we should teach optimizer more about skewed distributions of groups, but that doesn't seem feasible for me.

Any thoughts?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Sat, Sep 16, 2017 at 2:46 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Thu, Sep 14, 2017 at 2:48 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Patch rebased to current master is attached.  I'm going to improve my testing script and post new results. 

New benchmarking script and results are attached.  There new dataset parameter is introduced: skew factor.  Skew factor defines skew in distribution of groups sizes.
My idea of generating is just usage of power function where power is from 0 to 1.  Following formula is used to get group number for particular item number i.
[((i / number_of_indexes) ^ power) * number_of_groups]
For example, power = 1/6 gives following distribution of groups sizes:
group number    group size
0               2
1               63
2               665
3               3367
4               11529
5               31031
6               70993
7               144495
8               269297
9               468558

For convenience, instead of power itself, I use skew factor where power = 1.0 / (1.0 + skew).  Therefore, with skew = 0.0, distribution of groups sizes is uniform.  Larger skew gives more skewed distribution (and that seems to be quite intuitive).  For, negative skew, group sizes are mirrored as for corresponding positive skew.  For example, skew factor = -5.0 gives following groups sizes distribution:
group number    group size
0               468558
1               269297
2               144495
3               70993
4               31031
5               11529
6               3367
7               665
8               63
9               2

Results shows that between 2172 test cases, in 2113 incremental sort gives speedup while in 59 it causes slowdown.  The following 4 test cases show most significant slowdown (>10% of time).

Table                   GroupedCols GroupCount Skew PreorderedFrac FullSortMedian IncSortMedian TimeChangePercent
int4|int4|numeric       1                  100  -10              0   1.5688240528  2.0607631207             31.36
text|int8|text|int4     1                    1    0              0   1.7785198689  2.1816160679             22.66
int8|int8|int4          1                   10  -10              0    1.136412859  1.3166360855             15.86
numeric|text|int4|int8  2                   10  -10              1   0.4403841496  0.5070910454             15.15

As you can see, 3 of this 4 test cases have skewed distribution while one of them is related to costly location-aware comparison of text.  I've no particular idea of how to cope these slowdowns.  Probably, it's OK to have slowdown in some cases while have speedup in majority of cases (assuming there is an option to turn off new behavior).  Probably, we should teach optimizer more about skewed distributions of groups, but that doesn't seem feasible for me.

Any thoughts?

BTW, replacement selection sort was removed by 8b304b8b.  I think it worth to rerun benchmarks after that, because results might be changed.  Will do.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Sat, Sep 30, 2017 at 11:20 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Sat, Sep 16, 2017 at 2:46 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Thu, Sep 14, 2017 at 2:48 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Patch rebased to current master is attached.  I'm going to improve my testing script and post new results. 

New benchmarking script and results are attached.  There new dataset parameter is introduced: skew factor.  Skew factor defines skew in distribution of groups sizes.
My idea of generating is just usage of power function where power is from 0 to 1.  Following formula is used to get group number for particular item number i.
[((i / number_of_indexes) ^ power) * number_of_groups]
For example, power = 1/6 gives following distribution of groups sizes:
group number    group size
0               2
1               63
2               665
3               3367
4               11529
5               31031
6               70993
7               144495
8               269297
9               468558

For convenience, instead of power itself, I use skew factor where power = 1.0 / (1.0 + skew).  Therefore, with skew = 0.0, distribution of groups sizes is uniform.  Larger skew gives more skewed distribution (and that seems to be quite intuitive).  For, negative skew, group sizes are mirrored as for corresponding positive skew.  For example, skew factor = -5.0 gives following groups sizes distribution:
group number    group size
0               468558
1               269297
2               144495
3               70993
4               31031
5               11529
6               3367
7               665
8               63
9               2

Results shows that between 2172 test cases, in 2113 incremental sort gives speedup while in 59 it causes slowdown.  The following 4 test cases show most significant slowdown (>10% of time).

Table                   GroupedCols GroupCount Skew PreorderedFrac FullSortMedian IncSortMedian TimeChangePercent
int4|int4|numeric       1                  100  -10              0   1.5688240528  2.0607631207             31.36
text|int8|text|int4     1                    1    0              0   1.7785198689  2.1816160679             22.66
int8|int8|int4          1                   10  -10              0    1.136412859  1.3166360855             15.86
numeric|text|int4|int8  2                   10  -10              1   0.4403841496  0.5070910454             15.15

As you can see, 3 of this 4 test cases have skewed distribution while one of them is related to costly location-aware comparison of text.  I've no particular idea of how to cope these slowdowns.  Probably, it's OK to have slowdown in some cases while have speedup in majority of cases (assuming there is an option to turn off new behavior).  Probably, we should teach optimizer more about skewed distributions of groups, but that doesn't seem feasible for me.

Any thoughts?

BTW, replacement selection sort was removed by 8b304b8b.  I think it worth to rerun benchmarks after that, because results might be changed.  Will do.

I've applied patch on top of c12d570f and rerun the same benchmarks.
CSV-file with results is attached.  There is no dramatical changes.  There is still minority of performance regression cases while majority of cases has improvement.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Robert Haas
Date:
On Mon, Oct 2, 2017 at 12:37 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I've applied patch on top of c12d570f and rerun the same benchmarks.
> CSV-file with results is attached.  There is no dramatical changes.  There
> is still minority of performance regression cases while majority of cases
> has improvement.

Yes, I think these results look pretty good.  But are these times in
seconds?  You might need to do some testing with bigger sorts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Tue, Oct 3, 2017 at 2:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Oct 2, 2017 at 12:37 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I've applied patch on top of c12d570f and rerun the same benchmarks.
> CSV-file with results is attached.  There is no dramatical changes.  There
> is still minority of performance regression cases while majority of cases
> has improvement.

Yes, I think these results look pretty good.  But are these times in
seconds?  You might need to do some testing with bigger sorts.

Good point.  I'll rerun benchmarks with larger dataset size.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] [PATCH] Incremental sort

From
Antonin Houska
Date:
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

> Patch rebased to current master is attached. I'm going to improve my testing script and post new results.

I wanted to review this patch but incremental-sort-8.patch fails to apply. Can
you please rebase it again?

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Tue, Nov 14, 2017 at 7:00 PM, Antonin Houska <ah@cybertec.at> wrote:
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

> Patch rebased to current master is attached. I'm going to improve my testing script and post new results.

I wanted to review this patch but incremental-sort-8.patch fails to apply. Can
you please rebase it again?

Sure, please find rebased patch attached.
Also, I'd like to share partial results of benchmarks with 100M of rows.  It appears that for 100M of rows it takes quite amount of time.  Perhaps in cases when there were degradation on 1M of rows, it becomes somewhat bigger...

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Antonin Houska
Date:
Antonin Houska <ah@cybertec.at> wrote:

> Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
>
> > Patch rebased to current master is attached. I'm going to improve my testing script and post new results.
>
> I wanted to review this patch but incremental-sort-8.patch fails to apply. Can
> you please rebase it again?

I could find the matching HEAD quite easily (9b6cb46), so following are my comments:

* cost_sort()

** "presorted_keys" missing in the prologue

** when called from label_sort_with_costsize(), 0 is passed for  "presorted_keys". However label_sort_with_costsize()
cansometimes be  called on an IncrementalSort, in which case there are some "presorted  keys". See
create_mergejoin_plan()for example. (IIUC this should only  make EXPLAIN inaccurate, but should not cause incorrect
decisions.)


** instead of

if (!enable_incrementalsort)  presorted_keys = false;

you probably meant

if (!enable_incrementalsort)  presorted_keys = 0;


** instead of

/* Extract presorted keys as list of expressions */
foreach(l, pathkeys)
{PathKey *key = (PathKey *)lfirst(l);EquivalenceMember *member = (EquivalenceMember *)
lfirst(list_head(key->pk_eclass->ec_members));

you can use linitial():

/* Extract presorted keys as list of expressions */
foreach(l, pathkeys)
{PathKey *key = (PathKey *)lfirst(l);EquivalenceMember *member = (EquivalenceMember *)
linitial(key->pk_eclass->ec_members);


* get_cheapest_fractional_path_for_pathkeys()

The prologue says "... at least partially satisfies the given pathkeys ..."
but I see no change in the function code. In particular the use of
pathkeys_contained_in() does not allow for any kind of partial sorting.


* pathkeys_useful_for_ordering()

Extra whitespace following the comment opening string "/*":

/* * When incremental sort is disabled, pathkeys are useful only when they


* make_sort_from_pathkeys() - the "skipCols" argument should be mentioned in the prologue.


* create_sort_plan()

I noticed that pathkeys_common() is called, but the value of n_common_pathkeys
should already be in the path's "skipCols" field if the underlying path is
actually IncrementalSortPath.


* create_unique_plan() does not seem to make use of the incremental sort. Shouldn't it do?


* nodeIncrementalSort.c

** These comments seem to contain typos:

"Incremental sort algorithm would sort by xfollowing groups, which have ..."

"Interate while skip cols are same as in saved tuple"

** (This is rather a pedantic comment) I think prepareSkipCols() should be  defined in front of cmpSortSkipCols().

** the MIN_GROUP_SIZE constant deserves a comment.


* ExecIncrementalSort()

** if (node->tuplesortstate == NULL)

If both branches contain the expression
    node->groupsCount++;

I suggest it to be moved outside the "if" construct.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


Re: [HACKERS] [PATCH] Incremental sort

From
Thomas Munro
Date:
On Wed, Nov 15, 2017 at 7:42 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Sure, please find rebased patch attached.

+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+                                                             TupleTableSlot *b)
+ {
+     int n, i;
+
+     Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+     n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+     for (i = 0; i < n; i++)
+     {
+         Datum datumA, datumB, result;
+         bool isnullA, isnullB;
+         AttrNumber attno = node->skipKeys[i].attno;
+         SkipKeyData *key;
+
+         datumA = slot_getattr(a, attno, &isnullA);
+         datumB = slot_getattr(b, attno, &isnullB);
+
+         /* Special case for NULL-vs-NULL, else use standard comparison */
+         if (isnullA || isnullB)
+         {
+             if (isnullA == isnullB)
+                 continue;
+             else
+                 return false;
+         }
+
+         key = &node->skipKeys[i];
+
+         key->fcinfo.arg[0] = datumA;
+         key->fcinfo.arg[1] = datumB;
+
+         /* just for paranoia's sake, we reset isnull each time */
+         key->fcinfo.isnull = false;
+
+         result = FunctionCallInvoke(&key->fcinfo);
+
+         /* Check for null result, since caller is clearly not expecting one */
+         if (key->fcinfo.isnull)
+             elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+         if (!DatumGetBool(result))
+             return false;
+     }
+     return true;
+ }

Is there some reason not to use ApplySortComparator for this?  I think
you're missing out on lower-overhead comparators, and in any case it's
probably good code reuse, no?

Embarrassingly, I was unaware of this patch and started prototyping
exactly the same thing independently[1].  I hadn't got very far and
will now abandon that, but that's one thing I did differently.  Two
other things that may be different: I had a special case for groups of
size 1 that skipped the sorting, and I only sorted on the suffix
because I didn't put tuples with different prefixes into the sorter (I
was assuming that tuplesort_reset was going to be super efficient,
though I hadn't got around to writing that).  I gather that you have
determined empirically that it's better to be able to sort groups of
at least MIN_GROUP_SIZE than to be able to skip the comparisons on the
leading attributes, but why is that the case?

[1] https://github.com/macdice/postgres/commit/ab0f8aff9c4b25d5598aa6b3c630df864302f572

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
Hi!

Thank you very much for review.  I really appreciate this topic gets attention.  Please, find next revision of patch in the attachment.

On Wed, Nov 15, 2017 at 7:20 PM, Antonin Houska <ah@cybertec.at> wrote:
Antonin Houska <ah@cybertec.at> wrote:

> Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
>
> > Patch rebased to current master is attached. I'm going to improve my testing script and post new results.
>
> I wanted to review this patch but incremental-sort-8.patch fails to apply. Can
> you please rebase it again?

I could find the matching HEAD quite easily (9b6cb46), so following are my comments:

* cost_sort()

** "presorted_keys" missing in the prologue
 
Comment is added.

** when called from label_sort_with_costsize(), 0 is passed for
   "presorted_keys". However label_sort_with_costsize() can sometimes be
   called on an IncrementalSort, in which case there are some "presorted
   keys". See create_mergejoin_plan() for example. (IIUC this should only
   make EXPLAIN inaccurate, but should not cause incorrect decisions.)
 
Good catch.  Fixed.

** instead of

if (!enable_incrementalsort)
   presorted_keys = false;

you probably meant

if (!enable_incrementalsort)
   presorted_keys = 0;

Absolutely correct.  Fixed.
 
** instead of

/* Extract presorted keys as list of expressions */
foreach(l, pathkeys)
{
        PathKey *key = (PathKey *)lfirst(l);
        EquivalenceMember *member = (EquivalenceMember *)
                lfirst(list_head(key->pk_eclass->ec_members));

you can use linitial():

/* Extract presorted keys as list of expressions */
foreach(l, pathkeys)
{
        PathKey *key = (PathKey *)lfirst(l);
        EquivalenceMember *member = (EquivalenceMember *)
                linitial(key->pk_eclass->ec_members);
 
Sure.  Fixed.

* get_cheapest_fractional_path_for_pathkeys()

The prologue says "... at least partially satisfies the given pathkeys ..."
but I see no change in the function code. In particular the use of
pathkeys_contained_in() does not allow for any kind of partial sorting.
 
Good catch.  This is a part of optimization for build_minmax_path() which existed in earlier version of patch.  That optimization contained set of arguable solutions.  This is why I decided to wipe it out from the patch, and let it wait for initial implementation to be committed.

* pathkeys_useful_for_ordering()

Extra whitespace following the comment opening string "/*":

/*
 * When incremental sort is disabled, pathkeys are useful only when they

Fixed.
 
* make_sort_from_pathkeys() - the "skipCols" argument should be mentioned in
  the prologue.

Comment is added.
 
* create_sort_plan()

I noticed that pathkeys_common() is called, but the value of n_common_pathkeys
should already be in the path's "skipCols" field if the underlying path is
actually IncrementalSortPath.

Sounds like reasonable optimization.  Done.
 
* create_unique_plan() does not seem to make use of the incremental
  sort. Shouldn't it do?
 
It definitely should.  But proper solution doesn't seem to be easy for me.  We should construct possibly useful paths before.  Wherein it should be done in agnostic manner to the order of pathkeys.  I'm afraid for possible regression in query planning.  Therefore, it seems like a topic for separate discussion.  I would prefer to commit some basic implementation first and then consider smaller patches with possible enhancement including this one. 

* nodeIncrementalSort.c

** These comments seem to contain typos:

"Incremental sort algorithm would sort by xfollowing groups, which have ..."

"Interate while skip cols are same as in saved tuple"

Fixed.
 

** (This is rather a pedantic comment) I think prepareSkipCols() should be
   defined in front of cmpSortSkipCols().
 
That's a good comment.  We're trying to be as pedantic about code as we can :)
Fixed.

** the MIN_GROUP_SIZE constant deserves a comment.

Sure.  Explanation was added.
 
* ExecIncrementalSort()

** if (node->tuplesortstate == NULL)

If both branches contain the expression

     node->groupsCount++;

I suggest it to be moved outside the "if" construct.

Done.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
Hi!

On Mon, Nov 20, 2017 at 12:24 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Wed, Nov 15, 2017 at 7:42 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Sure, please find rebased patch attached.

+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(IncrementalSortState *node, TupleTableSlot *a,
+                                                             TupleTableSlot *b)
+ {
+     int n, i;
+
+     Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+     n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;
+
+     for (i = 0; i < n; i++)
+     {
+         Datum datumA, datumB, result;
+         bool isnullA, isnullB;
+         AttrNumber attno = node->skipKeys[i].attno;
+         SkipKeyData *key;
+
+         datumA = slot_getattr(a, attno, &isnullA);
+         datumB = slot_getattr(b, attno, &isnullB);
+
+         /* Special case for NULL-vs-NULL, else use standard comparison */
+         if (isnullA || isnullB)
+         {
+             if (isnullA == isnullB)
+                 continue;
+             else
+                 return false;
+         }
+
+         key = &node->skipKeys[i];
+
+         key->fcinfo.arg[0] = datumA;
+         key->fcinfo.arg[1] = datumB;
+
+         /* just for paranoia's sake, we reset isnull each time */
+         key->fcinfo.isnull = false;
+
+         result = FunctionCallInvoke(&key->fcinfo);
+
+         /* Check for null result, since caller is clearly not expecting one */
+         if (key->fcinfo.isnull)
+             elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+
+         if (!DatumGetBool(result))
+             return false;
+     }
+     return true;
+ }

Is there some reason not to use ApplySortComparator for this?  I think
you're missing out on lower-overhead comparators, and in any case it's
probably good code reuse, no?

However, for incremental sort case we don't need to know here whether A > B or B > A.  It's enough for us to know if A = B or A != B.  In some cases it's way cheaper.  For instance, for texts equality check is basically memcmp while comparison may use collation.

Embarrassingly, I was unaware of this patch and started prototyping
exactly the same thing independently[1].  I hadn't got very far and
will now abandon that, but that's one thing I did differently.  Two
other things that may be different: I had a special case for groups of
size 1 that skipped the sorting, and I only sorted on the suffix
because I didn't put tuples with different prefixes into the sorter (I
was assuming that tuplesort_reset was going to be super efficient,
though I hadn't got around to writing that).  I gather that you have
determined empirically that it's better to be able to sort groups of
at least MIN_GROUP_SIZE than to be able to skip the comparisons on the
leading attributes, but why is that the case?

Right.  The issue that not only case of one tuple per group could cause overhead, but few tuples (like 2 or 3) is also case of overhead.  Also, overhead is related not only to sorting.  While investigate of regression case provided by Heikki [1], I've seen extra time spent mostly in extra copying of sample tuple and comparison with that.  In order to cope this overhead I've introduced MIN_GROUP_SIZE which allows to skip copying sample tuples too frequently.


------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 

Re: [HACKERS] [PATCH] Incremental sort

From
Peter Geoghegan
Date:
On Mon, Nov 20, 2017 at 3:34 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Thank you very much for review.  I really appreciate this topic gets
> attention.  Please, find next revision of patch in the attachment.

I would really like to see this get into v11. This is an important
patch, that has fallen through the cracks far too many times.

-- 
Peter Geoghegan


Re: [HACKERS] [PATCH] Incremental sort

From
Thomas Munro
Date:
On Tue, Nov 21, 2017 at 1:00 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Mon, Nov 20, 2017 at 12:24 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Is there some reason not to use ApplySortComparator for this?  I think
>> you're missing out on lower-overhead comparators, and in any case it's
>> probably good code reuse, no?
>
> However, for incremental sort case we don't need to know here whether A > B
> or B > A.  It's enough for us to know if A = B or A != B.  In some cases
> it's way cheaper.  For instance, for texts equality check is basically
> memcmp while comparison may use collation.

Ah, right, of course.

>> I gather that you have
>> determined empirically that it's better to be able to sort groups of
>> at least MIN_GROUP_SIZE than to be able to skip the comparisons on the
>> leading attributes, but why is that the case?
>
> Right.  The issue that not only case of one tuple per group could cause
> overhead, but few tuples (like 2 or 3) is also case of overhead.  Also,
> overhead is related not only to sorting.  While investigate of regression
> case provided by Heikki [1], I've seen extra time spent mostly in extra
> copying of sample tuple and comparison with that.  In order to cope this
> overhead I've introduced MIN_GROUP_SIZE which allows to skip copying sample
> tuples too frequently.

I see.  I wonder if there could ever be a function like
ExecMoveTuple(dst, src).  Given the polymorphism involved it'd be
slightly complicated and you'd probably have a general case that just
copies the tuple to dst and clears src, but there might be a bunch of
cases where you can do something more efficient like moving a pointer
and pin ownership.  I haven't really thought that through and
there may be fundamental problems with it...

If you're going to push the tuples into the sorter every time, then I
guess there are some special cases that could allow future
optimisations: (1) if you noticed that every prefix was different, you
can skip the sort operation (that is, you can use the sorter as a dumb
tuplestore and just get the tuples out in the same order you put them
in; not sure if Tuplesort supports that but it presumably could), (2)
if you noticed that every prefix was the same (that is, you have only
one prefix/group in the sorter) then you could sort only on the suffix
(that is, you could somehow tell Tuplesort to ignore the leading
columns), (3) as a more complicated optimisation for intermediate
group sizes 1 < n < MIN_GROUP_SIZE, you could somehow number the
groups with an integer that increments whenever you see the prefix
change, and somehow tell tuplesort.c to use that instead of the
leading columns.  Ok, that last one is probably hard but the first two
might be easier...

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] [PATCH] Incremental sort

From
Antonin Houska
Date:
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

> Antonin Houska <ah@cybertec.at> wrote:

> >  * ExecIncrementalSort()
> >
> >  ** if (node->tuplesortstate == NULL)
> >
> >  If both branches contain the expression
> >
> >  node->groupsCount++;
> >
> >  I suggest it to be moved outside the "if" construct.
>
> Done.

One more comment on this: I wonder if the field isn't incremented too
early. It seems to me that the value can end up non-zero if the input set is
to be empty (not sure if it can happen in practice).

And finally one question about regression tests: what's the purpose of the
changes in contrib/postgres_fdw/sql/postgres_fdw.sql ? I see no
IncrementalSort node in the output.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


Re: [HACKERS] [PATCH] Incremental sort (was: PoC: Partial sort)

From
Michael Paquier
Date:
On Mon, Mar 20, 2017 at 6:33 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Thank you for the report.
> Please, find rebased patch in the attachment.

This patch cannot be applied. Please provide a rebased version. I am
moving it to next CF with waiting on author as status.
-- 
Michael


Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Wed, Nov 22, 2017 at 12:01 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> I gather that you have
>> determined empirically that it's better to be able to sort groups of
>> at least MIN_GROUP_SIZE than to be able to skip the comparisons on the
>> leading attributes, but why is that the case?
>
> Right.  The issue that not only case of one tuple per group could cause
> overhead, but few tuples (like 2 or 3) is also case of overhead.  Also,
> overhead is related not only to sorting.  While investigate of regression
> case provided by Heikki [1], I've seen extra time spent mostly in extra
> copying of sample tuple and comparison with that.  In order to cope this
> overhead I've introduced MIN_GROUP_SIZE which allows to skip copying sample
> tuples too frequently.

I see.  I wonder if there could ever be a function like
ExecMoveTuple(dst, src).  Given the polymorphism involved it'd be
slightly complicated and you'd probably have a general case that just
copies the tuple to dst and clears src, but there might be a bunch of
cases where you can do something more efficient like moving a pointer
and pin ownership.  I haven't really thought that through and
there may be fundamental problems with it...

ExecMoveTuple(dst, src) would be good.  But, it would be hard to implement "moving a pointer and pin ownership" principle in our current infrastructure.  It's because source and destination can have different memory contexts.  AFAICS, we can't just move memory area between memory contexts: we have to allocate new area, then memcpy, and then deallocate old area.  
 
If you're going to push the tuples into the sorter every time, then I
guess there are some special cases that could allow future
optimisations: (1) if you noticed that every prefix was different, you
can skip the sort operation (that is, you can use the sorter as a dumb
tuplestore and just get the tuples out in the same order you put them
in; not sure if Tuplesort supports that but it presumably could),

In order to notice that every prefix is different, I have to compare every prefix.  But that may introduce an overhead.  So, there reason why I introduced MIN_GROUP_SIZE is exactly to not compare every prefix...
 
(2)
if you noticed that every prefix was the same (that is, you have only
one prefix/group in the sorter) then you could sort only on the suffix
(that is, you could somehow tell Tuplesort to ignore the leading
columns),

Yes, I did so before.  But again, after introducing MIN_GROUP_SIZE, I missed knowledge whether all the prefixes were the same or different.  This is why, I've to sort by full column list for now...

(3) as a more complicated optimisation for intermediate
group sizes 1 < n < MIN_GROUP_SIZE, you could somehow number the
groups with an integer that increments whenever you see the prefix
change, and somehow tell tuplesort.c to use that instead of the
leading columns.

That is interesting idea.  The reason we have an overhead in comparison with plain sort is that we do extra comparison (and copying), but knowledge of this comparison result is lost for sorting itself.  Thus, sorting can "reuse" prefix comparison, and overhead would be lower.  But the problem is that we have to reformat tuples before putting them into tuplesort.  I wonder if tuple reformatting could eat potential performance win...

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
Hi!

On Fri, Dec 1, 2017 at 11:39 AM, Antonin Houska <ah@cybertec.at> wrote:
I expected the number of groups actually that actually appear in the output,
you consider it the number of groups started. I can't find similar case
elsewhere in the code (e.g. Agg node does not report this kind of
information), so I have no clue. Someone else will have to decide.
 
OK.

> But there is IncrementalSort node on the remote side.
> Let's see what happens. Idea of "CROSS JOIN, not pushed down" test is that cross join with ORDER BY LIMIT is not beneficial to push down, because LIMIT is not pushed down and remote side wouldn't be able to use top-N heapsort. But if remote side has incremental sort then it can be
> used, and fetching first 110 rows is cheap. Let's see plan of original "CROSS JOIN, not pushed down" test with incremental sort.
>
> # EXPLAIN (ANALYZE, VERBOSE) SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;

ok, understood, thanks. Perhaps it's worth a comment in the test script.

I'm afraid I still see a problem. The diff removes a query that (although a
bit different from the one above) lets the CROSS JOIN to be pushed down and
does introduce the IncrementalSort in the remote database. This query is
replaced with one that does not allow for the join push down.

*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
*************** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST
*** 510,517 ****
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;
--- 510,517 ----
  SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10;
  -- CROSS JOIN, not pushed down
  EXPLAIN (VERBOSE, COSTS OFF)
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10;
  -- different server, not pushed down. No result expected.
  EXPLAIN (VERBOSE, COSTS OFF)
  SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10;

Shouldn't the test contain *both* cases?

Thank you for pointing that.  Sure, both cases are better.  I've added second case as well as comments.  Patch is attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Fri, Dec 8, 2017 at 4:06 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Thank you for pointing that.  Sure, both cases are better.  I've added second case as well as comments.  Patch is attached.

I just found that patch apply is failed according to commitfest.cputube.org.  Please, find rebased patch attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
"Tels"
Date:
Hello Alexander,

On Thu, January 4, 2018 4:36 pm, Alexander Korotkov wrote:
> On Fri, Dec 8, 2017 at 4:06 PM, Alexander Korotkov <
> a.korotkov@postgrespro.ru> wrote:
>
>> Thank you for pointing that.  Sure, both cases are better.  I've added
>> second case as well as comments.  Patch is attached.

I had a quick look, this isn't a full review, but a few things struck me
on a read through the diff:

There are quite a few places where lines are broken like so:

+            ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+                                                pwcxt);
Or like this:

+            result = (PlanState *) ExecInitIncrementalSort(
+                                    (IncrementalSort *) node, estate, eflags);

e.g. a param is on the next line, but aligned to the very same place where
it would be w/o the linebreak. Or is this just some sort of artefact
because I viewed the diff with tabspacing = 8?

I'd fix the grammar here:

+ *        Incremental sort is specially optimized kind of multikey sort when
+ *        input is already presorted by prefix of required keys list.

Like so:

"Incremental sort is a specially optimized kind of multikey sort used when
the input is already presorted by a prefix of the required keys list."

+ *        Consider following example.  We have input tuples consisting from

"Consider the following example: We have ..."

+         * In incremental sort case we also have to cost to detect sort groups.

"we also have to cost the detection of sort groups."

"+         * It turns out into extra copy and comparison for each tuple."

"This turns out to be one extra copy and comparison per tuple."

+ "Portions Copyright (c) 1996-2017"

Should probably be 2018 now - time flies fast :)

         return_value = _readMaterial();
     else if (MATCH("SORT", 4))
         return_value = _readSort();
+    else if (MATCH("INCREMENTALSORT", 7))
+        return_value = _readIncrementalSort();
     else if (MATCH("GROUP", 5))
         return_value = _readGroup();

I think the ", 7" here is left-over from when it was named "INCSORT", and
it should be MATCH("INCREMENTALSORT", 15)), shouldn't it?

+                                   space, fase when it's value for in-memory

typo: "space, false when ..."

+            bool    cmp;
+            cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+
+            if (cmp)

In the above, the variable cmp could be optimized away with:

+            if (cmpSortSkipCols(node, node->sampleSlot, slot))

(not sure if modern compilers won't do this, anway, though)

+typedef struct IncrementalSortState
+{
+    ScanState    ss;                /* its first field is NodeTag */
+    bool        bounded;        /* is the result set
bounded? */
+    int64        bound;            /* if bounded, how many
tuples are needed */

If I'm not wrong, the layout of the struct will include quite a bit of
padding on 64 bit due to the mixing of bool and int64, maybe it would be
better to sort the fields differently, e.g. pack 4 or 8 bools together?
Not sure if that makes much of a difference, though.

That's all for now :)

Thank you for your work,

Tels


Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
Hi!

On Fri, Jan 5, 2018 at 2:21 AM, Tels <nospam-pg-abuse@bloodgate.com> wrote:
On Thu, January 4, 2018 4:36 pm, Alexander Korotkov wrote:
> On Fri, Dec 8, 2017 at 4:06 PM, Alexander Korotkov <
> a.korotkov@postgrespro.ru> wrote:
>
>> Thank you for pointing that.  Sure, both cases are better.  I've added
>> second case as well as comments.  Patch is attached.

I had a quick look, this isn't a full review, but a few things struck me
on a read through the diff:

There are quite a few places where lines are broken like so:

+                       ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+                                                                                               pwcxt);

It's quite common practice to align second argument to the same position as first argument.  See other lines nearby.
 
Or like this:

+                       result = (PlanState *) ExecInitIncrementalSort(
+                                                                       (IncrementalSort *) node, estate, eflags);

It was probably not so good idea to insert line break before first argument.  Fixed.
 

e.g. a param is on the next line, but aligned to the very same place where
it would be w/o the linebreak. Or is this just some sort of artefact
because I viewed the diff with tabspacing = 8?

I'd fix the grammar here:

+ *             Incremental sort is specially optimized kind of multikey sort when
+ *             input is already presorted by prefix of required keys list.

Like so:

"Incremental sort is a specially optimized kind of multikey sort used when
the input is already presorted by a prefix of the required keys list."

+ *             Consider following example.  We have input tuples consisting from

"Consider the following example: We have ..."

+                * In incremental sort case we also have to cost to detect sort groups.

"we also have to cost the detection of sort groups."

"+               * It turns out into extra copy and comparison for each tuple."

"This turns out to be one extra copy and comparison per tuple."

Many thanks for noticing these.  Fixed.
 
+ "Portions Copyright (c) 1996-2017"

Should probably be 2018 now - time flies fast :)

Right.  Happy New Year! :)
 
                return_value = _readMaterial();
        else if (MATCH("SORT", 4))
                return_value = _readSort();
+       else if (MATCH("INCREMENTALSORT", 7))
+               return_value = _readIncrementalSort();
        else if (MATCH("GROUP", 5))
                return_value = _readGroup();

I think the ", 7" here is left-over from when it was named "INCSORT", and
it should be MATCH("INCREMENTALSORT", 15)), shouldn't it?

Good catch, thank you!
 

+                                                                  space, fase when it's value for in-memory

typo: "space, false when ..."

Right.  Fixed.
 
+                       bool    cmp;
+                       cmp = cmpSortSkipCols(node, node->sampleSlot, slot);
+
+                       if (cmp)

In the above, the variable cmp could be optimized away with:

+                       if (cmpSortSkipCols(node, node->sampleSlot, slot))
 
Right.  This comes from time when there was more complicated code which have to use the cmp variable multiple times.

(not sure if modern compilers won't do this, anway, though)
 
Anyway, it's code simplification which is good regardless whether compilers able to do it themselves or not.

+typedef struct IncrementalSortState
+{
+       ScanState       ss;                             /* its first field is NodeTag */
+       bool            bounded;                /* is the result set
bounded? */
+       int64           bound;                  /* if bounded, how many
tuples are needed */

If I'm not wrong, the layout of the struct will include quite a bit of
padding on 64 bit due to the mixing of bool and int64, maybe it would be
better to sort the fields differently, e.g. pack 4 or 8 bools together?
Not sure if that makes much of a difference, though.
 
I'd like to leave common members between of SortState and IncrementalSortState to be ordered the same way.
Thus, I think that if we're going to reorder then we should do this in both data structures.
But I'm not sure it worth considering, because these data structures are very unlikely be the source of significant memory consumption...

That's all for now :)

Great, thank you for review.

BTW, I also fixed documentation markup (regarding migration to xml).

Rebased patch is attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Antonin Houska
Date:
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

> Antonin Houska <ah@cybertec.at> wrote:

> >  Shouldn't the test contain *both* cases?

> Thank you for pointing that. Sure, both cases are better. I've added second case as well as comments. Patch is
attached.

I'm fine with the tests now but have a minor comment on this comment:

-- CROSS JOIN, not pushed down, because we don't push down LIMIT and remote side
-- can't perform top-N sort like local side can.

I think the note on LIMIT push-down makes the comment less clear because
there's no difference in processing the LIMIT: EXPLAIN shows that both

SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET
100 LIMIT 10;

and

SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET
100 LIMIT 10;

evaluate the LIMIT clause only locally.

What I consider the important difference is that the 2nd case does not
generate the appropriate input for remote incremental sort (while incremental
sort tends to be very cheap). Therefore it's cheaper to do no remote sort at
all and perform the top-N sort locally than to do a regular (non-incremental)
remote sort.

I have no other questions about this patch. I expect the CFM to set the status
to "ready for committer" as soon as the other reviewers confirm they're happy
about the patch status.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Mon, Jan 8, 2018 at 2:29 PM, Antonin Houska <ah@cybertec.at> wrote:
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

> Antonin Houska <ah@cybertec.at> wrote:

> >  Shouldn't the test contain *both* cases?

> Thank you for pointing that. Sure, both cases are better. I've added second case as well as comments. Patch is attached.

I'm fine with the tests now but have a minor comment on this comment:

-- CROSS JOIN, not pushed down, because we don't push down LIMIT and remote side
-- can't perform top-N sort like local side can.

I think the note on LIMIT push-down makes the comment less clear because
there's no difference in processing the LIMIT: EXPLAIN shows that both

SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET
100 LIMIT 10;

and

SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET
100 LIMIT 10;

evaluate the LIMIT clause only locally.

What I consider the important difference is that the 2nd case does not
generate the appropriate input for remote incremental sort (while incremental
sort tends to be very cheap). Therefore it's cheaper to do no remote sort at
all and perform the top-N sort locally than to do a regular (non-incremental)
remote sort.

Agree, these comments are not clear enough.  I've rewritten comments: they became much
more wordy, but now they look clearer for me.  Also I've swapped the queries order, for me
it seems to easier for understanding.
 
I have no other questions about this patch. I expect the CFM to set the status
to "ready for committer" as soon as the other reviewers confirm they're happy
about the patch status.

Good, thank you.  Let's see what other reviewers will say.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Mon, Jan 8, 2018 at 10:17 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
I have no other questions about this patch. I expect the CFM to set the status
to "ready for committer" as soon as the other reviewers confirm they're happy
about the patch status.

Good, thank you.  Let's see what other reviewers will say.

Rebased patch is attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Tomas Vondra
Date:
Hi,

I have started reviewing the patch and doing some testing, and I have
pretty quickly ran into a segfault. Attached is a simple reproducer and
an backtrace. AFAICS the bug seems to be somewhere in the tuplesort
changes, likely resetting a memory context too soon or something like
that. I haven't investigated it further, but it matches my hunch that
tuplesort is likely where the bugs will be.

Otherwise the patch seems fairly complete. A couple of minor things that
I noticed while eyeballing the changes in a diff editor.


1) On a couple of places the new code has this comment

    /* even when not parallel-aware */

while all the immediately preceding blocks use

    /* even when not parallel-aware, for EXPLAIN ANALYZE */

I suggest using the same comment, otherwise it kinda suggests it's not
because of EXPLAIN ANALYZE.


2) I think the purpose of sampleSlot should be explicitly documented
(and I'm not sure "sample" is a good term here, as is suggest some sort
of sampling (for example nodeAgg uses grp_firstTuple).


3) skipCols/SkipKeyData seems a bit strange too, I think. I'd use
PresortedKeyData or something like that.


4) In cmpSortSkipCols, when checking if the columns changed, the patch
does this:

    n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;

    for (i = 0; i < n; i++)
    {
        ... check i-th key ...
    }

My hunch is that checking the keys from the last one, i.e.

    for (i = (n-1); i >= 0; i--)
    {
        ....
    }

would be faster. The reasoning is that with "ORDER BY a,b" the column
"b" changes more often. But I've been unable to test this because of the
segfault crashes.


5) The changes from

    if (pathkeys_contained_in(...))

to

    n = pathkeys_common(pathkeys, subpath->pathkeys);


    if (n == 0)

seem rather inconvenient to me, as it makes the code unnecessarily
verbose. I wonder if there's a better way to deal with this.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
Hi!

Thank you for reviewing this patch!
Revised version is attached.

On Mon, Mar 5, 2018 at 1:19 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
I have started reviewing the patch and doing some testing, and I have
pretty quickly ran into a segfault. Attached is a simple reproducer and
an backtrace. AFAICS the bug seems to be somewhere in the tuplesort
changes, likely resetting a memory context too soon or something like
that. I haven't investigated it further, but it matches my hunch that
tuplesort is likely where the bugs will be.
 
Right.  Incremental sort patch introduces maincontext of memory which
is persistent between incremental sort groups.  But mergeruns()
reallocates memtuples in sortcontext which is cleared by tuplesort_reset().
Fixed in the revised patch.

Otherwise the patch seems fairly complete. A couple of minor things that
I noticed while eyeballing the changes in a diff editor.


1) On a couple of places the new code has this comment

    /* even when not parallel-aware */

while all the immediately preceding blocks use

    /* even when not parallel-aware, for EXPLAIN ANALYZE */

I suggest using the same comment, otherwise it kinda suggests it's not
because of EXPLAIN ANALYZE.
 
Right, fixed.  I also found that incremental sort shoudn't support
DSM reinitialization similarly to regular sort.  Fixes in the revised patch.

2) I think the purpose of sampleSlot should be explicitly documented
(and I'm not sure "sample" is a good term here, as is suggest some sort
of sampling (for example nodeAgg uses grp_firstTuple).

Yes, "sample" isn't a good term here.  However, "first" isn't really correct,
because we can skip some tuples from beginning of the group in
order to not form groups too frequently.  I'd rather name it "pivot" tuple
slot.
 
3) skipCols/SkipKeyData seems a bit strange too, I think. I'd use
PresortedKeyData or something like that.

Good point, renamed. 

4) In cmpSortSkipCols, when checking if the columns changed, the patch
does this:

    n = ((IncrementalSort *) node->ss.ps.plan)->skipCols;

    for (i = 0; i < n; i++)
    {
        ... check i-th key ...
    }

My hunch is that checking the keys from the last one, i.e.

    for (i = (n-1); i >= 0; i--)
    {
        ....
    }

would be faster. The reasoning is that with "ORDER BY a,b" the column
"b" changes more often. But I've been unable to test this because of the
segfault crashes.
 
Agreed.

5) The changes from

    if (pathkeys_contained_in(...))

to

    n = pathkeys_common(pathkeys, subpath->pathkeys);


    if (n == 0)

seem rather inconvenient to me, as it makes the code unnecessarily
verbose. I wonder if there's a better way to deal with this.

I would rather say, that it changes from 

    if (pathkeys_contained_in(...))

to

    n = pathkeys_common(pathkeys, subpath->pathkeys);

    if (n == list_length(pathkeys))

I've introduced pathkeys_common_contained_in() which returns the same
result as pathkeys_contained_in(), but sets number of common pathkeys
to the last argument.  It simplifies code a little bit. The name, probably,
could be improved.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Tomas Vondra
Date:
On 03/05/2018 11:07 PM, Alexander Korotkov wrote:
> Hi!
> 
> Thank you for reviewing this patch!
> Revised version is attached.
> 

OK, the revised patch works fine - I've done a lot of testing and
benchmarking, and not a single segfault or any other crash.

Regarding the benchmarks, I generally used queries of the form

    SELECT * FROM (SELECT * FROM t ORDER BY a) foo ORDER BY a,b

with the first sort done in various ways:

    * regular Sort node
    * indexes with Index Scan
    * indexes with Index Only Scan

and all these three options with and without LIMIT (the limit was set to
1% of the source table).

I've also varied parallelism (max_parallel_workers_per_gather was set to
either 0 or 2), work_mem (from 4MB to 256MB) and data set size (tables
from 1000 rows to 10M rows).

All of this may seem like an overkill, but I've found a couple of
regressions thanks to that.

The full scripts and results are available here:

    https://github.com/tvondra/incremental-sort-tests

The queries actually executed are a bit more complicated, to eliminate
overhead due to data transfer to client etc. The same approach was used
in the other sorting benchmarks we've done in the past.

I'm attaching results for two scales - 10k and 10M rows, preprocessed
into .ods format. I haven't looked at the other scales yet, but I don't
expect any surprises there.

Each .ods file contains raw data for one of the tests (matching the .sh
script filename), pivot table, and comparison of durations with and
without the incremental sort.

In general, I think the results look pretty impressive. Almost all the
comparisons are green, which means "faster than master" - usually by
tens of percent (without limit), or by up to ~95% (with LIMIT).

There are a couple of regressions in two cases sort-indexes and
sort-indexes-ios.

Oh the small dataset this seems to be related to the number of groups
(essentially, number of distinct values in a column). My assumption is
that there is some additional overhead when "switching" between the
groups, and with many groups it's significant enough to affect results
on these tiny tables (where master only takes ~3ms to do the sort). The
slowdown seems to be

On the large data set it seems to be somehow related to both work_mem
and number of groups, but I didn't have time to investigate that yet
(there are explain analyze plans in the results, so feel free to look).

In general, I think this looks really nice. It's certainly awesome with
the LIMIT case, as it allows us to leverage indexes on a subset of the
ORDER BY columns.

Now, there's a caveat in those tests - the data set is synthetic and
perfectly random, i.e. all groups equally likely, no correlations or
anything like that.

I wonder what is the "worst case" scenario, i.e. how to construct a data
set with particularly bad behavior of the incremental sort.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Thu, Mar 8, 2018 at 2:49 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
OK, the revised patch works fine - I've done a lot of testing and
benchmarking, and not a single segfault or any other crash.

Regarding the benchmarks, I generally used queries of the form

    SELECT * FROM (SELECT * FROM t ORDER BY a) foo ORDER BY a,b

with the first sort done in various ways:

    * regular Sort node
    * indexes with Index Scan
    * indexes with Index Only Scan

and all these three options with and without LIMIT (the limit was set to
1% of the source table).

I've also varied parallelism (max_parallel_workers_per_gather was set to
either 0 or 2), work_mem (from 4MB to 256MB) and data set size (tables
from 1000 rows to 10M rows).

All of this may seem like an overkill, but I've found a couple of
regressions thanks to that.

The full scripts and results are available here:

    https://github.com/tvondra/incremental-sort-tests

The queries actually executed are a bit more complicated, to eliminate
overhead due to data transfer to client etc. The same approach was used
in the other sorting benchmarks we've done in the past.

I'm attaching results for two scales - 10k and 10M rows, preprocessed
into .ods format. I haven't looked at the other scales yet, but I don't
expect any surprises there.

Each .ods file contains raw data for one of the tests (matching the .sh
script filename), pivot table, and comparison of durations with and
without the incremental sort.

In general, I think the results look pretty impressive. Almost all the
comparisons are green, which means "faster than master" - usually by
tens of percent (without limit), or by up to ~95% (with LIMIT).

There are a couple of regressions in two cases sort-indexes and
sort-indexes-ios.

Oh the small dataset this seems to be related to the number of groups
(essentially, number of distinct values in a column). My assumption is
that there is some additional overhead when "switching" between the
groups, and with many groups it's significant enough to affect results
on these tiny tables (where master only takes ~3ms to do the sort). The
slowdown seems to be

On the large data set it seems to be somehow related to both work_mem
and number of groups, but I didn't have time to investigate that yet
(there are explain analyze plans in the results, so feel free to look).

In general, I think this looks really nice. It's certainly awesome with
the LIMIT case, as it allows us to leverage indexes on a subset of the
ORDER BY columns.

Thank you very much for testing and benchmarking.  I'll investigate
the regressions you found.
 
Now, there's a caveat in those tests - the data set is synthetic and
perfectly random, i.e. all groups equally likely, no correlations or
anything like that.

I wonder what is the "worst case" scenario, i.e. how to construct a data
set with particularly bad behavior of the incremental sort.

I think that depends on the reason of bad behavior of incremental sort.
For example, our quick sort implementation behaves very good on
presorted data.  But, incremental sort appears to be not so good in
this case as Heikki showed upthread.  That prompted me to test
presorted datasets (which appeared to be "worst case") more intensively. 
But I suspect that regressions you found have another reason, and
correspondingly "worst case" would be also different.
When I'll investigate the reason of regressions, I'll try to construct
"worst case" as well.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Sat, Mar 10, 2018 at 6:42 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Thu, Mar 8, 2018 at 2:49 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Thank you very much for testing and benchmarking.  I'll investigate
the regressions you found.
 
Now, there's a caveat in those tests - the data set is synthetic and
perfectly random, i.e. all groups equally likely, no correlations or
anything like that.

I wonder what is the "worst case" scenario, i.e. how to construct a data
set with particularly bad behavior of the incremental sort.

I think that depends on the reason of bad behavior of incremental sort.
For example, our quick sort implementation behaves very good on
presorted data.  But, incremental sort appears to be not so good in
this case as Heikki showed upthread.  That prompted me to test
presorted datasets (which appeared to be "worst case") more intensively. 
But I suspect that regressions you found have another reason, and
correspondingly "worst case" would be also different.
When I'll investigate the reason of regressions, I'll try to construct
"worst case" as well.

After some investigation of benchmark results, I found 2 sources of
regressions of incremental sort.

Case 1: Underlying node scan lose is bigger than incremental sort win

===== 33 [Wed Mar  7 10:14:14 CET 2018] scale:10000000 groups:10 work_mem:64MB incremental:on max_workers:0 =====
SELECT * FROM s_1 ORDER BY a, b
                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1588080.84..1588080.84 rows=1 width=20) (actual time=5874.527..5874.527 rows=0 loops=1)
   ->  Incremental Sort  (cost=119371.51..1488081.45 rows=9999939 width=20) (actual time=202.842..5653.224 rows=10000000 loops=1)
         Sort Key: s_1.a, s_1.b
         Presorted Key: s_1.a
         Sort Method: external merge  Disk: 29408kB
         Sort Groups: 11
         ->  Index Scan using s_1_a_idx on s_1  (cost=0.43..323385.52 rows=9999939 width=20) (actual time=0.051..1494.105 rows=10000000 loops=1)
 Planning time: 0.269 ms
 Execution time: 5877.367 ms
(9 rows)

===== 37 [Wed Mar  7 10:15:51 CET 2018] scale:10000000 groups:10 work_mem:64MB incremental:off max_workers:0 =====
SELECT * FROM s_1 ORDER BY a, b
                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1656439.93..1656439.93 rows=1 width=20) (actual time=4741.716..4741.716 rows=0 loops=1)
   ->  Sort  (cost=1531440.69..1556440.54 rows=9999939 width=20) (actual time=3522.156..4519.278 rows=10000000 loops=1)
         Sort Key: s_1.a, s_1.b
         Sort Method: external merge  Disk: 293648kB
         ->  Seq Scan on s_1  (cost=0.00..163694.39 rows=9999939 width=20) (actual time=0.021..650.322 rows=10000000 loops=1)
 Planning time: 0.249 ms
 Execution time: 4777.088 ms
(7 rows)

In this case optimizer have decided that "Index Scan + Incremental Sort" would be 
cheaper than "Seq Scan + Sort".  But it appears that the amount of time we loose by
selecting Index Scan over Seq Scan is bigger than amount of time we win by
selecting Incremental Sort over Sort.  I would note that regular Sort consumes
about 10X more disk space.  I bet that all this space has fit to OS cache of test
machine.  But optimizer did expect actual IO to take place in this case.  This
has lead actual time to be inadequate the costing.

Case 2: Underlying node is not parallelyzed

===== 178 [Wed Mar  7 11:18:53 CET 2018] scale:10000000 groups:100 work_mem:8MB incremental:on max_workers:2 =====
SELECT * FROM s_2 ORDER BY a, b, c
                                                                     QUERY PLAN                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1179047.88..1179047.88 rows=1 width=20) (actual time=4819.999..4819.999 rows=0 loops=1)
   ->  Incremental Sort  (cost=89.04..1079047.34 rows=10000054 width=20) (actual time=0.203..4603.197 rows=10000000 loops=1)
         Sort Key: s_2.a, s_2.b, s_2.c
         Presorted Key: s_2.a, s_2.b
         Sort Method: quicksort  Memory: 135kB
         Sort Groups: 10201
         ->  Index Scan using s_2_a_b_idx on s_2  (cost=0.43..406985.62 rows=10000054 width=20) (actual time=0.052..1461.177 rows=10000000 loops=1)
 Planning time: 0.313 ms
 Execution time: 4820.037 ms
(9 rows)

===== 182 [Wed Mar  7 11:20:11 CET 2018] scale:10000000 groups:100 work_mem:8MB incremental:off max_workers:2 =====
SELECT * FROM s_2 ORDER BY a, b, c
                                                                 QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1705580.76..1705580.76 rows=1 width=20) (actual time=3985.818..3985.818 rows=0 loops=1)
   ->  Gather Merge  (cost=649951.66..1622246.98 rows=8333378 width=20) (actual time=1782.354..3750.868 rows=10000000 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=648951.64..659368.36 rows=4166689 width=20) (actual time=1778.362..2091.253 rows=3333333 loops=3)
               Sort Key: s_2.a, s_2.b, s_2.c
               Sort Method: external merge  Disk: 99136kB
               Worker 0:  Sort Method: external merge  Disk: 96984kB
               Worker 1:  Sort Method: external merge  Disk: 97496kB
               ->  Parallel Seq Scan on s_2  (cost=0.00..105361.89 rows=4166689 width=20) (actual time=0.022..233.640 rows=3333333 loops=3)
 Planning time: 0.265 ms
 Execution time: 4007.591 ms
(12 rows)

The situation is similar to case #1 except that in the pair "Seq Scan + Sort" Sort also
gets paralellyzed.  In the same way as in previous case, disk writes/reads during
external sort are overestimated, because they actually use OS cache.
I would also say that it's not necessary wrong decision of optimizer, because
doing this work in single backend may consume less resources despite being
overall slower.  

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] [PATCH] Incremental sort

From
Tomas Vondra
Date:

On 03/10/2018 06:05 PM, Alexander Korotkov wrote:
> On Sat, Mar 10, 2018 at 6:42 PM, Alexander Korotkov
> <a.korotkov@postgrespro.ru <mailto:a.korotkov@postgrespro.ru>> wrote:
>
> ...
>
> After some investigation of benchmark results, I found 2 sources of
> regressions of incremental sort.
> 
> *Case 1: Underlying node scan lose is bigger than incremental sort win*
> 
> ===== 33 [Wed Mar  7 10:14:14 CET 2018] scale:10000000 groups:10
> work_mem:64MB incremental:on max_workers:0 =====
> SELECT * FROM s_1 ORDER BY a, b
>                                                                    QUERY
> PLAN                                                                    
>
-------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=1588080.84..1588080.84 rows=1 width=20) (actual
> time=5874.527..5874.527 rows=0 loops=1)
>    ->  Incremental Sort  (cost=119371.51..1488081.45 rows=9999939
> width=20) (actual time=202.842..5653.224 rows=10000000 loops=1)
>          Sort Key: s_1.a, s_1.b
>          Presorted Key: s_1.a
>          Sort Method: external merge  Disk: 29408kB
>          Sort Groups: 11
>          ->  Index Scan using s_1_a_idx on s_1  (cost=0.43..323385.52
> rows=9999939 width=20) (actual time=0.051..1494.105 rows=10000000 loops=1)
>  Planning time: 0.269 ms
>  Execution time: 5877.367 ms
> (9 rows)
> 
> ===== 37 [Wed Mar  7 10:15:51 CET 2018] scale:10000000 groups:10
> work_mem:64MB incremental:off max_workers:0 =====
> SELECT * FROM s_1 ORDER BY a, b
>                                                           QUERY PLAN   
>                                                       
>
------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=1656439.93..1656439.93 rows=1 width=20) (actual
> time=4741.716..4741.716 rows=0 loops=1)
>    ->  Sort  (cost=1531440.69..1556440.54 rows=9999939 width=20) (actual
> time=3522.156..4519.278 rows=10000000 loops=1)
>          Sort Key: s_1.a, s_1.b
>          Sort Method: external merge  Disk: 293648kB
>          ->  Seq Scan on s_1  (cost=0.00..163694.39 rows=9999939
> width=20) (actual time=0.021..650.322 rows=10000000 loops=1)
>  Planning time: 0.249 ms
>  Execution time: 4777.088 ms
> (7 rows)
> 
> In this case optimizer have decided that "Index Scan + Incremental
> Sort" would be cheaper than "Seq Scan + Sort".  But it appears that
> the amount of time we loose by selecting Index Scan over Seq Scan is
> bigger than amount of time we win by selecting Incremental Sort over
> Sort.  I would note that regular Sort consumes about 10X more disk
> space.  I bet that all this space has fit to OS cache of test 
> machine.  But optimizer did expect actual IO to take place in this 
> case.  This has lead actual time to be inadequate the costing.
> 

Yes, you're right the temporary file(s) likely fit into RAM in this test
(and even if they did not, the storage system is pretty good).

> *Case 2: Underlying node is not parallelyzed*
> 
> ===== 178 [Wed Mar  7 11:18:53 CET 2018] scale:10000000 groups:100
> work_mem:8MB incremental:on max_workers:2 =====
> SELECT * FROM s_2 ORDER BY a, b, c
>                                                                    
>  QUERY PLAN                                                             
>        
>
----------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=1179047.88..1179047.88 rows=1 width=20) (actual
> time=4819.999..4819.999 rows=0 loops=1)
>    ->  Incremental Sort  (cost=89.04..1079047.34 rows=10000054 width=20)
> (actual time=0.203..4603.197 rows=10000000 loops=1)
>          Sort Key: s_2.a, s_2.b, s_2.c
>          Presorted Key: s_2.a, s_2.b
>          Sort Method: quicksort  Memory: 135kB
>          Sort Groups: 10201
>          ->  Index Scan using s_2_a_b_idx on s_2  (cost=0.43..406985.62
> rows=10000054 width=20) (actual time=0.052..1461.177 rows=10000000 loops=1)
>  Planning time: 0.313 ms
>  Execution time: 4820.037 ms
> (9 rows)
> 
> ===== 182 [Wed Mar  7 11:20:11 CET 2018] scale:10000000 groups:100
> work_mem:8MB incremental:off max_workers:2 =====
> SELECT * FROM s_2 ORDER BY a, b, c
>                                                                  QUERY
> PLAN                                                                 
>
--------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=1705580.76..1705580.76 rows=1 width=20) (actual
> time=3985.818..3985.818 rows=0 loops=1)
>    ->  Gather Merge  (cost=649951.66..1622246.98 rows=8333378 width=20)
> (actual time=1782.354..3750.868 rows=10000000 loops=1)
>          Workers Planned: 2
>          Workers Launched: 2
>          ->  Sort  (cost=648951.64..659368.36 rows=4166689 width=20)
> (actual time=1778.362..2091.253 rows=3333333 loops=3)
>                Sort Key: s_2.a, s_2.b, s_2.c
>                Sort Method: external merge  Disk: 99136kB
>                Worker 0:  Sort Method: external merge  Disk: 96984kB
>                Worker 1:  Sort Method: external merge  Disk: 97496kB
>                ->  Parallel Seq Scan on s_2  (cost=0.00..105361.89
> rows=4166689 width=20) (actual time=0.022..233.640 rows=3333333 loops=3)
>  Planning time: 0.265 ms
>  Execution time: 4007.591 ms
> (12 rows)
> 
> The situation is similar to case #1 except that in the pair "Seq Scan
> + Sort" Sort also gets paralellyzed.  In the same way as in previous
> case, disk writes/reads during external sort are overestimated,
> because they actually use OS cache. I would also say that it's not
> necessary wrong decision of optimizer, because doing this work in
> single backend may consume less resources despite being overall
> slower.
> 

Yes, that seems like a likely explanation too.

I agree those don't seem like an issue in the Incremental Sort patch,
but like a more generic costing problems.


Thanks for looking into the benchmark results.

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Fri, Mar 16, 2018 at 5:12 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
I agree those don't seem like an issue in the Incremental Sort patch,
but like a more generic costing problems.

Yes, I think so too.
Do you think we can mark this patch RFC assuming that it have already got pretty
much of review previously.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] [PATCH] Incremental sort

From
Tomas Vondra
Date:
On 03/16/2018 09:47 AM, Alexander Korotkov wrote:
> On Fri, Mar 16, 2018 at 5:12 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
> 
>     I agree those don't seem like an issue in the Incremental Sort patch,
>     but like a more generic costing problems.
> 
> 
> Yes, I think so too.

I wonder if we could make the costing a bit more pessimistic, to make
these loses less likely, while still keeping the main wins (particularly
for the LIMIT queries). But that seems a bit like a lost case, I guess.

> Do you think we can mark this patch RFC assuming that it have
> already got pretty much of review previously.
> 

Actually, I was going to propose to switch it to RFC, so I've just done
that. I think the patch is clearly ready for a committer to take a
closer look. I really like this improvement.

I'm going to rerun the tests, but that's mostly because I'm interested
if the change from i++ to i-- in cmpSortPresortedCols makes a measurable
difference. I don't expect to find any issues, so why wait with the RFC?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
Hi!

Revised patch is attached.  It's rebased to the last master.

On Fri, Mar 16, 2018 at 3:55 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 03/16/2018 09:47 AM, Alexander Korotkov wrote:
> On Fri, Mar 16, 2018 at 5:12 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>     I agree those don't seem like an issue in the Incremental Sort patch,
>     but like a more generic costing problems.
>
>
> Yes, I think so too.

I wonder if we could make the costing a bit more pessimistic, to make
these loses less likely, while still keeping the main wins (particularly
for the LIMIT queries). But that seems a bit like a lost case, I guess.
 
Making costing more pessimistic makes sense.  Revised patch does
it in quite rough way: volumes of groups in incremental sort are multiplied
by 1.5.  That makes one query in regression tests to fallback to fullsort
from incremental sort.  Could you test it?  If this will shorten number of
cases where incremental sort causes regression, then it might be an
acceptable way to do more pessimistic costing.

> Do you think we can mark this patch RFC assuming that it have
> already got pretty much of review previously.
>

Actually, I was going to propose to switch it to RFC, so I've just done
that. I think the patch is clearly ready for a committer to take a
closer look. I really like this improvement.

I'm going to rerun the tests, but that's mostly because I'm interested
if the change from i++ to i-- in cmpSortPresortedCols makes a measurable
difference. I don't expect to find any issues, so why wait with the RFC?

Good, thanks.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Darafei "Komяpa" Praliaskouski
Date:
Hi,

on a PostGIS system tuned for preferring parallelism heavily (min_parallel_table_scan_size=10kB) we experience issues with QGIS table discovery query with this patch:

Failing query is:
[local] gis@gis=# SELECT l.f_table_name,l.f_table_schema,l.f_geometry_column,upper(l.type),l.srid,l.coord_dimension,c.relkind,obj_description(c.oid) FROM geometry_columns l,pg_class c,pg_namespace n WHERE c.relname=l.f_table_name AND l.f_table_schema=n.
nspname AND n.oid=c.relnamespace AND has_schema_privilege(n.nspname,'usage') AND has_table_privilege('"'||n.nspname||'"."'||c.relname||'"','select') AND l.f_table_schema='public' ORDER BY n.nspname,c.relname,l.f_geometry_column;         
ERROR:  XX000: badly formatted node string "INCREMENTALSORT :startup_cost 37"...
CONTEXT:  parallel worker
LOCATION:  parseNodeString, readfuncs.c:2693
Time: 42,052 ms


Query plan:

                                                                                                                            QUERY PLAN                                                                                                                             
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sort  (cost=38717.21..38717.22 rows=1 width=393)
  Sort Key: c_1.relname, a.attname
  ->  Nested Loop  (cost=36059.35..38717.20 rows=1 width=393)
        ->  Index Scan using pg_namespace_nspname_index on pg_namespace n  (cost=0.28..2.30 rows=1 width=68)
              Index Cond: (nspname = 'public'::name)
              Filter: has_schema_privilege((nspname)::text, 'usage'::text)
        ->  Nested Loop  (cost=36059.08..38714.59 rows=1 width=407)
              ->  Nested Loop Left Join  (cost=36058.65..38712.12 rows=1 width=334)
                    Join Filter: ((s_2.connamespace = n_1.oid) AND (a.attnum = ANY (s_2.conkey)))
                    ->  Nested Loop Left Join  (cost=36058.51..38711.94 rows=1 width=298)
                          Join Filter: ((s_1.connamespace = n_1.oid) AND (a.attnum = ANY (s_1.conkey)))
                          ->  Nested Loop  (cost=36058.38..38711.75 rows=1 width=252)
                                Join Filter: (a.atttypid = t.oid)
                                ->  Gather Merge  (cost=36057.95..38702.65 rows=444 width=256)
                                      Workers Planned: 10
                                      ->  Merge Left Join  (cost=35057.76..37689.01 rows=44 width=256)
                                            Merge Cond: ((n_1.oid = s.connamespace) AND (c_1.oid = s.conrelid))
                                            Join Filter: (a.attnum = ANY (s.conkey))
                                            ->  Incremental Sort  (cost=37687.19..37687.30 rows=44 width=210)
                                                  Sort Key: n_1.oid, c_1.oid
                                                  Presorted Key: n_1.oid
                                                  ->  Nested Loop  (cost=34837.25..37685.99 rows=44 width=210)
                                                        ->  Merge Join  (cost=34836.82..34865.99 rows=9 width=136)
                                                              Merge Cond: (c_1.relnamespace = n_1.oid)
                                                              ->  Sort  (cost=34834.52..34849.05 rows=5814 width=72)
                                                                    Sort Key: c_1.relnamespace
                                                                    ->  Parallel Seq Scan on pg_class c_1  (cost=0.00..34470.99 rows=5814 width=72)
                                                                          Filter: ((relname <> 'raster_columns'::name) AND (NOT pg_is_other_temp_schema(relnamespace)) AND has_table_privilege(oid, 'SELECT'::text) AND (relkind = ANY ('{r,v,m,f,p}'::"char"[])))
                                                              ->  Sort  (cost=2.30..2.31 rows=1 width=68)
                                                                    Sort Key: n_1.oid
                                                                    ->  Index Scan using pg_namespace_nspname_index on pg_namespace n_1  (cost=0.28..2.29 rows=1 width=68)
                                                                          Index Cond: (nspname = 'public'::name)
                                                        ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a  (cost=0.43..200.52 rows=11281 width=78)
                                                              Index Cond: (attrelid = c_1.oid)
                                                              Filter: (NOT attisdropped)
                                            ->  Sort  (cost=1.35..1.35 rows=1 width=77)
                                                  Sort Key: s.connamespace, s.conrelid
                                                  ->  Seq Scan on pg_constraint s  (cost=0.00..1.34 rows=1 width=77)
                                                        Filter: (consrc ~~* '%geometrytype(% = %'::text)
                                ->  Materialize  (cost=0.42..2.45 rows=1 width=4)
                                      ->  Index Scan using pg_type_typname_nsp_index on pg_type t  (cost=0.42..2.44 rows=1 width=4)
                                            Index Cond: (typname = 'geometry'::name)
                          ->  Index Scan using pg_constraint_conrelid_index on pg_constraint s_1  (cost=0.14..0.16 rows=1 width=77)
                                Index Cond: (conrelid = c_1.oid)
                                Filter: (consrc ~~* '%ndims(% = %'::text)
                    ->  Index Scan using pg_constraint_conrelid_index on pg_constraint s_2  (cost=0.14..0.16 rows=1 width=77)
                          Index Cond: (conrelid = c_1.oid)
                          Filter: (consrc ~~* '%srid(% = %'::text)
              ->  Index Scan using pg_class_relname_nsp_index on pg_class c  (cost=0.42..2.46 rows=1 width=73)
                    Index Cond: ((relname = c_1.relname) AND (relnamespace = n.oid))
                    Filter: has_table_privilege((((('"'::text || (n.nspname)::text) || '"."'::text) || (relname)::text) || '"'::text), 'select'::text)
(51 rows)

Darafei Praliaskouski, 
GIS Engineer / Juno Minsk

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
Hi!

On Wed, Mar 21, 2018 at 2:30 PM, Darafei "Komяpa" Praliaskouski <me@komzpa.net> wrote:
on a PostGIS system tuned for preferring parallelism heavily (min_parallel_table_scan_size=10kB) we experience issues with QGIS table discovery query with this patch:

Failing query is:
[local] gis@gis=# SELECT l.f_table_name,l.f_table_schema,l.f_geometry_column,upper(l.type),l.srid,l.coord_dimension,c.relkind,obj_description(c.oid) FROM geometry_columns l,pg_class c,pg_namespace n WHERE c.relname=l.f_table_name AND l.f_table_schema=n.
nspname AND n.oid=c.relnamespace AND has_schema_privilege(n.nspname,'usage') AND has_table_privilege('"'||n.nspname||'"."'||c.relname||'"','select') AND l.f_table_schema='public' ORDER BY n.nspname,c.relname,l.f_geometry_column;         
ERROR:  XX000: badly formatted node string "INCREMENTALSORT :startup_cost 37"...
CONTEXT:  parallel worker
LOCATION:  parseNodeString, readfuncs.c:2693
Time: 42,052 ms


Query plan:

                                                                                                                            QUERY PLAN                                                                                                                             
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sort  (cost=38717.21..38717.22 rows=1 width=393)
  Sort Key: c_1.relname, a.attname
  ->  Nested Loop  (cost=36059.35..38717.20 rows=1 width=393)
        ->  Index Scan using pg_namespace_nspname_index on pg_namespace n  (cost=0.28..2.30 rows=1 width=68)
              Index Cond: (nspname = 'public'::name)
              Filter: has_schema_privilege((nspname)::text, 'usage'::text)
        ->  Nested Loop  (cost=36059.08..38714.59 rows=1 width=407)
              ->  Nested Loop Left Join  (cost=36058.65..38712.12 rows=1 width=334)
                    Join Filter: ((s_2.connamespace = n_1.oid) AND (a.attnum = ANY (s_2.conkey)))
                    ->  Nested Loop Left Join  (cost=36058.51..38711.94 rows=1 width=298)
                          Join Filter: ((s_1.connamespace = n_1.oid) AND (a.attnum = ANY (s_1.conkey)))
                          ->  Nested Loop  (cost=36058.38..38711.75 rows=1 width=252)
                                Join Filter: (a.atttypid = t.oid)
                                ->  Gather Merge  (cost=36057.95..38702.65 rows=444 width=256)
                                      Workers Planned: 10
                                      ->  Merge Left Join  (cost=35057.76..37689.01 rows=44 width=256)
                                            Merge Cond: ((n_1.oid = s.connamespace) AND (c_1.oid = s.conrelid))
                                            Join Filter: (a.attnum = ANY (s.conkey))
                                            ->  Incremental Sort  (cost=37687.19..37687.30 rows=44 width=210)
                                                  Sort Key: n_1.oid, c_1.oid
                                                  Presorted Key: n_1.oid
                                                  ->  Nested Loop  (cost=34837.25..37685.99 rows=44 width=210)
                                                        ->  Merge Join  (cost=34836.82..34865.99 rows=9 width=136)
                                                              Merge Cond: (c_1.relnamespace = n_1.oid)
                                                              ->  Sort  (cost=34834.52..34849.05 rows=5814 width=72)
                                                                    Sort Key: c_1.relnamespace
                                                                    ->  Parallel Seq Scan on pg_class c_1  (cost=0.00..34470.99 rows=5814 width=72)
                                                                          Filter: ((relname <> 'raster_columns'::name) AND (NOT pg_is_other_temp_schema(relnamespace)) AND has_table_privilege(oid, 'SELECT'::text) AND (relkind = ANY ('{r,v,m,f,p}'::"char"[])))
                                                              ->  Sort  (cost=2.30..2.31 rows=1 width=68)
                                                                    Sort Key: n_1.oid
                                                                    ->  Index Scan using pg_namespace_nspname_index on pg_namespace n_1  (cost=0.28..2.29 rows=1 width=68)
                                                                          Index Cond: (nspname = 'public'::name)
                                                        ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a  (cost=0.43..200.52 rows=11281 width=78)
                                                              Index Cond: (attrelid = c_1.oid)
                                                              Filter: (NOT attisdropped)
                                            ->  Sort  (cost=1.35..1.35 rows=1 width=77)
                                                  Sort Key: s.connamespace, s.conrelid
                                                  ->  Seq Scan on pg_constraint s  (cost=0.00..1.34 rows=1 width=77)
                                                        Filter: (consrc ~~* '%geometrytype(% = %'::text)
                                ->  Materialize  (cost=0.42..2.45 rows=1 width=4)
                                      ->  Index Scan using pg_type_typname_nsp_index on pg_type t  (cost=0.42..2.44 rows=1 width=4)
                                            Index Cond: (typname = 'geometry'::name)
                          ->  Index Scan using pg_constraint_conrelid_index on pg_constraint s_1  (cost=0.14..0.16 rows=1 width=77)
                                Index Cond: (conrelid = c_1.oid)
                                Filter: (consrc ~~* '%ndims(% = %'::text)
                    ->  Index Scan using pg_constraint_conrelid_index on pg_constraint s_2  (cost=0.14..0.16 rows=1 width=77)
                          Index Cond: (conrelid = c_1.oid)
                          Filter: (consrc ~~* '%srid(% = %'::text)
              ->  Index Scan using pg_class_relname_nsp_index on pg_class c  (cost=0.42..2.46 rows=1 width=73)
                    Index Cond: ((relname = c_1.relname) AND (relnamespace = n.oid))
                    Filter: has_table_privilege((((('"'::text || (n.nspname)::text) || '"."'::text) || (relname)::text) || '"'::text), 'select'::text)
(51 rows)

Thank you for pointing.  I'll try to reproduce this issue and fix it.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Wed, Mar 21, 2018 at 2:32 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Wed, Mar 21, 2018 at 2:30 PM, Darafei "Komяpa" Praliaskouski <me@komzpa.net> wrote:
on a PostGIS system tuned for preferring parallelism heavily (min_parallel_table_scan_size=10kB) we experience issues with QGIS table discovery query with this patch:

Failing query is:
[local] gis@gis=# SELECT l.f_table_name,l.f_table_schema,l.f_geometry_column,upper(l.type),l.srid,l.coord_dimension,c.relkind,obj_description(c.oid) FROM geometry_columns l,pg_class c,pg_namespace n WHERE c.relname=l.f_table_name AND l.f_table_schema=n.
nspname AND n.oid=c.relnamespace AND has_schema_privilege(n.nspname,'usage') AND has_table_privilege('"'||n.nspname||'"."'||c.relname||'"','select') AND l.f_table_schema='public' ORDER BY n.nspname,c.relname,l.f_geometry_column;         
ERROR:  XX000: badly formatted node string "INCREMENTALSORT :startup_cost 37"...
CONTEXT:  parallel worker
LOCATION:  parseNodeString, readfuncs.c:2693
Time: 42,052 ms


Query plan:

                                                                                                                            QUERY PLAN                                                                                                                             
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sort  (cost=38717.21..38717.22 rows=1 width=393)
  Sort Key: c_1.relname, a.attname
  ->  Nested Loop  (cost=36059.35..38717.20 rows=1 width=393)
        ->  Index Scan using pg_namespace_nspname_index on pg_namespace n  (cost=0.28..2.30 rows=1 width=68)
              Index Cond: (nspname = 'public'::name)
              Filter: has_schema_privilege((nspname)::text, 'usage'::text)
        ->  Nested Loop  (cost=36059.08..38714.59 rows=1 width=407)
              ->  Nested Loop Left Join  (cost=36058.65..38712.12 rows=1 width=334)
                    Join Filter: ((s_2.connamespace = n_1.oid) AND (a.attnum = ANY (s_2.conkey)))
                    ->  Nested Loop Left Join  (cost=36058.51..38711.94 rows=1 width=298)
                          Join Filter: ((s_1.connamespace = n_1.oid) AND (a.attnum = ANY (s_1.conkey)))
                          ->  Nested Loop  (cost=36058.38..38711.75 rows=1 width=252)
                                Join Filter: (a.atttypid = t.oid)
                                ->  Gather Merge  (cost=36057.95..38702.65 rows=444 width=256)
                                      Workers Planned: 10
                                      ->  Merge Left Join  (cost=35057.76..37689.01 rows=44 width=256)
                                            Merge Cond: ((n_1.oid = s.connamespace) AND (c_1.oid = s.conrelid))
                                            Join Filter: (a.attnum = ANY (s.conkey))
                                            ->  Incremental Sort  (cost=37687.19..37687.30 rows=44 width=210)
                                                  Sort Key: n_1.oid, c_1.oid
                                                  Presorted Key: n_1.oid
                                                  ->  Nested Loop  (cost=34837.25..37685.99 rows=44 width=210)
                                                        ->  Merge Join  (cost=34836.82..34865.99 rows=9 width=136)
                                                              Merge Cond: (c_1.relnamespace = n_1.oid)
                                                              ->  Sort  (cost=34834.52..34849.05 rows=5814 width=72)
                                                                    Sort Key: c_1.relnamespace
                                                                    ->  Parallel Seq Scan on pg_class c_1  (cost=0.00..34470.99 rows=5814 width=72)
                                                                          Filter: ((relname <> 'raster_columns'::name) AND (NOT pg_is_other_temp_schema(relnamespace)) AND has_table_privilege(oid, 'SELECT'::text) AND (relkind = ANY ('{r,v,m,f,p}'::"char"[])))
                                                              ->  Sort  (cost=2.30..2.31 rows=1 width=68)
                                                                    Sort Key: n_1.oid
                                                                    ->  Index Scan using pg_namespace_nspname_index on pg_namespace n_1  (cost=0.28..2.29 rows=1 width=68)
                                                                          Index Cond: (nspname = 'public'::name)
                                                        ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a  (cost=0.43..200.52 rows=11281 width=78)
                                                              Index Cond: (attrelid = c_1.oid)
                                                              Filter: (NOT attisdropped)
                                            ->  Sort  (cost=1.35..1.35 rows=1 width=77)
                                                  Sort Key: s.connamespace, s.conrelid
                                                  ->  Seq Scan on pg_constraint s  (cost=0.00..1.34 rows=1 width=77)
                                                        Filter: (consrc ~~* '%geometrytype(% = %'::text)
                                ->  Materialize  (cost=0.42..2.45 rows=1 width=4)
                                      ->  Index Scan using pg_type_typname_nsp_index on pg_type t  (cost=0.42..2.44 rows=1 width=4)
                                            Index Cond: (typname = 'geometry'::name)
                          ->  Index Scan using pg_constraint_conrelid_index on pg_constraint s_1  (cost=0.14..0.16 rows=1 width=77)
                                Index Cond: (conrelid = c_1.oid)
                                Filter: (consrc ~~* '%ndims(% = %'::text)
                    ->  Index Scan using pg_constraint_conrelid_index on pg_constraint s_2  (cost=0.14..0.16 rows=1 width=77)
                          Index Cond: (conrelid = c_1.oid)
                          Filter: (consrc ~~* '%srid(% = %'::text)
              ->  Index Scan using pg_class_relname_nsp_index on pg_class c  (cost=0.42..2.46 rows=1 width=73)
                    Index Cond: ((relname = c_1.relname) AND (relnamespace = n.oid))
                    Filter: has_table_privilege((((('"'::text || (n.nspname)::text) || '"."'::text) || (relname)::text) || '"'::text), 'select'::text)
(51 rows)

Thank you for pointing.  I'll try to reproduce this issue and fix it.

I found that Darafei used build made using incremental-sort-7.patch.  That version contained bug in incremental sort node deserialization.
Modern patch versions doesn't contain that bug.

I've checked that it works.

create table t (i int, value float8);
insert into t select i%1000, random() from generate_series(1,1000000) i;
set force_parallel_mode = on;

# explain select count(*) from (select * from (select * from t order by i) x order by i, value) y;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Gather  (cost=254804.94..254805.05 rows=1 width=8)
   Workers Planned: 1
   Single Copy: true
   ->  Aggregate  (cost=253804.94..253804.95 rows=1 width=8)
         ->  Incremental Sort  (cost=132245.97..241304.94 rows=1000000 width=12)
               Sort Key: t.i, t.value
               Presorted Key: t.i
               ->  Sort  (cost=132154.34..134654.34 rows=1000000 width=12)
                     Sort Key: t.i
                     ->  Seq Scan on t  (cost=0.00..15406.00 rows=1000000 width=12)
(10 rows)

# select count(*) from (select * from (select * from t order by i) x order by i, value) y;
  count
---------
 1000000
(1 row)

BTW, patch had conflicts with master.  Please, find rebased version attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Teodor Sigaev
Date:
> BTW, patch had conflicts with master.  Please, find rebased version attached.

Sorry, but patch conflicts with master again.


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: [HACKERS] [PATCH] Incremental sort

From
Teodor Sigaev
Date:
> BTW, patch had conflicts with master.  Please, find rebased version attached.

Despite by patch conflist patch looks commitable, has anybody objections to 
commit it?

Patch recieved several rounds of review during 2 years, and seems to me, keeping 
it out from sources may cause a lost it. Although it suggests performance 
improvement in rather wide usecases.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: [HACKERS] [PATCH] Incremental sort

From
Tomas Vondra
Date:
On 03/28/2018 03:28 PM, Teodor Sigaev wrote:
>> BTW, patch had conflicts with master.  Please, find rebased version
>> attached.
> 
> Despite by patch conflist patch looks commitable, has anybody objections
> to commit it?
> 
> Patch recieved several rounds of review during 2 years, and seems to me,
> keeping it out from sources may cause a lost it. Although it suggests
> performance improvement in rather wide usecases.
> 

No objections from me - if you want me to do one final round of review
after the rebase (not sure how invasive it'll turn out), let me know.

BTW one detail I'd change is name of the GUC variable. enable_incsort
seems unnecessarily terse - let's go for enable_incremental_sort or
something like that.

regards


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Wed, Mar 28, 2018 at 4:44 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 03/28/2018 03:28 PM, Teodor Sigaev wrote:
>> BTW, patch had conflicts with master.  Please, find rebased version
>> attached.
>
> Despite by patch conflist patch looks commitable, has anybody objections
> to commit it?
>
> Patch recieved several rounds of review during 2 years, and seems to me,
> keeping it out from sources may cause a lost it. Although it suggests
> performance improvement in rather wide usecases.
>

No objections from me - if you want me to do one final round of review
after the rebase (not sure how invasive it'll turn out), let me know.

Rebased patch is attached.  Incremental sort get used in multiple places
of partition_aggregate regression test.  I've checked those cases, and it seems
that incremental sort was selected right. 

BTW one detail I'd change is name of the GUC variable. enable_incsort
seems unnecessarily terse - let's go for enable_incremental_sort or
something like that.

Already enable_incsort was already renamed to enable_incrementalsort since [1].  


------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Tomas Vondra
Date:

On 03/28/2018 05:12 PM, Alexander Korotkov wrote:
> On Wed, Mar 28, 2018 at 4:44 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
> 
>     On 03/28/2018 03:28 PM, Teodor Sigaev wrote:
>     >> BTW, patch had conflicts with master.  Please, find rebased version
>     >> attached.
>     >
>     > Despite by patch conflist patch looks commitable, has anybody objections
>     > to commit it?
>     >
>     > Patch recieved several rounds of review during 2 years, and seems to me,
>     > keeping it out from sources may cause a lost it. Although it suggests
>     > performance improvement in rather wide usecases.
>     >
> 
>     No objections from me - if you want me to do one final round of review
>     after the rebase (not sure how invasive it'll turn out), let me know.
> 
> 
> Rebased patch is attached.  Incremental sort get used in multiple places
> of partition_aggregate regression test.  I've checked those cases, and
> it seems that incremental sort was selected right.
> 

OK, I'll take a look.

>     BTW one detail I'd change is name of the GUC variable. enable_incsort
>     seems unnecessarily terse - let's go for enable_incremental_sort or
>     something like that.
> 
> 
> Already enable_incsort was already renamed to enable_incrementalsort
> since [1].  
> 
> 1.
> https://www.postgresql.org/message-id/CAPpHfduAVmiGDZC%2BdfNL1rEGu0mt45Rd_mxwjY57uqwWhrvQzg%40mail.gmail.com 
> 

Ah, apologies. I've been looking at the wrong version.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [PATCH] Incremental sort

From
Alvaro Herrera
Date:
Teodor Sigaev wrote:
> > BTW, patch had conflicts with master.  Please, find rebased version attached.
> 
> Despite by patch conflist patch looks commitable, has anybody objections to
> commit it?
> 
> Patch recieved several rounds of review during 2 years, and seems to me,
> keeping it out from sources may cause a lost it. Although it suggests
> performance improvement in rather wide usecases.

Can we have a recap on what the patch *does*?  I see there's a
description in Alexander's first email
https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
but that was a long time ago, and the patch has likely changed in the
meantime ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [PATCH] Incremental sort

From
Andres Freund
Date:
Hi,

On 2018-03-28 16:28:01 +0300, Teodor Sigaev wrote:
> > BTW, patch had conflicts with master.  Please, find rebased version attached.
> 
> Despite by patch conflist patch looks commitable, has anybody objections to
> commit it?

> Patch recieved several rounds of review during 2 years, and seems to me,
> keeping it out from sources may cause a lost it. Although it suggests
> performance improvement in rather wide usecases.

My impression it has *NOT* received enough review to be RFC. Not saying
it's impossible to get there this release, but that just committing it
doesn't seem wise.

Greetings,

Andres Freund


Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Wed, Mar 28, 2018 at 7:17 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-03-28 16:28:01 +0300, Teodor Sigaev wrote:
> > BTW, patch had conflicts with master.  Please, find rebased version attached.
>
> Despite by patch conflist patch looks commitable, has anybody objections to
> commit it?

> Patch recieved several rounds of review during 2 years, and seems to me,
> keeping it out from sources may cause a lost it. Although it suggests
> performance improvement in rather wide usecases.

My impression it has *NOT* received enough review to be RFC. Not saying
it's impossible to get there this release, but that just committing it
doesn't seem wise.

I would say that executor part of this patch already received plenty of review.
For sure, there still might be issues.  I just mean that amount of review of
executor part of this patch is not less than in average patch we commit.
But optimizer and costing part of this patch still need somebody to take
a look at it.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Kuzmenkov
Date:
Hi Alexander,

I took a quick look at the patch. Some things I fixed myself in the 
attached patch v.21. Here is the summary:

Typo in compare_fractional_path_costs() should be fixed as a separate patch.
Remove unused function estimate_pathkeys_groups.
Extra MemoryContextReset before tuplesort_end() shouldn't be a big deal, 
so we don't have to add a parameter to tuplesoft_free().
Add comment to maincontext declaration.
Fix typo in INITIAL_MEMTUPSIZE.
Remove trailing whitespace.

Some other things I found:

In tuplesort_reset:
if (state->memtupsize < INITIAL_MEMTUPSIZE)
     <reallocate memtuples to INITIAL_MEMTUPSIZE>
I'd add a comment explaining when and why we have to do this. Also maybe 
a comment to other allocations of memtuples in tuplesort_begin() and 
mergeruns(), explaining why it is reallocated and why in maincontext.


In tuplesort_updatemax:
     /* XXX */
     if (spaceUsedOnDisk > state->maxSpaceOnDisk ||
         (spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > 
state->maxSpace))
XXX. Also comparing bools with '>' looks confusing to me.


We should add a comment on top of tuplesort.c, explaining that we now 
have a faster way to sort multiple batches of data using the same sort 
conditions.


The name 'main context' sounds somewhat vague. Maybe 'top context'? Not 
sure.


In ExecSupportBackwardsScan:
         case T_IncrementalSort:
             return false;
This separate case looks useless, I'd either add a comment explaining 
why it can't scan backwards, or just return false by default.



That's all I have for today; tomorrow I'll continue with reviewing the 
planner part of the patch.

-- 
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Tomas Vondra
Date:
Hi,

I've been doing a bit more review of the patch today, focusing on the
planner part, and I'm starting to have some doubts regarding the way
incremental sort paths are created. I do have some question about the
executor and other parts too.

I'll mark this as 'waiting on author' to make it clear the patch is
still being discussed, RFC is not appropriate status for that.

Attached is a patch that highlights some of the interesting places, and
also suggests some minor changes to comments and other tweaks.


1) planning/costing of incremental vs. non-incremental sorts
------------------------------------------------------------

In sort, all the various places that create/cost sorts:

 * createplan.c (make_sort)
 * planner.c (create_sort_path)
 * pathnode.c (cost_sort)

seem to prefer incremental sorts whenever available. Consider for
example this code from create_merge_append_plan():

    if (!pathkeys_common_contained_in(pathkeys, subpath->pathkeys,
                                      &n_common_pathkeys))
    {
        Sort       *sort = make_sort(subplan, numsortkeys,
                                     n_common_pathkeys,
                                     sortColIdx, sortOperators,
                                     collations, nullsFirst);

        label_sort_with_costsize(root, sort, best_path->limit_tuples);
        subplan = (Plan *) sort;
    }

This essentially says that when (n_common_pathkeys > 0), the sort is
going to be incremental.

That however seems to rely on an important assumption - when the input
is presorted, the incremental sort is expected to be cheaper than
regular sort.

This assumption however seems to be proven invalid by cost_sort, which
does the common part for both sort modes (incremental/non-incremental)
first, and then does this:

    /* Extra costs of incremental sort */
    if (presorted_keys > 0)
    {
        ... add something to the costs ...
    }

That is, the incremental cost seems to be pretty much guaranteed to be
more expensive than regular Sort (with the exception of LIMIT queries,
where it's guaranteed to win thanks to lower startup cost).

I don't know how significant the cost difference may be (perhaps not
much), or if it may lead to inefficient plans. For example what if the
cheapest total path is partially sorted by chance, and only has a single
prefix group? Then all the comparisons with pivotSlot are unnecessary.

But I'm pretty sure it may lead to surprising behavior - for example if
you disable incremental sorts (enable_incrementalsort=off), the plan
will switch to plain sort without the additional costs. So you'll get a
cheaper plan by disabling some operation. That's surprising.

So I think it would be more appropriate if those places actually did a
costing of incremental vs. non-incremental sorts, and then constructed
the cheaper option. Essentially we should consider both plain and
incremental sort for each partially sorted input path, and then pick the
right one.

Of course, this is going to be tricky in createplan.c which builds the
plans directly - in that case it might be integrated into make_sort() or
something like that.

Also, I wonder if we could run into problems, due to incremental sort
not supporting things the regular sort does (rewind, backwards scans and
mark/restore).


2) nodeIncrementalsort.c
------------------------

There's a couple of obsolete comments, that came from nodeSort.c and did
not get tweaked (and so talk about first-time through when incremental
sort needs to do that for each group, etc.). The attached diff tweaks
those, and clarifies a couple of others. I've also added some comments
explaining what the pivotSlot is about etc. There's also a couple of XXX
comments asking additional questions/clarifications.

I'm wondering if a static MIN_GROUP_SIZE is good idea. For example, what
if the subplan is expected to return only very few tuples (say, 33), but
the query includes LIMIT 1. Now, let's assume the startup/total cost of
the subplan is 1 and 1000000. With MIN_GROUP_SIZE 32 we're bound to
execute it pretty much till the end, while we could terminate after the
first tuple (if the prefix changes).

So I think we should use a Min(limit,MIN_GROUP_SIZE) here, and perhaps
this should depend on average group size too.

The other questionable thing seems to be this claim:

 * We unlikely will have huge groups with incremental sort.  Therefore
 * usage of abbreviated keys would be likely a waste of time.

followed by disabling abbreviated keys in the tuplesort_begin_heap call.
I find this rather dubious and unsupported by any arguments (I certainly
don't see any in the comments).

If would be more acceptable if the estimated number of groups was used
when deciding whether to use incremental sort or not, but that's not the
case - as explained in the first part, we simply prefer incremental
sorts whenever there is a prefix. In those cases we have very little
idea (or even guarantees) regarding the average group size.

Furthermore, cost_sort is estimating the number of groups, so it does
know the average group size. I don't see why we couldn't consider it
here too, and disable/enable abbreviated keys depending on that.


3) pathkeys.c
-------------

The new function pathkeys_useful_for_ordering() does actually change
behavior depending on enable_incrementalsort. That seems like a rather
bad idea, for a couple or reasons.

AFAICS pathkeys.c is supposed to provide generic utils for work with
pathkeys, and no one would expect the functions to change behavior
depending on enable_* GUCs. I certainly would not.

In short, this does not seem like the right place to enable/disable
incremental sorts, that should be done when costing the plan (i.e. in
costsize.c) or creating the plan.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Tomas Vondra
Date:
On 03/31/2018 10:43 PM, Tomas Vondra wrote:
> ...
> But I'm pretty sure it may lead to surprising behavior - for example if
> you disable incremental sorts (enable_incrementalsort=off), the plan
> will switch to plain sort without the additional costs. So you'll get a
> cheaper plan by disabling some operation. That's surprising.
> 

To illustrate this is a valid issue, consider this trivial example:

create table t (a int, b int, c int);

insert into t select 10*random(), 10*random(), 10*random()
  from generate_series(1,1000000) s(i);

analyze t;

explain select * from (select * from t order by a,b) foo order by a,b,c;

                               QUERY PLAN
------------------------------------------------------------------------
 Incremental Sort  (cost=133100.48..264139.27 rows=1000000 width=12)
   Sort Key: t.a, t.b, t.c
   Presorted Key: t.a, t.b
   ->  Sort  (cost=132154.34..134654.34 rows=1000000 width=12)
         Sort Key: t.a, t.b
         ->  Seq Scan on t  (cost=0.00..15406.00 rows=1000000 width=12)
(6 rows)

set enable_incrementalsort = off;

explain select * from (select * from t order by a,b) foo order by a,b,c;
                               QUERY PLAN
------------------------------------------------------------------------
 Sort  (cost=261402.69..263902.69 rows=1000000 width=12)
   Sort Key: t.a, t.b, t.c
   ->  Sort  (cost=132154.34..134654.34 rows=1000000 width=12)
         Sort Key: t.a, t.b
         ->  Seq Scan on t  (cost=0.00..15406.00 rows=1000000 width=12)
(5 rows)

So the cost with incremental sort was 264139, and after disabling the
incremental cost it dropped to 263902. Granted, the difference is
negligible in this case, but it's still surprising.

Also, it can be made much more significant by reducing the number of
prefix groups in the data:

truncate t;

insert into t select 1,1,1 from generate_series(1,1000000) s(i);

analyze t;

set enable_incrementalsort = on;

explain select * from (select * from t order by a,b) foo order by a,b,c;

                               QUERY PLAN
------------------------------------------------------------------------
 Incremental Sort  (cost=324165.83..341665.85 rows=1000000 width=12)
   Sort Key: t.a, t.b, t.c
   Presorted Key: t.a, t.b
   ->  Sort  (cost=132154.34..134654.34 rows=1000000 width=12)
         Sort Key: t.a, t.b
         ->  Seq Scan on t  (cost=0.00..15406.00 rows=1000000 width=12)
(6 rows)

So that's 263902 vs. 341665, yet we still prefer the incremental mode.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Sun, Apr 1, 2018 at 12:06 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 03/31/2018 10:43 PM, Tomas Vondra wrote:
> ...
> But I'm pretty sure it may lead to surprising behavior - for example if
> you disable incremental sorts (enable_incrementalsort=off), the plan
> will switch to plain sort without the additional costs. So you'll get a
> cheaper plan by disabling some operation. That's surprising.
>

To illustrate this is a valid issue, consider this trivial example:

create table t (a int, b int, c int);

insert into t select 10*random(), 10*random(), 10*random()
  from generate_series(1,1000000) s(i);

analyze t;

explain select * from (select * from t order by a,b) foo order by a,b,c;

                               QUERY PLAN
------------------------------------------------------------------------
 Incremental Sort  (cost=133100.48..264139.27 rows=1000000 width=12)
   Sort Key: t.a, t.b, t.c
   Presorted Key: t.a, t.b
   ->  Sort  (cost=132154.34..134654.34 rows=1000000 width=12)
         Sort Key: t.a, t.b
         ->  Seq Scan on t  (cost=0.00..15406.00 rows=1000000 width=12)
(6 rows)

set enable_incrementalsort = off;

explain select * from (select * from t order by a,b) foo order by a,b,c;
                               QUERY PLAN
------------------------------------------------------------------------
 Sort  (cost=261402.69..263902.69 rows=1000000 width=12)
   Sort Key: t.a, t.b, t.c
   ->  Sort  (cost=132154.34..134654.34 rows=1000000 width=12)
         Sort Key: t.a, t.b
         ->  Seq Scan on t  (cost=0.00..15406.00 rows=1000000 width=12)
(5 rows)

So the cost with incremental sort was 264139, and after disabling the
incremental cost it dropped to 263902. Granted, the difference is
negligible in this case, but it's still surprising.

Also, it can be made much more significant by reducing the number of
prefix groups in the data:

truncate t;

insert into t select 1,1,1 from generate_series(1,1000000) s(i);

analyze t;

set enable_incrementalsort = on;

explain select * from (select * from t order by a,b) foo order by a,b,c;

                               QUERY PLAN
------------------------------------------------------------------------
 Incremental Sort  (cost=324165.83..341665.85 rows=1000000 width=12)
   Sort Key: t.a, t.b, t.c
   Presorted Key: t.a, t.b
   ->  Sort  (cost=132154.34..134654.34 rows=1000000 width=12)
         Sort Key: t.a, t.b
         ->  Seq Scan on t  (cost=0.00..15406.00 rows=1000000 width=12)
(6 rows)

So that's 263902 vs. 341665, yet we still prefer the incremental mode.

Problem is well-defined, thank you.
I'll check what can be done in this field today.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] [PATCH] Incremental sort

From
Tomas Vondra
Date:
On 04/03/2018 11:09 AM, Alexander Korotkov wrote:
> On Sun, Apr 1, 2018 at 12:06 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
> 
>     On 03/31/2018 10:43 PM, Tomas Vondra wrote:
>     > ...
>     > But I'm pretty sure it may lead to surprising behavior - for example if
>     > you disable incremental sorts (enable_incrementalsort=off), the plan
>     > will switch to plain sort without the additional costs. So you'll get a
>     > cheaper plan by disabling some operation. That's surprising.
>     >
> 
>     To illustrate this is a valid issue, consider this trivial example:
> 
>     create table t (a int, b int, c int);
> 
>     insert into t select 10*random(), 10*random(), 10*random()
>       from generate_series(1,1000000) s(i);
> 
>     analyze t;
> 
>     explain select * from (select * from t order by a,b) foo order by a,b,c;
> 
>                                    QUERY PLAN
>     ------------------------------------------------------------------------
>      Incremental Sort  (cost=133100.48..264139.27 rows=1000000 width=12)
>        Sort Key: t.a, t.b, t.c
>        Presorted Key: t.a, t.b
>        ->  Sort  (cost=132154.34..134654.34 rows=1000000 width=12)
>              Sort Key: t.a, t.b
>              ->  Seq Scan on t  (cost=0.00..15406.00 rows=1000000 width=12)
>     (6 rows)
> 
>     set enable_incrementalsort = off;
> 
>     explain select * from (select * from t order by a,b) foo order by a,b,c;
>                                    QUERY PLAN
>     ------------------------------------------------------------------------
>      Sort  (cost=261402.69..263902.69 rows=1000000 width=12)
>        Sort Key: t.a, t.b, t.c
>        ->  Sort  (cost=132154.34..134654.34 rows=1000000 width=12)
>              Sort Key: t.a, t.b
>              ->  Seq Scan on t  (cost=0.00..15406.00 rows=1000000 width=12)
>     (5 rows)
> 
>     So the cost with incremental sort was 264139, and after disabling the
>     incremental cost it dropped to 263902. Granted, the difference is
>     negligible in this case, but it's still surprising.
> 
>     Also, it can be made much more significant by reducing the number of
>     prefix groups in the data:
> 
>     truncate t;
> 
>     insert into t select 1,1,1 from generate_series(1,1000000) s(i);
> 
>     analyze t;
> 
>     set enable_incrementalsort = on;
> 
>     explain select * from (select * from t order by a,b) foo order by a,b,c;
> 
>                                    QUERY PLAN
>     ------------------------------------------------------------------------
>      Incremental Sort  (cost=324165.83..341665.85 rows=1000000 width=12)
>        Sort Key: t.a, t.b, t.c
>        Presorted Key: t.a, t.b
>        ->  Sort  (cost=132154.34..134654.34 rows=1000000 width=12)
>              Sort Key: t.a, t.b
>              ->  Seq Scan on t  (cost=0.00..15406.00 rows=1000000 width=12)
>     (6 rows)
> 
>     So that's 263902 vs. 341665, yet we still prefer the incremental mode.
> 
> 
> Problem is well-defined, thank you.
> I'll check what can be done in this field today.
> 

I think solving this may be fairly straight-forward. Essentially, until
now we only had one way to do the sort, so it was OK to make the sort
implicit by checking if the path is sorted

    if (input not sorted)
    {
        ... add a Sort node ...
    }

But now we have multiple possible ways to do the sort, with different
startup/total costs. So the places that create the sorts need to
actually generate the Sort paths for each sort alternative, and store
the information in the Sort node (instead of relying on pathkeys).

Ultimately, this should simplify the createplan.c places making all the
make_sort calls unnecessary (i.e. the input should be already sorted
when needed). Otherwise it'd mean the decision needs to be done locally,
but I don't think that should be needed.

But it's surely a fairly invasive change to the patch ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
Hi!

On Tue, Apr 3, 2018 at 2:10 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
I think solving this may be fairly straight-forward. Essentially, until
now we only had one way to do the sort, so it was OK to make the sort
implicit by checking if the path is sorted

    if (input not sorted)
    {
        ... add a Sort node ...
    }

But now we have multiple possible ways to do the sort, with different
startup/total costs. So the places that create the sorts need to
actually generate the Sort paths for each sort alternative, and store
the information in the Sort node (instead of relying on pathkeys).

Ultimately, this should simplify the createplan.c places making all the
make_sort calls unnecessary (i.e. the input should be already sorted
when needed). Otherwise it'd mean the decision needs to be done locally,
but I don't think that should be needed.

But it's surely a fairly invasive change to the patch ...

Right, there are situation when incremental sort has lower startup cost,
but higher total cost.  In order to find lower cost, we ideally should generate
paths for both full sort and incremental sort.  However, that would increase
number of total pathes, and could slowdown planning time.  Another issue
that we don't always generate pathes for sort.  And yes, it would be rather
invasive.  So, that doesn't look feasible to get into 11.

Intead, I decided to cut usage of incremental sort.  Now, incremental sort
is generated only in create_sort_path().  Cheaper path selected between
incremental sort and full sort with taking limit_tuples into account.
That limits usage of incremental sort, however risk of regression by this
patch is also minimal.  In fact, incremental sort will be used only when
sort is explicitly specified and simultaneously LIMIT is specified or
dataset to be sorted is large and incremental sort saves disk IO.

Attached patch also incorporates following commits made by Alexander Kuzmenkov:
* Rename fields of IncrementalSortState to snake_case for the sake of consistency.
* Rename group test function to isCurrentGroup.
* Comments from Tomas Vondra about nodeIncrementalSort.c
* Add a test for incremental sort.
* Add a separate function to calculate costs of incremental sort.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Tomas Vondra
Date:
On 04/06/2018 01:43 AM, Alexander Korotkov wrote:
> Hi!
> 
> On Tue, Apr 3, 2018 at 2:10 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
> 
>     I think solving this may be fairly straight-forward. Essentially, until
>     now we only had one way to do the sort, so it was OK to make the sort
>     implicit by checking if the path is sorted
> 
>         if (input not sorted)
>         {
>             ... add a Sort node ...
>         }
> 
>     But now we have multiple possible ways to do the sort, with different
>     startup/total costs. So the places that create the sorts need to
>     actually generate the Sort paths for each sort alternative, and store
>     the information in the Sort node (instead of relying on pathkeys).
> 
>     Ultimately, this should simplify the createplan.c places making all the
>     make_sort calls unnecessary (i.e. the input should be already sorted
>     when needed). Otherwise it'd mean the decision needs to be done locally,
>     but I don't think that should be needed.
> 
>     But it's surely a fairly invasive change to the patch ...
> 
> 
> Right, there are situation when incremental sort has lower startup cost,
> but higher total cost.  In order to find lower cost, we ideally should
> generate
> paths for both full sort and incremental sort.  However, that would increase
> number of total pathes, and could slowdown planning time.  Another issue
> that we don't always generate pathes for sort.  And yes, it would be rather
> invasive.  So, that doesn't look feasible to get into 11.
> 

I agree that's probably not feasible for PG11, considering the freeze is
about 48h from now. Not necessarily because of amount of code needed to
do that (it might be fairly small, actually) but because of the risk of
regressions in other types of plans and lack of time for review/testing.

I do not think this would cause a significant increase in path number.
We already do have the (partially sorted) paths in pathlist, otherwise
v21 wouldn't be able to build the incremental sort path anyway. And the
plans that did the decision in createplan.c could fairly easily do the
decision when constructing the path, I believe.

Looking v21, this affects three different places:

1) create_merge_append_plan

For merge_append the issue is that generate_mergeappend_paths() calls
get_cheapest_path_for_pathkeys(), which however only looks at cheapest
startup/total path, and then simply falls-back to cheapest_total_path if
there are no suitably sorted paths. IMHO if you modify this to also
consider partially-sorted paths, it should work. You'll have to account
for the extra cost of incremental cost, and it needs to be fairly cheap
(perhaps even by first quickly computing some initial cost estimate -
see e.g. initial_cost_nestloop/final_cost_nestloop).

2) create_mergejoin_plan

For mergejoin, the issue is that sort_inner_and_outer() only looks at
cheapest_total_path for both sides, even before computing the merge
pathkeys. So some of the code would have to move into the foreach loop,
and the paths would be picked by get_cheapest_path_for_pathkeys(). This
time only using total cost, per the comment in sort_inner_and_outer().

3) create_gather_merge_plan

This seems fairly simple - the issue is that gather_grouping_paths()
only looks at cheapest_partial_path. Should be a matter of simply
calling the improved get_cheapest_path_for_pathkeys().

Of course, this is all fairly hand-wavy and it may turn out to be fairly
expensive in some cases. But we can use another trick - we don't need to
search through all partially sorted paths, because for each pathkey
prefix there can be just one "best" path for startup cost and one for
total cost. So we could maintain a much shorter list of partially sorted
paths, I think.

>
> Intead, I decided to cut usage of incremental sort.  Now, incremental sort
> is generated only in create_sort_path().  Cheaper path selected between
> incremental sort and full sort with taking limit_tuples into account.
> That limits usage of incremental sort, however risk of regression by this
> patch is also minimal.  In fact, incremental sort will be used only when
> sort is explicitly specified and simultaneously LIMIT is specified or
> dataset to be sorted is large and incremental sort saves disk IO.
> 

I personally am OK with reducing the scope of the patch like this. It's
still beneficial for the common ORDER BY + LIMIT case, which is good. I
don't think it may negatively affect other cases (at least I can't think
of any).

It's pretty obvious it may be extremely useful for the other cases too
(particularly for mergejoin on large tables, where it can allow
in-memory sort with quick startup).

But even if you managed to make the necessary code changes, it's
unlikely any experienced committer will look into such significant
change this close to the cutoff. Either they are going to be away for a
weekend, or they are already looking at other patches :-(

> Attached patch also incorporates following commits made by Alexander
> Kuzmenkov:
> * Rename fields of IncrementalSortState to snake_case for the sake of
> consistency.
> * Rename group test function to isCurrentGroup.
> * Comments from Tomas Vondra about nodeIncrementalSort.c
> * Add a test for incremental sort.
> * Add a separate function to calculate costs of incremental sort.
> 

Those changes seem fine, but are still a couple of issues remaining:

1) pathkeys_useful_for_ordering() still uses enable_incrementalsort,
which I think is a bad idea. I've complained about it in my review on
31/3, and I don't see any explanation why this is a good idea.

2) Likewise, I've suggested that the claim about abbreviated keys in
nodeIncrementalsort.c is dubious. No response, and the XXX comment was
instead merged into the patch:

 * XXX The claim about abbreviated keys seems rather dubious, IMHO.

3) There is a comment at cost_merge_append, despite there being no
relevant changes in that function. Misplaced comment?

4) It's not clear to me why INITIAL_MEMTUPSIZE is defined the way it is.
There needs to be a comment - the intent seems to be making it large
enough to exceed ALLOCSET_SEPARATE_THRESHOLD, but it's not quite clear
why that's a good idea.

5) I do get this warning when building the code:

costsize.c: In function ‘cost_incremental_sort’:
costsize.c:1812:2: warning: ISO C90 forbids mixed declarations and code
[-Wdeclaration-after-statement]
  List    *presortedExprs = NIL;
  ^~~~

6) The comment at cost_incremental_sort talks about cost_sort_internal,
but it's cost_sort_tuplesort I guess.

7) The new code in create_sort_path is somewhat ugly, I guess. It's
correct, but it really needs to be made easier to comprehend. I might
have time to look into that tomorrow, but I can't promise that.

Attached is a diff highlighting some of those places, and couple of
minor code formatting fixes.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Kuzmenkov
Date:
Hi all,

This is the other Alexander K. speaking.


On 06.04.2018 20:26, Tomas Vondra wrote:
> I personally am OK with reducing the scope of the patch like this. It's
> still beneficial for the common ORDER BY + LIMIT case, which is good. I
> don't think it may negatively affect other cases (at least I can't think
> of any).

I think we can reduce it even further. Just try incremental sort along 
with full sort over the cheapest path in create_ordered_paths, and don't 
touch anything else. This is a very minimal and a probably safe start, 
and then we can continue working on other, more complex cases. In the 
attached patch I tried to do this. We probably should also remove 
changes in make_sort() and create a separate function 
make_incremental_sort() for it, but I'm done for today.


> 1) pathkeys_useful_for_ordering() still uses enable_incrementalsort,
> which I think is a bad idea. I've complained about it in my review on
> 31/3, and I don't see any explanation why this is a good idea.

Removed.

> 2) Likewise, I've suggested that the claim about abbreviated keys in
> nodeIncrementalsort.c is dubious. No response, and the XXX comment was
> instead merged into the patch:
>
>   * XXX The claim about abbreviated keys seems rather dubious, IMHO.

Not sure about that, maybe just use abbreviated keys for the first 
version? Later we can research this more closely and maybe start 
deciding whether to use abbrev on planning stage.

> 3) There is a comment at cost_merge_append, despite there being no
> relevant changes in that function. Misplaced comment?

Removed.

> 4) It's not clear to me why INITIAL_MEMTUPSIZE is defined the way it is.
> There needs to be a comment - the intent seems to be making it large
> enough to exceed ALLOCSET_SEPARATE_THRESHOLD, but it's not quite clear
> why that's a good idea.

Not sure myself, let's ask the other Alexander.


> 5) I do get this warning when building the code:
>
> costsize.c: In function ‘cost_incremental_sort’:
> costsize.c:1812:2: warning: ISO C90 forbids mixed declarations and code
> [-Wdeclaration-after-statement]
>    List    *presortedExprs = NIL;
>    ^~~~
>
> 6) The comment at cost_incremental_sort talks about cost_sort_internal,
> but it's cost_sort_tuplesort I guess.

Fixed.

> 7) The new code in create_sort_path is somewhat ugly, I guess. It's
> correct, but it really needs to be made easier to comprehend. I might
> have time to look into that tomorrow, but I can't promise that.

Removed this code altogether, now the costs are compared by add_path as 
usual.

> Attached is a diff highlighting some of those places, and couple of
> minor code formatting fixes.

Applied.

Also some other changes from me:

Remove extra blank lines
label_sort_with_costsize shouldn't have to deal with IncrementalSort 
plans, because they are only created from corresponding Path nodes.
Reword a comment in ExecSupportsBackwardsScan.
Clarify cost calculations.
enable_incrementalsort is checked at path level, we don't have to check 
it again at plan level.
enable_sort should act as a cost-based soft disable for both incremental 
and normal sort.

-- 
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Fri, Apr 6, 2018 at 11:40 PM, Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote:
On 06.04.2018 20:26, Tomas Vondra wrote:
I personally am OK with reducing the scope of the patch like this. It's
still beneficial for the common ORDER BY + LIMIT case, which is good. I
don't think it may negatively affect other cases (at least I can't think
of any).

I think we can reduce it even further. Just try incremental sort along with full sort over the cheapest path in create_ordered_paths, and don't touch anything else. This is a very minimal and a probably safe start, and then we can continue working on other, more complex cases. In the attached patch I tried to do this. We probably should also remove changes in make_sort() and create a separate function make_incremental_sort() for it, but I'm done for today.
 
I've done further unwedding of sort and incremental sort providing them separate function for plan createion.

2) Likewise, I've suggested that the claim about abbreviated keys in
nodeIncrementalsort.c is dubious. No response, and the XXX comment was
instead merged into the patch:

  * XXX The claim about abbreviated keys seems rather dubious, IMHO.

Not sure about that, maybe just use abbreviated keys for the first version? Later we can research this more closely and maybe start deciding whether to use abbrev on planning stage.

That comes from time when we're trying to make incremental sort to be always
not worse than full sort.  Now, we have separate paths for full and incremental sorts,
and some costing penalty for incremental sort.  So, incremental sort should be
selected only when it's expected to give big win.  Thus, we can give up with this
optimization at least in the initial version.

So, removed.

4) It's not clear to me why INITIAL_MEMTUPSIZE is defined the way it is.
There needs to be a comment - the intent seems to be making it large
enough to exceed ALLOCSET_SEPARATE_THRESHOLD, but it's not quite clear
why that's a good idea.

Not sure myself, let's ask the other Alexander.
 
I've added comment to INITIAL_MEMTUPSIZE.  However, to be fair it's not
invention of this patch.  Initial size of memtuples array was so previously.
What this patch did is just move it to the macro.

Also, this note hadn't been adressed yet.

On Sat, Mar 31, 2018 at 11:43 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
I'm wondering if a static MIN_GROUP_SIZE is good idea. For example, what
if the subplan is expected to return only very few tuples (say, 33), but
the query includes LIMIT 1. Now, let's assume the startup/total cost of
the subplan is 1 and 1000000. With MIN_GROUP_SIZE 32 we're bound to
execute it pretty much till the end, while we could terminate after the
first tuple (if the prefix changes).

So I think we should use a Min(limit,MIN_GROUP_SIZE) here, and perhaps
this should depend on average group size too.

I agree with that.  For bounded sort, attached patch now selects minimal group
size as Min(DEFAULT_MIN_GROUP_SIZE, bound).  That should improve
"LIMIT small_number" case.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Sat, Apr 7, 2018 at 4:56 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Fri, Apr 6, 2018 at 11:40 PM, Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote:
On 06.04.2018 20:26, Tomas Vondra wrote:
I personally am OK with reducing the scope of the patch like this. It's
still beneficial for the common ORDER BY + LIMIT case, which is good. I
don't think it may negatively affect other cases (at least I can't think
of any).

I think we can reduce it even further. Just try incremental sort along with full sort over the cheapest path in create_ordered_paths, and don't touch anything else. This is a very minimal and a probably safe start, and then we can continue working on other, more complex cases. In the attached patch I tried to do this. We probably should also remove changes in make_sort() and create a separate function make_incremental_sort() for it, but I'm done for today.
 
I've done further unwedding of sort and incremental sort providing them separate function for plan createion.

2) Likewise, I've suggested that the claim about abbreviated keys in
nodeIncrementalsort.c is dubious. No response, and the XXX comment was
instead merged into the patch:

  * XXX The claim about abbreviated keys seems rather dubious, IMHO.

Not sure about that, maybe just use abbreviated keys for the first version? Later we can research this more closely and maybe start deciding whether to use abbrev on planning stage.

That comes from time when we're trying to make incremental sort to be always
not worse than full sort.  Now, we have separate paths for full and incremental sorts,
and some costing penalty for incremental sort.  So, incremental sort should be
selected only when it's expected to give big win.  Thus, we can give up with this
optimization at least in the initial version.

So, removed.

4) It's not clear to me why INITIAL_MEMTUPSIZE is defined the way it is.
There needs to be a comment - the intent seems to be making it large
enough to exceed ALLOCSET_SEPARATE_THRESHOLD, but it's not quite clear
why that's a good idea.

Not sure myself, let's ask the other Alexander.
 
I've added comment to INITIAL_MEMTUPSIZE.  However, to be fair it's not
invention of this patch.  Initial size of memtuples array was so previously.
What this patch did is just move it to the macro.

Also, this note hadn't been adressed yet.

On Sat, Mar 31, 2018 at 11:43 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
I'm wondering if a static MIN_GROUP_SIZE is good idea. For example, what
if the subplan is expected to return only very few tuples (say, 33), but
the query includes LIMIT 1. Now, let's assume the startup/total cost of
the subplan is 1 and 1000000. With MIN_GROUP_SIZE 32 we're bound to
execute it pretty much till the end, while we could terminate after the
first tuple (if the prefix changes).

So I think we should use a Min(limit,MIN_GROUP_SIZE) here, and perhaps
this should depend on average group size too.

I agree with that.  For bounded sort, attached patch now selects minimal group
size as Min(DEFAULT_MIN_GROUP_SIZE, bound).  That should improve
"LIMIT small_number" case.

I've just noticed that incremental sort now is not used in contrib/postgres_fdw.
It's even better assuming that we're going to limit use-cases of incremental
sort.  I've rolled back all the changes made in tests of contirb/postgres_fdw
by this patch.  Revised version is attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Teodor Sigaev
Date:
> by this patch.  Revised version is attached.

Fine, patch got several rounds of review in all its parts. Is any places 
which should be improved before commit?

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/


Re: [HACKERS] [PATCH] Incremental sort

From
Tomas Vondra
Date:
On 04/07/2018 04:37 PM, Teodor Sigaev wrote:
>> by this patch.  Revised version is attached.
> 
> Fine, patch got several rounds of review in all its parts. Is any
> places which should be improved before commit?
> 

I personally feel rather uneasy about committing it, TBH.

While I don't see any obvious issues in the patch at the moment, the
recent changes were rather significant so I might easily miss some
unexpected consequences. (OTOH it's true it was mostly about reduction
of scope, to limit the risks.)

I don't have time to do more review and testing on the latest patch
version, unfortunately, certainly not before the CF end.


So I guess the ultimate review / decision is up to you ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Wed, Mar 28, 2018 at 6:38 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Teodor Sigaev wrote:
> > BTW, patch had conflicts with master.  Please, find rebased version attached.
>
> Despite by patch conflist patch looks commitable, has anybody objections to
> commit it?
>
> Patch recieved several rounds of review during 2 years, and seems to me,
> keeping it out from sources may cause a lost it. Although it suggests
> performance improvement in rather wide usecases.

Can we have a recap on what the patch *does*?  I see there's a
description in Alexander's first email
https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
but that was a long time ago, and the patch has likely changed in the
meantime ...

Ggeneral idea hasn't been changed much since first email.
Incremental sort gives benefit when you need to sort your dataset
by some list of columns while you alredy have input presorted
by some prefix of that list of columns.  Then you don't do full sort
of dataset, but rather sort groups where values of prefix columns
are equal (see header comment in nodeIncremenalSort.c).

Same example as in the first letter works, but plan displays
differently.

create table test as (select id, (random()*10000)::int as v1, random() as
v2 from generate_series(1,1000000) id);
create index test_v1_idx on test (v1);

# explain select * from test order by v1, v2 limit 10;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Limit  (cost=1.26..1.26 rows=10 width=16)
   ->  Incremental Sort  (cost=1.26..1.42 rows=1000000 width=16)
         Sort Key: v1, v2
         Presorted Key: v1
         ->  Index Scan using test_v1_idx on test  (cost=0.42..47602.50 rows=1000000 width=16)
(5 rows)

# select * from test order by v1, v2 limit 10;
   id   | v1 |         v2
--------+----+--------------------
 216426 |  0 | 0.0697950166650116
  96649 |  0 |  0.230586454737931
 892243 |  0 |  0.677791305817664
 323001 |  0 |  0.708638620562851
  87458 |  0 |  0.923310813494027
 224291 |  0 |    0.9349579163827
 446366 |  0 |  0.984529701061547
 376781 |  0 |  0.997424073051661
 768246 |  1 |  0.127851997036487
 666102 |  1 |   0.27093240711838
(10 rows)

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] [PATCH] Incremental sort

From
Tom Lane
Date:
Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> On Wed, Mar 28, 2018 at 6:38 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
> wrote:
>> Can we have a recap on what the patch *does*?

> Ggeneral idea hasn't been changed much since first email.
> Incremental sort gives benefit when you need to sort your dataset
> by some list of columns while you alredy have input presorted
> by some prefix of that list of columns.  Then you don't do full sort
> of dataset, but rather sort groups where values of prefix columns
> are equal (see header comment in nodeIncremenalSort.c).

I dunno, how would you estimate whether this is actually a win or not?
I don't think our model of sort costs is anywhere near refined enough
or accurate enough to reliably predict whether this is better than
just doing it in one step.  Even if the cost model is good, it's not
going to be better than our statistics about the number/size of the
groups in the first column(s), and that's a notoriously unreliable stat.

Given that we already have more than enough dubious patches that have
been shoved in in the last few days, I'd rather not pile on stuff that
there's any question about.

            regards, tom lane


Re: [HACKERS] [PATCH] Incremental sort

From
Andres Freund
Date:
On 2018-04-07 12:06:52 -0400, Tom Lane wrote:
> Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> > On Wed, Mar 28, 2018 at 6:38 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
> > wrote:
> >> Can we have a recap on what the patch *does*?
> 
> > Ggeneral idea hasn't been changed much since first email.
> > Incremental sort gives benefit when you need to sort your dataset
> > by some list of columns while you alredy have input presorted
> > by some prefix of that list of columns.  Then you don't do full sort
> > of dataset, but rather sort groups where values of prefix columns
> > are equal (see header comment in nodeIncremenalSort.c).
> 
> I dunno, how would you estimate whether this is actually a win or not?
> I don't think our model of sort costs is anywhere near refined enough
> or accurate enough to reliably predict whether this is better than
> just doing it in one step.  Even if the cost model is good, it's not
> going to be better than our statistics about the number/size of the
> groups in the first column(s), and that's a notoriously unreliable stat.
> 
> Given that we already have more than enough dubious patches that have
> been shoved in in the last few days, I'd rather not pile on stuff that
> there's any question about.

I don't disagree with any of that. Just wanted to pipe up to say that
there's a fair argument to be made that this patch, which has lingered
for years, "deserves" more to mature in tree than some of the rest.

Greetings,

Andres Freund


Re: [HACKERS] [PATCH] Incremental sort

From
Teodor Sigaev
Date:
> I dunno, how would you estimate whether this is actually a win or not?
> I don't think our model of sort costs is anywhere near refined enough
> or accurate enough to reliably predict whether this is better than
> just doing it in one step.  Even if the cost model is good, it's not
> going to be better than our statistics about the number/size of the
> groups in the first column(s), and that's a notoriously unreliable stat.
I think that improvement in cost calculation of sort should be a 
separate patch, not directly connected to this one. Postpone patches 
till other part will be ready to get max improvement for postponed ones 
doesn't seem to me very good, especially if it suggests some improvement 
right now.


-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/


Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Sat, Apr 7, 2018 at 5:37 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
by this patch.  Revised version is attached.

Fine, patch got several rounds of review in all its parts. Is any places which should be improved before commit?

Also I found that after planner changes of Alexander Kuzmenkov, incremental sort
was used in cheapest_input_path() only if its child is cheapest total path.
That makes incremental sort to not get used almost never.
I've changed that to consider incremental sort path when we have some
presorted columns.  I also have to put changes in postgres_fdw regression
tests back, because incremental sort was used right there.

This revision of the patch also includes commit message.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: [HACKERS] [PATCH] Incremental sort

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
>> I dunno, how would you estimate whether this is actually a win or not?
>> I don't think our model of sort costs is anywhere near refined enough
>> or accurate enough to reliably predict whether this is better than
>> just doing it in one step.  Even if the cost model is good, it's not
>> going to be better than our statistics about the number/size of the
>> groups in the first column(s), and that's a notoriously unreliable stat.

> I think that improvement in cost calculation of sort should be a 
> separate patch, not directly connected to this one. Postpone patches 
> till other part will be ready to get max improvement for postponed ones 
> doesn't seem to me very good, especially if it suggests some improvement 
> right now.

No, you misunderstand the point of my argument.  Without a reasonably
reliable cost model, this patch could easily make performance *worse*
not better for many people, due to choosing incremental-sort plans
where they were really a loss.

If we were at the start of a development cycle and work were being
promised to be done later in the cycle to improve the planning aspect,
I'd be more charitable about it.  But this isn't merely the end of a
cycle, it's the *last day*.  Now is not the time to commit stuff that
needs, or even just might need, follow-on work.

            regards, tom lane


Re: [HACKERS] [PATCH] Incremental sort

From
Tomas Vondra
Date:
On 04/07/2018 06:23 PM, Tom Lane wrote:
> Teodor Sigaev <teodor@sigaev.ru> writes:
>>> I dunno, how would you estimate whether this is actually a win or not?
>>> I don't think our model of sort costs is anywhere near refined enough
>>> or accurate enough to reliably predict whether this is better than
>>> just doing it in one step.  Even if the cost model is good, it's not
>>> going to be better than our statistics about the number/size of the
>>> groups in the first column(s), and that's a notoriously unreliable stat.
> 
>> I think that improvement in cost calculation of sort should be a 
>> separate patch, not directly connected to this one. Postpone patches 
>> till other part will be ready to get max improvement for postponed ones 
>> doesn't seem to me very good, especially if it suggests some improvement 
>> right now.
> 
> No, you misunderstand the point of my argument.  Without a reasonably
> reliable cost model, this patch could easily make performance *worse*
> not better for many people, due to choosing incremental-sort plans
> where they were really a loss.
> 

Yeah. Essentially the patch could push the planner to pick a path that
has low startup cost (and very high total cost), assuming it'll only
need to read small part of the input. But if the number of groups in the
input is low (perhaps just one huge group), that would be a regression.

> If we were at the start of a development cycle and work were being
> promised to be done later in the cycle to improve the planning aspect,
> I'd be more charitable about it.  But this isn't merely the end of a
> cycle, it's the *last day*.  Now is not the time to commit stuff that
> needs, or even just might need, follow-on work.
> 

+1 to that

FWIW I'm willing to spend some time on the patch for PG12, particularly
on the planner / costing part. The potential gains are too interesting.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Sat, Apr 7, 2018 at 11:57 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 04/07/2018 06:23 PM, Tom Lane wrote:
> Teodor Sigaev <teodor@sigaev.ru> writes:
>>> I dunno, how would you estimate whether this is actually a win or not?
>>> I don't think our model of sort costs is anywhere near refined enough
>>> or accurate enough to reliably predict whether this is better than
>>> just doing it in one step.  Even if the cost model is good, it's not
>>> going to be better than our statistics about the number/size of the
>>> groups in the first column(s), and that's a notoriously unreliable stat.
>
>> I think that improvement in cost calculation of sort should be a
>> separate patch, not directly connected to this one. Postpone patches
>> till other part will be ready to get max improvement for postponed ones
>> doesn't seem to me very good, especially if it suggests some improvement
>> right now.
>
> No, you misunderstand the point of my argument.  Without a reasonably
> reliable cost model, this patch could easily make performance *worse*
> not better for many people, due to choosing incremental-sort plans
> where they were really a loss.
>

Yeah. Essentially the patch could push the planner to pick a path that
has low startup cost (and very high total cost), assuming it'll only
need to read small part of the input. But if the number of groups in the
input is low (perhaps just one huge group), that would be a regression.
 
Yes, I think the biggest risk here is too small number of groups.  More
precisely the risk is too large groups while total number of groups might
be large enough.

> If we were at the start of a development cycle and work were being
> promised to be done later in the cycle to improve the planning aspect,
> I'd be more charitable about it.  But this isn't merely the end of a
> cycle, it's the *last day*.  Now is not the time to commit stuff that
> needs, or even just might need, follow-on work.
>

+1 to that

FWIW I'm willing to spend some time on the patch for PG12, particularly
on the planner / costing part. The potential gains are too interesting

Thank you very much for your efforts on reviewing this patch.
I'm looking forward work with you on this feature for PG12.

FWIW, I think that we're moving this patch in the right direction by
providing separate paths for incremental sort.  It's much better than
deciding between full or incremental sort in-place.  For sure, considereing
extra paths might cause planning time regression.  But I think the
same statement is true about many other planning optimizations.
One thing be can do is to make enable_incrementalsort = off by
default.  Then only users who understand improtance of incremental
sort will get extra planning time.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 

Re: [HACKERS] [PATCH] Incremental sort

From
David Steele
Date:
On 4/9/18 11:56 AM, Alexander Korotkov wrote:
> 
> Thank you very much for your efforts on reviewing this patch.
> I'm looking forward work with you on this feature for PG12.

Since there's a new patch I have changed the status to Needs Review and
moved the entry to the next CF.

Still, it seems to me that discussion and new patches will be required
to address Tom's concerns.

Regards,
-- 
-David
david@pgmasters.net


Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
On Tue, Apr 10, 2018 at 4:15 PM, David Steele <david@pgmasters.net> wrote:
On 4/9/18 11:56 AM, Alexander Korotkov wrote:
>
> Thank you very much for your efforts on reviewing this patch.
> I'm looking forward work with you on this feature for PG12.

Since there's a new patch I have changed the status to Needs Review and
moved the entry to the next CF.
 
Right, thank you.

Still, it seems to me that discussion and new patches will be required
to address Tom's concerns.

Sounds correct for me.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 

Re: Re: [HACKERS] [PATCH] Incremental sort

From
James Coleman
Date:
I've attached an updated copy of the patch that applies cleanly to 
current master.



Attachment

Re: Re: [HACKERS] [PATCH] Incremental sort

From
Alexander Korotkov
Date:
Hi, James!

On Thu, May 31, 2018 at 11:10 PM James Coleman <jtc331@gmail.com> wrote:
I've attached an updated copy of the patch that applies cleanly to
current master.

Thank you for rebasing this patch.  Next time sending a patch, please make sure you've bumped
its version, if even you made no changes there besides to pure rebase.  Otherwise, it would be
hard to distinguish patch versions, because patch files have exactly same names.

I'd like to note, that I'm going to provide revised version of this patch to the next commitfest.
After some conversations at PGCon regarding this patch, I got more confident that providing
separate paths for incremental sorts is right.  In the next revision of this patch, incremental
sort paths would be provided in more cases.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            not tested

A fairly common planning problem for us is what we call "most recent first" queries; i.e., "the 50 most recent <table>
rowsfor a <foreign key>".
 

Here's a basic setup:

-- created_at has very high cardinality
create table foo(pk serial primary key, owner_fk integer, created_at timestamp);
create index idx_foo_on_owner_and_created_at on foo(owner_fk, created_at);

-- technically this data guarantees unique created_at values,
-- but there's no reason it couldn't be modified to have a few
-- random non-unique values to prove the point
insert into foo(owner_fk, created_at)
  select i % 100, now() - (i::text || ' minutes')::interval
  from generate_series(1, 1000000) t(i);


And here's the naive query to get the results we want:

select *
from foo
where owner_fk = 23
-- pk is only here to disambiguate/guarantee a stable sort
-- in the rare case that there are collisions in the other
-- sort field(s)
order by created_at desc, pk desc
limit 50;


On stock Postgres this ends up being pretty terrible for cases where the fk filter represents a large number of rows,
becausethe planner generates a sort node under the limit node and therefore fetches all matches, sorts them, and then
appliesthe limit. Here's the plan:
 

 Limit  (cost=61386.12..61391.95 rows=50 width=16) (actual time=187.814..191.653 rows=50 loops=1)
   ->  Gather Merge  (cost=61386.12..70979.59 rows=82224 width=16) (actual time=187.813..191.647 rows=50 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=60386.10..60488.88 rows=41112 width=16) (actual time=185.639..185.642 rows=42 loops=3)
               Sort Key: created_at DESC, pk DESC
               Sort Method: top-N heapsort  Memory: 27kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 27kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 27kB
               ->  Parallel Bitmap Heap Scan on foo  (cost=3345.24..59020.38 rows=41112 width=16) (actual
time=25.150..181.804rows=33333 loops=3)
 
                     Recheck Cond: (owner_fk = 23)
                     Heap Blocks: exact=18014
                     ->  Bitmap Index Scan on idx_foo_on_owner_and_created_at  (cost=0.00..3320.57 rows=98668 width=0)
(actualtime=16.992..16.992 rows=100000 loops=1)
 
                           Index Cond: (owner_fk = 23)
 Planning Time: 0.384 ms
 Execution Time: 191.704 ms


I have a recursive CTE that implements the algorithm:
- Find first n+1 results
- If result at n+1’s created_at value differs from the n’s value, return first n values.
- If those equal, gather more results until a new created_at value is encountered.
- Sort all results by created_at and a tie-breaker (e.g., pk) and return the first n values.
But nobody wants to use/write that normally (it's quite complex).

This patch solves the problem presented; here's the plan:

 Limit  (cost=2.70..2.76 rows=50 width=16) (actual time=0.233..0.367 rows=50 loops=1)
   ->  Incremental Sort  (cost=2.70..111.72 rows=98668 width=16) (actual time=0.232..0.362 rows=50 loops=1)
         Sort Key: created_at DESC, pk DESC
         Presorted Key: created_at
         Sort Method: quicksort  Memory: 26kB
         Sort Groups: 2
         ->  Index Scan Backward using idx_foo_on_owner_and_created_at on foo  (cost=0.56..210640.79 rows=98668
width=16)(actual time=0.054..0.299 rows=65 loops=1)
 
               Index Cond: (owner_fk = 23)
 Planning Time: 0.428 ms
 Execution Time: 0.393 ms


While check world fails, the only failure appears to be a plan output change in
test/isolation/expected/drop-index-concurrently-1.outthat just needs to be updated (incremental sort is now used in
thisplan); I don't see any functionality breakage. 

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
James Coleman <jtc331@gmail.com> writes:
> A fairly common planning problem for us is what we call "most recent first" queries; i.e., "the 50 most recent
<table>rows for a <foreign key>". 

> Here's a basic setup:

> -- created_at has very high cardinality
> create table foo(pk serial primary key, owner_fk integer, created_at timestamp);
> create index idx_foo_on_owner_and_created_at on foo(owner_fk, created_at);

> -- technically this data guarantees unique created_at values,
> -- but there's no reason it couldn't be modified to have a few
> -- random non-unique values to prove the point
> insert into foo(owner_fk, created_at)
>   select i % 100, now() - (i::text || ' minutes')::interval
>   from generate_series(1, 1000000) t(i);

> And here's the naive query to get the results we want:

> select *
> from foo
> where owner_fk = 23
> -- pk is only here to disambiguate/guarantee a stable sort
> -- in the rare case that there are collisions in the other
> -- sort field(s)
> order by created_at desc, pk desc
> limit 50;

If you're concerned about the performance of this case, why don't you make
an index that actually matches the query?

regression=# create index on foo (owner_fk, created_at, pk);
CREATE INDEX
regression=# explain analyze select * from foo where owner_fk = 23 order by created_at desc, pk desc limit 50;
                                                                           QUERY PLAN
                                         

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..110.92 rows=50 width=16) (actual time=0.151..0.280 rows=50 loops=1)
   ->  Index Only Scan Backward using foo_owner_fk_created_at_pk_idx on foo  (cost=0.42..20110.94 rows=9100 width=16)
(actualtime=0.146..0.255 rows=50 loops=1) 
         Index Cond: (owner_fk = 23)
         Heap Fetches: 50
 Planning Time: 0.290 ms
 Execution Time: 0.361 ms
(6 rows)

There may be use-cases for Alexander's patch, but I don't find this
one to be terribly convincing.

            regards, tom lane


Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
I disagree because it's not ideal to basically have to append pk to every index in the system just to get the ability to tie-break when there's actually very low likelihood of ties anyway.

A similar use case is trying to batch through a result set, in which case you need a stable sort as well.

The benefit is retaining the generality of indexes (and saving space in them etc.) while still allowing using them for more specific sorts. Any time you paginate or batch this way you benefit from this, which in many applications applies to a very high percentage of indexes.

On Thu, Sep 6, 2018 at 10:39 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
James Coleman <jtc331@gmail.com> writes:
> A fairly common planning problem for us is what we call "most recent first" queries; i.e., "the 50 most recent <table> rows for a <foreign key>".

> Here's a basic setup:

> -- created_at has very high cardinality
> create table foo(pk serial primary key, owner_fk integer, created_at timestamp);
> create index idx_foo_on_owner_and_created_at on foo(owner_fk, created_at);

> -- technically this data guarantees unique created_at values,
> -- but there's no reason it couldn't be modified to have a few
> -- random non-unique values to prove the point
> insert into foo(owner_fk, created_at)
>   select i % 100, now() - (i::text || ' minutes')::interval
>   from generate_series(1, 1000000) t(i);

> And here's the naive query to get the results we want:

> select *
> from foo
> where owner_fk = 23
> -- pk is only here to disambiguate/guarantee a stable sort
> -- in the rare case that there are collisions in the other
> -- sort field(s)
> order by created_at desc, pk desc
> limit 50;

If you're concerned about the performance of this case, why don't you make
an index that actually matches the query?

regression=# create index on foo (owner_fk, created_at, pk);     
CREATE INDEX
regression=# explain analyze select * from foo where owner_fk = 23 order by created_at desc, pk desc limit 50;
                                                                           QUERY PLAN                                                                           
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..110.92 rows=50 width=16) (actual time=0.151..0.280 rows=50 loops=1)
   ->  Index Only Scan Backward using foo_owner_fk_created_at_pk_idx on foo  (cost=0.42..20110.94 rows=9100 width=16) (actual time=0.146..0.255 rows=50 loops=1)
         Index Cond: (owner_fk = 23)
         Heap Fetches: 50
 Planning Time: 0.290 ms
 Execution Time: 0.361 ms
(6 rows)

There may be use-cases for Alexander's patch, but I don't find this
one to be terribly convincing.

                        regards, tom lane

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On 09/06/2018 08:04 PM, James Coleman wrote:
> I disagree because it's not ideal to basically have to append pk to
> every index in the system just to get the ability to tie-break when
> there's actually very low likelihood of ties anyway.
> 
> A similar use case is trying to batch through a result set, in which
> case you need a stable sort as well.
> 
> The benefit is retaining the generality of indexes (and saving space in
> them etc.) while still allowing using them for more specific sorts. Any
> time you paginate or batch this way you benefit from this, which in many
> applications applies to a very high percentage of indexes.
> 

I 100% with this.

I see incremental sort as a way to run queries with fewer indexes that
are less query-specific, while still benefiting from them. Which means
lower overhead when writing data, lower disk space usage, and so on.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [PATCH] Incremental sort

From
Tomas Vondra
Date:
Hi Alexander,

On 06/01/2018 04:22 PM, Alexander Korotkov wrote:
> Hi, James!
> 
> On Thu, May 31, 2018 at 11:10 PM James Coleman <jtc331@gmail.com
> <mailto:jtc331@gmail.com>> wrote:
> 
>     I've attached an updated copy of the patch that applies cleanly to
>     current master.
> 
> 
> Thank you for rebasing this patch.  Next time sending a patch,
> please make sure you've bumped its version, if even you made no
> changes there besides to pure rebase. Otherwise, it would be hard to
> distinguish patch versions, because patch files have exactly same
> names.
> 
> I'd like to note, that I'm going to provide revised version of this 
> patch to the next commitfest. After some conversations at PGCon
> regarding this patch, I got more confident that providing separate
> paths for incremental sorts is right.  In the next revision of this
> patch, incremental sort paths would be provided in more cases.
> 

Do you plan to submit an updated patch version for the November CF? It
would be good to look at the costing model soon, otherwise it might end
up missing PG12, and that would be unfortunate.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Dmitry Dolgov
Date:
> On Mon, Mar 20, 2017 at 10:34 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
>
> Please, find rebased patch in the attachment.

It's been a while since this patch was posted. There is already some amount of
feedback in this thread, and the patch itself unfortunately has some conflicts
with the current master. Alexander, do you have any plans about this feature?
For now probably I'll mark it as "Returned with feedback".


Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Shaun Thomas
Date:
Another ping on this Incremental Sort patch.

Alexander, you'd noted that you would try to get it into subsequent
Commit Fests with improvements you've been considering, but I don't
see it in anything but 2018-11. Have you abandoned this as a
maintainer? If so, it would be nice to know so someone else can pick
it up.

-- 
Shaun M Thomas - 2ndQuadrant
PostgreSQL Training, Services and Support
shaun.thomas@2ndquadrant.com | www.2ndQuadrant.com



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On 2018-06-01 14:22:26, Alexander Korotkov wrote:
> I'd like to note, that I'm going to provide revised version of this patch
> to the next commitfest.
> After some conversations at PGCon regarding this patch, I got more
> confident that providing separate paths for incremental sorts is right.
> In the next revision of this patch, incremental sort paths would be
> provided in more cases.

Alexander,

I'm currently rebasing the patch, and if I get some time I'm going to
look into working on it.

Would you be able to provide a bit more detail on the changes you were
hoping to make after conversations you'd had with others? I'm hoping
for any pointers/context you have from those conversations as to what
you felt was necessary to get this change committed.

Thanks,
James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
I've rebased the patch on master and confirmed make check world passes.

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
Hi James,

On Fri, May 31, 2019 at 03:51:57PM -0400, James Coleman wrote:
>I've rebased the patch on master and confirmed make check world passes.

Thanks for the rebase! I think the patch is in pretty good shape - I'm
sure we'll find ways to make it more efficient etc. but IMO that's fine
and should not prevent getting it committed.

The planning/costing logic may need further discussion and improvements,
though. IIRC this was the main reason why the patch missed PG11, because
at that point it simply used incremental sort whenever the input was
already presorted by a pathkey prefix, but that may be slower than regular
sort in some cases (unexpectedly large groups, etc.).

I see the current patch partially improves this by removing the creating
both paths (full and and incremental sort). That'd good, because it means
the decision is cost-based (as it should be). The question however is how
accurate the costing model is - and per discussion in the thread, it may
need some improvements, to handle skewed distributions better.

Currently, the costing logic (cost_incremental_sort) assumes all groups
have the same size, which is fine for uniform distributions. But for
skewed distributions, that may be an issue.

Consider for example a table with 1M rows, two columns, 100 groups in each
column, and index on the first one.

    CREATE table t (a INT, b INT);

    INSERT INTO t SELECT 100*random(), 100*random()
      FROM generate_series(1,1000000);

Now, let's do a simple limit query to find the first row:

    SELECT * FROM t ORDER BU a, b LIMIT 1;

In this case the current costing logic is fine - the groups are close to
average size, and we only need to sort the first group, i.e. 1% of data.

Now, let's say the first group is much larger:


    INSERT INTO t SELECT 0, 100*random()
      FROM generate_series(1,900000) s(i);

    INSERT INTO t SELECT 100*random(), 100*random()
      FROM generate_series(1,100000);

That is, the first group is roughly 90% of data, but the number of groups
is still the same. But we need to scan 90% of data. But the average group
size is still the same, so the current cost model is oblivious to this.

But I think we can improve this by looking at the MCV lists (either
per-column or multi-column) and see if some groups are much larger, and
consider that when computing the costs.

In particular, I think we should estimate the size of the first group,
because that's important for startup cost - we need to process the whole
first group before producing the first tuple, and that matters for LIMIT
queries etc.

For example, let's say the first (already sorted) column has a MCV. Then
we can see how large the first group (by valule, not frequency) is, and
use that instead of the average group size. E.g. in the above example we'd
know the first group is ~90%.

And we could do the same for multiple columns, either by looking at
multi-column MCV lists (if there's one), or by using minimum from each
per-column MCV lists.

Of course, these are only the groups that made it to the MCV list, and
there may be other (smaller) groups before this large one. For example
there could be a group with "-1" value and a single row.

For a moment I thought we could/should look at the histogram, becase that
could tell us if there are groups "before" the first MCV one, but I don't
think we should do that, for two reasons. Firstly, rare values may not get
to the histogram anyway, which makes this rather unreliable and might
introduce sudden plan changes, because the cost would vary wildly
depending on whether we happened to sample the rare row or not. And
secondly, the rare row may be easily filtered out by a WHERE condition or
something, at which point we'll have to deal with the large group anyway.

So I think we should look at the MCV list, and use that information when
computing the startup/total cost. I think using the first/largest group to
compute the startup cost, and the average group size for total cost would
do the trick.

I don't think we can do much better than this during planning. There will
inevitably be cases where the costing model will push us to do the wrong
thing, in either direction. The question is how serious issue that is, and
whether we could remedy that during execution somehow.

When we "incorrectly" pick full sort (when the incremental sort would be
faster), that's obviously sad but I think it's mostly fine because it's
not a regression.

The opposite direction (picking incremental sort, while full sort being
faster) is probably more serious, because it's a regression between
releases.

I don't think we can fully fix that by refining the cost model. We have
two basic options:

1) Argue/show that this is not an issue in practice, because (a) such
cases are very rare, and/or (b) the regression is limited. In short, the
benefits of the patch outweight the losses.

2) Provide some fallback at execution time. For example, we might watch
the size of the group, and if we run into an unexpectedly large one we
might abandon the incremental sort and switch to a "full sort" mode.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sun, Jun 2, 2019 at 5:18 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Thanks for the rebase! I think the patch is in pretty good shape - I'm
> sure we'll find ways to make it more efficient etc. but IMO that's fine
> and should not prevent getting it committed.

Thank you for the in-depth review!

> Currently, the costing logic (cost_incremental_sort) assumes all groups
> have the same size, which is fine for uniform distributions. But for
> skewed distributions, that may be an issue.
>
> Consider for example a table with 1M rows, two columns, 100 groups in each
> column, and index on the first one.
>
>     CREATE table t (a INT, b INT);
>
>     INSERT INTO t SELECT 100*random(), 100*random()
>       FROM generate_series(1,1000000);
>
> Now, let's do a simple limit query to find the first row:
>
>     SELECT * FROM t ORDER BU a, b LIMIT 1;
>
> In this case the current costing logic is fine - the groups are close to
> average size, and we only need to sort the first group, i.e. 1% of data.
>
> Now, let's say the first group is much larger:
>
>
>     INSERT INTO t SELECT 0, 100*random()
>       FROM generate_series(1,900000) s(i);
>
>     INSERT INTO t SELECT 100*random(), 100*random()
>       FROM generate_series(1,100000);
>
> That is, the first group is roughly 90% of data, but the number of groups
> is still the same. But we need to scan 90% of data. But the average group
> size is still the same, so the current cost model is oblivious to this.

Thinking out loud here: the current implementation doesn't guarantee
that sort groups always have the same prefix column values because
(from code comments) "Sorting many small groups with tuplesort is
inefficient). While this seems a reasonable optimization, I think it's
possible that thinking steered away from an optimization in the
opposite direction. Perhaps we should always track whether or not all
prefix tuples are the same (currently we only do that after reaching
DEFAULT_MIN_GROUP_SIZE tuples) and use that information to be able to
have tuplesort only care about the non-prefix columns (where currently
it has to sort on all pathkey columns even though for a large group
the prefix columns are guaranteed to be equal).

Essentially I'm trying to think of ways that would get us to
comparable performance with regular sort in the case of large batch
sizes.

One other thing about the DEFAULT_MIN_GROUP_SIZE logic is that in the
case where you have a very small group and then a very large batch,
we'd lose the ability to optimize in the above way. That makes me
wonder if we shouldn't intentionally optimize for the possibility of
large batch sizes, since a little extra expense per group/tuple is
more likely to be a non-concern with small groups anyway when there
are large numbers of input tuples but a relatively small limit.

Thoughts?

> So I think we should look at the MCV list, and use that information when
> computing the startup/total cost. I think using the first/largest group to
> compute the startup cost, and the average group size for total cost would
> do the trick.

I think this sounds very reasonable.

> I don't think we can do much better than this during planning. There will
> inevitably be cases where the costing model will push us to do the wrong
> thing, in either direction. The question is how serious issue that is, and
> whether we could remedy that during execution somehow.
>
> When we "incorrectly" pick full sort (when the incremental sort would be
> faster), that's obviously sad but I think it's mostly fine because it's
> not a regression.
>
> The opposite direction (picking incremental sort, while full sort being
> faster) is probably more serious, because it's a regression between
> releases.
>
> I don't think we can fully fix that by refining the cost model. We have
> two basic options:
>
> 1) Argue/show that this is not an issue in practice, because (a) such
> cases are very rare, and/or (b) the regression is limited. In short, the
> benefits of the patch outweight the losses.

My comments above go in this direction. If we can improve performance
in the worst case, I think it's plausible this concern becomes a
non-issue.

> 2) Provide some fallback at execution time. For example, we might watch
> the size of the group, and if we run into an unexpectedly large one we
> might abandon the incremental sort and switch to a "full sort" mode.

Are there good examples of our doing this in other types of nodes
(whether the fallback is an entirely different algorithm/node type)? I
like this idea in theory, but I also think it's likely it would add a
very significant amount of complexity. The other problem is knowing
where to draw the line: you end up creating these kinds of cliffs
where pulling one more tuple through the incremental sort would give
you your batch and result in not having to pull many more tuples in a
regular sort node, but the fallback logic kicks in anyway.

Unrelated to all of the above: if I read the patch properly it
intentionally excludes backwards scanning. I don't see any particular
reason why that ought to be the case, and it seems like an odd
limitation for the feature should it be merged. Should that be a
blocker to merging?

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Rafia Sabih
Date:
On Mon, 3 Jun 2019 at 15:39, James Coleman <jtc331@gmail.com> wrote:
>
> On Sun, Jun 2, 2019 at 5:18 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > Currently, the costing logic (cost_incremental_sort) assumes all groups
> > have the same size, which is fine for uniform distributions. But for
> > skewed distributions, that may be an issue.
> >
> > Consider for example a table with 1M rows, two columns, 100 groups in each
> > column, and index on the first one.
> >
> >     CREATE table t (a INT, b INT);
> >
> >     INSERT INTO t SELECT 100*random(), 100*random()
> >       FROM generate_series(1,1000000);
> >
> > Now, let's do a simple limit query to find the first row:
> >
> >     SELECT * FROM t ORDER BU a, b LIMIT 1;
> >
> > In this case the current costing logic is fine - the groups are close to
> > average size, and we only need to sort the first group, i.e. 1% of data.
> >
> > Now, let's say the first group is much larger:
> >
> >
> >     INSERT INTO t SELECT 0, 100*random()
> >       FROM generate_series(1,900000) s(i);
> >
> >     INSERT INTO t SELECT 100*random(), 100*random()
> >       FROM generate_series(1,100000);
> >
> > That is, the first group is roughly 90% of data, but the number of groups
> > is still the same. But we need to scan 90% of data. But the average group
> > size is still the same, so the current cost model is oblivious to this.
>
> Thinking out loud here: the current implementation doesn't guarantee
> that sort groups always have the same prefix column values because
> (from code comments) "Sorting many small groups with tuplesort is
> inefficient). While this seems a reasonable optimization, I think it's
> possible that thinking steered away from an optimization in the
> opposite direction. Perhaps we should always track whether or not all
> prefix tuples are the same (currently we only do that after reaching
> DEFAULT_MIN_GROUP_SIZE tuples) and use that information to be able to
> have tuplesort only care about the non-prefix columns (where currently
> it has to sort on all pathkey columns even though for a large group
> the prefix columns are guaranteed to be equal).
>
+1 for passing only the non-prefix columns to tuplesort.
> Essentially I'm trying to think of ways that would get us to
> comparable performance with regular sort in the case of large batch
> sizes.
>
> One other thing about the DEFAULT_MIN_GROUP_SIZE logic is that in the
> case where you have a very small group and then a very large batch,
> we'd lose the ability to optimize in the above way. That makes me
> wonder if we shouldn't intentionally optimize for the possibility of
> large batch sizes, since a little extra expense per group/tuple is
> more likely to be a non-concern with small groups anyway when there
> are large numbers of input tuples but a relatively small limit.
>
What about using the knowledge of MCV here, if we know the next value
is in MCV list then take the overhead of sorting this small group
alone and then leverage the optimization for the larger group, by
passing only the non-prefix columns.
> Thoughts?
>
> > So I think we should look at the MCV list, and use that information when
> > computing the startup/total cost. I think using the first/largest group to
> > compute the startup cost, and the average group size for total cost would
> > do the trick.
>
+1
> I think this sounds very reasonable.
>
> > I don't think we can do much better than this during planning. There will
> > inevitably be cases where the costing model will push us to do the wrong
> > thing, in either direction. The question is how serious issue that is, and
> > whether we could remedy that during execution somehow.
> >
> > When we "incorrectly" pick full sort (when the incremental sort would be
> > faster), that's obviously sad but I think it's mostly fine because it's
> > not a regression.
> >
> > The opposite direction (picking incremental sort, while full sort being
> > faster) is probably more serious, because it's a regression between
> > releases.
> >
> > I don't think we can fully fix that by refining the cost model. We have
> > two basic options:
> >
> > 1) Argue/show that this is not an issue in practice, because (a) such
> > cases are very rare, and/or (b) the regression is limited. In short, the
> > benefits of the patch outweight the losses.
>
> My comments above go in this direction. If we can improve performance
> in the worst case, I think it's plausible this concern becomes a
> non-issue.
>
> > 2) Provide some fallback at execution time. For example, we might watch
> > the size of the group, and if we run into an unexpectedly large one we
> > might abandon the incremental sort and switch to a "full sort" mode.
>
> Are there good examples of our doing this in other types of nodes
> (whether the fallback is an entirely different algorithm/node type)? I
> like this idea in theory, but I also think it's likely it would add a
> very significant amount of complexity. The other problem is knowing
> where to draw the line: you end up creating these kinds of cliffs
> where pulling one more tuple through the incremental sort would give
> you your batch and result in not having to pull many more tuples in a
> regular sort node, but the fallback logic kicks in anyway.
>
What about having some simple mechanism for this, like if we encounter
the group with more tuples than the one estimated then simply switch
to normal sort for the remaining tuples, as the estimates does not
hold true anyway. Atleast this will not give issues of having
regressions of incremental sort being too bad than the normal sort.
I mean having something like this, populate the tuplesortstate and
keep on counting the number of tuples in a group, if they are within
the budget call tuplesort_performsort, otherwise put all the tuples in
the tuplesort and then call tuplesort_performsort. We may have an
additional field in IncrementalSortState to save the estimated size of
each group. I am assuming that we use MCV lists to approximate better
the group sizes as suggested above by Tomas.

> Unrelated to all of the above: if I read the patch properly it
> intentionally excludes backwards scanning. I don't see any particular
> reason why that ought to be the case, and it seems like an odd
> limitation for the feature should it be merged. Should that be a
> blocker to merging?

Regarding this, I came across this,
/*
  * Incremental sort can't be used with either EXEC_FLAG_REWIND,
  * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
  * bucket in tuplesortstate.
  */
I think that is quite understandable. How are you planning to support
backwards scan for this? In other words, when will incremental sort be
useful for backward scan.

On a different note, I can't stop imagining this operator on the lines
similar to parallel-append, wherein multiple workers can sort the
different groups independently at the same time.

-- 
Regards,
Rafia Sabih



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Wed, Jun 5, 2019 at 12:14 PM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> > > 2) Provide some fallback at execution time. For example, we might watch
> > > the size of the group, and if we run into an unexpectedly large one we
> > > might abandon the incremental sort and switch to a "full sort" mode.
> >
> > Are there good examples of our doing this in other types of nodes
> > (whether the fallback is an entirely different algorithm/node type)? I
> > like this idea in theory, but I also think it's likely it would add a
> > very significant amount of complexity. The other problem is knowing
> > where to draw the line: you end up creating these kinds of cliffs
> > where pulling one more tuple through the incremental sort would give
> > you your batch and result in not having to pull many more tuples in a
> > regular sort node, but the fallback logic kicks in anyway.
> >
> What about having some simple mechanism for this, like if we encounter
> the group with more tuples than the one estimated then simply switch
> to normal sort for the remaining tuples, as the estimates does not
> hold true anyway. Atleast this will not give issues of having
> regressions of incremental sort being too bad than the normal sort.
> I mean having something like this, populate the tuplesortstate and
> keep on counting the number of tuples in a group, if they are within
> the budget call tuplesort_performsort, otherwise put all the tuples in
> the tuplesort and then call tuplesort_performsort. We may have an
> additional field in IncrementalSortState to save the estimated size of
> each group. I am assuming that we use MCV lists to approximate better
> the group sizes as suggested above by Tomas.

I think the first thing to do is get some concrete numbers on performance if we:

1. Only sort one group at a time.
2. Update the costing to prefer traditional sort unless we have very
high confidence we'll win with incremental sort.

It'd be nice not to have to add additional complexity if at all possible.

> > Unrelated to all of the above: if I read the patch properly it
> > intentionally excludes backwards scanning. I don't see any particular
> > reason why that ought to be the case, and it seems like an odd
> > limitation for the feature should it be merged. Should that be a
> > blocker to merging?
>
> Regarding this, I came across this,
> /*
>   * Incremental sort can't be used with either EXEC_FLAG_REWIND,
>   * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
>   * bucket in tuplesortstate.
>   */
> I think that is quite understandable. How are you planning to support
> backwards scan for this? In other words, when will incremental sort be
> useful for backward scan.

For some reason I was thinking we'd need it to support backwards scans
to be able to handle DESC sort on the index, but I've tested and
confirmed that already works. I suppose that's because the index scan
provides that ordering and the sort node doesn't need to reverse the
direction of what's provided to it. That's not particularly obvious to
someone newer to the codebase; I'm not sure if that's documented
anywhere.

> On a different note, I can't stop imagining this operator on the lines
> similar to parallel-append, wherein multiple workers can sort the
> different groups independently at the same time.

That is an interesting idea. I suppose it'd be particularly valuable
if somehow there a node that was generating each batch in parallel
already, though I'm not sure at first though what kind of query or
node would result in that. I also wonder if (assuming that weren't the
case) it would be much of an improvement since a single thread would
have to generate each batch anyway; I'm not sure if the overhead of
farming each batch out to a worker would actually be useful or if the
real blocker is the base scan.

At the very least it's an interesting idea.

---

I've been writing down notes here, and I realized that my test case
from far upthread is actually a useful setup to see how much overhead
is involved in sorting each batch individually, since it sets up data
with each batch only containing 1 tuple. That particular case is one
we could easily optimize anyway in the code and skip sorting
altogether -- might be a useful enhancement.

I hope to do some more testing and then report back with the results.

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:
>On Wed, Jun 5, 2019 at 12:14 PM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
>> > > 2) Provide some fallback at execution time. For example, we might watch
>> > > the size of the group, and if we run into an unexpectedly large one we
>> > > might abandon the incremental sort and switch to a "full sort" mode.
>> >
>> > Are there good examples of our doing this in other types of nodes
>> > (whether the fallback is an entirely different algorithm/node type)? I
>> > like this idea in theory, but I also think it's likely it would add a
>> > very significant amount of complexity. The other problem is knowing
>> > where to draw the line: you end up creating these kinds of cliffs
>> > where pulling one more tuple through the incremental sort would give
>> > you your batch and result in not having to pull many more tuples in a
>> > regular sort node, but the fallback logic kicks in anyway.
>> >
>> What about having some simple mechanism for this, like if we encounter
>> the group with more tuples than the one estimated then simply switch
>> to normal sort for the remaining tuples, as the estimates does not
>> hold true anyway. Atleast this will not give issues of having
>> regressions of incremental sort being too bad than the normal sort.
>> I mean having something like this, populate the tuplesortstate and
>> keep on counting the number of tuples in a group, if they are within
>> the budget call tuplesort_performsort, otherwise put all the tuples in
>> the tuplesort and then call tuplesort_performsort. We may have an
>> additional field in IncrementalSortState to save the estimated size of
>> each group. I am assuming that we use MCV lists to approximate better
>> the group sizes as suggested above by Tomas.
>
>I think the first thing to do is get some concrete numbers on performance if we:
>
>1. Only sort one group at a time.
>2. Update the costing to prefer traditional sort unless we have very
>high confidence we'll win with incremental sort.
>
>It'd be nice not to have to add additional complexity if at all possible.
>

+1 to that

>> > Unrelated to all of the above: if I read the patch properly it
>> > intentionally excludes backwards scanning. I don't see any particular
>> > reason why that ought to be the case, and it seems like an odd
>> > limitation for the feature should it be merged. Should that be a
>> > blocker to merging?
>>
>> Regarding this, I came across this,
>> /*
>>   * Incremental sort can't be used with either EXEC_FLAG_REWIND,
>>   * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
>>   * bucket in tuplesortstate.
>>   */
>> I think that is quite understandable. How are you planning to support
>> backwards scan for this? In other words, when will incremental sort be
>> useful for backward scan.
>
>For some reason I was thinking we'd need it to support backwards scans
>to be able to handle DESC sort on the index, but I've tested and
>confirmed that already works. I suppose that's because the index scan
>provides that ordering and the sort node doesn't need to reverse the
>direction of what's provided to it. That's not particularly obvious to
>someone newer to the codebase; I'm not sure if that's documented
>anywhere.
>

Yeah, backward scans are not about ASC/DESC, it's about being able to walk
back through the result. And we can't do that with incremental sorts
without materialization.

>> On a different note, I can't stop imagining this operator on the lines
>> similar to parallel-append, wherein multiple workers can sort the
>> different groups independently at the same time.
>
>That is an interesting idea. I suppose it'd be particularly valuable
>if somehow there a node that was generating each batch in parallel
>already, though I'm not sure at first though what kind of query or
>node would result in that. I also wonder if (assuming that weren't the
>case) it would be much of an improvement since a single thread would
>have to generate each batch anyway; I'm not sure if the overhead of
>farming each batch out to a worker would actually be useful or if the
>real blocker is the base scan.
>
>At the very least it's an interesting idea.
>

I kinda doubt that'd be very valuable. Or more precisely, we kinda already
have that capability because we can do things like this:

   -> Gather Merge
      -> Sort
         -> ... scan ...

so I imagine we'd just do an Incremental Sort here. Granted, it's not
distributed by prefix groups (I assume that's what you mean by batches
here), but I don't think that's a big problem.

>---
>
>I've been writing down notes here, and I realized that my test case
>from far upthread is actually a useful setup to see how much overhead
>is involved in sorting each batch individually, since it sets up data
>with each batch only containing 1 tuple. That particular case is one
>we could easily optimize anyway in the code and skip sorting
>altogether -- might be a useful enhancement.
>
>I hope to do some more testing and then report back with the results.
>
>James Coleman

OK.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Wed, Jun 05, 2019 at 06:14:14PM +0200, Rafia Sabih wrote:
>On Mon, 3 Jun 2019 at 15:39, James Coleman <jtc331@gmail.com> wrote:
>>
>> On Sun, Jun 2, 2019 at 5:18 PM Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> > Currently, the costing logic (cost_incremental_sort) assumes all groups
>> > have the same size, which is fine for uniform distributions. But for
>> > skewed distributions, that may be an issue.
>> >
>> > Consider for example a table with 1M rows, two columns, 100 groups in each
>> > column, and index on the first one.
>> >
>> >     CREATE table t (a INT, b INT);
>> >
>> >     INSERT INTO t SELECT 100*random(), 100*random()
>> >       FROM generate_series(1,1000000);
>> >
>> > Now, let's do a simple limit query to find the first row:
>> >
>> >     SELECT * FROM t ORDER BU a, b LIMIT 1;
>> >
>> > In this case the current costing logic is fine - the groups are close to
>> > average size, and we only need to sort the first group, i.e. 1% of data.
>> >
>> > Now, let's say the first group is much larger:
>> >
>> >
>> >     INSERT INTO t SELECT 0, 100*random()
>> >       FROM generate_series(1,900000) s(i);
>> >
>> >     INSERT INTO t SELECT 100*random(), 100*random()
>> >       FROM generate_series(1,100000);
>> >
>> > That is, the first group is roughly 90% of data, but the number of groups
>> > is still the same. But we need to scan 90% of data. But the average group
>> > size is still the same, so the current cost model is oblivious to this.
>>
>> Thinking out loud here: the current implementation doesn't guarantee
>> that sort groups always have the same prefix column values because
>> (from code comments) "Sorting many small groups with tuplesort is
>> inefficient). While this seems a reasonable optimization, I think it's
>> possible that thinking steered away from an optimization in the
>> opposite direction. Perhaps we should always track whether or not all
>> prefix tuples are the same (currently we only do that after reaching
>> DEFAULT_MIN_GROUP_SIZE tuples) and use that information to be able to
>> have tuplesort only care about the non-prefix columns (where currently
>> it has to sort on all pathkey columns even though for a large group
>> the prefix columns are guaranteed to be equal).
>>
>+1 for passing only the non-prefix columns to tuplesort.
>> Essentially I'm trying to think of ways that would get us to
>> comparable performance with regular sort in the case of large batch
>> sizes.
>>
>> One other thing about the DEFAULT_MIN_GROUP_SIZE logic is that in the
>> case where you have a very small group and then a very large batch,
>> we'd lose the ability to optimize in the above way. That makes me
>> wonder if we shouldn't intentionally optimize for the possibility of
>> large batch sizes, since a little extra expense per group/tuple is
>> more likely to be a non-concern with small groups anyway when there
>> are large numbers of input tuples but a relatively small limit.
>>
>What about using the knowledge of MCV here, if we know the next value
>is in MCV list then take the overhead of sorting this small group
>alone and then leverage the optimization for the larger group, by
>passing only the non-prefix columns.

Not sure. It very much depends on how expensive the comparisons are for
that particular data type. If the comparisons are cheap, then I'm not sure
it matters very much whether the group is small or large. For expensive
comparison it may not be a win either, because we need to search the MCV
lists whenever the group changes.

I guess we'll need to make some benchmarks and see if it's a win or not.

>> Thoughts?
>>
>> > So I think we should look at the MCV list, and use that information when
>> > computing the startup/total cost. I think using the first/largest group to
>> > compute the startup cost, and the average group size for total cost would
>> > do the trick.
>>
>+1
>> I think this sounds very reasonable.
>>
>> > I don't think we can do much better than this during planning. There will
>> > inevitably be cases where the costing model will push us to do the wrong
>> > thing, in either direction. The question is how serious issue that is, and
>> > whether we could remedy that during execution somehow.
>> >
>> > When we "incorrectly" pick full sort (when the incremental sort would be
>> > faster), that's obviously sad but I think it's mostly fine because it's
>> > not a regression.
>> >
>> > The opposite direction (picking incremental sort, while full sort being
>> > faster) is probably more serious, because it's a regression between
>> > releases.
>> >
>> > I don't think we can fully fix that by refining the cost model. We have
>> > two basic options:
>> >
>> > 1) Argue/show that this is not an issue in practice, because (a) such
>> > cases are very rare, and/or (b) the regression is limited. In short, the
>> > benefits of the patch outweight the losses.
>>
>> My comments above go in this direction. If we can improve performance
>> in the worst case, I think it's plausible this concern becomes a
>> non-issue.
>>
>> > 2) Provide some fallback at execution time. For example, we might watch
>> > the size of the group, and if we run into an unexpectedly large one we
>> > might abandon the incremental sort and switch to a "full sort" mode.
>>
>> Are there good examples of our doing this in other types of nodes
>> (whether the fallback is an entirely different algorithm/node type)? I
>> like this idea in theory, but I also think it's likely it would add a
>> very significant amount of complexity. The other problem is knowing
>> where to draw the line: you end up creating these kinds of cliffs
>> where pulling one more tuple through the incremental sort would give
>> you your batch and result in not having to pull many more tuples in a
>> regular sort node, but the fallback logic kicks in anyway.
>>

I don't think we have nodes where we'd switch to an entirely different
algorithm - say from hash-join to nested-loop join (as is often proposed
as a solution for excessive memory consumption). That's obviously very
complex thing to implement.

But I don't think that's the type of fallback we'd need here - IMO it's
more similar to switching from in-memory to on-disk sort. Essentially,
we'd need to disable the extra logic (detecting the prefix grouping), and
just stash all the remaining tuples into tuplesort and do regular sort.

Of course, I haven't actually implemented this, so maybe it's trickier.

>What about having some simple mechanism for this, like if we encounter
>the group with more tuples than the one estimated then simply switch
>to normal sort for the remaining tuples, as the estimates does not
>hold true anyway. Atleast this will not give issues of having
>regressions of incremental sort being too bad than the normal sort.
>I mean having something like this, populate the tuplesortstate and
>keep on counting the number of tuples in a group, if they are within
>the budget call tuplesort_performsort, otherwise put all the tuples in
>the tuplesort and then call tuplesort_performsort. We may have an
>additional field in IncrementalSortState to save the estimated size of
>each group. I am assuming that we use MCV lists to approximate better
>the group sizes as suggested above by Tomas.
>

Maybe. I suggest we try to implement the simplest solution first, trying
to do as much as possible during planning, and then try to be smart at
execution time only if necessary.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Apr 7, 2018 at 4:56 PM, Alexander Korotkov <
a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> I agree with that. For bounded sort, attached patch now selects minimal
> group
> size as Min(DEFAULT_MIN_GROUP_SIZE, bound). That should improve
> "LIMIT small_number" case.

As I was working on some benchmarking I noticed that incremental sort
never seemed to switch into the top-n heapsort mode, which meant for
very large groups it significantly underperformed a regular sort since
it would have to spill to disk every time. Perhaps this indicates some
missing tests also.

I tracked that down to a missing case for IncrementalSortState in
ExecSetTupleBound and have updated the patched to correct the issue
(and confirmed it now properly switches sort modes).

That also means the optimization of choosing the min group size based
on bounds (if available) wasn't previously working.

I also haven't seen incremental sort used in any parallel plans,
though there seems to be some code intended to support it. I haven't
dug into really at all yet though, so can't comment further.

I'm attaching updated patch, and will reply separately with more
detailed comments on my current benchmarking work.

James Coleman

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:
>I think the first thing to do is get some concrete numbers on performance if we:
>
>1. Only sort one group at a time.
>2. Update the costing to prefer traditional sort unless we have very
>high confidence we'll win with incremental sort.
>
>It'd be nice not to have to add additional complexity if at all possible.

I've been focusing my efforts so far on seeing how much we can
eliminate performance penalties (relative to traditional sort). It
seems that if we can improve things enough there that we'd limit the
amount of adjustment needed to costing -- we'd still need to consider
cases where the lower startup cost results in picking significantly
different plans in a broad sense (presumably due to lower startup cost
and the ability to short circuit on a limit). But I'm hopeful then we
might be able to avoid having to consult MCV lists (and we wouldn't
have that available in all cases anyway)

As I see it the two most significant concerning cases right now are:
1. Very large batches (in particular where the batch is effectively
all of the matching rows such that we're really just doing a standard
sort).
2. Many very small batches.

(From reading the whole thread through again, there's a third case:
skewed group sizes, but I'm ignoring that for now because my intuition
is that if the sort can reasonably handle the above two cases *in
execution not just planning* the skew problem will become a
non-issue-)

For (1), I'd really like to be able to find a way to still benefit
from knowing we have prefix columns already sorted. Currently the
patch doesn't do that because of the optimization for small batches
precludes knowing that all prefixes in a batch are indeed equal. As
such I've temporarily removed that optimization in my testing to see
how much of a gain we get from not needing to sort prefix columns.
Intuitively it seems plausible that this would always beat a standard
sort, since a.) equality is sometimes much cheaper than ordering, and
b.) it reduces the likelihood of spilling to disk. In addition my fix
to make top-n heapsort improves things. The one thing I'm not sure
about here is the fact that I haven't seen parallel incremental sort
happen at all, and a parallel regular sort can sometimes beat single
process incremental sort (though at the cost of higher system
resources of course). As noted upthread though I haven't really
investigated the parallel component yet at all.

I did confirm a measurable speedup with suffix-only sorting (roughly
5%, relative to the current patch).

For (2), I'd love to have an optimization for the "batches are a
single tuple" case (indeed in my most common real-world use case,
that's virtually guaranteed). But that optimization is also
incompatible with the current minimum batch size optimization. There
was previously reference to the cost of copying the tuple slot being
the problem factor with small batches, and I've confirmed there is a
performance penalty for small batches if you handle every prefix as a
separate batch (relative to the current patch), but what I plan to
investigate next is whether or not that penalty is primarily due to
the bookkeeping of frequent copying of a pivot tuple or whether it's
due to running the sort algorithm itself so frequently. If it's just
the overhead of copying the pivot tuple, then I'm wondering if adding
functionality to tuplesort to allow peeking at the first inserted
tuple might be worth it.

Alternatively I've been thinking about ways to achieve a hybrid
approach: using single-batch suffix-only sort for large batches and
multi-batch full sort for very small batches. For example, I could
imagine maintaining two different tuplesorts (one for full sort and
one for suffix-only sort) and inserting tuples into the full sorter
until the minimum batch size, and then (like now) checking every tuple
to see when we've finished the batch. If we finish the batch by, say,
2 * minimum batch size, then we perform the full sort. If we don't
find a new batch by that point, we'd go back and check to see if the
tuples we've accumulated so far are actually all the same batch. If
so, we'd move all of them to the suffix-only sorter and continue
adding rows until we hit a new batch. If not, we'd perform the full
sort, but only return tuples out of the full sorter until we encounter
the current batch, and then we'd move the remainder into the
suffix-only sorter.

Using this kind of approach you can also imagine further optimizations
like checking the current tuple against the pivot tuple only every 10
processed tuples or so. That adds another level of complication,
obviously, but could be interesting.

Unrelated: There was some discussion upthread about whether or not
abbreviated column support could/should be supported. The comments
around tuplesort_set_bound in tuplesort.c claim that for bounded sort
abbreviated keys are not useful, and, in fact, the code there disables
abbreviated sort. Given that the single largest use case for this
patch is when we have a LIMIT, I think we can conclude not supporting
abbreviated keys is reasonable.

One other (meta) question: Do we care mostly about "real" queries on
synthetic data (and comparing them to master)? Or do we care more
about comparing solely the difference in sorting cost between the two
approaches? I think some of both is necessary, but I'm curious what
you all think.

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Simon Riggs
Date:
On Wed, 5 Jun 2019 at 17:14, Rafia Sabih <rafia.pghackers@gmail.com> wrote:
 
Regarding this, I came across this,
/*
  * Incremental sort can't be used with either EXEC_FLAG_REWIND,
  * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
  * bucket in tuplesortstate.
  */
I think that is quite understandable. How are you planning to support
backwards scan for this? In other words, when will incremental sort be
useful for backward scan.

We stopped materializing the sort by default about 15 years ago because it wasn't a common use case and it was very expensive for large sorts.

It's no real problem if incremental sorts don't support backwards scans - we just won't use incremental in that case.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Solutions for the Enterprise

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Simon Riggs
Date:
On Mon, 24 Jun 2019 at 16:10, James Coleman <jtc331@gmail.com> wrote:
On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:
>I think the first thing to do is get some concrete numbers on performance if we:
>
>1. Only sort one group at a time.
>2. Update the costing to prefer traditional sort unless we have very
>high confidence we'll win with incremental sort.
>
>It'd be nice not to have to add additional complexity if at all possible.

I've been focusing my efforts so far on seeing how much we can
eliminate performance penalties (relative to traditional sort). It
seems that if we can improve things enough there that we'd limit the
amount of adjustment needed to costing -- we'd still need to consider
cases where the lower startup cost results in picking significantly
different plans in a broad sense (presumably due to lower startup cost
and the ability to short circuit on a limit). But I'm hopeful then we
might be able to avoid having to consult MCV lists (and we wouldn't
have that available in all cases anyway)

As I see it the two most significant concerning cases right now are:
1. Very large batches (in particular where the batch is effectively
all of the matching rows such that we're really just doing a standard
sort).
2. Many very small batches.

What is the specific use case for this? This sounds quite general case.

Do we know something about the nearly-sorted rows that could help us? Or could we introduce some information elsewhere that would help with the sort?

Could we for-example, pre-sort the rows block by block, or filter out the rows that are clearly out of order, so we can re-merge them later?

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Solutions for the Enterprise

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Mon, 24 Jun 2019 at 16:10, James Coleman <jtc331@gmail.com> wrote:
>>
>> On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:
>> >I think the first thing to do is get some concrete numbers on performance if we:
>> >
>> >1. Only sort one group at a time.
>> >2. Update the costing to prefer traditional sort unless we have very
>> >high confidence we'll win with incremental sort.
>> >
>> >It'd be nice not to have to add additional complexity if at all possible.
>>
>> I've been focusing my efforts so far on seeing how much we can
>> eliminate performance penalties (relative to traditional sort). It
>> seems that if we can improve things enough there that we'd limit the
>> amount of adjustment needed to costing -- we'd still need to consider
>> cases where the lower startup cost results in picking significantly
>> different plans in a broad sense (presumably due to lower startup cost
>> and the ability to short circuit on a limit). But I'm hopeful then we
>> might be able to avoid having to consult MCV lists (and we wouldn't
>> have that available in all cases anyway)
>>
>> As I see it the two most significant concerning cases right now are:
>> 1. Very large batches (in particular where the batch is effectively
>> all of the matching rows such that we're really just doing a standard
>> sort).
>> 2. Many very small batches.
>
>
> What is the specific use case for this? This sounds quite general case.

They are both general cases in some sense, but the concerns lie mostly
with what happens when they're unexpectedly encountered. For example,
if the expected row count or group size is off by a good bit and we
effectively have to perform a sort of all (or most) possible rows.

If we can get the performance to a point where that misestimated row
count or group size doesn't much matter, then ISTM including the patch
becomes a much more obvious total win.

> Do we know something about the nearly-sorted rows that could help us? Or could we introduce some information
elsewherethat would help with the sort?
 
>
> Could we for-example, pre-sort the rows block by block, or filter out the rows that are clearly out of order, so we
canre-merge them later?
 

I'm not sure what you mean by "block by block"?

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Simon Riggs
Date:
On Mon, 24 Jun 2019 at 18:01, James Coleman <jtc331@gmail.com> wrote:
On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs <simon@2ndquadrant.com> wrote:

> What is the specific use case for this? This sounds quite general case.

They are both general cases in some sense, but the concerns lie mostly
with what happens when they're unexpectedly encountered. For example,
if the expected row count or group size is off by a good bit and we
effectively have to perform a sort of all (or most) possible rows.

If we can get the performance to a point where that misestimated row
count or group size doesn't much matter, then ISTM including the patch
becomes a much more obvious total win.

I was trying to think of ways of using external information/circumstance to knowingly avoid negative use cases. i.e. don't treat sort as a black box, use its context.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Solutions for the Enterprise

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Jun 24, 2019 at 01:00:50PM -0400, James Coleman wrote:
>On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> On Mon, 24 Jun 2019 at 16:10, James Coleman <jtc331@gmail.com> wrote:
>>>
>>> On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:
>>> >I think the first thing to do is get some concrete numbers on performance if we:
>>> >
>>> >1. Only sort one group at a time.
>>> >2. Update the costing to prefer traditional sort unless we have very
>>> >high confidence we'll win with incremental sort.
>>> >
>>> >It'd be nice not to have to add additional complexity if at all possible.
>>>
>>> I've been focusing my efforts so far on seeing how much we can
>>> eliminate performance penalties (relative to traditional sort). It
>>> seems that if we can improve things enough there that we'd limit the
>>> amount of adjustment needed to costing -- we'd still need to consider
>>> cases where the lower startup cost results in picking significantly
>>> different plans in a broad sense (presumably due to lower startup cost
>>> and the ability to short circuit on a limit). But I'm hopeful then we
>>> might be able to avoid having to consult MCV lists (and we wouldn't
>>> have that available in all cases anyway)
>>>
>>> As I see it the two most significant concerning cases right now are:
>>> 1. Very large batches (in particular where the batch is effectively
>>> all of the matching rows such that we're really just doing a standard
>>> sort).
>>> 2. Many very small batches.
>>
>>
>> What is the specific use case for this? This sounds quite general case.
>
>They are both general cases in some sense, but the concerns lie mostly
>with what happens when they're unexpectedly encountered. For example,
>if the expected row count or group size is off by a good bit and we
>effectively have to perform a sort of all (or most) possible rows.
>
>If we can get the performance to a point where that misestimated row
>count or group size doesn't much matter, then ISTM including the patch
>becomes a much more obvious total win.
>

Yes, that seems like a reasonable approach. Essentially, we're trying to
construct plausible worst case examples, and then minimize the overhead
compared to regular sort. If we get sufficiently close, then it's fine
to rely on somewhat shaky stats - like group size estimates.


>> Do we know something about the nearly-sorted rows that could help us?
>> Or could we introduce some information elsewhere that would help with
>> the sort?
>>
>> Could we for-example, pre-sort the rows block by block, or filter out
>> the rows that are clearly out of order, so we can re-merge them
>> later?
>
>I'm not sure what you mean by "block by block"?
>

I'm not sure what "block by block" means either.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Jun 24, 2019 at 07:05:24PM +0100, Simon Riggs wrote:
>On Mon, 24 Jun 2019 at 18:01, James Coleman <jtc331@gmail.com> wrote:
>
>> On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs <simon@2ndquadrant.com>
>> wrote:
>>
>> > What is the specific use case for this? This sounds quite general case.
>>
>> They are both general cases in some sense, but the concerns lie mostly
>> with what happens when they're unexpectedly encountered. For example,
>> if the expected row count or group size is off by a good bit and we
>> effectively have to perform a sort of all (or most) possible rows.
>>
>> If we can get the performance to a point where that misestimated row
>> count or group size doesn't much matter, then ISTM including the patch
>> becomes a much more obvious total win.
>>
>
>I was trying to think of ways of using external information/circumstance to
>knowingly avoid negative use cases. i.e. don't treat sort as a black box,
>use its context.
>

Like what, for example? I'm not saying there's no such additional
information, but I can't think of anything at the moment.

One of the issues is that while the decision whether to use incremental
sort is done during planning, most of the additional (and reliable)
infomation about the data is only available at execution time. And it's
likely not available explicitly - we need to deduce it. Which I think is
exactly what the min group size heuristics is about, no?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Jun 24, 2019 at 4:16 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Jun 24, 2019 at 01:00:50PM -0400, James Coleman wrote:
> >On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs <simon@2ndquadrant.com> wrote:
> >>
> >> On Mon, 24 Jun 2019 at 16:10, James Coleman <jtc331@gmail.com> wrote:
> >>>
> >>> On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:
...
> >>> As I see it the two most significant concerning cases right now are:
> >>> 1. Very large batches (in particular where the batch is effectively
> >>> all of the matching rows such that we're really just doing a standard
> >>> sort).
> >>> 2. Many very small batches.
> >>
> >>
> >> What is the specific use case for this? This sounds quite general case.
> >
> >They are both general cases in some sense, but the concerns lie mostly
> >with what happens when they're unexpectedly encountered. For example,
> >if the expected row count or group size is off by a good bit and we
> >effectively have to perform a sort of all (or most) possible rows.
> >
> >If we can get the performance to a point where that misestimated row
> >count or group size doesn't much matter, then ISTM including the patch
> >becomes a much more obvious total win.
> >
>
> Yes, that seems like a reasonable approach. Essentially, we're trying to
> construct plausible worst case examples, and then minimize the overhead
> compared to regular sort. If we get sufficiently close, then it's fine
> to rely on somewhat shaky stats - like group size estimates.

I have a bit of a mystery in my performance testing. I've been setting
up a table like so:

create table foo(pk serial primary key, owner_fk integer, created_at timestamp);
insert into foo(owner_fk, created_at)
select fk_t.i, now() - (time_t.i::text || ' minutes')::interval
from generate_series(1, 10000) time_t(i)
cross join generate_series(1, 1000) fk_t(i);
-- double up on one set to guarantee matching prefixes
insert into foo (owner_fk, created_at) select owner_fk, created_at
from foo where owner_fk = 23;
create index idx_foo_on_owner_and_created_at on foo(owner_fk, created_at);
analyze foo;

and then I have the following query:

select *
from foo
where owner_fk = 23
order by created_at desc, pk desc
limit 20000;

The idea here is to force a bit of a worst case for small groups: we
have 10,000 batches (i.e., equal prefix groups) of 2 tuples each and
then query with a limit matching the actual number of rows we know
will match the query -- so even though there's a limit we're forcing a
total sort (and also guaranteeing both plans have to touch the same
number of rows). Note: I know that batches of size is actually the
worst case, but I chose batches of two because I've also been testing
a change that would skip the sort entirely for single tuple batches.

On master (really the commit right before the current revision of the
patch), I get:
latency average = 14.271 ms
tps = 70.075243 (excluding connections establishing)

With the patch (and incremental sort enabled):
latency average = 11.975 ms
tps = 83.512090 (excluding connections establishing)

With the patch (but incremental sort disabled):
latency average = 11.884 ms
tps = 84.149834 (excluding connections establishing)

All of those are 60 seconds runs on pgbench with a single thread.

So we have a very substantial speedup with patch *even if the new
feature isn't enabled*. I've confirmed the plan looks the same on
patched with incremental sort disabled and master. The only changes
that would seem to really effect execution time would be the changes
to tuplesort.c, but looking through them I don't see anything I'd
expect to change things so dramatically.

Any thoughts on this?

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Jun 24, 2019 at 07:34:19PM -0400, James Coleman wrote:
>On Mon, Jun 24, 2019 at 4:16 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Jun 24, 2019 at 01:00:50PM -0400, James Coleman wrote:
>> >On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs <simon@2ndquadrant.com> wrote:
>> >>
>> >> On Mon, 24 Jun 2019 at 16:10, James Coleman <jtc331@gmail.com> wrote:
>> >>>
>> >>> On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:
>...
>> >>> As I see it the two most significant concerning cases right now are:
>> >>> 1. Very large batches (in particular where the batch is effectively
>> >>> all of the matching rows such that we're really just doing a standard
>> >>> sort).
>> >>> 2. Many very small batches.
>> >>
>> >>
>> >> What is the specific use case for this? This sounds quite general case.
>> >
>> >They are both general cases in some sense, but the concerns lie mostly
>> >with what happens when they're unexpectedly encountered. For example,
>> >if the expected row count or group size is off by a good bit and we
>> >effectively have to perform a sort of all (or most) possible rows.
>> >
>> >If we can get the performance to a point where that misestimated row
>> >count or group size doesn't much matter, then ISTM including the patch
>> >becomes a much more obvious total win.
>> >
>>
>> Yes, that seems like a reasonable approach. Essentially, we're trying to
>> construct plausible worst case examples, and then minimize the overhead
>> compared to regular sort. If we get sufficiently close, then it's fine
>> to rely on somewhat shaky stats - like group size estimates.
>
>I have a bit of a mystery in my performance testing. I've been setting
>up a table like so:
>
>create table foo(pk serial primary key, owner_fk integer, created_at timestamp);
>insert into foo(owner_fk, created_at)
>select fk_t.i, now() - (time_t.i::text || ' minutes')::interval
>from generate_series(1, 10000) time_t(i)
>cross join generate_series(1, 1000) fk_t(i);
>-- double up on one set to guarantee matching prefixes
>insert into foo (owner_fk, created_at) select owner_fk, created_at
>from foo where owner_fk = 23;
>create index idx_foo_on_owner_and_created_at on foo(owner_fk, created_at);
>analyze foo;
>
>and then I have the following query:
>
>select *
>from foo
>where owner_fk = 23
>order by created_at desc, pk desc
>limit 20000;
>
>The idea here is to force a bit of a worst case for small groups: we
>have 10,000 batches (i.e., equal prefix groups) of 2 tuples each and
>then query with a limit matching the actual number of rows we know
>will match the query -- so even though there's a limit we're forcing a
>total sort (and also guaranteeing both plans have to touch the same
>number of rows). Note: I know that batches of size is actually the
>worst case, but I chose batches of two because I've also been testing
>a change that would skip the sort entirely for single tuple batches.
>
>On master (really the commit right before the current revision of the
>patch), I get:
>latency average = 14.271 ms
>tps = 70.075243 (excluding connections establishing)
>
>With the patch (and incremental sort enabled):
>latency average = 11.975 ms
>tps = 83.512090 (excluding connections establishing)
>
>With the patch (but incremental sort disabled):
>latency average = 11.884 ms
>tps = 84.149834 (excluding connections establishing)
>
>All of those are 60 seconds runs on pgbench with a single thread.
>
>So we have a very substantial speedup with patch *even if the new
>feature isn't enabled*. I've confirmed the plan looks the same on
>patched with incremental sort disabled and master. The only changes
>that would seem to really effect execution time would be the changes
>to tuplesort.c, but looking through them I don't see anything I'd
>expect to change things so dramatically.
>
>Any thoughts on this?
>

I can reproduce the same thing, so it's not just you. On my machine, I see
these tps numbers (average of 10 runs, 60 seconds each):

  master:        65.177
  patched (on):  80.368
  patched (off): 80.750

The numbers are very consistent (within 1 tps).

I've done a bit of CPU profiling, and on master I see this:

    13.84%  postgres  postgres            [.] comparetup_heap
     4.83%  postgres  postgres            [.] qsort_tuple
     3.87%  postgres  postgres            [.] pg_ltostr_zeropad
     3.55%  postgres  postgres            [.] pg_ltoa
     3.19%  postgres  postgres            [.] AllocSetAlloc
     2.68%  postgres  libc-2.28.so        [.] __GI___strlen_sse2
     2.38%  postgres  postgres            [.] LWLockRelease
     2.38%  postgres  postgres            [.] AppendSeconds.constprop.9
     2.22%  postgres  libc-2.28.so        [.] __memmove_sse2_unaligned_erms
     2.17%  postgres  postgres            [.] GetPrivateRefCountEntry
     2.03%  postgres  postgres            [.] j2date
     ...

while on patched versions I see this:

     4.60%  postgres  postgres            [.] pg_ltostr_zeropad
     4.51%  postgres  postgres            [.] pg_ltoa
     3.50%  postgres  postgres            [.] AllocSetAlloc
     3.34%  postgres  libc-2.28.so        [.] __GI___strlen_sse2
     2.99%  postgres  postgres            [.] LWLockRelease
     2.84%  postgres  postgres            [.] AppendSeconds.constprop.9
     2.65%  postgres  postgres            [.] GetPrivateRefCountEntry
     2.64%  postgres  postgres            [.] j2date
     2.60%  postgres  postgres            [.] printtup
     2.56%  postgres  postgres            [.] heap_hot_search_buffer
     ...
     1.35%  postgres  postgres            [.] comparetup_heap
     ...

So either we're calling comparetup_heap less often, or it's cheaper.

But it seems to be very dependent on the data set. If you do this:

    create table foo_2 as select * from foo order by random();
    alter table foo_2 add primary key (pk);
    create index idx_foo_2_on_owner_and_created_at on foo_2 (owner_fk, created_at);

and then run the test against this table, there's no difference.

So my guess is this particular data set triggers slightly different
behavior in tuplesort, reducing the cost of comparetup_heap. The speedup
is quite significant (~20% on my system), the question is how widely
applicable can it be.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Jun 25, 2019 at 12:02 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Jun 24, 2019 at 07:34:19PM -0400, James Coleman wrote:
> >On Mon, Jun 24, 2019 at 4:16 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Mon, Jun 24, 2019 at 01:00:50PM -0400, James Coleman wrote:
> >> >On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs <simon@2ndquadrant.com> wrote:
> >> >>
> >> >> On Mon, 24 Jun 2019 at 16:10, James Coleman <jtc331@gmail.com> wrote:
> >> >>>
> >> >>> On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:
> >...
> >> >>> As I see it the two most significant concerning cases right now are:
> >> >>> 1. Very large batches (in particular where the batch is effectively
> >> >>> all of the matching rows such that we're really just doing a standard
> >> >>> sort).
> >> >>> 2. Many very small batches.
> >> >>
> >> >>
> >> >> What is the specific use case for this? This sounds quite general case.
> >> >
> >> >They are both general cases in some sense, but the concerns lie mostly
> >> >with what happens when they're unexpectedly encountered. For example,
> >> >if the expected row count or group size is off by a good bit and we
> >> >effectively have to perform a sort of all (or most) possible rows.
> >> >
> >> >If we can get the performance to a point where that misestimated row
> >> >count or group size doesn't much matter, then ISTM including the patch
> >> >becomes a much more obvious total win.
> >> >
> >>
> >> Yes, that seems like a reasonable approach. Essentially, we're trying to
> >> construct plausible worst case examples, and then minimize the overhead
> >> compared to regular sort. If we get sufficiently close, then it's fine
> >> to rely on somewhat shaky stats - like group size estimates.
> >
> >I have a bit of a mystery in my performance testing. I've been setting
> >up a table like so:
> >
> >create table foo(pk serial primary key, owner_fk integer, created_at timestamp);
> >insert into foo(owner_fk, created_at)
> >select fk_t.i, now() - (time_t.i::text || ' minutes')::interval
> >from generate_series(1, 10000) time_t(i)
> >cross join generate_series(1, 1000) fk_t(i);
> >-- double up on one set to guarantee matching prefixes
> >insert into foo (owner_fk, created_at) select owner_fk, created_at
> >from foo where owner_fk = 23;
> >create index idx_foo_on_owner_and_created_at on foo(owner_fk, created_at);
> >analyze foo;
> >
> >and then I have the following query:
> >
> >select *
> >from foo
> >where owner_fk = 23
> >order by created_at desc, pk desc
> >limit 20000;
> >
> >The idea here is to force a bit of a worst case for small groups: we
> >have 10,000 batches (i.e., equal prefix groups) of 2 tuples each and
> >then query with a limit matching the actual number of rows we know
> >will match the query -- so even though there's a limit we're forcing a
> >total sort (and also guaranteeing both plans have to touch the same
> >number of rows). Note: I know that batches of size is actually the
> >worst case, but I chose batches of two because I've also been testing
> >a change that would skip the sort entirely for single tuple batches.
> >
> >On master (really the commit right before the current revision of the
> >patch), I get:
> >latency average = 14.271 ms
> >tps = 70.075243 (excluding connections establishing)
> >
> >With the patch (and incremental sort enabled):
> >latency average = 11.975 ms
> >tps = 83.512090 (excluding connections establishing)
> >
> >With the patch (but incremental sort disabled):
> >latency average = 11.884 ms
> >tps = 84.149834 (excluding connections establishing)
> >
> >All of those are 60 seconds runs on pgbench with a single thread.
> >
> >So we have a very substantial speedup with patch *even if the new
> >feature isn't enabled*. I've confirmed the plan looks the same on
> >patched with incremental sort disabled and master. The only changes
> >that would seem to really effect execution time would be the changes
> >to tuplesort.c, but looking through them I don't see anything I'd
> >expect to change things so dramatically.
> >
> >Any thoughts on this?
> >
>
> I can reproduce the same thing, so it's not just you. On my machine, I see
> these tps numbers (average of 10 runs, 60 seconds each):
>
>   master:        65.177
>   patched (on):  80.368
>   patched (off): 80.750
>
> The numbers are very consistent (within 1 tps).
>
> I've done a bit of CPU profiling, and on master I see this:
>
>     13.84%  postgres  postgres            [.] comparetup_heap
>      4.83%  postgres  postgres            [.] qsort_tuple
>      3.87%  postgres  postgres            [.] pg_ltostr_zeropad
>      3.55%  postgres  postgres            [.] pg_ltoa
>      3.19%  postgres  postgres            [.] AllocSetAlloc
>      2.68%  postgres  libc-2.28.so        [.] __GI___strlen_sse2
>      2.38%  postgres  postgres            [.] LWLockRelease
>      2.38%  postgres  postgres            [.] AppendSeconds.constprop.9
>      2.22%  postgres  libc-2.28.so        [.] __memmove_sse2_unaligned_erms
>      2.17%  postgres  postgres            [.] GetPrivateRefCountEntry
>      2.03%  postgres  postgres            [.] j2date
>      ...
>
> while on patched versions I see this:
>
>      4.60%  postgres  postgres            [.] pg_ltostr_zeropad
>      4.51%  postgres  postgres            [.] pg_ltoa
>      3.50%  postgres  postgres            [.] AllocSetAlloc
>      3.34%  postgres  libc-2.28.so        [.] __GI___strlen_sse2
>      2.99%  postgres  postgres            [.] LWLockRelease
>      2.84%  postgres  postgres            [.] AppendSeconds.constprop.9
>      2.65%  postgres  postgres            [.] GetPrivateRefCountEntry
>      2.64%  postgres  postgres            [.] j2date
>      2.60%  postgres  postgres            [.] printtup
>      2.56%  postgres  postgres            [.] heap_hot_search_buffer
>      ...
>      1.35%  postgres  postgres            [.] comparetup_heap
>      ...
>
> So either we're calling comparetup_heap less often, or it's cheaper.
>
> But it seems to be very dependent on the data set. If you do this:
>
>     create table foo_2 as select * from foo order by random();
>     alter table foo_2 add primary key (pk);
>     create index idx_foo_2_on_owner_and_created_at on foo_2 (owner_fk, created_at);
>
> and then run the test against this table, there's no difference.
>
> So my guess is this particular data set triggers slightly different
> behavior in tuplesort, reducing the cost of comparetup_heap. The speedup
> is quite significant (~20% on my system), the question is how widely
> applicable can it be.

Thanks for confirming!

Given the patch contents I don't see any obviously reason why either
of those possibilities (calling comparetup_heap less often, or it's
cheaper) are likely. Is that something we should investigate further?
Or is it just a nice happy accident that we should ignore since it's
dataset specific?

Anyway, when evaluating the patch performance with oddities like this
would you compare performance of incremental sort on to off on the
same revision or to master when determining if it's a regression in
performance? I could make an argument either way I think.

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Peter Geoghegan
Date:
On Tue, Jun 25, 2019 at 9:53 AM James Coleman <jtc331@gmail.com> wrote:
> Given the patch contents I don't see any obviously reason why either
> of those possibilities (calling comparetup_heap less often, or it's
> cheaper) are likely. Is that something we should investigate further?
> Or is it just a nice happy accident that we should ignore since it's
> dataset specific?

Have you actually confirmed that comparetup_heap() is called less
often, by instrumenting the number of individual calls specifically?
If there has been a reduction in calls to comparetup_heap(), then
that's weird, and needs to be explained. If it turns out that it isn't
actually called less often, then I would suggest that the speedup
might be related to memory fragmentation, which often matters a lot
within tuplesort.c. (This is why external sort merging now uses big
buffers, and double buffering.)

-- 
Peter Geoghegan



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Jun 25, 2019 at 1:13 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Jun 25, 2019 at 9:53 AM James Coleman <jtc331@gmail.com> wrote:
> > Given the patch contents I don't see any obviously reason why either
> > of those possibilities (calling comparetup_heap less often, or it's
> > cheaper) are likely. Is that something we should investigate further?
> > Or is it just a nice happy accident that we should ignore since it's
> > dataset specific?
>
> Have you actually confirmed that comparetup_heap() is called less
> often, by instrumenting the number of individual calls specifically?
> If there has been a reduction in calls to comparetup_heap(), then
> that's weird, and needs to be explained. If it turns out that it isn't
> actually called less often, then I would suggest that the speedup
> might be related to memory fragmentation, which often matters a lot
> within tuplesort.c. (This is why external sort merging now uses big
> buffers, and double buffering.)

No, I haven't confirmed that it's called less frequently, and I'd be
extremely surprised if it were given the diff doesn't suggest any
changes to that at all.

If you think it's important enough to do so, I can instrument it to
confirm, but I was mostly wanting to know if there were any other
plausible explanations, and I think you've provided one: there *are*
changes in the patch to memory contexts in tuplesort.c, so if memory
fragmentation is a real concern this patch could definitely notice
changes in that regard.

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Peter Geoghegan
Date:
On Tue, Jun 25, 2019 at 11:03 AM James Coleman <jtc331@gmail.com> wrote:
> No, I haven't confirmed that it's called less frequently, and I'd be
> extremely surprised if it were given the diff doesn't suggest any
> changes to that at all.

I must have misunderstood, then. I thought that you were suggesting
that that might have happened.

> If you think it's important enough to do so, I can instrument it to
> confirm, but I was mostly wanting to know if there were any other
> plausible explanations, and I think you've provided one: there *are*
> changes in the patch to memory contexts in tuplesort.c, so if memory
> fragmentation is a real concern this patch could definitely notice
> changes in that regard.

Sounds like it's probably fragmentation. That's generally hard to measure.

-- 
Peter Geoghegan



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Tue, Jun 25, 2019 at 12:13:01PM -0700, Peter Geoghegan wrote:
>On Tue, Jun 25, 2019 at 11:03 AM James Coleman <jtc331@gmail.com> wrote:
>> No, I haven't confirmed that it's called less frequently, and I'd be
>> extremely surprised if it were given the diff doesn't suggest any
>> changes to that at all.
>
>I must have misunderstood, then. I thought that you were suggesting
>that that might have happened.
>
>> If you think it's important enough to do so, I can instrument it to
>> confirm, but I was mostly wanting to know if there were any other
>> plausible explanations, and I think you've provided one: there *are*
>> changes in the patch to memory contexts in tuplesort.c, so if memory
>> fragmentation is a real concern this patch could definitely notice
>> changes in that regard.
>
>Sounds like it's probably fragmentation. That's generally hard to measure.
>

I'm not sure I'm really conviced this explains the difference, because
the changes in tuplesort.c are actually fairly small - we do split the
tuplesort context into two, but vast majority of the stuff is allocated
in one of the contexts (essentially just the tuplesort state gets moved
to a new context). I wouldn't expect this to have such strong impact on
locality/fragmentation.

But maybe it does - in that case it seems it might be worthwile to do it
separately, irrespectedly of the incremental sort patch. I wonder if
perf would show that as cache hits/misses, or something?

It shouldn't be that difficult to separate this change into a separate
patch, and benchmark it on it's own, though.


FWIW while looking at the tuplesort.c changes, I've noticed some
inaccurate comments in tuplesort_free. Firstly, the top-level comment
says:

/*
 * tuplesort_free
 *
 *    Internal routine for freeing resources of tuplesort.
 */

without mentioning which resources it actually releases, so it kinda
suggests it releases everything. But that's not true - AFAICS it only
releases the per-sort resources. IMO this is a poor function name, and
people will easily keep resources longer than they think - we should
rename it to something like tuplesort_free_batch().

And then at the end tuplesort_free() does this:

    /*
     * Free the per-sort memory context, thereby releasing all working memory,
     * including the Tuplesortstate struct itself.
     */
    MemoryContextReset(state->sortcontext);

But that's clearly not true, because the tuplesortstate is allocated in
the maincontext, not sortcontext.

In general, the comments seem to be a bit confused by what 'sort' means.
Sometimes it means the whole sort operation, sometimes it means one of
the batches, etc.  And the fact that the per-batch context is called
sortcontext does not really improve the situation.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Jun 25, 2019 at 4:32 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Jun 25, 2019 at 12:13:01PM -0700, Peter Geoghegan wrote:
> >On Tue, Jun 25, 2019 at 11:03 AM James Coleman <jtc331@gmail.com> wrote:
> >> No, I haven't confirmed that it's called less frequently, and I'd be
> >> extremely surprised if it were given the diff doesn't suggest any
> >> changes to that at all.
> >
> >I must have misunderstood, then. I thought that you were suggesting
> >that that might have happened.
> >
> >> If you think it's important enough to do so, I can instrument it to
> >> confirm, but I was mostly wanting to know if there were any other
> >> plausible explanations, and I think you've provided one: there *are*
> >> changes in the patch to memory contexts in tuplesort.c, so if memory
> >> fragmentation is a real concern this patch could definitely notice
> >> changes in that regard.
> >
> >Sounds like it's probably fragmentation. That's generally hard to measure.
> >
>
> I'm not sure I'm really conviced this explains the difference, because
> the changes in tuplesort.c are actually fairly small - we do split the
> tuplesort context into two, but vast majority of the stuff is allocated
> in one of the contexts (essentially just the tuplesort state gets moved
> to a new context). I wouldn't expect this to have such strong impact on
> locality/fragmentation.

OTOH it is as you noted heavily dependent on data...so it's hard to
say if it's a real win or not.

> But maybe it does - in that case it seems it might be worthwile to do it
> separately, irrespectedly of the incremental sort patch. I wonder if
> perf would show that as cache hits/misses, or something?
>
> It shouldn't be that difficult to separate this change into a separate
> patch, and benchmark it on it's own, though.

I don't know enough about perf to say, but unless this ends up being a
sticking point for the patch I'll probably avoid it for now because
there are too many other things to worry about in the patch.

> FWIW while looking at the tuplesort.c changes, I've noticed some
> inaccurate comments in tuplesort_free. Firstly, the top-level comment
> says:
>
> /*
>  * tuplesort_free
>  *
>  *      Internal routine for freeing resources of tuplesort.
>  */
>
> without mentioning which resources it actually releases, so it kinda
> suggests it releases everything. But that's not true - AFAICS it only
> releases the per-sort resources. IMO this is a poor function name, and
> people will easily keep resources longer than they think - we should
> rename it to something like tuplesort_free_batch().
>
> And then at the end tuplesort_free() does this:
>
>     /*
>      * Free the per-sort memory context, thereby releasing all working memory,
>      * including the Tuplesortstate struct itself.
>      */
>     MemoryContextReset(state->sortcontext);
>
> But that's clearly not true, because the tuplesortstate is allocated in
> the maincontext, not sortcontext.
>
> In general, the comments seem to be a bit confused by what 'sort' means.
> Sometimes it means the whole sort operation, sometimes it means one of
> the batches, etc.  And the fact that the per-batch context is called
> sortcontext does not really improve the situation.

There are also quite a few misleading or out of date comments in
nodeIncrementalSort.c as well. I'm currently working on the hybrid
approach I mentioned earlier, but once the patch proper looks like
we're coming close to addressing the performance concerns/costing I'll
look at doing a pass through the comments to clean them up.

Unrelated: if you or someone else you know that's more familiar with
the parallel code, I'd be interested in their looking at the patch at
some point, because I have a suspicion it might not be operating in
parallel ever (either that or I don't know how to trigger it), but I'm
not really familiar with that stuff at all currently. :)

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Tue, Jun 25, 2019 at 04:53:40PM -0400, James Coleman wrote:
>
>Unrelated: if you or someone else you know that's more familiar with
>the parallel code, I'd be interested in their looking at the patch at
>some point, because I have a suspicion it might not be operating in
>parallel ever (either that or I don't know how to trigger it), but I'm
>not really familiar with that stuff at all currently. :)
>

That's an interesting question. I don't think plans like this would be
very interesting:

    Limit
      -> Incremental Sort
         -> Gather Merge
            -> Index Scan

because most of the extra cost would be paid in the leader anyway. So
I'm not all that surprised those paths are not generated (I might be
wrong and those plans would be interesting, though).

But I think something like this would be quite beneficial:

    Limit
     -> Gather Merge
        -> Incremental Sort
           -> Index Scan

So I've looked into that, and the reason seems fairly simple - when
generating the Gather Merge paths, we only look at paths that are in
partial_pathlist. See generate_gather_paths().

And we only have sequential + index paths in partial_pathlist, not
incremental sort paths.

IMHO we can do two things:

1) modify generate_gather_paths to also consider incremental sort for
each sorted path, similarly to what create_ordered_paths does

2) modify build_index_paths to also generate an incremental sort path
for each index path

IMHO (1) is the right choice here, because it automatically does the
trick for all other types of ordered paths, not just index scans. So,
something like the attached patch, which gives me plans like this:


                                                                           QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------
     Limit  (cost=0.86..2.85 rows=100000 width=12) (actual time=3.726..233.249 rows=100000 loops=1)
       ->  Gather Merge  (cost=0.86..120.00 rows=5999991 width=12) (actual time=3.724..156.802 rows=100000 loops=1)
             Workers Planned: 2
             Workers Launched: 2
             ->  Incremental Sort  (cost=0.84..100.00 rows=2499996 width=12) (actual time=0.563..164.438 rows=33910
loops=3)
                   Sort Key: a, b
                   Presorted Key: a
                   Sort Method: quicksort  Memory: 27kB
                   Sort Groups: 389
                   Worker 0:  Sort Method: quicksort  Memory: 27kB  Groups: 1295
                   Worker 1:  Sort Method: quicksort  Memory: 27kB  Groups: 1241
                   ->  Parallel Index Scan using t_a_idx on t  (cost=0.43..250612.29 rows=2499996 width=12) (actual
time=0.027..128.518rows=33926 loops=3)
 
     Planning Time: 68559.695 ms
     Execution Time: 285.245 ms
    (14 rows)


This is not the whole story, though - there seems to be some costing
issue, because even with the parallel costs set to 0, I only get such
plans after I tweak the cost in the patch like this:

    subpath->total_cost = 100.0;
    path->path.total_cost = 120.0;

When I don't do that, the gather merge gets with total cost 1037485, and
it gets beaten by this plan:

                                                                    QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------
     Limit  (cost=1.05..152.75 rows=100000 width=12) (actual time=0.234..374.492 rows=100000 loops=1)
       ->  Incremental Sort  (cost=1.05..9103.09 rows=5999991 width=12) (actual time=0.232..316.210 rows=100000
loops=1)
             Sort Key: a, b
             Presorted Key: a
             Sort Method: quicksort  Memory: 27kB
             Sort Groups: 2863
             ->  Index Scan using t_a_idx on t  (cost=0.43..285612.24 rows=5999991 width=12) (actual
time=0.063..240.248rows=100003 loops=1)
 
     Planning Time: 53743.858 ms
     Execution Time: 403.379 ms
    (9 rows)

I suspect it's related to the fact that for the Gather Merge plan we
don't have the information about the number of rows, while for the
incremental sort we have it.

But clearly 9103.09 is not total cost for all 6M rows the incremental
sort is expected to produce (because that has to be higher than 285612,
which is the cost of the index scan). So it seems like the total cost of
the incremental sort is ~546000, because

    (100000 / 6000000) * 546000 = 9133

so close to the total cost of the incremental sort. But then

    100000.0 / 6000000 * 9133 = 152

so it seems we actually do the linear approximation twice. That seems
pretty bogus, IMO. And indeed, if I remove this part from
cost_incremental_sort:

    if (limit_tuples > 0 && limit_tuples < input_tuples)
    {
        output_tuples = limit_tuples;
        output_groups = floor(output_tuples / group_tuples) + 1;
    }

then it behaves kinda reasonable:

    explain  select * from t order by a, b limit 100000;

                                           QUERY PLAN
    -----------------------------------------------------------------------------------------
     Limit  (cost=1.05..9103.12 rows=100000 width=12)
       ->  Incremental Sort  (cost=1.05..546124.41 rows=5999991 width=12)
             Sort Key: a, b
             Presorted Key: a
             ->  Index Scan using t_a_idx on t  (cost=0.43..285612.24 rows=5999991 width=12)
    (5 rows)

    set parallel_tuple_cost = 0;
    set parallel_setup_cost = 0;

    explain  select * from t order by a, b limit 100000;

                                                   QUERY PLAN
    --------------------------------------------------------------------------------------------------------
     Limit  (cost=0.86..6775.63 rows=100000 width=12)
       ->  Gather Merge  (cost=0.86..406486.25 rows=5999991 width=12)
             Workers Planned: 2
             ->  Incremental Sort  (cost=0.84..343937.44 rows=2499996 width=12)
                   Sort Key: a, b
                   Presorted Key: a
                   ->  Parallel Index Scan using t_a_idx on t  (cost=0.43..250612.29 rows=2499996 width=12)
    (7 rows)

But I'm not going to claim those are total fixes, it's the minimum I
needed to do to make this particular type of plan work.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Jun 24, 2019 at 11:10 AM James Coleman <jtc331@gmail.com> wrote:
> As I see it the two most significant concerning cases right now are:
> 1. Very large batches (in particular where the batch is effectively
> all of the matching rows such that we're really just doing a standard
> sort).
> 2. Many very small batches.
...
> Alternatively I've been thinking about ways to achieve a hybrid
> approach: using single-batch suffix-only sort for large batches and
> multi-batch full sort for very small batches. For example, I could
> imagine maintaining two different tuplesorts (one for full sort and
> one for suffix-only sort) and inserting tuples into the full sorter
> until the minimum batch size, and then (like now) checking every tuple
> to see when we've finished the batch. If we finish the batch by, say,
> 2 * minimum batch size, then we perform the full sort. If we don't
> find a new batch by that point, we'd go back and check to see if the
> tuples we've accumulated so far are actually all the same batch. If
> so, we'd move all of them to the suffix-only sorter and continue
> adding rows until we hit a new batch. If not, we'd perform the full
> sort, but only return tuples out of the full sorter until we encounter
> the current batch, and then we'd move the remainder into the
> suffix-only sorter.

Over the past week or two I've implemented this approach. The attached
patch implements the sort retaining the minimum group size logic
already in the patch, but then after 32 tuples only looks for a
different prefix key group for the next 32 tuples. If it finds a new
prefix key group, then everything happens exactly as before. However
if it doesn't find a new prefix key group in that number of tuples,
then it assumes it's found a large set of tuples in a single prefix
key group. To guarantee this is the case it has to transfer tuples
from the full sorting tuplesort into a presorted prefix tuplesort and
(leaving out a few details here about how it handles the possibility
of multiple prefix key groups already in the sort) continues to fill
up that optimized sort with the large group.

This approach should allow us to, broadly speaking, have our cake and
eat it too with respect to small and large batches of tuples in prefix
key groups. There is of course the cost of switching modes, but this
is much like the cost that bound sort pays to switch into top-n heap
sorting mode; there's an inflection point, and you pay some extra cost
right at it, but it's worth it in the broad case.

Initial testing shows promising performance (akin to the original
patch for small groups and similar to my variant that only ever sorted
by single prefix key groups for large batches)

A couple of thoughts:
- It'd be nice if tuplesort allowed us to pull back out tuples in FIFO
manner without sorting. That'd lower the inflection point cost of
switching modes.
- I haven't adjusted costing yet for this change in approach; I wanted
to take a more holistic look at that after getting this working.
- I don't have strong feelings about the group size inflection points.
More perf testing would be useful, and also it's plausible we should
do more adjusting of those heuristics based on the size of the bound,
if we have one.
- The guc enable_sort current also disabled incremental sort, which
makes sense, but it also means there's not a good way to tweak a plan
using a full sort into a plan that uses incremental sort. That seems
not the greatest, but I'm not sure what the best solution would be.
- I did a lot of comment work in this patch, but comments for changes
to tuplesort.c and elsewhere still need to be cleaned up.

A few style notes:
- I know some of the variable declarations don't line up well; I need
to figure out how to get Vim to do what appears to be the standard
style in PG source.
- All code and comment lines are the right length, I think with the
exception of debug printf statements. I'm not sure if long strings are
an exception.
- Lining up arguments is another thing Vim isn't setup to do, though
maybe someone has some thoughts on a good approach.
I'm planning to look at the formatting utility when I get a chance,
but it'd be nice to have the editor handle the basic setup most of the
time.

Process questions:
- Do I need to explicitly move the patch somehow to the next CF?
- Since I've basically taken over patch ownership, should I move my
name from reviewer to author in the CF app? And can there be two
authors listed there?

James Coleman

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Jun 25, 2019 at 7:22 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Jun 25, 2019 at 04:53:40PM -0400, James Coleman wrote:
> >
> >Unrelated: if you or someone else you know that's more familiar with
> >the parallel code, I'd be interested in their looking at the patch at
> >some point, because I have a suspicion it might not be operating in
...
> So I've looked into that, and the reason seems fairly simple - when
> generating the Gather Merge paths, we only look at paths that are in
> partial_pathlist. See generate_gather_paths().
>
> And we only have sequential + index paths in partial_pathlist, not
> incremental sort paths.
>
> IMHO we can do two things:
>
> 1) modify generate_gather_paths to also consider incremental sort for
> each sorted path, similarly to what create_ordered_paths does
>
> 2) modify build_index_paths to also generate an incremental sort path
> for each index path
>
> IMHO (1) is the right choice here, because it automatically does the
> trick for all other types of ordered paths, not just index scans. So,
> something like the attached patch, which gives me plans like this:
...
> But I'm not going to claim those are total fixes, it's the minimum I
> needed to do to make this particular type of plan work.

Thanks for looking into this!

I intended to apply this to my most recent version of the patch (just
sent a few minutes ago), but when I apply it I noticed that the
partition_aggregate regression tests have several of these failures:

ERROR:  could not find pathkey item to sort

I haven't had time to look into the cause yet, so I decided to wait
until the next patch revision.

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Thu, Jul 04, 2019 at 09:29:49AM -0400, James Coleman wrote:
>On Tue, Jun 25, 2019 at 7:22 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Tue, Jun 25, 2019 at 04:53:40PM -0400, James Coleman wrote:
>> >
>> >Unrelated: if you or someone else you know that's more familiar with
>> >the parallel code, I'd be interested in their looking at the patch at
>> >some point, because I have a suspicion it might not be operating in
>...
>> So I've looked into that, and the reason seems fairly simple - when
>> generating the Gather Merge paths, we only look at paths that are in
>> partial_pathlist. See generate_gather_paths().
>>
>> And we only have sequential + index paths in partial_pathlist, not
>> incremental sort paths.
>>
>> IMHO we can do two things:
>>
>> 1) modify generate_gather_paths to also consider incremental sort for
>> each sorted path, similarly to what create_ordered_paths does
>>
>> 2) modify build_index_paths to also generate an incremental sort path
>> for each index path
>>
>> IMHO (1) is the right choice here, because it automatically does the
>> trick for all other types of ordered paths, not just index scans. So,
>> something like the attached patch, which gives me plans like this:
>...
>> But I'm not going to claim those are total fixes, it's the minimum I
>> needed to do to make this particular type of plan work.
>
>Thanks for looking into this!
>
>I intended to apply this to my most recent version of the patch (just
>sent a few minutes ago), but when I apply it I noticed that the
>partition_aggregate regression tests have several of these failures:
>
>ERROR:  could not find pathkey item to sort
>
>I haven't had time to look into the cause yet, so I decided to wait
>until the next patch revision.
>

FWIW I don't claim the patch I shared is complete and/or 100% correct.
It was more an illustration of the issue and the smallest patch to make
a particular query work. The test failures are a consequence of that.

I'll try looking into the failures over the next couple of days, but I
can't promise anything.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Thu, Jul 4, 2019 at 10:46 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Thu, Jul 04, 2019 at 09:29:49AM -0400, James Coleman wrote:
> >On Tue, Jun 25, 2019 at 7:22 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Tue, Jun 25, 2019 at 04:53:40PM -0400, James Coleman wrote:
> >> >
> >> >Unrelated: if you or someone else you know that's more familiar with
> >> >the parallel code, I'd be interested in their looking at the patch at
> >> >some point, because I have a suspicion it might not be operating in
> >...
> >> So I've looked into that, and the reason seems fairly simple - when
> >> generating the Gather Merge paths, we only look at paths that are in
> >> partial_pathlist. See generate_gather_paths().
> >>
> >> And we only have sequential + index paths in partial_pathlist, not
> >> incremental sort paths.
> >>
> >> IMHO we can do two things:
> >>
> >> 1) modify generate_gather_paths to also consider incremental sort for
> >> each sorted path, similarly to what create_ordered_paths does
> >>
> >> 2) modify build_index_paths to also generate an incremental sort path
> >> for each index path
> >>
> >> IMHO (1) is the right choice here, because it automatically does the
> >> trick for all other types of ordered paths, not just index scans. So,
> >> something like the attached patch, which gives me plans like this:
> >...
> >> But I'm not going to claim those are total fixes, it's the minimum I
> >> needed to do to make this particular type of plan work.
> >
> >Thanks for looking into this!
> >
> >I intended to apply this to my most recent version of the patch (just
> >sent a few minutes ago), but when I apply it I noticed that the
> >partition_aggregate regression tests have several of these failures:
> >
> >ERROR:  could not find pathkey item to sort
> >
> >I haven't had time to look into the cause yet, so I decided to wait
> >until the next patch revision.
> >
>
> FWIW I don't claim the patch I shared is complete and/or 100% correct.
> It was more an illustration of the issue and the smallest patch to make
> a particular query work. The test failures are a consequence of that.
>
> I'll try looking into the failures over the next couple of days, but I
> can't promise anything.

Yep, I understand, I just wanted to note that it was still an
outstanding item and give a quick update on why so.

Anything you can look at is much appreciated.

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Thu, Jul 04, 2019 at 09:29:49AM -0400, James Coleman wrote:
>On Tue, Jun 25, 2019 at 7:22 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Tue, Jun 25, 2019 at 04:53:40PM -0400, James Coleman wrote:
>> >
>> >Unrelated: if you or someone else you know that's more familiar with
>> >the parallel code, I'd be interested in their looking at the patch at
>> >some point, because I have a suspicion it might not be operating in
>...
>> So I've looked into that, and the reason seems fairly simple - when
>> generating the Gather Merge paths, we only look at paths that are in
>> partial_pathlist. See generate_gather_paths().
>>
>> And we only have sequential + index paths in partial_pathlist, not
>> incremental sort paths.
>>
>> IMHO we can do two things:
>>
>> 1) modify generate_gather_paths to also consider incremental sort for
>> each sorted path, similarly to what create_ordered_paths does
>>
>> 2) modify build_index_paths to also generate an incremental sort path
>> for each index path
>>
>> IMHO (1) is the right choice here, because it automatically does the
>> trick for all other types of ordered paths, not just index scans. So,
>> something like the attached patch, which gives me plans like this:
>...
>> But I'm not going to claim those are total fixes, it's the minimum I
>> needed to do to make this particular type of plan work.
>
>Thanks for looking into this!
>
>I intended to apply this to my most recent version of the patch (just
>sent a few minutes ago), but when I apply it I noticed that the
>partition_aggregate regression tests have several of these failures:
>
>ERROR:  could not find pathkey item to sort
>
>I haven't had time to look into the cause yet, so I decided to wait
>until the next patch revision.
>

I wanted to investigate this today, but I can't reprodure it. How are
you building and running the regression tests?

Attached is a patch adding the incremental sort below gather merge, and
also tweaking the costing. But that's mostly for and better planning
decisions, I don't get any pathkey errors even with the first patch.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sun, Jul 7, 2019 at 8:34 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Thu, Jul 04, 2019 at 09:29:49AM -0400, James Coleman wrote:
> >On Tue, Jun 25, 2019 at 7:22 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Tue, Jun 25, 2019 at 04:53:40PM -0400, James Coleman wrote:
> >> >
> >> >Unrelated: if you or someone else you know that's more familiar with
> >> >the parallel code, I'd be interested in their looking at the patch at
> >> >some point, because I have a suspicion it might not be operating in
> >...
> >> So I've looked into that, and the reason seems fairly simple - when
> >> generating the Gather Merge paths, we only look at paths that are in
> >> partial_pathlist. See generate_gather_paths().
> >>
> >> And we only have sequential + index paths in partial_pathlist, not
> >> incremental sort paths.
> >>
> >> IMHO we can do two things:
> >>
> >> 1) modify generate_gather_paths to also consider incremental sort for
> >> each sorted path, similarly to what create_ordered_paths does
> >>
> >> 2) modify build_index_paths to also generate an incremental sort path
> >> for each index path
> >>
> >> IMHO (1) is the right choice here, because it automatically does the
> >> trick for all other types of ordered paths, not just index scans. So,
> >> something like the attached patch, which gives me plans like this:
> >...
> >> But I'm not going to claim those are total fixes, it's the minimum I
> >> needed to do to make this particular type of plan work.
> >
> >Thanks for looking into this!
> >
> >I intended to apply this to my most recent version of the patch (just
> >sent a few minutes ago), but when I apply it I noticed that the
> >partition_aggregate regression tests have several of these failures:
> >
> >ERROR:  could not find pathkey item to sort
> >
> >I haven't had time to look into the cause yet, so I decided to wait
> >until the next patch revision.
> >
>
> I wanted to investigate this today, but I can't reprodure it. How are
> you building and running the regression tests?
>
> Attached is a patch adding the incremental sort below gather merge, and
> also tweaking the costing. But that's mostly for and better planning
> decisions, I don't get any pathkey errors even with the first patch.

On 12be7f7f997debe4e05e84b69c03ecf7051b1d79 (the last patch I sent,
which is based on top of 5683b34956b4e8da9dccadc2e3a53b86104ebb33), I
did this:

patch -p1 < ~/Downloads/parallel-incremental-sort.patch
<rebuild> (FWIW I configure with ./configure
--prefix=$HOME/postgresql-test --enable-cassert --enable-debug
--enable-depend CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer
-DOPTIMIZER_DEBUG")
make check-world

And I get the attached regression failures.

James Coleman

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sun, Jul 07, 2019 at 09:01:43AM -0400, James Coleman wrote:
>On Sun, Jul 7, 2019 at 8:34 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Thu, Jul 04, 2019 at 09:29:49AM -0400, James Coleman wrote:
>> >On Tue, Jun 25, 2019 at 7:22 PM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> >>
>> >> On Tue, Jun 25, 2019 at 04:53:40PM -0400, James Coleman wrote:
>> >> >
>> >> >Unrelated: if you or someone else you know that's more familiar with
>> >> >the parallel code, I'd be interested in their looking at the patch at
>> >> >some point, because I have a suspicion it might not be operating in
>> >...
>> >> So I've looked into that, and the reason seems fairly simple - when
>> >> generating the Gather Merge paths, we only look at paths that are in
>> >> partial_pathlist. See generate_gather_paths().
>> >>
>> >> And we only have sequential + index paths in partial_pathlist, not
>> >> incremental sort paths.
>> >>
>> >> IMHO we can do two things:
>> >>
>> >> 1) modify generate_gather_paths to also consider incremental sort for
>> >> each sorted path, similarly to what create_ordered_paths does
>> >>
>> >> 2) modify build_index_paths to also generate an incremental sort path
>> >> for each index path
>> >>
>> >> IMHO (1) is the right choice here, because it automatically does the
>> >> trick for all other types of ordered paths, not just index scans. So,
>> >> something like the attached patch, which gives me plans like this:
>> >...
>> >> But I'm not going to claim those are total fixes, it's the minimum I
>> >> needed to do to make this particular type of plan work.
>> >
>> >Thanks for looking into this!
>> >
>> >I intended to apply this to my most recent version of the patch (just
>> >sent a few minutes ago), but when I apply it I noticed that the
>> >partition_aggregate regression tests have several of these failures:
>> >
>> >ERROR:  could not find pathkey item to sort
>> >
>> >I haven't had time to look into the cause yet, so I decided to wait
>> >until the next patch revision.
>> >
>>
>> I wanted to investigate this today, but I can't reprodure it. How are
>> you building and running the regression tests?
>>
>> Attached is a patch adding the incremental sort below gather merge, and
>> also tweaking the costing. But that's mostly for and better planning
>> decisions, I don't get any pathkey errors even with the first patch.
>
>On 12be7f7f997debe4e05e84b69c03ecf7051b1d79 (the last patch I sent,
>which is based on top of 5683b34956b4e8da9dccadc2e3a53b86104ebb33), I
>did this:
>
>patch -p1 < ~/Downloads/parallel-incremental-sort.patch
><rebuild> (FWIW I configure with ./configure
>--prefix=$HOME/postgresql-test --enable-cassert --enable-debug
>--enable-depend CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer
>-DOPTIMIZER_DEBUG")
>make check-world
>
>And I get the attached regression failures.
>

OK, thanks. Apparently it's the costing changes that make it go away, if
I try just the patch that tweaks generate_gather_paths() I see the same
failures. The failure happens during plan construction, so I think the
costing changes simply mean the path with incremental sort end up not
being the cheapest one (for the problematic queries), but that's just
pure luck - it's definitely an issue that needs fixing.

That error message is triggered in two places in createplan.c, and after
changing them to Assert(false) I get a core dump with this backtrace:

#0  0x0000702b3328857f in raise () from /lib64/libc.so.6
#1  0x0000702b33272895 in abort () from /lib64/libc.so.6
#2  0x0000000000a59a9d in ExceptionalCondition (conditionName=0xc52e84 "!(0)", errorType=0xc51f96 "FailedAssertion",
fileName=0xc51fe6"createplan.c", lineNumber=5937) at assert.c:54
 
#3  0x00000000007d4ab5 in prepare_sort_from_pathkeys (lefttree=0x2bbbce0, pathkeys=0x2b7a130, relids=0x0,
reqColIdx=0x0,adjust_tlist_in_place=false, p_numsortkeys=0x7ffe1abcfd6c, p_sortColIdx=0x7ffe1abcfd60,
p_sortOperators=0x7ffe1abcfd58,p_collations=0x7ffe1abcfd50, 
 
    p_nullsFirst=0x7ffe1abcfd48) at createplan.c:5937
#4  0x00000000007d4e7f in make_incrementalsort_from_pathkeys (lefttree=0x2bbbce0, pathkeys=0x2b7a130, relids=0x0,
presortedCols=1)at createplan.c:6101
 
#5  0x00000000007cdd3f in create_incrementalsort_plan (root=0x2b787c0, best_path=0x2bb92b0, flags=1) at
createplan.c:2019
#6  0x00000000007cb7ad in create_plan_recurse (root=0x2b787c0, best_path=0x2bb92b0, flags=1) at createplan.c:469
#7  0x00000000007cd778 in create_gather_merge_plan (root=0x2b787c0, best_path=0x2bb94a0) at createplan.c:1764
#8  0x00000000007cb8fb in create_plan_recurse (root=0x2b787c0, best_path=0x2bb94a0, flags=4) at createplan.c:516
#9  0x00000000007cdf10 in create_agg_plan (root=0x2b787c0, best_path=0x2bb9b28) at createplan.c:2115
#10 0x00000000007cb834 in create_plan_recurse (root=0x2b787c0, best_path=0x2bb9b28, flags=3) at createplan.c:484
#11 0x00000000007cdc16 in create_sort_plan (root=0x2b787c0, best_path=0x2bba1e8, flags=1) at createplan.c:1986
#12 0x00000000007cb78e in create_plan_recurse (root=0x2b787c0, best_path=0x2bba1e8, flags=1) at createplan.c:464
#13 0x00000000007cb4ae in create_plan (root=0x2b787c0, best_path=0x2bba1e8) at createplan.c:330
#14 0x00000000007db63c in standard_planner (parse=0x2bf5bc8, cursorOptions=256, boundParams=0x0) at planner.c:413
#15 0x00000000007db3b4 in planner (parse=0x2bf5bc8, cursorOptions=256, boundParams=0x0) at planner.c:275
#16 0x00000000008e404f in pg_plan_query (querytree=0x2bf5bc8, cursorOptions=256, boundParams=0x0) at postgres.c:878
#17 0x0000000000657afa in ExplainOneQuery (query=0x2bf5bc8, cursorOptions=256, into=0x0, es=0x2bf5fc0,
queryString=0x2a74be8"EXPLAIN (COSTS OFF)\nSELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3
ORDERBY 1, 2, 3;", params=0x0, queryEnv=0x0) at explain.c:371
 
#18 0x0000000000657804 in ExplainQuery (pstate=0x2a95ff8, stmt=0x2a76628, queryString=0x2a74be8 "EXPLAIN (COSTS
OFF)\nSELECTa, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;", params=0x0,
queryEnv=0x0,dest=0x2a95f60) at explain.c:259
 
#19 0x00000000008ec75e in standard_ProcessUtility (pstmt=0x2b8a050, queryString=0x2a74be8 "EXPLAIN (COSTS OFF)\nSELECT
a,sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;", context=PROCESS_UTILITY_TOPLEVEL,
params=0x0,queryEnv=0x0, dest=0x2a95f60, 
 
    completionTag=0x7ffe1abd0430 "") at utility.c:675
#20 0x00000000008ebfbe in ProcessUtility (pstmt=0x2b8a050, queryString=0x2a74be8 "EXPLAIN (COSTS OFF)\nSELECT a,
sum(b),count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;", context=PROCESS_UTILITY_TOPLEVEL,
params=0x0,queryEnv=0x0, dest=0x2a95f60, 
 
    completionTag=0x7ffe1abd0430 "") at utility.c:360
#21 0x00000000008eafab in PortalRunUtility (portal=0x2adc558, pstmt=0x2b8a050, isTopLevel=true, setHoldSnapshot=true,
dest=0x2a95f60,completionTag=0x7ffe1abd0430 "") at pquery.c:1175
 
#22 0x00000000008eacb2 in FillPortalStore (portal=0x2adc558, isTopLevel=true) at pquery.c:1035
#23 0x00000000008ea60e in PortalRun (portal=0x2adc558, count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x2b8a148,altdest=0x2b8a148, completionTag=0x7ffe1abd0620 "") at pquery.c:765
 
#24 0x00000000008e45be in exec_simple_query (query_string=0x2a74be8 "EXPLAIN (COSTS OFF)\nSELECT a, sum(b), count(*)
FROMpagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;") at postgres.c:1215
 
#25 0x00000000008e8912 in PostgresMain (argc=1, argv=0x2aa07c0, dbname=0x2aa0530 "regression", username=0x2a707e8
"user")at postgres.c:4249
 
#26 0x000000000083ed83 in BackendRun (port=0x2a99c50) at postmaster.c:4431
#27 0x000000000083e551 in BackendStartup (port=0x2a99c50) at postmaster.c:4122
#28 0x000000000083a977 in ServerLoop () at postmaster.c:1704
#29 0x000000000083a223 in PostmasterMain (argc=8, argv=0x2a6e670) at postmaster.c:1377
#30 0x000000000075a823 in main (argc=8, argv=0x2a6e670) at main.c:228

I think it's pretty obvious what's happening - I'll resist the urge
to just write "The proof is obvious and is left to the reader as a
homework." because I always despised that during math lectures ;-)

We're running query like this:

  SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3

but we're trying to add the incremental sort *before* the aggregation,
because the optimizer also considers group aggregate with a sorted
input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
actually can do this, but clearly that's nonsense, because we can't
possibly know the aggregates yet. Hence the error.

If this is the actual issue, we need to ensure we actually can evaluate
all the pathkeys. I don't know how to do that yet. I thought that maybe
we should modify pathkeys_common_contained_in() to set presorted_keys to
0 in this case.

But then I started wondering why we don't see this issue even for
regular (non-incremental-sort) paths built in create_ordered_paths().
How come we don't see these failures there? I've modified costing to
make all incremental sort paths very cheap, and still nothing.

So presumably there's a check elsewhere (either implicit or explicit),
because create_ordered_paths() uses pathkeys_common_contained_in() and
does not have the same issue.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> We're running query like this:
>
>   SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3
>
> but we're trying to add the incremental sort *before* the aggregation,
> because the optimizer also considers group aggregate with a sorted
> input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
> actually can do this, but clearly that's nonsense, because we can't
> possibly know the aggregates yet. Hence the error.
>
> If this is the actual issue, we need to ensure we actually can evaluate
> all the pathkeys. I don't know how to do that yet. I thought that maybe
> we should modify pathkeys_common_contained_in() to set presorted_keys to
> 0 in this case.
>
> But then I started wondering why we don't see this issue even for
> regular (non-incremental-sort) paths built in create_ordered_paths().
> How come we don't see these failures there? I've modified costing to
> make all incremental sort paths very cheap, and still nothing.

I assume you mean you modified costing to make regular sort paths very cheap?

> So presumably there's a check elsewhere (either implicit or explicit),
> because create_ordered_paths() uses pathkeys_common_contained_in() and
> does not have the same issue.

Given this comment in create_ordered_paths():

  generate_gather_paths() will have already generated a simple Gather
  path for the best parallel path, if any, and the loop above will have
  considered sorting it.  Similarly, generate_gather_paths() will also
  have generated order-preserving Gather Merge plans which can be used
  without sorting if they happen to match the sort_pathkeys, and the loop
  above will have handled those as well.  However, there's one more
  possibility: it may make sense to sort the cheapest partial path
  according to the required output order and then use Gather Merge.

my understanding is that generate_gather_paths() only considers paths
that already happen to be sorted (not explicit sorts), so I'm
wondering if it would make more sense for the incremental sort path
creation for this case to live alongside the explicit ordered path
creation in create_ordered_paths() rather than in
generate_gather_paths().

If I'm understanding what you're saying properly, I think you'd
expected create_ordered_paths() to be roughly similar in what it
considers as partial paths and so have the same problem, and I haven't
yet read enough of the code to understand if my proposed change
actually has any impact on the issue we're discussing, but it seems to
me that it at least fits more with what the comments imply.

I'll try to look at it a bit more later also, but at the moment other
work calls.

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:
>On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> We're running query like this:
>>
>>   SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3
>>
>> but we're trying to add the incremental sort *before* the aggregation,
>> because the optimizer also considers group aggregate with a sorted
>> input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
>> actually can do this, but clearly that's nonsense, because we can't
>> possibly know the aggregates yet. Hence the error.
>>
>> If this is the actual issue, we need to ensure we actually can evaluate
>> all the pathkeys. I don't know how to do that yet. I thought that maybe
>> we should modify pathkeys_common_contained_in() to set presorted_keys to
>> 0 in this case.
>>
>> But then I started wondering why we don't see this issue even for
>> regular (non-incremental-sort) paths built in create_ordered_paths().
>> How come we don't see these failures there? I've modified costing to
>> make all incremental sort paths very cheap, and still nothing.
>
>I assume you mean you modified costing to make regular sort paths very cheap?
>

No, I mean costing of incremental sort paths, so that they end up being
the cheapest ones. If some other path is cheaper, we won't see the error
because it only happens when building plan from the cheapest path.

>> So presumably there's a check elsewhere (either implicit or explicit),
>> because create_ordered_paths() uses pathkeys_common_contained_in() and
>> does not have the same issue.
>
>Given this comment in create_ordered_paths():
>
>  generate_gather_paths() will have already generated a simple Gather
>  path for the best parallel path, if any, and the loop above will have
>  considered sorting it.  Similarly, generate_gather_paths() will also
>  have generated order-preserving Gather Merge plans which can be used
>  without sorting if they happen to match the sort_pathkeys, and the loop
>  above will have handled those as well.  However, there's one more
>  possibility: it may make sense to sort the cheapest partial path
>  according to the required output order and then use Gather Merge.
>
>my understanding is that generate_gather_paths() only considers paths
>that already happen to be sorted (not explicit sorts), so I'm
>wondering if it would make more sense for the incremental sort path
>creation for this case to live alongside the explicit ordered path
>creation in create_ordered_paths() rather than in
>generate_gather_paths().
>

How would that solve the issue? Also, we're building a gather path, so
I think generate_gather_paths() is the right place where to do it. And
we're not changing the semantics of generate_gather_paths() - the result
path should be sorted "correctly" with respect to sort_pathkeys.

>If I'm understanding what you're saying properly, I think you'd
>expected create_ordered_paths() to be roughly similar in what it
>considers as partial paths and so have the same problem, and I haven't
>yet read enough of the code to understand if my proposed change
>actually has any impact on the issue we're discussing, but it seems to
>me that it at least fits more with what the comments imply.
>

Roughly. AFAICS the problem is that we're trying to use pathkeys that are
only valid for the (aggregated) upper relation, not before it.

I have to admit I've never quite understood this pathkeys business. I
mean, I know what pathkeys are for, but clearly it's not valid to look at
root->sort_pathkeys at this place. Maybe there's some other field we
should be looking at instead, or maybe it's ensured by grouping_planner in
some implicit way ...

I.e. the question is - when you do a query like

    SELECT a, count(*) FROM t GROUP BY a ORDER BY a, count(*);

and cost the incremental sort extremely cheap, how come we don't end up
with the same issue?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:
> >On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >> We're running query like this:
> >>
> >>   SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3
> >>
> >> but we're trying to add the incremental sort *before* the aggregation,
> >> because the optimizer also considers group aggregate with a sorted
> >> input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
> >> actually can do this, but clearly that's nonsense, because we can't
> >> possibly know the aggregates yet. Hence the error.
> >>
> >> If this is the actual issue, we need to ensure we actually can evaluate
> >> all the pathkeys. I don't know how to do that yet. I thought that maybe
> >> we should modify pathkeys_common_contained_in() to set presorted_keys to
> >> 0 in this case.
> >>
> >> But then I started wondering why we don't see this issue even for
> >> regular (non-incremental-sort) paths built in create_ordered_paths().
> >> How come we don't see these failures there? I've modified costing to
> >> make all incremental sort paths very cheap, and still nothing.
> >
> >I assume you mean you modified costing to make regular sort paths very cheap?
> >
>
> No, I mean costing of incremental sort paths, so that they end up being
> the cheapest ones. If some other path is cheaper, we won't see the error
> because it only happens when building plan from the cheapest path.

Ah, I misunderstood as you trying to figure out a way to try to cause
the same problem with a regular sort.

> >> So presumably there's a check elsewhere (either implicit or explicit),
> >> because create_ordered_paths() uses pathkeys_common_contained_in() and
> >> does not have the same issue.
> >
> >Given this comment in create_ordered_paths():
> >
> >  generate_gather_paths() will have already generated a simple Gather
> >  path for the best parallel path, if any, and the loop above will have
> >  considered sorting it.  Similarly, generate_gather_paths() will also
> >  have generated order-preserving Gather Merge plans which can be used
> >  without sorting if they happen to match the sort_pathkeys, and the loop
> >  above will have handled those as well.  However, there's one more
> >  possibility: it may make sense to sort the cheapest partial path
> >  according to the required output order and then use Gather Merge.
> >
> >my understanding is that generate_gather_paths() only considers paths
> >that already happen to be sorted (not explicit sorts), so I'm
> >wondering if it would make more sense for the incremental sort path
> >creation for this case to live alongside the explicit ordered path
> >creation in create_ordered_paths() rather than in
> >generate_gather_paths().
> >
>
> How would that solve the issue? Also, we're building a gather path, so
> I think generate_gather_paths() is the right place where to do it. And
> we're not changing the semantics of generate_gather_paths() - the result
> path should be sorted "correctly" with respect to sort_pathkeys.

Does that imply what the explicit sort in create_ordered_paths() is in
the wrong spot?

Or, to put it another way, do you think that both kinds of sorts
should be added in the same place? It seems confusing to me that
they'd be split between the two methods (unless I'm completely
misunderstanding how the two work).

I'm not saying it would solve the issue here; just noting that the
division of labor seemed odd to me at first read through.

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Jul 08, 2019 at 10:32:18AM -0400, James Coleman wrote:
>On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:
>> >On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> >> We're running query like this:
>> >>
>> >>   SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3
>> >>
>> >> but we're trying to add the incremental sort *before* the aggregation,
>> >> because the optimizer also considers group aggregate with a sorted
>> >> input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
>> >> actually can do this, but clearly that's nonsense, because we can't
>> >> possibly know the aggregates yet. Hence the error.
>> >>
>> >> If this is the actual issue, we need to ensure we actually can evaluate
>> >> all the pathkeys. I don't know how to do that yet. I thought that maybe
>> >> we should modify pathkeys_common_contained_in() to set presorted_keys to
>> >> 0 in this case.
>> >>
>> >> But then I started wondering why we don't see this issue even for
>> >> regular (non-incremental-sort) paths built in create_ordered_paths().
>> >> How come we don't see these failures there? I've modified costing to
>> >> make all incremental sort paths very cheap, and still nothing.
>> >
>> >I assume you mean you modified costing to make regular sort paths very cheap?
>> >
>>
>> No, I mean costing of incremental sort paths, so that they end up being
>> the cheapest ones. If some other path is cheaper, we won't see the error
>> because it only happens when building plan from the cheapest path.
>
>Ah, I misunderstood as you trying to figure out a way to try to cause
>the same problem with a regular sort.
>
>> >> So presumably there's a check elsewhere (either implicit or explicit),
>> >> because create_ordered_paths() uses pathkeys_common_contained_in() and
>> >> does not have the same issue.
>> >
>> >Given this comment in create_ordered_paths():
>> >
>> >  generate_gather_paths() will have already generated a simple Gather
>> >  path for the best parallel path, if any, and the loop above will have
>> >  considered sorting it.  Similarly, generate_gather_paths() will also
>> >  have generated order-preserving Gather Merge plans which can be used
>> >  without sorting if they happen to match the sort_pathkeys, and the loop
>> >  above will have handled those as well.  However, there's one more
>> >  possibility: it may make sense to sort the cheapest partial path
>> >  according to the required output order and then use Gather Merge.
>> >
>> >my understanding is that generate_gather_paths() only considers paths
>> >that already happen to be sorted (not explicit sorts), so I'm
>> >wondering if it would make more sense for the incremental sort path
>> >creation for this case to live alongside the explicit ordered path
>> >creation in create_ordered_paths() rather than in
>> >generate_gather_paths().
>> >
>>
>> How would that solve the issue? Also, we're building a gather path, so
>> I think generate_gather_paths() is the right place where to do it. And
>> we're not changing the semantics of generate_gather_paths() - the result
>> path should be sorted "correctly" with respect to sort_pathkeys.
>
>Does that imply what the explicit sort in create_ordered_paths() is in
>the wrong spot?
>

I think those are essentially the right places where to do this sort of
stuff. Maybe there's a better place, but I don't think those places are
somehow wrong.

>Or, to put it another way, do you think that both kinds of sorts
>should be added in the same place? It seems confusing to me that
>they'd be split between the two methods (unless I'm completely
>misunderstanding how the two work).
>

The paths built in those two places were very different in one aspect:

1) generate_gather_paths only ever looked at pathkeys for the subpath, it
never even looked at ordering expected by paths above it (or the query as
a whole). Plain Gather ignores pathkeys entirely, Gather Merge only aims
to maintain ordering of the different subpaths.

2) create_oredered_paths is meant to enforce ordering needed by upper
parts of the plan - either by using a properly sorted path, or adding an
explicit sort.


We want to extend (1) to also look at ordering expected by the upper parts
of the plan, and consider incremental sort if applicable. (2) already does
that, and it already has the correct pathkeys to enforce.

But looking at root->sort_pathkeys in (1) seems to be the wrong thing :-(

The thing is, we don't have just sort_pathkeys, there's distinct_pathkeys
and group_pathkeys too, so maybe we should be looking at those too?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Jul 8, 2019 at 10:58 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Jul 08, 2019 at 10:32:18AM -0400, James Coleman wrote:
> >On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:
> >> >On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
> >> ><tomas.vondra@2ndquadrant.com> wrote:
> >> >> We're running query like this:
> >> >>
> >> >>   SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3
> >> >>
> >> >> but we're trying to add the incremental sort *before* the aggregation,
> >> >> because the optimizer also considers group aggregate with a sorted
> >> >> input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
> >> >> actually can do this, but clearly that's nonsense, because we can't
> >> >> possibly know the aggregates yet. Hence the error.
> >> >>
> >> >> If this is the actual issue, we need to ensure we actually can evaluate
> >> >> all the pathkeys. I don't know how to do that yet. I thought that maybe
> >> >> we should modify pathkeys_common_contained_in() to set presorted_keys to
> >> >> 0 in this case.
> >> >>
> >> >> But then I started wondering why we don't see this issue even for
> >> >> regular (non-incremental-sort) paths built in create_ordered_paths().
> >> >> How come we don't see these failures there? I've modified costing to
> >> >> make all incremental sort paths very cheap, and still nothing.
> >> >
> >> >I assume you mean you modified costing to make regular sort paths very cheap?
> >> >
> >>
> >> No, I mean costing of incremental sort paths, so that they end up being
> >> the cheapest ones. If some other path is cheaper, we won't see the error
> >> because it only happens when building plan from the cheapest path.
> >
> >Ah, I misunderstood as you trying to figure out a way to try to cause
> >the same problem with a regular sort.
> >
> >> >> So presumably there's a check elsewhere (either implicit or explicit),
> >> >> because create_ordered_paths() uses pathkeys_common_contained_in() and
> >> >> does not have the same issue.
> >> >
> >> >Given this comment in create_ordered_paths():
> >> >
> >> >  generate_gather_paths() will have already generated a simple Gather
> >> >  path for the best parallel path, if any, and the loop above will have
> >> >  considered sorting it.  Similarly, generate_gather_paths() will also
> >> >  have generated order-preserving Gather Merge plans which can be used
> >> >  without sorting if they happen to match the sort_pathkeys, and the loop
> >> >  above will have handled those as well.  However, there's one more
> >> >  possibility: it may make sense to sort the cheapest partial path
> >> >  according to the required output order and then use Gather Merge.
> >> >
> >> >my understanding is that generate_gather_paths() only considers paths
> >> >that already happen to be sorted (not explicit sorts), so I'm
> >> >wondering if it would make more sense for the incremental sort path
> >> >creation for this case to live alongside the explicit ordered path
> >> >creation in create_ordered_paths() rather than in
> >> >generate_gather_paths().
> >> >
> >>
> >> How would that solve the issue? Also, we're building a gather path, so
> >> I think generate_gather_paths() is the right place where to do it. And
> >> we're not changing the semantics of generate_gather_paths() - the result
> >> path should be sorted "correctly" with respect to sort_pathkeys.
> >
> >Does that imply what the explicit sort in create_ordered_paths() is in
> >the wrong spot?
> >
>
> I think those are essentially the right places where to do this sort of
> stuff. Maybe there's a better place, but I don't think those places are
> somehow wrong.
>
> >Or, to put it another way, do you think that both kinds of sorts
> >should be added in the same place? It seems confusing to me that
> >they'd be split between the two methods (unless I'm completely
> >misunderstanding how the two work).
> >
>
> The paths built in those two places were very different in one aspect:
>
> 1) generate_gather_paths only ever looked at pathkeys for the subpath, it
> never even looked at ordering expected by paths above it (or the query as
> a whole). Plain Gather ignores pathkeys entirely, Gather Merge only aims
> to maintain ordering of the different subpaths.
>
> 2) create_oredered_paths is meant to enforce ordering needed by upper
> parts of the plan - either by using a properly sorted path, or adding an
> explicit sort.
>
>
> We want to extend (1) to also look at ordering expected by the upper parts
> of the plan, and consider incremental sort if applicable. (2) already does
> that, and it already has the correct pathkeys to enforce.

I guess I'm still not following. If (2) is responsible (currently) for
adding an explicit sort, why wouldn't adding an incremental sort be an
example of that responsibility? The subpath that either a Sort or
IncrementalSort is being added on top of (to then feed into the
GatherMerge) is the same in both cases right?

Unless you're saying that the difference is that since we have a
partial ordering already for incremental sort then incremental sort
falls into the category of "maintaining" existing ordering of the
subpath?

> But looking at root->sort_pathkeys in (1) seems to be the wrong thing :-(
>
> The thing is, we don't have just sort_pathkeys, there's distinct_pathkeys
> and group_pathkeys too, so maybe we should be looking at those too?

I don't know enough yet to answer, but I'd like to look at (in the
debugger) the subpaths considered in each function to try to get a
better understanding of why we don't try to explicitly sort the aggs
(which we know we can't sort yet) but do for incremental sort. I
assume that means a subpath has to be present in one but not the other
since they both use the same pathkey checking function.

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Alexander Korotkov
Date:
On Mon, Jun 3, 2019 at 12:18 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> For a moment I thought we could/should look at the histogram, becase that
> could tell us if there are groups "before" the first MCV one, but I don't
> think we should do that, for two reasons. Firstly, rare values may not get
> to the histogram anyway, which makes this rather unreliable and might
> introduce sudden plan changes, because the cost would vary wildly
> depending on whether we happened to sample the rare row or not. And
> secondly, the rare row may be easily filtered out by a WHERE condition or
> something, at which point we'll have to deal with the large group anyway.

If first MCV is in the middle of first histogram bin, then it's
reasonable to think that it would fit to first group.  But if first
MCV is in the middle of histogram, such assumption would be
ridiculous.  Also, I'd like to note that with our
default_statistics_target == 100, non MCV values are not so rare.  So,
I'm +1 for taking histogram into account.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Alexander Korotkov
Date:
On Thu, Jul 4, 2019 at 4:25 PM James Coleman <jtc331@gmail.com> wrote:
> Process questions:
> - Do I need to explicitly move the patch somehow to the next CF?

We didn't manage to register it on current (July) commitfest.  So,
please, register it on next (September) commitfest.

> - Since I've basically taken over patch ownership, should I move my
> name from reviewer to author in the CF app? And can there be two
> authors listed there?

Sure, you're co-author of this patch.  Two or more authors could be
listed at CF app, you can find a lot of examples on the list.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Jul 08, 2019 at 12:07:06PM -0400, James Coleman wrote:
>On Mon, Jul 8, 2019 at 10:58 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Jul 08, 2019 at 10:32:18AM -0400, James Coleman wrote:
>> >On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> >>
>> >> On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:
>> >> >On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
>> >> ><tomas.vondra@2ndquadrant.com> wrote:
>> >> >> We're running query like this:
>> >> >>
>> >> >>   SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3
>> >> >>
>> >> >> but we're trying to add the incremental sort *before* the aggregation,
>> >> >> because the optimizer also considers group aggregate with a sorted
>> >> >> input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
>> >> >> actually can do this, but clearly that's nonsense, because we can't
>> >> >> possibly know the aggregates yet. Hence the error.
>> >> >>
>> >> >> If this is the actual issue, we need to ensure we actually can evaluate
>> >> >> all the pathkeys. I don't know how to do that yet. I thought that maybe
>> >> >> we should modify pathkeys_common_contained_in() to set presorted_keys to
>> >> >> 0 in this case.
>> >> >>
>> >> >> But then I started wondering why we don't see this issue even for
>> >> >> regular (non-incremental-sort) paths built in create_ordered_paths().
>> >> >> How come we don't see these failures there? I've modified costing to
>> >> >> make all incremental sort paths very cheap, and still nothing.
>> >> >
>> >> >I assume you mean you modified costing to make regular sort paths very cheap?
>> >> >
>> >>
>> >> No, I mean costing of incremental sort paths, so that they end up being
>> >> the cheapest ones. If some other path is cheaper, we won't see the error
>> >> because it only happens when building plan from the cheapest path.
>> >
>> >Ah, I misunderstood as you trying to figure out a way to try to cause
>> >the same problem with a regular sort.
>> >
>> >> >> So presumably there's a check elsewhere (either implicit or explicit),
>> >> >> because create_ordered_paths() uses pathkeys_common_contained_in() and
>> >> >> does not have the same issue.
>> >> >
>> >> >Given this comment in create_ordered_paths():
>> >> >
>> >> >  generate_gather_paths() will have already generated a simple Gather
>> >> >  path for the best parallel path, if any, and the loop above will have
>> >> >  considered sorting it.  Similarly, generate_gather_paths() will also
>> >> >  have generated order-preserving Gather Merge plans which can be used
>> >> >  without sorting if they happen to match the sort_pathkeys, and the loop
>> >> >  above will have handled those as well.  However, there's one more
>> >> >  possibility: it may make sense to sort the cheapest partial path
>> >> >  according to the required output order and then use Gather Merge.
>> >> >
>> >> >my understanding is that generate_gather_paths() only considers paths
>> >> >that already happen to be sorted (not explicit sorts), so I'm
>> >> >wondering if it would make more sense for the incremental sort path
>> >> >creation for this case to live alongside the explicit ordered path
>> >> >creation in create_ordered_paths() rather than in
>> >> >generate_gather_paths().
>> >> >
>> >>
>> >> How would that solve the issue? Also, we're building a gather path, so
>> >> I think generate_gather_paths() is the right place where to do it. And
>> >> we're not changing the semantics of generate_gather_paths() - the result
>> >> path should be sorted "correctly" with respect to sort_pathkeys.
>> >
>> >Does that imply what the explicit sort in create_ordered_paths() is in
>> >the wrong spot?
>> >
>>
>> I think those are essentially the right places where to do this sort of
>> stuff. Maybe there's a better place, but I don't think those places are
>> somehow wrong.
>>
>> >Or, to put it another way, do you think that both kinds of sorts
>> >should be added in the same place? It seems confusing to me that
>> >they'd be split between the two methods (unless I'm completely
>> >misunderstanding how the two work).
>> >
>>
>> The paths built in those two places were very different in one aspect:
>>
>> 1) generate_gather_paths only ever looked at pathkeys for the subpath, it
>> never even looked at ordering expected by paths above it (or the query as
>> a whole). Plain Gather ignores pathkeys entirely, Gather Merge only aims
>> to maintain ordering of the different subpaths.
>>
>> 2) create_oredered_paths is meant to enforce ordering needed by upper
>> parts of the plan - either by using a properly sorted path, or adding an
>> explicit sort.
>>
>>
>> We want to extend (1) to also look at ordering expected by the upper parts
>> of the plan, and consider incremental sort if applicable. (2) already does
>> that, and it already has the correct pathkeys to enforce.
>
>I guess I'm still not following. If (2) is responsible (currently) for
>adding an explicit sort, why wouldn't adding an incremental sort be an
>example of that responsibility? The subpath that either a Sort or
>IncrementalSort is being added on top of (to then feed into the
>GatherMerge) is the same in both cases right?
>
>Unless you're saying that the difference is that since we have a
>partial ordering already for incremental sort then incremental sort
>falls into the category of "maintaining" existing ordering of the
>subpath?
>

Oh, I think I understand what you're saying. Essentially, we should not
be making generate_gather_paths responsible for adding the incremental
sort. Instead, we should be looking at places than are adding explicit
sort (using create_sort_path) and also consider adding incremental sort.

I definitely agree with the second half - we should look at all places
that create explicit sorts and make them also consider incremental
sorts. That makes sense.

But I'm not sure it'll address all cases - the problem is that those
places add the explicit sort because they need sorted input. Gather
Merge does not do that, it only preserves existing ordering of paths.

So it's possible the path does not have an explicit sort on to, and
gather merge will not know to add it. And once we have the gather merge
in place, we can't push place "under" it.

In fact, we already have code dealing with this "issue" for a special
case - see gather_grouping_paths(). It generates plain gather merge
paths, but then also considers building one with explicit sort. But it
only does that for grouping paths (when it's clear we need to be looking
at grouping_pathkeys), and there are generate_gather_paths() that don't
have similar treatment.


>> But looking at root->sort_pathkeys in (1) seems to be the wrong thing :-(
>>
>> The thing is, we don't have just sort_pathkeys, there's distinct_pathkeys
>> and group_pathkeys too, so maybe we should be looking at those too?
>
>I don't know enough yet to answer, but I'd like to look at (in the
>debugger) the subpaths considered in each function to try to get a
>better understanding of why we don't try to explicitly sort the aggs
>(which we know we can't sort yet) but do for incremental sort. I
>assume that means a subpath has to be present in one but not the other
>since they both use the same pathkey checking function.
>

I've been wondering if we have some other code that needs to consider
interesting pathkeys "candidates" (instead of just getting the list
interesting in that place). Because then we could look at that code and
use it here ...

And guess what - postgres_fdw needs to do pretty much exactly that, when
building paths for remote relations. AFAIK we can't easily request all
plans from the remote node and then look at their pathkeys (like we'd do
with local node), so instead we deduce "interesting pathkeys" and then
request best plans for those. And deducing "interesing" pathkeys is
pretty much what get_useful_pathkeys_for_relation() is about.

So I've copied this function (and two more, called from it), whacked it
a bit until it removed (shakespeare-writing chimp comes to mind) and
voila, it seems to be working. The errors you reported are gone, and the
plans seems to be reasonable.

Attached is a sequence of 4 patches:


0001-fix-pathkey-processing-in-generate_gather_paths.patch
----------------------------------------------------------
This is the fixed version of my previous patch, with the stuff stolen
from postgres_fdw.


0002-fix-costing-in-cost_incremental_sort.patch
-----------------------------------------------
This is the costing fix, I mentioned before.


0003-fix-explain-in-parallel-mode.patch
---------------------------------------
Minor bug in explain, when incremental sort ends up being in the
parallel part of the plan (missing newline on per-worker line)


0004-rework-where-incremental-sort-paths-are-created.patch
----------------------------------------------------------
This undoes the generate_gather_paths() changes from 0001, and instead
modifies a bunch of places that call create_sort_path() to also consider
incremental sorts. There are a couple remaining, but those should not be
relevant to the queries I've been looking at.


Essentially, 0002 and 0003 are bugfixes. 0001 and 0004 are the two
different aproaches to building incremental sort + gather merge.

Now, consider this example:

  create table t (a int, b int, c int);
  insert into t select mod(i,100),mod(i,100),i from generate_series(1,10000000) s(i);
  create index on t (a);
  analyze t;
  explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1;

With 0001+0002+0003 pathes, I get a plan like this:

                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Limit  (cost=10375.39..10594.72 rows=1 width=16)
   ->  Incremental Sort  (cost=10375.39..2203675.71 rows=10000 width=16)
         Sort Key: a, b, (sum(c))
         Presorted Key: a, b
         ->  GroupAggregate  (cost=10156.07..2203225.71 rows=10000 width=16)
               Group Key: a, b
               ->  Gather Merge  (cost=10156.07..2128124.39 rows=10000175 width=12)
                     Workers Planned: 2
                     ->  Incremental Sort  (cost=9156.04..972856.05 rows=4166740 width=12)
                           Sort Key: a, b
                           Presorted Key: a
                           ->  Parallel Index Scan using t_a_idx on t  (cost=0.43..417690.30 rows=4166740 width=12)
(12 rows)


and with 0004, I get this:

                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Limit  (cost=20443.84..20665.32 rows=1 width=16)
   ->  Incremental Sort  (cost=20443.84..2235250.05 rows=10000 width=16)
         Sort Key: a, b, (sum(c))
         Presorted Key: a, b
         ->  GroupAggregate  (cost=20222.37..2234800.05 rows=10000 width=16)
               Group Key: a, b
               ->  Incremental Sort  (cost=20222.37..2159698.74 rows=10000175 width=12)
                     Sort Key: a, b
                     Presorted Key: a
                     ->  Index Scan using t_a_idx on t  (cost=0.43..476024.65 rows=10000175 width=12)
(10 rows)

Notice that cost of the second plan is almost double the first one. That
means 0004 does not even generate the first plan, i.e. there are cases
where we don't try to add the explicit sort before passing the path to
generate_gather_paths().

And I think I know why is that - while gather_grouping_paths() tries to
add explicit sort below the gather merge, there are other places that
call generate_gather_paths() that don't do that. In this case it's
probably apply_scanjoin_target_to_paths() which simply builds

   parallel (seq|index) scan + gather merge

and that's it. The problem is likely the same - the code does not know
which pathkeys are "interesting" at that point. We probably need to
teach planner to do this.


FWIW tweaking all the create_sort_path() places to also consider adding
incremental sort is a bit tedious and invasive, and it almost doubles
the amount of repetitive code. It's OK for experiment like this, but we
should try handling this in a nicer way (move to a separate function
that does both, or something like that).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
Note: As I was writing this, I saw a new email come in from Tomas with
a new patch series, and some similar observations. I'll look at that
patch series more, but I think it's likely far more complete, so will
end up going with that. I wanted to send this email anyway to at least
capture the debugging process for reference.

On Mon, Jul 8, 2019 at 12:07 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Mon, Jul 8, 2019 at 10:58 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > On Mon, Jul 08, 2019 at 10:32:18AM -0400, James Coleman wrote:
> > >On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra
> > ><tomas.vondra@2ndquadrant.com> wrote:
> > >>
> > >> On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:
> > >> >On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
> > >> ><tomas.vondra@2ndquadrant.com> wrote:
> > >> >> We're running query like this:
> > >> >>
> > >> >>   SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3
> > >> >>
> > >> >> but we're trying to add the incremental sort *before* the aggregation,
> > >> >> because the optimizer also considers group aggregate with a sorted
> > >> >> input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
> > >> >> actually can do this, but clearly that's nonsense, because we can't
> > >> >> possibly know the aggregates yet. Hence the error.
> > >> >>
> > >> >> If this is the actual issue, we need to ensure we actually can evaluate
> > >> >> all the pathkeys. I don't know how to do that yet. I thought that maybe
> > >> >> we should modify pathkeys_common_contained_in() to set presorted_keys to
> > >> >> 0 in this case.
> > >> >>
> > >> >> But then I started wondering why we don't see this issue even for
> > >> >> regular (non-incremental-sort) paths built in create_ordered_paths().
> > >> >> How come we don't see these failures there? I've modified costing to
> > >> >> make all incremental sort paths very cheap, and still nothing.
> > >> >
> > >> >I assume you mean you modified costing to make regular sort paths very cheap?
> > >> >
> > >>
> > >> No, I mean costing of incremental sort paths, so that they end up being
> > >> the cheapest ones. If some other path is cheaper, we won't see the error
> > >> because it only happens when building plan from the cheapest path.
> > >
> > >Ah, I misunderstood as you trying to figure out a way to try to cause
> > >the same problem with a regular sort.
> > >
> > >> >> So presumably there's a check elsewhere (either implicit or explicit),
> > >> >> because create_ordered_paths() uses pathkeys_common_contained_in() and
> > >> >> does not have the same issue.
> > >> >
> > >> >Given this comment in create_ordered_paths():
> > >> >
> > >> >  generate_gather_paths() will have already generated a simple Gather
> > >> >  path for the best parallel path, if any, and the loop above will have
> > >> >  considered sorting it.  Similarly, generate_gather_paths() will also
> > >> >  have generated order-preserving Gather Merge plans which can be used
> > >> >  without sorting if they happen to match the sort_pathkeys, and the loop
> > >> >  above will have handled those as well.  However, there's one more
> > >> >  possibility: it may make sense to sort the cheapest partial path
> > >> >  according to the required output order and then use Gather Merge.
> > >> >
> > >> >my understanding is that generate_gather_paths() only considers paths
> > >> >that already happen to be sorted (not explicit sorts), so I'm
> > >> >wondering if it would make more sense for the incremental sort path
> > >> >creation for this case to live alongside the explicit ordered path
> > >> >creation in create_ordered_paths() rather than in
> > >> >generate_gather_paths().
> > >> >
> > >>
> > >> How would that solve the issue? Also, we're building a gather path, so
> > >> I think generate_gather_paths() is the right place where to do it. And
> > >> we're not changing the semantics of generate_gather_paths() - the result
> > >> path should be sorted "correctly" with respect to sort_pathkeys.
> > >
> > >Does that imply what the explicit sort in create_ordered_paths() is in
> > >the wrong spot?
> > >
> >
> > I think those are essentially the right places where to do this sort of
> > stuff. Maybe there's a better place, but I don't think those places are
> > somehow wrong.
> >
> > >Or, to put it another way, do you think that both kinds of sorts
> > >should be added in the same place? It seems confusing to me that
> > >they'd be split between the two methods (unless I'm completely
> > >misunderstanding how the two work).
> > >
> >
> > The paths built in those two places were very different in one aspect:
> >
> > 1) generate_gather_paths only ever looked at pathkeys for the subpath, it
> > never even looked at ordering expected by paths above it (or the query as
> > a whole). Plain Gather ignores pathkeys entirely, Gather Merge only aims
> > to maintain ordering of the different subpaths.
> >
> > 2) create_oredered_paths is meant to enforce ordering needed by upper
> > parts of the plan - either by using a properly sorted path, or adding an
> > explicit sort.
> >
> >
> > We want to extend (1) to also look at ordering expected by the upper parts
> > of the plan, and consider incremental sort if applicable. (2) already does
> > that, and it already has the correct pathkeys to enforce.
>
> I guess I'm still not following. If (2) is responsible (currently) for
> adding an explicit sort, why wouldn't adding an incremental sort be an
> example of that responsibility? The subpath that either a Sort or
> IncrementalSort is being added on top of (to then feed into the
> GatherMerge) is the same in both cases right?
>
> Unless you're saying that the difference is that since we have a
> partial ordering already for incremental sort then incremental sort
> falls into the category of "maintaining" existing ordering of the
> subpath?

To try to understand this better I looked through all usages of
generate_gather_paths(). The usage in gather_grouping_paths() also
treats explicit sorting as its responsibility rather than the
responsibility of generate_gather_paths(). Since I consider
incremental sort to be just another form of explicit sorting, I think
it's reasonable to make it the responsibility of callers, given that
currently generate_gather_paths() currently seems to be explicitly
only about fully presorted paths.

Of course that might not be ideal the more callers that have to handle
it specially. So maybe it's worth shifting responsibility.

> > But looking at root->sort_pathkeys in (1) seems to be the wrong thing :-(
> >
> > The thing is, we don't have just sort_pathkeys, there's distinct_pathkeys
> > and group_pathkeys too, so maybe we should be looking at those too?
>
> I don't know enough yet to answer, but I'd like to look at (in the
> debugger) the subpaths considered in each function to try to get a
> better understanding of why we don't try to explicitly sort the aggs
> (which we know we can't sort yet) but do for incremental sort. I
> assume that means a subpath has to be present in one but not the other
> since they both use the same pathkey checking function.

I stepped through one of the failing test cases and found that in
create_ordered_paths() the if expression fails because
input_rel->partial_pathlist == NIL. So apparently that is the guard
against incorrectly adding as-yet unsortable paths below the
aggregate. So if I move the create_incremental_sort_path() call inside
this block the error goes away.

Attached is patch revision with that changes that.

James Coleman

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Tue, Jul 09, 2019 at 03:37:03AM +0200, Tomas Vondra wrote:
>
> ...
>
>Notice that cost of the second plan is almost double the first one. That
>means 0004 does not even generate the first plan, i.e. there are cases
>where we don't try to add the explicit sort before passing the path to
>generate_gather_paths().
>
>And I think I know why is that - while gather_grouping_paths() tries to
>add explicit sort below the gather merge, there are other places that
>call generate_gather_paths() that don't do that. In this case it's
>probably apply_scanjoin_target_to_paths() which simply builds
>
>  parallel (seq|index) scan + gather merge
>
>and that's it. The problem is likely the same - the code does not know
>which pathkeys are "interesting" at that point. We probably need to
>teach planner to do this.
>

I've looked into this again, and yes - that's the reason. I've added
generate_useful_gather_paths() which is a wrapper on top of
generate_gather_paths(). It essentially does what 0001 patch did directly
in generate_gather_paths() so it's more like create_grouping_paths().

And that does the trick - we now generate the cheaper paths, and I don't
see any crashes in regression tests etc.

I still suspect we already have code doing similar checks whether pathkeys
might be useful somewhere. I've looked into pathkeys.c and elsewhere but
no luck.

Attached is a slightly modified patch series:

1) 0001 considers incremental sort paths in various places (adds the new
generate_useful_gather_paths and modifies places calling create_sort_path)

2) 0002 and 0003 are fixes I mentioned before

3) 0004 adds a new GUC force_incremental_sort that (when set to 'on')
tries to nudge the optimizer into using incremental sort by essentially
making it free (i.e. using startup/total costs of the subpath). I've found
this useful when trying to force incremental sorts into plans where it may
not be the best strategy.

I won't have time to hack on this over the next ~2 weeks, but I'll try to
respond to questions when possible.

>
>FWIW tweaking all the create_sort_path() places to also consider adding
>incremental sort is a bit tedious and invasive, and it almost doubles
>the amount of repetitive code. It's OK for experiment like this, but we
>should try handling this in a nicer way (move to a separate function
>that does both, or something like that).
>

This definitely needs more work. We need to refactor it in some way, e.g.
have a function that would consider both explicit sort (on the cheapest
path) and incremental sort (on all paths), and call it from all those
places. Haven't tried it, though.

There's also a couple more places where we do create_sort_path() and don't
consider incremental sort yet - window functions, distinct etc.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Jul 08, 2019 at 12:07:06PM -0400, James Coleman wrote:
> >On Mon, Jul 8, 2019 at 10:58 AM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Mon, Jul 08, 2019 at 10:32:18AM -0400, James Coleman wrote:
> >> >On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra
> >> ><tomas.vondra@2ndquadrant.com> wrote:
> >> >>
> >> >> On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:
> >> >> >On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
> >> >> ><tomas.vondra@2ndquadrant.com> wrote:
> >> >> >> We're running query like this:
> >> >> >>
> >> >> >>   SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3
> >> >> >>
> >> >> >> but we're trying to add the incremental sort *before* the aggregation,
> >> >> >> because the optimizer also considers group aggregate with a sorted
> >> >> >> input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
> >> >> >> actually can do this, but clearly that's nonsense, because we can't
> >> >> >> possibly know the aggregates yet. Hence the error.
> >> >> >>
> >> >> >> If this is the actual issue, we need to ensure we actually can evaluate
> >> >> >> all the pathkeys. I don't know how to do that yet. I thought that maybe
> >> >> >> we should modify pathkeys_common_contained_in() to set presorted_keys to
> >> >> >> 0 in this case.
> >> >> >>
> >> >> >> But then I started wondering why we don't see this issue even for
> >> >> >> regular (non-incremental-sort) paths built in create_ordered_paths().
> >> >> >> How come we don't see these failures there? I've modified costing to
> >> >> >> make all incremental sort paths very cheap, and still nothing.
> >> >> >
> >> >> >I assume you mean you modified costing to make regular sort paths very cheap?
> >> >> >
> >> >>
> >> >> No, I mean costing of incremental sort paths, so that they end up being
> >> >> the cheapest ones. If some other path is cheaper, we won't see the error
> >> >> because it only happens when building plan from the cheapest path.
> >> >
> >> >Ah, I misunderstood as you trying to figure out a way to try to cause
> >> >the same problem with a regular sort.
> >> >
> >> >> >> So presumably there's a check elsewhere (either implicit or explicit),
> >> >> >> because create_ordered_paths() uses pathkeys_common_contained_in() and
> >> >> >> does not have the same issue.
> >> >> >
> >> >> >Given this comment in create_ordered_paths():
> >> >> >
> >> >> >  generate_gather_paths() will have already generated a simple Gather
> >> >> >  path for the best parallel path, if any, and the loop above will have
> >> >> >  considered sorting it.  Similarly, generate_gather_paths() will also
> >> >> >  have generated order-preserving Gather Merge plans which can be used
> >> >> >  without sorting if they happen to match the sort_pathkeys, and the loop
> >> >> >  above will have handled those as well.  However, there's one more
> >> >> >  possibility: it may make sense to sort the cheapest partial path
> >> >> >  according to the required output order and then use Gather Merge.
> >> >> >
> >> >> >my understanding is that generate_gather_paths() only considers paths
> >> >> >that already happen to be sorted (not explicit sorts), so I'm
> >> >> >wondering if it would make more sense for the incremental sort path
> >> >> >creation for this case to live alongside the explicit ordered path
> >> >> >creation in create_ordered_paths() rather than in
> >> >> >generate_gather_paths().
> >> >> >
> >> >>
> >> >> How would that solve the issue? Also, we're building a gather path, so
> >> >> I think generate_gather_paths() is the right place where to do it. And
> >> >> we're not changing the semantics of generate_gather_paths() - the result
> >> >> path should be sorted "correctly" with respect to sort_pathkeys.
> >> >
> >> >Does that imply what the explicit sort in create_ordered_paths() is in
> >> >the wrong spot?
> >> >
> >>
> >> I think those are essentially the right places where to do this sort of
> >> stuff. Maybe there's a better place, but I don't think those places are
> >> somehow wrong.
> >>
> >> >Or, to put it another way, do you think that both kinds of sorts
> >> >should be added in the same place? It seems confusing to me that
> >> >they'd be split between the two methods (unless I'm completely
> >> >misunderstanding how the two work).
> >> >
> >>
> >> The paths built in those two places were very different in one aspect:
> >>
> >> 1) generate_gather_paths only ever looked at pathkeys for the subpath, it
> >> never even looked at ordering expected by paths above it (or the query as
> >> a whole). Plain Gather ignores pathkeys entirely, Gather Merge only aims
> >> to maintain ordering of the different subpaths.
> >>
> >> 2) create_oredered_paths is meant to enforce ordering needed by upper
> >> parts of the plan - either by using a properly sorted path, or adding an
> >> explicit sort.
> >>
> >>
> >> We want to extend (1) to also look at ordering expected by the upper parts
> >> of the plan, and consider incremental sort if applicable. (2) already does
> >> that, and it already has the correct pathkeys to enforce.
> >
> >I guess I'm still not following. If (2) is responsible (currently) for
> >adding an explicit sort, why wouldn't adding an incremental sort be an
> >example of that responsibility? The subpath that either a Sort or
> >IncrementalSort is being added on top of (to then feed into the
> >GatherMerge) is the same in both cases right?
> >
> >Unless you're saying that the difference is that since we have a
> >partial ordering already for incremental sort then incremental sort
> >falls into the category of "maintaining" existing ordering of the
> >subpath?
> >
>
> Oh, I think I understand what you're saying. Essentially, we should not
> be making generate_gather_paths responsible for adding the incremental
> sort. Instead, we should be looking at places than are adding explicit
> sort (using create_sort_path) and also consider adding incremental sort.
>
> I definitely agree with the second half - we should look at all places
> that create explicit sorts and make them also consider incremental
> sorts. That makes sense.

Yep, exactly.

> But I'm not sure it'll address all cases - the problem is that those
> places add the explicit sort because they need sorted input. Gather
> Merge does not do that, it only preserves existing ordering of paths.
>
> So it's possible the path does not have an explicit sort on to, and
> gather merge will not know to add it. And once we have the gather merge
> in place, we can't push place "under" it.

That's the explanation I was missing; and it makes sense (to restate:
sometimes sorting is useful even when not required for correctness of
the user returned data).

> In fact, we already have code dealing with this "issue" for a special
> case - see gather_grouping_paths(). It generates plain gather merge
> paths, but then also considers building one with explicit sort. But it
> only does that for grouping paths (when it's clear we need to be looking
> at grouping_pathkeys), and there are generate_gather_paths() that don't
> have similar treatment.

I just find it humorous both of us were writing separate emails
mentioning that function at the same time.

> >> But looking at root->sort_pathkeys in (1) seems to be the wrong thing :-(
> >>
> >> The thing is, we don't have just sort_pathkeys, there's distinct_pathkeys
> >> and group_pathkeys too, so maybe we should be looking at those too?
> >
> >I don't know enough yet to answer, but I'd like to look at (in the
> >debugger) the subpaths considered in each function to try to get a
> >better understanding of why we don't try to explicitly sort the aggs
> >(which we know we can't sort yet) but do for incremental sort. I
> >assume that means a subpath has to be present in one but not the other
> >since they both use the same pathkey checking function.
> >
>
> I've been wondering if we have some other code that needs to consider
> interesting pathkeys "candidates" (instead of just getting the list
> interesting in that place). Because then we could look at that code and
> use it here ...
>
> And guess what - postgres_fdw needs to do pretty much exactly that, when
> building paths for remote relations. AFAIK we can't easily request all
> plans from the remote node and then look at their pathkeys (like we'd do
> with local node), so instead we deduce "interesting pathkeys" and then
> request best plans for those. And deducing "interesing" pathkeys is
> pretty much what get_useful_pathkeys_for_relation() is about.
>
> So I've copied this function (and two more, called from it), whacked it
> a bit until it removed (shakespeare-writing chimp comes to mind) and
> voila, it seems to be working. The errors you reported are gone, and the
> plans seems to be reasonable.
>
> Attached is a sequence of 4 patches:
>
>
> 0001-fix-pathkey-processing-in-generate_gather_paths.patch
> ----------------------------------------------------------
> This is the fixed version of my previous patch, with the stuff stolen
> from postgres_fdw.
>
>
> 0002-fix-costing-in-cost_incremental_sort.patch
> -----------------------------------------------
> This is the costing fix, I mentioned before.
>
>
> 0003-fix-explain-in-parallel-mode.patch
> ---------------------------------------
> Minor bug in explain, when incremental sort ends up being in the
> parallel part of the plan (missing newline on per-worker line)
>
>
> 0004-rework-where-incremental-sort-paths-are-created.patch
> ----------------------------------------------------------
> This undoes the generate_gather_paths() changes from 0001, and instead
> modifies a bunch of places that call create_sort_path() to also consider
> incremental sorts. There are a couple remaining, but those should not be
> relevant to the queries I've been looking at.
>
>
> Essentially, 0002 and 0003 are bugfixes. 0001 and 0004 are the two
> different aproaches to building incremental sort + gather merge.
>
> Now, consider this example:
>
>   create table t (a int, b int, c int);
>   insert into t select mod(i,100),mod(i,100),i from generate_series(1,10000000) s(i);
>   create index on t (a);
>   analyze t;
>   explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1;
>
> With 0001+0002+0003 pathes, I get a plan like this:
>
>                                                      QUERY PLAN
> --------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=10375.39..10594.72 rows=1 width=16)
>    ->  Incremental Sort  (cost=10375.39..2203675.71 rows=10000 width=16)
>          Sort Key: a, b, (sum(c))
>          Presorted Key: a, b
>          ->  GroupAggregate  (cost=10156.07..2203225.71 rows=10000 width=16)
>                Group Key: a, b
>                ->  Gather Merge  (cost=10156.07..2128124.39 rows=10000175 width=12)
>                      Workers Planned: 2
>                      ->  Incremental Sort  (cost=9156.04..972856.05 rows=4166740 width=12)
>                            Sort Key: a, b
>                            Presorted Key: a
>                            ->  Parallel Index Scan using t_a_idx on t  (cost=0.43..417690.30 rows=4166740 width=12)
> (12 rows)
>
>
> and with 0004, I get this:
>
>                                               QUERY PLAN
> ------------------------------------------------------------------------------------------------------
>  Limit  (cost=20443.84..20665.32 rows=1 width=16)
>    ->  Incremental Sort  (cost=20443.84..2235250.05 rows=10000 width=16)
>          Sort Key: a, b, (sum(c))
>          Presorted Key: a, b
>          ->  GroupAggregate  (cost=20222.37..2234800.05 rows=10000 width=16)
>                Group Key: a, b
>                ->  Incremental Sort  (cost=20222.37..2159698.74 rows=10000175 width=12)
>                      Sort Key: a, b
>                      Presorted Key: a
>                      ->  Index Scan using t_a_idx on t  (cost=0.43..476024.65 rows=10000175 width=12)
> (10 rows)
>
> Notice that cost of the second plan is almost double the first one. That
> means 0004 does not even generate the first plan, i.e. there are cases
> where we don't try to add the explicit sort before passing the path to
> generate_gather_paths().
>
> And I think I know why is that - while gather_grouping_paths() tries to
> add explicit sort below the gather merge, there are other places that
> call generate_gather_paths() that don't do that. In this case it's
> probably apply_scanjoin_target_to_paths() which simply builds
>
>    parallel (seq|index) scan + gather merge
>
> and that's it. The problem is likely the same - the code does not know
> which pathkeys are "interesting" at that point. We probably need to
> teach planner to do this.

I had also noticed that that was an obvious place where
generate_gather_paths() was used but an explicit sort wasn't also
added separately, which makes me think the division of labor is
probably currently wrong regardless of the incremental sort patch.

Do you agree? Should we try to fix that (likely with your new
"interesting paths" version of generate_gather_paths()) first as a
prefix patch to adding incremental sort?

> FWIW tweaking all the create_sort_path() places to also consider adding
> incremental sort is a bit tedious and invasive, and it almost doubles
> the amount of repetitive code. It's OK for experiment like this, but we
> should try handling this in a nicer way (move to a separate function
> that does both, or something like that).

Agreed.

On Tue, Jul 9, 2019 at 8:11 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Jul 09, 2019 at 03:37:03AM +0200, Tomas Vondra wrote:
> >
> > ...
> >
> >Notice that cost of the second plan is almost double the first one. That
> >means 0004 does not even generate the first plan, i.e. there are cases
> >where we don't try to add the explicit sort before passing the path to
> >generate_gather_paths().
> >
> >And I think I know why is that - while gather_grouping_paths() tries to
> >add explicit sort below the gather merge, there are other places that
> >call generate_gather_paths() that don't do that. In this case it's
> >probably apply_scanjoin_target_to_paths() which simply builds
> >
> >  parallel (seq|index) scan + gather merge
> >
> >and that's it. The problem is likely the same - the code does not know
> >which pathkeys are "interesting" at that point. We probably need to
> >teach planner to do this.
> >
>
> I've looked into this again, and yes - that's the reason. I've added
> generate_useful_gather_paths() which is a wrapper on top of
> generate_gather_paths(). It essentially does what 0001 patch did directly
> in generate_gather_paths() so it's more like create_grouping_paths().
>
> And that does the trick - we now generate the cheaper paths, and I don't
> see any crashes in regression tests etc.
>
> I still suspect we already have code doing similar checks whether pathkeys
> might be useful somewhere. I've looked into pathkeys.c and elsewhere but
> no luck.
>
> Attached is a slightly modified patch series:
>
> 1) 0001 considers incremental sort paths in various places (adds the new
> generate_useful_gather_paths and modifies places calling create_sort_path)

I need to spend some decent time digesting this patch, but the concept
sounds very useful.

> 2) 0002 and 0003 are fixes I mentioned before
>
> 3) 0004 adds a new GUC force_incremental_sort that (when set to 'on')
> tries to nudge the optimizer into using incremental sort by essentially
> making it free (i.e. using startup/total costs of the subpath). I've found
> this useful when trying to force incremental sorts into plans where it may
> not be the best strategy.

That will be super helpful. I do wonder if we need to expose (in the
production patch) a GUC of some kind to adjust incremental sort
costing so that users can try to tweak when it is preferred over
regular sort.

> I won't have time to hack on this over the next ~2 weeks, but I'll try to
> respond to questions when possible.

Understood; thanks so much for your help on this.

> >
> >FWIW tweaking all the create_sort_path() places to also consider adding
> >incremental sort is a bit tedious and invasive, and it almost doubles
> >the amount of repetitive code. It's OK for experiment like this, but we
> >should try handling this in a nicer way (move to a separate function
> >that does both, or something like that).
> >
>
> This definitely needs more work. We need to refactor it in some way, e.g.
> have a function that would consider both explicit sort (on the cheapest
> path) and incremental sort (on all paths), and call it from all those
> places. Haven't tried it, though.
>
> There's also a couple more places where we do create_sort_path() and don't
> consider incremental sort yet - window functions, distinct etc.

Yep, and likely want regression tests for all of these cases also.

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Jul 8, 2019 at 6:37 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>
> On Thu, Jul 4, 2019 at 4:25 PM James Coleman <jtc331@gmail.com> wrote:
> > Process questions:
> > - Do I need to explicitly move the patch somehow to the next CF?
>
> We didn't manage to register it on current (July) commitfest.  So,
> please, register it on next (September) commitfest.

I've moved it to the September cf.

> > - Since I've basically taken over patch ownership, should I move my
> > name from reviewer to author in the CF app? And can there be two
> > authors listed there?
>
> Sure, you're co-author of this patch.  Two or more authors could be
> listed at CF app, you can find a lot of examples on the list.

Done.

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Tue, Jul 09, 2019 at 09:28:42AM -0400, James Coleman wrote:
>On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Jul 08, 2019 at 12:07:06PM -0400, James Coleman wrote:
>> > ...
>> >
>> >I guess I'm still not following. If (2) is responsible (currently) for
>> >adding an explicit sort, why wouldn't adding an incremental sort be an
>> >example of that responsibility? The subpath that either a Sort or
>> >IncrementalSort is being added on top of (to then feed into the
>> >GatherMerge) is the same in both cases right?
>> >
>> >Unless you're saying that the difference is that since we have a
>> >partial ordering already for incremental sort then incremental sort
>> >falls into the category of "maintaining" existing ordering of the
>> >subpath?
>> >
>>
>> Oh, I think I understand what you're saying. Essentially, we should not
>> be making generate_gather_paths responsible for adding the incremental
>> sort. Instead, we should be looking at places than are adding explicit
>> sort (using create_sort_path) and also consider adding incremental sort.
>>
>> I definitely agree with the second half - we should look at all places
>> that create explicit sorts and make them also consider incremental
>> sorts. That makes sense.
>
>Yep, exactly.
>

If I remember correctly, one of the previous patch versions (in the early
2018 commitfests) actually modified many of those places, but it did that
in a somewhat "naive" way. It simply used incremental sort whenever the
path was partially sorted, or something like that. So it did not use
costing properly. There was an attempt to fix that in the last commitfest
but the costing model was deemed to be a bit too rough and unreliable
(especially the ndistinct estimates can be quite weak), so the agreement 
was to try salvaging the patch for PG11 by only considering incremental
sort in "safest" places with greatest gains.

We've significantly improved the costing model since then, and the
implementation likely handles the corner cases much better. But that does
not mean we have to introduce the incremental sort to all those places at
once - it might be wise to split that into separate patches.

It's not just about picking the right plan - we've kinda what impact these
extra paths might have on planner performance, so maybe we should look at
that too. And the impact might be different for each of those places.

I'll leave that up to you, but I certainly won't insist on doing it all in
one huge patch.

>> But I'm not sure it'll address all cases - the problem is that those
>> places add the explicit sort because they need sorted input. Gather
>> Merge does not do that, it only preserves existing ordering of paths.
>>
>> So it's possible the path does not have an explicit sort on to, and
>> gather merge will not know to add it. And once we have the gather merge
>> in place, we can't push place "under" it.
>
>That's the explanation I was missing; and it makes sense (to restate:
>sometimes sorting is useful even when not required for correctness of
>the user returned data).
>

Yes, although even when the sorting is required for correctness (because
the user specified ORDER BY) you can do it at different points in the
plan. We'd still produce correct results, but the sort might be done at
the very end without these changes.

For example we might end up with plans

   Incremental Sort (presorted: a, path keys: a,b)
    -> Gather Merge (path keys: a)
        -> Index Scan (path keys: a)

but with those changes we might push the incremental sort down into the
parallel part:

   Gather Merge (path keys: a,b)
    -> Incremental Sort (presorted: a, path keys: a,b)
        -> Index Scan (path keys: a)

which is likely better. Perhaps that's what you meant, though.


>> In fact, we already have code dealing with this "issue" for a special
>> case - see gather_grouping_paths(). It generates plain gather merge
>> paths, but then also considers building one with explicit sort. But it
>> only does that for grouping paths (when it's clear we need to be looking
>> at grouping_pathkeys), and there are generate_gather_paths() that don't
>> have similar treatment.
>
>I just find it humorous both of us were writing separate emails
>mentioning that function at the same time.
>

;-)

>> ...
>>
>> And I think I know why is that - while gather_grouping_paths() tries to
>> add explicit sort below the gather merge, there are other places that
>> call generate_gather_paths() that don't do that. In this case it's
>> probably apply_scanjoin_target_to_paths() which simply builds
>>
>>    parallel (seq|index) scan + gather merge
>>
>> and that's it. The problem is likely the same - the code does not know
>> which pathkeys are "interesting" at that point. We probably need to
>> teach planner to do this.
>
>I had also noticed that that was an obvious place where
>generate_gather_paths() was used but an explicit sort wasn't also
>added separately, which makes me think the division of labor is
>probably currently wrong regardless of the incremental sort patch.
>
>Do you agree? Should we try to fix that (likely with your new
>"interesting paths" version of generate_gather_paths()) first as a
>prefix patch to adding incremental sort?
>

I'm not sure what the generate_useful_gather_paths() should do but does
not? Or is it just the division of labor that you think is wrong? In any
case, feel free to whack it until you're happy with it.

>> ...
>>
>> Attached is a slightly modified patch series:
>>
>> 1) 0001 considers incremental sort paths in various places (adds the new
>> generate_useful_gather_paths and modifies places calling create_sort_path)
>
>I need to spend some decent time digesting this patch, but the concept
>sounds very useful.
>

OK.

>> 2) 0002 and 0003 are fixes I mentioned before
>>
>> 3) 0004 adds a new GUC force_incremental_sort that (when set to 'on')
>> tries to nudge the optimizer into using incremental sort by essentially
>> making it free (i.e. using startup/total costs of the subpath). I've found
>> this useful when trying to force incremental sorts into plans where it may
>> not be the best strategy.
>
>That will be super helpful. I do wonder if we need to expose (in the
>production patch) a GUC of some kind to adjust incremental sort
>costing so that users can try to tweak when it is preferred over
>regular sort.
>

This GUC is really meant primarily for development, to force choice of
incremental sort during regression tests (so as to use incremental sort in
as many plans as possible). I'd remove it from the final patch. I think
the general consensus on pgsql-hackers is that we should not introduce
GUCs unless absolutely necessary. But for development GUCs - sure.

FWIW I'm not sure it's a good idea to look at both enable_incremental_sort
and enable_sort in cost_incremental_sort(). Not only end up with
disable_cost twice when both GUCs are set to 'off' at the moment, but it
might be useful to be able to disable those two sort types independently.
For example you might set just enable_sort=off and we'd still generate
incremental sort paths.

>> ...
>>
>> This definitely needs more work. We need to refactor it in some way, e.g.
>> have a function that would consider both explicit sort (on the cheapest
>> path) and incremental sort (on all paths), and call it from all those
>> places. Haven't tried it, though.
>>
>> There's also a couple more places where we do create_sort_path() and don't
>> consider incremental sort yet - window functions, distinct etc.
>
>Yep, and likely want regression tests for all of these cases also.
>

Yes, that's definitely something a committable patch needs to include.

Another thing we should have is a collection of tests with data sets that
"break" the costing model in some way (skew, correlated columns,
non-uniform group sizes, ...). That's something not meant for commit,
because it'll probably require significant amounts of data, but we need it
to asses the quality of the planner/costing part. I know there are various
ad-hoc test cases in the thread history, it'd be good to consolidate that
into once place.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Jul 9, 2019 at 10:54 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Jul 09, 2019 at 09:28:42AM -0400, James Coleman wrote:
> >On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Mon, Jul 08, 2019 at 12:07:06PM -0400, James Coleman wrote:
> >> > ...
> >> >
> >> >I guess I'm still not following. If (2) is responsible (currently) for
> >> >adding an explicit sort, why wouldn't adding an incremental sort be an
> >> >example of that responsibility? The subpath that either a Sort or
> >> >IncrementalSort is being added on top of (to then feed into the
> >> >GatherMerge) is the same in both cases right?
> >> >
> >> >Unless you're saying that the difference is that since we have a
> >> >partial ordering already for incremental sort then incremental sort
> >> >falls into the category of "maintaining" existing ordering of the
> >> >subpath?
> >> >
> >>
> >> Oh, I think I understand what you're saying. Essentially, we should not
> >> be making generate_gather_paths responsible for adding the incremental
> >> sort. Instead, we should be looking at places than are adding explicit
> >> sort (using create_sort_path) and also consider adding incremental sort.
> >>
> >> I definitely agree with the second half - we should look at all places
> >> that create explicit sorts and make them also consider incremental
> >> sorts. That makes sense.
> >
> >Yep, exactly.
> >
>
> If I remember correctly, one of the previous patch versions (in the early
> 2018 commitfests) actually modified many of those places, but it did that
> in a somewhat "naive" way. It simply used incremental sort whenever the
> path was partially sorted, or something like that. So it did not use
> costing properly. There was an attempt to fix that in the last commitfest
> but the costing model was deemed to be a bit too rough and unreliable
> (especially the ndistinct estimates can be quite weak), so the agreement
> was to try salvaging the patch for PG11 by only considering incremental
> sort in "safest" places with greatest gains.
>
> We've significantly improved the costing model since then, and the
> implementation likely handles the corner cases much better. But that does
> not mean we have to introduce the incremental sort to all those places at
> once - it might be wise to split that into separate patches.

Yes, although we haven't added the MCV checking yet; that's on my
mental checklist, but another new area of the codebase for me to
understand, so I've been prioritizing other parts of the patch.

> It's not just about picking the right plan - we've kinda what impact these
> extra paths might have on planner performance, so maybe we should look at
> that too. And the impact might be different for each of those places.
>
> I'll leave that up to you, but I certainly won't insist on doing it all in
> one huge patch.

I'm not opposed to handling some of them separately, but I would like
to at least hit the places where it's most likely (for example, with
LIMIT) to improve things. I supposed I'll have to look at all of the
usages of create_sort_path() and try to rank them in terms of
perceived likely value.

> >> But I'm not sure it'll address all cases - the problem is that those
> >> places add the explicit sort because they need sorted input. Gather
> >> Merge does not do that, it only preserves existing ordering of paths.
> >>
> >> So it's possible the path does not have an explicit sort on to, and
> >> gather merge will not know to add it. And once we have the gather merge
> >> in place, we can't push place "under" it.
> >
> >That's the explanation I was missing; and it makes sense (to restate:
> >sometimes sorting is useful even when not required for correctness of
> >the user returned data).
> >
>
> Yes, although even when the sorting is required for correctness (because
> the user specified ORDER BY) you can do it at different points in the
> plan. We'd still produce correct results, but the sort might be done at
> the very end without these changes.
>
> For example we might end up with plans
>
>    Incremental Sort (presorted: a, path keys: a,b)
>     -> Gather Merge (path keys: a)
>         -> Index Scan (path keys: a)
>
> but with those changes we might push the incremental sort down into the
> parallel part:
>
>    Gather Merge (path keys: a,b)
>     -> Incremental Sort (presorted: a, path keys: a,b)
>         -> Index Scan (path keys: a)
>
> which is likely better. Perhaps that's what you meant, though.

I was thinking of ordering being useful for grouping/aggregation or
merge joins; I didn't realize the above plan wasn't possible yet, so
that explanation is helpful.

> >> And I think I know why is that - while gather_grouping_paths() tries to
> >> add explicit sort below the gather merge, there are other places that
> >> call generate_gather_paths() that don't do that. In this case it's
> >> probably apply_scanjoin_target_to_paths() which simply builds
> >>
> >>    parallel (seq|index) scan + gather merge
> >>
> >> and that's it. The problem is likely the same - the code does not know
> >> which pathkeys are "interesting" at that point. We probably need to
> >> teach planner to do this.
> >
> >I had also noticed that that was an obvious place where
> >generate_gather_paths() was used but an explicit sort wasn't also
> >added separately, which makes me think the division of labor is
> >probably currently wrong regardless of the incremental sort patch.
> >
> >Do you agree? Should we try to fix that (likely with your new
> >"interesting paths" version of generate_gather_paths()) first as a
> >prefix patch to adding incremental sort?
> >
>
> I'm not sure what the generate_useful_gather_paths() should do but does
> not? Or is it just the division of labor that you think is wrong? In any
> case, feel free to whack it until you're happy with it.

Oh, I didn't mean to imply generate_useful_gather_paths() was
deficient in some way; I was noting that I'd noticed that
apply_scanjoin_target_to_paths() didn't consider explicit sort +
gather merge on master, and that maybe that was an unintentional miss
and something that should be remedied.

And if it is a miss, then that would demonstrate in my mind that the
addition of generate_useful_gather_paths() would be a helpful refactor
as a standalone patch even without incremental sort. So "currently
wrong" above meant "on master" not "in the current version of the
patch."

> >> Attached is a slightly modified patch series:
> >>
> >> 1) 0001 considers incremental sort paths in various places (adds the new
> >> generate_useful_gather_paths and modifies places calling create_sort_path)
> >
> >I need to spend some decent time digesting this patch, but the concept
> >sounds very useful.
> >
>
> OK.

I've been reading this several times + stepping through with the
debugger to understand when this is useful, but I have a few
questions.

The first case considered in get_useful_pathkeys_for_relation (which
considers root->query_pathkeys) makes a lot of sense -- obviously if
we want sorted results then it's useful to consider both full and
incremental sort.

But I'm not sure I see yet a way to trigger the second case (which
uses get_useful_ecs_for_relation to build pathkeys potentially useful
for merge joins). In the FDW case we need to consider it since we want
to avoid local sort and so want to see if the foreign server might be
able to provide us useful presorted data, but in the local case I
don't think that's useful. From what I can tell merge join path
costing internally considers possible sorts of both the inner and
outer input paths, and merge join plan node creation is responsible
for building explicit sort nodes as necessary (i.e., there are no
explicit sort paths created for merge join paths.) That means that,
for example, a query like:

select * from
(select * from j2 order by j2.t, j2.a) j2
join (select * from j1 order by j1.t) j1
    on j1.t = j2.t and j1.a = j2.a;

don't consider incremental sort for the merge join path (I disabled
hash joins, nested loops, and full sorts testing that on an empty
table just to easily force a merge join plan). And unfortunately I
don't think there's an easy way to remedy that: from what I can tell
it'd be a pretty invasive patch requiring refactoring merge join
costing to consider both kinds of sorting (in the most simple
implementation that would mean considering up to 4x as many merge join
paths -- inner/outer sorted by: full/full, full/incremental,
incremental/full, and incremental/incremental). Given that's a
significant undertaking on its own, I think I'm going to avoid
addressing it as part of this patch.

If it's true that the get_useful_ecs_for_relation part of that logic
isn't actually exercisable currently, that that would cut down
significantly on the amount of code that needs to be added for
consideration of valid gather merge paths. But if you can think of a
counterexample, please let me know.

> >> 2) 0002 and 0003 are fixes I mentioned before

I'm incorporating those with a bit of additional cleanup.

> >> 3) 0004 adds a new GUC force_incremental_sort that (when set to 'on')
> >> tries to nudge the optimizer into using incremental sort by essentially
> >> making it free (i.e. using startup/total costs of the subpath). I've found
> >> this useful when trying to force incremental sorts into plans where it may
> >> not be the best strategy.
> >
> >That will be super helpful. I do wonder if we need to expose (in the
> >production patch) a GUC of some kind to adjust incremental sort
> >costing so that users can try to tweak when it is preferred over
> >regular sort.
> >
>
> This GUC is really meant primarily for development, to force choice of
> incremental sort during regression tests (so as to use incremental sort in
> as many plans as possible). I'd remove it from the final patch. I think
> the general consensus on pgsql-hackers is that we should not introduce
> GUCs unless absolutely necessary. But for development GUCs - sure.
>
> FWIW I'm not sure it's a good idea to look at both enable_incremental_sort
> and enable_sort in cost_incremental_sort(). Not only end up with
> disable_cost twice when both GUCs are set to 'off' at the moment, but it
> might be useful to be able to disable those two sort types independently.
> For example you might set just enable_sort=off and we'd still generate
> incremental sort paths.

That would cover the usage case I was getting at. Having enable_sort
disable incremental sort also came in without much explanation [1]:

On Fri, Apr 6, 2018 at 11:40 PM, Alexander Kuzmenkov <
a(dot)kuzmenkov(at)postgrespro(dot)ru> wrote:
> Also some other changes from me:
> ...
> enable_sort should act as a cost-based soft disable for both incremental
> and normal sort.

I wasn't sure that fully made sense to me, but was assuming the idea
was to effectively not introduce a regression for anyone already
disabling sort to force a specific plan shape. That being said, any
new execution node/planning feature can cause a regression in such
"hinted" queries, so I'm not sure that's a good reason on its own. In
any case, incremental sort is different enough in performance
qualities that I think you'd want to re-evaluate possible plans in
queries where enable_sort=off is useful, so I'm going make incremental
sort independent of enable_sort unless there's a strong objection
here.

Tangentially: I'd almost expect enable_incremental_sort to act as a
hard disable (and not even generate the paths) rather than a soft
cost-based disable, since while standard sort is the most basic
operation that needs to always be available as a last resort the same
is not true for incremental sort...

> Another thing we should have is a collection of tests with data sets that
> "break" the costing model in some way (skew, correlated columns,
> non-uniform group sizes, ...). That's something not meant for commit,
> because it'll probably require significant amounts of data, but we need it
> to asses the quality of the planner/costing part. I know there are various
> ad-hoc test cases in the thread history, it'd be good to consolidate that
> into once place.

Agreed.

I'm continuing to work on the planning side of this with the goal of
not needing to modify too many places to consider an incremental sort
path + considering which ones are most likely to be useful, but I
wanted to get my questions about get_useful_ecs_for_relation sent out
while I work on that.

If we end up wanting to limit the number of places we consider
incremental sort (whether for planning time or merely for size of the
initial patch, do you have any thoughts on what general areas we
should consider most valuable? Besides the obvious LIMIT case aother
area that might benefit was min/max, though I'm not sure yet at the
moment if that would really end up meaning considering it all over the
place.

James Coleman

[1]: https://www.postgresql.org/message-id/CAPpHfdtKHETXhf062CPvkjpG1wnjQ7rv4uLhZgYQ6VZjwqDYpg%40mail.gmail.com



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sun, Jul 14, 2019 at 02:38:40PM -0400, James Coleman wrote:
>On Tue, Jul 9, 2019 at 10:54 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Tue, Jul 09, 2019 at 09:28:42AM -0400, James Coleman wrote:
>> >On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> >>
>> >> On Mon, Jul 08, 2019 at 12:07:06PM -0400, James Coleman wrote:
>> >> > ...
>> >> >
>> >> >I guess I'm still not following. If (2) is responsible (currently) for
>> >> >adding an explicit sort, why wouldn't adding an incremental sort be an
>> >> >example of that responsibility? The subpath that either a Sort or
>> >> >IncrementalSort is being added on top of (to then feed into the
>> >> >GatherMerge) is the same in both cases right?
>> >> >
>> >> >Unless you're saying that the difference is that since we have a
>> >> >partial ordering already for incremental sort then incremental sort
>> >> >falls into the category of "maintaining" existing ordering of the
>> >> >subpath?
>> >> >
>> >>
>> >> Oh, I think I understand what you're saying. Essentially, we should not
>> >> be making generate_gather_paths responsible for adding the incremental
>> >> sort. Instead, we should be looking at places than are adding explicit
>> >> sort (using create_sort_path) and also consider adding incremental sort.
>> >>
>> >> I definitely agree with the second half - we should look at all places
>> >> that create explicit sorts and make them also consider incremental
>> >> sorts. That makes sense.
>> >
>> >Yep, exactly.
>> >
>>
>> If I remember correctly, one of the previous patch versions (in the early
>> 2018 commitfests) actually modified many of those places, but it did that
>> in a somewhat "naive" way. It simply used incremental sort whenever the
>> path was partially sorted, or something like that. So it did not use
>> costing properly. There was an attempt to fix that in the last commitfest
>> but the costing model was deemed to be a bit too rough and unreliable
>> (especially the ndistinct estimates can be quite weak), so the agreement
>> was to try salvaging the patch for PG11 by only considering incremental
>> sort in "safest" places with greatest gains.
>>
>> We've significantly improved the costing model since then, and the
>> implementation likely handles the corner cases much better. But that does
>> not mean we have to introduce the incremental sort to all those places at
>> once - it might be wise to split that into separate patches.
>
>Yes, although we haven't added the MCV checking yet; that's on my
>mental checklist, but another new area of the codebase for me to
>understand, so I've been prioritizing other parts of the patch.
>

Sure, no problem.

>> It's not just about picking the right plan - we've kinda what impact these
>> extra paths might have on planner performance, so maybe we should look at
>> that too. And the impact might be different for each of those places.
>>
>> I'll leave that up to you, but I certainly won't insist on doing it all in
>> one huge patch.
>
>I'm not opposed to handling some of them separately, but I would like
>to at least hit the places where it's most likely (for example, with
>LIMIT) to improve things. I supposed I'll have to look at all of the
>usages of create_sort_path() and try to rank them in terms of
>perceived likely value.
>

Yep, makes sense.

>> >> But I'm not sure it'll address all cases - the problem is that those
>> >> places add the explicit sort because they need sorted input. Gather
>> >> Merge does not do that, it only preserves existing ordering of paths.
>> >>
>> >> So it's possible the path does not have an explicit sort on to, and
>> >> gather merge will not know to add it. And once we have the gather merge
>> >> in place, we can't push place "under" it.
>> >
>> >That's the explanation I was missing; and it makes sense (to restate:
>> >sometimes sorting is useful even when not required for correctness of
>> >the user returned data).
>> >
>>
>> Yes, although even when the sorting is required for correctness (because
>> the user specified ORDER BY) you can do it at different points in the
>> plan. We'd still produce correct results, but the sort might be done at
>> the very end without these changes.
>>
>> For example we might end up with plans
>>
>>    Incremental Sort (presorted: a, path keys: a,b)
>>     -> Gather Merge (path keys: a)
>>         -> Index Scan (path keys: a)
>>
>> but with those changes we might push the incremental sort down into the
>> parallel part:
>>
>>    Gather Merge (path keys: a,b)
>>     -> Incremental Sort (presorted: a, path keys: a,b)
>>         -> Index Scan (path keys: a)
>>
>> which is likely better. Perhaps that's what you meant, though.
>
>I was thinking of ordering being useful for grouping/aggregation or
>merge joins; I didn't realize the above plan wasn't possible yet, so
>that explanation is helpful.
>
>> >> And I think I know why is that - while gather_grouping_paths() tries to
>> >> add explicit sort below the gather merge, there are other places that
>> >> call generate_gather_paths() that don't do that. In this case it's
>> >> probably apply_scanjoin_target_to_paths() which simply builds
>> >>
>> >>    parallel (seq|index) scan + gather merge
>> >>
>> >> and that's it. The problem is likely the same - the code does not know
>> >> which pathkeys are "interesting" at that point. We probably need to
>> >> teach planner to do this.
>> >
>> >I had also noticed that that was an obvious place where
>> >generate_gather_paths() was used but an explicit sort wasn't also
>> >added separately, which makes me think the division of labor is
>> >probably currently wrong regardless of the incremental sort patch.
>> >
>> >Do you agree? Should we try to fix that (likely with your new
>> >"interesting paths" version of generate_gather_paths()) first as a
>> >prefix patch to adding incremental sort?
>> >
>>
>> I'm not sure what the generate_useful_gather_paths() should do but does
>> not? Or is it just the division of labor that you think is wrong? In any
>> case, feel free to whack it until you're happy with it.
>
>Oh, I didn't mean to imply generate_useful_gather_paths() was
>deficient in some way; I was noting that I'd noticed that
>apply_scanjoin_target_to_paths() didn't consider explicit sort +
>gather merge on master, and that maybe that was an unintentional miss
>and something that should be remedied.
>
>And if it is a miss, then that would demonstrate in my mind that the
>addition of generate_useful_gather_paths() would be a helpful refactor
>as a standalone patch even without incremental sort. So "currently
>wrong" above meant "on master" not "in the current version of the
>patch."
>

Ah, OK. I don't know if it's a miss or intentional omission - it might
be a mix of both, actually. I wouldn't be surprised if doing that was
considered and considered not worth it, but if we can come up with a
plan where adding an explicit sort here helps ...

>> >> Attached is a slightly modified patch series:
>> >>
>> >> 1) 0001 considers incremental sort paths in various places (adds the new
>> >> generate_useful_gather_paths and modifies places calling create_sort_path)
>> >
>> >I need to spend some decent time digesting this patch, but the concept
>> >sounds very useful.
>> >
>>
>> OK.
>
>I've been reading this several times + stepping through with the
>debugger to understand when this is useful, but I have a few
>questions.
>
>The first case considered in get_useful_pathkeys_for_relation (which
>considers root->query_pathkeys) makes a lot of sense -- obviously if
>we want sorted results then it's useful to consider both full and
>incremental sort.
>
>But I'm not sure I see yet a way to trigger the second case (which
>uses get_useful_ecs_for_relation to build pathkeys potentially useful
>for merge joins). In the FDW case we need to consider it since we want
>to avoid local sort and so want to see if the foreign server might be
>able to provide us useful presorted data, but in the local case I
>don't think that's useful. From what I can tell merge join path
>costing internally considers possible sorts of both the inner and
>outer input paths, and merge join plan node creation is responsible
>for building explicit sort nodes as necessary (i.e., there are no
>explicit sort paths created for merge join paths.) That means that,
>for example, a query like:
>
>select * from
>(select * from j2 order by j2.t, j2.a) j2
>join (select * from j1 order by j1.t) j1
>    on j1.t = j2.t and j1.a = j2.a;
>
>don't consider incremental sort for the merge join path (I disabled
>hash joins, nested loops, and full sorts testing that on an empty
>table just to easily force a merge join plan). And unfortunately I
>don't think there's an easy way to remedy that: from what I can tell
>it'd be a pretty invasive patch requiring refactoring merge join
>costing to consider both kinds of sorting (in the most simple
>implementation that would mean considering up to 4x as many merge join
>paths -- inner/outer sorted by: full/full, full/incremental,
>incremental/full, and incremental/incremental). Given that's a
>significant undertaking on its own, I think I'm going to avoid
>addressing it as part of this patch.
>
>If it's true that the get_useful_ecs_for_relation part of that logic
>isn't actually exercisable currently, that that would cut down
>significantly on the amount of code that needs to be added for
>consideration of valid gather merge paths. But if you can think of a
>counterexample, please let me know.
>

It's quite possible parts of the code are not needed - I've pretty much
just copied the code and did minimal changes to get it working.

That being said, it's not clear to me why plans like this would not be
useful (or why would it require changes to merge join costing):

    Merge Join
      -> Gather Merge
          -> Incremental Sort
      -> Gather Merge
          -> Incremental Sort

But I have not checked the code, so maybe it would, in which case I
think it's OK to skip it in this patch (or at least in v1 of it).


>> >> 2) 0002 and 0003 are fixes I mentioned before
>
>I'm incorporating those with a bit of additional cleanup.
>
>> >> 3) 0004 adds a new GUC force_incremental_sort that (when set to 'on')
>> >> tries to nudge the optimizer into using incremental sort by essentially
>> >> making it free (i.e. using startup/total costs of the subpath). I've found
>> >> this useful when trying to force incremental sorts into plans where it may
>> >> not be the best strategy.
>> >
>> >That will be super helpful. I do wonder if we need to expose (in the
>> >production patch) a GUC of some kind to adjust incremental sort
>> >costing so that users can try to tweak when it is preferred over
>> >regular sort.
>> >
>>
>> This GUC is really meant primarily for development, to force choice of
>> incremental sort during regression tests (so as to use incremental sort in
>> as many plans as possible). I'd remove it from the final patch. I think
>> the general consensus on pgsql-hackers is that we should not introduce
>> GUCs unless absolutely necessary. But for development GUCs - sure.
>>
>> FWIW I'm not sure it's a good idea to look at both enable_incremental_sort
>> and enable_sort in cost_incremental_sort(). Not only end up with
>> disable_cost twice when both GUCs are set to 'off' at the moment, but it
>> might be useful to be able to disable those two sort types independently.
>> For example you might set just enable_sort=off and we'd still generate
>> incremental sort paths.
>
>That would cover the usage case I was getting at. Having enable_sort
>disable incremental sort also came in without much explanation [1]:
>
>On Fri, Apr 6, 2018 at 11:40 PM, Alexander Kuzmenkov <
>a(dot)kuzmenkov(at)postgrespro(dot)ru> wrote:
>> Also some other changes from me:
>> ...
>> enable_sort should act as a cost-based soft disable for both incremental
>> and normal sort.
>
>I wasn't sure that fully made sense to me, but was assuming the idea
>was to effectively not introduce a regression for anyone already
>disabling sort to force a specific plan shape. That being said, any
>new execution node/planning feature can cause a regression in such
>"hinted" queries, so I'm not sure that's a good reason on its own. In
>any case, incremental sort is different enough in performance
>qualities that I think you'd want to re-evaluate possible plans in
>queries where enable_sort=off is useful, so I'm going make incremental
>sort independent of enable_sort unless there's a strong objection
>here.
>

OK

>Tangentially: I'd almost expect enable_incremental_sort to act as a
>hard disable (and not even generate the paths) rather than a soft
>cost-based disable, since while standard sort is the most basic
>operation that needs to always be available as a last resort the same
>is not true for incremental sort...
>

Good point. It's somewhat similar to options like enable_parallel_hash
which also are "hard" switches (i.e. not cost-based penalties).

>> Another thing we should have is a collection of tests with data sets that
>> "break" the costing model in some way (skew, correlated columns,
>> non-uniform group sizes, ...). That's something not meant for commit,
>> because it'll probably require significant amounts of data, but we need it
>> to asses the quality of the planner/costing part. I know there are various
>> ad-hoc test cases in the thread history, it'd be good to consolidate that
>> into once place.
>
>Agreed.
>
>I'm continuing to work on the planning side of this with the goal of
>not needing to modify too many places to consider an incremental sort
>path + considering which ones are most likely to be useful, but I
>wanted to get my questions about get_useful_ecs_for_relation sent out
>while I work on that.
>
>If we end up wanting to limit the number of places we consider
>incremental sort (whether for planning time or merely for size of the
>initial patch, do you have any thoughts on what general areas we
>should consider most valuable? Besides the obvious LIMIT case aother
>area that might benefit was min/max, though I'm not sure yet at the
>moment if that would really end up meaning considering it all over the
>place.
>

OK, sounds like a plan!


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sun, Jul 14, 2019 at 10:16 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> >> >> Attached is a slightly modified patch series:
> >> >>
> >> >> 1) 0001 considers incremental sort paths in various places (adds the new
> >> >> generate_useful_gather_paths and modifies places calling create_sort_path)
> >> >
> >> >I need to spend some decent time digesting this patch, but the concept
> >> >sounds very useful.
> >> >
> >>
> >> OK.
> >
> >I've been reading this several times + stepping through with the
> >debugger to understand when this is useful, but I have a few
> >questions.
> >
> >The first case considered in get_useful_pathkeys_for_relation (which
> >considers root->query_pathkeys) makes a lot of sense -- obviously if
> >we want sorted results then it's useful to consider both full and
> >incremental sort.
> >
> >But I'm not sure I see yet a way to trigger the second case (which
> >uses get_useful_ecs_for_relation to build pathkeys potentially useful
> >for merge joins). In the FDW case we need to consider it since we want
> >to avoid local sort and so want to see if the foreign server might be
> >able to provide us useful presorted data, but in the local case I
> >don't think that's useful. From what I can tell merge join path
> >costing internally considers possible sorts of both the inner and
> >outer input paths, and merge join plan node creation is responsible
> >for building explicit sort nodes as necessary (i.e., there are no
> >explicit sort paths created for merge join paths.) That means that,
> >for example, a query like:
> >
> >select * from
> >(select * from j2 order by j2.t, j2.a) j2
> >join (select * from j1 order by j1.t) j1
> >    on j1.t = j2.t and j1.a = j2.a;
> >
> >don't consider incremental sort for the merge join path (I disabled
> >hash joins, nested loops, and full sorts testing that on an empty
> >table just to easily force a merge join plan). And unfortunately I
> >don't think there's an easy way to remedy that: from what I can tell
> >it'd be a pretty invasive patch requiring refactoring merge join
> >costing to consider both kinds of sorting (in the most simple
> >implementation that would mean considering up to 4x as many merge join
> >paths -- inner/outer sorted by: full/full, full/incremental,
> >incremental/full, and incremental/incremental). Given that's a
> >significant undertaking on its own, I think I'm going to avoid
> >addressing it as part of this patch.
> >
> >If it's true that the get_useful_ecs_for_relation part of that logic
> >isn't actually exercisable currently, that that would cut down
> >significantly on the amount of code that needs to be added for
> >consideration of valid gather merge paths. But if you can think of a
> >counterexample, please let me know.
> >
>
> It's quite possible parts of the code are not needed - I've pretty much
> just copied the code and did minimal changes to get it working.
>
> That being said, it's not clear to me why plans like this would not be
> useful (or why would it require changes to merge join costing):
>
>     Merge Join
>       -> Gather Merge
>           -> Incremental Sort
>       -> Gather Merge
>           -> Incremental Sort
>
> But I have not checked the code, so maybe it would, in which case I
> think it's OK to skip it in this patch (or at least in v1 of it).

Someone could correct me if I'm wrong, but I've noticed some comments
that seem to imply we avoid plans with multiple parallel gather
[merge]s; i.e., we try to put a single parallel node as high as
possible in the plan. I assume that's to avoid multiplying out the
number of workers we might consume. And in the sample query above that
kind of plan never got considered because there were no partial paths
to loop through (I'm not sure I fully understand why) when the new
code is called from under apply_scanjoin_target_to_paths().

Of course, we should consider non-parallel incremental sort inputs to
merge joins...but as I noted that's a lot of extra work...

> >> FWIW I'm not sure it's a good idea to look at both enable_incremental_sort
> >> and enable_sort in cost_incremental_sort(). Not only end up with
> >> disable_cost twice when both GUCs are set to 'off' at the moment, but it
> >> might be useful to be able to disable those two sort types independently.
> >> For example you might set just enable_sort=off and we'd still generate
> >> incremental sort paths.
> >
> >That would cover the usage case I was getting at. Having enable_sort
> >disable incremental sort also came in without much explanation [1]:
> >
> >On Fri, Apr 6, 2018 at 11:40 PM, Alexander Kuzmenkov <
> >a(dot)kuzmenkov(at)postgrespro(dot)ru> wrote:
> >> Also some other changes from me:
> >> ...
> >> enable_sort should act as a cost-based soft disable for both incremental
> >> and normal sort.
> >
> >I wasn't sure that fully made sense to me, but was assuming the idea
> >was to effectively not introduce a regression for anyone already
> >disabling sort to force a specific plan shape. That being said, any
> >new execution node/planning feature can cause a regression in such
> >"hinted" queries, so I'm not sure that's a good reason on its own. In
> >any case, incremental sort is different enough in performance
> >qualities that I think you'd want to re-evaluate possible plans in
> >queries where enable_sort=off is useful, so I'm going make incremental
> >sort independent of enable_sort unless there's a strong objection
> >here.
> >
>
> OK
>
> >Tangentially: I'd almost expect enable_incremental_sort to act as a
> >hard disable (and not even generate the paths) rather than a soft
> >cost-based disable, since while standard sort is the most basic
> >operation that needs to always be available as a last resort the same
> >is not true for incremental sort...
> >
>
> Good point. It's somewhat similar to options like enable_parallel_hash
> which also are "hard" switches (i.e. not cost-based penalties).

I assume the proper approach here then is to check the GUC before
building and adding the path?

> >If we end up wanting to limit the number of places we consider
> >incremental sort (whether for planning time or merely for size of the
> >initial patch, do you have any thoughts on what general areas we
> >should consider most valuable? Besides the obvious LIMIT case aother
> >area that might benefit was min/max, though I'm not sure yet at the
> >moment if that would really end up meaning considering it all over the
> >place.
> >
>
> OK, sounds like a plan!

Did you have any thoughts on the question about where this is likely
to be most valuable?

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Jul 15, 2019 at 09:25:32AM -0400, James Coleman wrote:
>On Sun, Jul 14, 2019 at 10:16 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> >> >> Attached is a slightly modified patch series:
>> >> >>
>> >> >> 1) 0001 considers incremental sort paths in various places (adds the new
>> >> >> generate_useful_gather_paths and modifies places calling create_sort_path)
>> >> >
>> >> >I need to spend some decent time digesting this patch, but the concept
>> >> >sounds very useful.
>> >> >
>> >>
>> >> OK.
>> >
>> >I've been reading this several times + stepping through with the
>> >debugger to understand when this is useful, but I have a few
>> >questions.
>> >
>> >The first case considered in get_useful_pathkeys_for_relation (which
>> >considers root->query_pathkeys) makes a lot of sense -- obviously if
>> >we want sorted results then it's useful to consider both full and
>> >incremental sort.
>> >
>> >But I'm not sure I see yet a way to trigger the second case (which
>> >uses get_useful_ecs_for_relation to build pathkeys potentially useful
>> >for merge joins). In the FDW case we need to consider it since we want
>> >to avoid local sort and so want to see if the foreign server might be
>> >able to provide us useful presorted data, but in the local case I
>> >don't think that's useful. From what I can tell merge join path
>> >costing internally considers possible sorts of both the inner and
>> >outer input paths, and merge join plan node creation is responsible
>> >for building explicit sort nodes as necessary (i.e., there are no
>> >explicit sort paths created for merge join paths.) That means that,
>> >for example, a query like:
>> >
>> >select * from
>> >(select * from j2 order by j2.t, j2.a) j2
>> >join (select * from j1 order by j1.t) j1
>> >    on j1.t = j2.t and j1.a = j2.a;
>> >
>> >don't consider incremental sort for the merge join path (I disabled
>> >hash joins, nested loops, and full sorts testing that on an empty
>> >table just to easily force a merge join plan). And unfortunately I
>> >don't think there's an easy way to remedy that: from what I can tell
>> >it'd be a pretty invasive patch requiring refactoring merge join
>> >costing to consider both kinds of sorting (in the most simple
>> >implementation that would mean considering up to 4x as many merge join
>> >paths -- inner/outer sorted by: full/full, full/incremental,
>> >incremental/full, and incremental/incremental). Given that's a
>> >significant undertaking on its own, I think I'm going to avoid
>> >addressing it as part of this patch.
>> >
>> >If it's true that the get_useful_ecs_for_relation part of that logic
>> >isn't actually exercisable currently, that that would cut down
>> >significantly on the amount of code that needs to be added for
>> >consideration of valid gather merge paths. But if you can think of a
>> >counterexample, please let me know.
>> >
>>
>> It's quite possible parts of the code are not needed - I've pretty much
>> just copied the code and did minimal changes to get it working.
>>
>> That being said, it's not clear to me why plans like this would not be
>> useful (or why would it require changes to merge join costing):
>>
>>     Merge Join
>>       -> Gather Merge
>>           -> Incremental Sort
>>       -> Gather Merge
>>           -> Incremental Sort
>>
>> But I have not checked the code, so maybe it would, in which case I
>> think it's OK to skip it in this patch (or at least in v1 of it).
>
>Someone could correct me if I'm wrong, but I've noticed some comments
>that seem to imply we avoid plans with multiple parallel gather
>[merge]s; i.e., we try to put a single parallel node as high as
>possible in the plan. I assume that's to avoid multiplying out the
>number of workers we might consume. And in the sample query above that
>kind of plan never got considered because there were no partial paths
>to loop through (I'm not sure I fully understand why) when the new
>code is called from under apply_scanjoin_target_to_paths().
>

I think you're probably right in this case it'd be more efficient to just
do one Gather Merge on top of the merge join. And yes - we try to put the
Gather node as high as possible, to parallelize as large part of a query
as possible. Keeping the number of necessary workers low is one reason,
but it's also about the serial part in Amdahl's law.

>Of course, we should consider non-parallel incremental sort inputs to
>merge joins...but as I noted that's a lot of extra work...
>

OK, understood. I agree such plans would be useful, but if you think it'd
be a lot of extra work, then we can leave it out for now.

Although, looking at the mergejoin path construction code, it does not
seem very complex. Essentially, generate_mergejoin_paths() would need to
consider adding incremental sort on inner/outer path. It would need some
refactoring, but it's not clear to me why would this need changes to
costing? Essentially, we'd produce multiple mergejoin paths that would be
costed by the current code.

That being said, I'm perfectly fine with ignoring this for now. There are
more fundamental bits that we need to tackle first.

>> >> FWIW I'm not sure it's a good idea to look at both enable_incremental_sort
>> >> and enable_sort in cost_incremental_sort(). Not only end up with
>> >> disable_cost twice when both GUCs are set to 'off' at the moment, but it
>> >> might be useful to be able to disable those two sort types independently.
>> >> For example you might set just enable_sort=off and we'd still generate
>> >> incremental sort paths.
>> >
>> >That would cover the usage case I was getting at. Having enable_sort
>> >disable incremental sort also came in without much explanation [1]:
>> >
>> >On Fri, Apr 6, 2018 at 11:40 PM, Alexander Kuzmenkov <
>> >a(dot)kuzmenkov(at)postgrespro(dot)ru> wrote:
>> >> Also some other changes from me:
>> >> ...
>> >> enable_sort should act as a cost-based soft disable for both incremental
>> >> and normal sort.
>> >
>> >I wasn't sure that fully made sense to me, but was assuming the idea
>> >was to effectively not introduce a regression for anyone already
>> >disabling sort to force a specific plan shape. That being said, any
>> >new execution node/planning feature can cause a regression in such
>> >"hinted" queries, so I'm not sure that's a good reason on its own. In
>> >any case, incremental sort is different enough in performance
>> >qualities that I think you'd want to re-evaluate possible plans in
>> >queries where enable_sort=off is useful, so I'm going make incremental
>> >sort independent of enable_sort unless there's a strong objection
>> >here.
>> >
>>
>> OK
>>
>> >Tangentially: I'd almost expect enable_incremental_sort to act as a
>> >hard disable (and not even generate the paths) rather than a soft
>> >cost-based disable, since while standard sort is the most basic
>> >operation that needs to always be available as a last resort the same
>> >is not true for incremental sort...
>> >
>>
>> Good point. It's somewhat similar to options like enable_parallel_hash
>> which also are "hard" switches (i.e. not cost-based penalties).
>
>I assume the proper approach here then is to check the GUC before
>building and adding the path?
>

Yeah. The simplest way to do that might be setting the number of presorted
keys to 0 in pathkeys_common_contained_in(). That effectively says it
makes no sense to do incremental sort. It's a bit too deep, though - not
sure it does not affect any other plans, though.

Or you might reference the GUC in every place that considers incremental
sort, but that's going to be a lot of places.

>> >If we end up wanting to limit the number of places we consider
>> >incremental sort (whether for planning time or merely for size of the
>> >initial patch, do you have any thoughts on what general areas we
>> >should consider most valuable? Besides the obvious LIMIT case aother
>> >area that might benefit was min/max, though I'm not sure yet at the
>> >moment if that would really end up meaning considering it all over the
>> >place.
>> >
>>
>> OK, sounds like a plan!
>
>Did you have any thoughts on the question about where this is likely
>to be most valuable?
>

Good question. I can certainly list some generic cases where I'd expect
incremental sort to be most beneficial:

1) (ORDER BY + LIMIT) - in this case the main advantage is low startup
cost (compared to explicit sort, which has to fetch/sort everything)

2) ORDER BY with on-disk sort - in this case the main advantage is ability
to sort data in small chunks that fit in memory, instead of flushing large
amounts of data into temporary files

Of course, (2) helps with any operation that can leverage the ordering,
i.e. aggregation/...

But the question is how to map this to places in the source code. I don't
have a very good answer to that :-(

IMO the best thing we can do is get some realistic queries, and address
them first. And then eventually add incremental sort to some other places. 


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Now, consider this example:
>
>   create table t (a int, b int, c int);
>   insert into t select mod(i,100),mod(i,100),i from generate_series(1,10000000) s(i);
>   create index on t (a);
>   analyze t;
>   explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1;
>
> With 0001+0002+0003 pathes, I get a plan like this:
>
>                                                      QUERY PLAN
> --------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=10375.39..10594.72 rows=1 width=16)
>    ->  Incremental Sort  (cost=10375.39..2203675.71 rows=10000 width=16)
>          Sort Key: a, b, (sum(c))
>          Presorted Key: a, b
>          ->  GroupAggregate  (cost=10156.07..2203225.71 rows=10000 width=16)
>                Group Key: a, b
>                ->  Gather Merge  (cost=10156.07..2128124.39 rows=10000175 width=12)
>                      Workers Planned: 2
>                      ->  Incremental Sort  (cost=9156.04..972856.05 rows=4166740 width=12)
>                            Sort Key: a, b
>                            Presorted Key: a
>                            ->  Parallel Index Scan using t_a_idx on t  (cost=0.43..417690.30 rows=4166740 width=12)
> (12 rows)
>
>
> and with 0004, I get this:
>
>                                               QUERY PLAN
> ------------------------------------------------------------------------------------------------------
>  Limit  (cost=20443.84..20665.32 rows=1 width=16)
>    ->  Incremental Sort  (cost=20443.84..2235250.05 rows=10000 width=16)
>          Sort Key: a, b, (sum(c))
>          Presorted Key: a, b
>          ->  GroupAggregate  (cost=20222.37..2234800.05 rows=10000 width=16)
>                Group Key: a, b
>                ->  Incremental Sort  (cost=20222.37..2159698.74 rows=10000175 width=12)
>                      Sort Key: a, b
>                      Presorted Key: a
>                      ->  Index Scan using t_a_idx on t  (cost=0.43..476024.65 rows=10000175 width=12)
> (10 rows)
>
> Notice that cost of the second plan is almost double the first one. That
> means 0004 does not even generate the first plan, i.e. there are cases
> where we don't try to add the explicit sort before passing the path to
> generate_gather_paths().
>
> And I think I know why is that - while gather_grouping_paths() tries to
> add explicit sort below the gather merge, there are other places that
> call generate_gather_paths() that don't do that. In this case it's
> probably apply_scanjoin_target_to_paths() which simply builds
>
>    parallel (seq|index) scan + gather merge
>
> and that's it. The problem is likely the same - the code does not know
> which pathkeys are "interesting" at that point. We probably need to
> teach planner to do this.

I've been working on figuring out sample queries for each of the
places we're looking at adding create_increment_sort() (starting with
the cases enabled by gather-merge nodes). The
generate_useful_gather_paths() call in
apply_scanjoin_target_to_paths() is required to generate the above
preferred plan.

But I found that if I set enable_sort=off (with our without the
_useful_ variant of generate_gather_paths()) I get a very similar plan
that's actually lower cost yet:

                                                        QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=10255.98..10355.77 rows=1 width=16)
   ->  Incremental Sort  (cost=10255.98..1008228.67 rows=10000 width=16)
         Sort Key: a, b, (sum(c))
         Presorted Key: a, b
         ->  Finalize GroupAggregate  (cost=10156.20..1007778.67
rows=10000 width=16)
               Group Key: a, b
               ->  Gather Merge  (cost=10156.20..1007528.67 rows=20000 width=16)
                     Workers Planned: 2
                     ->  Partial GroupAggregate
(cost=9156.18..1004220.15 rows=10000 width=16)
                           Group Key: a, b
                           ->  Incremental Sort
(cost=9156.18..972869.60 rows=4166740 width=12)
                                 Sort Key: a, b
                                 Presorted Key: a
                                 ->  Parallel Index Scan using t_a_idx
on t  (cost=0.43..417703.85 rows=4166740 width=12)

Is that something we should consider a bug at this stage? It's also
not clear to mean (costing aside) which plan is intuitively
preferable.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Fri, Jul 19, 2019 at 04:59:21PM -0400, James Coleman wrote:
>On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> Now, consider this example:
>>
>>   create table t (a int, b int, c int);
>>   insert into t select mod(i,100),mod(i,100),i from generate_series(1,10000000) s(i);
>>   create index on t (a);
>>   analyze t;
>>   explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1;
>>
>> With 0001+0002+0003 pathes, I get a plan like this:
>>
>>                                                      QUERY PLAN
>>
--------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=10375.39..10594.72 rows=1 width=16)
>>    ->  Incremental Sort  (cost=10375.39..2203675.71 rows=10000 width=16)
>>          Sort Key: a, b, (sum(c))
>>          Presorted Key: a, b
>>          ->  GroupAggregate  (cost=10156.07..2203225.71 rows=10000 width=16)
>>                Group Key: a, b
>>                ->  Gather Merge  (cost=10156.07..2128124.39 rows=10000175 width=12)
>>                      Workers Planned: 2
>>                      ->  Incremental Sort  (cost=9156.04..972856.05 rows=4166740 width=12)
>>                            Sort Key: a, b
>>                            Presorted Key: a
>>                            ->  Parallel Index Scan using t_a_idx on t  (cost=0.43..417690.30 rows=4166740 width=12)
>> (12 rows)
>>
>>
>> and with 0004, I get this:
>>
>>                                               QUERY PLAN
>> ------------------------------------------------------------------------------------------------------
>>  Limit  (cost=20443.84..20665.32 rows=1 width=16)
>>    ->  Incremental Sort  (cost=20443.84..2235250.05 rows=10000 width=16)
>>          Sort Key: a, b, (sum(c))
>>          Presorted Key: a, b
>>          ->  GroupAggregate  (cost=20222.37..2234800.05 rows=10000 width=16)
>>                Group Key: a, b
>>                ->  Incremental Sort  (cost=20222.37..2159698.74 rows=10000175 width=12)
>>                      Sort Key: a, b
>>                      Presorted Key: a
>>                      ->  Index Scan using t_a_idx on t  (cost=0.43..476024.65 rows=10000175 width=12)
>> (10 rows)
>>
>> Notice that cost of the second plan is almost double the first one. That
>> means 0004 does not even generate the first plan, i.e. there are cases
>> where we don't try to add the explicit sort before passing the path to
>> generate_gather_paths().
>>
>> And I think I know why is that - while gather_grouping_paths() tries to
>> add explicit sort below the gather merge, there are other places that
>> call generate_gather_paths() that don't do that. In this case it's
>> probably apply_scanjoin_target_to_paths() which simply builds
>>
>>    parallel (seq|index) scan + gather merge
>>
>> and that's it. The problem is likely the same - the code does not know
>> which pathkeys are "interesting" at that point. We probably need to
>> teach planner to do this.
>
>I've been working on figuring out sample queries for each of the
>places we're looking at adding create_increment_sort() (starting with
>the cases enabled by gather-merge nodes). The
>generate_useful_gather_paths() call in
>apply_scanjoin_target_to_paths() is required to generate the above
>preferred plan.
>
>But I found that if I set enable_sort=off (with our without the
>_useful_ variant of generate_gather_paths()) I get a very similar plan
>that's actually lower cost yet:
>
>                                                        QUERY PLAN

>--------------------------------------------------------------------------------------------------------------------------
> Limit  (cost=10255.98..10355.77 rows=1 width=16)
>   ->  Incremental Sort  (cost=10255.98..1008228.67 rows=10000 width=16)
>         Sort Key: a, b, (sum(c))
>         Presorted Key: a, b
>         ->  Finalize GroupAggregate  (cost=10156.20..1007778.67
>rows=10000 width=16)
>               Group Key: a, b
>               ->  Gather Merge  (cost=10156.20..1007528.67 rows=20000 width=16)
>                     Workers Planned: 2
>                     ->  Partial GroupAggregate
>(cost=9156.18..1004220.15 rows=10000 width=16)
>                           Group Key: a, b
>                           ->  Incremental Sort
>(cost=9156.18..972869.60 rows=4166740 width=12)
>                                 Sort Key: a, b
>                                 Presorted Key: a
>                                 ->  Parallel Index Scan using t_a_idx
>on t  (cost=0.43..417703.85 rows=4166740 width=12)
>
>Is that something we should consider a bug at this stage? It's also
>not clear to mean (costing aside) which plan is intuitively
>preferable.
>

This seems like a thinko in add_partial_path() - it only looks at total
cost of the paths, and ignores the startup cost entirely. I've debugged
it a bit, and what's happening for the partially-grouped relation is
roughly this:

1) We add a partial path with startup/total costs 696263 / 738029

2) We attempt to add the "Partial GroupAggregate" path, but it loses the
fight because the total cost (1004207) is higher than the first path.
Which entirely misses that the startup cost is way lower.

3) We however use the startup cost later when computing the LIMIT cost
(because that's linear approximation between startup and total cost),
and we reject the first path too, because we happen to find something
cheaper (but more expensive than what we'd get the path rejected in 2).

Attached is a debug patch which makes this clear - it only prints info
about first step of partial aggregates, because that's what matters in
this example. You should see something like this:

WARNING:  rel 0x2aa8e00 adding partial agg path 0x2aa9448 startup 696263.839993 total 738029.919993
WARNING:  rel 0x2aa8e00 path 0x2aa9448 adding new = 1
WARNING:  rel 0x2aa8e00 adding partial agg path 0x2aa9710 startup 9156.084995 total 1004207.854495
WARNING:  A: new path 0x2aa9710 rejected because of 0x2aa9448
WARNING:  rel 0x2aa8e00 path 0x2aa9710 adding new = 0

which essentially says "path rejected because of total cost" (and you
know it's the interesting partial aggregate from the second plan). And
if you disable sort, you get this:

WARNING:  rel 0x2aa8e00 adding partial agg path 0x2aa9448 startup 10000696263.839994 total 10000738029.919996
WARNING:  rel 0x2aa8e00 path 0x2aa9448 adding new = 1
WARNING:  rel 0x2aa8e00 adding partial agg path 0x2aa9710 startup 9156.084995 total 1004207.854495
WARNING:  rel 0x2aa8e00 path 0x2aa9710 adding new = 1
...

So in this case we decided the path is interesting, thanks to the
increased cost of sort.

The comment for add_partial_path says this:

 *  Neither do we need to consider startup costs: parallelism is only
 *  used for plans that will be run to completion.  Therefore, this
 *  routine is much simpler than add_path: it needs to consider only
 *  pathkeys and total cost.

I think this may be a thinko, as this plan demonstrates - but I'm not
sure about it. I wonder if this might be penalizing some other types of
plans (essentially anything with limit + gather).

Attached is a WIP patch fixing this by considering both startup and
total cost (by calling compare_path_costs_fuzzily).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Jul 20, 2019 at 9:22 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Fri, Jul 19, 2019 at 04:59:21PM -0400, James Coleman wrote:
> >On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >> Now, consider this example:
> >>
> >>   create table t (a int, b int, c int);
> >>   insert into t select mod(i,100),mod(i,100),i from generate_series(1,10000000) s(i);
> >>   create index on t (a);
> >>   analyze t;
> >>   explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1;
> >>
> >> With 0001+0002+0003 pathes, I get a plan like this:
> >>
> >>                                                      QUERY PLAN
> >>
--------------------------------------------------------------------------------------------------------------------
> >>  Limit  (cost=10375.39..10594.72 rows=1 width=16)
> >>    ->  Incremental Sort  (cost=10375.39..2203675.71 rows=10000 width=16)
> >>          Sort Key: a, b, (sum(c))
> >>          Presorted Key: a, b
> >>          ->  GroupAggregate  (cost=10156.07..2203225.71 rows=10000 width=16)
> >>                Group Key: a, b
> >>                ->  Gather Merge  (cost=10156.07..2128124.39 rows=10000175 width=12)
> >>                      Workers Planned: 2
> >>                      ->  Incremental Sort  (cost=9156.04..972856.05 rows=4166740 width=12)
> >>                            Sort Key: a, b
> >>                            Presorted Key: a
> >>                            ->  Parallel Index Scan using t_a_idx on t  (cost=0.43..417690.30 rows=4166740
width=12)
> >> (12 rows)
> >>
> >>
> >> and with 0004, I get this:
> >>
> >>                                               QUERY PLAN
> >> ------------------------------------------------------------------------------------------------------
> >>  Limit  (cost=20443.84..20665.32 rows=1 width=16)
> >>    ->  Incremental Sort  (cost=20443.84..2235250.05 rows=10000 width=16)
> >>          Sort Key: a, b, (sum(c))
> >>          Presorted Key: a, b
> >>          ->  GroupAggregate  (cost=20222.37..2234800.05 rows=10000 width=16)
> >>                Group Key: a, b
> >>                ->  Incremental Sort  (cost=20222.37..2159698.74 rows=10000175 width=12)
> >>                      Sort Key: a, b
> >>                      Presorted Key: a
> >>                      ->  Index Scan using t_a_idx on t  (cost=0.43..476024.65 rows=10000175 width=12)
> >> (10 rows)
> >>
> >> Notice that cost of the second plan is almost double the first one. That
> >> means 0004 does not even generate the first plan, i.e. there are cases
> >> where we don't try to add the explicit sort before passing the path to
> >> generate_gather_paths().
> >>
> >> And I think I know why is that - while gather_grouping_paths() tries to
> >> add explicit sort below the gather merge, there are other places that
> >> call generate_gather_paths() that don't do that. In this case it's
> >> probably apply_scanjoin_target_to_paths() which simply builds
> >>
> >>    parallel (seq|index) scan + gather merge
> >>
> >> and that's it. The problem is likely the same - the code does not know
> >> which pathkeys are "interesting" at that point. We probably need to
> >> teach planner to do this.
> >
> >I've been working on figuring out sample queries for each of the
> >places we're looking at adding create_increment_sort() (starting with
> >the cases enabled by gather-merge nodes). The
> >generate_useful_gather_paths() call in
> >apply_scanjoin_target_to_paths() is required to generate the above
> >preferred plan.

As I continue this, I've added a couple of test cases (notably for
generate_useful_gather_paths() in both standard_join_search() and
apply_scanjoin_target_to_paths()). Those, plus the current WIP state
of my hacking on your patch adding generate_useful_gather_paths() is
attached as 0001-parallel-and-more-paths.patch.

My current line of investigation is whether we need to do anything in
the parallel portion of create_ordered_paths(). I noticed that the
first-pass patch adding generate_useful_gather_paths() modified that
section but wasn't actually adding any new gather-merge paths (just
bare incremental sort paths). That seems pretty clearly just a
prototype miss, so I modified the prototype to build gather-merge
paths instead (as a side note that change seems to fix an oddity I was
seeing where plans would include a parallel index scan node even
though they weren't parallel plans). While the resulting plan for
something like:

explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
t.a, t.b limit 50;

changes cost (to be cheaper) ever so slightly with the gather-merge
addition to create_ordered_paths(), the plan itself is otherwise
identical (including row estimates):

Limit
  -> Gather Merge
       -> Incremental Sort
          -> Parallel Index Scan

(Note: I'm forcing parallel plans here with: set
max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
set parallel_tuple_cost=0; set parallel_setup_cost=0; set
min_parallel_index_scan_size=0;)

I can't seem to come up with a case where adding these gather-merge
paths in create_ordered_paths() isn't entirely duplicative of paths
already created by generate_useful_gather_paths() as called from
apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
that both apply_scanjoin_target_to_paths() and create_ordered_paths()
are called by grouping_planner().

Can you think of a case I'm missing here that would make it valuable
to generate new parallel plans in create_ordered_paths()?

> >But I found that if I set enable_sort=off (with our without the
> >_useful_ variant of generate_gather_paths()) I get a very similar plan
> >that's actually lower cost yet:
> >
> >                                                        QUERY PLAN
>
>--------------------------------------------------------------------------------------------------------------------------
> > Limit  (cost=10255.98..10355.77 rows=1 width=16)
> >   ->  Incremental Sort  (cost=10255.98..1008228.67 rows=10000 width=16)
> >         Sort Key: a, b, (sum(c))
> >         Presorted Key: a, b
> >         ->  Finalize GroupAggregate  (cost=10156.20..1007778.67
> >rows=10000 width=16)
> >               Group Key: a, b
> >               ->  Gather Merge  (cost=10156.20..1007528.67 rows=20000 width=16)
> >                     Workers Planned: 2
> >                     ->  Partial GroupAggregate
> >(cost=9156.18..1004220.15 rows=10000 width=16)
> >                           Group Key: a, b
> >                           ->  Incremental Sort
> >(cost=9156.18..972869.60 rows=4166740 width=12)
> >                                 Sort Key: a, b
> >                                 Presorted Key: a
> >                                 ->  Parallel Index Scan using t_a_idx
> >on t  (cost=0.43..417703.85 rows=4166740 width=12)
> >
> >Is that something we should consider a bug at this stage? It's also
> >not clear to mean (costing aside) which plan is intuitively
> >preferable.
> >
>
> This seems like a thinko in add_partial_path() - it only looks at total
> cost of the paths, and ignores the startup cost entirely. I've debugged
> it a bit, and what's happening for the partially-grouped relation is
> roughly this:
>
> 1) We add a partial path with startup/total costs 696263 / 738029
>
> 2) We attempt to add the "Partial GroupAggregate" path, but it loses the
> fight because the total cost (1004207) is higher than the first path.
> Which entirely misses that the startup cost is way lower.
>
> 3) We however use the startup cost later when computing the LIMIT cost
> (because that's linear approximation between startup and total cost),
> and we reject the first path too, because we happen to find something
> cheaper (but more expensive than what we'd get the path rejected in 2).
>
> Attached is a debug patch which makes this clear - it only prints info
> about first step of partial aggregates, because that's what matters in
> this example. You should see something like this:
>
> WARNING:  rel 0x2aa8e00 adding partial agg path 0x2aa9448 startup 696263.839993 total 738029.919993
> WARNING:  rel 0x2aa8e00 path 0x2aa9448 adding new = 1
> WARNING:  rel 0x2aa8e00 adding partial agg path 0x2aa9710 startup 9156.084995 total 1004207.854495
> WARNING:  A: new path 0x2aa9710 rejected because of 0x2aa9448
> WARNING:  rel 0x2aa8e00 path 0x2aa9710 adding new = 0
>
> which essentially says "path rejected because of total cost" (and you
> know it's the interesting partial aggregate from the second plan). And
> if you disable sort, you get this:
>
> WARNING:  rel 0x2aa8e00 adding partial agg path 0x2aa9448 startup 10000696263.839994 total 10000738029.919996
> WARNING:  rel 0x2aa8e00 path 0x2aa9448 adding new = 1
> WARNING:  rel 0x2aa8e00 adding partial agg path 0x2aa9710 startup 9156.084995 total 1004207.854495
> WARNING:  rel 0x2aa8e00 path 0x2aa9710 adding new = 1
> ...
>
> So in this case we decided the path is interesting, thanks to the
> increased cost of sort.
>
> The comment for add_partial_path says this:
>
>  *  Neither do we need to consider startup costs: parallelism is only
>  *  used for plans that will be run to completion.  Therefore, this
>  *  routine is much simpler than add_path: it needs to consider only
>  *  pathkeys and total cost.
>
> I think this may be a thinko, as this plan demonstrates - but I'm not
> sure about it. I wonder if this might be penalizing some other types of
> plans (essentially anything with limit + gather).
>
> Attached is a WIP patch fixing this by considering both startup and
> total cost (by calling compare_path_costs_fuzzily).

It seems to me that this is likely a bug, and not just a changed
needed for this. Do you think it's better addressed in a separate
thread? Or retain it as part of this patch for now (and possibly break
it out later)? On the other hand, it's entirely possible that someone
more familiar with parallel plan limitations could explain why the
above comment holds true. That makes me lean towards asking in a new
thread.

I've also attached a new base patch (incremental-sort-30.patch) which
includes some of the other obvious fixes (costing, etc.) that you'd
previously proposed.

James Coleman

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sat, Jul 20, 2019 at 10:33:02AM -0400, James Coleman wrote:
>On Sat, Jul 20, 2019 at 9:22 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Fri, Jul 19, 2019 at 04:59:21PM -0400, James Coleman wrote:
>> >On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> >> Now, consider this example:
>> >>
>> >>   create table t (a int, b int, c int);
>> >>   insert into t select mod(i,100),mod(i,100),i from generate_series(1,10000000) s(i);
>> >>   create index on t (a);
>> >>   analyze t;
>> >>   explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1;
>> >>
>> >> With 0001+0002+0003 pathes, I get a plan like this:
>> >>
>> >>                                                      QUERY PLAN
>> >>
--------------------------------------------------------------------------------------------------------------------
>> >>  Limit  (cost=10375.39..10594.72 rows=1 width=16)
>> >>    ->  Incremental Sort  (cost=10375.39..2203675.71 rows=10000 width=16)
>> >>          Sort Key: a, b, (sum(c))
>> >>          Presorted Key: a, b
>> >>          ->  GroupAggregate  (cost=10156.07..2203225.71 rows=10000 width=16)
>> >>                Group Key: a, b
>> >>                ->  Gather Merge  (cost=10156.07..2128124.39 rows=10000175 width=12)
>> >>                      Workers Planned: 2
>> >>                      ->  Incremental Sort  (cost=9156.04..972856.05 rows=4166740 width=12)
>> >>                            Sort Key: a, b
>> >>                            Presorted Key: a
>> >>                            ->  Parallel Index Scan using t_a_idx on t  (cost=0.43..417690.30 rows=4166740
width=12)
>> >> (12 rows)
>> >>
>> >>
>> >> and with 0004, I get this:
>> >>
>> >>                                               QUERY PLAN
>> >> ------------------------------------------------------------------------------------------------------
>> >>  Limit  (cost=20443.84..20665.32 rows=1 width=16)
>> >>    ->  Incremental Sort  (cost=20443.84..2235250.05 rows=10000 width=16)
>> >>          Sort Key: a, b, (sum(c))
>> >>          Presorted Key: a, b
>> >>          ->  GroupAggregate  (cost=20222.37..2234800.05 rows=10000 width=16)
>> >>                Group Key: a, b
>> >>                ->  Incremental Sort  (cost=20222.37..2159698.74 rows=10000175 width=12)
>> >>                      Sort Key: a, b
>> >>                      Presorted Key: a
>> >>                      ->  Index Scan using t_a_idx on t  (cost=0.43..476024.65 rows=10000175 width=12)
>> >> (10 rows)
>> >>
>> >> Notice that cost of the second plan is almost double the first one. That
>> >> means 0004 does not even generate the first plan, i.e. there are cases
>> >> where we don't try to add the explicit sort before passing the path to
>> >> generate_gather_paths().
>> >>
>> >> And I think I know why is that - while gather_grouping_paths() tries to
>> >> add explicit sort below the gather merge, there are other places that
>> >> call generate_gather_paths() that don't do that. In this case it's
>> >> probably apply_scanjoin_target_to_paths() which simply builds
>> >>
>> >>    parallel (seq|index) scan + gather merge
>> >>
>> >> and that's it. The problem is likely the same - the code does not know
>> >> which pathkeys are "interesting" at that point. We probably need to
>> >> teach planner to do this.
>> >
>> >I've been working on figuring out sample queries for each of the
>> >places we're looking at adding create_increment_sort() (starting with
>> >the cases enabled by gather-merge nodes). The
>> >generate_useful_gather_paths() call in
>> >apply_scanjoin_target_to_paths() is required to generate the above
>> >preferred plan.
>
>As I continue this, I've added a couple of test cases (notably for
>generate_useful_gather_paths() in both standard_join_search() and
>apply_scanjoin_target_to_paths()). Those, plus the current WIP state
>of my hacking on your patch adding generate_useful_gather_paths() is
>attached as 0001-parallel-and-more-paths.patch.
>
>My current line of investigation is whether we need to do anything in
>the parallel portion of create_ordered_paths(). I noticed that the
>first-pass patch adding generate_useful_gather_paths() modified that
>section but wasn't actually adding any new gather-merge paths (just
>bare incremental sort paths). That seems pretty clearly just a
>prototype miss, so I modified the prototype to build gather-merge
>paths instead (as a side note that change seems to fix an oddity I was
>seeing where plans would include a parallel index scan node even
>though they weren't parallel plans). While the resulting plan for
>something like:
>

Yes, that seems to be a bug. The block above it clealy has a gather
merge nodes, so this one should too.

>explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
>t.a, t.b limit 50;
>
>changes cost (to be cheaper) ever so slightly with the gather-merge
>addition to create_ordered_paths(), the plan itself is otherwise
>identical (including row estimates):
>
>Limit
>  -> Gather Merge
>       -> Incremental Sort
>          -> Parallel Index Scan
>
>(Note: I'm forcing parallel plans here with: set
>max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
>set parallel_tuple_cost=0; set parallel_setup_cost=0; set
>min_parallel_index_scan_size=0;)
>
>I can't seem to come up with a case where adding these gather-merge
>paths in create_ordered_paths() isn't entirely duplicative of paths
>already created by generate_useful_gather_paths() as called from
>apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
>that both apply_scanjoin_target_to_paths() and create_ordered_paths()
>are called by grouping_planner().
>
>Can you think of a case I'm missing here that would make it valuable
>to generate new parallel plans in create_ordered_paths()?
>

Good question. Not sure. I think such path would need to do something on
a relation that is neither a join nor a scan - in which case the path
should not be created by apply_scanjoin_target_to_paths().

So for example a query like this:

  SELECT
      a, b, sum(expensive_function(c))
  FROM
      t
  GROUP BY a,b
  ORDER BY a,sum(...) LIMIT 10;

should be able to produce a plan like this:

  -> Limit
      -> Gather Merge
        -> Incremental Sort (pathkeys: a, sum)
          -> Group Aggregate
             a, b, sum(expensive_function(c))
            -> Index Scan (pathkeys: a, b)

or something like that, maybe. I haven't tried this, though. The other
question is whether such queries are useful in practice ...

>> ...
>>
>> I think this may be a thinko, as this plan demonstrates - but I'm not
>> sure about it. I wonder if this might be penalizing some other types of
>> plans (essentially anything with limit + gather).
>>
>> Attached is a WIP patch fixing this by considering both startup and
>> total cost (by calling compare_path_costs_fuzzily).
>
>It seems to me that this is likely a bug, and not just a changed
>needed for this. Do you think it's better addressed in a separate
>thread? Or retain it as part of this patch for now (and possibly break
>it out later)? On the other hand, it's entirely possible that someone
>more familiar with parallel plan limitations could explain why the
>above comment holds true. That makes me lean towards asking in a new
>thread.
>

Maybe. I think creating a separate thread would be useful, provided we
manage to demonstrate the issue without an incremental sort.

>I've also attached a new base patch (incremental-sort-30.patch) which
>includes some of the other obvious fixes (costing, etc.) that you'd
>previously proposed.
>

Thanks!

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
...
> >My current line of investigation is whether we need to do anything in
> >the parallel portion of create_ordered_paths(). I noticed that the
> >first-pass patch adding generate_useful_gather_paths() modified that
> >section but wasn't actually adding any new gather-merge paths (just
> >bare incremental sort paths). That seems pretty clearly just a
> >prototype miss, so I modified the prototype to build gather-merge
> >paths instead (as a side note that change seems to fix an oddity I was
> >seeing where plans would include a parallel index scan node even
> >though they weren't parallel plans). While the resulting plan for
> >something like:
> >
>
> Yes, that seems to be a bug. The block above it clealy has a gather
> merge nodes, so this one should too.
>
> >explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
> >t.a, t.b limit 50;
> >
> >changes cost (to be cheaper) ever so slightly with the gather-merge
> >addition to create_ordered_paths(), the plan itself is otherwise
> >identical (including row estimates):
> >
> >Limit
> >  -> Gather Merge
> >       -> Incremental Sort
> >          -> Parallel Index Scan
> >
> >(Note: I'm forcing parallel plans here with: set
> >max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
> >set parallel_tuple_cost=0; set parallel_setup_cost=0; set
> >min_parallel_index_scan_size=0;)
> >
> >I can't seem to come up with a case where adding these gather-merge
> >paths in create_ordered_paths() isn't entirely duplicative of paths
> >already created by generate_useful_gather_paths() as called from
> >apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
> >that both apply_scanjoin_target_to_paths() and create_ordered_paths()
> >are called by grouping_planner().
> >
> >Can you think of a case I'm missing here that would make it valuable
> >to generate new parallel plans in create_ordered_paths()?
> >
>
> Good question. Not sure. I think such path would need to do something on
> a relation that is neither a join nor a scan - in which case the path
> should not be created by apply_scanjoin_target_to_paths().
>
> So for example a query like this:
>
>   SELECT
>       a, b, sum(expensive_function(c))
>   FROM
>       t
>   GROUP BY a,b
>   ORDER BY a,sum(...) LIMIT 10;
>
> should be able to produce a plan like this:
>
>   -> Limit
>       -> Gather Merge
>         -> Incremental Sort (pathkeys: a, sum)
>           -> Group Aggregate
>              a, b, sum(expensive_function(c))
>             -> Index Scan (pathkeys: a, b)
>
> or something like that, maybe. I haven't tried this, though. The other
> question is whether such queries are useful in practice ...

Hmm, when I step through on that example input_rel->partial_pathlist
!= NIL is false, so we don't even attempt to consider any extra
parallel paths in create_ordered_paths(). Nonetheless we get a
parallel plan, but with a different shape:

 Limit
   ->  Incremental Sort
         Sort Key: a, (sum(expensive_function(c)))
         Presorted Key: a
         ->  Finalize GroupAggregate
               Group Key: a, b
               ->  Gather Merge
                     Workers Planned: 4
                     ->  Partial GroupAggregate
                           Group Key: a, b
                           ->  Sort
                                 Sort Key: a, b
                                 ->  Parallel Seq Scan on t

(or if I disable seqscan then the sort+seq scan above becomes inc sort
+ index scan)

To be honest, I don't think I understand how you would get a plan like
that anyway since the index here only has "a" and so can't provide
(pathkeys: a, b).

Perhaps there's something I'm still missing though.

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Jul 20, 2019 at 12:12 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> ...
> > >My current line of investigation is whether we need to do anything in
> > >the parallel portion of create_ordered_paths(). I noticed that the
> > >first-pass patch adding generate_useful_gather_paths() modified that
> > >section but wasn't actually adding any new gather-merge paths (just
> > >bare incremental sort paths). That seems pretty clearly just a
> > >prototype miss, so I modified the prototype to build gather-merge
> > >paths instead (as a side note that change seems to fix an oddity I was
> > >seeing where plans would include a parallel index scan node even
> > >though they weren't parallel plans). While the resulting plan for
> > >something like:
> > >
> >
> > Yes, that seems to be a bug. The block above it clealy has a gather
> > merge nodes, so this one should too.
> >
> > >explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
> > >t.a, t.b limit 50;
> > >
> > >changes cost (to be cheaper) ever so slightly with the gather-merge
> > >addition to create_ordered_paths(), the plan itself is otherwise
> > >identical (including row estimates):
> > >
> > >Limit
> > >  -> Gather Merge
> > >       -> Incremental Sort
> > >          -> Parallel Index Scan
> > >
> > >(Note: I'm forcing parallel plans here with: set
> > >max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
> > >set parallel_tuple_cost=0; set parallel_setup_cost=0; set
> > >min_parallel_index_scan_size=0;)
> > >
> > >I can't seem to come up with a case where adding these gather-merge
> > >paths in create_ordered_paths() isn't entirely duplicative of paths
> > >already created by generate_useful_gather_paths() as called from
> > >apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
> > >that both apply_scanjoin_target_to_paths() and create_ordered_paths()
> > >are called by grouping_planner().
> > >
> > >Can you think of a case I'm missing here that would make it valuable
> > >to generate new parallel plans in create_ordered_paths()?
> > >
> >
> > Good question. Not sure. I think such path would need to do something on
> > a relation that is neither a join nor a scan - in which case the path
> > should not be created by apply_scanjoin_target_to_paths().
> >
> > So for example a query like this:
> >
> >   SELECT
> >       a, b, sum(expensive_function(c))
> >   FROM
> >       t
> >   GROUP BY a,b
> >   ORDER BY a,sum(...) LIMIT 10;
> >
> > should be able to produce a plan like this:
> >
> >   -> Limit
> >       -> Gather Merge
> >         -> Incremental Sort (pathkeys: a, sum)
> >           -> Group Aggregate
> >              a, b, sum(expensive_function(c))
> >             -> Index Scan (pathkeys: a, b)
> >
> > or something like that, maybe. I haven't tried this, though. The other
> > question is whether such queries are useful in practice ...
>
> Hmm, when I step through on that example input_rel->partial_pathlist
> != NIL is false, so we don't even attempt to consider any extra
> parallel paths in create_ordered_paths(). Nonetheless we get a
> parallel plan, but with a different shape:
>
>  Limit
>    ->  Incremental Sort
>          Sort Key: a, (sum(expensive_function(c)))
>          Presorted Key: a
>          ->  Finalize GroupAggregate
>                Group Key: a, b
>                ->  Gather Merge
>                      Workers Planned: 4
>                      ->  Partial GroupAggregate
>                            Group Key: a, b
>                            ->  Sort
>                                  Sort Key: a, b
>                                  ->  Parallel Seq Scan on t
>
> (or if I disable seqscan then the sort+seq scan above becomes inc sort
> + index scan)
>
> To be honest, I don't think I understand how you would get a plan like
> that anyway since the index here only has "a" and so can't provide
> (pathkeys: a, b).
>
> Perhaps there's something I'm still missing though.

Also just realized I don't think (?) we can order by the sum inside a
gather-merge -- at least not without having another sort above the
parallel portion? Or is the group aggregate able to also provide
ordering on the final sum after aggregating the partial sums?

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sat, Jul 20, 2019 at 12:21:01PM -0400, James Coleman wrote:
>On Sat, Jul 20, 2019 at 12:12 PM James Coleman <jtc331@gmail.com> wrote:
>>
>> On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> ...
>> > >My current line of investigation is whether we need to do anything in
>> > >the parallel portion of create_ordered_paths(). I noticed that the
>> > >first-pass patch adding generate_useful_gather_paths() modified that
>> > >section but wasn't actually adding any new gather-merge paths (just
>> > >bare incremental sort paths). That seems pretty clearly just a
>> > >prototype miss, so I modified the prototype to build gather-merge
>> > >paths instead (as a side note that change seems to fix an oddity I was
>> > >seeing where plans would include a parallel index scan node even
>> > >though they weren't parallel plans). While the resulting plan for
>> > >something like:
>> > >
>> >
>> > Yes, that seems to be a bug. The block above it clealy has a gather
>> > merge nodes, so this one should too.
>> >
>> > >explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
>> > >t.a, t.b limit 50;
>> > >
>> > >changes cost (to be cheaper) ever so slightly with the gather-merge
>> > >addition to create_ordered_paths(), the plan itself is otherwise
>> > >identical (including row estimates):
>> > >
>> > >Limit
>> > >  -> Gather Merge
>> > >       -> Incremental Sort
>> > >          -> Parallel Index Scan
>> > >
>> > >(Note: I'm forcing parallel plans here with: set
>> > >max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
>> > >set parallel_tuple_cost=0; set parallel_setup_cost=0; set
>> > >min_parallel_index_scan_size=0;)
>> > >
>> > >I can't seem to come up with a case where adding these gather-merge
>> > >paths in create_ordered_paths() isn't entirely duplicative of paths
>> > >already created by generate_useful_gather_paths() as called from
>> > >apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
>> > >that both apply_scanjoin_target_to_paths() and create_ordered_paths()
>> > >are called by grouping_planner().
>> > >
>> > >Can you think of a case I'm missing here that would make it valuable
>> > >to generate new parallel plans in create_ordered_paths()?
>> > >
>> >
>> > Good question. Not sure. I think such path would need to do something on
>> > a relation that is neither a join nor a scan - in which case the path
>> > should not be created by apply_scanjoin_target_to_paths().
>> >
>> > So for example a query like this:
>> >
>> >   SELECT
>> >       a, b, sum(expensive_function(c))
>> >   FROM
>> >       t
>> >   GROUP BY a,b
>> >   ORDER BY a,sum(...) LIMIT 10;
>> >
>> > should be able to produce a plan like this:
>> >
>> >   -> Limit
>> >       -> Gather Merge
>> >         -> Incremental Sort (pathkeys: a, sum)
>> >           -> Group Aggregate
>> >              a, b, sum(expensive_function(c))
>> >             -> Index Scan (pathkeys: a, b)
>> >
>> > or something like that, maybe. I haven't tried this, though. The other
>> > question is whether such queries are useful in practice ...
>>
>> Hmm, when I step through on that example input_rel->partial_pathlist
>> != NIL is false, so we don't even attempt to consider any extra
>> parallel paths in create_ordered_paths(). Nonetheless we get a
>> parallel plan, but with a different shape:
>>
>>  Limit
>>    ->  Incremental Sort
>>          Sort Key: a, (sum(expensive_function(c)))
>>          Presorted Key: a
>>          ->  Finalize GroupAggregate
>>                Group Key: a, b
>>                ->  Gather Merge
>>                      Workers Planned: 4
>>                      ->  Partial GroupAggregate
>>                            Group Key: a, b
>>                            ->  Sort
>>                                  Sort Key: a, b
>>                                  ->  Parallel Seq Scan on t
>>
>> (or if I disable seqscan then the sort+seq scan above becomes inc sort
>> + index scan)
>>
>> To be honest, I don't think I understand how you would get a plan like
>> that anyway since the index here only has "a" and so can't provide
>> (pathkeys: a, b).
>>
>> Perhaps there's something I'm still missing though.

I wasn't stricly adhering to the example we used before, and I imagined
there would be an index on (a,b). Sorry if that wasn't clear.

>
>Also just realized I don't think (?) we can order by the sum inside a
>gather-merge -- at least not without having another sort above the
>parallel portion? Or is the group aggregate able to also provide
>ordering on the final sum after aggregating the partial sums?
>

Yes, you're right - an extra sort node would be necessary. But I don't
think that changes the idea behind that example. The question is whether
the extra sorts below the gather would be cheaper than doing a large sort
on top of it - but I don't see why wouldn't that be the case, and if we
only need a couple of rows from the beginning ...


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Jul 20, 2019 at 1:00 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sat, Jul 20, 2019 at 12:21:01PM -0400, James Coleman wrote:
> >On Sat, Jul 20, 2019 at 12:12 PM James Coleman <jtc331@gmail.com> wrote:
> >>
> >> On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
> >> <tomas.vondra@2ndquadrant.com> wrote:
> >> ...
> >> > >My current line of investigation is whether we need to do anything in
> >> > >the parallel portion of create_ordered_paths(). I noticed that the
> >> > >first-pass patch adding generate_useful_gather_paths() modified that
> >> > >section but wasn't actually adding any new gather-merge paths (just
> >> > >bare incremental sort paths). That seems pretty clearly just a
> >> > >prototype miss, so I modified the prototype to build gather-merge
> >> > >paths instead (as a side note that change seems to fix an oddity I was
> >> > >seeing where plans would include a parallel index scan node even
> >> > >though they weren't parallel plans). While the resulting plan for
> >> > >something like:
> >> > >
> >> >
> >> > Yes, that seems to be a bug. The block above it clealy has a gather
> >> > merge nodes, so this one should too.
> >> >
> >> > >explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
> >> > >t.a, t.b limit 50;
> >> > >
> >> > >changes cost (to be cheaper) ever so slightly with the gather-merge
> >> > >addition to create_ordered_paths(), the plan itself is otherwise
> >> > >identical (including row estimates):
> >> > >
> >> > >Limit
> >> > >  -> Gather Merge
> >> > >       -> Incremental Sort
> >> > >          -> Parallel Index Scan
> >> > >
> >> > >(Note: I'm forcing parallel plans here with: set
> >> > >max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
> >> > >set parallel_tuple_cost=0; set parallel_setup_cost=0; set
> >> > >min_parallel_index_scan_size=0;)
> >> > >
> >> > >I can't seem to come up with a case where adding these gather-merge
> >> > >paths in create_ordered_paths() isn't entirely duplicative of paths
> >> > >already created by generate_useful_gather_paths() as called from
> >> > >apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
> >> > >that both apply_scanjoin_target_to_paths() and create_ordered_paths()
> >> > >are called by grouping_planner().
> >> > >
> >> > >Can you think of a case I'm missing here that would make it valuable
> >> > >to generate new parallel plans in create_ordered_paths()?
> >> > >
> >> >
> >> > Good question. Not sure. I think such path would need to do something on
> >> > a relation that is neither a join nor a scan - in which case the path
> >> > should not be created by apply_scanjoin_target_to_paths().
> >> >
> >> > So for example a query like this:
> >> >
> >> >   SELECT
> >> >       a, b, sum(expensive_function(c))
> >> >   FROM
> >> >       t
> >> >   GROUP BY a,b
> >> >   ORDER BY a,sum(...) LIMIT 10;
> >> >
> >> > should be able to produce a plan like this:
> >> >
> >> >   -> Limit
> >> >       -> Gather Merge
> >> >         -> Incremental Sort (pathkeys: a, sum)
> >> >           -> Group Aggregate
> >> >              a, b, sum(expensive_function(c))
> >> >             -> Index Scan (pathkeys: a, b)
> >> >
> >> > or something like that, maybe. I haven't tried this, though. The other
> >> > question is whether such queries are useful in practice ...
> >>
> >> Hmm, when I step through on that example input_rel->partial_pathlist
> >> != NIL is false, so we don't even attempt to consider any extra
> >> parallel paths in create_ordered_paths(). Nonetheless we get a
> >> parallel plan, but with a different shape:
> >>
> >>  Limit
> >>    ->  Incremental Sort
> >>          Sort Key: a, (sum(expensive_function(c)))
> >>          Presorted Key: a
> >>          ->  Finalize GroupAggregate
> >>                Group Key: a, b
> >>                ->  Gather Merge
> >>                      Workers Planned: 4
> >>                      ->  Partial GroupAggregate
> >>                            Group Key: a, b
> >>                            ->  Sort
> >>                                  Sort Key: a, b
> >>                                  ->  Parallel Seq Scan on t
> >>
> >> (or if I disable seqscan then the sort+seq scan above becomes inc sort
> >> + index scan)
> >>
> >> To be honest, I don't think I understand how you would get a plan like
> >> that anyway since the index here only has "a" and so can't provide
> >> (pathkeys: a, b).
> >>
> >> Perhaps there's something I'm still missing though.
>
> I wasn't stricly adhering to the example we used before, and I imagined
> there would be an index on (a,b). Sorry if that wasn't clear.
>
> >
> >Also just realized I don't think (?) we can order by the sum inside a
> >gather-merge -- at least not without having another sort above the
> >parallel portion? Or is the group aggregate able to also provide
> >ordering on the final sum after aggregating the partial sums?
> >
>
> Yes, you're right - an extra sort node would be necessary. But I don't
> think that changes the idea behind that example. The question is whether
> the extra sorts below the gather would be cheaper than doing a large sort
> on top of it - but I don't see why wouldn't that be the case, and if we
> only need a couple of rows from the beginning ...

Ah, I see. Right now we get:

 Limit
   ->  Incremental Sort
         Sort Key: a, (sum(expensive_function(c)))
         Presorted Key: a
         ->  Finalize GroupAggregate
               Group Key: a, b
               ->  Gather Merge
                     Workers Planned: 2
                     ->  Partial GroupAggregate
                           Group Key: a, b
                           ->  Parallel Index Scan using t_a_b on t

even with the parallel additions to create_ordered_paths() -- that
addition doesn't actually add any new parallel paths because
input_rel->partial_pathlist != NIL is false (I'm not sure why yet), so
if we want (if I understand correctly) something more like:

I'm still struggling to understand though how another incremental sort
below the gather-merge would actually be able to help us. For one I'm
not sure it would be less expensive, but more importantly I'm not sure
how we could do that and maintain correctness. Wouldn't a per-worker
sum not actually be useful in sorting since it has no predictable
impact on the ordering of the total sum?

Describing that got me thinking of similar cases where ordering of the
partial aggregate would (in theory) be a correct partial sort for the
total ordering, and it seems like min() and max() would be. So I ran
the same experiment with that instead of sum(), but, you guessed it,
input_rel->partial_pathlist != NIL is false again, so we don't add any
parallel paths in create_ordered_paths().

I'm leaning towards thinking considering parallel incremental sorts in
create_ordered_paths() won't add value. But I also feel like this
whole project has me jumping into the deep end of the pool again (this
time the planner), so I'm still picking up a lot of pieces for how all
this fits together, and as such I don't have a great intuitive grasp
yet of how this particular part of the planning process maps to the
kind of queries and plans we consider. All that to say: if you have
further thoughts, I'm happy to look into it, but right now I'm not
seeing anything.

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sat, Jul 20, 2019 at 07:37:08PM -0400, James Coleman wrote:
>> ..
>>
>> Yes, you're right - an extra sort node would be necessary. But I don't
>> think that changes the idea behind that example. The question is whether
>> the extra sorts below the gather would be cheaper than doing a large sort
>> on top of it - but I don't see why wouldn't that be the case, and if we
>> only need a couple of rows from the beginning ...
>
>Ah, I see. Right now we get:
>
> Limit
>   ->  Incremental Sort
>         Sort Key: a, (sum(expensive_function(c)))
>         Presorted Key: a
>         ->  Finalize GroupAggregate
>               Group Key: a, b
>               ->  Gather Merge
>                     Workers Planned: 2
>                     ->  Partial GroupAggregate
>                           Group Key: a, b
>                           ->  Parallel Index Scan using t_a_b on t
>
>even with the parallel additions to create_ordered_paths() -- that
>addition doesn't actually add any new parallel paths because
>input_rel->partial_pathlist != NIL is false (I'm not sure why yet), so
>if we want (if I understand correctly) something more like:
>

Well, in this particular case it's fairly simple, I think. We only call
the create_ordered_paths() from the grouping planner, on the upper
relation that represents the result of the GROUP BY. But that means it
has to see only the finalized result (after Finalize GroupAggregate). So
it can't see partial aggregation or any other partial path. So in this
case it seems guaranteed (partial_pathlist == NIL).

So maybe we should not be looking at GROUP BY queries, which probably
can't hit this particular code path at all - we need a different type of
upper relation. For example UNION ALL should hit this code, I think.

So maybe try

   select * from t union all select * from t order by a, b limit 10;

and that will hit this condition with partial_pathlist != NIL.

I don't know if such queries can benefit from incremental sort, though.
There are other upper relations too:

  typedef enum UpperRelationKind
  {
      UPPERREL_SETOP,             /* result of UNION/INTERSECT/EXCEPT, if any */
      UPPERREL_PARTIAL_GROUP_AGG, /* result of partial grouping/aggregation, if
                                   * any */
      UPPERREL_GROUP_AGG,         /* result of grouping/aggregation, if any */
      UPPERREL_WINDOW,            /* result of window functions, if any */
      UPPERREL_DISTINCT,          /* result of "SELECT DISTINCT", if any */
      UPPERREL_ORDERED,           /* result of ORDER BY, if any */
      UPPERREL_FINAL              /* result of any remaining top-level actions */
      /* NB: UPPERREL_FINAL must be last enum entry; it's used to size arrays */
  } UpperRelationKind;


>I'm still struggling to understand though how another incremental sort
>below the gather-merge would actually be able to help us. For one I'm
>not sure it would be less expensive,

That's a good question. I think in general we agree that if we can get
the gather merge to sort the data the way the operation above the gather
merge (which is the first operation that can't operate in parallel
mode), that's probably beneficial.

So this pattern seems reasonable:

    -> something
       -> non-parallel operation
          -> gather merge
             -> incremental sort
                -> something

And it's likely faster, especially when the parts above this can
leverage the lower startup cost. Say, when there's an explicit LIMIT.

I think the question is where exactly do we add the incremental sort.
It's quite possible some of the places we modified are redundant, at
least for some queries. Not sure.

>but more importantly I'm not sure how we could do that and maintain
>correctness. Wouldn't a per-worker sum not actually be useful in
>sorting since it has no predictable impact on the ordering of the total
>sum?

Yes, you're right - that wouldn't be correct. The reason why I've been
thinking about such examples because distributed databases do make such
things in some cases, and we might do that too with partitioning.

Consider a simple example

    create table t (a int, b int, c int) partition by hash (a);
    create table t0 partition of t for values with (modulus 4, remainder 0);
    create table t1 partition of t for values with (modulus 4, remainder 1);
    create table t2 partition of t for values with (modulus 4, remainder 2);
    create table t3 partition of t for values with (modulus 4, remainder 3);
    insert into t select mod(i,1000), i, i from generate_series(1,1000000) s(i);
    analyze t;
    select a, count(b) from t group by a order by a, count(b) limit 10;

In this case we might do a plan similar to what I proposed, assuming
each worker gets to execute on a different partition (because then we
know each worker will see distinct groups, thanks to the partitioning).

But AFAIK we don't do such optimizations yet, so it's probably just a
distraction.

>Describing that got me thinking of similar cases where ordering of the
>partial aggregate would (in theory) be a correct partial sort for the
>total ordering, and it seems like min() and max() would be. So I ran
>the same experiment with that instead of sum(), but, you guessed it,
>input_rel->partial_pathlist != NIL is false again, so we don't add any
>parallel paths in create_ordered_paths().
>

Right. That's because with aggregation, grouping planner only sees the
total result, not the partial paths. We need different upper rel to
exercise that code path.

>I'm leaning towards thinking considering parallel incremental sorts in
>create_ordered_paths() won't add value. But I also feel like this
>whole project has me jumping into the deep end of the pool again (this
>time the planner), so I'm still picking up a lot of pieces for how all
>this fits together, and as such I don't have a great intuitive grasp
>yet of how this particular part of the planning process maps to the
>kind of queries and plans we consider. All that to say: if you have
>further thoughts, I'm happy to look into it, but right now I'm not
>seeing anything.
>

Understood. FWIW I'm not particularly familiar with this code (or which
places are supposed to work together), so I definitely agree it may be
overwhelming. Especially when it's only a part of a larger patch.

I wonder if we're approaching this wrong. Maybe we should not reverse
engineer queries for the various places, but just start with a set of
queries that we want to optimize, and then identify which places in the
planner need to be modified.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sun, Jul 21, 2019 at 01:34:22PM +0200, Tomas Vondra wrote:
>
> ...
>
>I wonder if we're approaching this wrong. Maybe we should not reverse
>engineer queries for the various places, but just start with a set of
>queries that we want to optimize, and then identify which places in the
>planner need to be modified.
>

I've decided to do a couple of experiments, trying to make my mind about
which modified places matter to diffrent queries. But instead of trying
to reverse engineer the queries, I've taken a different approach - I've
compiled a list of queries that I think are sensible and relevant, and
then planned them with incremental sort enabled in different places.

I don't have any clear conclusions at this point - it does show some of
the places don't change plan for any of the queries, although there may
be some additional query where it'd make a difference.

But I'm posting this mostly because it might be useful. I've initially
planned to move changes that add incremental sort paths to separate
patches, and then apply/skip different subsets of those patches. But
then I realized there's a better way to do this - I've added a bunch of
GUCs, one for each such place. This allows doing this testing without
having to rebuild repeatedly.

I'm not going to post the patch(es) with extra GUCs here, because it'd
just confuse the patch tester, but it's available here:

  https://github.com/tvondra/postgres/tree/incremental-sort-20190730

There are 10 GUCs, one for each place in planner where incremental sort
paths are constructed. By default all those are set to 'false' so no
incremental sort paths are built. If you do

  SET devel_create_ordered_paths = on;

it'll start creating the paths in non-parallel in create_ordered_paths.
Then you may enable devel_create_ordered_paths_parallel to also consider
parallel paths, etc.

The list of queries (synthetic, but hopefully sufficiently realistic)
and a couple of scripts to collect the plans is in this repository:

  https://github.com/tvondra/incremental-sort-tests-2

There's also a spreadsheet with a summary of results, with a visual
representation of which GUCs affect which queries.



regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Rafia Sabih
Date:
On Tue, 30 Jul 2019 at 02:17, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Sun, Jul 21, 2019 at 01:34:22PM +0200, Tomas Vondra wrote:
>
> ...
>
>I wonder if we're approaching this wrong. Maybe we should not reverse
>engineer queries for the various places, but just start with a set of
>queries that we want to optimize, and then identify which places in the
>planner need to be modified.
>

I've decided to do a couple of experiments, trying to make my mind about
which modified places matter to diffrent queries. But instead of trying
to reverse engineer the queries, I've taken a different approach - I've
compiled a list of queries that I think are sensible and relevant, and
then planned them with incremental sort enabled in different places.

I don't have any clear conclusions at this point - it does show some of
the places don't change plan for any of the queries, although there may
be some additional query where it'd make a difference.

But I'm posting this mostly because it might be useful. I've initially
planned to move changes that add incremental sort paths to separate
patches, and then apply/skip different subsets of those patches. But
then I realized there's a better way to do this - I've added a bunch of
GUCs, one for each such place. This allows doing this testing without
having to rebuild repeatedly.

I'm not going to post the patch(es) with extra GUCs here, because it'd
just confuse the patch tester, but it's available here:

  https://github.com/tvondra/postgres/tree/incremental-sort-20190730

There are 10 GUCs, one for each place in planner where incremental sort
paths are constructed. By default all those are set to 'false' so no
incremental sort paths are built. If you do

  SET devel_create_ordered_paths = on;

it'll start creating the paths in non-parallel in create_ordered_paths.
Then you may enable devel_create_ordered_paths_parallel to also consider
parallel paths, etc.

The list of queries (synthetic, but hopefully sufficiently realistic)
and a couple of scripts to collect the plans is in this repository:

  https://github.com/tvondra/incremental-sort-tests-2

There's also a spreadsheet with a summary of results, with a visual
representation of which GUCs affect which queries.

Wow, that sounds like an elaborate experiment. But where is this spreadsheet you mentioned ? 

--
Regards,
Rafia Sabih

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Wed, Sep 04, 2019 at 11:37:48AM +0200, Rafia Sabih wrote:
>On Tue, 30 Jul 2019 at 02:17, Tomas Vondra <tomas.vondra@2ndquadrant.com>
>wrote:
>
>> On Sun, Jul 21, 2019 at 01:34:22PM +0200, Tomas Vondra wrote:
>> >
>> > ...
>> >
>> >I wonder if we're approaching this wrong. Maybe we should not reverse
>> >engineer queries for the various places, but just start with a set of
>> >queries that we want to optimize, and then identify which places in the
>> >planner need to be modified.
>> >
>>
>> I've decided to do a couple of experiments, trying to make my mind about
>> which modified places matter to diffrent queries. But instead of trying
>> to reverse engineer the queries, I've taken a different approach - I've
>> compiled a list of queries that I think are sensible and relevant, and
>> then planned them with incremental sort enabled in different places.
>>
>> I don't have any clear conclusions at this point - it does show some of
>> the places don't change plan for any of the queries, although there may
>> be some additional query where it'd make a difference.
>>
>> But I'm posting this mostly because it might be useful. I've initially
>> planned to move changes that add incremental sort paths to separate
>> patches, and then apply/skip different subsets of those patches. But
>> then I realized there's a better way to do this - I've added a bunch of
>> GUCs, one for each such place. This allows doing this testing without
>> having to rebuild repeatedly.
>>
>> I'm not going to post the patch(es) with extra GUCs here, because it'd
>> just confuse the patch tester, but it's available here:
>>
>>   https://github.com/tvondra/postgres/tree/incremental-sort-20190730
>>
>> There are 10 GUCs, one for each place in planner where incremental sort
>> paths are constructed. By default all those are set to 'false' so no
>> incremental sort paths are built. If you do
>>
>>   SET devel_create_ordered_paths = on;
>>
>> it'll start creating the paths in non-parallel in create_ordered_paths.
>> Then you may enable devel_create_ordered_paths_parallel to also consider
>> parallel paths, etc.
>>
>> The list of queries (synthetic, but hopefully sufficiently realistic)
>> and a couple of scripts to collect the plans is in this repository:
>>
>>   https://github.com/tvondra/incremental-sort-tests-2
>>
>> There's also a spreadsheet with a summary of results, with a visual
>> representation of which GUCs affect which queries.
>>
> Wow, that sounds like an elaborate experiment. But where is this
> spreadsheet you mentioned ?
>

It seems I forgot to push the commit containing the spreadsheet with
results. I'll fix that tomorrow.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Wed, Sep 04, 2019 at 09:17:10PM +0200, Tomas Vondra wrote:
>On Wed, Sep 04, 2019 at 11:37:48AM +0200, Rafia Sabih wrote:
>>On Tue, 30 Jul 2019 at 02:17, Tomas Vondra <tomas.vondra@2ndquadrant.com>
>>wrote:
>>
>>>On Sun, Jul 21, 2019 at 01:34:22PM +0200, Tomas Vondra wrote:
>>>>
>>>> ...
>>>>
>>>>I wonder if we're approaching this wrong. Maybe we should not reverse
>>>>engineer queries for the various places, but just start with a set of
>>>>queries that we want to optimize, and then identify which places in the
>>>>planner need to be modified.
>>>>
>>>
>>>I've decided to do a couple of experiments, trying to make my mind about
>>>which modified places matter to diffrent queries. But instead of trying
>>>to reverse engineer the queries, I've taken a different approach - I've
>>>compiled a list of queries that I think are sensible and relevant, and
>>>then planned them with incremental sort enabled in different places.
>>>
>>>I don't have any clear conclusions at this point - it does show some of
>>>the places don't change plan for any of the queries, although there may
>>>be some additional query where it'd make a difference.
>>>
>>>But I'm posting this mostly because it might be useful. I've initially
>>>planned to move changes that add incremental sort paths to separate
>>>patches, and then apply/skip different subsets of those patches. But
>>>then I realized there's a better way to do this - I've added a bunch of
>>>GUCs, one for each such place. This allows doing this testing without
>>>having to rebuild repeatedly.
>>>
>>>I'm not going to post the patch(es) with extra GUCs here, because it'd
>>>just confuse the patch tester, but it's available here:
>>>
>>>  https://github.com/tvondra/postgres/tree/incremental-sort-20190730
>>>
>>>There are 10 GUCs, one for each place in planner where incremental sort
>>>paths are constructed. By default all those are set to 'false' so no
>>>incremental sort paths are built. If you do
>>>
>>>  SET devel_create_ordered_paths = on;
>>>
>>>it'll start creating the paths in non-parallel in create_ordered_paths.
>>>Then you may enable devel_create_ordered_paths_parallel to also consider
>>>parallel paths, etc.
>>>
>>>The list of queries (synthetic, but hopefully sufficiently realistic)
>>>and a couple of scripts to collect the plans is in this repository:
>>>
>>>  https://github.com/tvondra/incremental-sort-tests-2
>>>
>>>There's also a spreadsheet with a summary of results, with a visual
>>>representation of which GUCs affect which queries.
>>>
>>Wow, that sounds like an elaborate experiment. But where is this
>>spreadsheet you mentioned ?
>>
>
>It seems I forgot to push the commit containing the spreadsheet with
>results. I'll fix that tomorrow.
>

OK, I've pushed the commit with the spreadsheet. The single sheet lists
the synthetic queries, and hashes of plans with different flags enables
(parallel query, force incremental sort, and the new developer GUCs
mentioned before). Only a single developer flag is set to true (or none
of them).

The columns at the end simply say whether the plan differs from the plan
generated by master (no patches). TRUE means "same as master" while
FALSE means "different plan.

The "patched" column means all developer GUCs disabled, so it's expected
to produce the same plan as master (and it is). And then there's one
column for each developer GUC. If the column is just TRUE it means the
GUC does not affect any of the synthetic queries. There are 4 of them:

- devel_add_paths_to_grouping_rel_parallel
- devel_create_partial_grouping_paths
- devel_gather_grouping_paths
- devel_standard_join_search

The places controlled by those GUCs are either useless, or the query
affected by them is not included in the list of queries.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Alvaro Herrera
Date:
On 2019-Jul-30, Tomas Vondra wrote:

> On Sun, Jul 21, 2019 at 01:34:22PM +0200, Tomas Vondra wrote:
> > 
> > I wonder if we're approaching this wrong. Maybe we should not reverse
> > engineer queries for the various places, but just start with a set of
> > queries that we want to optimize, and then identify which places in the
> > planner need to be modified.

[...]

> I've decided to do a couple of experiments, trying to make my mind about
> which modified places matter to diffrent queries. But instead of trying
> to reverse engineer the queries, I've taken a different approach - I've
> compiled a list of queries that I think are sensible and relevant, and
> then planned them with incremental sort enabled in different places.

[...]

> The list of queries (synthetic, but hopefully sufficiently realistic)
> and a couple of scripts to collect the plans is in this repository:
> 
>  https://github.com/tvondra/incremental-sort-tests-2
> 
> There's also a spreadsheet with a summary of results, with a visual
> representation of which GUCs affect which queries.

OK, so we have that now.  I suppose this spreadsheet now tells us which
places are useful and which aren't, at least for the queries that you've
tested.  Dowe that mean that we want to get the patch to consider adding
paths only the places that your spreadsheet says are useful?  I'm not
sure what the next steps are for this patch.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
> OK, so we have that now.  I suppose this spreadsheet now tells us which
> places are useful and which aren't, at least for the queries that you've
> tested.  Dowe that mean that we want to get the patch to consider adding
> paths only the places that your spreadsheet says are useful?  I'm not
> sure what the next steps are for this patch.

I wanted to note here that I haven't abandoned this patch, but ended
up needing to use my extra time for working on a conference talk. That
talk is today, so I'm hoping to be able to catch up on this again
soon.

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Thu, Sep 12, 2019 at 12:49:29PM -0300, Alvaro Herrera wrote:
>On 2019-Jul-30, Tomas Vondra wrote:
>
>> On Sun, Jul 21, 2019 at 01:34:22PM +0200, Tomas Vondra wrote:
>> >
>> > I wonder if we're approaching this wrong. Maybe we should not reverse
>> > engineer queries for the various places, but just start with a set of
>> > queries that we want to optimize, and then identify which places in the
>> > planner need to be modified.
>
>[...]
>
>> I've decided to do a couple of experiments, trying to make my mind about
>> which modified places matter to diffrent queries. But instead of trying
>> to reverse engineer the queries, I've taken a different approach - I've
>> compiled a list of queries that I think are sensible and relevant, and
>> then planned them with incremental sort enabled in different places.
>
>[...]
>
>> The list of queries (synthetic, but hopefully sufficiently realistic)
>> and a couple of scripts to collect the plans is in this repository:
>>
>>  https://github.com/tvondra/incremental-sort-tests-2
>>
>> There's also a spreadsheet with a summary of results, with a visual
>> representation of which GUCs affect which queries.
>
>OK, so we have that now.  I suppose this spreadsheet now tells us which
>places are useful and which aren't, at least for the queries that you've
>tested.  Dowe that mean that we want to get the patch to consider adding
>paths only the places that your spreadsheet says are useful?  I'm not
>sure what the next steps are for this patch.
>

Yes. I think the spreadsheet call help us with answering two things:

1) places actually affecting the plan (all but three do)

2) redundant places (there are some cases where two GUCs produce the
same plan in the end)

Of course, this does assume the query set makes sense and is somewhat
realistic, but I've tried to construct queries where that is true. We
may extend it over time, of course.

I think we've agreed to add incremental sort paths different places in
separate patches, to make review easier. So this may be a useful way to
decide which places to address first. I'd probably do it in this order:

- create_ordered_paths
- create_ordered_paths (parallel part)
- add_paths_to_grouping_rel
- ... not sure ...

but that's just a proposal. It'd give us most of the benefits, I think,
and we could also focus on the rest of the patch.

Also, regarding the three GUCs that don't affect any of the queries, we
can't really add them as we wouldn't be able to test them. If we manage
to construct a query that'd benefit from them, we can revisit this.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Thu, Sep 12, 2019 at 11:54:06AM -0400, James Coleman wrote:
>> OK, so we have that now.  I suppose this spreadsheet now tells us which
>> places are useful and which aren't, at least for the queries that you've
>> tested.  Dowe that mean that we want to get the patch to consider adding
>> paths only the places that your spreadsheet says are useful?  I'm not
>> sure what the next steps are for this patch.
>
>I wanted to note here that I haven't abandoned this patch, but ended
>up needing to use my extra time for working on a conference talk. That
>talk is today, so I'm hoping to be able to catch up on this again
>soon.
>

Good! I'm certainly looking forward to a new patch version.

As discussed in the past, this patch is pretty sensitive (large, touches
planning, ...), so we should try getting most of it in not too late in
the cycle. For example 2019-11 would be nice.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Fri, Sep 13, 2019 at 10:54 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Thu, Sep 12, 2019 at 11:54:06AM -0400, James Coleman wrote:
> >> OK, so we have that now.  I suppose this spreadsheet now tells us which
> >> places are useful and which aren't, at least for the queries that you've
> >> tested.  Dowe that mean that we want to get the patch to consider adding
> >> paths only the places that your spreadsheet says are useful?  I'm not
> >> sure what the next steps are for this patch.
> >
> >I wanted to note here that I haven't abandoned this patch, but ended
> >up needing to use my extra time for working on a conference talk. That
> >talk is today, so I'm hoping to be able to catch up on this again
> >soon.
> >
>
> Good! I'm certainly looking forward to a new patch version.
>
> As discussed in the past, this patch is pretty sensitive (large, touches
> planning, ...), so we should try getting most of it in not too late in
> the cycle. For example 2019-11 would be nice.

Completely agree; originally I'd hoped to have it in rough draft
finished form to get serious review in the September CF...but...



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> >> ...
> >>
> >> I think this may be a thinko, as this plan demonstrates - but I'm not
> >> sure about it. I wonder if this might be penalizing some other types of
> >> plans (essentially anything with limit + gather).
> >>
> >> Attached is a WIP patch fixing this by considering both startup and
> >> total cost (by calling compare_path_costs_fuzzily).
> >
> >It seems to me that this is likely a bug, and not just a changed
> >needed for this. Do you think it's better addressed in a separate
> >thread? Or retain it as part of this patch for now (and possibly break
> >it out later)? On the other hand, it's entirely possible that someone
> >more familiar with parallel plan limitations could explain why the
> >above comment holds true. That makes me lean towards asking in a new
> >thread.
> >
>
> Maybe. I think creating a separate thread would be useful, provided we
> manage to demonstrate the issue without an incremental sort.

I did some more thinking about this, and I can't currently come up
with a way to reproduce this issue outside of this patch. It doesn't
seem reasonable to me to assume that there's anything inherent about
this patch that means it's the only way we can end up with a partial
path with a low startup cost we'd want to prefer.

Part of me wants to pull it over to a separate thread just to get
additional feedback, but I'm not sure how useful that is given we
don't currently have an example case outside of this patch.

One thing to note though: the current patch does not also modify
add_partial_path_precheck which also does not take into account
startup cost, so we probably need to update that for completeness's
sake.

James Coleman



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sun, Sep 15, 2019 at 09:33:33PM -0400, James Coleman wrote:
>On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> >> ...
>> >>
>> >> I think this may be a thinko, as this plan demonstrates - but I'm not
>> >> sure about it. I wonder if this might be penalizing some other types of
>> >> plans (essentially anything with limit + gather).
>> >>
>> >> Attached is a WIP patch fixing this by considering both startup and
>> >> total cost (by calling compare_path_costs_fuzzily).
>> >
>> >It seems to me that this is likely a bug, and not just a changed
>> >needed for this. Do you think it's better addressed in a separate
>> >thread? Or retain it as part of this patch for now (and possibly break
>> >it out later)? On the other hand, it's entirely possible that someone
>> >more familiar with parallel plan limitations could explain why the
>> >above comment holds true. That makes me lean towards asking in a new
>> >thread.
>> >
>>
>> Maybe. I think creating a separate thread would be useful, provided we
>> manage to demonstrate the issue without an incremental sort.
>
>I did some more thinking about this, and I can't currently come up
>with a way to reproduce this issue outside of this patch. It doesn't
>seem reasonable to me to assume that there's anything inherent about
>this patch that means it's the only way we can end up with a partial
>path with a low startup cost we'd want to prefer.
>
>Part of me wants to pull it over to a separate thread just to get
>additional feedback, but I'm not sure how useful that is given we
>don't currently have an example case outside of this patch.
>

Hmm, I see.

While I initially suggested to start a separate thread only if we have
example not involving an incremental sort, that's probably not a hard
requirement. I think it's fine to start a thead briefly explaining the
issue, and pointing to incremental sort thread for actual example.

>
>One thing to note though: the current patch does not also modify
>add_partial_path_precheck which also does not take into account
>startup cost, so we probably need to update that for completeness's
>sake.
>

Good point. It does indeed seem to make the same assumption about only
comparing total cost before calling add_path_precheck.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Sep 9, 2019 at 5:55 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> The "patched" column means all developer GUCs disabled, so it's expected
> to produce the same plan as master (and it is). And then there's one
> column for each developer GUC. If the column is just TRUE it means the
> GUC does not affect any of the synthetic queries. There are 4 of them:
>
> - devel_add_paths_to_grouping_rel_parallel
> - devel_create_partial_grouping_paths
> - devel_gather_grouping_paths
> - devel_standard_join_search
>
> The places controlled by those GUCs are either useless, or the query
> affected by them is not included in the list of queries.

I'd previously found (in my reverse engineering efforts) the query:

select *
from tenk1 t1
join tenk1 t2 on t1.hundred = t2.hundred
join tenk1 t3 on t1.hundred = t3.hundred
order by t1.hundred, t1.twenty
limit 50;

can change plans to use incremental sort when
generate_useful_gather_paths() is added to standard_join_search().
Specifically, we get a merge join between t1 and t3 as the top level
(besides limit) node where the driving side of the join is a gather
merge with incremental sort. This does rely on these gucs set in the
test harness:

set local max_parallel_workers_per_gather=4;
set local min_parallel_table_scan_size=0;
set local parallel_tuple_cost=0;
set local parallel_setup_cost=0;

So I think we can reduce the number of unused gucs to 3.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Sep 16, 2019 at 6:32 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sun, Sep 15, 2019 at 09:33:33PM -0400, James Coleman wrote:
> >On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >> >> ...
> >> >>
> >> >> I think this may be a thinko, as this plan demonstrates - but I'm not
> >> >> sure about it. I wonder if this might be penalizing some other types of
> >> >> plans (essentially anything with limit + gather).
> >> >>
> >> >> Attached is a WIP patch fixing this by considering both startup and
> >> >> total cost (by calling compare_path_costs_fuzzily).
> >> >
> >> >It seems to me that this is likely a bug, and not just a changed
> >> >needed for this. Do you think it's better addressed in a separate
> >> >thread? Or retain it as part of this patch for now (and possibly break
> >> >it out later)? On the other hand, it's entirely possible that someone
> >> >more familiar with parallel plan limitations could explain why the
> >> >above comment holds true. That makes me lean towards asking in a new
> >> >thread.
> >> >
> >>
> >> Maybe. I think creating a separate thread would be useful, provided we
> >> manage to demonstrate the issue without an incremental sort.
> >
> >I did some more thinking about this, and I can't currently come up
> >with a way to reproduce this issue outside of this patch. It doesn't
> >seem reasonable to me to assume that there's anything inherent about
> >this patch that means it's the only way we can end up with a partial
> >path with a low startup cost we'd want to prefer.
> >
> >Part of me wants to pull it over to a separate thread just to get
> >additional feedback, but I'm not sure how useful that is given we
> >don't currently have an example case outside of this patch.
> >
>
> Hmm, I see.
>
> While I initially suggested to start a separate thread only if we have
> example not involving an incremental sort, that's probably not a hard
> requirement. I think it's fine to start a thead briefly explaining the
> issue, and pointing to incremental sort thread for actual example.
>
> >
> >One thing to note though: the current patch does not also modify
> >add_partial_path_precheck which also does not take into account
> >startup cost, so we probably need to update that for completeness's
> >sake.
> >
>
> Good point. It does indeed seem to make the same assumption about only
> comparing total cost before calling add_path_precheck.

I've started a new thread to discuss:
https://www.postgresql.org/message-id/CAAaqYe-5HmM4ih6FWp2RNV9rruunfrFrLhqFXF_nrrNCPy1Zhg%40mail.gmail.com

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Fri, Sep 13, 2019 at 10:51 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Thu, Sep 12, 2019 at 12:49:29PM -0300, Alvaro Herrera wrote:
> >On 2019-Jul-30, Tomas Vondra wrote:
> >> I've decided to do a couple of experiments, trying to make my mind about
> >> which modified places matter to diffrent queries. But instead of trying
> >> to reverse engineer the queries, I've taken a different approach - I've
> >> compiled a list of queries that I think are sensible and relevant, and
> >> then planned them with incremental sort enabled in different places.
> >
> >[...]
> >
> >> The list of queries (synthetic, but hopefully sufficiently realistic)
> >> and a couple of scripts to collect the plans is in this repository:
> >>
> >>  https://github.com/tvondra/incremental-sort-tests-2
> >>
> >> There's also a spreadsheet with a summary of results, with a visual
> >> representation of which GUCs affect which queries.
> >
> >OK, so we have that now.  I suppose this spreadsheet now tells us which
> >places are useful and which aren't, at least for the queries that you've
> >tested.  Dowe that mean that we want to get the patch to consider adding
> >paths only the places that your spreadsheet says are useful?  I'm not
> >sure what the next steps are for this patch.
> >
>
> Yes. I think the spreadsheet call help us with answering two things:
>
> 1) places actually affecting the plan (all but three do)
>
> 2) redundant places (there are some cases where two GUCs produce the
> same plan in the end)

To expand on this further, (1) should probably help us to be able to
write test cases.

Additionally, one big thing we still need that's somewhat external to
the patch is a good way to benchmark/a set of queries that we believe
are representative enough to be good benchmarks.

I'd really appreciate some input from you all on that particular
question; I feel like it's in some sense the biggest barrier to
getting the patch merged, but also the part where long experience in
the community/exposure to other use cases will probably be quite
valuable.

> Of course, this does assume the query set makes sense and is somewhat
> realistic, but I've tried to construct queries where that is true. We
> may extend it over time, of course.
>
> I think we've agreed to add incremental sort paths different places in
> separate patches, to make review easier. So this may be a useful way to
> decide which places to address first. I'd probably do it in this order:
>
> - create_ordered_paths
> - create_ordered_paths (parallel part)
> - add_paths_to_grouping_rel
> - ... not sure ...
>
> but that's just a proposal. It'd give us most of the benefits, I think,
> and we could also focus on the rest of the patch.

Certainly the first two seem like pretty obvious most necessary base
cases. I think supporting group bys also seems like a pretty standard
case, so at first glance I'd say this seems like a reasonable course
to me.

I'm going to start breaking up the patches in this thread into a
series in support of that. Since I've started a new thread with the
add_partial_path change, I'll include that patch here as part of this
series also. Do you think it's worth moving the tuplesort changes into
a standalone patch in the series also?

Attached is a rebased v31 now broken into the following:

- 001-consider-startup-cost-in-add-partial-path_v1.patch: From the
other thread (Tomas's patch unmodified)
- 002-incremental-sort_v31.patch: Updated base incremental sort patch

Besides rebasing, I've changed the enable_incrementalsort GUC to
prevent generating paths entirely rather than being cost-based, since
incremental sort is never absolutely necessary in the way regular sort
is.

I'm hoping to add 003 soon with the initial parallel parts, but I'm
about out of time right now and wanted to get something out, so
sending this without that.

Side question: for the patch tester do I have to attach each part of
the series each time even if nothing's changed in several of them? And
does the vN number at the end need to stay the same for all of them?
My attachments to this email don't follow that... Also, since this
email changes patch naming, so I need to do anything to clear out the
old ones? (I suppose if not, then that would imply an answer to the
first question also.)

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Fri, Sep 27, 2019 at 08:31:30PM -0400, James Coleman wrote:
>On Fri, Sep 13, 2019 at 10:51 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Thu, Sep 12, 2019 at 12:49:29PM -0300, Alvaro Herrera wrote:
>> >On 2019-Jul-30, Tomas Vondra wrote:
>> >> I've decided to do a couple of experiments, trying to make my mind about
>> >> which modified places matter to diffrent queries. But instead of trying
>> >> to reverse engineer the queries, I've taken a different approach - I've
>> >> compiled a list of queries that I think are sensible and relevant, and
>> >> then planned them with incremental sort enabled in different places.
>> >
>> >[...]
>> >
>> >> The list of queries (synthetic, but hopefully sufficiently realistic)
>> >> and a couple of scripts to collect the plans is in this repository:
>> >>
>> >>  https://github.com/tvondra/incremental-sort-tests-2
>> >>
>> >> There's also a spreadsheet with a summary of results, with a visual
>> >> representation of which GUCs affect which queries.
>> >
>> >OK, so we have that now.  I suppose this spreadsheet now tells us which
>> >places are useful and which aren't, at least for the queries that you've
>> >tested.  Dowe that mean that we want to get the patch to consider adding
>> >paths only the places that your spreadsheet says are useful?  I'm not
>> >sure what the next steps are for this patch.
>> >
>>
>> Yes. I think the spreadsheet call help us with answering two things:
>>
>> 1) places actually affecting the plan (all but three do)
>>
>> 2) redundant places (there are some cases where two GUCs produce the
>> same plan in the end)
>
>To expand on this further, (1) should probably help us to be able to
>write test cases.
>
>Additionally, one big thing we still need that's somewhat external to
>the patch is a good way to benchmark/a set of queries that we believe
>are representative enough to be good benchmarks.
>
>I'd really appreciate some input from you all on that particular
>question; I feel like it's in some sense the biggest barrier to
>getting the patch merged, but also the part where long experience in
>the community/exposure to other use cases will probably be quite
>valuable.
>

Hmmm. I don't think anyone will hand us a set of representative queries,
so I think we have two options:

1) Generate synthetic queries covering a wide range of cases (both when
incremental sort is expected to help and not). I think the script I've
used to determine which places do matter can be the starting point.

2) Look at some established benchmarks and see if some of the queries
could benefit from the incremental sort (possibly with some changes to
indexes in the usual schema). I plan to look at TPC-H / TPC-DS, but I
wonder if some OLTP benchmarks would be relevant too.

>> Of course, this does assume the query set makes sense and is somewhat
>> realistic, but I've tried to construct queries where that is true. We
>> may extend it over time, of course.
>>
>> I think we've agreed to add incremental sort paths different places in
>> separate patches, to make review easier. So this may be a useful way to
>> decide which places to address first. I'd probably do it in this order:
>>
>> - create_ordered_paths
>> - create_ordered_paths (parallel part)
>> - add_paths_to_grouping_rel
>> - ... not sure ...
>>
>> but that's just a proposal. It'd give us most of the benefits, I think,
>> and we could also focus on the rest of the patch.
>
>Certainly the first two seem like pretty obvious most necessary base
>cases. I think supporting group bys also seems like a pretty standard
>case, so at first glance I'd say this seems like a reasonable course
>to me.
>

OK.

>I'm going to start breaking up the patches in this thread into a
>series in support of that. Since I've started a new thread with the
>add_partial_path change, I'll include that patch here as part of this
>series also. Do you think it's worth moving the tuplesort changes into
>a standalone patch in the series also?
>

Probably. I'd do that at least for the review.

>Attached is a rebased v31 now broken into the following:
>
>- 001-consider-startup-cost-in-add-partial-path_v1.patch: From the
>other thread (Tomas's patch unmodified)
>- 002-incremental-sort_v31.patch: Updated base incremental sort patch
>
>Besides rebasing, I've changed the enable_incrementalsort GUC to
>prevent generating paths entirely rather than being cost-based, since
>incremental sort is never absolutely necessary in the way regular sort
>is.
>

OK, makes sense.

>I'm hoping to add 003 soon with the initial parallel parts, but I'm
>about out of time right now and wanted to get something out, so
>sending this without that.
>
>Side question: for the patch tester do I have to attach each part of
>the series each time even if nothing's changed in several of them? And
>does the vN number at the end need to stay the same for all of them?
>My attachments to this email don't follow that... Also, since this
>email changes patch naming, so I need to do anything to clear out the
>old ones? (I suppose if not, then that would imply an answer to the
>first question also.)
>

Please always send the whole patch series. Firstly, that's the only way
how the cfbot can apply and test the patches (it can't collect patches
scattered in different messages in the thread). Secondly, it's really
annoying for the reviewers to try to pick the matching bits.

Also, it's a good ide to always mark all parts with the same version
info, not v1 for one part and v31 for another one. I'd simply do
something like

    git format-patch --suffix=-YYYYMMDD.patch master

or something like that.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Fri, Sep 27, 2019 at 01:50:30PM -0400, James Coleman wrote:
>On Mon, Sep 9, 2019 at 5:55 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> The "patched" column means all developer GUCs disabled, so it's expected
>> to produce the same plan as master (and it is). And then there's one
>> column for each developer GUC. If the column is just TRUE it means the
>> GUC does not affect any of the synthetic queries. There are 4 of them:
>>
>> - devel_add_paths_to_grouping_rel_parallel
>> - devel_create_partial_grouping_paths
>> - devel_gather_grouping_paths
>> - devel_standard_join_search
>>
>> The places controlled by those GUCs are either useless, or the query
>> affected by them is not included in the list of queries.
>
>I'd previously found (in my reverse engineering efforts) the query:
>
>select *
>from tenk1 t1
>join tenk1 t2 on t1.hundred = t2.hundred
>join tenk1 t3 on t1.hundred = t3.hundred
>order by t1.hundred, t1.twenty
>limit 50;
>
>can change plans to use incremental sort when
>generate_useful_gather_paths() is added to standard_join_search().
>Specifically, we get a merge join between t1 and t3 as the top level
>(besides limit) node where the driving side of the join is a gather
>merge with incremental sort. This does rely on these gucs set in the
>test harness:
>
>set local max_parallel_workers_per_gather=4;
>set local min_parallel_table_scan_size=0;
>set local parallel_tuple_cost=0;
>set local parallel_setup_cost=0;
>
>So I think we can reduce the number of unused gucs to 3.
>

OK. I'll try extending the set of synthetic queries in [1] to also do
soemthing like this and generate similar plans.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Michael Paquier
Date:
On Sun, Sep 29, 2019 at 01:00:49AM +0200, Tomas Vondra wrote:
> OK. I'll try extending the set of synthetic queries in [1] to also do
> soemthing like this and generate similar plans.

Any progress on that?

Please note that the latest patch does not apply anymore, so a rebase
is needed.  I am switching the patch as waiting on author for now.
--
Michael

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Fri, Nov 29, 2019 at 03:01:46PM +0900, Michael Paquier wrote:
>On Sun, Sep 29, 2019 at 01:00:49AM +0200, Tomas Vondra wrote:
>> OK. I'll try extending the set of synthetic queries in [1] to also do
>> soemthing like this and generate similar plans.
>
>Any progress on that?
>
>Please note that the latest patch does not apply anymore, so a rebase
>is needed.  I am switching the patch as waiting on author for now.
>--

Ah, thanks for reminding me. I've added a couple more queries with two
joins (there only were queries with two joins, I haven't expected
another joint to make such difference, but seems I was wrong).

So yes, there seem to be 6 different GUCs / places where considering
incremental sort makes a difference (the numbers say how many of the
4960 tested combinations were affected)

- create_ordered_paths_parallel (50)
- create_partial_grouping_paths_2 (228)
- standard_join_search (94)
- add_paths_to_grouping_rel (2148)
- set_rel_pathlist (156)
- apply_scanjoin_target_to_paths (286)

Clearly some of the places are more important than others, plus there
are some overlaps (two GUCs producing the same plan, etc.).

Plus there are four GUCs that did not affect any queries at all:

- create_partial_grouping_paths
- gather_grouping_paths
- create_ordered_paths
- add_paths_to_grouping_rel_parallel

Anyway, this might serve as a way to prioritize the effort. All the
test changes are in the original repo at

   https://github.com/tvondra/incremental-sort-tests-2
   
and I'm also attaching the rebased patches - the changes were pretty
minor, hopefully that helps others (all the patches with dev GUCs are in

https://github.com/tvondra/postgres/tree/incremental-sort-20191129


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
Hi,

This patch has been marked as WoA since end of November, and there has
been no discussion/reviews since then :-( Based on off-list discussion
with James I don't think that's going to change in this CF, so I'll move
it to the next CF.

I plan to work on the planner part of this patch before 2020-03, with
the hope it can still make it into 13.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Jan 21, 2020 at 9:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> This patch has been marked as WoA since end of November, and there has
> been no discussion/reviews since then :-( Based on off-list discussion
> with James I don't think that's going to change in this CF, so I'll move
> it to the next CF.
>
> I plan to work on the planner part of this patch before 2020-03, with
> the hope it can still make it into 13.

In that off-list discussion I'd mentioned to Tomas that I would still
like to work on this, just my other responsibilities at work have left
me little time to work on the most important remaining part of this
(the planner parts) since that requires a fair amount of focus and
time.

That being said, the patch also needs some more work on improving
EXPLAIN ANALYZE output (perhaps min/max/mean or median of
memory usage number of groups in each sort mode), and I think it's far
more feasible that I can tackle that piecemeal before the next CF.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Tue, Jan 21, 2020 at 09:37:01AM -0500, James Coleman wrote:
>On Tue, Jan 21, 2020 at 9:25 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> Hi,
>>
>> This patch has been marked as WoA since end of November, and there has
>> been no discussion/reviews since then :-( Based on off-list discussion
>> with James I don't think that's going to change in this CF, so I'll move
>> it to the next CF.
>>
>> I plan to work on the planner part of this patch before 2020-03, with
>> the hope it can still make it into 13.
>
>In that off-list discussion I'd mentioned to Tomas that I would still
>like to work on this, just my other responsibilities at work have left
>me little time to work on the most important remaining part of this
>(the planner parts) since that requires a fair amount of focus and
>time.
>
>That being said, the patch also needs some more work on improving
>EXPLAIN ANALYZE output (perhaps min/max/mean or median of
>memory usage number of groups in each sort mode), and I think it's far
>more feasible that I can tackle that piecemeal before the next CF.
>

Sure, sorry if that was not clear from my message - you're of course
more than welcome to continue working on this. My understanding is that
won't happen by the end of this CF, hence the move to 2020-03.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Jan 21, 2020 at 9:58 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Jan 21, 2020 at 09:37:01AM -0500, James Coleman wrote:
> >On Tue, Jan 21, 2020 at 9:25 AM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> Hi,
> >>
> >> This patch has been marked as WoA since end of November, and there has
> >> been no discussion/reviews since then :-( Based on off-list discussion
> >> with James I don't think that's going to change in this CF, so I'll move
> >> it to the next CF.
> >>
> >> I plan to work on the planner part of this patch before 2020-03, with
> >> the hope it can still make it into 13.
> >
> >In that off-list discussion I'd mentioned to Tomas that I would still
> >like to work on this, just my other responsibilities at work have left
> >me little time to work on the most important remaining part of this
> >(the planner parts) since that requires a fair amount of focus and
> >time.
> >
> >That being said, the patch also needs some more work on improving
> >EXPLAIN ANALYZE output (perhaps min/max/mean or median of
> >memory usage number of groups in each sort mode), and I think it's far
> >more feasible that I can tackle that piecemeal before the next CF.
> >
>
> Sure, sorry if that was not clear from my message - you're of course
> more than welcome to continue working on this. My understanding is that
> won't happen by the end of this CF, hence the move to 2020-03.

Oh, yeah, I probably didn't word that reply well -- I just wanted to
add some additional detail to what you had already said.

Thanks for your work on this!

James



Re: [PATCH] Incremental sort

From
David Steele
Date:
James and Tomas,

On 1/21/20 10:03 AM, James Coleman wrote:
> On Tue, Jan 21, 2020 at 9:58 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Tue, Jan 21, 2020 at 09:37:01AM -0500, James Coleman wrote:
>>> On Tue, Jan 21, 2020 at 9:25 AM Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> This patch has been marked as WoA since end of November, and there has
>>>> been no discussion/reviews since then :-( Based on off-list discussion
>>>> with James I don't think that's going to change in this CF, so I'll move
>>>> it to the next CF.
>>>>
>>>> I plan to work on the planner part of this patch before 2020-03, with
>>>> the hope it can still make it into 13.
>>>
>>> In that off-list discussion I'd mentioned to Tomas that I would still
>>> like to work on this, just my other responsibilities at work have left
>>> me little time to work on the most important remaining part of this
>>> (the planner parts) since that requires a fair amount of focus and
>>> time.
>>>
>>> That being said, the patch also needs some more work on improving
>>> EXPLAIN ANALYZE output (perhaps min/max/mean or median of
>>> memory usage number of groups in each sort mode), and I think it's far
>>> more feasible that I can tackle that piecemeal before the next CF.
>>>
>>
>> Sure, sorry if that was not clear from my message - you're of course
>> more than welcome to continue working on this. My understanding is that
>> won't happen by the end of this CF, hence the move to 2020-03.
> 
> Oh, yeah, I probably didn't word that reply well -- I just wanted to
> add some additional detail to what you had already said.
> 
> Thanks for your work on this!

It doesn't look there has been much movement on this patch for the last 
few CFs.  Are one or both of you planning to work on this for v13? Or 
should we mark it for v14 and/or move it to the next CF.

Regards,
-- 
-David
david@pgmasters.net



Re: [PATCH] Incremental sort

From
Tomas Vondra
Date:
On Tue, Mar 03, 2020 at 12:17:22PM -0500, David Steele wrote:
>James and Tomas,
>
>On 1/21/20 10:03 AM, James Coleman wrote:
>>On Tue, Jan 21, 2020 at 9:58 AM Tomas Vondra
>><tomas.vondra@2ndquadrant.com> wrote:
>>>
>>>On Tue, Jan 21, 2020 at 09:37:01AM -0500, James Coleman wrote:
>>>>On Tue, Jan 21, 2020 at 9:25 AM Tomas Vondra
>>>><tomas.vondra@2ndquadrant.com> wrote:
>>>>>
>>>>>Hi,
>>>>>
>>>>>This patch has been marked as WoA since end of November, and there has
>>>>>been no discussion/reviews since then :-( Based on off-list discussion
>>>>>with James I don't think that's going to change in this CF, so I'll move
>>>>>it to the next CF.
>>>>>
>>>>>I plan to work on the planner part of this patch before 2020-03, with
>>>>>the hope it can still make it into 13.
>>>>
>>>>In that off-list discussion I'd mentioned to Tomas that I would still
>>>>like to work on this, just my other responsibilities at work have left
>>>>me little time to work on the most important remaining part of this
>>>>(the planner parts) since that requires a fair amount of focus and
>>>>time.
>>>>
>>>>That being said, the patch also needs some more work on improving
>>>>EXPLAIN ANALYZE output (perhaps min/max/mean or median of
>>>>memory usage number of groups in each sort mode), and I think it's far
>>>>more feasible that I can tackle that piecemeal before the next CF.
>>>>
>>>
>>>Sure, sorry if that was not clear from my message - you're of course
>>>more than welcome to continue working on this. My understanding is that
>>>won't happen by the end of this CF, hence the move to 2020-03.
>>
>>Oh, yeah, I probably didn't word that reply well -- I just wanted to
>>add some additional detail to what you had already said.
>>
>>Thanks for your work on this!
>
>It doesn't look there has been much movement on this patch for the 
>last few CFs.  Are one or both of you planning to work on this for 
>v13? Or should we mark it for v14 and/or move it to the next CF.
>

I'm currently working on it, I plan to submit a new patch version
shortly - hopefully by the end of this week.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort

From
James Coleman
Date:
On Tue, Mar 3, 2020 at 1:43 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Tue, Mar 03, 2020 at 12:17:22PM -0500, David Steele wrote:
>James and Tomas,
>
>On 1/21/20 10:03 AM, James Coleman wrote:
>>On Tue, Jan 21, 2020 at 9:58 AM Tomas Vondra
>><tomas.vondra@2ndquadrant.com> wrote:
>>>
>>>On Tue, Jan 21, 2020 at 09:37:01AM -0500, James Coleman wrote:
>>>>On Tue, Jan 21, 2020 at 9:25 AM Tomas Vondra
>>>><tomas.vondra@2ndquadrant.com> wrote:
>>>>>
>>>>>Hi,
>>>>>
>>>>>This patch has been marked as WoA since end of November, and there has
>>>>>been no discussion/reviews since then :-( Based on off-list discussion
>>>>>with James I don't think that's going to change in this CF, so I'll move
>>>>>it to the next CF.
>>>>>
>>>>>I plan to work on the planner part of this patch before 2020-03, with
>>>>>the hope it can still make it into 13.
>>>>
>>>>In that off-list discussion I'd mentioned to Tomas that I would still
>>>>like to work on this, just my other responsibilities at work have left
>>>>me little time to work on the most important remaining part of this
>>>>(the planner parts) since that requires a fair amount of focus and
>>>>time.
>>>>
>>>>That being said, the patch also needs some more work on improving
>>>>EXPLAIN ANALYZE output (perhaps min/max/mean or median of
>>>>memory usage number of groups in each sort mode), and I think it's far
>>>>more feasible that I can tackle that piecemeal before the next CF.
>>>>
>>>
>>>Sure, sorry if that was not clear from my message - you're of course
>>>more than welcome to continue working on this. My understanding is that
>>>won't happen by the end of this CF, hence the move to 2020-03.
>>
>>Oh, yeah, I probably didn't word that reply well -- I just wanted to
>>add some additional detail to what you had already said.
>>
>>Thanks for your work on this!
>
>It doesn't look there has been much movement on this patch for the
>last few CFs.  Are one or both of you planning to work on this for
>v13? Or should we mark it for v14 and/or move it to the next CF.
>

I'm currently working on it, I plan to submit a new patch version
shortly - hopefully by the end of this week.

Tomas, thanks much for working on this.

I haven't had a lot of time to dedicate to this, but I do hope to soon (late this week or next), in addition to the planner stuff I believe Tomas is working on, push improvements to the EXPLAIN ANALYZE output.

James

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:


On Tue, Jan 21, 2020 at 9:37 AM James Coleman <jtc331@gmail.com> wrote:
That being said, the patch also needs some more work on improving
EXPLAIN ANALYZE output (perhaps min/max/mean or median of
memory usage number of groups in each sort mode), and I think it's far
more feasible that I can tackle that piecemeal before the next CF.

I'm looking at this now, and realized that at least for parallel plans the current patch tracks the tuplesort instrumentation whether or not an EXPLAIN ANALYZE is in process.

Is this fairly standard for executor nodes? Or is it expected to condition some of this tracking based on whether or not an ANALYZE is running?

I'm found EXEC_FLAG_EXPLAIN_ONLY but no parallel for analyze. Similarly the InstrumentOption bit flags on the executor state seems to indicate whether specific ANALYZE options should be enabled, but I haven't yet seen anything conditioned solely on whether an ANALYZE is in flight. Could someone point me in the right direction is this is expected?

Thanks,
James

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
James Coleman <jtc331@gmail.com> writes:
> I'm looking at this now, and realized that at least for parallel plans the
> current patch tracks the tuplesort instrumentation whether or not an
> EXPLAIN ANALYZE is in process.

> Is this fairly standard for executor nodes? Or is it expected to condition
> some of this tracking based on whether or not an ANALYZE is running?

No, it's entirely not standard.  Maybe you could make an argument that
it's too cheap to bother making it conditional, but without a convincing
argument for that, it needs to be conditional.

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Thu, Mar 5, 2020 at 5:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
James Coleman <jtc331@gmail.com> writes:
> I'm looking at this now, and realized that at least for parallel plans the
> current patch tracks the tuplesort instrumentation whether or not an
> EXPLAIN ANALYZE is in process.

> Is this fairly standard for executor nodes? Or is it expected to condition
> some of this tracking based on whether or not an ANALYZE is running?

No, it's entirely not standard.  Maybe you could make an argument that
it's too cheap to bother making it conditional, but without a convincing
argument for that, it needs to be conditional.

That's what I figured, but as I mentioned I've having trouble figuring out how the fact that an analyze is in flight is determined. I assume it's something that lives of the EState or similar, but I'm not seeing anything obvious.

Thanks
James

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
James Coleman <jtc331@gmail.com> writes:
> That's what I figured, but as I mentioned I've having trouble figuring out
> how the fact that an analyze is in flight is determined. I assume it's
> something that lives of the EState or similar, but I'm not seeing anything
> obvious.

AFAIR, it's just whether or not the current planstate node has an
instrumentation struct attached.

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:


On Thu, Mar 5, 2020 at 6:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
James Coleman <jtc331@gmail.com> writes:
> That's what I figured, but as I mentioned I've having trouble figuring out
> how the fact that an analyze is in flight is determined. I assume it's
> something that lives of the EState or similar, but I'm not seeing anything
> obvious.

AFAIR, it's just whether or not the current planstate node has an
instrumentation struct attached.

Oh, that's easy. Thanks for pointing that out. I'll be attaching a new patch soon incorporating that check.

James

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Jan 21, 2020 at 9:37 AM James Coleman <jtc331@gmail.com> wrote:
That being said, the patch also needs some more work on improving
EXPLAIN ANALYZE output (perhaps min/max/mean or median of
memory usage number of groups in each sort mode), and I think it's far
more feasible that I can tackle that piecemeal before the next CF.

James

I'm attaching a rebased patch revision + a new commit that reworks EXPLAIN output. I left that patch as separate for now so that it was easy enough to see the difference, and so that as Tomas is working on stuff in parallel I don't unnecessarily cause merge conflicts for now, but next patch revision (assuming the EXPLAIN change looks good) can just incorporate it into the base patch.

Here's what I've changed:

- The stats necessary for ANALYZE are now only kept if the PlanState has a non-null instrument field (thanks to Tom for pointing out this as the correct way to check that ANALYZE is in flight). I did leave lines like `node->incsort_info.fullsortGroupInfo.groupCount++;` unguarded by that `if` since it seems like practically zero overhead (and almost equal to check the condition), but if anyone disagrees, I'm happy to change it. Additionally those lines (if ANALYZE is not in flight) are technically operating on variables that haven't explicitly been initialized in the Init function; please tell me if that's actually an issue given the are counters and we won't be using them in that case.
- A good bit of cleanup on how parallel workers are output (I believe there was some duplicative group opening and also inconsistent text output with other multi-worker explain nodes). I haven't had a chance to test this yet, thought, so there could be bugs.
- I left and XXX in the patch to note a place I wanted extra eyes. The original patch ignored workers if the tuplesort for that worker still had an in-progress status, but from what I can tell that doesn't make a lot of sense given that we re-use the same tuplesort multiple times. So a parallel worker (I think) could have returned from the first batch, but then be in-progress still on the 2nd batch, and we wouldn't want to ignore that worker. As a replacement I'm now checking that at least one of the fullsort and prefixsort group count stats are greater than 0 (to imply we've sorted at least one batch).
- I also left a TODO wondering if we should break out the instrumentation into a separate function; it seems like a decent sized chunk of cleanly extractable code; I suppose that's always a bit of personal preference, so anyone who wants to weigh in gets a vote :)
- The previous implementation assumed the most recent tuplesort usage had the correct information for memory/disk usage and sort implementation, but again, since we re-use, that doesn't make a lot of sense. Instead I now output all sort methods used as well as maximum and average disk and memory usage.

Here's example output:
   ->  Incremental Sort
         Sort Key: a, b
         Presorted Key: a
         Full-sort Groups: 4 (Methods: quicksort) Memory: 26kB (avg), 26kB (max)
         ->  Index Scan using idx_t_a...

You'd have an additional line for "Presorted groups: ..." if any are present to parallel "Full-sort groups".

I haven't yet run pg formatting, but I didn't want to modify the base patch given other work on it is in flight.

James
Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Mar 7, 2020 at 5:47 PM James Coleman <jtc331@gmail.com> wrote:
On Tue, Jan 21, 2020 at 9:37 AM James Coleman <jtc331@gmail.com> wrote:
That being said, the patch also needs some more work on improving
EXPLAIN ANALYZE output (perhaps min/max/mean or median of
memory usage number of groups in each sort mode), and I think it's far
more feasible that I can tackle that piecemeal before the next CF.

James

I'm attaching a rebased patch revision + a new commit that reworks EXPLAIN output. I left that patch as separate for now so that it was easy enough to see the difference, and so that as Tomas is working on stuff in parallel I don't unnecessarily cause merge conflicts for now, but next patch revision (assuming the EXPLAIN change looks good) can just incorporate it into the base patch.

Here's what I've changed:

- The stats necessary for ANALYZE are now only kept if the PlanState has a non-null instrument field (thanks to Tom for pointing out this as the correct way to check that ANALYZE is in flight). I did leave lines like `node->incsort_info.fullsortGroupInfo.groupCount++;` unguarded by that `if` since it seems like practically zero overhead (and almost equal to check the condition), but if anyone disagrees, I'm happy to change it. Additionally those lines (if ANALYZE is not in flight) are technically operating on variables that haven't explicitly been initialized in the Init function; please tell me if that's actually an issue given the are counters and we won't be using them in that case.

And...I discovered that I need to do this anyway. Basically, the originally patch stored per-worker instrumentation information on every tuple fetch, which is unnecessary, and in my haste refactoring I'd replaced that spot with my code to better record stats. The original patch just looked at the last tuple sort state, as I'd mentioned previously, so didn't have any special instrumentation in non-parallel workers other than incrementing group counters.

But obviously we don't want to record stats from every tuple, we want to record sort info every time we finalize a sort. And so I've replaced the group counter increment lines with calls to a newly broken out function to record stats for the appropriate fullsort/prefixsort group info.

I came across while adding tests for EXPLAIN ANALYZE and saw a result with the reported average memory usage higher than the max--this happened since I was adding the memory used each time through the loop rather than once when finalizing the sort.
 
- A good bit of cleanup on how parallel workers are output (I believe there was some duplicative group opening and also inconsistent text output with other multi-worker explain nodes). I haven't had a chance to test this yet, thought, so there could be bugs.

Note: I still haven't had time to test parallel plans with the updated EXPLAIN, so there aren't tests for that either.

- I also left a TODO wondering if we should break out the instrumentation into a separate function; it seems like a decent sized chunk of cleanly extractable code; I suppose that's always a bit of personal preference, so anyone who wants to weigh in gets a vote :)

I ended up having to do this anyway, for reasons described above.

See new version attached (still with EXPLAIN changes as a separate patch file).

James
Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
Hi,

I've been focusing on the planner part of the patch, particularly on
picking places need to consider incremental sort paths. As discussed in
the past, we might simply consider incremental sort everywhere we need
sorted path, but that's likely an overkill. So I used the set of queries
used for previous tests [1] and determined which places affect the most
plans (using a separate GUC for each place, per [2]).

The result is attached, on top of the incremental sort patch. Part 0004
tweak all places to cover all plan changes in the test queries, and part
0005 tweaks a couple additional places where I think we should consider
incremental sort (but I've been unable to construct a query for which
it'd make a difference).

I've also removed the various GUCs and all the places are now using the
same enable_incrementalsort GUC. Compared to [2] I've also modified the
code to consider full and incremental sort in the same loop, which does
allow reusing some of the work.

The next thing I work on is determining how expensive this can be, in
extreme cases with many indexes etc.


Now, a couple comments about parts 0001 - 0003 of the patch ...

1) I see a bunch of failures in the regression test, due to minor
differences in the explain output. All the differences are about minor
changes in memory usage, like this:

-               "Sort Space Used": 30,                             +
+               "Sort Space Used": 29,                             +

I'm not sure if it happens on my machine only, but maybe the test is not
entirely stable.


2) I think this bit in ExecReScanIncrementalSort is wrong:

     node->sort_Done = false;
     tuplesort_end(node->fullsort_state);
     node->prefixsort_state = NULL;
     tuplesort_end(node->fullsort_state);
     node->prefixsort_state = NULL;
     node->bound_Done = 0;

Notice both places reset fullsort_state and set prefixsort_state to
NULL. Another thing is that I'm not sure it's fine to pass NULL to
tuplesort_end (my guess is tuplesort_free will fail when it gets NULL).


3) Most of the execution plans look reasonable, except that some of the
plans look like this:


                          QUERY PLAN
   ---------------------------------------------------------
    Limit
      ->  GroupAggregate
            Group Key: t.a, t.b, t.c, t.d
            ->  Incremental Sort
                  Sort Key: t.a, t.b, t.c, t.d
                  Presorted Key: t.a, t.b, t.c
                  ->  Incremental Sort
                        Sort Key: t.a, t.b, t.c
                        Presorted Key: t.a, t.b
                        ->  Index Scan using t_a_b_idx on t
   (10 rows)

i.e. there are two incremental sorts on top of each other, with
different prefixes. But this this is not a new issue - it happens with
queries like this:

   SELECT a, b, c, d, count(*) FROM (
     SELECT * FROM t ORDER BY a, b, c
   ) foo GROUP BY a, b, c, d limit 1000;

i.e. there's a subquery with a subset of pathkeys. Without incremental
sort the plan looks like this:

                    QUERY PLAN
   ---------------------------------------------
    Limit
      ->  GroupAggregate
            Group Key: t.a, t.b, t.c, t.d
            ->  Sort
                  Sort Key: t.a, t.b, t.c, t.d
                  ->  Sort
                        Sort Key: t.a, t.b, t.c
                        ->  Seq Scan on t
   (8 rows)

so essentially the same plan shape. What bugs me though is that there
seems to be some sort of memory leak, so that this query consumes
gigabytes os RAM before it gets killed by OOM. But the memory seems not
to be allocated in any memory context (at least MemoryContextStats don't
show anything like that), so I'm not sure what's going on.

Reproducing it is fairly simple:

   CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
   INSERT INTO t SELECT
     1000*random(), 1000*random(), 1000*random(), 1000*random()
   FROM generate_series(1,10000000) s(i);
   CREATE INDEX idx ON t(a,b);
   ANALYZE t;

   EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
   FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
   LIMIT 100;


[1] https://github.com/tvondra/incremental-sort-tests-2

[2] https://github.com/tvondra/postgres/tree/incremental-sort-20200309


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Alvaro Herrera
Date:
I gave this a very quick look; I don't claim to understand it or
anything, but I thought these trivial cleanups worthwhile.  The only
non-cosmetic thing is changing order of arguments to the SOn_printf()
calls in 0008; I think they are contrary to what the comment says.

I don't propose to commit 0003 of course, since it's not our policy;
that's just to allow running pgindent sanely, which gives you 0004
(though my local pgindent has an unrelated fix).  And after that you
notice the issue that 0005 fixes.

I did notice that show_incremental_sort_group_info() seems to be doing
things in a hard way, or something.  I got there because it throws this
warning:

/pgsql/source/master/src/backend/commands/explain.c: In function 'show_incremental_sort_group_info':
/pgsql/source/master/src/backend/commands/explain.c:2766:39: warning: passing argument 2 of 'lappend' discards 'const'
qualifierfrom pointer target type [-Wdiscarded-qualifiers]
 
    methodNames = lappend(methodNames, sortMethodName);
                                       ^~~~~~~~~~~~~~
In file included from /pgsql/source/master/src/include/access/xact.h:20,
                 from /pgsql/source/master/src/backend/commands/explain.c:16:
/pgsql/source/master/src/include/nodes/pg_list.h:509:14: note: expected 'void *' but argument is of type 'const char
*'
 extern List *lappend(List *list, void *datum);
              ^~~~~~~
/pgsql/source/master/src/backend/commands/explain.c:2766:39: warning: passing 'const char *' to parameter of type 'void
*'discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers]
 
                        methodNames = lappend(methodNames, sortMethodName);
                                                           ^~~~~~~~~~~~~~
/pgsql/source/master/src/include/nodes/pg_list.h:509:40: note: passing argument to parameter 'datum' here
extern List *lappend(List *list, void *datum);
                                       ^
1 warning generated.

(Eh, it's funny that GCC reports two warnings about the same line, and
then says there's one warning.)

I suppose you could silence this by adding pstrdup(), and then use
list_free_deep (you have to put the sortMethodName declaration in the
inner scope for that, but seems fine).  Or maybe there's a clever way
around it.

But I hesitate to send a patch for that because I think the whole
function is written by handling text and the other outputs completely
separately -- but looking for example show_modifytable_info() it seems
you can do ExplainOpenGroup, ExplainPropertyText, ExplainPropertyList
etc in all explain output modes, and those routines will care about
emitting the data in the correct format, without having the
show_incremental_sort_group_info function duplicate everything.

HTH.  I would really like to get this patch done for pg13.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Justin Pryzby
Date:
Thanks for working on this.  I have some minor comments.

In 0005:

+                /* Restore the input path (we might have addes Sort on top). */

=> added?  There's at least two more of the same typo.

+                /* also ignore already sorted paths */

=> You say that in a couple places, but I don't think "also" makes sense since
there's nothing preceding it ?

In 0004:

+             * end up resorting the entire data set.  So, unless we can push

=> re-sorting

+ * Unlike generate_gather_paths, this does not look just as pathkeys of the

=> look just AT ?

+            /* now we know is_sorted == false */

=> I would just spell that "Assert", as I think you already do elsewhere.

+                /* continue */

=> Please consider saying "fall through", since "continue" means exactly the
opposite.


+generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
...
+            /* finally, consider incremental sort */
...
+                /* Also consider incremental sort. */

=> I think it's more confusing than useful with two comments - one is adequate.

In 0002:

+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
...
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node

=> AN incremental

+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024

Four typos (?)
that array DOESN'T
and THE overhead
CONSIDER
I'm not sure, but "be possible less" should maybe say "possibly be less" ?

+    bool        maxSpaceOnDisk;    /* true when maxSpace is value for on-disk

I suggest to call it IsMaxSpaceDisk

+    MemoryContext maincontext;    /* memory context for tuple sort metadata
+                       that persist across multiple batches */

persists

+ *    a new sort.  It allows evade recreation of tuple sort (and save resources)
+ *    when sorting multiple small batches.

allows to avoid?  Or allows avoiding?

+ *     When performing sorting by multiple keys input dataset could be already
+ *     presorted by some prefix of these keys.  We call them "presorted keys".

"already presorted" sounds redundant

+    int64        fullsort_group_count;    /* number of groups with equal presorted keys */
+    int64        prefixsort_group_count;    /* number of groups with equal presorted keys */

I guess these should have different comments

-- 
Justin



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Thu, Mar 12, 2020 at 5:53 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> I gave this a very quick look; I don't claim to understand it or
> anything, but I thought these trivial cleanups worthwhile.  The only
> non-cosmetic thing is changing order of arguments to the SOn_printf()
> calls in 0008; I think they are contrary to what the comment says.

Yes, I think you're correct (re: 0008).

They all look generally good to me, and are included in the attached
patch series.

> I don't propose to commit 0003 of course, since it's not our policy;
> that's just to allow running pgindent sanely, which gives you 0004
> (though my local pgindent has an unrelated fix).  And after that you
> notice the issue that 0005 fixes.

Is there a page on how you're supposed to run pgindent/when stuff like
this does get added/etc.? It's all a big mystery to me right now.

Also, I noticed some of the pgindent changes aren't for changes in
this patch series; I have that as a separate patch, but not attached
because I see that running pgindent locally generates a massive patch,
so I'm assuming we just ignore those for now?

> I did notice that show_incremental_sort_group_info() seems to be doing
> things in a hard way, or something.  I got there because it throws this
> warning:
>
> /pgsql/source/master/src/backend/commands/explain.c: In function 'show_incremental_sort_group_info':
> /pgsql/source/master/src/backend/commands/explain.c:2766:39: warning: passing argument 2 of 'lappend' discards
'const'qualifier from pointer target type [-Wdiscarded-qualifiers]
 
>     methodNames = lappend(methodNames, sortMethodName);
>                                        ^~~~~~~~~~~~~~
> In file included from /pgsql/source/master/src/include/access/xact.h:20,
>                  from /pgsql/source/master/src/backend/commands/explain.c:16:
> /pgsql/source/master/src/include/nodes/pg_list.h:509:14: note: expected 'void *' but argument is of type 'const char
*'
>  extern List *lappend(List *list, void *datum);
>               ^~~~~~~
> /pgsql/source/master/src/backend/commands/explain.c:2766:39: warning: passing 'const char *' to parameter of type
'void*' discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers]
 
>                         methodNames = lappend(methodNames, sortMethodName);
>                                                            ^~~~~~~~~~~~~~
> /pgsql/source/master/src/include/nodes/pg_list.h:509:40: note: passing argument to parameter 'datum' here
> extern List *lappend(List *list, void *datum);
>                                        ^
> 1 warning generated.
>
> (Eh, it's funny that GCC reports two warnings about the same line, and
> then says there's one warning.)

I had seen this before I sent the patch, but then it seemed like it
disappeared, so I didn't come back to it; maybe I just missed it in my
buffer.

I do see it now, and moving the declarations into each relevant block
(rather than trying to share them) seems to fix it. I think that's
correct anyway, since before they were technically being assigned to
more than once which seems wrong for const.

I have this change locally and will include it in my next patch version.

> I suppose you could silence this by adding pstrdup(), and then use
> list_free_deep (you have to put the sortMethodName declaration in the
> inner scope for that, but seems fine).  Or maybe there's a clever way
> around it.
>
> But I hesitate to send a patch for that because I think the whole
> function is written by handling text and the other outputs completely
> separately -- but looking for example show_modifytable_info() it seems
> you can do ExplainOpenGroup, ExplainPropertyText, ExplainPropertyList
> etc in all explain output modes, and those routines will care about
> emitting the data in the correct format, without having the
> show_incremental_sort_group_info function duplicate everything.

I'm not sure how that would work: those functions (for
EXPLAIN_FORMAT_TEXT) all add newlines, and this code is intentionally
trying to avoid too many lines.

I'm open to suggestions though.

> HTH.  I would really like to get this patch done for pg13.

As would I!

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
...
> Now, a couple comments about parts 0001 - 0003 of the patch ...
>
> 1) I see a bunch of failures in the regression test, due to minor
> differences in the explain output. All the differences are about minor
> changes in memory usage, like this:
>
> -               "Sort Space Used": 30,                             +
> +               "Sort Space Used": 29,                             +
>
> I'm not sure if it happens on my machine only, but maybe the test is not
> entirely stable.

make check passes on multiple machines for me; what arch/distro are you using?

Is there a better way to test these? I would prefer these code paths
have test coverage, but the standard SQL tests don't leave a good way
to handle stuff like this.

Is TAP the only alternative, and do you think it'd be worth considering?

> 2) I think this bit in ExecReScanIncrementalSort is wrong:
>
>      node->sort_Done = false;
>      tuplesort_end(node->fullsort_state);
>      node->prefixsort_state = NULL;
>      tuplesort_end(node->fullsort_state);
>      node->prefixsort_state = NULL;
>      node->bound_Done = 0;
>
> Notice both places reset fullsort_state and set prefixsort_state to
> NULL. Another thing is that I'm not sure it's fine to pass NULL to
> tuplesort_end (my guess is tuplesort_free will fail when it gets NULL).

Fixed.

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
James Coleman <jtc331@gmail.com> writes:
> On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> 1) I see a bunch of failures in the regression test, due to minor
>> differences in the explain output. All the differences are about minor
>> changes in memory usage, like this:
>>
>> -               "Sort Space Used": 30,                             +
>> +               "Sort Space Used": 29,                             +
>>
>> I'm not sure if it happens on my machine only, but maybe the test is not
>> entirely stable.

> make check passes on multiple machines for me; what arch/distro are you using?

I think there's exactly zero chance of such output being stable across
different platforms, particularly 32-vs-64-bit.  You'll need to either
drop that test or find some way to mask the variability.

> Is there a better way to test these? I would prefer these code paths
> have test coverage, but the standard SQL tests don't leave a good way
> to handle stuff like this.

In some places we use plpgsql code to filter the EXPLAIN output.

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Andres Freund
Date:
Hi,

On 2020-03-13 13:36:44 -0400, Tom Lane wrote:
> James Coleman <jtc331@gmail.com> writes:
> > On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> >> 1) I see a bunch of failures in the regression test, due to minor
> >> differences in the explain output. All the differences are about minor
> >> changes in memory usage, like this:
> >> 
> >> -               "Sort Space Used": 30,                             +
> >> +               "Sort Space Used": 29,                             +
> >> 
> >> I'm not sure if it happens on my machine only, but maybe the test is not
> >> entirely stable.
> 
> > make check passes on multiple machines for me; what arch/distro are you using?
> 
> I think there's exactly zero chance of such output being stable across
> different platforms, particularly 32-vs-64-bit.  You'll need to either
> drop that test or find some way to mask the variability.

+1


> > Is there a better way to test these? I would prefer these code paths
> > have test coverage, but the standard SQL tests don't leave a good way
> > to handle stuff like this.
> 
> In some places we use plpgsql code to filter the EXPLAIN output.

I still think we should just go for a REPRODUCIBLE, TESTING, REGRESS or
similar EXPLAIN option, instead of playing whack-a-mole. Due to the
amount of discussion, the reduced test coverage, the increased test
complexity, the reduced quality of explain for humans we are well beyond
the point of making the cost of such an option worth it.

Greetings,

Andres Freund



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Thu, Mar 12, 2020 at 7:40 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> Thanks for working on this.  I have some minor comments.
>
> In 0005:
>
> +                               /* Restore the input path (we might have addes Sort on top). */
>
> => added?  There's at least two more of the same typo.

Fixed.

> +                               /* also ignore already sorted paths */
>
> => You say that in a couple places, but I don't think "also" makes sense since
> there's nothing preceding it ?

Updated.

> In 0004:
>
> +                        * end up resorting the entire data set.  So, unless we can push
>
> => re-sorting

Fixed in this patch; that also shows up in
contrib/postgres_fdw/postgres_fdw.c, but I'll leave that alone.

> + * Unlike generate_gather_paths, this does not look just as pathkeys of the
>
> => look just AT ?

Fixed.

> +                       /* now we know is_sorted == false */
>
> => I would just spell that "Assert", as I think you already do elsewhere.
>
> +                               /* continue */
>
> => Please consider saying "fall through", since "continue" means exactly the
> opposite.

Updated.

> +generate_useful_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
> ...
> +                       /* finally, consider incremental sort */
> ...
> +                               /* Also consider incremental sort. */
>
> => I think it's more confusing than useful with two comments - one is adequate.

Also fixed.

> In 0002:
>
> + * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
> ...
> + * make_incrementalsort --- basic routine to build a IncrementalSort plan node
>
> => AN incremental

Fixed.

> + * Initial size of memtuples array.  We're trying to select this size so that
> + * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
> + * be possible less.  However, we don't cosider array sizes less than 1024
>
> Four typos (?)
> that array DOESN'T
> and THE overhead
> CONSIDER
> I'm not sure, but "be possible less" should maybe say "possibly be less" ?

Fixed.

> +       bool            maxSpaceOnDisk; /* true when maxSpace is value for on-disk
>
> I suggest to call it IsMaxSpaceDisk

Changed, though with lowercase 'I' (let me know if using uppercase is
standard here).

> +       MemoryContext maincontext;      /* memory context for tuple sort metadata
> +                                          that persist across multiple batches */
>
> persists

Fixed.

> + *     a new sort.  It allows evade recreation of tuple sort (and save resources)
> + *     when sorting multiple small batches.
>
> allows to avoid?  Or allows avoiding?

Fixed.

> + *      When performing sorting by multiple keys input dataset could be already
> + *      presorted by some prefix of these keys.  We call them "presorted keys".
>
> "already presorted" sounds redundant

Reworded.

> +       int64           fullsort_group_count;   /* number of groups with equal presorted keys */
> +       int64           prefixsort_group_count; /* number of groups with equal presorted keys */
>
> I guess these should have different comments

The structure of that changed in my patch from a fews days ago, I
believe, so there aren't two fields anymore. Are you reviewing the
current patch?

Thanks,
James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Fri, Mar 13, 2020 at 01:16:44PM -0400, James Coleman wrote:
>On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>...
>> Now, a couple comments about parts 0001 - 0003 of the patch ...
>>
>> 1) I see a bunch of failures in the regression test, due to minor
>> differences in the explain output. All the differences are about minor
>> changes in memory usage, like this:
>>
>> -               "Sort Space Used": 30,                             +
>> +               "Sort Space Used": 29,                             +
>>
>> I'm not sure if it happens on my machine only, but maybe the test is not
>> entirely stable.
>
>make check passes on multiple machines for me; what arch/distro are you using?
>

Nothing exotic - Fedora on x64. My guess is that things like enabling
asserts will make a difference too, because then we track additional
stuff for allocated chunks etc. So I agree with Tom trying to keep this
stable is a lost cause, essentially.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> 3) Most of the execution plans look reasonable, except that some of the
> plans look like this:
>
>
>                           QUERY PLAN
>    ---------------------------------------------------------
>     Limit
>       ->  GroupAggregate
>             Group Key: t.a, t.b, t.c, t.d
>             ->  Incremental Sort
>                   Sort Key: t.a, t.b, t.c, t.d
>                   Presorted Key: t.a, t.b, t.c
>                   ->  Incremental Sort
>                         Sort Key: t.a, t.b, t.c
>                         Presorted Key: t.a, t.b
>                         ->  Index Scan using t_a_b_idx on t
>    (10 rows)
>
> i.e. there are two incremental sorts on top of each other, with
> different prefixes. But this this is not a new issue - it happens with
> queries like this:
>
>    SELECT a, b, c, d, count(*) FROM (
>      SELECT * FROM t ORDER BY a, b, c
>    ) foo GROUP BY a, b, c, d limit 1000;
>
> i.e. there's a subquery with a subset of pathkeys. Without incremental
> sort the plan looks like this:
>
>                     QUERY PLAN
>    ---------------------------------------------
>     Limit
>       ->  GroupAggregate
>             Group Key: t.a, t.b, t.c, t.d
>             ->  Sort
>                   Sort Key: t.a, t.b, t.c, t.d
>                   ->  Sort
>                         Sort Key: t.a, t.b, t.c
>                         ->  Seq Scan on t
>    (8 rows)
>
> so essentially the same plan shape. What bugs me though is that there
> seems to be some sort of memory leak, so that this query consumes
> gigabytes os RAM before it gets killed by OOM. But the memory seems not
> to be allocated in any memory context (at least MemoryContextStats don't
> show anything like that), so I'm not sure what's going on.
>
> Reproducing it is fairly simple:
>
>    CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
>    INSERT INTO t SELECT
>      1000*random(), 1000*random(), 1000*random(), 1000*random()
>    FROM generate_series(1,10000000) s(i);
>    CREATE INDEX idx ON t(a,b);
>    ANALYZE t;
>
>    EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
>    FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
>    LIMIT 100;

While trying to reproduce this, instead of lots of memory usage, I got
the attached assertion failure instead.

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Alvaro Herrera
Date:
On 2020-Mar-13, James Coleman wrote:

> > I don't propose to commit 0003 of course, since it's not our policy;
> > that's just to allow running pgindent sanely, which gives you 0004
> > (though my local pgindent has an unrelated fix).  And after that you
> > notice the issue that 0005 fixes.
> 
> Is there a page on how you're supposed to run pgindent/when stuff like
> this does get added/etc.? It's all a big mystery to me right now.
> 
> Also, I noticed some of the pgindent changes aren't for changes in
> this patch series; I have that as a separate patch, but not attached
> because I see that running pgindent locally generates a massive patch,
> so I'm assuming we just ignore those for now?

Ah, I should have paid more attention to what I was attaching.  Yeah,
ideally you run pgindent and then only include the changes that are
relevant to your patch series.  We run pgindent across the whole tree
once every release, so about yearly.  Some commits go in that are not
indented correctly, and those bother everyone -- I'm guilty of this
myself more frequently than I'd like.

You can specify a filelist to pgindent, also.  What I do is super
low-tech: do a "git diff origin/master", copy the filelist, and then
^V^E to paste that list into a command line to run pgindent (editing to
remove the change histogram and irrelevant files).  I should automate
this ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> ... You can specify a filelist to pgindent, also.  What I do is super
> low-tech: do a "git diff origin/master", copy the filelist, and then
> ^V^E to paste that list into a command line to run pgindent (editing to
> remove the change histogram and irrelevant files).  I should automate
> this ...

Yeah.  I tend to keep copies of the files I'm specifically hacking on
in a separate work directory, and then I re-indent just that directory.
But that's far from ideal as well.  I wonder if it'd be worth teaching
pgindent to have some option to indent only files that are already
modified according to git?

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Alvaro Herrera
Date:
On 2020-Mar-13, Tom Lane wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > ... You can specify a filelist to pgindent, also.  What I do is super
> > low-tech: do a "git diff origin/master", copy the filelist, and then
> > ^V^E to paste that list into a command line to run pgindent (editing to
> > remove the change histogram and irrelevant files).  I should automate
> > this ...
> 
> Yeah.  I tend to keep copies of the files I'm specifically hacking on
> in a separate work directory, and then I re-indent just that directory.
> But that's far from ideal as well.  I wonder if it'd be worth teaching
> pgindent to have some option to indent only files that are already
> modified according to git?

A quick look at git-ls-files manpage suggests that this might work:

 src/tools/pgindent/pgindent $(git ls-files --modified -- *.[ch])

If it's that easy, maybe it's not worth messing with pgindent ...


Also, I wonder if it would be better to modify our policies so that we
update typedefs.list more frequently.  Some people include additions
with their commits, but it's far from SOP.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Also, I wonder if it would be better to modify our policies so that we
> update typedefs.list more frequently.  Some people include additions
> with their commits, but it's far from SOP.

Perhaps.  My own workflow includes pulling down a fresh typedefs.list
from the buildfarm (which is trivial to automate) and then adding any
typedefs invented by the patch I'm working on.  The latter part of it
makes it hard to see how the in-tree list would be very helpful; and
if we started expecting patches to include typedef updates, I'm afraid
we'd get lots of patch collisions in that file.

I don't have any big objection to updating the in-tree list more often,
but personally I wouldn't use it, unless we can find a better workflow.

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Fri, Mar 13, 2020 at 2:23 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > 3) Most of the execution plans look reasonable, except that some of the
> > plans look like this:
> >
> >
> >                           QUERY PLAN
> >    ---------------------------------------------------------
> >     Limit
> >       ->  GroupAggregate
> >             Group Key: t.a, t.b, t.c, t.d
> >             ->  Incremental Sort
> >                   Sort Key: t.a, t.b, t.c, t.d
> >                   Presorted Key: t.a, t.b, t.c
> >                   ->  Incremental Sort
> >                         Sort Key: t.a, t.b, t.c
> >                         Presorted Key: t.a, t.b
> >                         ->  Index Scan using t_a_b_idx on t
> >    (10 rows)
> >
> > i.e. there are two incremental sorts on top of each other, with
> > different prefixes. But this this is not a new issue - it happens with
> > queries like this:
> >
> >    SELECT a, b, c, d, count(*) FROM (
> >      SELECT * FROM t ORDER BY a, b, c
> >    ) foo GROUP BY a, b, c, d limit 1000;
> >
> > i.e. there's a subquery with a subset of pathkeys. Without incremental
> > sort the plan looks like this:
> >
> >                     QUERY PLAN
> >    ---------------------------------------------
> >     Limit
> >       ->  GroupAggregate
> >             Group Key: t.a, t.b, t.c, t.d
> >             ->  Sort
> >                   Sort Key: t.a, t.b, t.c, t.d
> >                   ->  Sort
> >                         Sort Key: t.a, t.b, t.c
> >                         ->  Seq Scan on t
> >    (8 rows)
> >
> > so essentially the same plan shape. What bugs me though is that there
> > seems to be some sort of memory leak, so that this query consumes
> > gigabytes os RAM before it gets killed by OOM. But the memory seems not
> > to be allocated in any memory context (at least MemoryContextStats don't
> > show anything like that), so I'm not sure what's going on.
> >
> > Reproducing it is fairly simple:
> >
> >    CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
> >    INSERT INTO t SELECT
> >      1000*random(), 1000*random(), 1000*random(), 1000*random()
> >    FROM generate_series(1,10000000) s(i);
> >    CREATE INDEX idx ON t(a,b);
> >    ANALYZE t;
> >
> >    EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
> >    FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
> >    LIMIT 100;
>
> While trying to reproduce this, instead of lots of memory usage, I got
> the attached assertion failure instead.

And, without the EXPLAIN ANALYZE was able to get this one, which will
probably be a lot more helpful.

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Fri, Mar 13, 2020 at 04:31:16PM -0400, James Coleman wrote:
>On Fri, Mar 13, 2020 at 2:23 PM James Coleman <jtc331@gmail.com> wrote:
>>
>> On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> > 3) Most of the execution plans look reasonable, except that some of the
>> > plans look like this:
>> >
>> >
>> >                           QUERY PLAN
>> >    ---------------------------------------------------------
>> >     Limit
>> >       ->  GroupAggregate
>> >             Group Key: t.a, t.b, t.c, t.d
>> >             ->  Incremental Sort
>> >                   Sort Key: t.a, t.b, t.c, t.d
>> >                   Presorted Key: t.a, t.b, t.c
>> >                   ->  Incremental Sort
>> >                         Sort Key: t.a, t.b, t.c
>> >                         Presorted Key: t.a, t.b
>> >                         ->  Index Scan using t_a_b_idx on t
>> >    (10 rows)
>> >
>> > i.e. there are two incremental sorts on top of each other, with
>> > different prefixes. But this this is not a new issue - it happens with
>> > queries like this:
>> >
>> >    SELECT a, b, c, d, count(*) FROM (
>> >      SELECT * FROM t ORDER BY a, b, c
>> >    ) foo GROUP BY a, b, c, d limit 1000;
>> >
>> > i.e. there's a subquery with a subset of pathkeys. Without incremental
>> > sort the plan looks like this:
>> >
>> >                     QUERY PLAN
>> >    ---------------------------------------------
>> >     Limit
>> >       ->  GroupAggregate
>> >             Group Key: t.a, t.b, t.c, t.d
>> >             ->  Sort
>> >                   Sort Key: t.a, t.b, t.c, t.d
>> >                   ->  Sort
>> >                         Sort Key: t.a, t.b, t.c
>> >                         ->  Seq Scan on t
>> >    (8 rows)
>> >
>> > so essentially the same plan shape. What bugs me though is that there
>> > seems to be some sort of memory leak, so that this query consumes
>> > gigabytes os RAM before it gets killed by OOM. But the memory seems not
>> > to be allocated in any memory context (at least MemoryContextStats don't
>> > show anything like that), so I'm not sure what's going on.
>> >
>> > Reproducing it is fairly simple:
>> >
>> >    CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
>> >    INSERT INTO t SELECT
>> >      1000*random(), 1000*random(), 1000*random(), 1000*random()
>> >    FROM generate_series(1,10000000) s(i);
>> >    CREATE INDEX idx ON t(a,b);
>> >    ANALYZE t;
>> >
>> >    EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
>> >    FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
>> >    LIMIT 100;
>>
>> While trying to reproduce this, instead of lots of memory usage, I got
>> the attached assertion failure instead.
>
>And, without the EXPLAIN ANALYZE was able to get this one, which will
>probably be a lot more helpful.
>

Hmmm, I'll try reproducing it, but can you investigate the values in the
Assert? I mean, it fails on this:

   Assert(total_allocated == context->mem_allocated);

so can you get a core or attach to the process using gdb, and see what's
the expected / total value?

BTW, I might have copied the wrong query - can you try with a higher
value in the LIMIT clause? For example:

     EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
     FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
     LIMIT 1000000;

I think this might be the differenc ewhy you don't see the memory leak.
Or maybe it was because of asserts? I'm not sure if I had enabled them
in the build ...


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Friday, March 13, 2020, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Fri, Mar 13, 2020 at 04:31:16PM -0400, James Coleman wrote:
On Fri, Mar 13, 2020 at 2:23 PM James Coleman <jtc331@gmail.com> wrote:

On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> 3) Most of the execution plans look reasonable, except that some of the
> plans look like this:
>
>
>                           QUERY PLAN
>    ---------------------------------------------------------
>     Limit
>       ->  GroupAggregate
>             Group Key: t.a, t.b, t.c, t.d
>             ->  Incremental Sort
>                   Sort Key: t.a, t.b, t.c, t.d
>                   Presorted Key: t.a, t.b, t.c
>                   ->  Incremental Sort
>                         Sort Key: t.a, t.b, t.c
>                         Presorted Key: t.a, t.b
>                         ->  Index Scan using t_a_b_idx on t
>    (10 rows)
>
> i.e. there are two incremental sorts on top of each other, with
> different prefixes. But this this is not a new issue - it happens with
> queries like this:
>
>    SELECT a, b, c, d, count(*) FROM (
>      SELECT * FROM t ORDER BY a, b, c
>    ) foo GROUP BY a, b, c, d limit 1000;
>
> i.e. there's a subquery with a subset of pathkeys. Without incremental
> sort the plan looks like this:
>
>                     QUERY PLAN
>    ---------------------------------------------
>     Limit
>       ->  GroupAggregate
>             Group Key: t.a, t.b, t.c, t.d
>             ->  Sort
>                   Sort Key: t.a, t.b, t.c, t.d
>                   ->  Sort
>                         Sort Key: t.a, t.b, t.c
>                         ->  Seq Scan on t
>    (8 rows)
>
> so essentially the same plan shape. What bugs me though is that there
> seems to be some sort of memory leak, so that this query consumes
> gigabytes os RAM before it gets killed by OOM. But the memory seems not
> to be allocated in any memory context (at least MemoryContextStats don't
> show anything like that), so I'm not sure what's going on.
>
> Reproducing it is fairly simple:
>
>    CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
>    INSERT INTO t SELECT
>      1000*random(), 1000*random(), 1000*random(), 1000*random()
>    FROM generate_series(1,10000000) s(i);
>    CREATE INDEX idx ON t(a,b);
>    ANALYZE t;
>
>    EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
>    FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
>    LIMIT 100;

While trying to reproduce this, instead of lots of memory usage, I got
the attached assertion failure instead.

And, without the EXPLAIN ANALYZE was able to get this one, which will
probably be a lot more helpful.


Hmmm, I'll try reproducing it, but can you investigate the values in the
Assert? I mean, it fails on this:

  Assert(total_allocated == context->mem_allocated);

so can you get a core or attach to the process using gdb, and see what's
the expected / total value?

BTW, I might have copied the wrong query - can you try with a higher
value in the LIMIT clause? For example:

    EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
    FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
    LIMIT 1000000;

I think this might be the differenc ewhy you don't see the memory leak.
Or maybe it was because of asserts? I'm not sure if I had enabled them
in the build ...

I’m not at my laptop right now, but I’ve started looking at it, but I haven’t figured it out yet. Going from memory, it had allocated 16384 but expected 8192 (I think I have the order of that right). 

It’s very consistently reproducible, thankfully, but doesn’t always happen on the first query; IIRC always the 2nd with LIMIT 100, and I could get it to happen with first at 96 and second at 97, but repeating 96 many times didn’t seem to trigger it. 

I’m hoping it’s the same root cause as the memory leak, but unsure. 

I’ll try a higher number when I get a chance. 

James

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Fri, Mar 13, 2020 at 8:23 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Friday, March 13, 2020, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Fri, Mar 13, 2020 at 04:31:16PM -0400, James Coleman wrote:
>>>
>>> On Fri, Mar 13, 2020 at 2:23 PM James Coleman <jtc331@gmail.com> wrote:
>>>>
>>>>
>>>> On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
>>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>> > 3) Most of the execution plans look reasonable, except that some of the
>>>> > plans look like this:
>>>> >
>>>> >
>>>> >                           QUERY PLAN
>>>> >    ---------------------------------------------------------
>>>> >     Limit
>>>> >       ->  GroupAggregate
>>>> >             Group Key: t.a, t.b, t.c, t.d
>>>> >             ->  Incremental Sort
>>>> >                   Sort Key: t.a, t.b, t.c, t.d
>>>> >                   Presorted Key: t.a, t.b, t.c
>>>> >                   ->  Incremental Sort
>>>> >                         Sort Key: t.a, t.b, t.c
>>>> >                         Presorted Key: t.a, t.b
>>>> >                         ->  Index Scan using t_a_b_idx on t
>>>> >    (10 rows)
>>>> >
>>>> > i.e. there are two incremental sorts on top of each other, with
>>>> > different prefixes. But this this is not a new issue - it happens with
>>>> > queries like this:
>>>> >
>>>> >    SELECT a, b, c, d, count(*) FROM (
>>>> >      SELECT * FROM t ORDER BY a, b, c
>>>> >    ) foo GROUP BY a, b, c, d limit 1000;
>>>> >
>>>> > i.e. there's a subquery with a subset of pathkeys. Without incremental
>>>> > sort the plan looks like this:
>>>> >
>>>> >                     QUERY PLAN
>>>> >    ---------------------------------------------
>>>> >     Limit
>>>> >       ->  GroupAggregate
>>>> >             Group Key: t.a, t.b, t.c, t.d
>>>> >             ->  Sort
>>>> >                   Sort Key: t.a, t.b, t.c, t.d
>>>> >                   ->  Sort
>>>> >                         Sort Key: t.a, t.b, t.c
>>>> >                         ->  Seq Scan on t
>>>> >    (8 rows)
>>>> >
>>>> > so essentially the same plan shape. What bugs me though is that there
>>>> > seems to be some sort of memory leak, so that this query consumes
>>>> > gigabytes os RAM before it gets killed by OOM. But the memory seems not
>>>> > to be allocated in any memory context (at least MemoryContextStats don't
>>>> > show anything like that), so I'm not sure what's going on.
>>>> >
>>>> > Reproducing it is fairly simple:
>>>> >
>>>> >    CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
>>>> >    INSERT INTO t SELECT
>>>> >      1000*random(), 1000*random(), 1000*random(), 1000*random()
>>>> >    FROM generate_series(1,10000000) s(i);
>>>> >    CREATE INDEX idx ON t(a,b);
>>>> >    ANALYZE t;
>>>> >
>>>> >    EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
>>>> >    FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
>>>> >    LIMIT 100;
>>>>
>>>> While trying to reproduce this, instead of lots of memory usage, I got
>>>> the attached assertion failure instead.
>>>
>>>
>>> And, without the EXPLAIN ANALYZE was able to get this one, which will
>>> probably be a lot more helpful.
>>>
>>
>> Hmmm, I'll try reproducing it, but can you investigate the values in the
>> Assert? I mean, it fails on this:
>>
>>   Assert(total_allocated == context->mem_allocated);
>>
>> so can you get a core or attach to the process using gdb, and see what's
>> the expected / total value?

I've reproduced this on multiple machines (though all are Ubuntu or
Debian derivatives...I don't think that's likely to matter). A core
dump is ~150MB, so I've uploaded to Dropbox [1].

I didn't find an obvious first-level member of Tuplesortstate that was
covered by either of the two blocks in the AllocSet (both are 8KB in
size).

James

[1]: https://www.dropbox.com/s/jwndwp4634hzywk/aset_assertion_failure.core?dl=0



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Mar 14, 2020 at 12:07 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Fri, Mar 13, 2020 at 8:23 PM James Coleman <jtc331@gmail.com> wrote:
> >
> > On Friday, March 13, 2020, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Fri, Mar 13, 2020 at 04:31:16PM -0400, James Coleman wrote:
> >>>
> >>> On Fri, Mar 13, 2020 at 2:23 PM James Coleman <jtc331@gmail.com> wrote:
> >>>>
> >>>>
> >>>> On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
> >>>> <tomas.vondra@2ndquadrant.com> wrote:
> >>>> > 3) Most of the execution plans look reasonable, except that some of the
> >>>> > plans look like this:
> >>>> >
> >>>> >
> >>>> >                           QUERY PLAN
> >>>> >    ---------------------------------------------------------
> >>>> >     Limit
> >>>> >       ->  GroupAggregate
> >>>> >             Group Key: t.a, t.b, t.c, t.d
> >>>> >             ->  Incremental Sort
> >>>> >                   Sort Key: t.a, t.b, t.c, t.d
> >>>> >                   Presorted Key: t.a, t.b, t.c
> >>>> >                   ->  Incremental Sort
> >>>> >                         Sort Key: t.a, t.b, t.c
> >>>> >                         Presorted Key: t.a, t.b
> >>>> >                         ->  Index Scan using t_a_b_idx on t
> >>>> >    (10 rows)
> >>>> >
> >>>> > i.e. there are two incremental sorts on top of each other, with
> >>>> > different prefixes. But this this is not a new issue - it happens with
> >>>> > queries like this:
> >>>> >
> >>>> >    SELECT a, b, c, d, count(*) FROM (
> >>>> >      SELECT * FROM t ORDER BY a, b, c
> >>>> >    ) foo GROUP BY a, b, c, d limit 1000;
> >>>> >
> >>>> > i.e. there's a subquery with a subset of pathkeys. Without incremental
> >>>> > sort the plan looks like this:
> >>>> >
> >>>> >                     QUERY PLAN
> >>>> >    ---------------------------------------------
> >>>> >     Limit
> >>>> >       ->  GroupAggregate
> >>>> >             Group Key: t.a, t.b, t.c, t.d
> >>>> >             ->  Sort
> >>>> >                   Sort Key: t.a, t.b, t.c, t.d
> >>>> >                   ->  Sort
> >>>> >                         Sort Key: t.a, t.b, t.c
> >>>> >                         ->  Seq Scan on t
> >>>> >    (8 rows)
> >>>> >
> >>>> > so essentially the same plan shape. What bugs me though is that there
> >>>> > seems to be some sort of memory leak, so that this query consumes
> >>>> > gigabytes os RAM before it gets killed by OOM. But the memory seems not
> >>>> > to be allocated in any memory context (at least MemoryContextStats don't
> >>>> > show anything like that), so I'm not sure what's going on.
> >>>> >
> >>>> > Reproducing it is fairly simple:
> >>>> >
> >>>> >    CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
> >>>> >    INSERT INTO t SELECT
> >>>> >      1000*random(), 1000*random(), 1000*random(), 1000*random()
> >>>> >    FROM generate_series(1,10000000) s(i);
> >>>> >    CREATE INDEX idx ON t(a,b);
> >>>> >    ANALYZE t;
> >>>> >
> >>>> >    EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
> >>>> >    FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
> >>>> >    LIMIT 100;
> >>>>
> >>>> While trying to reproduce this, instead of lots of memory usage, I got
> >>>> the attached assertion failure instead.
> >>>
> >>>
> >>> And, without the EXPLAIN ANALYZE was able to get this one, which will
> >>> probably be a lot more helpful.
> >>>
> >>
> >> Hmmm, I'll try reproducing it, but can you investigate the values in the
> >> Assert? I mean, it fails on this:
> >>
> >>   Assert(total_allocated == context->mem_allocated);
> >>
> >> so can you get a core or attach to the process using gdb, and see what's
> >> the expected / total value?
>
> I've reproduced this on multiple machines (though all are Ubuntu or
> Debian derivatives...I don't think that's likely to matter). A core
> dump is ~150MB, so I've uploaded to Dropbox [1].
>
> I didn't find an obvious first-level member of Tuplesortstate that was
> covered by either of the two blocks in the AllocSet (both are 8KB in
> size).
>
> James
>
> [1]: https://www.dropbox.com/s/jwndwp4634hzywk/aset_assertion_failure.core?dl=0

And...I think I might have found out the issue (though haven't proved
it 100% yet or fixed it):

The incremental sort node calls `tuplesort_puttupleslot`, which
switches the memory context to `sortcontext`. It then calls
`puttuple_common`. `puttuple_common` may then call `grow_memtuples`
which reallocs space for `sortstate->memtuples`, but `memtuples` is
elsewhere allocated in the memory context maincontext.

I had earlier in this debugging process noticed that `sortcontext` was
allocated in `maincontext`, which seemed conceptually odd if our goal
is to reuse the sort state, and I also found a comment that needed to
be changed relative to cleaning up the per-sort context (that talks
about it freeing the sort state itself), but the `memtuples` array was
in fact freed additionally at reset, so it seemed safe.

Given this issue though, I think I'm going to go ahead and rework so
that the `memtuples` array lies within the `sortcontext` instead.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Mar 14, 2020 at 12:24 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Sat, Mar 14, 2020 at 12:07 PM James Coleman <jtc331@gmail.com> wrote:
> >
> > On Fri, Mar 13, 2020 at 8:23 PM James Coleman <jtc331@gmail.com> wrote:
> > >
> > > On Friday, March 13, 2020, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > >>
> > >> On Fri, Mar 13, 2020 at 04:31:16PM -0400, James Coleman wrote:
> > >>>
> > >>> On Fri, Mar 13, 2020 at 2:23 PM James Coleman <jtc331@gmail.com> wrote:
> > >>>>
> > >>>>
> > >>>> On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
> > >>>> <tomas.vondra@2ndquadrant.com> wrote:
> > >>>> > 3) Most of the execution plans look reasonable, except that some of the
> > >>>> > plans look like this:
> > >>>> >
> > >>>> >
> > >>>> >                           QUERY PLAN
> > >>>> >    ---------------------------------------------------------
> > >>>> >     Limit
> > >>>> >       ->  GroupAggregate
> > >>>> >             Group Key: t.a, t.b, t.c, t.d
> > >>>> >             ->  Incremental Sort
> > >>>> >                   Sort Key: t.a, t.b, t.c, t.d
> > >>>> >                   Presorted Key: t.a, t.b, t.c
> > >>>> >                   ->  Incremental Sort
> > >>>> >                         Sort Key: t.a, t.b, t.c
> > >>>> >                         Presorted Key: t.a, t.b
> > >>>> >                         ->  Index Scan using t_a_b_idx on t
> > >>>> >    (10 rows)
> > >>>> >
> > >>>> > i.e. there are two incremental sorts on top of each other, with
> > >>>> > different prefixes. But this this is not a new issue - it happens with
> > >>>> > queries like this:
> > >>>> >
> > >>>> >    SELECT a, b, c, d, count(*) FROM (
> > >>>> >      SELECT * FROM t ORDER BY a, b, c
> > >>>> >    ) foo GROUP BY a, b, c, d limit 1000;
> > >>>> >
> > >>>> > i.e. there's a subquery with a subset of pathkeys. Without incremental
> > >>>> > sort the plan looks like this:
> > >>>> >
> > >>>> >                     QUERY PLAN
> > >>>> >    ---------------------------------------------
> > >>>> >     Limit
> > >>>> >       ->  GroupAggregate
> > >>>> >             Group Key: t.a, t.b, t.c, t.d
> > >>>> >             ->  Sort
> > >>>> >                   Sort Key: t.a, t.b, t.c, t.d
> > >>>> >                   ->  Sort
> > >>>> >                         Sort Key: t.a, t.b, t.c
> > >>>> >                         ->  Seq Scan on t
> > >>>> >    (8 rows)
> > >>>> >
> > >>>> > so essentially the same plan shape. What bugs me though is that there
> > >>>> > seems to be some sort of memory leak, so that this query consumes
> > >>>> > gigabytes os RAM before it gets killed by OOM. But the memory seems not
> > >>>> > to be allocated in any memory context (at least MemoryContextStats don't
> > >>>> > show anything like that), so I'm not sure what's going on.
> > >>>> >
> > >>>> > Reproducing it is fairly simple:
> > >>>> >
> > >>>> >    CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
> > >>>> >    INSERT INTO t SELECT
> > >>>> >      1000*random(), 1000*random(), 1000*random(), 1000*random()
> > >>>> >    FROM generate_series(1,10000000) s(i);
> > >>>> >    CREATE INDEX idx ON t(a,b);
> > >>>> >    ANALYZE t;
> > >>>> >
> > >>>> >    EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
> > >>>> >    FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
> > >>>> >    LIMIT 100;
> > >>>>
> > >>>> While trying to reproduce this, instead of lots of memory usage, I got
> > >>>> the attached assertion failure instead.
> > >>>
> > >>>
> > >>> And, without the EXPLAIN ANALYZE was able to get this one, which will
> > >>> probably be a lot more helpful.
> > >>>
> > >>
> > >> Hmmm, I'll try reproducing it, but can you investigate the values in the
> > >> Assert? I mean, it fails on this:
> > >>
> > >>   Assert(total_allocated == context->mem_allocated);
> > >>
> > >> so can you get a core or attach to the process using gdb, and see what's
> > >> the expected / total value?
> >
> > I've reproduced this on multiple machines (though all are Ubuntu or
> > Debian derivatives...I don't think that's likely to matter). A core
> > dump is ~150MB, so I've uploaded to Dropbox [1].
> >
> > I didn't find an obvious first-level member of Tuplesortstate that was
> > covered by either of the two blocks in the AllocSet (both are 8KB in
> > size).
> >
> > James
> >
> > [1]: https://www.dropbox.com/s/jwndwp4634hzywk/aset_assertion_failure.core?dl=0
>
> And...I think I might have found out the issue (though haven't proved
> it 100% yet or fixed it):
>
> The incremental sort node calls `tuplesort_puttupleslot`, which
> switches the memory context to `sortcontext`. It then calls
> `puttuple_common`. `puttuple_common` may then call `grow_memtuples`
> which reallocs space for `sortstate->memtuples`, but `memtuples` is
> elsewhere allocated in the memory context maincontext.
>
> I had earlier in this debugging process noticed that `sortcontext` was
> allocated in `maincontext`, which seemed conceptually odd if our goal
> is to reuse the sort state, and I also found a comment that needed to
> be changed relative to cleaning up the per-sort context (that talks
> about it freeing the sort state itself), but the `memtuples` array was
> in fact freed additionally at reset, so it seemed safe.
>
> Given this issue though, I think I'm going to go ahead and rework so
> that the `memtuples` array lies within the `sortcontext` instead.

Perhaps I spoke too soon: I didn't realize repalloc(_huge) didn't need
a memory context switch, so this likely isn't the issue.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Mar 14, 2020 at 12:36 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Sat, Mar 14, 2020 at 12:24 PM James Coleman <jtc331@gmail.com> wrote:
> >
> > On Sat, Mar 14, 2020 at 12:07 PM James Coleman <jtc331@gmail.com> wrote:
> > >
> > > On Fri, Mar 13, 2020 at 8:23 PM James Coleman <jtc331@gmail.com> wrote:
> > > >
> > > > On Friday, March 13, 2020, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > > >>
> > > >> On Fri, Mar 13, 2020 at 04:31:16PM -0400, James Coleman wrote:
> > > >>>
> > > >>> On Fri, Mar 13, 2020 at 2:23 PM James Coleman <jtc331@gmail.com> wrote:
> > > >>>>
> > > >>>>
> > > >>>> On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
> > > >>>> <tomas.vondra@2ndquadrant.com> wrote:
> > > >>>> > 3) Most of the execution plans look reasonable, except that some of the
> > > >>>> > plans look like this:
> > > >>>> >
> > > >>>> >
> > > >>>> >                           QUERY PLAN
> > > >>>> >    ---------------------------------------------------------
> > > >>>> >     Limit
> > > >>>> >       ->  GroupAggregate
> > > >>>> >             Group Key: t.a, t.b, t.c, t.d
> > > >>>> >             ->  Incremental Sort
> > > >>>> >                   Sort Key: t.a, t.b, t.c, t.d
> > > >>>> >                   Presorted Key: t.a, t.b, t.c
> > > >>>> >                   ->  Incremental Sort
> > > >>>> >                         Sort Key: t.a, t.b, t.c
> > > >>>> >                         Presorted Key: t.a, t.b
> > > >>>> >                         ->  Index Scan using t_a_b_idx on t
> > > >>>> >    (10 rows)
> > > >>>> >
> > > >>>> > i.e. there are two incremental sorts on top of each other, with
> > > >>>> > different prefixes. But this this is not a new issue - it happens with
> > > >>>> > queries like this:
> > > >>>> >
> > > >>>> >    SELECT a, b, c, d, count(*) FROM (
> > > >>>> >      SELECT * FROM t ORDER BY a, b, c
> > > >>>> >    ) foo GROUP BY a, b, c, d limit 1000;
> > > >>>> >
> > > >>>> > i.e. there's a subquery with a subset of pathkeys. Without incremental
> > > >>>> > sort the plan looks like this:
> > > >>>> >
> > > >>>> >                     QUERY PLAN
> > > >>>> >    ---------------------------------------------
> > > >>>> >     Limit
> > > >>>> >       ->  GroupAggregate
> > > >>>> >             Group Key: t.a, t.b, t.c, t.d
> > > >>>> >             ->  Sort
> > > >>>> >                   Sort Key: t.a, t.b, t.c, t.d
> > > >>>> >                   ->  Sort
> > > >>>> >                         Sort Key: t.a, t.b, t.c
> > > >>>> >                         ->  Seq Scan on t
> > > >>>> >    (8 rows)
> > > >>>> >
> > > >>>> > so essentially the same plan shape. What bugs me though is that there
> > > >>>> > seems to be some sort of memory leak, so that this query consumes
> > > >>>> > gigabytes os RAM before it gets killed by OOM. But the memory seems not
> > > >>>> > to be allocated in any memory context (at least MemoryContextStats don't
> > > >>>> > show anything like that), so I'm not sure what's going on.
> > > >>>> >
> > > >>>> > Reproducing it is fairly simple:
> > > >>>> >
> > > >>>> >    CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
> > > >>>> >    INSERT INTO t SELECT
> > > >>>> >      1000*random(), 1000*random(), 1000*random(), 1000*random()
> > > >>>> >    FROM generate_series(1,10000000) s(i);
> > > >>>> >    CREATE INDEX idx ON t(a,b);
> > > >>>> >    ANALYZE t;
> > > >>>> >
> > > >>>> >    EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
> > > >>>> >    FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
> > > >>>> >    LIMIT 100;
> > > >>>>
> > > >>>> While trying to reproduce this, instead of lots of memory usage, I got
> > > >>>> the attached assertion failure instead.
> > > >>>
> > > >>>
> > > >>> And, without the EXPLAIN ANALYZE was able to get this one, which will
> > > >>> probably be a lot more helpful.
> > > >>>
> > > >>
> > > >> Hmmm, I'll try reproducing it, but can you investigate the values in the
> > > >> Assert? I mean, it fails on this:
> > > >>
> > > >>   Assert(total_allocated == context->mem_allocated);
> > > >>
> > > >> so can you get a core or attach to the process using gdb, and see what's
> > > >> the expected / total value?
> > >
> > > I've reproduced this on multiple machines (though all are Ubuntu or
> > > Debian derivatives...I don't think that's likely to matter). A core
> > > dump is ~150MB, so I've uploaded to Dropbox [1].
> > >
> > > I didn't find an obvious first-level member of Tuplesortstate that was
> > > covered by either of the two blocks in the AllocSet (both are 8KB in
> > > size).
> > >
> > > James
> > >
> > > [1]: https://www.dropbox.com/s/jwndwp4634hzywk/aset_assertion_failure.core?dl=0
> >
> > And...I think I might have found out the issue (though haven't proved
> > it 100% yet or fixed it):
> >
> > The incremental sort node calls `tuplesort_puttupleslot`, which
> > switches the memory context to `sortcontext`. It then calls
> > `puttuple_common`. `puttuple_common` may then call `grow_memtuples`
> > which reallocs space for `sortstate->memtuples`, but `memtuples` is
> > elsewhere allocated in the memory context maincontext.
> >
> > I had earlier in this debugging process noticed that `sortcontext` was
> > allocated in `maincontext`, which seemed conceptually odd if our goal
> > is to reuse the sort state, and I also found a comment that needed to
> > be changed relative to cleaning up the per-sort context (that talks
> > about it freeing the sort state itself), but the `memtuples` array was
> > in fact freed additionally at reset, so it seemed safe.
> >
> > Given this issue though, I think I'm going to go ahead and rework so
> > that the `memtuples` array lies within the `sortcontext` instead.
>
> Perhaps I spoke too soon: I didn't realize repalloc(_huge) didn't need
> a memory context switch, so this likely isn't the issue.

It looks like the issue is actually into the `tuplecontext`, which is
currently a child context of `sortcontext`:

#3  0x0000558cd153b565 in AllocSetCheck
(context=context@entry=0x558cd28e0b70) at aset.c:1573
1573            Assert(total_allocated == context->mem_allocated);
(gdb) p total_allocated
$1 = 16384
(gdb) p context->mem_allocated
$2 = 8192
(gdb) p context->name
$3 = 0x558cd16c8ccd "Caller tuples"

I stuck in several more AllocSetCheck calls in aset.c and got the
attached backtrace.

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sat, Mar 14, 2020 at 02:41:09PM -0400, James Coleman wrote:
>
>It looks like the issue is actually into the `tuplecontext`, which is
>currently a child context of `sortcontext`:
>
>#3  0x0000558cd153b565 in AllocSetCheck
>(context=context@entry=0x558cd28e0b70) at aset.c:1573
>1573            Assert(total_allocated == context->mem_allocated);
>(gdb) p total_allocated
>$1 = 16384
>(gdb) p context->mem_allocated
>$2 = 8192
>(gdb) p context->name
>$3 = 0x558cd16c8ccd "Caller tuples"
>
>I stuck in several more AllocSetCheck calls in aset.c and got the
>attached backtrace.
>

I think the problem is pretty simple - tuplesort_reset does call
tuplesort_reset, which resets the sortcontext. But that *deletes* the
tuplecontext, so the state->tuplecontext gets stale. I'd haven't looked
into the exact details, but it clearly confuses the accouting.

The attached patch fixes the issue for me - I'm not claiming it's the
right fix, but it's the simplest thing I could think of. Maybe the
tuplesort_rest should work differently, not sure.

And it seems to resolve the memory leak too - I suspect we've freed the
context (so it was not part of the tree of contexts) but the struct was
still valid and we kept allocating memory in it - but it was invisible
to MemoryContextDump etc.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Fri, Mar 13, 2020 at 1:06 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Thu, Mar 12, 2020 at 5:53 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >
> > I gave this a very quick look; I don't claim to understand it or
> > anything, but I thought these trivial cleanups worthwhile.  The only
> > non-cosmetic thing is changing order of arguments to the SOn_printf()
> > calls in 0008; I think they are contrary to what the comment says.
>
> Yes, I think you're correct (re: 0008).
>
> They all look generally good to me, and are included in the attached
> patch series.

I just realized something about this (unsure if in Alvaro's or in my
applying that) broke make check pretty decently (3 test files broken,
also much slower, and the incremental sort test returns a lot of
obviously broken results).

I'll take a look tomorrow and hopefully get a fix (probably will reply
to the more recent subthread's though).

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Mar 14, 2020 at 10:55 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Fri, Mar 13, 2020 at 1:06 PM James Coleman <jtc331@gmail.com> wrote:
> >
> > On Thu, Mar 12, 2020 at 5:53 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > >
> > > I gave this a very quick look; I don't claim to understand it or
> > > anything, but I thought these trivial cleanups worthwhile.  The only
> > > non-cosmetic thing is changing order of arguments to the SOn_printf()
> > > calls in 0008; I think they are contrary to what the comment says.
> >
> > Yes, I think you're correct (re: 0008).
> >
> > They all look generally good to me, and are included in the attached
> > patch series.
>
> I just realized something about this (unsure if in Alvaro's or in my
> applying that) broke make check pretty decently (3 test files broken,
> also much slower, and the incremental sort test returns a lot of
> obviously broken results).
>
> I'll take a look tomorrow and hopefully get a fix (probably will reply
> to the more recent subthread's though).

This took a bit of manually just excluding changes until I got a
red/green set because nothing in the patch set looked all incorrect.
But it turns out this change breaks things:

- if (tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
-                                                     false, slot,
NULL) || node->finished)
+ if (node->finished ||
+          tuplesort_gettupleslot(read_sortstate, ScanDirectionIsForward(dir),
+                                                     false, slot, NULL))

I believe what's happening here is that we need the
tuplesort_gettupleslot to set the slot to a NULL tuple (if there
aren't any left) before we return the slot, but that returns false, so
the node->finished check is to ensure that the first time that method
nulls out the slot and returns false we still return the value in
slot.

Since this isn't obvious, I'll add a comment. I think I'm also going
to rename node->finished to be more clear.

In this debugging I also noticed that we don't set node->finished back
to false in rescan, which I assume is a bug, but I don't really
understand a whole lot about rescan. IIRC from some previous
discussions rescan exists for things like cursors, where you can move
back and forth over the result set. Assuming that's the case, do we
need explicit tests for cursors using incremental sort? Is there a
good strategy for how much to do there (since I don't want to
duplicate every non-cursor functional test).

Thanks,
James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Fri, Mar 13, 2020 at 4:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Also, I wonder if it would be better to modify our policies so that we
> > update typedefs.list more frequently.  Some people include additions
> > with their commits, but it's far from SOP.
>
> Perhaps.  My own workflow includes pulling down a fresh typedefs.list
> from the buildfarm (which is trivial to automate) and then adding any
> typedefs invented by the patch I'm working on.  The latter part of it
> makes it hard to see how the in-tree list would be very helpful; and
> if we started expecting patches to include typedef updates, I'm afraid
> we'd get lots of patch collisions in that file.
>
> I don't have any big objection to updating the in-tree list more often,
> but personally I wouldn't use it, unless we can find a better workflow.

How does the buildfarm automate generating the typedefs list? Would it
be relatively easy to incorporate that into a tool that someone could
use locally with pgindent?

Thanks,
James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Mar 14, 2020 at 3:58 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sat, Mar 14, 2020 at 02:41:09PM -0400, James Coleman wrote:
> >
> >It looks like the issue is actually into the `tuplecontext`, which is
> >currently a child context of `sortcontext`:
> >
> >#3  0x0000558cd153b565 in AllocSetCheck
> >(context=context@entry=0x558cd28e0b70) at aset.c:1573
> >1573            Assert(total_allocated == context->mem_allocated);
> >(gdb) p total_allocated
> >$1 = 16384
> >(gdb) p context->mem_allocated
> >$2 = 8192
> >(gdb) p context->name
> >$3 = 0x558cd16c8ccd "Caller tuples"
> >
> >I stuck in several more AllocSetCheck calls in aset.c and got the
> >attached backtrace.
> >
>
> I think the problem is pretty simple - tuplesort_reset does call
> tuplesort_reset, which resets the sortcontext. But that *deletes* the
> tuplecontext, so the state->tuplecontext gets stale. I'd haven't looked
> into the exact details, but it clearly confuses the accouting.
>
> The attached patch fixes the issue for me - I'm not claiming it's the
> right fix, but it's the simplest thing I could think of. Maybe the
> tuplesort_rest should work differently, not sure.
>
> And it seems to resolve the memory leak too - I suspect we've freed the
> context (so it was not part of the tree of contexts) but the struct was
> still valid and we kept allocating memory in it - but it was invisible
> to MemoryContextDump etc.

Thanks for tracking that down!

I wondered at first if we should consider making the tuplecontext a
child context of the main context instead, but the allocations it
contains will always live at most the length of the contents of the
sortcontext, so I think that is probably fine as is.

This issue, and the resulting fix, did however make me think that we
have some duplication here that's likely to lead to bugs now and/or
later. I've reworked things a bit so that both (the added for
incremental sort) tuplesort_reset and (the existing)
tuplesort_begin_common now both call out to tuplesort_begin_batch so
configure initial starting state. This way it should be more obvious
that there are both cases to consider if new initialization code is
being added to tuplesort.c.

Working on this refactor I noticed it seems that we were only
reallocing the memtuples array if it was smaller (not sure this is
even possible?) than the initial size, which means if it had been
resized significantly we were keeping that memory tied up for
subsequent batches even if not needed. I suppose one could argue
that's helpful in the sense we don't need to keep increasing its size
on each batch...but it seems to me to be not very clean, so I've
changed that.

I also noticed that we had the USEMEM macro used in tuplesort_reset
regardless of whether or not the the memtuples array had been
realloced, which seems wrong (I don't think we were resetting the
stats), so I changed that too.

I'm still not sure if we should keep the memtuples array in
maincontext (where it is now) or sortcontext. The only argument I can
see for the former is that it allows us a minor optimization: we we
haven't grown the array, we can reuse it for multiple batches. On the
other hand, always resetting it is conceptually more clear. I'm
curious to hear your thoughts on this.

Over at [1] I'd noticed that the patch series versions since my
incorporating Alvaro's pgindent/general formatting patches had been
failing make check. I noted there that I'd found the problem and was
fixing it, so this version of the patch includes that fix. Still (from
that sub-thread) need to figure out what we may or may not need to
update and test with rescan.

James

[1]: https://www.postgresql.org/message-id/CAAaqYe9BsrW%3DDRBOd9yW0s2djofXTM9mRpO%3DLhHrCu4qdGgrVg%40mail.gmail.com

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
Hi,

I've looked at v38 but it seems it's a bit broken by some recent explain
changes (mostly missing type in declarations). Attached is v39 fixing
those issues, and including a bunch of fixes based on a review - most of
the changes is in comments, so I've instead kept them in separate "fix"
patches after each part.

In general I'm mostly happy with the current shape of the patch, and
unless there are some objections I'd like to get some of it committed
sometime next week.

I've done a fair amount of testing with various queries, and the plan
changes seem pretty sensible. I'm still not entirely sure whether to be
a bit conservative and only tweak the first patch adding incremental
sort to extra places, or commit both. 

The main thing I still have on my plate is assessment of how much more
expensive can the planning due to increased number of paths we
generate/keep (due to considering extra pathkeys). I haven't seen any
significant slowdowns, but I plan to look at some extreme cases (many
similar and applicable indexes etc.).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Andreas Karlsson
Date:
On 3/21/20 1:56 AM, Tomas Vondra wrote:
> I've looked at v38 but it seems it's a bit broken by some recent explain
> changes (mostly missing type in declarations). Attached is v39 fixing
> those issues, and including a bunch of fixes based on a review - most of
> the changes is in comments, so I've instead kept them in separate "fix"
> patches after each part.
> 
> In general I'm mostly happy with the current shape of the patch, and
> unless there are some objections I'd like to get some of it committed
> sometime next week.

Hi,

I haven't had any time to look at the patch yet but when compiling it 
and running the tests GCC gave me a warning. The tests passed btw.

gcc (Debian 8.3.0-6) 8.3.0

explain.c: In function ‘show_incremental_sort_group_info’:
explain.c:2772:39: warning: passing argument 2 of ‘lappend’ discards 
‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
     methodNames = lappend(methodNames, sortMethodName);
                                        ^~~~~~~~~~~~~~
In file included from ../../../src/include/access/xact.h:20,
                  from explain.c:16:
../../../src/include/nodes/pg_list.h:509:14: note: expected ‘void *’ but 
argument is of type ‘const char *’
  extern List *lappend(List *list, void *datum);
               ^~~~~~~
explain.c:2772:39: warning: passing 'const char *' to parameter of type 
'void *' discards qualifiers
       [-Wincompatible-pointer-types-discards-qualifiers]
                         methodNames = lappend(methodNames, sortMethodName);
                                                            ^~~~~~~~~~~~~~
../../../src/include/nodes/pg_list.h:509:40: note: passing argument to 
parameter 'datum' here
extern List *lappend(List *list, void *datum);
                                        ^
Andreas



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sun, Mar 22, 2020 at 6:02 PM Andreas Karlsson <andreas@proxel.se> wrote:
>
> On 3/21/20 1:56 AM, Tomas Vondra wrote:
> > I've looked at v38 but it seems it's a bit broken by some recent explain
> > changes (mostly missing type in declarations). Attached is v39 fixing
> > those issues, and including a bunch of fixes based on a review - most of
> > the changes is in comments, so I've instead kept them in separate "fix"
> > patches after each part.
> >
> > In general I'm mostly happy with the current shape of the patch, and
> > unless there are some objections I'd like to get some of it committed
> > sometime next week.
>
> Hi,
>
> I haven't had any time to look at the patch yet but when compiling it
> and running the tests GCC gave me a warning. The tests passed btw.
>
> gcc (Debian 8.3.0-6) 8.3.0
>
> explain.c: In function ‘show_incremental_sort_group_info’:
> explain.c:2772:39: warning: passing argument 2 of ‘lappend’ discards
> ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
>      methodNames = lappend(methodNames, sortMethodName);
>                                         ^~~~~~~~~~~~~~
> In file included from ../../../src/include/access/xact.h:20,
>                   from explain.c:16:
> ../../../src/include/nodes/pg_list.h:509:14: note: expected ‘void *’ but
> argument is of type ‘const char *’
>   extern List *lappend(List *list, void *datum);
>                ^~~~~~~
> explain.c:2772:39: warning: passing 'const char *' to parameter of type
> 'void *' discards qualifiers
>        [-Wincompatible-pointer-types-discards-qualifiers]
>                          methodNames = lappend(methodNames, sortMethodName);
>                                                             ^~~~~~~~~~~~~~
> ../../../src/include/nodes/pg_list.h:509:40: note: passing argument to
> parameter 'datum' here
> extern List *lappend(List *list, void *datum);

So if we naively get rid of the const on the variable declaration in
question, then we get this warning instead:

explain.c: In function ‘show_incremental_sort_group_info’:
explain.c:2770:27: warning: initialization discards ‘const’ qualifier
from pointer target type [-Wdiscarded-qualifiers]
    char *sortMethodName = tuplesort_method_name(methodCell->int_value);

So on the face of it we have a bit of a no-win situation. The function
tuple_sort_method_name returns a const, but lappend wants a non-const.
I'm not sure what the project style preference is here: we could cast
the result as (char *) to drop the const qualifier, but that's frowned
upon some places. Alternatively we could make a new non-const copy of
string. Which is preferable in the postgres project style?

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Andreas Karlsson
Date:
On 3/23/20 1:33 AM, James Coleman wrote:
> So on the face of it we have a bit of a no-win situation. The function
> tuple_sort_method_name returns a const, but lappend wants a non-const.
> I'm not sure what the project style preference is here: we could cast
> the result as (char *) to drop the const qualifier, but that's frowned
> upon some places. Alternatively we could make a new non-const copy of
> string. Which is preferable in the postgres project style?

The PostgreSQL has places where const is explicitly casted away with the 
unconstify() macro, so unless you can find a better solution that is 
probably an ok option.

Andreas




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sun, Mar 22, 2020 at 8:54 PM Andreas Karlsson <andreas@proxel.se> wrote:
>
> On 3/23/20 1:33 AM, James Coleman wrote:
> > So on the face of it we have a bit of a no-win situation. The function
> > tuple_sort_method_name returns a const, but lappend wants a non-const.
> > I'm not sure what the project style preference is here: we could cast
> > the result as (char *) to drop the const qualifier, but that's frowned
> > upon some places. Alternatively we could make a new non-const copy of
> > string. Which is preferable in the postgres project style?
>
> The PostgreSQL has places where const is explicitly casted away with the
> unconstify() macro, so unless you can find a better solution that is
> probably an ok option.

Thanks, that's exactly what I need!

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Fri, Mar 20, 2020 at 8:56 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> I've looked at v38 but it seems it's a bit broken by some recent explain
> changes (mostly missing type in declarations). Attached is v39 fixing
> those issues, and including a bunch of fixes based on a review - most of
> the changes is in comments, so I've instead kept them in separate "fix"
> patches after each part.
>
> In general I'm mostly happy with the current shape of the patch, and
> unless there are some objections I'd like to get some of it committed
> sometime next week.
>
> I've done a fair amount of testing with various queries, and the plan
> changes seem pretty sensible. I'm still not entirely sure whether to be
> a bit conservative and only tweak the first patch adding incremental
> sort to extra places, or commit both.
>
> The main thing I still have on my plate is assessment of how much more
> expensive can the planning due to increased number of paths we
> generate/keep (due to considering extra pathkeys). I haven't seen any
> significant slowdowns, but I plan to look at some extreme cases (many
> similar and applicable indexes etc.).

I'm currently incorporating all of the fixes you proposed into the
main patch series, as well as doing a thorough read-through of the
current state of the patch. I'm hoping to reply tomorrow with:

- Fix patches of my own to clean up and add additional comments.
- Catalog all of the current open questions (XXX, etc.) in the patch
to more easily discuss them in the mailing list.

One question I have while I work on that: I've noticed some confusion
in the patch as to whether we should refer to the node below the
incremental sort node in the plan tree (i.e., the node we get tuples
from) as the inner node or the outer node. Intuitively I'd expect to
call it the inner node, but the original patch referred to it
frequently as the outer node. The outerPlanState/innerPlanState macro
comments don't offer a lot of clarification though they're "to avoid
confusion" about right/left inner/outer. I suppose if the
outerPlanState macro is working here the correct term should be outer?

Thanks,
James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sun, Mar 22, 2020 at 10:05:50PM -0400, James Coleman wrote:
>On Sun, Mar 22, 2020 at 8:54 PM Andreas Karlsson <andreas@proxel.se> wrote:
>>
>> On 3/23/20 1:33 AM, James Coleman wrote:
>> > So on the face of it we have a bit of a no-win situation. The function
>> > tuple_sort_method_name returns a const, but lappend wants a non-const.
>> > I'm not sure what the project style preference is here: we could cast
>> > the result as (char *) to drop the const qualifier, but that's frowned
>> > upon some places. Alternatively we could make a new non-const copy of
>> > string. Which is preferable in the postgres project style?
>>
>> The PostgreSQL has places where const is explicitly casted away with the
>> unconstify() macro, so unless you can find a better solution that is
>> probably an ok option.
>
>Thanks, that's exactly what I need!
>

Yeah, sorry I forgot to mention/fix the warning in the last review round.

BTW I think the comment for pathkeys_useful_for_ordering() needs
updating, it still claims it's all-or-nothing affair, but that's no
longer true I think.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Alvaro Herrera
Date:
On 2020-Mar-22, James Coleman wrote:

> One question I have while I work on that: I've noticed some confusion
> in the patch as to whether we should refer to the node below the
> incremental sort node in the plan tree (i.e., the node we get tuples
> from) as the inner node or the outer node. Intuitively I'd expect to
> call it the inner node, but the original patch referred to it
> frequently as the outer node. The outerPlanState/innerPlanState macro
> comments don't offer a lot of clarification though they're "to avoid
> confusion" about right/left inner/outer. I suppose if the
> outerPlanState macro is working here the correct term should be outer?

I think the inner/outer distinction comes from join nodes wanting to
distinguish which child drives the scan of the other.  If there's a
single child, there's no need to make such a distinction: it's just "the
child".  And if it's the only child, conventionally we use the first
one, which conventionally is (for us westerners) the one on the left.
This view is supported by the fact that outerPlanState() appears 113
times in the code whereas innerPlanState() appears only 27 times --
that is, all plan types that use only one child use the outer one.  They
could use either, as long as it does that consistently, I think.

Therefore the term should be "outer".  It's not "outer" to the parent
incremental sort; it's just the "outer" of its two possible children.

I think.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> ... all plan types that use only one child use the outer one.  They
> could use either, as long as it does that consistently, I think.

Yeah, exactly.  The outer/inner terminology is really only sensible
for join nodes, but there isn't a third child-plan pointer reserved
for single-child node types, so you gotta use one of those.  And
conventionally we use the "outer" one.

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Mar 23, 2020 at 1:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > ... all plan types that use only one child use the outer one.  They
> > could use either, as long as it does that consistently, I think.
>
> Yeah, exactly.  The outer/inner terminology is really only sensible
> for join nodes, but there isn't a third child-plan pointer reserved
> for single-child node types, so you gotta use one of those.  And
> conventionally we use the "outer" one.
>
>                         regards, tom lane

Great, thanks for the explanation Alvaro and Tom; I'll fix that up in
my next patch series.

I idly wonder if a macro childPlanState() defined exactly the same as
outerPlanState() might _kinda_ make sense here, but I'm also content
to follow convention.

I might submit a small patch to the comment on those macros though to
expand on the explanation.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sun, Mar 22, 2020 at 10:17 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Fri, Mar 20, 2020 at 8:56 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > Hi,
> >
> > I've looked at v38 but it seems it's a bit broken by some recent explain
> > changes (mostly missing type in declarations). Attached is v39 fixing
> > those issues, and including a bunch of fixes based on a review - most of
> > the changes is in comments, so I've instead kept them in separate "fix"
> > patches after each part.
> >
> > In general I'm mostly happy with the current shape of the patch, and
> > unless there are some objections I'd like to get some of it committed
> > sometime next week.
> >
> > I've done a fair amount of testing with various queries, and the plan
> > changes seem pretty sensible. I'm still not entirely sure whether to be
> > a bit conservative and only tweak the first patch adding incremental
> > sort to extra places, or commit both.
> >
> > The main thing I still have on my plate is assessment of how much more
> > expensive can the planning due to increased number of paths we
> > generate/keep (due to considering extra pathkeys). I haven't seen any
> > significant slowdowns, but I plan to look at some extreme cases (many
> > similar and applicable indexes etc.).
>
> I'm currently incorporating all of the fixes you proposed into the
> main patch series, as well as doing a thorough read-through of the
> current state of the patch. I'm hoping to reply tomorrow with:
>
> - Fix patches of my own to clean up and add additional comments.
> - Catalog all of the current open questions (XXX, etc.) in the patch
> to more easily discuss them in the mailing list.

Here's the above.

Current TODOs:

1. src/backend/optimizer/util/pathnode.c add_partial_path()
* XXX Perhaps we could do this only when incremental sort is enabled,
* and use the simpler version (comparing just total cost) otherwise?

I don't have a strong opinion here. It doesn't seem like a significant
difference in terms of cost?

2. Not marked in the patch, but in nodeIncrementalSort.c
ExecIncrementalSort() I wonder if perhaps we should move the algorithm
discussion comments up to the file header comment. On the other hand,
I suppose it could be valuable to leave the the file header comment
more high level about the mathematical properties of incremental sort
rather than discussing the details of the hybrid mode.

3. nodeIncrementalSort.c ExecIncrementalSort() in the main for loop:
 * TODO: do we need to check for interrupts inside these loops or
 * will the outer node handle that?

4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
is suspect. I've mentioned previously I don't have a great mental
model of how rescan works and its invariants (IIRC someone said it was
about moving around a result set in a cursor). Regardless I'm pretty
sure this code just doesn't work correctly. Additionally the sort_Done
variable is poorly named; it probably would make more sense to call it
something like "scanHasBegun". I'm waiting to change it though until
cleaning up this code more holistically.

5. planner.c create_ordered_paths:
* XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
* other pathkeys (grouping, ...) like generate_useful_gather_paths.

6. regress/expected/incremental_sort.out:
-- TODO if an analyze happens here the plans might change; should we
-- solve by inserting extra rows or by adding a GUC that would somehow
-- forcing the time of plan we expect.

Maybe this isn't an actual issue (I have vague memory of auto-analyze
being disabled during these regression tests)?

7. Not listed as a comment in the patch, but I need to modify the
testing for analyze output to parse out the memory/disk stats to the
tests are stable.

8. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
* XXX At the moment this can only ever return a list with a single element,
* because it looks at query_pathkeys only. So we might return the pathkeys
* directly, but it seems plausible we'll want to consider other orderings
* in the future.

This might be something we just leave in as a comment?

9. In the same function as the above:
* Considering query_pathkeys is always worth it, because it might let us
* avoid a local sort.

That seems like a copy from the fdw code; I didn't remove it in the
attached patchset, because I think I have a diff on a branch somewhere
that standardizes some of this shared code between here and the
postgres_fdw code.

10. optimizer/path/allpaths.c generate_useful_gather_paths:
* XXX I wonder if we need to consider adding a projection here, as
* create_ordered_paths does.

11. In the same function as the above:
* XXX Can't we skip this (maybe only for the cheapest partial path)
* when the path is already sorted? Then it's likely duplicate with
* the path created by generate_gather_paths.

12. In the same function as the above:
* XXX This is not redundant with the gather merge path created in
* generate_gather_paths, because that merely preserves ordering of
* the cheapest partial path, while here we add an explicit sort to
* get match the useful ordering.

13. planner.c create_ordered_paths:
* XXX This is probably duplicate with the paths we already generate
* in generate_useful_gather_paths in apply_scanjoin_target_to_paths.

Attached is v40, including quite a bit of comment cleanup primarily.

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Alvaro Herrera
Date:
On 2020-Mar-23, James Coleman wrote:

> 4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
> is suspect. I've mentioned previously I don't have a great mental
> model of how rescan works and its invariants (IIRC someone said it was
> about moving around a result set in a cursor). Regardless I'm pretty
> sure this code just doesn't work correctly.

I don't think that's the whole of it.  My own vague understanding of
ReScan is that it's there to support running a node again, possibly with
different parameters.  For example if you have a join of an indexscan
on the outer side and an incremental sort on the inner side, and the
values from the index are used as parameters to the incremental sort,
then the incremental sort is going to receive ReScan calls for each of
the values that the index returns.  Sometimes the index could give you
the same values as before (because there's a dupe in the index), so you
can just return the same values from the incremental sort; but other
times it's going to return different values so you need to reset the
incremental sort to "start from scratch" using the new values as
parameters.

Now, if you have a cursor reading from the incremental sort and fetch
all tuples, then rewind completely and fetch all again, then that's
going to be a rescan as well.

I agree with you that the code doesn't seem to implement that.


-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Mar 23, 2020 at 11:44 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>
> On 2020-Mar-23, James Coleman wrote:
>
> > 4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
> > is suspect. I've mentioned previously I don't have a great mental
> > model of how rescan works and its invariants (IIRC someone said it was
> > about moving around a result set in a cursor). Regardless I'm pretty
> > sure this code just doesn't work correctly.
>
> I don't think that's the whole of it.  My own vague understanding of
> ReScan is that it's there to support running a node again, possibly with
> different parameters.  For example if you have a join of an indexscan
> on the outer side and an incremental sort on the inner side, and the
> values from the index are used as parameters to the incremental sort,
> then the incremental sort is going to receive ReScan calls for each of
> the values that the index returns.  Sometimes the index could give you
> the same values as before (because there's a dupe in the index), so you
> can just return the same values from the incremental sort; but other
> times it's going to return different values so you need to reset the
> incremental sort to "start from scratch" using the new values as
> parameters.
>
> Now, if you have a cursor reading from the incremental sort and fetch
> all tuples, then rewind completely and fetch all again, then that's
> going to be a rescan as well.
>
> I agree with you that the code doesn't seem to implement that.

I grepped the codebase for rescan, and noted this relevant info in
src/backend/executor/README:

* Rescan command to reset a node and make it generate its output sequence
over again.

* Parameters that can alter a node's results. After adjusting a parameter,
the rescan command must be applied to that node and all nodes above it.
There is a moderately intelligent scheme to avoid rescanning nodes
unnecessarily (for example, Sort does not rescan its input if no parameters
of the input have changed, since it can just reread its stored sorted data).

That jives pretty well with what you're saying.

The interesting thing with incremental sort, as the comments in the
patch already note, is that even if the params haven't changed, we
can't regenerate the same values again *unless* we know that we're
still in the same batch, or, have only processed a single full batch
(and the tuples are still in the full sort state) or we've
transitioned to prefix mode and have only transferred tuples from the
full sort state for a single prefix key group.

That's a pretty narrow range of applicability of not needing to
re-execute the entire node, at least based on my assumptions about
when rescanning will typically happen.

So, two followup questions:

1. Given the narrow applicability, might it make sense to just say
"we're only going to do a total reset and rescan and not try to
implement a smart 'don't rescan if we don't have to'"?

2. What would be a typical or good way to test this? Should I
basically repeat many of the existing implementation tests but with a
cursor and verify that rescanning produces the same results? That's
probably the path I'm going to take if there are no objections. Of
course we would need even more testing if we wanted to have the "smart
rescan" functionality.

Thoughts?

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Tue, Mar 24, 2020 at 06:26:11PM -0400, James Coleman wrote:
>On Mon, Mar 23, 2020 at 11:44 PM Alvaro Herrera
><alvherre@2ndquadrant.com> wrote:
>>
>> On 2020-Mar-23, James Coleman wrote:
>>
>> > 4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
>> > is suspect. I've mentioned previously I don't have a great mental
>> > model of how rescan works and its invariants (IIRC someone said it was
>> > about moving around a result set in a cursor). Regardless I'm pretty
>> > sure this code just doesn't work correctly.
>>
>> I don't think that's the whole of it.  My own vague understanding of
>> ReScan is that it's there to support running a node again, possibly with
>> different parameters.  For example if you have a join of an indexscan
>> on the outer side and an incremental sort on the inner side, and the
>> values from the index are used as parameters to the incremental sort,
>> then the incremental sort is going to receive ReScan calls for each of
>> the values that the index returns.  Sometimes the index could give you
>> the same values as before (because there's a dupe in the index), so you
>> can just return the same values from the incremental sort; but other
>> times it's going to return different values so you need to reset the
>> incremental sort to "start from scratch" using the new values as
>> parameters.
>>
>> Now, if you have a cursor reading from the incremental sort and fetch
>> all tuples, then rewind completely and fetch all again, then that's
>> going to be a rescan as well.
>>
>> I agree with you that the code doesn't seem to implement that.
>
>I grepped the codebase for rescan, and noted this relevant info in
>src/backend/executor/README:
>
>* Rescan command to reset a node and make it generate its output sequence
>over again.
>
>* Parameters that can alter a node's results. After adjusting a parameter,
>the rescan command must be applied to that node and all nodes above it.
>There is a moderately intelligent scheme to avoid rescanning nodes
>unnecessarily (for example, Sort does not rescan its input if no parameters
>of the input have changed, since it can just reread its stored sorted data).
>
>That jives pretty well with what you're saying.
>
>The interesting thing with incremental sort, as the comments in the
>patch already note, is that even if the params haven't changed, we
>can't regenerate the same values again *unless* we know that we're
>still in the same batch, or, have only processed a single full batch
>(and the tuples are still in the full sort state) or we've
>transitioned to prefix mode and have only transferred tuples from the
>full sort state for a single prefix key group.
>
>That's a pretty narrow range of applicability of not needing to
>re-execute the entire node, at least based on my assumptions about
>when rescanning will typically happen.
>
>So, two followup questions:
>
>1. Given the narrow applicability, might it make sense to just say
>"we're only going to do a total reset and rescan and not try to
>implement a smart 'don't rescan if we don't have to'"?
>

I think that's a sensible approach.

>2. What would be a typical or good way to test this? Should I
>basically repeat many of the existing implementation tests but with a
>cursor and verify that rescanning produces the same results? That's
>probably the path I'm going to take if there are no objections. Of
>course we would need even more testing if we wanted to have the "smart
>rescan" functionality.
>

I haven't checked, but how are we testing it for the other nodes?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Mar 24, 2020 at 7:08 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Mar 24, 2020 at 06:26:11PM -0400, James Coleman wrote:
> >On Mon, Mar 23, 2020 at 11:44 PM Alvaro Herrera
> ><alvherre@2ndquadrant.com> wrote:
> >>
> >> On 2020-Mar-23, James Coleman wrote:
> >>
> >> > 4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
> >> > is suspect. I've mentioned previously I don't have a great mental
> >> > model of how rescan works and its invariants (IIRC someone said it was
> >> > about moving around a result set in a cursor). Regardless I'm pretty
> >> > sure this code just doesn't work correctly.
> >>
> >> I don't think that's the whole of it.  My own vague understanding of
> >> ReScan is that it's there to support running a node again, possibly with
> >> different parameters.  For example if you have a join of an indexscan
> >> on the outer side and an incremental sort on the inner side, and the
> >> values from the index are used as parameters to the incremental sort,
> >> then the incremental sort is going to receive ReScan calls for each of
> >> the values that the index returns.  Sometimes the index could give you
> >> the same values as before (because there's a dupe in the index), so you
> >> can just return the same values from the incremental sort; but other
> >> times it's going to return different values so you need to reset the
> >> incremental sort to "start from scratch" using the new values as
> >> parameters.
> >>
> >> Now, if you have a cursor reading from the incremental sort and fetch
> >> all tuples, then rewind completely and fetch all again, then that's
> >> going to be a rescan as well.
> >>
> >> I agree with you that the code doesn't seem to implement that.
> >
> >I grepped the codebase for rescan, and noted this relevant info in
> >src/backend/executor/README:
> >
> >* Rescan command to reset a node and make it generate its output sequence
> >over again.
> >
> >* Parameters that can alter a node's results. After adjusting a parameter,
> >the rescan command must be applied to that node and all nodes above it.
> >There is a moderately intelligent scheme to avoid rescanning nodes
> >unnecessarily (for example, Sort does not rescan its input if no parameters
> >of the input have changed, since it can just reread its stored sorted data).
> >
> >That jives pretty well with what you're saying.
> >
> >The interesting thing with incremental sort, as the comments in the
> >patch already note, is that even if the params haven't changed, we
> >can't regenerate the same values again *unless* we know that we're
> >still in the same batch, or, have only processed a single full batch
> >(and the tuples are still in the full sort state) or we've
> >transitioned to prefix mode and have only transferred tuples from the
> >full sort state for a single prefix key group.
> >
> >That's a pretty narrow range of applicability of not needing to
> >re-execute the entire node, at least based on my assumptions about
> >when rescanning will typically happen.
> >
> >So, two followup questions:
> >
> >1. Given the narrow applicability, might it make sense to just say
> >"we're only going to do a total reset and rescan and not try to
> >implement a smart 'don't rescan if we don't have to'"?
> >
>
> I think that's a sensible approach.
>
> >2. What would be a typical or good way to test this? Should I
> >basically repeat many of the existing implementation tests but with a
> >cursor and verify that rescanning produces the same results? That's
> >probably the path I'm going to take if there are no objections. Of
> >course we would need even more testing if we wanted to have the "smart
> >rescan" functionality.
> >
>
> I haven't checked, but how are we testing it for the other nodes?

I haven't checked yet, but figured I'd ask in case someone had ideas
off the top of their head.

While working on finding a test case to show rescan isn't implemented
properly yet, I came across a bug. At the top of
ExecInitIncrementalSort, we assert that eflags does not contain
EXEC_FLAG_REWIND. But the following query (with merge and hash joins
disabled) breaks that assertion:

select * from t join (select * from t order by a, b) s on s.a = t.a
where t.a in (1,2);

The comments about this flag in src/include/executor/executor.h say:

* REWIND indicates that the plan node should try to efficiently support
* rescans without parameter changes. (Nodes must support ExecReScan calls
* in any case, but if this flag was not given, they are at liberty to do it
* through complete recalculation. Note that a parameter change forces a
* full recalculation in any case.)

Now we know that except in rare cases (as just discussed recently up
thread) we can't implement rescan efficiently.

So is this a planner bug (i.e., should we try not to generate
incremental sort plans that require efficient rewind)? Or can we just
remove that part of the assertion and know that we'll implement the
rescan, albeit inefficiently? We already explicitly declare that we
don't support backwards scanning, but I don't see a way to declare the
same for rewind.

I'm going to try the latter approach now to see if it at least solves
the immediate problem...

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Mar 24, 2020 at 8:23 PM James Coleman <jtc331@gmail.com> wrote:
> While working on finding a test case to show rescan isn't implemented
> properly yet, I came across a bug. At the top of
> ExecInitIncrementalSort, we assert that eflags does not contain
> EXEC_FLAG_REWIND. But the following query (with merge and hash joins
> disabled) breaks that assertion:
>
> select * from t join (select * from t order by a, b) s on s.a = t.a
> where t.a in (1,2);
>
> The comments about this flag in src/include/executor/executor.h say:
>
> * REWIND indicates that the plan node should try to efficiently support
> * rescans without parameter changes. (Nodes must support ExecReScan calls
> * in any case, but if this flag was not given, they are at liberty to do it
> * through complete recalculation. Note that a parameter change forces a
> * full recalculation in any case.)
>
> Now we know that except in rare cases (as just discussed recently up
> thread) we can't implement rescan efficiently.
>
> So is this a planner bug (i.e., should we try not to generate
> incremental sort plans that require efficient rewind)? Or can we just
> remove that part of the assertion and know that we'll implement the
> rescan, albeit inefficiently? We already explicitly declare that we
> don't support backwards scanning, but I don't see a way to declare the
> same for rewind.

Other nodes seem to get a materialization node placed above them to
support this case "better". Is that something we should be doing?

> I'm going to try the latter approach now to see if it at least solves
> the immediate problem...

So a couple of interesting results here. First, it does seem to fix
the assertion failure, and rescan is being used in this case (as I
expected).

The plans have a bit of a weird shape, at least in my mind. First, to
get the incremental sort on the right side of the join I had to:
set enable_mergejoin = off;
set enable_hashjoin = off;
and got this plan:

 Gather  (cost=1000.47..108541.96 rows=2 width=16)
   Workers Planned: 2
   ->  Nested Loop  (cost=0.47..107541.76 rows=1 width=16)
         Join Filter: (t.a = t_1.a)
         ->  Parallel Seq Scan on t  (cost=0.00..9633.33 rows=1 width=8)
               Filter: (a = ANY ('{1,2}'::integer[]))
         ->  Incremental Sort  (cost=0.47..75408.43 rows=1000000 width=8)
               Sort Key: t_1.a, t_1.b
               Presorted Key: t_1.a
               ->  Index Scan using idx_t_a on t t_1
(cost=0.42..30408.42 rows=1000000 width=8)

To get rid of the parallelism but keep the same basic plan shape I had
to further:
set enable_seqscan = off;
set enable_material = off;
and got this plan:

 Nested Loop  (cost=0.89..195829.74 rows=2 width=16)
   Join Filter: (t.a = t_1.a)
   ->  Index Scan using idx_t_a on t  (cost=0.42..12.88 rows=2 width=8)
         Index Cond: (a = ANY ('{1,2}'::integer[]))
   ->  Incremental Sort  (cost=0.47..75408.43 rows=1000000 width=8)
         Sort Key: t_1.a, t_1.b
         Presorted Key: t_1.a
         ->  Index Scan using idx_t_a on t t_1  (cost=0.42..30408.42
rows=1000000 width=8)

Two observations:
1. Ideally the planner would have realized that the join condition can
be safely pushed down into both index scans. I was surprised this
didn't happen, but...maybe that's just not supported?

2. Ideally the nested loop node would have the smarts to know that the
right child is ordered, and therefore it can stop pulling nodes from
it as soon as it stops matching the join condition for each iteration
in the loop. I'm less surprised this isn't supported; it seems like a
fairly advanced optimization (OTOH it is the kind of interesting
optimization incremental sort opens up in more cases.

I don't *think* either of these are issues with the patch, but wanted
to mention them in case it piqued someone's curiosity or in case it
actually is a bug [in our patch or otherwise].

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Mar 24, 2020 at 8:58 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Tue, Mar 24, 2020 at 8:23 PM James Coleman <jtc331@gmail.com> wrote:
> > While working on finding a test case to show rescan isn't implemented
> > properly yet, I came across a bug. At the top of
> > ExecInitIncrementalSort, we assert that eflags does not contain
> > EXEC_FLAG_REWIND. But the following query (with merge and hash joins
> > disabled) breaks that assertion:
> >
> > select * from t join (select * from t order by a, b) s on s.a = t.a
> > where t.a in (1,2);
> >
> > The comments about this flag in src/include/executor/executor.h say:
> >
> > * REWIND indicates that the plan node should try to efficiently support
> > * rescans without parameter changes. (Nodes must support ExecReScan calls
> > * in any case, but if this flag was not given, they are at liberty to do it
> > * through complete recalculation. Note that a parameter change forces a
> > * full recalculation in any case.)
> >
> > Now we know that except in rare cases (as just discussed recently up
> > thread) we can't implement rescan efficiently.
> >
> > So is this a planner bug (i.e., should we try not to generate
> > incremental sort plans that require efficient rewind)? Or can we just
> > remove that part of the assertion and know that we'll implement the
> > rescan, albeit inefficiently? We already explicitly declare that we
> > don't support backwards scanning, but I don't see a way to declare the
> > same for rewind.
>
> Other nodes seem to get a materialization node placed above them to
> support this case "better". Is that something we should be doing?
>
> > I'm going to try the latter approach now to see if it at least solves
> > the immediate problem...
>
> So a couple of interesting results here. First, it does seem to fix
> the assertion failure, and rescan is being used in this case (as I
> expected).

I've fixed the rescan implementation and added a test. I think we
might actually be able to get away with one basic test (it just has to
exercise both full and prefix sort states). Note, this also allowed me
to get rid of the sort_Done incremental sort state attribute that I'd
wondered earlier if we'd actually need.

In the attached patch series I've collapsed my previous fix commits,
but included the changes here as a separate patch in the series so you
can see the changes more easily.

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
In a previous email I'd summarized remaining TODOs I'd found. Here's
an updated listed with several resolved.

Resolved:

2. Not marked in the patch, but in nodeIncrementalSort.c
ExecIncrementalSort() I wonder if perhaps we should move the algorithm
discussion comments up to the file header comment. On the other hand,
I suppose it could be valuable to leave the the file header comment
more high level about the mathematical properties of incremental sort
rather than discussing the details of the hybrid mode.

I've decided to do this, and the attached patch series includes the change.

3. nodeIncrementalSort.c ExecIncrementalSort() in the main for loop:
 * TODO: do we need to check for interrupts inside these loops or
 * will the outer node handle that?

It seems like what we have is sufficient, given that the nodes (and
sort) we rely on have their own calls. The one place where someone
might make an argument otherwise would be in the mode transition
function where we copy tuples from the full sort state to the
presorted sort state. If this is a problem, let me know, and I'll
change it, but I'm proceeding under the assumption for now that it's
not.

4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
is suspect. I've mentioned previously I don't have a great mental
model of how rescan works and its invariants (IIRC someone said it was
about moving around a result set in a cursor). Regardless I'm pretty
sure this code just doesn't work correctly. Additionally the sort_Done
variable is poorly named; it probably would make more sense to call it
something like "scanHasBegun". I'm waiting to change it though until
cleaning up this code more holistically.

Fixed, as described in previous email.

6. regress/expected/incremental_sort.out:
-- TODO if an analyze happens here the plans might change; should we
-- solve by inserting extra rows or by adding a GUC that would somehow
-- forcing the time of plan we expect.

I've decided this doesn't seem to be a real issue, so, comment removed.

7. Not listed as a comment in the patch, but I need to modify the
testing for analyze output to parse out the memory/disk stats to the
tests are stable.

Included in the attached patch series. I use plpgsql to munge out the
space kB numbers. I also discovered two bugs in the JSON output along
the way and fixed those (memory and disk need to be output separate;
disk was using the wrong "space type" enum). Finally I also use
plpgsql to check a few invariants (for now just that max space is
greater than or equal to the average.

8. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
* XXX At the moment this can only ever return a list with a single element,
* because it looks at query_pathkeys only. So we might return the pathkeys
* directly, but it seems plausible we'll want to consider other orderings
* in the future.

I think we just leave this in as a comment.

9. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
* Considering query_pathkeys is always worth it, because it might let us
* avoid a local sort.

That originally was a copy from the fdw code, but since the two
functions have diverged (Is that concerning? I could be confusing, but
isn't a compilation problem) I didn't move the function.

I did notice though that find_em_expr_for_rel() is wholesale copied
(and unchanged) from the fdw code, so I moved it to equivclass.c so
both places can share it.


Still remaining:

1. src/backend/optimizer/util/pathnode.c add_partial_path()
* XXX Perhaps we could do this only when incremental sort is enabled,
* and use the simpler version (comparing just total cost) otherwise?

I don't have a strong opinion here. It doesn't seem like a significant
difference in terms of cost?

5. planner.c create_ordered_paths:
* XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
* other pathkeys (grouping, ...) like generate_useful_gather_paths.

10. optimizer/path/allpaths.c generate_useful_gather_paths:
* XXX I wonder if we need to consider adding a projection here, as
* create_ordered_paths does.

11. In the same function as the above:
* XXX Can't we skip this (maybe only for the cheapest partial path)
* when the path is already sorted? Then it's likely duplicate with
* the path created by generate_gather_paths.

12. In the same function as the above:
* XXX This is not redundant with the gather merge path created in
* generate_gather_paths, because that merely preserves ordering of
* the cheapest partial path, while here we add an explicit sort to
* get match the useful ordering.

13. planner.c create_ordered_paths:
* XXX This is probably duplicate with the paths we already generate
* in generate_useful_gather_paths in apply_scanjoin_target_to_paths.

Tomas, any chance you could take a look at the above XXX/questions? I
believe all of them that remain relate to the planner patches.

Thanks,
James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Fri, Mar 27, 2020 at 12:51:34PM -0400, James Coleman wrote:
>In a previous email I'd summarized remaining TODOs I'd found. Here's
>an updated listed with several resolved.
>
>Resolved:
>
>2. Not marked in the patch, but in nodeIncrementalSort.c
>ExecIncrementalSort() I wonder if perhaps we should move the algorithm
>discussion comments up to the file header comment. On the other hand,
>I suppose it could be valuable to leave the the file header comment
>more high level about the mathematical properties of incremental sort
>rather than discussing the details of the hybrid mode.
>
>I've decided to do this, and the attached patch series includes the change.
>

It's a bit tough to find the right balance what to put into the header
comment and what should go to function comments, but this seems mostly
reasonable. I wouldn't use the double-tab indentation and the copyright
notices should stay at the top.

>3. nodeIncrementalSort.c ExecIncrementalSort() in the main for loop:
> * TODO: do we need to check for interrupts inside these loops or
> * will the outer node handle that?
>
>It seems like what we have is sufficient, given that the nodes (and
>sort) we rely on have their own calls. The one place where someone
>might make an argument otherwise would be in the mode transition
>function where we copy tuples from the full sort state to the
>presorted sort state. If this is a problem, let me know, and I'll
>change it, but I'm proceeding under the assumption for now that it's
>not.
>

I think what we have now is sufficient.

>4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
>is suspect. I've mentioned previously I don't have a great mental
>model of how rescan works and its invariants (IIRC someone said it was
>about moving around a result set in a cursor). Regardless I'm pretty
>sure this code just doesn't work correctly. Additionally the sort_Done
>variable is poorly named; it probably would make more sense to call it
>something like "scanHasBegun". I'm waiting to change it though until
>cleaning up this code more holistically.
>
>Fixed, as described in previous email.
>
>6. regress/expected/incremental_sort.out:
>-- TODO if an analyze happens here the plans might change; should we
>-- solve by inserting extra rows or by adding a GUC that would somehow
>-- forcing the time of plan we expect.
>
>I've decided this doesn't seem to be a real issue, so, comment removed.
>

OK

>7. Not listed as a comment in the patch, but I need to modify the
>testing for analyze output to parse out the memory/disk stats to the
>tests are stable.
>
>Included in the attached patch series. I use plpgsql to munge out the
>space kB numbers. I also discovered two bugs in the JSON output along
>the way and fixed those (memory and disk need to be output separate;
>disk was using the wrong "space type" enum). Finally I also use
>plpgsql to check a few invariants (for now just that max space is
>greater than or equal to the average.
>

OK

>8. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
>* XXX At the moment this can only ever return a list with a single element,
>* because it looks at query_pathkeys only. So we might return the pathkeys
>* directly, but it seems plausible we'll want to consider other orderings
>* in the future.
>
>I think we just leave this in as a comment.
>

Fine with me.

As a side note here, I'm wondering if this (determining useful pathkeys)
can be made a bit smarter by looking both at query_pathkeys and pathkeys
useful for merging, similarly to what truncate_useless_pathkeys() does.
But that can be seen as an improvement of what we do now.

>9. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
>* Considering query_pathkeys is always worth it, because it might let us
>* avoid a local sort.
>
>That originally was a copy from the fdw code, but since the two
>functions have diverged (Is that concerning? I could be confusing, but
>isn't a compilation problem) I didn't move the function.
>

I think it's OK the two functions diverged, it's simply because the FDW
one needs to check other things too. But I might rework this once I look
closer at truncate_useless_pathkeys.

>I did notice though that find_em_expr_for_rel() is wholesale copied
>(and unchanged) from the fdw code, so I moved it to equivclass.c so
>both places can share it.
>

+1

>
>Still remaining:
>
>1. src/backend/optimizer/util/pathnode.c add_partial_path()
>* XXX Perhaps we could do this only when incremental sort is enabled,
>* and use the simpler version (comparing just total cost) otherwise?
>
>I don't have a strong opinion here. It doesn't seem like a significant
>difference in terms of cost?
>
>5. planner.c create_ordered_paths:
>* XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
>* other pathkeys (grouping, ...) like generate_useful_gather_paths.
>
>10. optimizer/path/allpaths.c generate_useful_gather_paths:
>* XXX I wonder if we need to consider adding a projection here, as
>* create_ordered_paths does.
>
>11. In the same function as the above:
>* XXX Can't we skip this (maybe only for the cheapest partial path)
>* when the path is already sorted? Then it's likely duplicate with
>* the path created by generate_gather_paths.
>
>12. In the same function as the above:
>* XXX This is not redundant with the gather merge path created in
>* generate_gather_paths, because that merely preserves ordering of
>* the cheapest partial path, while here we add an explicit sort to
>* get match the useful ordering.
>
>13. planner.c create_ordered_paths:
>* XXX This is probably duplicate with the paths we already generate
>* in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
>
>Tomas, any chance you could take a look at the above XXX/questions? I
>believe all of them that remain relate to the planner patches.
>

Yes, I'll take a look over the weekend.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Fri, Mar 27, 2020 at 9:19 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Fri, Mar 27, 2020 at 12:51:34PM -0400, James Coleman wrote:
> >In a previous email I'd summarized remaining TODOs I'd found. Here's
> >an updated listed with several resolved.
> >
> >Resolved:
> >
> >2. Not marked in the patch, but in nodeIncrementalSort.c
> >ExecIncrementalSort() I wonder if perhaps we should move the algorithm
> >discussion comments up to the file header comment. On the other hand,
> >I suppose it could be valuable to leave the the file header comment
> >more high level about the mathematical properties of incremental sort
> >rather than discussing the details of the hybrid mode.
> >
> >I've decided to do this, and the attached patch series includes the change.
> >
>
> It's a bit tough to find the right balance what to put into the header
> comment and what should go to function comments, but this seems mostly
> reasonable. I wouldn't use the double-tab indentation and the copyright
> notices should stay at the top.

Fixed. I also re-ran pg_indent on the the nodeIncrementalSort.c file.

> >3. nodeIncrementalSort.c ExecIncrementalSort() in the main for loop:
> > * TODO: do we need to check for interrupts inside these loops or
> > * will the outer node handle that?
> >
> >It seems like what we have is sufficient, given that the nodes (and
> >sort) we rely on have their own calls. The one place where someone
> >might make an argument otherwise would be in the mode transition
> >function where we copy tuples from the full sort state to the
> >presorted sort state. If this is a problem, let me know, and I'll
> >change it, but I'm proceeding under the assumption for now that it's
> >not.
> >
>
> I think what we have now is sufficient.
>
> >4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
> >is suspect. I've mentioned previously I don't have a great mental
> >model of how rescan works and its invariants (IIRC someone said it was
> >about moving around a result set in a cursor). Regardless I'm pretty
> >sure this code just doesn't work correctly. Additionally the sort_Done
> >variable is poorly named; it probably would make more sense to call it
> >something like "scanHasBegun". I'm waiting to change it though until
> >cleaning up this code more holistically.
> >
> >Fixed, as described in previous email.
> >
> >6. regress/expected/incremental_sort.out:
> >-- TODO if an analyze happens here the plans might change; should we
> >-- solve by inserting extra rows or by adding a GUC that would somehow
> >-- forcing the time of plan we expect.
> >
> >I've decided this doesn't seem to be a real issue, so, comment removed.
> >
>
> OK
>
> >7. Not listed as a comment in the patch, but I need to modify the
> >testing for analyze output to parse out the memory/disk stats to the
> >tests are stable.
> >
> >Included in the attached patch series. I use plpgsql to munge out the
> >space kB numbers. I also discovered two bugs in the JSON output along
> >the way and fixed those (memory and disk need to be output separate;
> >disk was using the wrong "space type" enum). Finally I also use
> >plpgsql to check a few invariants (for now just that max space is
> >greater than or equal to the average.
> >
>
> OK
>
> >8. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
> >* XXX At the moment this can only ever return a list with a single element,
> >* because it looks at query_pathkeys only. So we might return the pathkeys
> >* directly, but it seems plausible we'll want to consider other orderings
> >* in the future.
> >
> >I think we just leave this in as a comment.
> >
>
> Fine with me.
>
> As a side note here, I'm wondering if this (determining useful pathkeys)
> can be made a bit smarter by looking both at query_pathkeys and pathkeys
> useful for merging, similarly to what truncate_useless_pathkeys() does.
> But that can be seen as an improvement of what we do now.

Unless your comment below about looking at truncate_useless_pathkeys
is implying you're considering aiming to get this in now, I wonder if
we should just expand the comment to reference pathkeys useful for
merging as a possible future extension.

> >9. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
> >* Considering query_pathkeys is always worth it, because it might let us
> >* avoid a local sort.
> >
> >That originally was a copy from the fdw code, but since the two
> >functions have diverged (Is that concerning? I could be confusing, but
> >isn't a compilation problem) I didn't move the function.
> >
>
> I think it's OK the two functions diverged, it's simply because the FDW
> one needs to check other things too. But I might rework this once I look
> closer at truncate_useless_pathkeys.

Agreed, for now at least. It's tempting to think they should always be
shared, but I'm not convinced (without a lot more digging) that this
represents structural rather than incidental duplication.

> >I did notice though that find_em_expr_for_rel() is wholesale copied
> >(and unchanged) from the fdw code, so I moved it to equivclass.c so
> >both places can share it.
> >
>
> +1
>
> >
> >Still remaining:
> >
> >1. src/backend/optimizer/util/pathnode.c add_partial_path()
> >* XXX Perhaps we could do this only when incremental sort is enabled,
> >* and use the simpler version (comparing just total cost) otherwise?
> >
> >I don't have a strong opinion here. It doesn't seem like a significant
> >difference in terms of cost?
> >
> >5. planner.c create_ordered_paths:
> >* XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
> >* other pathkeys (grouping, ...) like generate_useful_gather_paths.
> >
> >10. optimizer/path/allpaths.c generate_useful_gather_paths:
> >* XXX I wonder if we need to consider adding a projection here, as
> >* create_ordered_paths does.
> >
> >11. In the same function as the above:
> >* XXX Can't we skip this (maybe only for the cheapest partial path)
> >* when the path is already sorted? Then it's likely duplicate with
> >* the path created by generate_gather_paths.
> >
> >12. In the same function as the above:
> >* XXX This is not redundant with the gather merge path created in
> >* generate_gather_paths, because that merely preserves ordering of
> >* the cheapest partial path, while here we add an explicit sort to
> >* get match the useful ordering.
> >
> >13. planner.c create_ordered_paths:
> >* XXX This is probably duplicate with the paths we already generate
> >* in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
> >
> >Tomas, any chance you could take a look at the above XXX/questions? I
> >believe all of them that remain relate to the planner patches.
> >
>
> Yes, I'll take a look over the weekend.

Awesome, thanks.

I collapsed things down including the changes referenced in this
email, since they were all comment formatting changes.

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Fri, Mar 27, 2020 at 09:36:55PM -0400, James Coleman wrote:
>On Fri, Mar 27, 2020 at 9:19 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Fri, Mar 27, 2020 at 12:51:34PM -0400, James Coleman wrote:
>> >In a previous email I'd summarized remaining TODOs I'd found. Here's
>> >an updated listed with several resolved.
>> >
>> >Resolved:
>> >
>> >2. Not marked in the patch, but in nodeIncrementalSort.c
>> >ExecIncrementalSort() I wonder if perhaps we should move the algorithm
>> >discussion comments up to the file header comment. On the other hand,
>> >I suppose it could be valuable to leave the the file header comment
>> >more high level about the mathematical properties of incremental sort
>> >rather than discussing the details of the hybrid mode.
>> >
>> >I've decided to do this, and the attached patch series includes the change.
>> >
>>
>> It's a bit tough to find the right balance what to put into the header
>> comment and what should go to function comments, but this seems mostly
>> reasonable. I wouldn't use the double-tab indentation and the copyright
>> notices should stay at the top.
>
>Fixed. I also re-ran pg_indent on the the nodeIncrementalSort.c file.
>
>> >3. nodeIncrementalSort.c ExecIncrementalSort() in the main for loop:
>> > * TODO: do we need to check for interrupts inside these loops or
>> > * will the outer node handle that?
>> >
>> >It seems like what we have is sufficient, given that the nodes (and
>> >sort) we rely on have their own calls. The one place where someone
>> >might make an argument otherwise would be in the mode transition
>> >function where we copy tuples from the full sort state to the
>> >presorted sort state. If this is a problem, let me know, and I'll
>> >change it, but I'm proceeding under the assumption for now that it's
>> >not.
>> >
>>
>> I think what we have now is sufficient.
>>
>> >4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
>> >is suspect. I've mentioned previously I don't have a great mental
>> >model of how rescan works and its invariants (IIRC someone said it was
>> >about moving around a result set in a cursor). Regardless I'm pretty
>> >sure this code just doesn't work correctly. Additionally the sort_Done
>> >variable is poorly named; it probably would make more sense to call it
>> >something like "scanHasBegun". I'm waiting to change it though until
>> >cleaning up this code more holistically.
>> >
>> >Fixed, as described in previous email.
>> >
>> >6. regress/expected/incremental_sort.out:
>> >-- TODO if an analyze happens here the plans might change; should we
>> >-- solve by inserting extra rows or by adding a GUC that would somehow
>> >-- forcing the time of plan we expect.
>> >
>> >I've decided this doesn't seem to be a real issue, so, comment removed.
>> >
>>
>> OK
>>
>> >7. Not listed as a comment in the patch, but I need to modify the
>> >testing for analyze output to parse out the memory/disk stats to the
>> >tests are stable.
>> >
>> >Included in the attached patch series. I use plpgsql to munge out the
>> >space kB numbers. I also discovered two bugs in the JSON output along
>> >the way and fixed those (memory and disk need to be output separate;
>> >disk was using the wrong "space type" enum). Finally I also use
>> >plpgsql to check a few invariants (for now just that max space is
>> >greater than or equal to the average.
>> >
>>
>> OK
>>
>> >8. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
>> >* XXX At the moment this can only ever return a list with a single element,
>> >* because it looks at query_pathkeys only. So we might return the pathkeys
>> >* directly, but it seems plausible we'll want to consider other orderings
>> >* in the future.
>> >
>> >I think we just leave this in as a comment.
>> >
>>
>> Fine with me.
>>
>> As a side note here, I'm wondering if this (determining useful pathkeys)
>> can be made a bit smarter by looking both at query_pathkeys and pathkeys
>> useful for merging, similarly to what truncate_useless_pathkeys() does.
>> But that can be seen as an improvement of what we do now.
>
>Unless your comment below about looking at truncate_useless_pathkeys
>is implying you're considering aiming to get this in now, I wonder if
>we should just expand the comment to reference pathkeys useful for
>merging as a possible future extension.
>

Maybe. I've been thinking about how to generate those path keys, but
it's a bit tricky, because pathkeys_useful_for_merging() - the thing
that backs truncate_useless_pathkeys() actually gets pathkeys and merely
verifies if those are useful for merging. But this code needs to do the
opposite - generate those pathkeys.

But let's say we know we have a join on columns (a,b,c). For each of
those PathKey values we know it's useful for merging, but should we
generate pathkeys (a,b,c), (b,c,a), ... or any other permutation? I
suppose we can look at pathkeys for existing paths of the relation to
prune the possibilities a bit, but what then?

BTW I wonder if we actually need the ec_has_volatile check in
get_useful_pathkeys_for_relation. The comment says we can't but is it
really true? pathkeys_useful_for_ordering doesn't do any such checks, so
I'd bet this is merely an unnecessary copy-paste from postgres_fdw.
Opinions?

>> >9. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
>> >* Considering query_pathkeys is always worth it, because it might let us
>> >* avoid a local sort.
>> >
>> >That originally was a copy from the fdw code, but since the two
>> >functions have diverged (Is that concerning? I could be confusing, but
>> >isn't a compilation problem) I didn't move the function.
>> >
>>
>> I think it's OK the two functions diverged, it's simply because the FDW
>> one needs to check other things too. But I might rework this once I look
>> closer at truncate_useless_pathkeys.
>
>Agreed, for now at least. It's tempting to think they should always be
>shared, but I'm not convinced (without a lot more digging) that this
>represents structural rather than incidental duplication.
>

The more I look at pathkeys_useful_for_ordering() the more I think the
get_useful_pathkeys_for_relation() function should look more like it
than the postgres_fdw one ...

>> >I did notice though that find_em_expr_for_rel() is wholesale copied
>> >(and unchanged) from the fdw code, so I moved it to equivclass.c so
>> >both places can share it.
>> >
>>
>> +1
>>

... which would also get rid of find_em_expr_for_rel().

>> >
>> >Still remaining:
>> >
>> >1. src/backend/optimizer/util/pathnode.c add_partial_path()
>> >* XXX Perhaps we could do this only when incremental sort is enabled,
>> >* and use the simpler version (comparing just total cost) otherwise?
>> >
>> >I don't have a strong opinion here. It doesn't seem like a significant
>> >difference in terms of cost?
>> >
>> >5. planner.c create_ordered_paths:
>> >* XXX This only looks at sort_pathkeys. I wonder if it needs to look at the
>> >* other pathkeys (grouping, ...) like generate_useful_gather_paths.
>> >
>> >10. optimizer/path/allpaths.c generate_useful_gather_paths:
>> >* XXX I wonder if we need to consider adding a projection here, as
>> >* create_ordered_paths does.
>> >
>> >11. In the same function as the above:
>> >* XXX Can't we skip this (maybe only for the cheapest partial path)
>> >* when the path is already sorted? Then it's likely duplicate with
>> >* the path created by generate_gather_paths.
>> >
>> >12. In the same function as the above:
>> >* XXX This is not redundant with the gather merge path created in
>> >* generate_gather_paths, because that merely preserves ordering of
>> >* the cheapest partial path, while here we add an explicit sort to
>> >* get match the useful ordering.
>> >
>> >13. planner.c create_ordered_paths:
>> >* XXX This is probably duplicate with the paths we already generate
>> >* in generate_useful_gather_paths in apply_scanjoin_target_to_paths.
>> >
>> >Tomas, any chance you could take a look at the above XXX/questions? I
>> >believe all of them that remain relate to the planner patches.
>> >
>>
>> Yes, I'll take a look over the weekend.
>
>Awesome, thanks.
>
>I collapsed things down including the changes referenced in this
>email, since they were all comment formatting changes.
>

Thanks.

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Fri, Mar 27, 2020 at 10:58 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> ...
> >> As a side note here, I'm wondering if this (determining useful pathkeys)
> >> can be made a bit smarter by looking both at query_pathkeys and pathkeys
> >> useful for merging, similarly to what truncate_useless_pathkeys() does.
> >> But that can be seen as an improvement of what we do now.
> >
> >Unless your comment below about looking at truncate_useless_pathkeys
> >is implying you're considering aiming to get this in now, I wonder if
> >we should just expand the comment to reference pathkeys useful for
> >merging as a possible future extension.
> >
>
> Maybe. I've been thinking about how to generate those path keys, but
> it's a bit tricky, because pathkeys_useful_for_merging() - the thing
> that backs truncate_useless_pathkeys() actually gets pathkeys and merely
> verifies if those are useful for merging. But this code needs to do the
> opposite - generate those pathkeys.
>
> But let's say we know we have a join on columns (a,b,c). For each of
> those PathKey values we know it's useful for merging, but should we
> generate pathkeys (a,b,c), (b,c,a), ... or any other permutation? I
> suppose we can look at pathkeys for existing paths of the relation to
> prune the possibilities a bit, but what then?

I'm not convinced it's worth it this time around. Both because of the
late hour in the CF etc., but also because it seems likely to become
pretty complex quickly, and also far more likely to raise performance
questions in planning if there's not a lot of work done to keep it
limited.

> BTW I wonder if we actually need the ec_has_volatile check in
> get_useful_pathkeys_for_relation. The comment says we can't but is it
> really true? pathkeys_useful_for_ordering doesn't do any such checks, so
> I'd bet this is merely an unnecessary copy-paste from postgres_fdw.
> Opinions?

I hadn't really looked at that part in depth before, but after reading
it over, re-reading the definition of volatility, and running a few
basic queries, I agree.

For example: we already do allow volatile pathkeys in a simple query like:
-- t(a, b) with index on (a)
select * from t order by a, random();

It makes sense that you couldn't push down such a sort to a foreign
server, given there's no constraints said function isn't operating
directly on the primary database (in the fdw case). But that obviously
wouldn't apply here.

> >> >9. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
> >> >* Considering query_pathkeys is always worth it, because it might let us
> >> >* avoid a local sort.
> >> >
> >> >That originally was a copy from the fdw code, but since the two
> >> >functions have diverged (Is that concerning? I could be confusing, but
> >> >isn't a compilation problem) I didn't move the function.
> >> >
> >>
> >> I think it's OK the two functions diverged, it's simply because the FDW
> >> one needs to check other things too. But I might rework this once I look
> >> closer at truncate_useless_pathkeys.
> >
> >Agreed, for now at least. It's tempting to think they should always be
> >shared, but I'm not convinced (without a lot more digging) that this
> >represents structural rather than incidental duplication.
> >
>
> The more I look at pathkeys_useful_for_ordering() the more I think the
> get_useful_pathkeys_for_relation() function should look more like it
> than the postgres_fdw one ...

If we go down that path (and indeed this is actually implied by
removing the volatile check too), doesn't that really just mean (at
least for now) that get_useful_pathkeys_for_relation goes away
entirely and we effectively use root->query_pathkeys directly? The
only thing your lose them is the check that each eclass has a member
in the rel. But that check probably wasn't quite right anyway: at
least for incremental sort (maybe not regular sort), I think we
probably don't necessarily care that the entire list has members in
the rel, but rather that some prefix does, right? The idea there would
be that we could create a gather merge here that supplies a partial
ordering (that prefix of query_pathkeys) and then above it the planner
might place another incremental sort (say above a join), or perhaps
even a join above that would actually be sufficient (since many joins
are capable of providing an ordering anyway).

I've attached (added prefix .txt to avoid the cfbot assuming this is a
full series) an idea in that direction to see if we're thinking along
the same route. You'll want to apply and view with `git diff -w`
probably.

The attached also adds a few XXX comments. In particular, I wonder if
we should verify that some prefix of the root->query_pathkeys is
actually made up of eclass members for the current rel, because
otherwise I think we can skip the loop on the subpaths entirely.

> >> >I did notice though that find_em_expr_for_rel() is wholesale copied
> >> >(and unchanged) from the fdw code, so I moved it to equivclass.c so
> >> >both places can share it.
> >> >
> >>
> >> +1
> >>
>
> ... which would also get rid of find_em_expr_for_rel().

... which, I think, would retain the need for find_em_expr_for_rel().

Note: The attached applied to the previous series compiles and runs
make check...but I haven't really tested it; it's meant more for "is
this the direction we want to go".

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sat, Mar 28, 2020 at 10:19:04AM -0400, James Coleman wrote:
>On Fri, Mar 27, 2020 at 10:58 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> ...
>> >> As a side note here, I'm wondering if this (determining useful pathkeys)
>> >> can be made a bit smarter by looking both at query_pathkeys and pathkeys
>> >> useful for merging, similarly to what truncate_useless_pathkeys() does.
>> >> But that can be seen as an improvement of what we do now.
>> >
>> >Unless your comment below about looking at truncate_useless_pathkeys
>> >is implying you're considering aiming to get this in now, I wonder if
>> >we should just expand the comment to reference pathkeys useful for
>> >merging as a possible future extension.
>> >
>>
>> Maybe. I've been thinking about how to generate those path keys, but
>> it's a bit tricky, because pathkeys_useful_for_merging() - the thing
>> that backs truncate_useless_pathkeys() actually gets pathkeys and merely
>> verifies if those are useful for merging. But this code needs to do the
>> opposite - generate those pathkeys.
>>
>> But let's say we know we have a join on columns (a,b,c). For each of
>> those PathKey values we know it's useful for merging, but should we
>> generate pathkeys (a,b,c), (b,c,a), ... or any other permutation? I
>> suppose we can look at pathkeys for existing paths of the relation to
>> prune the possibilities a bit, but what then?
>
>I'm not convinced it's worth it this time around. Both because of the
>late hour in the CF etc., but also because it seems likely to become
>pretty complex quickly, and also far more likely to raise performance
>questions in planning if there's not a lot of work done to keep it
>limited.
>

Agreed. There'll be time to add more optimizations in the future.

>> BTW I wonder if we actually need the ec_has_volatile check in
>> get_useful_pathkeys_for_relation. The comment says we can't but is it
>> really true? pathkeys_useful_for_ordering doesn't do any such checks, so
>> I'd bet this is merely an unnecessary copy-paste from postgres_fdw.
>> Opinions?
>
>I hadn't really looked at that part in depth before, but after reading
>it over, re-reading the definition of volatility, and running a few
>basic queries, I agree.
>
>For example: we already do allow volatile pathkeys in a simple query like:
>-- t(a, b) with index on (a)
>select * from t order by a, random();
>
>It makes sense that you couldn't push down such a sort to a foreign
>server, given there's no constraints said function isn't operating
>directly on the primary database (in the fdw case). But that obviously
>wouldn't apply here.
>

Thanks for confirming my reasoning.

>> >> >9. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
>> >> >* Considering query_pathkeys is always worth it, because it might let us
>> >> >* avoid a local sort.
>> >> >
>> >> >That originally was a copy from the fdw code, but since the two
>> >> >functions have diverged (Is that concerning? I could be confusing, but
>> >> >isn't a compilation problem) I didn't move the function.
>> >> >
>> >>
>> >> I think it's OK the two functions diverged, it's simply because the FDW
>> >> one needs to check other things too. But I might rework this once I look
>> >> closer at truncate_useless_pathkeys.
>> >
>> >Agreed, for now at least. It's tempting to think they should always be
>> >shared, but I'm not convinced (without a lot more digging) that this
>> >represents structural rather than incidental duplication.
>> >
>>
>> The more I look at pathkeys_useful_for_ordering() the more I think the
>> get_useful_pathkeys_for_relation() function should look more like it
>> than the postgres_fdw one ...
>
>If we go down that path (and indeed this is actually implied by
>removing the volatile check too), doesn't that really just mean (at
>least for now) that get_useful_pathkeys_for_relation goes away
>entirely and we effectively use root->query_pathkeys directly?

Yes, basically.

>The only thing your lose them is the check that each eclass has a
>member in the rel. But that check probably wasn't quite right anyway:
>at least for incremental sort (maybe not regular sort), I think we
>probably don't necessarily care that the entire list has members in the
>rel, but rather that some prefix does, right?

Probably, although I always forget how exactly this EC business works.
My reasoning is more "If pathkeys_useful_for_ordering does not need
that, why should we need it here?"

>The idea there would be that we could create a gather merge here that
>supplies a partial ordering (that prefix of query_pathkeys) and then
>above it the planner might place another incremental sort (say above a
>join), or perhaps even a join above that would actually be sufficient
>(since many joins are capable of providing an ordering anyway).
>

Not sure. Isn't the idea that we do the Incremental Sort below the
Gather Merge, because then it's actually done in parallel?

>I've attached (added prefix .txt to avoid the cfbot assuming this is a
>full series) an idea in that direction to see if we're thinking along
>the same route. You'll want to apply and view with `git diff -w`
>probably.
>
>The attached also adds a few XXX comments. In particular, I wonder if
>we should verify that some prefix of the root->query_pathkeys is
>actually made up of eclass members for the current rel, because
>otherwise I think we can skip the loop on the subpaths entirely.
>

Hmmm, that's an interesting possible optimization. I wonder if that
actually saves anything, though, because the number of paths in the loop
is likely fairly low.

>> >> >I did notice though that find_em_expr_for_rel() is wholesale
>> >> >copied (and unchanged) from the fdw code, so I moved it to
>> >> >equivclass.c so both places can share it.
>> >> >
>> >>
>> >> +1
>> >>
>>
>> ... which would also get rid of find_em_expr_for_rel().
>
>... which, I think, would retain the need for find_em_expr_for_rel().
>
>Note: The attached applied to the previous series compiles and runs
>make check...but I haven't really tested it; it's meant more for "is
>this the direction we want to go".
>

Thanks, I'll take a look.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Mar 28, 2020 at 2:54 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> ...
> >> >> >9. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
> >> >> >* Considering query_pathkeys is always worth it, because it might let us
> >> >> >* avoid a local sort.
> >> >> >
> >> >> >That originally was a copy from the fdw code, but since the two
> >> >> >functions have diverged (Is that concerning? I could be confusing, but
> >> >> >isn't a compilation problem) I didn't move the function.
> >> >> >
> >> >>
> >> >> I think it's OK the two functions diverged, it's simply because the FDW
> >> >> one needs to check other things too. But I might rework this once I look
> >> >> closer at truncate_useless_pathkeys.
> >> >
> >> >Agreed, for now at least. It's tempting to think they should always be
> >> >shared, but I'm not convinced (without a lot more digging) that this
> >> >represents structural rather than incidental duplication.
> >> >
> >>
> >> The more I look at pathkeys_useful_for_ordering() the more I think the
> >> get_useful_pathkeys_for_relation() function should look more like it
> >> than the postgres_fdw one ...
> >
> >If we go down that path (and indeed this is actually implied by
> >removing the volatile check too), doesn't that really just mean (at
> >least for now) that get_useful_pathkeys_for_relation goes away
> >entirely and we effectively use root->query_pathkeys directly?
>
> Yes, basically.
>
> >The only thing your lose them is the check that each eclass has a
> >member in the rel. But that check probably wasn't quite right anyway:
> >at least for incremental sort (maybe not regular sort), I think we
> >probably don't necessarily care that the entire list has members in the
> >rel, but rather that some prefix does, right?
>
> Probably, although I always forget how exactly this EC business works.
> My reasoning is more "If pathkeys_useful_for_ordering does not need
> that, why should we need it here?"

i think it effectively does, since it's called by
truncate_useless_pathkeys with the pathkeys list a path provides and
root-query_pathkeys and tries to find a prefix.

So if there wasn't a prefix of matching eclasses, we'd return 0 as the
number of matching prefix pathkeys, and thus a NIL list to the caller
of truncate_useless_pathkeys.

> >The idea there would be that we could create a gather merge here that
> >supplies a partial ordering (that prefix of query_pathkeys) and then
> >above it the planner might place another incremental sort (say above a
> >join), or perhaps even a join above that would actually be sufficient
> >(since many joins are capable of providing an ordering anyway).
> >
>
> Not sure. Isn't the idea that we do the Incremental Sort below the
> Gather Merge, because then it's actually done in parallel?

Yeah, I think as I was typing this my ideas got kinda mixed up a bit,
or at least was confusing. The incremental sort *would* need a path
that is ordered by a prefix of query_pathkeys and provide the sort on
the full query_pathkeys.

But the incremental sort at this point would still be on a path with a
given rel, and that path's rel would need to contain all of the
eclasses for the pathkeys we want as the final ordering to be able to
sort on them, right? For example, suppose:
select * from t join s order by t.a, s.b -- index on t.a

We'd have a presorted path on t.a that's has a prefix of the
query_pathkeys, but we couldn't have the incremental sort on path
whose rel only contained t, right? It'd have to a rel that was the
result of the join, otherwise we don't yet have access to s.b.

Given that, I think we have two options in this code. Suppose query
pathkeys (a, b, c):
1. Build an incremental sort path on (a, b, c) if we have a path
ordered by (a) or (a, b) but only when (c) is already available.
2. Additionally, (for the most possible paths generated): build an
incremental sort path on (a, b) if we have a path ordered by (a) but
(c) isn't yet available.

Either way I think we need the ability to know if all (or some subset)
of the query pathkeys are for eclasses we have access to in the
current rel.

> >I've attached (added prefix .txt to avoid the cfbot assuming this is a
> >full series) an idea in that direction to see if we're thinking along
> >the same route. You'll want to apply and view with `git diff -w`
> >probably.
> >
> >The attached also adds a few XXX comments. In particular, I wonder if
> >we should verify that some prefix of the root->query_pathkeys is
> >actually made up of eclass members for the current rel, because
> >otherwise I think we can skip the loop on the subpaths entirely.
> >
>
> Hmmm, that's an interesting possible optimization. I wonder if that
> actually saves anything, though, because the number of paths in the loop
> is likely fairly low.

If what I said above is correct (and please poke holes in it if
possible), then I think we have to know the matching eclass count
anyway, so we might as well include the optimization since it'd be a
simple int comparison.

> >> >> >I did notice though that find_em_expr_for_rel() is wholesale
> >> >> >copied (and unchanged) from the fdw code, so I moved it to
> >> >> >equivclass.c so both places can share it.
> >> >> >
> >> >>
> >> >> +1
> >> >>
> >>
> >> ... which would also get rid of find_em_expr_for_rel().
> >
> >... which, I think, would retain the need for find_em_expr_for_rel().
> >
> >Note: The attached applied to the previous series compiles and runs
> >make check...but I haven't really tested it; it's meant more for "is
> >this the direction we want to go".
> >
>
> Thanks, I'll take a look.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sat, Mar 28, 2020 at 10:19:04AM -0400, James Coleman wrote:
>On Fri, Mar 27, 2020 at 10:58 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> ...
>>
>> The more I look at pathkeys_useful_for_ordering() the more I think the
>> get_useful_pathkeys_for_relation() function should look more like it
>> than the postgres_fdw one ...
>
>If we go down that path (and indeed this is actually implied by
>removing the volatile check too), doesn't that really just mean (at
>least for now) that get_useful_pathkeys_for_relation goes away
>entirely and we effectively use root->query_pathkeys directly? The
>only thing your lose them is the check that each eclass has a member
>in the rel. But that check probably wasn't quite right anyway: at
>least for incremental sort (maybe not regular sort), I think we
>probably don't necessarily care that the entire list has members in
>the rel, but rather that some prefix does, right? The idea there would
>be that we could create a gather merge here that supplies a partial
>ordering (that prefix of query_pathkeys) and then above it the planner
>might place another incremental sort (say above a join), or perhaps
>even a join above that would actually be sufficient (since many joins
>are capable of providing an ordering anyway).
>
>I've attached (added prefix .txt to avoid the cfbot assuming this is a
>full series) an idea in that direction to see if we're thinking along
>the same route. You'll want to apply and view with `git diff -w`
>probably.
>

Hmmm, I'm not sure the patch is quite correct.

Firstly, I suggest we don't remove get_useful_pathkeys_for_relation
entirely, because that allows us to add more useful pathkeys in the
future (even if we don't consider pathkeys useful for merging now).
We could also do the EC optimization in the function, return NIL and
not loop over partial_pathlist at all. That's cosmetic issue, though.

More importantly, I'm not sure this makes sense:

   /* consider incremental sort for interesting orderings */
   useful_pathkeys = truncate_useless_pathkeys(root, rel, subpath->pathkeys);

   ...

   is_sorted = pathkeys_common_contained_in(useful_pathkeys,
                                            subpath->pathkeys,
                                            &presorted_keys);

I mean, useful_pathkeys is bound to be a sublist of subpath->pathkeys,
right? So how could this ever return is_sorted=false?

The whole point is to end up with query_pathkeys (or whatever pathkeys
we deem useful), but this does not do that.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Mar 28, 2020 at 5:30 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sat, Mar 28, 2020 at 10:19:04AM -0400, James Coleman wrote:
> >On Fri, Mar 27, 2020 at 10:58 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> ...
> >>
> >> The more I look at pathkeys_useful_for_ordering() the more I think the
> >> get_useful_pathkeys_for_relation() function should look more like it
> >> than the postgres_fdw one ...
> >
> >If we go down that path (and indeed this is actually implied by
> >removing the volatile check too), doesn't that really just mean (at
> >least for now) that get_useful_pathkeys_for_relation goes away
> >entirely and we effectively use root->query_pathkeys directly? The
> >only thing your lose them is the check that each eclass has a member
> >in the rel. But that check probably wasn't quite right anyway: at
> >least for incremental sort (maybe not regular sort), I think we
> >probably don't necessarily care that the entire list has members in
> >the rel, but rather that some prefix does, right? The idea there would
> >be that we could create a gather merge here that supplies a partial
> >ordering (that prefix of query_pathkeys) and then above it the planner
> >might place another incremental sort (say above a join), or perhaps
> >even a join above that would actually be sufficient (since many joins
> >are capable of providing an ordering anyway).
> >
> >I've attached (added prefix .txt to avoid the cfbot assuming this is a
> >full series) an idea in that direction to see if we're thinking along
> >the same route. You'll want to apply and view with `git diff -w`
> >probably.
> >
>
> Hmmm, I'm not sure the patch is quite correct.
>
> Firstly, I suggest we don't remove get_useful_pathkeys_for_relation
> entirely, because that allows us to add more useful pathkeys in the
> future (even if we don't consider pathkeys useful for merging now).
> We could also do the EC optimization in the function, return NIL and
> not loop over partial_pathlist at all. That's cosmetic issue, though.
>
> More importantly, I'm not sure this makes sense:
>
>    /* consider incremental sort for interesting orderings */
>    useful_pathkeys = truncate_useless_pathkeys(root, rel, subpath->pathkeys);
>
>    ...
>
>    is_sorted = pathkeys_common_contained_in(useful_pathkeys,
>                                             subpath->pathkeys,
>                                             &presorted_keys);
>
> I mean, useful_pathkeys is bound to be a sublist of subpath->pathkeys,
> right? So how could this ever return is_sorted=false?
>
> The whole point is to end up with query_pathkeys (or whatever pathkeys
> we deem useful), but this does not do that.

Yes, that's obviously a thinko in my rush to get an idea out.

I think useful_pathkeys there would be essentially
root->query_pathkeys, or, more correctly the prefix of query_pathkeys
that has eclasses shared with the current rel. Or, if we go with the
more restrictive approach (see the two approaches I mentioned
earlier), then either NIL or query_pathkeys (if they all matches
eclasses in the current rel).

Given those requirements, I agree that keeping
get_useful_pathkeys_for_relation makes sense to wrap up all of that
behavior.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
Hi,

Attached is my take on simplification of the useful pathkeyes thing. It
keeps the function, but it truncates query_pathkeys to only members with
EC members in the relation. I think that's essentially the optimization
you've proposed.

I've also noticed an issue in explain output. EXPLAIN ANALYZE on a simple
query gives me this:

                                                                       QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------
  Gather Merge  (cost=66.30..816060.48 rows=8333226 width=24) (actual time=6.464..19091.006 rows=10000000 loops=1)
    Workers Planned: 2
    Workers Launched: 2
    ->  Incremental Sort  (cost=66.28..729188.13 rows=4166613 width=24) (actual time=1.836..13401.109 rows=3333333
loops=3)
          Sort Key: a, b, c
          Presorted Key: a, b
          Full-sort Groups: 4156 (Methods: quicksort) Memory: 30kB (avg), 30kB (max)
          Presorted Groups: 4137 (Methods: quicksort) Memory: 108kB (avg), 111kB (max)
          Full-sort Groups: 6888 (Methods: ) Memory: 30kB (avg), 30kB (max)
          Presorted Groups: 6849 (Methods: ) Memory: 121kB (avg), 131kB (max)
          Full-sort Groups: 6869 (Methods: ) Memory: 30kB (avg), 30kB (max)
          Presorted Groups: 6816 (Methods: ) Memory: 128kB (avg), 132kB (max)
          ->  Parallel Index Scan using t_a_b_idx on t  (cost=0.43..382353.69 rows=4166613 width=24) (actual
time=0.033..9346.679rows=3333333 loops=3)
 
  Planning Time: 0.133 ms
  Execution Time: 23998.669 ms
(15 rows)

while with incremental sort disabled it looks like this:

                                                              QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
  Gather Merge  (cost=734387.50..831676.35 rows=8333226 width=24) (actual time=5597.978..14967.246 rows=10000000
loops=1)
    Workers Planned: 2
    Workers Launched: 2
    ->  Sort  (cost=734387.47..744804.00 rows=4166613 width=24) (actual time=5584.616..7645.711 rows=3333333 loops=3)
          Sort Key: a, b, c
          Sort Method: external merge  Disk: 111216kB
          Worker 0:  Sort Method: external merge  Disk: 109552kB
          Worker 1:  Sort Method: external merge  Disk: 112112kB
          ->  Parallel Seq Scan on t  (cost=0.00..105361.13 rows=4166613 width=24) (actual time=0.011..1753.128
rows=3333333loops=3)
 
  Planning Time: 0.048 ms
  Execution Time: 19682.582 ms
(11 rows)

So I think there's a couple of issues:

1) Missing worker identification (Worker #).

2) Missing method for workers (we have it for the leader, though).

3) I'm not sure why the lable is "Methods" instead of "Sort Method", and
why it's in parenthesis.

4) Not sure having two lines for each worker is a great idea.

5) I'd probably prefer having multiple labels for avg/max memory values,
instead of (avg) and (max) notes. Also, I think we use "peak" in this
context instead of "max".


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Mar 28, 2020 at 6:59 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> Attached is my take on simplification of the useful pathkeyes thing. It
> keeps the function, but it truncates query_pathkeys to only members with
> EC members in the relation. I think that's essentially the optimization
> you've proposed.

Thanks. I've included that in the patch series in this email (as a
separate patch) with a few additional comments.

I've also noticed that the enabled_incrementalsort check in
generate_useful_gather_paths seemed broken, because it returned us out
of the function before creating either a plain gather merge (if
already sorted) or an explicit sort path. I've included a patch that
moves it to the if block that actually builds the incremental sort
path.

> I've also noticed an issue in explain output. EXPLAIN ANALYZE on a simple
> query gives me this:
>
>                                                                        QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------------------------
>   Gather Merge  (cost=66.30..816060.48 rows=8333226 width=24) (actual time=6.464..19091.006 rows=10000000 loops=1)
>     Workers Planned: 2
>     Workers Launched: 2
>     ->  Incremental Sort  (cost=66.28..729188.13 rows=4166613 width=24) (actual time=1.836..13401.109 rows=3333333
loops=3)
>           Sort Key: a, b, c
>           Presorted Key: a, b
>           Full-sort Groups: 4156 (Methods: quicksort) Memory: 30kB (avg), 30kB (max)
>           Presorted Groups: 4137 (Methods: quicksort) Memory: 108kB (avg), 111kB (max)
>           Full-sort Groups: 6888 (Methods: ) Memory: 30kB (avg), 30kB (max)
>           Presorted Groups: 6849 (Methods: ) Memory: 121kB (avg), 131kB (max)
>           Full-sort Groups: 6869 (Methods: ) Memory: 30kB (avg), 30kB (max)
>           Presorted Groups: 6816 (Methods: ) Memory: 128kB (avg), 132kB (max)
>           ->  Parallel Index Scan using t_a_b_idx on t  (cost=0.43..382353.69 rows=4166613 width=24) (actual
time=0.033..9346.679rows=3333333 loops=3)
 
>   Planning Time: 0.133 ms
>   Execution Time: 23998.669 ms
> (15 rows)
>
> while with incremental sort disabled it looks like this:
>
>                                                               QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------
>   Gather Merge  (cost=734387.50..831676.35 rows=8333226 width=24) (actual time=5597.978..14967.246 rows=10000000
loops=1)
>     Workers Planned: 2
>     Workers Launched: 2
>     ->  Sort  (cost=734387.47..744804.00 rows=4166613 width=24) (actual time=5584.616..7645.711 rows=3333333
loops=3)
>           Sort Key: a, b, c
>           Sort Method: external merge  Disk: 111216kB
>           Worker 0:  Sort Method: external merge  Disk: 109552kB
>           Worker 1:  Sort Method: external merge  Disk: 112112kB
>           ->  Parallel Seq Scan on t  (cost=0.00..105361.13 rows=4166613 width=24) (actual time=0.011..1753.128
rows=3333333loops=3)
 
>   Planning Time: 0.048 ms
>   Execution Time: 19682.582 ms
> (11 rows)
>
> So I think there's a couple of issues:
>
> 1) Missing worker identification (Worker #).

Fixed.

> 2) Missing method for workers (we have it for the leader, though).

Fixed. Since we can't have pointers in the parallel shared memory
space, we can't store the sort methods used in a list. To accomplish
the same goal, I've assigned the TuplesortMethod enum entries uique
bit positions, and store the methods used in a bitmask.

> 3) I'm not sure why the lable is "Methods" instead of "Sort Method", and
> why it's in parenthesis.

I've removed the parentheses. It's labeled "Methods" since there can
be more than one (different batches could use different methods). I've
updated this to properly use singular/plural depending on the number
of methods used.

> 4) Not sure having two lines for each worker is a great idea.

I've left these in for now because the lines are already very long
(much, much longer than the worker lines in a standard sort node).
This is largely because we're trying to summarize many sort batches,
while standard sort nodes only have to give the exact stats from a
single batch.

See the example output later in the email.

> 5) I'd probably prefer having multiple labels for avg/max memory values,
> instead of (avg) and (max) notes. Also, I think we use "peak" in this
> context instead of "max".

Updated.

Here's the current output:

 Limit  (cost=1887419.20..1889547.68 rows=10000 width=8) (actual
time=13218.403..13222.519 rows=10000 loo
ps=1)
   ->  Gather Merge  (cost=1887419.20..19624748.03 rows=83333360
width=8) (actual time=13218.401..13229.7
50 rows=10000 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Incremental Sort  (cost=1886419.17..10005010.55
rows=41666680 width=8) (actual time=13208.00
4..13208.586 rows=4425 loops=3)
               Sort Key: a, b
               Presorted Key: a
               Full-sort Groups: 1 Sort Method: quicksort Memory:
avg=28kB peak=28kB
               Presorted Groups: 1 Sort Method: top-N heapsort Memory:
avg=1681kB peak=1681kB
               Worker 0:  Full-sort Groups: 1 Sort Method: quicksort
Memory: avg=28kB peak=28kB
                 Presorted Groups: 1 Sort Method: top-N heapsort
Memory: avg=1680kB peak=1680kB
               Worker 1:  Full-sort Groups: 1 Sort Method: quicksort
Memory: avg=28kB peak=28kB
                 Presorted Groups: 1 Sort Method: top-N heapsort
Memory: avg=1682kB peak=1682kB
               ->  Parallel Index Scan using index_s_a on s
(cost=0.57..4967182.06 rows=41666680 width=8
) (actual time=0.455..11730.878 rows=6666668 loops=3)

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sat, Mar 28, 2020 at 10:47:49PM -0400, James Coleman wrote:
>On Sat, Mar 28, 2020 at 6:59 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> Hi,
>>
>> Attached is my take on simplification of the useful pathkeyes thing. It
>> keeps the function, but it truncates query_pathkeys to only members with
>> EC members in the relation. I think that's essentially the optimization
>> you've proposed.
>
>Thanks. I've included that in the patch series in this email (as a
>separate patch) with a few additional comments.
>

Thanks.

>I've also noticed that the enabled_incrementalsort check in
>generate_useful_gather_paths seemed broken, because it returned us out
>of the function before creating either a plain gather merge (if
>already sorted) or an explicit sort path. I've included a patch that
>moves it to the if block that actually builds the incremental sort
>path.
>

Hmmm, that's probably right.

The reason why the GUC check was right after generate_gather_paths is
that the intent to disable all the useful-pathkeys business, essentially
reducing it back to plain generate_gather_paths.

But I think you're right that's wrong, because it might lead to strange
behavior when the GUC switches between plans without any incremental
sort nodes - setting it to 'true' might end up picking GM on top of
plain Sort, for example.

>>
>> ...
>>
>> 1) Missing worker identification (Worker #).
>
>Fixed.
>
>> 2) Missing method for workers (we have it for the leader, though).
>
>Fixed. Since we can't have pointers in the parallel shared memory
>space, we can't store the sort methods used in a list. To accomplish
>the same goal, I've assigned the TuplesortMethod enum entries uique
>bit positions, and store the methods used in a bitmask.
>

OK, makes sense.

>> 3) I'm not sure why the lable is "Methods" instead of "Sort Method", and
>> why it's in parenthesis.
>
>I've removed the parentheses. It's labeled "Methods" since there can
>be more than one (different batches could use different methods). I've
>updated this to properly use singular/plural depending on the number
>of methods used.
>

I'm a bit confused. How do I know which batch used which method? Is it
actually worth printing in explain analyze? Maybe only print it in the
verbose mode?

>> 4) Not sure having two lines for each worker is a great idea.
>
>I've left these in for now because the lines are already very long
>(much, much longer than the worker lines in a standard sort node).
>This is largely because we're trying to summarize many sort batches,
>while standard sort nodes only have to give the exact stats from a
>single batch.
>
>See the example output later in the email.
>

OK

>> 5) I'd probably prefer having multiple labels for avg/max memory values,
>> instead of (avg) and (max) notes. Also, I think we use "peak" in this
>> context instead of "max".
>
>Updated.
>

OK

>Here's the current output:
>
> Limit  (cost=1887419.20..1889547.68 rows=10000 width=8) (actual
>time=13218.403..13222.519 rows=10000 loo
>ps=1)
>   ->  Gather Merge  (cost=1887419.20..19624748.03 rows=83333360
>width=8) (actual time=13218.401..13229.7
>50 rows=10000 loops=1)
>         Workers Planned: 2
>         Workers Launched: 2
>         ->  Incremental Sort  (cost=1886419.17..10005010.55
>rows=41666680 width=8) (actual time=13208.00
>4..13208.586 rows=4425 loops=3)
>               Sort Key: a, b
>               Presorted Key: a
>               Full-sort Groups: 1 Sort Method: quicksort Memory:
>avg=28kB peak=28kB
>               Presorted Groups: 1 Sort Method: top-N heapsort Memory:
>avg=1681kB peak=1681kB
>               Worker 0:  Full-sort Groups: 1 Sort Method: quicksort
>Memory: avg=28kB peak=28kB
>                 Presorted Groups: 1 Sort Method: top-N heapsort
>Memory: avg=1680kB peak=1680kB
>               Worker 1:  Full-sort Groups: 1 Sort Method: quicksort
>Memory: avg=28kB peak=28kB
>                 Presorted Groups: 1 Sort Method: top-N heapsort
>Memory: avg=1682kB peak=1682kB
>               ->  Parallel Index Scan using index_s_a on s
>(cost=0.57..4967182.06 rows=41666680 width=8
>) (actual time=0.455..11730.878 rows=6666668 loops=3)
>
>James

Looks reasonable. Did you try it in other output formats - json/yaml?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Mar 28, 2020 at 11:14 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sat, Mar 28, 2020 at 10:47:49PM -0400, James Coleman wrote:
> >On Sat, Mar 28, 2020 at 6:59 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> Hi,
> >>
> >> Attached is my take on simplification of the useful pathkeyes thing. It
> >> keeps the function, but it truncates query_pathkeys to only members with
> >> EC members in the relation. I think that's essentially the optimization
> >> you've proposed.
> >
> >Thanks. I've included that in the patch series in this email (as a
> >separate patch) with a few additional comments.
> >
>
> Thanks.
>
> >I've also noticed that the enabled_incrementalsort check in
> >generate_useful_gather_paths seemed broken, because it returned us out
> >of the function before creating either a plain gather merge (if
> >already sorted) or an explicit sort path. I've included a patch that
> >moves it to the if block that actually builds the incremental sort
> >path.
> >
>
> Hmmm, that's probably right.
>
> The reason why the GUC check was right after generate_gather_paths is
> that the intent to disable all the useful-pathkeys business, essentially
> reducing it back to plain generate_gather_paths.
>
> But I think you're right that's wrong, because it might lead to strange
> behavior when the GUC switches between plans without any incremental
> sort nodes - setting it to 'true' might end up picking GM on top of
> plain Sort, for example.

Thanks.

> >>
> >> ...
> >>
> >> 1) Missing worker identification (Worker #).
> >
> >Fixed.
> >
> >> 2) Missing method for workers (we have it for the leader, though).
> >
> >Fixed. Since we can't have pointers in the parallel shared memory
> >space, we can't store the sort methods used in a list. To accomplish
> >the same goal, I've assigned the TuplesortMethod enum entries uique
> >bit positions, and store the methods used in a bitmask.
> >
>
> OK, makes sense.
>
> >> 3) I'm not sure why the lable is "Methods" instead of "Sort Method", and
> >> why it's in parenthesis.
> >
> >I've removed the parentheses. It's labeled "Methods" since there can
> >be more than one (different batches could use different methods). I've
> >updated this to properly use singular/plural depending on the number
> >of methods used.
> >
>
> I'm a bit confused. How do I know which batch used which method? Is it
> actually worth printing in explain analyze? Maybe only print it in the
> verbose mode?

The alternative is showing no sort method information at all, or only
showing it if all batches used the same method (which seems confusing
to me). It seems weird that we wouldn't try to find some rough
analogue to what a regular sort node outputs, so I've attempted to
summarize.

This is similar to the memory information: the average doesn't apply
to any one batch, and you don't know which one (or how many) hit the
peak memory usage either, but I think it's meaningful to know a
summary.

With the sort methods, I think it's useful to be able to, for example,
know if any of the groups happened to trigger the top-n heapsort
optimization, or not, and as a corollary, if all of them did or not.

> >> 4) Not sure having two lines for each worker is a great idea.
> >
> >I've left these in for now because the lines are already very long
> >(much, much longer than the worker lines in a standard sort node).
> >This is largely because we're trying to summarize many sort batches,
> >while standard sort nodes only have to give the exact stats from a
> >single batch.
> >
> >See the example output later in the email.
> >
>
> OK
>
> >> 5) I'd probably prefer having multiple labels for avg/max memory values,
> >> instead of (avg) and (max) notes. Also, I think we use "peak" in this
> >> context instead of "max".
> >
> >Updated.
> >
>
> OK
>
> >Here's the current output:
> >
> > Limit  (cost=1887419.20..1889547.68 rows=10000 width=8) (actual
> >time=13218.403..13222.519 rows=10000 loo
> >ps=1)
> >   ->  Gather Merge  (cost=1887419.20..19624748.03 rows=83333360
> >width=8) (actual time=13218.401..13229.7
> >50 rows=10000 loops=1)
> >         Workers Planned: 2
> >         Workers Launched: 2
> >         ->  Incremental Sort  (cost=1886419.17..10005010.55
> >rows=41666680 width=8) (actual time=13208.00
> >4..13208.586 rows=4425 loops=3)
> >               Sort Key: a, b
> >               Presorted Key: a
> >               Full-sort Groups: 1 Sort Method: quicksort Memory:
> >avg=28kB peak=28kB
> >               Presorted Groups: 1 Sort Method: top-N heapsort Memory:
> >avg=1681kB peak=1681kB
> >               Worker 0:  Full-sort Groups: 1 Sort Method: quicksort
> >Memory: avg=28kB peak=28kB
> >                 Presorted Groups: 1 Sort Method: top-N heapsort
> >Memory: avg=1680kB peak=1680kB
> >               Worker 1:  Full-sort Groups: 1 Sort Method: quicksort
> >Memory: avg=28kB peak=28kB
> >                 Presorted Groups: 1 Sort Method: top-N heapsort
> >Memory: avg=1682kB peak=1682kB
> >               ->  Parallel Index Scan using index_s_a on s
> >(cost=0.57..4967182.06 rows=41666680 width=8
> >) (actual time=0.455..11730.878 rows=6666668 loops=3)
>
> Looks reasonable. Did you try it in other output formats - json/yaml?

I did. JSON looks good also, which implies to me yaml would to (but I
didn't look at it).

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Mar 28, 2020 at 11:23 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Sat, Mar 28, 2020 at 11:14 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > On Sat, Mar 28, 2020 at 10:47:49PM -0400, James Coleman wrote:
> > >On Sat, Mar 28, 2020 at 6:59 PM Tomas Vondra
> > ><tomas.vondra@2ndquadrant.com> wrote:
> > >>
> > >> Hi,
> > >>
> > >> Attached is my take on simplification of the useful pathkeyes thing. It
> > >> keeps the function, but it truncates query_pathkeys to only members with
> > >> EC members in the relation. I think that's essentially the optimization
> > >> you've proposed.
> > >
> > >Thanks. I've included that in the patch series in this email (as a
> > >separate patch) with a few additional comments.
> > >
> >
> > Thanks.
> >
> > >I've also noticed that the enabled_incrementalsort check in
> > >generate_useful_gather_paths seemed broken, because it returned us out
> > >of the function before creating either a plain gather merge (if
> > >already sorted) or an explicit sort path. I've included a patch that
> > >moves it to the if block that actually builds the incremental sort
> > >path.
> > >
> >
> > Hmmm, that's probably right.
> >
> > The reason why the GUC check was right after generate_gather_paths is
> > that the intent to disable all the useful-pathkeys business, essentially
> > reducing it back to plain generate_gather_paths.
> >
> > But I think you're right that's wrong, because it might lead to strange
> > behavior when the GUC switches between plans without any incremental
> > sort nodes - setting it to 'true' might end up picking GM on top of
> > plain Sort, for example.
>
> Thanks.
>
> > >>
> > >> ...
> > >>
> > >> 1) Missing worker identification (Worker #).
> > >
> > >Fixed.
> > >
> > >> 2) Missing method for workers (we have it for the leader, though).
> > >
> > >Fixed. Since we can't have pointers in the parallel shared memory
> > >space, we can't store the sort methods used in a list. To accomplish
> > >the same goal, I've assigned the TuplesortMethod enum entries uique
> > >bit positions, and store the methods used in a bitmask.
> > >
> >
> > OK, makes sense.
> >
> > >> 3) I'm not sure why the lable is "Methods" instead of "Sort Method", and
> > >> why it's in parenthesis.
> > >
> > >I've removed the parentheses. It's labeled "Methods" since there can
> > >be more than one (different batches could use different methods). I've
> > >updated this to properly use singular/plural depending on the number
> > >of methods used.
> > >
> >
> > I'm a bit confused. How do I know which batch used which method? Is it
> > actually worth printing in explain analyze? Maybe only print it in the
> > verbose mode?
>
> The alternative is showing no sort method information at all, or only
> showing it if all batches used the same method (which seems confusing
> to me). It seems weird that we wouldn't try to find some rough
> analogue to what a regular sort node outputs, so I've attempted to
> summarize.
>
> This is similar to the memory information: the average doesn't apply
> to any one batch, and you don't know which one (or how many) hit the
> peak memory usage either, but I think it's meaningful to know a
> summary.
>
> With the sort methods, I think it's useful to be able to, for example,
> know if any of the groups happened to trigger the top-n heapsort
> optimization, or not, and as a corollary, if all of them did or not.
>
> > >> 4) Not sure having two lines for each worker is a great idea.
> > >
> > >I've left these in for now because the lines are already very long
> > >(much, much longer than the worker lines in a standard sort node).
> > >This is largely because we're trying to summarize many sort batches,
> > >while standard sort nodes only have to give the exact stats from a
> > >single batch.
> > >
> > >See the example output later in the email.
> > >
> >
> > OK
> >
> > >> 5) I'd probably prefer having multiple labels for avg/max memory values,
> > >> instead of (avg) and (max) notes. Also, I think we use "peak" in this
> > >> context instead of "max".
> > >
> > >Updated.
> > >
> >
> > OK
> >
> > >Here's the current output:
> > >
> > > Limit  (cost=1887419.20..1889547.68 rows=10000 width=8) (actual
> > >time=13218.403..13222.519 rows=10000 loo
> > >ps=1)
> > >   ->  Gather Merge  (cost=1887419.20..19624748.03 rows=83333360
> > >width=8) (actual time=13218.401..13229.7
> > >50 rows=10000 loops=1)
> > >         Workers Planned: 2
> > >         Workers Launched: 2
> > >         ->  Incremental Sort  (cost=1886419.17..10005010.55
> > >rows=41666680 width=8) (actual time=13208.00
> > >4..13208.586 rows=4425 loops=3)
> > >               Sort Key: a, b
> > >               Presorted Key: a
> > >               Full-sort Groups: 1 Sort Method: quicksort Memory:
> > >avg=28kB peak=28kB
> > >               Presorted Groups: 1 Sort Method: top-N heapsort Memory:
> > >avg=1681kB peak=1681kB
> > >               Worker 0:  Full-sort Groups: 1 Sort Method: quicksort
> > >Memory: avg=28kB peak=28kB
> > >                 Presorted Groups: 1 Sort Method: top-N heapsort
> > >Memory: avg=1680kB peak=1680kB
> > >               Worker 1:  Full-sort Groups: 1 Sort Method: quicksort
> > >Memory: avg=28kB peak=28kB
> > >                 Presorted Groups: 1 Sort Method: top-N heapsort
> > >Memory: avg=1682kB peak=1682kB
> > >               ->  Parallel Index Scan using index_s_a on s
> > >(cost=0.57..4967182.06 rows=41666680 width=8
> > >) (actual time=0.455..11730.878 rows=6666668 loops=3)
> >
> > Looks reasonable. Did you try it in other output formats - json/yaml?
>
> I did. JSON looks good also, which implies to me yaml would to (but I
> didn't look at it).

After sleeping on it, I decided the XXX I'd left in my explain fixes
was too gross to keep around, so I've replaced it with a macro that
selects the proper shared or private memory group info struct (that
way we avoid the pointer comparison hack to reverse engineer (in the
parallel worker case) which group info we were looking to use.

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
Hi,

Attached is a slightly reorganized patch series. I've merged the fixes
into the appropriate matches, and I've also combined the two patches
adding incremental sort paths to additional places in planner.

A couple more comments:


1) I think the GUC documentation in src/sgml/config.sgml is a bit too
detailed, compared to the other enable_* GUCs. I wonder if there's a
better place where to move the details. What about adding some examples
and explanation to perform.sgml?


2) Looking at the explain output, the verbose mode looks like this:

test=# explain (verbose, analyze) select a from t order by a, b, c;
                                                                           QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------
  Gather Merge  (cost=66.31..816072.71 rows=8333226 width=24) (actual time=4.787..20092.555 rows=10000000 loops=1)
    Output: a, b, c
    Workers Planned: 2
    Workers Launched: 2
    ->  Incremental Sort  (cost=66.28..729200.36 rows=4166613 width=24) (actual time=1.308..14021.575 rows=3333333
loops=3)
          Output: a, b, c
          Sort Key: t.a, t.b, t.c
          Presorted Key: t.a, t.b
          Full-sort Groups: 4169 Sort Method: quicksort Memory: avg=30kB peak=30kB
          Presorted Groups: 4144 Sort Method: quicksort Memory: avg=128kB peak=138kB
          Worker 0:  actual time=0.766..16122.368 rows=3841573 loops=1
  Full-sort Groups: 6871 Sort Method: quicksort Memory: avg=30kB peak=30kB
            Presorted Groups: 6823 Sort Method: quicksort Memory: avg=132kB peak=141kB
          Worker 1:  actual time=1.986..16189.831 rows=3845490 loops=1
  Full-sort Groups: 6874 Sort Method: quicksort Memory: avg=30kB peak=30kB
            Presorted Groups: 6847 Sort Method: quicksort Memory: avg=130kB peak=139kB
          ->  Parallel Index Scan using t_a_b_idx on public.t  (cost=0.43..382365.92 rows=4166613 width=24) (actual
time=0.040..9808.449rows=3333333 loops=3)
 
                Output: a, b, c
                Worker 0:  actual time=0.048..11275.178 rows=3841573 loops=1
                Worker 1:  actual time=0.041..11314.133 rows=3845490 loops=1
  Planning Time: 0.166 ms
  Execution Time: 25135.029 ms
(22 rows)

There seems to be missing indentation for the first line of worker info.

I'm still not quite convinced we should be printing two lines - I know
you mentioned the lines might be too long, but see how long the other
lines may get ...


3) I see the new nodes (plan state, ...) have "presortedCols" which does
not indicate it's a "number of". I think we usually prefix names of such
fields "n" or "num". What about "nPresortedCols"? (Nitpicking, I know.)


My TODO for this patch is this:

- review the costing (I think the estimates are OK, but I recall I
   haven't been entirely happy with how it's broken into functions.)

- review the tuplesort changes (the memory contexts etc.)

- do more testing of performance impact on planning


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sun, Mar 29, 2020 at 9:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> Attached is a slightly reorganized patch series. I've merged the fixes
> into the appropriate matches, and I've also combined the two patches
> adding incremental sort paths to additional places in planner.
>
> A couple more comments:
>
>
> 1) I think the GUC documentation in src/sgml/config.sgml is a bit too
> detailed, compared to the other enable_* GUCs. I wonder if there's a
> better place where to move the details. What about adding some examples
> and explanation to perform.sgml?

I'll take a look at that and include in a patch series tomorrow.

> 2) Looking at the explain output, the verbose mode looks like this:
>
> test=# explain (verbose, analyze) select a from t order by a, b, c;
>                                                                            QUERY PLAN
>
--------------------------------------------------------------------------------------------------------------------------------------------------------------
>   Gather Merge  (cost=66.31..816072.71 rows=8333226 width=24) (actual time=4.787..20092.555 rows=10000000 loops=1)
>     Output: a, b, c
>     Workers Planned: 2
>     Workers Launched: 2
>     ->  Incremental Sort  (cost=66.28..729200.36 rows=4166613 width=24) (actual time=1.308..14021.575 rows=3333333
loops=3)
>           Output: a, b, c
>           Sort Key: t.a, t.b, t.c
>           Presorted Key: t.a, t.b
>           Full-sort Groups: 4169 Sort Method: quicksort Memory: avg=30kB peak=30kB
>           Presorted Groups: 4144 Sort Method: quicksort Memory: avg=128kB peak=138kB
>           Worker 0:  actual time=0.766..16122.368 rows=3841573 loops=1
>   Full-sort Groups: 6871 Sort Method: quicksort Memory: avg=30kB peak=30kB
>             Presorted Groups: 6823 Sort Method: quicksort Memory: avg=132kB peak=141kB
>           Worker 1:  actual time=1.986..16189.831 rows=3845490 loops=1
>   Full-sort Groups: 6874 Sort Method: quicksort Memory: avg=30kB peak=30kB
>             Presorted Groups: 6847 Sort Method: quicksort Memory: avg=130kB peak=139kB
>           ->  Parallel Index Scan using t_a_b_idx on public.t  (cost=0.43..382365.92 rows=4166613 width=24) (actual
time=0.040..9808.449rows=3333333 loops=3)
 
>                 Output: a, b, c
>                 Worker 0:  actual time=0.048..11275.178 rows=3841573 loops=1
>                 Worker 1:  actual time=0.041..11314.133 rows=3845490 loops=1
>   Planning Time: 0.166 ms
>   Execution Time: 25135.029 ms
> (22 rows)
>
> There seems to be missing indentation for the first line of worker info.

Working on that too.

> I'm still not quite convinced we should be printing two lines - I know
> you mentioned the lines might be too long, but see how long the other
> lines may get ...

All right, I give in :)

Do you think non-workers (both the leader and non-parallel plans)
should also move to one line?

> 3) I see the new nodes (plan state, ...) have "presortedCols" which does
> not indicate it's a "number of". I think we usually prefix names of such
> fields "n" or "num". What about "nPresortedCols"? (Nitpicking, I know.)

I can fix this too.

Also I noticed a few compiler warnings I'll fixup in tomorrow's reply.

> My TODO for this patch is this:
>
> - review the costing (I think the estimates are OK, but I recall I
>    haven't been entirely happy with how it's broken into functions.)
>
> - review the tuplesort changes (the memory contexts etc.)
>
> - do more testing of performance impact on planning

Sounds good.

Thanks,
James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sun, Mar 29, 2020 at 10:16:53PM -0400, James Coleman wrote:
>On Sun, Mar 29, 2020 at 9:44 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> Hi,
>>
>> Attached is a slightly reorganized patch series. I've merged the fixes
>> into the appropriate matches, and I've also combined the two patches
>> adding incremental sort paths to additional places in planner.
>>
>> A couple more comments:
>>
>>
>> 1) I think the GUC documentation in src/sgml/config.sgml is a bit too
>> detailed, compared to the other enable_* GUCs. I wonder if there's a
>> better place where to move the details. What about adding some examples
>> and explanation to perform.sgml?
>
>I'll take a look at that and include in a patch series tomorrow.
>
>> 2) Looking at the explain output, the verbose mode looks like this:
>>
>> test=# explain (verbose, analyze) select a from t order by a, b, c;
>>                                                                            QUERY PLAN
>>
--------------------------------------------------------------------------------------------------------------------------------------------------------------
>>   Gather Merge  (cost=66.31..816072.71 rows=8333226 width=24) (actual time=4.787..20092.555 rows=10000000 loops=1)
>>     Output: a, b, c
>>     Workers Planned: 2
>>     Workers Launched: 2
>>     ->  Incremental Sort  (cost=66.28..729200.36 rows=4166613 width=24) (actual time=1.308..14021.575 rows=3333333
loops=3)
>>           Output: a, b, c
>>           Sort Key: t.a, t.b, t.c
>>           Presorted Key: t.a, t.b
>>           Full-sort Groups: 4169 Sort Method: quicksort Memory: avg=30kB peak=30kB
>>           Presorted Groups: 4144 Sort Method: quicksort Memory: avg=128kB peak=138kB
>>           Worker 0:  actual time=0.766..16122.368 rows=3841573 loops=1
>>   Full-sort Groups: 6871 Sort Method: quicksort Memory: avg=30kB peak=30kB
>>             Presorted Groups: 6823 Sort Method: quicksort Memory: avg=132kB peak=141kB
>>           Worker 1:  actual time=1.986..16189.831 rows=3845490 loops=1
>>   Full-sort Groups: 6874 Sort Method: quicksort Memory: avg=30kB peak=30kB
>>             Presorted Groups: 6847 Sort Method: quicksort Memory: avg=130kB peak=139kB
>>           ->  Parallel Index Scan using t_a_b_idx on public.t  (cost=0.43..382365.92 rows=4166613 width=24) (actual
time=0.040..9808.449rows=3333333 loops=3)
 
>>                 Output: a, b, c
>>                 Worker 0:  actual time=0.048..11275.178 rows=3841573 loops=1
>>                 Worker 1:  actual time=0.041..11314.133 rows=3845490 loops=1
>>   Planning Time: 0.166 ms
>>   Execution Time: 25135.029 ms
>> (22 rows)
>>
>> There seems to be missing indentation for the first line of worker info.
>
>Working on that too.
>
>> I'm still not quite convinced we should be printing two lines - I know
>> you mentioned the lines might be too long, but see how long the other
>> lines may get ...
>
>All right, I give in :)
>
>Do you think non-workers (both the leader and non-parallel plans)
>should also move to one line?
>

I think we should use the same formatting for both cases, so yes.

FWIW I forgot to mention I tweaked the INSTRUMENT_SORT_GROUP macro a
bit, by moving the if condition in it. That makes the calls easier.

>> 3) I see the new nodes (plan state, ...) have "presortedCols" which does
>> not indicate it's a "number of". I think we usually prefix names of such
>> fields "n" or "num". What about "nPresortedCols"? (Nitpicking, I know.)
>
>I can fix this too.
>
>Also I noticed a few compiler warnings I'll fixup in tomorrow's reply.
>

OK

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Mar 30, 2020 at 8:24 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sun, Mar 29, 2020 at 10:16:53PM -0400, James Coleman wrote:
> >On Sun, Mar 29, 2020 at 9:44 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> Hi,
> >>
> >> Attached is a slightly reorganized patch series. I've merged the fixes
> >> into the appropriate matches, and I've also combined the two patches
> >> adding incremental sort paths to additional places in planner.
> >>
> >> A couple more comments:
> >>
> >>
> >> 1) I think the GUC documentation in src/sgml/config.sgml is a bit too
> >> detailed, compared to the other enable_* GUCs. I wonder if there's a
> >> better place where to move the details. What about adding some examples
> >> and explanation to perform.sgml?
> >
> >I'll take a look at that and include in a patch series tomorrow.

Attached.

> >> 2) Looking at the explain output, the verbose mode looks like this:
> >>
> >> test=# explain (verbose, analyze) select a from t order by a, b, c;
> >>                                                                            QUERY PLAN
> >>
--------------------------------------------------------------------------------------------------------------------------------------------------------------
> >>   Gather Merge  (cost=66.31..816072.71 rows=8333226 width=24) (actual time=4.787..20092.555 rows=10000000
loops=1)
> >>     Output: a, b, c
> >>     Workers Planned: 2
> >>     Workers Launched: 2
> >>     ->  Incremental Sort  (cost=66.28..729200.36 rows=4166613 width=24) (actual time=1.308..14021.575 rows=3333333
loops=3)
> >>           Output: a, b, c
> >>           Sort Key: t.a, t.b, t.c
> >>           Presorted Key: t.a, t.b
> >>           Full-sort Groups: 4169 Sort Method: quicksort Memory: avg=30kB peak=30kB
> >>           Presorted Groups: 4144 Sort Method: quicksort Memory: avg=128kB peak=138kB
> >>           Worker 0:  actual time=0.766..16122.368 rows=3841573 loops=1
> >>   Full-sort Groups: 6871 Sort Method: quicksort Memory: avg=30kB peak=30kB
> >>             Presorted Groups: 6823 Sort Method: quicksort Memory: avg=132kB peak=141kB
> >>           Worker 1:  actual time=1.986..16189.831 rows=3845490 loops=1
> >>   Full-sort Groups: 6874 Sort Method: quicksort Memory: avg=30kB peak=30kB
> >>             Presorted Groups: 6847 Sort Method: quicksort Memory: avg=130kB peak=139kB
> >>           ->  Parallel Index Scan using t_a_b_idx on public.t  (cost=0.43..382365.92 rows=4166613 width=24)
(actualtime=0.040..9808.449 rows=3333333 loops=3)
 
> >>                 Output: a, b, c
> >>                 Worker 0:  actual time=0.048..11275.178 rows=3841573 loops=1
> >>                 Worker 1:  actual time=0.041..11314.133 rows=3845490 loops=1
> >>   Planning Time: 0.166 ms
> >>   Execution Time: 25135.029 ms
> >> (22 rows)
> >>
> >> There seems to be missing indentation for the first line of worker info.
> >
> >Working on that too.

See attached. I've folded in the original "explain fixes" patch into
the main series, and the "explain fixes" patch in this series contains
only the changes for the above.

> >> I'm still not quite convinced we should be printing two lines - I know
> >> you mentioned the lines might be too long, but see how long the other
> >> lines may get ...
> >
> >All right, I give in :)
> >
> >Do you think non-workers (both the leader and non-parallel plans)
> >should also move to one line?
> >
>
> I think we should use the same formatting for both cases, so yes.
>
> FWIW I forgot to mention I tweaked the INSTRUMENT_SORT_GROUP macro a
> bit, by moving the if condition in it. That makes the calls easier.

Ah, that actually fixed some of the compile warnings. The other is
fixed in my explain fixes patch.

> >> 3) I see the new nodes (plan state, ...) have "presortedCols" which does
> >> not indicate it's a "number of". I think we usually prefix names of such
> >> fields "n" or "num". What about "nPresortedCols"? (Nitpicking, I know.)
> >
> >I can fix this too.

Changed everywhere we used this var name. I'm tempted to change to
nPresortedKeys, but a cursory glance suggests some cases might
actually be consistent with other var names reference columns, so I'm
not sure if we want to go down that path (and change more than just
this).

> >
> >Also I noticed a few compiler warnings I'll fixup in tomorrow's reply.
> >
>
> OK

Mentioned above.

Thanks,
James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Mar 30, 2020 at 06:53:47PM -0400, James Coleman wrote:
>On Mon, Mar 30, 2020 at 8:24 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Sun, Mar 29, 2020 at 10:16:53PM -0400, James Coleman wrote:
>> >On Sun, Mar 29, 2020 at 9:44 PM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> Attached is a slightly reorganized patch series. I've merged the fixes
>> >> into the appropriate matches, and I've also combined the two patches
>> >> adding incremental sort paths to additional places in planner.
>> >>
>> >> A couple more comments:
>> >>
>> >>
>> >> 1) I think the GUC documentation in src/sgml/config.sgml is a bit too
>> >> detailed, compared to the other enable_* GUCs. I wonder if there's a
>> >> better place where to move the details. What about adding some examples
>> >> and explanation to perform.sgml?
>> >
>> >I'll take a look at that and include in a patch series tomorrow.
>
>Attached.
>
>> >> 2) Looking at the explain output, the verbose mode looks like this:
>> >>
>> >> test=# explain (verbose, analyze) select a from t order by a, b, c;
>> >>                                                                            QUERY PLAN
>> >>
--------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >>   Gather Merge  (cost=66.31..816072.71 rows=8333226 width=24) (actual time=4.787..20092.555 rows=10000000
loops=1)
>> >>     Output: a, b, c
>> >>     Workers Planned: 2
>> >>     Workers Launched: 2
>> >>     ->  Incremental Sort  (cost=66.28..729200.36 rows=4166613 width=24) (actual time=1.308..14021.575
rows=3333333loops=3)
 
>> >>           Output: a, b, c
>> >>           Sort Key: t.a, t.b, t.c
>> >>           Presorted Key: t.a, t.b
>> >>           Full-sort Groups: 4169 Sort Method: quicksort Memory: avg=30kB peak=30kB
>> >>           Presorted Groups: 4144 Sort Method: quicksort Memory: avg=128kB peak=138kB
>> >>           Worker 0:  actual time=0.766..16122.368 rows=3841573 loops=1
>> >>   Full-sort Groups: 6871 Sort Method: quicksort Memory: avg=30kB peak=30kB
>> >>             Presorted Groups: 6823 Sort Method: quicksort Memory: avg=132kB peak=141kB
>> >>           Worker 1:  actual time=1.986..16189.831 rows=3845490 loops=1
>> >>   Full-sort Groups: 6874 Sort Method: quicksort Memory: avg=30kB peak=30kB
>> >>             Presorted Groups: 6847 Sort Method: quicksort Memory: avg=130kB peak=139kB
>> >>           ->  Parallel Index Scan using t_a_b_idx on public.t  (cost=0.43..382365.92 rows=4166613 width=24)
(actualtime=0.040..9808.449 rows=3333333 loops=3)
 
>> >>                 Output: a, b, c
>> >>                 Worker 0:  actual time=0.048..11275.178 rows=3841573 loops=1
>> >>                 Worker 1:  actual time=0.041..11314.133 rows=3845490 loops=1
>> >>   Planning Time: 0.166 ms
>> >>   Execution Time: 25135.029 ms
>> >> (22 rows)
>> >>
>> >> There seems to be missing indentation for the first line of worker info.
>> >
>> >Working on that too.
>
>See attached. I've folded in the original "explain fixes" patch into
>the main series, and the "explain fixes" patch in this series contains
>only the changes for the above.
>

Thanks. I'll take a look at those changes tomorrow.

>> >> I'm still not quite convinced we should be printing two lines - I know
>> >> you mentioned the lines might be too long, but see how long the other
>> >> lines may get ...
>> >
>> >All right, I give in :)
>> >
>> >Do you think non-workers (both the leader and non-parallel plans)
>> >should also move to one line?
>> >
>>
>> I think we should use the same formatting for both cases, so yes.
>>
>> FWIW I forgot to mention I tweaked the INSTRUMENT_SORT_GROUP macro a
>> bit, by moving the if condition in it. That makes the calls easier.
>
>Ah, that actually fixed some of the compile warnings. The other is
>fixed in my explain fixes patch.
>
>> >> 3) I see the new nodes (plan state, ...) have "presortedCols" which does
>> >> not indicate it's a "number of". I think we usually prefix names of such
>> >> fields "n" or "num". What about "nPresortedCols"? (Nitpicking, I know.)
>> >
>> >I can fix this too.
>
>Changed everywhere we used this var name. I'm tempted to change to
>nPresortedKeys, but a cursory glance suggests some cases might
>actually be consistent with other var names reference columns, so I'm
>not sure if we want to go down that path (and change more than just
>this).
>

Not sure. We use "sort keys" and "path keys" for this, but I think
"columns" is good enough.


The main thing I've been working on today is benchmarking how this
affects planning. And I'm seeing a regression that worries me a bit,
unfortunately.

The test I'm doing is pretty simple - build a small table with a bunch
of columns:

   create table t (a int, b int, c int, d int, e int, f int, g int);

   insert into t select 100*random(), 100*random(), 100*random(),
     100*random(), 100*random(), 100*random(), 100*random()
   from generate_series(1,100000) s(i);

and then a number of indexes on subsets of up to 3 columns, as generated
using the attached build-indexes.py script. And then run a bunch of
explains (so no actual execution) sorting the data by at least 4 columns
(to trigger incremental sort paths), measuring timing of the script.

I did a bunch of runs on current master and v46 with incremental sort
disabled and enabled, and the results look like this:

     master       off        on
     --------------------------
     34.609    37.463    37.729

which means about 8-9% regression with incremental sort. Of course, this
is only for planning time, for execution the impact is going to be much
smaller. But it's still a bit annoying.

I've suspected this might be either due to the add_partial_path changes
or the patch adding incremental sort to additional places, so I tested
those parts individually and the answer is no - add_partial_path changes
have very small impact (~1%, which might be noise). The regression comes
mostly from the 0002 part that adds incremental sort. At least in this
particular test - different tests might behave differently, of course.

The annoying bit is that the overhead does not disappear after disabling
incremental sort. That suggests this is not merely due to considering
and creating higher number of paths, but due to something that happens
before we even look at the GUC ...

I think I've found one such place - if you look at compare_pathkeys, it
has this check right before the forboth() loop:

     if (keys1 == keys2)
         return PATHKEYS_EQUAL;

But with incremental sort we don't call pathkeys_contained_in, we call
pathkeys_common_contained_in instead. And that does not call
compare_pathkeys and does not have the simple equality check. Adding
the following check seems to cut the overhead in half, which is nice:

     if (keys1 == keys2)
     {
         *n_common = list_length(keys1);
    return true;
     }

Not sure where the rest of the regression comes from yet.

Also, while looking at pathkeys_common_contained_in(), I've been a bit
puzzled why does is this correct:

     return (key1 == NULL);

It's easy to not notice it's key1 and not keys1, so I suggest we add a
comment 'key1 == NULL' means we've processed whole keys1 list.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Alvaro Herrera
Date:
On 2020-Mar-30, James Coleman wrote:

> +/* ----------------
> + *     Instruementation information for IncrementalSort
> + * ----------------
> + */
> +typedef struct IncrementalSortGroupInfo
> +{
> +    int64        groupCount;
> +    long        maxDiskSpaceUsed;
> +    long        totalDiskSpaceUsed;
> +    long        maxMemorySpaceUsed;
> +    long        totalMemorySpaceUsed;
> +    Size        sortMethods; /* bitmask of TuplesortMethod */
> +} IncrementalSortGroupInfo;

There's a typo "Instruementation" in the comment, but I'm more surprised
that type Size is being used to store a bitmask.  It looks weird to me.
Wouldn't it be more reasonable to use bits32 or some such?  (I first
noticed this in the "sizeof(Size)" code that appears in the explain
code.)


OTOH, aesthetically it would seem to be better to define these values
using ones and increasing shifts (1 << 1 and so on), rather than powers
of two:

> + * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
> + * instrumentation so needs to have each value be a separate bit.
>   */
>  typedef enum
>  {
>      SORT_TYPE_STILL_IN_PROGRESS = 0,
> -    SORT_TYPE_TOP_N_HEAPSORT,
> -    SORT_TYPE_QUICKSORT,
> -    SORT_TYPE_EXTERNAL_SORT,
> -    SORT_TYPE_EXTERNAL_MERGE
> +    SORT_TYPE_TOP_N_HEAPSORT = 2,
> +    SORT_TYPE_QUICKSORT = 4,
> +    SORT_TYPE_EXTERNAL_SORT = 8,
> +    SORT_TYPE_EXTERNAL_MERGE = 16
>  } TuplesortMethod;

I don't quite understand why you skipped "1".  (Also, is the use of zero
a wise choice?)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Mar 31, 2020 at 12:31 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>
> On 2020-Mar-30, James Coleman wrote:
>
> > +/* ----------------
> > + *    Instruementation information for IncrementalSort
> > + * ----------------
> > + */
> > +typedef struct IncrementalSortGroupInfo
> > +{
> > +     int64           groupCount;
> > +     long            maxDiskSpaceUsed;
> > +     long            totalDiskSpaceUsed;
> > +     long            maxMemorySpaceUsed;
> > +     long            totalMemorySpaceUsed;
> > +     Size            sortMethods; /* bitmask of TuplesortMethod */
> > +} IncrementalSortGroupInfo;
>
> There's a typo "Instruementation" in the comment, but I'm more surprised
> that type Size is being used to store a bitmask.  It looks weird to me.
> Wouldn't it be more reasonable to use bits32 or some such?  (I first
> noticed this in the "sizeof(Size)" code that appears in the explain
> code.)

I just didn't know about bits32; I'll change.

> OTOH, aesthetically it would seem to be better to define these values
> using ones and increasing shifts (1 << 1 and so on), rather than powers
> of two:
>
> > + * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
> > + * instrumentation so needs to have each value be a separate bit.
> >   */
> >  typedef enum
> >  {
> >       SORT_TYPE_STILL_IN_PROGRESS = 0,
> > -     SORT_TYPE_TOP_N_HEAPSORT,
> > -     SORT_TYPE_QUICKSORT,
> > -     SORT_TYPE_EXTERNAL_SORT,
> > -     SORT_TYPE_EXTERNAL_MERGE
> > +     SORT_TYPE_TOP_N_HEAPSORT = 2,
> > +     SORT_TYPE_QUICKSORT = 4,
> > +     SORT_TYPE_EXTERNAL_SORT = 8,
> > +     SORT_TYPE_EXTERNAL_MERGE = 16
> >  } TuplesortMethod;
>
> I don't quite understand why you skipped "1".  (Also, is the use of zero
> a wise choice?)

The assignment of 0 was already there, and there wasn't a comment to
indicate why. That ends up meaning we wouldn't display "still in
progress" as a type here, which is maybe desirable, but I'm honestly
not sure why it was that way originally. I'm curious if you have any
thoughts on it.

I knew some projects used increasing shifts, but I wasn't sure what
the preference was here. I'll correct.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
James Coleman <jtc331@gmail.com> writes:
> + * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
> + * instrumentation so needs to have each value be a separate bit.

>> I don't quite understand why you skipped "1".  (Also, is the use of zero
>> a wise choice?)

> The assignment of 0 was already there, and there wasn't a comment to
> indicate why. That ends up meaning we wouldn't display "still in
> progress" as a type here, which is maybe desirable, but I'm honestly
> not sure why it was that way originally. I'm curious if you have any
> thoughts on it.

As things stood, the "= 0" was a no-op, since the first enum value
would've been that anyway.  But if you're converting this set of symbols
to bits that can be OR'd together, it seems pretty strange to use zero,
because that can't be distinguished from "absence of any entry".

Perhaps the semantics are such that that's actually sensible, but it's
far from a straightforward remapping of the old enum.

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Mar 31, 2020 at 1:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> James Coleman <jtc331@gmail.com> writes:
> > + * TuplesortMethod is used in a bitmask in Increment Sort's shared memory
> > + * instrumentation so needs to have each value be a separate bit.
>
> >> I don't quite understand why you skipped "1".  (Also, is the use of zero
> >> a wise choice?)
>
> > The assignment of 0 was already there, and there wasn't a comment to
> > indicate why. That ends up meaning we wouldn't display "still in
> > progress" as a type here, which is maybe desirable, but I'm honestly
> > not sure why it was that way originally. I'm curious if you have any
> > thoughts on it.
>
> As things stood, the "= 0" was a no-op, since the first enum value
> would've been that anyway.  But if you're converting this set of symbols
> to bits that can be OR'd together, it seems pretty strange to use zero,
> because that can't be distinguished from "absence of any entry".
>
> Perhaps the semantics are such that that's actually sensible, but it's
> far from a straightforward remapping of the old enum.

Right, I didn't see the explicit "= 0" in other enums there, so it
made me wonder if it was intentional to designate that one had to be
0, but I guess without a comment that's a lot of inference.

The semantics seemed somewhat useful here in theory, but since I'm not
hearing a "yeah, that was intentional but not commented", I'm just
going to change it to what you'd naturally expect.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
James Coleman <jtc331@gmail.com> writes:
> On Tue, Mar 31, 2020 at 1:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Perhaps the semantics are such that that's actually sensible, but it's
>> far from a straightforward remapping of the old enum.

> Right, I didn't see the explicit "= 0" in other enums there, so it
> made me wonder if it was intentional to designate that one had to be
> 0, but I guess without a comment that's a lot of inference.

It's possible that somebody meant that as an indicator that the code
depends on palloc0() leaving the field with that value.  But if so,
you'd soon find that out ... and an actual comment would be better,
anyway.

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Mar 30, 2020 at 9:14 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> The main thing I've been working on today is benchmarking how this
> affects planning. And I'm seeing a regression that worries me a bit,
> unfortunately.
>
> The test I'm doing is pretty simple - build a small table with a bunch
> of columns:
>
>    create table t (a int, b int, c int, d int, e int, f int, g int);
>
>    insert into t select 100*random(), 100*random(), 100*random(),
>      100*random(), 100*random(), 100*random(), 100*random()
>    from generate_series(1,100000) s(i);
>
> and then a number of indexes on subsets of up to 3 columns, as generated
> using the attached build-indexes.py script. And then run a bunch of
> explains (so no actual execution) sorting the data by at least 4 columns
> (to trigger incremental sort paths), measuring timing of the script.
>
> I did a bunch of runs on current master and v46 with incremental sort
> disabled and enabled, and the results look like this:
>
>      master       off        on
>      --------------------------
>      34.609    37.463    37.729
>
> which means about 8-9% regression with incremental sort. Of course, this
> is only for planning time, for execution the impact is going to be much
> smaller. But it's still a bit annoying.
>
> I've suspected this might be either due to the add_partial_path changes
> or the patch adding incremental sort to additional places, so I tested
> those parts individually and the answer is no - add_partial_path changes
> have very small impact (~1%, which might be noise). The regression comes
> mostly from the 0002 part that adds incremental sort. At least in this
> particular test - different tests might behave differently, of course.
>
> The annoying bit is that the overhead does not disappear after disabling
> incremental sort. That suggests this is not merely due to considering
> and creating higher number of paths, but due to something that happens
> before we even look at the GUC ...
>
> I think I've found one such place - if you look at compare_pathkeys, it
> has this check right before the forboth() loop:
>
>      if (keys1 == keys2)
>          return PATHKEYS_EQUAL;
>
> But with incremental sort we don't call pathkeys_contained_in, we call
> pathkeys_common_contained_in instead. And that does not call
> compare_pathkeys and does not have the simple equality check. Adding
> the following check seems to cut the overhead in half, which is nice:
>
>      if (keys1 == keys2)
>      {
>          *n_common = list_length(keys1);
>         return true;
>      }
>
> Not sure where the rest of the regression comes from yet.

I noticed in the other function we also optimize by checking if either
keys list is NIL, so I tried adding that, and it might have made a
minor difference, but it's hard to tell as it was under 1%.

I also ran perf with a slightly modified version of your test that
uses psql, and after the above changes was seeing something like a
3.5% delta between master and this patch series. Nothing obvious in
the perf report though.

This test is intended to be somewhat worst case, no? At what point do
we consider the trade-off worth it (given that it's not plausible to
have zero impact)?

> Also, while looking at pathkeys_common_contained_in(), I've been a bit
> puzzled why does is this correct:
>
>      return (key1 == NULL);
>
> It's easy to not notice it's key1 and not keys1, so I suggest we add a
> comment 'key1 == NULL' means we've processed whole keys1 list.

Done.

I've included fixes for Alvaro's comments in this patch series also.

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Tue, Mar 31, 2020 at 02:23:15PM -0400, James Coleman wrote:
>On Mon, Mar 30, 2020 at 9:14 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> The main thing I've been working on today is benchmarking how this
>> affects planning. And I'm seeing a regression that worries me a bit,
>> unfortunately.
>>
>> The test I'm doing is pretty simple - build a small table with a bunch
>> of columns:
>>
>>    create table t (a int, b int, c int, d int, e int, f int, g int);
>>
>>    insert into t select 100*random(), 100*random(), 100*random(),
>>      100*random(), 100*random(), 100*random(), 100*random()
>>    from generate_series(1,100000) s(i);
>>
>> and then a number of indexes on subsets of up to 3 columns, as generated
>> using the attached build-indexes.py script. And then run a bunch of
>> explains (so no actual execution) sorting the data by at least 4 columns
>> (to trigger incremental sort paths), measuring timing of the script.
>>
>> I did a bunch of runs on current master and v46 with incremental sort
>> disabled and enabled, and the results look like this:
>>
>>      master       off        on
>>      --------------------------
>>      34.609    37.463    37.729
>>
>> which means about 8-9% regression with incremental sort. Of course, this
>> is only for planning time, for execution the impact is going to be much
>> smaller. But it's still a bit annoying.
>>
>> I've suspected this might be either due to the add_partial_path changes
>> or the patch adding incremental sort to additional places, so I tested
>> those parts individually and the answer is no - add_partial_path changes
>> have very small impact (~1%, which might be noise). The regression comes
>> mostly from the 0002 part that adds incremental sort. At least in this
>> particular test - different tests might behave differently, of course.
>>
>> The annoying bit is that the overhead does not disappear after disabling
>> incremental sort. That suggests this is not merely due to considering
>> and creating higher number of paths, but due to something that happens
>> before we even look at the GUC ...
>>
>> I think I've found one such place - if you look at compare_pathkeys, it
>> has this check right before the forboth() loop:
>>
>>      if (keys1 == keys2)
>>          return PATHKEYS_EQUAL;
>>
>> But with incremental sort we don't call pathkeys_contained_in, we call
>> pathkeys_common_contained_in instead. And that does not call
>> compare_pathkeys and does not have the simple equality check. Adding
>> the following check seems to cut the overhead in half, which is nice:
>>
>>      if (keys1 == keys2)
>>      {
>>          *n_common = list_length(keys1);
>>         return true;
>>      }
>>
>> Not sure where the rest of the regression comes from yet.
>
>I noticed in the other function we also optimize by checking if either
>keys list is NIL, so I tried adding that, and it might have made a
>minor difference, but it's hard to tell as it was under 1%.
>

Which other function? I don't see this optimization in compare_pathkeys,
that only checks key1/key2 after the forboth loop, but that's something
different.

I may be wrong, but my guess would be that this won't save much, because
the loop should terminate immediately. The whole point is not to loop
over possibly many pathkeys when we know it's exactly the same pathkeys
list (same pointer). When one of the lists is NIL the loop should
terminate right away ...

>I also ran perf with a slightly modified version of your test that
>uses psql, and after the above changes was seeing something like a
>3.5% delta between master and this patch series. Nothing obvious in
>the perf report though.
>

Yeah, I don't think there's anything obviously more expensive.

>This test is intended to be somewhat worst case, no? At what point do
>we consider the trade-off worth it (given that it's not plausible to
>have zero impact)?
>

Yes, more or less. It was definitely designed to do that - it merely
plans the query (no execution), with many applicable indexes etc. It's
definitely true that once the exection starts to take more time, the
overhead will get negligible. Same for reducing number of indexes.

And of course, for queries that can benefit from incremental sort, the
speedup is likely way larger than this.

In general, I think it'd be naive that we can make planner smarter with
no extra overhead spent on planning, and we can never accept patches
adding even tiny overhead. With that approach we'd probably end up with
a trivial planner that generates just a single query plan, because
that's going to be the fastest planner. A realistic approach needs to
consider both the planning and execution phase, and benefits of this
patch seem to be clear - if you have queries that do benefit from it.

Let's try to minimize the overhead a bit more, and then we'll see.


>> Also, while looking at pathkeys_common_contained_in(), I've been a bit
>> puzzled why does is this correct:
>>
>>      return (key1 == NULL);
>>
>> It's easy to not notice it's key1 and not keys1, so I suggest we add a
>> comment 'key1 == NULL' means we've processed whole keys1 list.
>
>Done.
>
>I've included fixes for Alvaro's comments in this patch series also.
>

OK, thanks.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Mar 31, 2020 at 5:53 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Mar 31, 2020 at 02:23:15PM -0400, James Coleman wrote:
> >On Mon, Mar 30, 2020 at 9:14 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> The main thing I've been working on today is benchmarking how this
> >> affects planning. And I'm seeing a regression that worries me a bit,
> >> unfortunately.
> >>
> >> The test I'm doing is pretty simple - build a small table with a bunch
> >> of columns:
> >>
> >>    create table t (a int, b int, c int, d int, e int, f int, g int);
> >>
> >>    insert into t select 100*random(), 100*random(), 100*random(),
> >>      100*random(), 100*random(), 100*random(), 100*random()
> >>    from generate_series(1,100000) s(i);
> >>
> >> and then a number of indexes on subsets of up to 3 columns, as generated
> >> using the attached build-indexes.py script. And then run a bunch of
> >> explains (so no actual execution) sorting the data by at least 4 columns
> >> (to trigger incremental sort paths), measuring timing of the script.
> >>
> >> I did a bunch of runs on current master and v46 with incremental sort
> >> disabled and enabled, and the results look like this:
> >>
> >>      master       off        on
> >>      --------------------------
> >>      34.609    37.463    37.729
> >>
> >> which means about 8-9% regression with incremental sort. Of course, this
> >> is only for planning time, for execution the impact is going to be much
> >> smaller. But it's still a bit annoying.
> >>
> >> I've suspected this might be either due to the add_partial_path changes
> >> or the patch adding incremental sort to additional places, so I tested
> >> those parts individually and the answer is no - add_partial_path changes
> >> have very small impact (~1%, which might be noise). The regression comes
> >> mostly from the 0002 part that adds incremental sort. At least in this
> >> particular test - different tests might behave differently, of course.
> >>
> >> The annoying bit is that the overhead does not disappear after disabling
> >> incremental sort. That suggests this is not merely due to considering
> >> and creating higher number of paths, but due to something that happens
> >> before we even look at the GUC ...
> >>
> >> I think I've found one such place - if you look at compare_pathkeys, it
> >> has this check right before the forboth() loop:
> >>
> >>      if (keys1 == keys2)
> >>          return PATHKEYS_EQUAL;
> >>
> >> But with incremental sort we don't call pathkeys_contained_in, we call
> >> pathkeys_common_contained_in instead. And that does not call
> >> compare_pathkeys and does not have the simple equality check. Adding
> >> the following check seems to cut the overhead in half, which is nice:
> >>
> >>      if (keys1 == keys2)
> >>      {
> >>          *n_common = list_length(keys1);
> >>         return true;
> >>      }
> >>
> >> Not sure where the rest of the regression comes from yet.
> >
> >I noticed in the other function we also optimize by checking if either
> >keys list is NIL, so I tried adding that, and it might have made a
> >minor difference, but it's hard to tell as it was under 1%.
> >
>
> Which other function? I don't see this optimization in compare_pathkeys,
> that only checks key1/key2 after the forboth loop, but that's something
> different.

pathkeys_useful_for_ordering checks both inputs.

> I may be wrong, but my guess would be that this won't save much, because
> the loop should terminate immediately. The whole point is not to loop
> over possibly many pathkeys when we know it's exactly the same pathkeys
> list (same pointer). When one of the lists is NIL the loop should
> terminate right away ...
>
> >I also ran perf with a slightly modified version of your test that
> >uses psql, and after the above changes was seeing something like a
> >3.5% delta between master and this patch series. Nothing obvious in
> >the perf report though.
> >
>
> Yeah, I don't think there's anything obviously more expensive.
>
> >This test is intended to be somewhat worst case, no? At what point do
> >we consider the trade-off worth it (given that it's not plausible to
> >have zero impact)?
> >
>
> Yes, more or less. It was definitely designed to do that - it merely
> plans the query (no execution), with many applicable indexes etc. It's
> definitely true that once the exection starts to take more time, the
> overhead will get negligible. Same for reducing number of indexes.
>
> And of course, for queries that can benefit from incremental sort, the
> speedup is likely way larger than this.
>
> In general, I think it'd be naive that we can make planner smarter with
> no extra overhead spent on planning, and we can never accept patches
> adding even tiny overhead. With that approach we'd probably end up with
> a trivial planner that generates just a single query plan, because
> that's going to be the fastest planner. A realistic approach needs to
> consider both the planning and execution phase, and benefits of this
> patch seem to be clear - if you have queries that do benefit from it.
>
> Let's try to minimize the overhead a bit more, and then we'll see.

Any thoughts you have already on an approach for this?

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> In general, I think it'd be naive that we can make planner smarter with
> no extra overhead spent on planning, and we can never accept patches
> adding even tiny overhead. With that approach we'd probably end up with
> a trivial planner that generates just a single query plan, because
> that's going to be the fastest planner. A realistic approach needs to
> consider both the planning and execution phase, and benefits of this
> patch seem to be clear - if you have queries that do benefit from it.

I think that's kind of attacking a straw man, though.  The thing that
people push back on, or should push back on IMO, is when a proposed
patch adds significant slowdown to queries that it has no or very little
hope of improving.  The trick is to do expensive stuff only when
there's a good chance of getting a better plan out of it.

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Tue, Mar 31, 2020 at 06:00:00PM -0400, James Coleman wrote:
>On Tue, Mar 31, 2020 at 5:53 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Tue, Mar 31, 2020 at 02:23:15PM -0400, James Coleman wrote:
>> >On Mon, Mar 30, 2020 at 9:14 PM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> >>
>> >> The main thing I've been working on today is benchmarking how this
>> >> affects planning. And I'm seeing a regression that worries me a bit,
>> >> unfortunately.
>> >>
>> >> The test I'm doing is pretty simple - build a small table with a bunch
>> >> of columns:
>> >>
>> >>    create table t (a int, b int, c int, d int, e int, f int, g int);
>> >>
>> >>    insert into t select 100*random(), 100*random(), 100*random(),
>> >>      100*random(), 100*random(), 100*random(), 100*random()
>> >>    from generate_series(1,100000) s(i);
>> >>
>> >> and then a number of indexes on subsets of up to 3 columns, as generated
>> >> using the attached build-indexes.py script. And then run a bunch of
>> >> explains (so no actual execution) sorting the data by at least 4 columns
>> >> (to trigger incremental sort paths), measuring timing of the script.
>> >>
>> >> I did a bunch of runs on current master and v46 with incremental sort
>> >> disabled and enabled, and the results look like this:
>> >>
>> >>      master       off        on
>> >>      --------------------------
>> >>      34.609    37.463    37.729
>> >>
>> >> which means about 8-9% regression with incremental sort. Of course, this
>> >> is only for planning time, for execution the impact is going to be much
>> >> smaller. But it's still a bit annoying.
>> >>
>> >> I've suspected this might be either due to the add_partial_path changes
>> >> or the patch adding incremental sort to additional places, so I tested
>> >> those parts individually and the answer is no - add_partial_path changes
>> >> have very small impact (~1%, which might be noise). The regression comes
>> >> mostly from the 0002 part that adds incremental sort. At least in this
>> >> particular test - different tests might behave differently, of course.
>> >>
>> >> The annoying bit is that the overhead does not disappear after disabling
>> >> incremental sort. That suggests this is not merely due to considering
>> >> and creating higher number of paths, but due to something that happens
>> >> before we even look at the GUC ...
>> >>
>> >> I think I've found one such place - if you look at compare_pathkeys, it
>> >> has this check right before the forboth() loop:
>> >>
>> >>      if (keys1 == keys2)
>> >>          return PATHKEYS_EQUAL;
>> >>
>> >> But with incremental sort we don't call pathkeys_contained_in, we call
>> >> pathkeys_common_contained_in instead. And that does not call
>> >> compare_pathkeys and does not have the simple equality check. Adding
>> >> the following check seems to cut the overhead in half, which is nice:
>> >>
>> >>      if (keys1 == keys2)
>> >>      {
>> >>          *n_common = list_length(keys1);
>> >>         return true;
>> >>      }
>> >>
>> >> Not sure where the rest of the regression comes from yet.
>> >
>> >I noticed in the other function we also optimize by checking if either
>> >keys list is NIL, so I tried adding that, and it might have made a
>> >minor difference, but it's hard to tell as it was under 1%.
>> >
>>
>> Which other function? I don't see this optimization in compare_pathkeys,
>> that only checks key1/key2 after the forboth loop, but that's something
>> different.
>
>pathkeys_useful_for_ordering checks both inputs.
>
>> I may be wrong, but my guess would be that this won't save much, because
>> the loop should terminate immediately. The whole point is not to loop
>> over possibly many pathkeys when we know it's exactly the same pathkeys
>> list (same pointer). When one of the lists is NIL the loop should
>> terminate right away ...
>>
>> >I also ran perf with a slightly modified version of your test that
>> >uses psql, and after the above changes was seeing something like a
>> >3.5% delta between master and this patch series. Nothing obvious in
>> >the perf report though.
>> >
>>
>> Yeah, I don't think there's anything obviously more expensive.
>>
>> >This test is intended to be somewhat worst case, no? At what point do
>> >we consider the trade-off worth it (given that it's not plausible to
>> >have zero impact)?
>> >
>>
>> Yes, more or less. It was definitely designed to do that - it merely
>> plans the query (no execution), with many applicable indexes etc. It's
>> definitely true that once the exection starts to take more time, the
>> overhead will get negligible. Same for reducing number of indexes.
>>
>> And of course, for queries that can benefit from incremental sort, the
>> speedup is likely way larger than this.
>>
>> In general, I think it'd be naive that we can make planner smarter with
>> no extra overhead spent on planning, and we can never accept patches
>> adding even tiny overhead. With that approach we'd probably end up with
>> a trivial planner that generates just a single query plan, because
>> that's going to be the fastest planner. A realistic approach needs to
>> consider both the planning and execution phase, and benefits of this
>> patch seem to be clear - if you have queries that do benefit from it.
>>
>> Let's try to minimize the overhead a bit more, and then we'll see.
>
>Any thoughts you have already on an approach for this?
>

That very much depends on what may be causing the problem. I have two
hypotheses, at the moment.

Based on the profiles I've seen so far, there does not seem to be any
function that suddenly got slower. That probably implies we're simply
generating more paths than before, which means more allocation, mode
add_path calls etc. It's not clear to me why would this happen even with
enable_incrementalsort=off, though.

Or maybe some of the structs got larger and need more cachelines? That
would affect performance even with the GUC set to off. But the perf stat
data also don't show anything particularly revealing. I'm using this:

   perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,\
                dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses,\
           LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches,\
           cycles,instructions,cache-references,cache-misses,'
           bus-cycles,raw_syscalls:sys_enter -p $PID

An example output (for master and patched branch) is attached, but I
don't see anything obviously worse (there is some variance, of course).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> In general, I think it'd be naive that we can make planner smarter with
>> no extra overhead spent on planning, and we can never accept patches
>> adding even tiny overhead. With that approach we'd probably end up with
>> a trivial planner that generates just a single query plan, because
>> that's going to be the fastest planner. A realistic approach needs to
>> consider both the planning and execution phase, and benefits of this
>> patch seem to be clear - if you have queries that do benefit from it.
>
>I think that's kind of attacking a straw man, though.  The thing that
>people push back on, or should push back on IMO, is when a proposed
>patch adds significant slowdown to queries that it has no or very little
>hope of improving.  The trick is to do expensive stuff only when
>there's a good chance of getting a better plan out of it.
>

Yeah, I agree with that. I think the main issue is that we don't really
know what the "expensive stuff" is in this case, so it's not really
clear how to be smarter :-(

One possibility is that it's just one of those regressions due to change
in binary layout, but I'm not sure know how to verify that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
> >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> >> In general, I think it'd be naive that we can make planner smarter with
> >> no extra overhead spent on planning, and we can never accept patches
> >> adding even tiny overhead. With that approach we'd probably end up with
> >> a trivial planner that generates just a single query plan, because
> >> that's going to be the fastest planner. A realistic approach needs to
> >> consider both the planning and execution phase, and benefits of this
> >> patch seem to be clear - if you have queries that do benefit from it.
> >
> >I think that's kind of attacking a straw man, though.  The thing that
> >people push back on, or should push back on IMO, is when a proposed
> >patch adds significant slowdown to queries that it has no or very little
> >hope of improving.  The trick is to do expensive stuff only when
> >there's a good chance of getting a better plan out of it.
> >
>
> Yeah, I agree with that. I think the main issue is that we don't really
> know what the "expensive stuff" is in this case, so it's not really
> clear how to be smarter :-(

To add to this: I agree that ideally you'd check cheaply to know
you're in a situation that might help, and then do more work. But here
the question is always going to be simply "would we benefit from an
ordering, and, if so, do we have it already partially sorted". It's
hard to imagine that reducing much conceptually, so we're left with
optimizations of that check.

> One possibility is that it's just one of those regressions due to change
> in binary layout, but I'm not sure know how to verify that.

If we are testing with a case that can't actually add more paths (due
to it checking the guc before building them), doesn't that effectively
leave one of these two options:
1. Binary layout/cache/other untraceable change, or
2. Changes due to refactored function calls.

There's not anything obvious in point (2) that would be a big cost,
but there are definitely changes there. I was surprised that just
eliminating the loop through the pathkeys on the query and the index
was enough to save us ~4%.

Tomas: Earlier you'd wondered about if we should try to shortcut the
changes in costing...I was skeptical of that originally, but maybe
it's worth looking into? I'm going to try backing that out and see
what the numbers look like.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:
>On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
>> >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> >> In general, I think it'd be naive that we can make planner smarter with
>> >> no extra overhead spent on planning, and we can never accept patches
>> >> adding even tiny overhead. With that approach we'd probably end up with
>> >> a trivial planner that generates just a single query plan, because
>> >> that's going to be the fastest planner. A realistic approach needs to
>> >> consider both the planning and execution phase, and benefits of this
>> >> patch seem to be clear - if you have queries that do benefit from it.
>> >
>> >I think that's kind of attacking a straw man, though.  The thing that
>> >people push back on, or should push back on IMO, is when a proposed
>> >patch adds significant slowdown to queries that it has no or very little
>> >hope of improving.  The trick is to do expensive stuff only when
>> >there's a good chance of getting a better plan out of it.
>> >
>>
>> Yeah, I agree with that. I think the main issue is that we don't really
>> know what the "expensive stuff" is in this case, so it's not really
>> clear how to be smarter :-(
>
>To add to this: I agree that ideally you'd check cheaply to know
>you're in a situation that might help, and then do more work. But here
>the question is always going to be simply "would we benefit from an
>ordering, and, if so, do we have it already partially sorted". It's
>hard to imagine that reducing much conceptually, so we're left with
>optimizations of that check.
>

I think it depends on what exactly is the expensive part. For example if
it's the construction of IncrementalSort paths, then maybe we could try
do a quick/check check if the path can even be useful by estimating the
cost and only then building the path.

That's what we do for join algorithms, for example - we first compute
initial_cost_nestloop and only when that seems cheap enough we do the
more expensive stuff.

But I'm not sure the path construction is the expensive part, as it
should be disabled by enable_incrementalsort=off. But the regression
does not seem to disappear, at least not entirely.


>> One possibility is that it's just one of those regressions due to change
>> in binary layout, but I'm not sure know how to verify that.
>
>If we are testing with a case that can't actually add more paths (due
>to it checking the guc before building them), doesn't that effectively
>leave one of these two options:
>
>1. Binary layout/cache/other untraceable change, or
>2. Changes due to refactored function calls.
>

Hmm, but in case of (1) the overhead should be there even with tests
that don't really have any additional paths to consider, right? I've
tried with such test (single table with no indexes) and I don't quite
see any regression (maybe ~1%).

(2) might have impact, but I don't see any immediate supects. Did you
have some functions in mind?

BTW I see the patch adds pathkeys_common but it's not called from
anywhere. It's probably leftover from an earlier patch version.

>There's not anything obvious in point (2) that would be a big cost,
>but there are definitely changes there. I was surprised that just
>eliminating the loop through the pathkeys on the query and the index
>was enough to save us ~4%.
>
>Tomas: Earlier you'd wondered about if we should try to shortcut the
>changes in costing...I was skeptical of that originally, but maybe
>it's worth looking into? I'm going to try backing that out and see
>what the numbers look like.
>

I've described the idea about something like initial_cost_nestloop and
so on. But I'm a bit skeptical about it, considering that the GUC only
has limited effect.


A somewhat note is that the number of indexes has pretty significant
impact on planning time, even on master. This is timing of the same
explain script (similar to the one shown before) with different number
of indexes on master:

     0 indexes        7 indexes       49 indexes
     -------------------------------------------
         10.85            12.56            27.83

The way I look at incremental sort is that it allows using indexes for
queries that couldn't use it before, possibly requiring a separate
index. So incremental sort might easily reduce the number of indexes
needed, compensating for the overhead we're discussing here. Of course,
that may or may not be true.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:
> >On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
> >> >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> >> >> In general, I think it'd be naive that we can make planner smarter with
> >> >> no extra overhead spent on planning, and we can never accept patches
> >> >> adding even tiny overhead. With that approach we'd probably end up with
> >> >> a trivial planner that generates just a single query plan, because
> >> >> that's going to be the fastest planner. A realistic approach needs to
> >> >> consider both the planning and execution phase, and benefits of this
> >> >> patch seem to be clear - if you have queries that do benefit from it.
> >> >
> >> >I think that's kind of attacking a straw man, though.  The thing that
> >> >people push back on, or should push back on IMO, is when a proposed
> >> >patch adds significant slowdown to queries that it has no or very little
> >> >hope of improving.  The trick is to do expensive stuff only when
> >> >there's a good chance of getting a better plan out of it.
> >> >
> >>
> >> Yeah, I agree with that. I think the main issue is that we don't really
> >> know what the "expensive stuff" is in this case, so it's not really
> >> clear how to be smarter :-(
> >
> >To add to this: I agree that ideally you'd check cheaply to know
> >you're in a situation that might help, and then do more work. But here
> >the question is always going to be simply "would we benefit from an
> >ordering, and, if so, do we have it already partially sorted". It's
> >hard to imagine that reducing much conceptually, so we're left with
> >optimizations of that check.
> >
>
> I think it depends on what exactly is the expensive part. For example if
> it's the construction of IncrementalSort paths, then maybe we could try
> do a quick/check check if the path can even be useful by estimating the
> cost and only then building the path.
>
> That's what we do for join algorithms, for example - we first compute
> initial_cost_nestloop and only when that seems cheap enough we do the
> more expensive stuff.
>
> But I'm not sure the path construction is the expensive part, as it
> should be disabled by enable_incrementalsort=off. But the regression
> does not seem to disappear, at least not entirely.
>
>
> >> One possibility is that it's just one of those regressions due to change
> >> in binary layout, but I'm not sure know how to verify that.
> >
> >If we are testing with a case that can't actually add more paths (due
> >to it checking the guc before building them), doesn't that effectively
> >leave one of these two options:
> >
> >1. Binary layout/cache/other untraceable change, or
> >2. Changes due to refactored function calls.
> >
>
> Hmm, but in case of (1) the overhead should be there even with tests
> that don't really have any additional paths to consider, right? I've
> tried with such test (single table with no indexes) and I don't quite
> see any regression (maybe ~1%).

Not necessarily, if the cost is in sort costing or useful pathkeys
checking, right? We have run that code even without incremental sort,
but it's changed from master.

> (2) might have impact, but I don't see any immediate supects. Did you
> have some functions in mind?

I guess this is where the lines blur: I didn't see anything obvious
either, but the changes to sort costing...should probably not have
real impact...but...

> BTW I see the patch adds pathkeys_common but it's not called from
> anywhere. It's probably leftover from an earlier patch version.
>
> >There's not anything obvious in point (2) that would be a big cost,
> >but there are definitely changes there. I was surprised that just
> >eliminating the loop through the pathkeys on the query and the index
> >was enough to save us ~4%.
> >
> >Tomas: Earlier you'd wondered about if we should try to shortcut the
> >changes in costing...I was skeptical of that originally, but maybe
> >it's worth looking into? I'm going to try backing that out and see
> >what the numbers look like.
> >

BTW, I did this test, and it looks like we can get back something
close to 1% by reverting that initial fix on partial path costing. But
we can't get rid of it all the time, at the very least. *Maybe* we
could condition it on incremental sort, but I'm not convinced that's
the only place it's needed as a fix.

> I've described the idea about something like initial_cost_nestloop and
> so on. But I'm a bit skeptical about it, considering that the GUC only
> has limited effect.
>
>
> A somewhat note is that the number of indexes has pretty significant
> impact on planning time, even on master. This is timing of the same
> explain script (similar to the one shown before) with different number
> of indexes on master:
>
>      0 indexes        7 indexes       49 indexes
>      -------------------------------------------
>          10.85            12.56            27.83
>
> The way I look at incremental sort is that it allows using indexes for
> queries that couldn't use it before, possibly requiring a separate
> index. So incremental sort might easily reduce the number of indexes
> needed, compensating for the overhead we're discussing here. Of course,
> that may or may not be true.

One small idea, but I'm not yet sure it helps us a whole lot: if the
query pathkeys is only length 1, then we could skip the additional
path creation.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Alvaro Herrera
Date:
On 2020-Mar-31, Tom Lane wrote:

> James Coleman <jtc331@gmail.com> writes:
> > On Tue, Mar 31, 2020 at 1:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Perhaps the semantics are such that that's actually sensible, but it's
> >> far from a straightforward remapping of the old enum.
> 
> > Right, I didn't see the explicit "= 0" in other enums there, so it
> > made me wonder if it was intentional to designate that one had to be
> > 0, but I guess without a comment that's a lot of inference.
> 
> It's possible that somebody meant that as an indicator that the code
> depends on palloc0() leaving the field with that value.  But if so,
> you'd soon find that out ... and an actual comment would be better,
> anyway.

git blame fingers this:

commit bf11e7ee2e3607bb67d25aec73aa53b2d7e9961b
Author:     Robert Haas <rhaas@postgresql.org>
AuthorDate: Tue Aug 29 13:22:49 2017 -0400
CommitDate: Tue Aug 29 13:26:33 2017 -0400

    Propagate sort instrumentation from workers back to leader.
    
    Up until now, when parallel query was used, no details about the
    sort method or space used by the workers were available; details
    were shown only for any sorting done by the leader.  Fix that.
    
    Commit 1177ab1dabf72bafee8f19d904cee3a299f25892 forced the test case
    added by commit 1f6d515a67ec98194c23a5db25660856c9aab944 to run
    without parallelism; now that we have this infrastructure, allow
    that again, with a little tweaking to make it pass with and without
    force_parallel_mode.
    
    Robert Haas and Tom Lane
    
    Discussion: http://postgr.es/m/CA+Tgmoa2VBZW6S8AAXfhpHczb=Rf6RqQ2br+zJvEgwJ0uoD_tQ@mail.gmail.com

I looked at the discussion thread.  That patch was first posted by
Robert at
https://postgr.es/m/CA+Tgmoa2VBZW6S8AAXfhpHczb=Rf6RqQ2br+zJvEgwJ0uoD_tQ@mail.gmail.com
without the "= 0" part; later Tom posted v2 here
https://postgr.es/m/11223.1503695532@sss.pgh.pa.us
containing the "= 0", but I see no actual discussion of that point.

I suppose it could also be important to clarify that it's 0 if it were
used as an array index of some sort, but I don't see that in 2017's
commit.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:
>On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:
>> >On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> >>
>> >> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
>> >> >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> >> >> In general, I think it'd be naive that we can make planner smarter with
>> >> >> no extra overhead spent on planning, and we can never accept patches
>> >> >> adding even tiny overhead. With that approach we'd probably end up with
>> >> >> a trivial planner that generates just a single query plan, because
>> >> >> that's going to be the fastest planner. A realistic approach needs to
>> >> >> consider both the planning and execution phase, and benefits of this
>> >> >> patch seem to be clear - if you have queries that do benefit from it.
>> >> >
>> >> >I think that's kind of attacking a straw man, though.  The thing that
>> >> >people push back on, or should push back on IMO, is when a proposed
>> >> >patch adds significant slowdown to queries that it has no or very little
>> >> >hope of improving.  The trick is to do expensive stuff only when
>> >> >there's a good chance of getting a better plan out of it.
>> >> >
>> >>
>> >> Yeah, I agree with that. I think the main issue is that we don't really
>> >> know what the "expensive stuff" is in this case, so it's not really
>> >> clear how to be smarter :-(
>> >
>> >To add to this: I agree that ideally you'd check cheaply to know
>> >you're in a situation that might help, and then do more work. But here
>> >the question is always going to be simply "would we benefit from an
>> >ordering, and, if so, do we have it already partially sorted". It's
>> >hard to imagine that reducing much conceptually, so we're left with
>> >optimizations of that check.
>> >
>>
>> I think it depends on what exactly is the expensive part. For example if
>> it's the construction of IncrementalSort paths, then maybe we could try
>> do a quick/check check if the path can even be useful by estimating the
>> cost and only then building the path.
>>
>> That's what we do for join algorithms, for example - we first compute
>> initial_cost_nestloop and only when that seems cheap enough we do the
>> more expensive stuff.
>>
>> But I'm not sure the path construction is the expensive part, as it
>> should be disabled by enable_incrementalsort=off. But the regression
>> does not seem to disappear, at least not entirely.
>>
>>
>> >> One possibility is that it's just one of those regressions due to change
>> >> in binary layout, but I'm not sure know how to verify that.
>> >
>> >If we are testing with a case that can't actually add more paths (due
>> >to it checking the guc before building them), doesn't that effectively
>> >leave one of these two options:
>> >
>> >1. Binary layout/cache/other untraceable change, or
>> >2. Changes due to refactored function calls.
>> >
>>
>> Hmm, but in case of (1) the overhead should be there even with tests
>> that don't really have any additional paths to consider, right? I've
>> tried with such test (single table with no indexes) and I don't quite
>> see any regression (maybe ~1%).
>
>Not necessarily, if the cost is in sort costing or useful pathkeys
>checking, right? We have run that code even without incremental sort,
>but it's changed from master.
>

Ah, I should have mentioned I've done most of the tests on just the
basic incremental sort patch (0001+0002), without the additional useful
paths. I initially tested the whole patch series, but after discovering
the regression I removed the last part (which I suspected might be the
root cause). But the regression is still there, so it's not that.

It might be in the reworked costing, yeah. But then I'd expect those
function to show in the perf profile.

>> (2) might have impact, but I don't see any immediate supects. Did you
>> have some functions in mind?
>
>I guess this is where the lines blur: I didn't see anything obvious
>either, but the changes to sort costing...should probably not have
>real impact...but...
>

:-(

>> BTW I see the patch adds pathkeys_common but it's not called from
>> anywhere. It's probably leftover from an earlier patch version.
>>
>> >There's not anything obvious in point (2) that would be a big cost,
>> >but there are definitely changes there. I was surprised that just
>> >eliminating the loop through the pathkeys on the query and the index
>> >was enough to save us ~4%.
>> >
>> >Tomas: Earlier you'd wondered about if we should try to shortcut the
>> >changes in costing...I was skeptical of that originally, but maybe
>> >it's worth looking into? I'm going to try backing that out and see
>> >what the numbers look like.
>> >
>
>BTW, I did this test, and it looks like we can get back something
>close to 1% by reverting that initial fix on partial path costing. But
>we can't get rid of it all the time, at the very least. *Maybe* we
>could condition it on incremental sort, but I'm not convinced that's
>the only place it's needed as a fix.
>

Sounds interesting. I actually tried how much the add_partial_path
change accounts for, and you're right it was quite a bit. But I forgot
about that when investigating the rest.

I wonder how large would the regression be without add_partial_path and
with the fix in pathkeys_common_contained_in.

I'm not sure how much we want to make add_partial_path() dependent on
particular GUCs, but I guess if it gets rid of the regression, allows us
to commit incremental sort and we can reasonably justify that only
incremental sort needs those paths, it might be acceptable.

>> I've described the idea about something like initial_cost_nestloop and
>> so on. But I'm a bit skeptical about it, considering that the GUC only
>> has limited effect.
>>
>>
>> A somewhat note is that the number of indexes has pretty significant
>> impact on planning time, even on master. This is timing of the same
>> explain script (similar to the one shown before) with different number
>> of indexes on master:
>>
>>      0 indexes        7 indexes       49 indexes
>>      -------------------------------------------
>>          10.85            12.56            27.83
>>
>> The way I look at incremental sort is that it allows using indexes for
>> queries that couldn't use it before, possibly requiring a separate
>> index. So incremental sort might easily reduce the number of indexes
>> needed, compensating for the overhead we're discussing here. Of course,
>> that may or may not be true.
>
>One small idea, but I'm not yet sure it helps us a whole lot: if the
>query pathkeys is only length 1, then we could skip the additional
>path creation.
>

I don't follow. Why would we create incremental sort in this case at
all? With single-element query_pathkeys the path is either unsorted or
fully sorted - there's no room for incremental sort. No?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:
> >On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:
> >> >On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
> >> ><tomas.vondra@2ndquadrant.com> wrote:
> >> >>
> >> >> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
> >> >> >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> >> >> >> In general, I think it'd be naive that we can make planner smarter with
> >> >> >> no extra overhead spent on planning, and we can never accept patches
> >> >> >> adding even tiny overhead. With that approach we'd probably end up with
> >> >> >> a trivial planner that generates just a single query plan, because
> >> >> >> that's going to be the fastest planner. A realistic approach needs to
> >> >> >> consider both the planning and execution phase, and benefits of this
> >> >> >> patch seem to be clear - if you have queries that do benefit from it.
> >> >> >
> >> >> >I think that's kind of attacking a straw man, though.  The thing that
> >> >> >people push back on, or should push back on IMO, is when a proposed
> >> >> >patch adds significant slowdown to queries that it has no or very little
> >> >> >hope of improving.  The trick is to do expensive stuff only when
> >> >> >there's a good chance of getting a better plan out of it.
> >> >> >
> >> >>
> >> >> Yeah, I agree with that. I think the main issue is that we don't really
> >> >> know what the "expensive stuff" is in this case, so it's not really
> >> >> clear how to be smarter :-(
> >> >
> >> >To add to this: I agree that ideally you'd check cheaply to know
> >> >you're in a situation that might help, and then do more work. But here
> >> >the question is always going to be simply "would we benefit from an
> >> >ordering, and, if so, do we have it already partially sorted". It's
> >> >hard to imagine that reducing much conceptually, so we're left with
> >> >optimizations of that check.
> >> >
> >>
> >> I think it depends on what exactly is the expensive part. For example if
> >> it's the construction of IncrementalSort paths, then maybe we could try
> >> do a quick/check check if the path can even be useful by estimating the
> >> cost and only then building the path.
> >>
> >> That's what we do for join algorithms, for example - we first compute
> >> initial_cost_nestloop and only when that seems cheap enough we do the
> >> more expensive stuff.
> >>
> >> But I'm not sure the path construction is the expensive part, as it
> >> should be disabled by enable_incrementalsort=off. But the regression
> >> does not seem to disappear, at least not entirely.
> >>
> >>
> >> >> One possibility is that it's just one of those regressions due to change
> >> >> in binary layout, but I'm not sure know how to verify that.
> >> >
> >> >If we are testing with a case that can't actually add more paths (due
> >> >to it checking the guc before building them), doesn't that effectively
> >> >leave one of these two options:
> >> >
> >> >1. Binary layout/cache/other untraceable change, or
> >> >2. Changes due to refactored function calls.
> >> >
> >>
> >> Hmm, but in case of (1) the overhead should be there even with tests
> >> that don't really have any additional paths to consider, right? I've
> >> tried with such test (single table with no indexes) and I don't quite
> >> see any regression (maybe ~1%).
> >
> >Not necessarily, if the cost is in sort costing or useful pathkeys
> >checking, right? We have run that code even without incremental sort,
> >but it's changed from master.
> >
>
> Ah, I should have mentioned I've done most of the tests on just the
> basic incremental sort patch (0001+0002), without the additional useful
> paths. I initially tested the whole patch series, but after discovering
> the regression I removed the last part (which I suspected might be the
> root cause). But the regression is still there, so it's not that.
>
> It might be in the reworked costing, yeah. But then I'd expect those
> function to show in the perf profile.

Right. I'm just grasping at straws on that.

> >> (2) might have impact, but I don't see any immediate supects. Did you
> >> have some functions in mind?
> >
> >I guess this is where the lines blur: I didn't see anything obvious
> >either, but the changes to sort costing...should probably not have
> >real impact...but...
> >
>
> :-(
>
> >> BTW I see the patch adds pathkeys_common but it's not called from
> >> anywhere. It's probably leftover from an earlier patch version.
> >>

BTW, I think I'm going to rename the pathkeys_common_contained_in
function to something like pathkeys_count_contained_in, unless you
have an objection to that. The name doesn't seem obvious at all to me.

> >> >There's not anything obvious in point (2) that would be a big cost,
> >> >but there are definitely changes there. I was surprised that just
> >> >eliminating the loop through the pathkeys on the query and the index
> >> >was enough to save us ~4%.
> >> >
> >> >Tomas: Earlier you'd wondered about if we should try to shortcut the
> >> >changes in costing...I was skeptical of that originally, but maybe
> >> >it's worth looking into? I'm going to try backing that out and see
> >> >what the numbers look like.
> >> >
> >
> >BTW, I did this test, and it looks like we can get back something
> >close to 1% by reverting that initial fix on partial path costing. But
> >we can't get rid of it all the time, at the very least. *Maybe* we
> >could condition it on incremental sort, but I'm not convinced that's
> >the only place it's needed as a fix.
> >
>
> Sounds interesting. I actually tried how much the add_partial_path
> change accounts for, and you're right it was quite a bit. But I forgot
> about that when investigating the rest.
>
> I wonder how large would the regression be without add_partial_path and
> with the fix in pathkeys_common_contained_in.
>
> I'm not sure how much we want to make add_partial_path() dependent on
> particular GUCs, but I guess if it gets rid of the regression, allows us
> to commit incremental sort and we can reasonably justify that only
> incremental sort needs those paths, it might be acceptable.

That's a good point.

> >> I've described the idea about something like initial_cost_nestloop and
> >> so on. But I'm a bit skeptical about it, considering that the GUC only
> >> has limited effect.
> >>
> >>
> >> A somewhat note is that the number of indexes has pretty significant
> >> impact on planning time, even on master. This is timing of the same
> >> explain script (similar to the one shown before) with different number
> >> of indexes on master:
> >>
> >>      0 indexes        7 indexes       49 indexes
> >>      -------------------------------------------
> >>          10.85            12.56            27.83
> >>
> >> The way I look at incremental sort is that it allows using indexes for
> >> queries that couldn't use it before, possibly requiring a separate
> >> index. So incremental sort might easily reduce the number of indexes
> >> needed, compensating for the overhead we're discussing here. Of course,
> >> that may or may not be true.
> >
> >One small idea, but I'm not yet sure it helps us a whole lot: if the
> >query pathkeys is only length 1, then we could skip the additional
> >path creation.
> >
>
> I don't follow. Why would we create incremental sort in this case at
> all? With single-element query_pathkeys the path is either unsorted or
> fully sorted - there's no room for incremental sort. No?

Well, we shouldn't, that's what I'm getting. But I didn't see anything
in the code now that explicitly excludes that case when decided
whether or not to create an incremental sort path, unless I'm missing
something obvious.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:
>On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:
>> >On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> >>
>> >> On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:
>> >> >On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
>> >> ><tomas.vondra@2ndquadrant.com> wrote:
>> >> >>
>> >> >> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
>> >> >> >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> >> >> >> In general, I think it'd be naive that we can make planner smarter with
>> >> >> >> no extra overhead spent on planning, and we can never accept patches
>> >> >> >> adding even tiny overhead. With that approach we'd probably end up with
>> >> >> >> a trivial planner that generates just a single query plan, because
>> >> >> >> that's going to be the fastest planner. A realistic approach needs to
>> >> >> >> consider both the planning and execution phase, and benefits of this
>> >> >> >> patch seem to be clear - if you have queries that do benefit from it.
>> >> >> >
>> >> >> >I think that's kind of attacking a straw man, though.  The thing that
>> >> >> >people push back on, or should push back on IMO, is when a proposed
>> >> >> >patch adds significant slowdown to queries that it has no or very little
>> >> >> >hope of improving.  The trick is to do expensive stuff only when
>> >> >> >there's a good chance of getting a better plan out of it.
>> >> >> >
>> >> >>
>> >> >> Yeah, I agree with that. I think the main issue is that we don't really
>> >> >> know what the "expensive stuff" is in this case, so it's not really
>> >> >> clear how to be smarter :-(
>> >> >
>> >> >To add to this: I agree that ideally you'd check cheaply to know
>> >> >you're in a situation that might help, and then do more work. But here
>> >> >the question is always going to be simply "would we benefit from an
>> >> >ordering, and, if so, do we have it already partially sorted". It's
>> >> >hard to imagine that reducing much conceptually, so we're left with
>> >> >optimizations of that check.
>> >> >
>> >>
>> >> I think it depends on what exactly is the expensive part. For example if
>> >> it's the construction of IncrementalSort paths, then maybe we could try
>> >> do a quick/check check if the path can even be useful by estimating the
>> >> cost and only then building the path.
>> >>
>> >> That's what we do for join algorithms, for example - we first compute
>> >> initial_cost_nestloop and only when that seems cheap enough we do the
>> >> more expensive stuff.
>> >>
>> >> But I'm not sure the path construction is the expensive part, as it
>> >> should be disabled by enable_incrementalsort=off. But the regression
>> >> does not seem to disappear, at least not entirely.
>> >>
>> >>
>> >> >> One possibility is that it's just one of those regressions due to change
>> >> >> in binary layout, but I'm not sure know how to verify that.
>> >> >
>> >> >If we are testing with a case that can't actually add more paths (due
>> >> >to it checking the guc before building them), doesn't that effectively
>> >> >leave one of these two options:
>> >> >
>> >> >1. Binary layout/cache/other untraceable change, or
>> >> >2. Changes due to refactored function calls.
>> >> >
>> >>
>> >> Hmm, but in case of (1) the overhead should be there even with tests
>> >> that don't really have any additional paths to consider, right? I've
>> >> tried with such test (single table with no indexes) and I don't quite
>> >> see any regression (maybe ~1%).
>> >
>> >Not necessarily, if the cost is in sort costing or useful pathkeys
>> >checking, right? We have run that code even without incremental sort,
>> >but it's changed from master.
>> >
>>
>> Ah, I should have mentioned I've done most of the tests on just the
>> basic incremental sort patch (0001+0002), without the additional useful
>> paths. I initially tested the whole patch series, but after discovering
>> the regression I removed the last part (which I suspected might be the
>> root cause). But the regression is still there, so it's not that.
>>
>> It might be in the reworked costing, yeah. But then I'd expect those
>> function to show in the perf profile.
>
>Right. I'm just grasping at straws on that.
>
>> >> (2) might have impact, but I don't see any immediate supects. Did you
>> >> have some functions in mind?
>> >
>> >I guess this is where the lines blur: I didn't see anything obvious
>> >either, but the changes to sort costing...should probably not have
>> >real impact...but...
>> >
>>
>> :-(
>>
>> >> BTW I see the patch adds pathkeys_common but it's not called from
>> >> anywhere. It's probably leftover from an earlier patch version.
>> >>
>
>BTW, I think I'm going to rename the pathkeys_common_contained_in
>function to something like pathkeys_count_contained_in, unless you
>have an objection to that. The name doesn't seem obvious at all to me.
>

WFM

>> >> >There's not anything obvious in point (2) that would be a big cost,
>> >> >but there are definitely changes there. I was surprised that just
>> >> >eliminating the loop through the pathkeys on the query and the index
>> >> >was enough to save us ~4%.
>> >> >
>> >> >Tomas: Earlier you'd wondered about if we should try to shortcut the
>> >> >changes in costing...I was skeptical of that originally, but maybe
>> >> >it's worth looking into? I'm going to try backing that out and see
>> >> >what the numbers look like.
>> >> >
>> >
>> >BTW, I did this test, and it looks like we can get back something
>> >close to 1% by reverting that initial fix on partial path costing. But
>> >we can't get rid of it all the time, at the very least. *Maybe* we
>> >could condition it on incremental sort, but I'm not convinced that's
>> >the only place it's needed as a fix.
>> >
>>
>> Sounds interesting. I actually tried how much the add_partial_path
>> change accounts for, and you're right it was quite a bit. But I forgot
>> about that when investigating the rest.
>>
>> I wonder how large would the regression be without add_partial_path and
>> with the fix in pathkeys_common_contained_in.
>>
>> I'm not sure how much we want to make add_partial_path() dependent on
>> particular GUCs, but I guess if it gets rid of the regression, allows us
>> to commit incremental sort and we can reasonably justify that only
>> incremental sort needs those paths, it might be acceptable.
>
>That's a good point.
>
>> >> I've described the idea about something like initial_cost_nestloop and
>> >> so on. But I'm a bit skeptical about it, considering that the GUC only
>> >> has limited effect.
>> >>
>> >>
>> >> A somewhat note is that the number of indexes has pretty significant
>> >> impact on planning time, even on master. This is timing of the same
>> >> explain script (similar to the one shown before) with different number
>> >> of indexes on master:
>> >>
>> >>      0 indexes        7 indexes       49 indexes
>> >>      -------------------------------------------
>> >>          10.85            12.56            27.83
>> >>
>> >> The way I look at incremental sort is that it allows using indexes for
>> >> queries that couldn't use it before, possibly requiring a separate
>> >> index. So incremental sort might easily reduce the number of indexes
>> >> needed, compensating for the overhead we're discussing here. Of course,
>> >> that may or may not be true.
>> >
>> >One small idea, but I'm not yet sure it helps us a whole lot: if the
>> >query pathkeys is only length 1, then we could skip the additional
>> >path creation.
>> >
>>
>> I don't follow. Why would we create incremental sort in this case at
>> all? With single-element query_pathkeys the path is either unsorted or
>> fully sorted - there's no room for incremental sort. No?
>
>Well, we shouldn't, that's what I'm getting. But I didn't see anything
>in the code now that explicitly excludes that case when decided
>whether or not to create an incremental sort path, unless I'm missing
>something obvious.

Well, my point is that create_ordered_paths() looks like this:

     is_sorted = pathkeys_common_contained_in(root->sort_patkeys, ...);

     if (is_sorted)
     {
         ... old code
     }
     else
     {
         if (input_path == cheapest_input_path)
         {
             ... old code
         }

         /* With incremental sort disabled, don't build those paths. */
         if (!enable_incrementalsort)
             continue;

         /* Likewise, if the path can't be used for incremental sort. */
         if (!presorted_keys)
             continue;

         ... incremental sort path
     }

Now, with single-item sort_pathkeys, the input path can't be partially
sorted. It's either fully sorted - in which case it's handled by the
first branch. Or it's not sorted at all, so presorted_keys==0 and we
never get to the incremental path.

Or did you mean to use the optimization somewhere else?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Mar 31, 2020 at 9:59 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:
> >On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:
> >> >On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
> >> ><tomas.vondra@2ndquadrant.com> wrote:
> >> >>
> >> >> On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:
> >> >> >On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
> >> >> ><tomas.vondra@2ndquadrant.com> wrote:
> >> >> >>
> >> >> >> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
> >> >> >> >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> >> >> >> >> In general, I think it'd be naive that we can make planner smarter with
> >> >> >> >> no extra overhead spent on planning, and we can never accept patches
> >> >> >> >> adding even tiny overhead. With that approach we'd probably end up with
> >> >> >> >> a trivial planner that generates just a single query plan, because
> >> >> >> >> that's going to be the fastest planner. A realistic approach needs to
> >> >> >> >> consider both the planning and execution phase, and benefits of this
> >> >> >> >> patch seem to be clear - if you have queries that do benefit from it.
> >> >> >> >
> >> >> >> >I think that's kind of attacking a straw man, though.  The thing that
> >> >> >> >people push back on, or should push back on IMO, is when a proposed
> >> >> >> >patch adds significant slowdown to queries that it has no or very little
> >> >> >> >hope of improving.  The trick is to do expensive stuff only when
> >> >> >> >there's a good chance of getting a better plan out of it.
> >> >> >> >
> >> >> >>
> >> >> >> Yeah, I agree with that. I think the main issue is that we don't really
> >> >> >> know what the "expensive stuff" is in this case, so it's not really
> >> >> >> clear how to be smarter :-(
> >> >> >
> >> >> >To add to this: I agree that ideally you'd check cheaply to know
> >> >> >you're in a situation that might help, and then do more work. But here
> >> >> >the question is always going to be simply "would we benefit from an
> >> >> >ordering, and, if so, do we have it already partially sorted". It's
> >> >> >hard to imagine that reducing much conceptually, so we're left with
> >> >> >optimizations of that check.
> >> >> >
> >> >>
> >> >> I think it depends on what exactly is the expensive part. For example if
> >> >> it's the construction of IncrementalSort paths, then maybe we could try
> >> >> do a quick/check check if the path can even be useful by estimating the
> >> >> cost and only then building the path.
> >> >>
> >> >> That's what we do for join algorithms, for example - we first compute
> >> >> initial_cost_nestloop and only when that seems cheap enough we do the
> >> >> more expensive stuff.
> >> >>
> >> >> But I'm not sure the path construction is the expensive part, as it
> >> >> should be disabled by enable_incrementalsort=off. But the regression
> >> >> does not seem to disappear, at least not entirely.
> >> >>
> >> >>
> >> >> >> One possibility is that it's just one of those regressions due to change
> >> >> >> in binary layout, but I'm not sure know how to verify that.
> >> >> >
> >> >> >If we are testing with a case that can't actually add more paths (due
> >> >> >to it checking the guc before building them), doesn't that effectively
> >> >> >leave one of these two options:
> >> >> >
> >> >> >1. Binary layout/cache/other untraceable change, or
> >> >> >2. Changes due to refactored function calls.
> >> >> >
> >> >>
> >> >> Hmm, but in case of (1) the overhead should be there even with tests
> >> >> that don't really have any additional paths to consider, right? I've
> >> >> tried with such test (single table with no indexes) and I don't quite
> >> >> see any regression (maybe ~1%).
> >> >
> >> >Not necessarily, if the cost is in sort costing or useful pathkeys
> >> >checking, right? We have run that code even without incremental sort,
> >> >but it's changed from master.
> >> >
> >>
> >> Ah, I should have mentioned I've done most of the tests on just the
> >> basic incremental sort patch (0001+0002), without the additional useful
> >> paths. I initially tested the whole patch series, but after discovering
> >> the regression I removed the last part (which I suspected might be the
> >> root cause). But the regression is still there, so it's not that.
> >>
> >> It might be in the reworked costing, yeah. But then I'd expect those
> >> function to show in the perf profile.
> >
> >Right. I'm just grasping at straws on that.
> >
> >> >> (2) might have impact, but I don't see any immediate supects. Did you
> >> >> have some functions in mind?
> >> >
> >> >I guess this is where the lines blur: I didn't see anything obvious
> >> >either, but the changes to sort costing...should probably not have
> >> >real impact...but...
> >> >
> >>
> >> :-(
> >>
> >> >> BTW I see the patch adds pathkeys_common but it's not called from
> >> >> anywhere. It's probably leftover from an earlier patch version.
> >> >>
> >
> >BTW, I think I'm going to rename the pathkeys_common_contained_in
> >function to something like pathkeys_count_contained_in, unless you
> >have an objection to that. The name doesn't seem obvious at all to me.
> >
>
> WFM
>
> >> >> >There's not anything obvious in point (2) that would be a big cost,
> >> >> >but there are definitely changes there. I was surprised that just
> >> >> >eliminating the loop through the pathkeys on the query and the index
> >> >> >was enough to save us ~4%.
> >> >> >
> >> >> >Tomas: Earlier you'd wondered about if we should try to shortcut the
> >> >> >changes in costing...I was skeptical of that originally, but maybe
> >> >> >it's worth looking into? I'm going to try backing that out and see
> >> >> >what the numbers look like.
> >> >> >
> >> >
> >> >BTW, I did this test, and it looks like we can get back something
> >> >close to 1% by reverting that initial fix on partial path costing. But
> >> >we can't get rid of it all the time, at the very least. *Maybe* we
> >> >could condition it on incremental sort, but I'm not convinced that's
> >> >the only place it's needed as a fix.
> >> >
> >>
> >> Sounds interesting. I actually tried how much the add_partial_path
> >> change accounts for, and you're right it was quite a bit. But I forgot
> >> about that when investigating the rest.
> >>
> >> I wonder how large would the regression be without add_partial_path and
> >> with the fix in pathkeys_common_contained_in.
> >>
> >> I'm not sure how much we want to make add_partial_path() dependent on
> >> particular GUCs, but I guess if it gets rid of the regression, allows us
> >> to commit incremental sort and we can reasonably justify that only
> >> incremental sort needs those paths, it might be acceptable.
> >
> >That's a good point.
> >
> >> >> I've described the idea about something like initial_cost_nestloop and
> >> >> so on. But I'm a bit skeptical about it, considering that the GUC only
> >> >> has limited effect.
> >> >>
> >> >>
> >> >> A somewhat note is that the number of indexes has pretty significant
> >> >> impact on planning time, even on master. This is timing of the same
> >> >> explain script (similar to the one shown before) with different number
> >> >> of indexes on master:
> >> >>
> >> >>      0 indexes        7 indexes       49 indexes
> >> >>      -------------------------------------------
> >> >>          10.85            12.56            27.83
> >> >>
> >> >> The way I look at incremental sort is that it allows using indexes for
> >> >> queries that couldn't use it before, possibly requiring a separate
> >> >> index. So incremental sort might easily reduce the number of indexes
> >> >> needed, compensating for the overhead we're discussing here. Of course,
> >> >> that may or may not be true.
> >> >
> >> >One small idea, but I'm not yet sure it helps us a whole lot: if the
> >> >query pathkeys is only length 1, then we could skip the additional
> >> >path creation.
> >> >
> >>
> >> I don't follow. Why would we create incremental sort in this case at
> >> all? With single-element query_pathkeys the path is either unsorted or
> >> fully sorted - there's no room for incremental sort. No?
> >
> >Well, we shouldn't, that's what I'm getting. But I didn't see anything
> >in the code now that explicitly excludes that case when decided
> >whether or not to create an incremental sort path, unless I'm missing
> >something obvious.
>
> Well, my point is that create_ordered_paths() looks like this:
>
>      is_sorted = pathkeys_common_contained_in(root->sort_patkeys, ...);
>
>      if (is_sorted)
>      {
>          ... old code
>      }
>      else
>      {
>          if (input_path == cheapest_input_path)
>          {
>              ... old code
>          }
>
>          /* With incremental sort disabled, don't build those paths. */
>          if (!enable_incrementalsort)
>              continue;
>
>          /* Likewise, if the path can't be used for incremental sort. */
>          if (!presorted_keys)
>              continue;
>
>          ... incremental sort path
>      }
>
> Now, with single-item sort_pathkeys, the input path can't be partially
> sorted. It's either fully sorted - in which case it's handled by the
> first branch. Or it's not sorted at all, so presorted_keys==0 and we
> never get to the incremental path.
>
> Or did you mean to use the optimization somewhere else?

Hmm, yes, I didn't think through that properly. I'll have to look at
the other cases to confirm the same logic applies there.

One other thing:in the code above we create the regular sort path
inside of `if (input_path == cheapest_input_path)`, but incremental
sort is outside of that condition. I'm not sure I'm remembering why
that was, and it's not obvious to me reading it right now (though it's
getting late here, so maybe I'm just not thinking clearly). Do you
happen to remember why that is?

I've included the optimization on the add_partial_path fix and I now
have numbers (for your test, slightly modified in how I execute it)
like:

branch: 0.8354718927735362
master: 0.8128127066707269

Which is a 2.7% regression (with enable_incrementalsort off).

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Tue, Mar 31, 2020 at 10:12:29PM -0400, James Coleman wrote:
>On Tue, Mar 31, 2020 at 9:59 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:
>> >On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> >>
>> >> On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:
>> >> >On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
>> >> ><tomas.vondra@2ndquadrant.com> wrote:
>> >> >>
>> >> >> On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:
>> >> >> >On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
>> >> >> ><tomas.vondra@2ndquadrant.com> wrote:
>> >> >> >>
>> >> >> >> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
>> >> >> >> >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> >> >> >> >> In general, I think it'd be naive that we can make planner smarter with
>> >> >> >> >> no extra overhead spent on planning, and we can never accept patches
>> >> >> >> >> adding even tiny overhead. With that approach we'd probably end up with
>> >> >> >> >> a trivial planner that generates just a single query plan, because
>> >> >> >> >> that's going to be the fastest planner. A realistic approach needs to
>> >> >> >> >> consider both the planning and execution phase, and benefits of this
>> >> >> >> >> patch seem to be clear - if you have queries that do benefit from it.
>> >> >> >> >
>> >> >> >> >I think that's kind of attacking a straw man, though.  The thing that
>> >> >> >> >people push back on, or should push back on IMO, is when a proposed
>> >> >> >> >patch adds significant slowdown to queries that it has no or very little
>> >> >> >> >hope of improving.  The trick is to do expensive stuff only when
>> >> >> >> >there's a good chance of getting a better plan out of it.
>> >> >> >> >
>> >> >> >>
>> >> >> >> Yeah, I agree with that. I think the main issue is that we don't really
>> >> >> >> know what the "expensive stuff" is in this case, so it's not really
>> >> >> >> clear how to be smarter :-(
>> >> >> >
>> >> >> >To add to this: I agree that ideally you'd check cheaply to know
>> >> >> >you're in a situation that might help, and then do more work. But here
>> >> >> >the question is always going to be simply "would we benefit from an
>> >> >> >ordering, and, if so, do we have it already partially sorted". It's
>> >> >> >hard to imagine that reducing much conceptually, so we're left with
>> >> >> >optimizations of that check.
>> >> >> >
>> >> >>
>> >> >> I think it depends on what exactly is the expensive part. For example if
>> >> >> it's the construction of IncrementalSort paths, then maybe we could try
>> >> >> do a quick/check check if the path can even be useful by estimating the
>> >> >> cost and only then building the path.
>> >> >>
>> >> >> That's what we do for join algorithms, for example - we first compute
>> >> >> initial_cost_nestloop and only when that seems cheap enough we do the
>> >> >> more expensive stuff.
>> >> >>
>> >> >> But I'm not sure the path construction is the expensive part, as it
>> >> >> should be disabled by enable_incrementalsort=off. But the regression
>> >> >> does not seem to disappear, at least not entirely.
>> >> >>
>> >> >>
>> >> >> >> One possibility is that it's just one of those regressions due to change
>> >> >> >> in binary layout, but I'm not sure know how to verify that.
>> >> >> >
>> >> >> >If we are testing with a case that can't actually add more paths (due
>> >> >> >to it checking the guc before building them), doesn't that effectively
>> >> >> >leave one of these two options:
>> >> >> >
>> >> >> >1. Binary layout/cache/other untraceable change, or
>> >> >> >2. Changes due to refactored function calls.
>> >> >> >
>> >> >>
>> >> >> Hmm, but in case of (1) the overhead should be there even with tests
>> >> >> that don't really have any additional paths to consider, right? I've
>> >> >> tried with such test (single table with no indexes) and I don't quite
>> >> >> see any regression (maybe ~1%).
>> >> >
>> >> >Not necessarily, if the cost is in sort costing or useful pathkeys
>> >> >checking, right? We have run that code even without incremental sort,
>> >> >but it's changed from master.
>> >> >
>> >>
>> >> Ah, I should have mentioned I've done most of the tests on just the
>> >> basic incremental sort patch (0001+0002), without the additional useful
>> >> paths. I initially tested the whole patch series, but after discovering
>> >> the regression I removed the last part (which I suspected might be the
>> >> root cause). But the regression is still there, so it's not that.
>> >>
>> >> It might be in the reworked costing, yeah. But then I'd expect those
>> >> function to show in the perf profile.
>> >
>> >Right. I'm just grasping at straws on that.
>> >
>> >> >> (2) might have impact, but I don't see any immediate supects. Did you
>> >> >> have some functions in mind?
>> >> >
>> >> >I guess this is where the lines blur: I didn't see anything obvious
>> >> >either, but the changes to sort costing...should probably not have
>> >> >real impact...but...
>> >> >
>> >>
>> >> :-(
>> >>
>> >> >> BTW I see the patch adds pathkeys_common but it's not called from
>> >> >> anywhere. It's probably leftover from an earlier patch version.
>> >> >>
>> >
>> >BTW, I think I'm going to rename the pathkeys_common_contained_in
>> >function to something like pathkeys_count_contained_in, unless you
>> >have an objection to that. The name doesn't seem obvious at all to me.
>> >
>>
>> WFM
>>
>> >> >> >There's not anything obvious in point (2) that would be a big cost,
>> >> >> >but there are definitely changes there. I was surprised that just
>> >> >> >eliminating the loop through the pathkeys on the query and the index
>> >> >> >was enough to save us ~4%.
>> >> >> >
>> >> >> >Tomas: Earlier you'd wondered about if we should try to shortcut the
>> >> >> >changes in costing...I was skeptical of that originally, but maybe
>> >> >> >it's worth looking into? I'm going to try backing that out and see
>> >> >> >what the numbers look like.
>> >> >> >
>> >> >
>> >> >BTW, I did this test, and it looks like we can get back something
>> >> >close to 1% by reverting that initial fix on partial path costing. But
>> >> >we can't get rid of it all the time, at the very least. *Maybe* we
>> >> >could condition it on incremental sort, but I'm not convinced that's
>> >> >the only place it's needed as a fix.
>> >> >
>> >>
>> >> Sounds interesting. I actually tried how much the add_partial_path
>> >> change accounts for, and you're right it was quite a bit. But I forgot
>> >> about that when investigating the rest.
>> >>
>> >> I wonder how large would the regression be without add_partial_path and
>> >> with the fix in pathkeys_common_contained_in.
>> >>
>> >> I'm not sure how much we want to make add_partial_path() dependent on
>> >> particular GUCs, but I guess if it gets rid of the regression, allows us
>> >> to commit incremental sort and we can reasonably justify that only
>> >> incremental sort needs those paths, it might be acceptable.
>> >
>> >That's a good point.
>> >
>> >> >> I've described the idea about something like initial_cost_nestloop and
>> >> >> so on. But I'm a bit skeptical about it, considering that the GUC only
>> >> >> has limited effect.
>> >> >>
>> >> >>
>> >> >> A somewhat note is that the number of indexes has pretty significant
>> >> >> impact on planning time, even on master. This is timing of the same
>> >> >> explain script (similar to the one shown before) with different number
>> >> >> of indexes on master:
>> >> >>
>> >> >>      0 indexes        7 indexes       49 indexes
>> >> >>      -------------------------------------------
>> >> >>          10.85            12.56            27.83
>> >> >>
>> >> >> The way I look at incremental sort is that it allows using indexes for
>> >> >> queries that couldn't use it before, possibly requiring a separate
>> >> >> index. So incremental sort might easily reduce the number of indexes
>> >> >> needed, compensating for the overhead we're discussing here. Of course,
>> >> >> that may or may not be true.
>> >> >
>> >> >One small idea, but I'm not yet sure it helps us a whole lot: if the
>> >> >query pathkeys is only length 1, then we could skip the additional
>> >> >path creation.
>> >> >
>> >>
>> >> I don't follow. Why would we create incremental sort in this case at
>> >> all? With single-element query_pathkeys the path is either unsorted or
>> >> fully sorted - there's no room for incremental sort. No?
>> >
>> >Well, we shouldn't, that's what I'm getting. But I didn't see anything
>> >in the code now that explicitly excludes that case when decided
>> >whether or not to create an incremental sort path, unless I'm missing
>> >something obvious.
>>
>> Well, my point is that create_ordered_paths() looks like this:
>>
>>      is_sorted = pathkeys_common_contained_in(root->sort_patkeys, ...);
>>
>>      if (is_sorted)
>>      {
>>          ... old code
>>      }
>>      else
>>      {
>>          if (input_path == cheapest_input_path)
>>          {
>>              ... old code
>>          }
>>
>>          /* With incremental sort disabled, don't build those paths. */
>>          if (!enable_incrementalsort)
>>              continue;
>>
>>          /* Likewise, if the path can't be used for incremental sort. */
>>          if (!presorted_keys)
>>              continue;
>>
>>          ... incremental sort path
>>      }
>>
>> Now, with single-item sort_pathkeys, the input path can't be partially
>> sorted. It's either fully sorted - in which case it's handled by the
>> first branch. Or it's not sorted at all, so presorted_keys==0 and we
>> never get to the incremental path.
>>
>> Or did you mean to use the optimization somewhere else?
>
>Hmm, yes, I didn't think through that properly. I'll have to look at
>the other cases to confirm the same logic applies there.
>
>One other thing:in the code above we create the regular sort path
>inside of `if (input_path == cheapest_input_path)`, but incremental
>sort is outside of that condition. I'm not sure I'm remembering why
>that was, and it's not obvious to me reading it right now (though it's
>getting late here, so maybe I'm just not thinking clearly). Do you
>happen to remember why that is?
>

It's because for the regular sort, the path is either already sorted or
it requires a full sort. But full sort only makes sense on the cheapest
path, because we assume the additional sort cost is independent of the
input cost, essentially

    cost(path + Sort) = cost(path) + cost(Sort)

and it's always

     cost(path) + cost(Sort) >= cost(cheapest path) + cost(Sort)

and by checking for cheapest path we simply skip building all the paths
that we'd end up discarding anyway.

With incremental sort we can't do this, the cost of the incremental sort
depends on how well presorted is the input path.

>I've included the optimization on the add_partial_path fix and I now
>have numbers (for your test, slightly modified in how I execute it)
>like:
>
>branch: 0.8354718927735362
>master: 0.8128127066707269
>
>Which is a 2.7% regression (with enable_incrementalsort off).

Can you try a more realistic benchmark, not this focused on the planner
part? Something like a read-only pgbench with a fairly small data set
and a single client, or something like that?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Mar 31, 2020 at 10:44 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Mar 31, 2020 at 10:12:29PM -0400, James Coleman wrote:
> >On Tue, Mar 31, 2020 at 9:59 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:
> >> >On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
> >> ><tomas.vondra@2ndquadrant.com> wrote:
> >> >>
> >> >> On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:
> >> >> >On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
> >> >> ><tomas.vondra@2ndquadrant.com> wrote:
> >> >> >>
> >> >> >> On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:
> >> >> >> >On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
> >> >> >> ><tomas.vondra@2ndquadrant.com> wrote:
> >> >> >> >>
> >> >> >> >> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
> >> >> >> >> >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> >> >> >> >> >> In general, I think it'd be naive that we can make planner smarter with
> >> >> >> >> >> no extra overhead spent on planning, and we can never accept patches
> >> >> >> >> >> adding even tiny overhead. With that approach we'd probably end up with
> >> >> >> >> >> a trivial planner that generates just a single query plan, because
> >> >> >> >> >> that's going to be the fastest planner. A realistic approach needs to
> >> >> >> >> >> consider both the planning and execution phase, and benefits of this
> >> >> >> >> >> patch seem to be clear - if you have queries that do benefit from it.
> >> >> >> >> >
> >> >> >> >> >I think that's kind of attacking a straw man, though.  The thing that
> >> >> >> >> >people push back on, or should push back on IMO, is when a proposed
> >> >> >> >> >patch adds significant slowdown to queries that it has no or very little
> >> >> >> >> >hope of improving.  The trick is to do expensive stuff only when
> >> >> >> >> >there's a good chance of getting a better plan out of it.
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >> Yeah, I agree with that. I think the main issue is that we don't really
> >> >> >> >> know what the "expensive stuff" is in this case, so it's not really
> >> >> >> >> clear how to be smarter :-(
> >> >> >> >
> >> >> >> >To add to this: I agree that ideally you'd check cheaply to know
> >> >> >> >you're in a situation that might help, and then do more work. But here
> >> >> >> >the question is always going to be simply "would we benefit from an
> >> >> >> >ordering, and, if so, do we have it already partially sorted". It's
> >> >> >> >hard to imagine that reducing much conceptually, so we're left with
> >> >> >> >optimizations of that check.
> >> >> >> >
> >> >> >>
> >> >> >> I think it depends on what exactly is the expensive part. For example if
> >> >> >> it's the construction of IncrementalSort paths, then maybe we could try
> >> >> >> do a quick/check check if the path can even be useful by estimating the
> >> >> >> cost and only then building the path.
> >> >> >>
> >> >> >> That's what we do for join algorithms, for example - we first compute
> >> >> >> initial_cost_nestloop and only when that seems cheap enough we do the
> >> >> >> more expensive stuff.
> >> >> >>
> >> >> >> But I'm not sure the path construction is the expensive part, as it
> >> >> >> should be disabled by enable_incrementalsort=off. But the regression
> >> >> >> does not seem to disappear, at least not entirely.
> >> >> >>
> >> >> >>
> >> >> >> >> One possibility is that it's just one of those regressions due to change
> >> >> >> >> in binary layout, but I'm not sure know how to verify that.
> >> >> >> >
> >> >> >> >If we are testing with a case that can't actually add more paths (due
> >> >> >> >to it checking the guc before building them), doesn't that effectively
> >> >> >> >leave one of these two options:
> >> >> >> >
> >> >> >> >1. Binary layout/cache/other untraceable change, or
> >> >> >> >2. Changes due to refactored function calls.
> >> >> >> >
> >> >> >>
> >> >> >> Hmm, but in case of (1) the overhead should be there even with tests
> >> >> >> that don't really have any additional paths to consider, right? I've
> >> >> >> tried with such test (single table with no indexes) and I don't quite
> >> >> >> see any regression (maybe ~1%).
> >> >> >
> >> >> >Not necessarily, if the cost is in sort costing or useful pathkeys
> >> >> >checking, right? We have run that code even without incremental sort,
> >> >> >but it's changed from master.
> >> >> >
> >> >>
> >> >> Ah, I should have mentioned I've done most of the tests on just the
> >> >> basic incremental sort patch (0001+0002), without the additional useful
> >> >> paths. I initially tested the whole patch series, but after discovering
> >> >> the regression I removed the last part (which I suspected might be the
> >> >> root cause). But the regression is still there, so it's not that.
> >> >>
> >> >> It might be in the reworked costing, yeah. But then I'd expect those
> >> >> function to show in the perf profile.
> >> >
> >> >Right. I'm just grasping at straws on that.
> >> >
> >> >> >> (2) might have impact, but I don't see any immediate supects. Did you
> >> >> >> have some functions in mind?
> >> >> >
> >> >> >I guess this is where the lines blur: I didn't see anything obvious
> >> >> >either, but the changes to sort costing...should probably not have
> >> >> >real impact...but...
> >> >> >
> >> >>
> >> >> :-(
> >> >>
> >> >> >> BTW I see the patch adds pathkeys_common but it's not called from
> >> >> >> anywhere. It's probably leftover from an earlier patch version.
> >> >> >>
> >> >
> >> >BTW, I think I'm going to rename the pathkeys_common_contained_in
> >> >function to something like pathkeys_count_contained_in, unless you
> >> >have an objection to that. The name doesn't seem obvious at all to me.
> >> >
> >>
> >> WFM
> >>
> >> >> >> >There's not anything obvious in point (2) that would be a big cost,
> >> >> >> >but there are definitely changes there. I was surprised that just
> >> >> >> >eliminating the loop through the pathkeys on the query and the index
> >> >> >> >was enough to save us ~4%.
> >> >> >> >
> >> >> >> >Tomas: Earlier you'd wondered about if we should try to shortcut the
> >> >> >> >changes in costing...I was skeptical of that originally, but maybe
> >> >> >> >it's worth looking into? I'm going to try backing that out and see
> >> >> >> >what the numbers look like.
> >> >> >> >
> >> >> >
> >> >> >BTW, I did this test, and it looks like we can get back something
> >> >> >close to 1% by reverting that initial fix on partial path costing. But
> >> >> >we can't get rid of it all the time, at the very least. *Maybe* we
> >> >> >could condition it on incremental sort, but I'm not convinced that's
> >> >> >the only place it's needed as a fix.
> >> >> >
> >> >>
> >> >> Sounds interesting. I actually tried how much the add_partial_path
> >> >> change accounts for, and you're right it was quite a bit. But I forgot
> >> >> about that when investigating the rest.
> >> >>
> >> >> I wonder how large would the regression be without add_partial_path and
> >> >> with the fix in pathkeys_common_contained_in.
> >> >>
> >> >> I'm not sure how much we want to make add_partial_path() dependent on
> >> >> particular GUCs, but I guess if it gets rid of the regression, allows us
> >> >> to commit incremental sort and we can reasonably justify that only
> >> >> incremental sort needs those paths, it might be acceptable.
> >> >
> >> >That's a good point.
> >> >
> >> >> >> I've described the idea about something like initial_cost_nestloop and
> >> >> >> so on. But I'm a bit skeptical about it, considering that the GUC only
> >> >> >> has limited effect.
> >> >> >>
> >> >> >>
> >> >> >> A somewhat note is that the number of indexes has pretty significant
> >> >> >> impact on planning time, even on master. This is timing of the same
> >> >> >> explain script (similar to the one shown before) with different number
> >> >> >> of indexes on master:
> >> >> >>
> >> >> >>      0 indexes        7 indexes       49 indexes
> >> >> >>      -------------------------------------------
> >> >> >>          10.85            12.56            27.83
> >> >> >>
> >> >> >> The way I look at incremental sort is that it allows using indexes for
> >> >> >> queries that couldn't use it before, possibly requiring a separate
> >> >> >> index. So incremental sort might easily reduce the number of indexes
> >> >> >> needed, compensating for the overhead we're discussing here. Of course,
> >> >> >> that may or may not be true.
> >> >> >
> >> >> >One small idea, but I'm not yet sure it helps us a whole lot: if the
> >> >> >query pathkeys is only length 1, then we could skip the additional
> >> >> >path creation.
> >> >> >
> >> >>
> >> >> I don't follow. Why would we create incremental sort in this case at
> >> >> all? With single-element query_pathkeys the path is either unsorted or
> >> >> fully sorted - there's no room for incremental sort. No?
> >> >
> >> >Well, we shouldn't, that's what I'm getting. But I didn't see anything
> >> >in the code now that explicitly excludes that case when decided
> >> >whether or not to create an incremental sort path, unless I'm missing
> >> >something obvious.
> >>
> >> Well, my point is that create_ordered_paths() looks like this:
> >>
> >>      is_sorted = pathkeys_common_contained_in(root->sort_patkeys, ...);
> >>
> >>      if (is_sorted)
> >>      {
> >>          ... old code
> >>      }
> >>      else
> >>      {
> >>          if (input_path == cheapest_input_path)
> >>          {
> >>              ... old code
> >>          }
> >>
> >>          /* With incremental sort disabled, don't build those paths. */
> >>          if (!enable_incrementalsort)
> >>              continue;
> >>
> >>          /* Likewise, if the path can't be used for incremental sort. */
> >>          if (!presorted_keys)
> >>              continue;
> >>
> >>          ... incremental sort path
> >>      }
> >>
> >> Now, with single-item sort_pathkeys, the input path can't be partially
> >> sorted. It's either fully sorted - in which case it's handled by the
> >> first branch. Or it's not sorted at all, so presorted_keys==0 and we
> >> never get to the incremental path.
> >>
> >> Or did you mean to use the optimization somewhere else?
> >
> >Hmm, yes, I didn't think through that properly. I'll have to look at
> >the other cases to confirm the same logic applies there.
> >
> >One other thing:in the code above we create the regular sort path
> >inside of `if (input_path == cheapest_input_path)`, but incremental
> >sort is outside of that condition. I'm not sure I'm remembering why
> >that was, and it's not obvious to me reading it right now (though it's
> >getting late here, so maybe I'm just not thinking clearly). Do you
> >happen to remember why that is?
> >
>
> It's because for the regular sort, the path is either already sorted or
> it requires a full sort. But full sort only makes sense on the cheapest
> path, because we assume the additional sort cost is independent of the
> input cost, essentially
>
>     cost(path + Sort) = cost(path) + cost(Sort)
>
> and it's always
>
>      cost(path) + cost(Sort) >= cost(cheapest path) + cost(Sort)
>
> and by checking for cheapest path we simply skip building all the paths
> that we'd end up discarding anyway.
>
> With incremental sort we can't do this, the cost of the incremental sort
> depends on how well presorted is the input path.
>
> >I've included the optimization on the add_partial_path fix and I now
> >have numbers (for your test, slightly modified in how I execute it)
> >like:
> >
> >branch: 0.8354718927735362
> >master: 0.8128127066707269
> >
> >Which is a 2.7% regression (with enable_incrementalsort off).
>
> Can you try a more realistic benchmark, not this focused on the planner
> part? Something like a read-only pgbench with a fairly small data set
> and a single client, or something like that?

A default pgbench run with select-only for 60s got me 99.93% on the
branch of the speed of master.

I've attached my current updates (with the optimization in add_partial_path).

To add some weight to the "stuff beyond the patch's control" theory
I'm pretty sure I've gotten ~1% repeated differences with the included
v49-0004-ignore-single-key-orderings.patch even though that shouldn't
change anything *both* because enable_incremental_sort is off *and*
because logically it shouldn't be needed (though I still haven't
confirmed in all cases)...so that's interesting. I'm not suggesting we
include the patch, but wanted you to at least see it.

I can look at some more pgbench stuff tomorrow, but for now I'm
signing off for the night.

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Mar 31, 2020 at 11:07 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Tue, Mar 31, 2020 at 10:44 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > On Tue, Mar 31, 2020 at 10:12:29PM -0400, James Coleman wrote:
> > >On Tue, Mar 31, 2020 at 9:59 PM Tomas Vondra
> > ><tomas.vondra@2ndquadrant.com> wrote:
> > >>
> > >> On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:
> > >> >On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
> > >> ><tomas.vondra@2ndquadrant.com> wrote:
> > >> >>
> > >> >> On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:
> > >> >> >On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
> > >> >> ><tomas.vondra@2ndquadrant.com> wrote:
...
> > >> >> >One small idea, but I'm not yet sure it helps us a whole lot: if the
> > >> >> >query pathkeys is only length 1, then we could skip the additional
> > >> >> >path creation.
> > >> >> >
> > >> >>
> > >> >> I don't follow. Why would we create incremental sort in this case at
> > >> >> all? With single-element query_pathkeys the path is either unsorted or
> > >> >> fully sorted - there's no room for incremental sort. No?
> > >> >
> > >> >Well, we shouldn't, that's what I'm getting. But I didn't see anything
> > >> >in the code now that explicitly excludes that case when decided
> > >> >whether or not to create an incremental sort path, unless I'm missing
> > >> >something obvious.
> > >>
> > >> Well, my point is that create_ordered_paths() looks like this:
> > >>
> > >>      is_sorted = pathkeys_common_contained_in(root->sort_patkeys, ...);
> > >>
> > >>      if (is_sorted)
> > >>      {
> > >>          ... old code
> > >>      }
> > >>      else
> > >>      {
> > >>          if (input_path == cheapest_input_path)
> > >>          {
> > >>              ... old code
> > >>          }
> > >>
> > >>          /* With incremental sort disabled, don't build those paths. */
> > >>          if (!enable_incrementalsort)
> > >>              continue;
> > >>
> > >>          /* Likewise, if the path can't be used for incremental sort. */
> > >>          if (!presorted_keys)
> > >>              continue;
> > >>
> > >>          ... incremental sort path
> > >>      }
> > >>
> > >> Now, with single-item sort_pathkeys, the input path can't be partially
> > >> sorted. It's either fully sorted - in which case it's handled by the
> > >> first branch. Or it's not sorted at all, so presorted_keys==0 and we
> > >> never get to the incremental path.
> > >>
> > >> Or did you mean to use the optimization somewhere else?
> > >
> > >Hmm, yes, I didn't think through that properly. I'll have to look at
> > >the other cases to confirm the same logic applies there.

I looked through this more carefully, and I did end up finding a few
places where we can skip iterating through a list of paths entirely
with this check, so I added it there. I also cleaned up some comments,
added comments and asserts to the other places where
list_length(pathkeys) should be guaranteed to be > 1, removed a few
asserts I found unnecessary, and merged duplicative
pathkeys_[count_]_contained_in calls.

> > >One other thing:in the code above we create the regular sort path
> > >inside of `if (input_path == cheapest_input_path)`, but incremental
> > >sort is outside of that condition. I'm not sure I'm remembering why
> > >that was, and it's not obvious to me reading it right now (though it's
> > >getting late here, so maybe I'm just not thinking clearly). Do you
> > >happen to remember why that is?
> > >
> >
> > It's because for the regular sort, the path is either already sorted or
> > it requires a full sort. But full sort only makes sense on the cheapest
> > path, because we assume the additional sort cost is independent of the
> > input cost, essentially
> >
> >     cost(path + Sort) = cost(path) + cost(Sort)
> >
> > and it's always
> >
> >      cost(path) + cost(Sort) >= cost(cheapest path) + cost(Sort)
> >
> > and by checking for cheapest path we simply skip building all the paths
> > that we'd end up discarding anyway.
> >
> > With incremental sort we can't do this, the cost of the incremental sort
> > depends on how well presorted is the input path.

Thanks for the explanation. I've added a comment to that effect.

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Wed, Apr 01, 2020 at 09:05:27AM -0400, James Coleman wrote:
>On Tue, Mar 31, 2020 at 11:07 PM James Coleman <jtc331@gmail.com> wrote:
>>
>> On Tue, Mar 31, 2020 at 10:44 PM Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> >
>> > On Tue, Mar 31, 2020 at 10:12:29PM -0400, James Coleman wrote:
>> > >On Tue, Mar 31, 2020 at 9:59 PM Tomas Vondra
>> > ><tomas.vondra@2ndquadrant.com> wrote:
>> > >>
>> > >> On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:
>> > >> >On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
>> > >> ><tomas.vondra@2ndquadrant.com> wrote:
>> > >> >>
>> > >> >> On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:
>> > >> >> >On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
>> > >> >> ><tomas.vondra@2ndquadrant.com> wrote:
>...
>> > >> >> >One small idea, but I'm not yet sure it helps us a whole lot: if the
>> > >> >> >query pathkeys is only length 1, then we could skip the additional
>> > >> >> >path creation.
>> > >> >> >
>> > >> >>
>> > >> >> I don't follow. Why would we create incremental sort in this case at
>> > >> >> all? With single-element query_pathkeys the path is either unsorted or
>> > >> >> fully sorted - there's no room for incremental sort. No?
>> > >> >
>> > >> >Well, we shouldn't, that's what I'm getting. But I didn't see anything
>> > >> >in the code now that explicitly excludes that case when decided
>> > >> >whether or not to create an incremental sort path, unless I'm missing
>> > >> >something obvious.
>> > >>
>> > >> Well, my point is that create_ordered_paths() looks like this:
>> > >>
>> > >>      is_sorted = pathkeys_common_contained_in(root->sort_patkeys, ...);
>> > >>
>> > >>      if (is_sorted)
>> > >>      {
>> > >>          ... old code
>> > >>      }
>> > >>      else
>> > >>      {
>> > >>          if (input_path == cheapest_input_path)
>> > >>          {
>> > >>              ... old code
>> > >>          }
>> > >>
>> > >>          /* With incremental sort disabled, don't build those paths. */
>> > >>          if (!enable_incrementalsort)
>> > >>              continue;
>> > >>
>> > >>          /* Likewise, if the path can't be used for incremental sort. */
>> > >>          if (!presorted_keys)
>> > >>              continue;
>> > >>
>> > >>          ... incremental sort path
>> > >>      }
>> > >>
>> > >> Now, with single-item sort_pathkeys, the input path can't be partially
>> > >> sorted. It's either fully sorted - in which case it's handled by the
>> > >> first branch. Or it's not sorted at all, so presorted_keys==0 and we
>> > >> never get to the incremental path.
>> > >>
>> > >> Or did you mean to use the optimization somewhere else?
>> > >
>> > >Hmm, yes, I didn't think through that properly. I'll have to look at
>> > >the other cases to confirm the same logic applies there.
>
>I looked through this more carefully, and I did end up finding a few
>places where we can skip iterating through a list of paths entirely
>with this check, so I added it there. I also cleaned up some comments,
>added comments and asserts to the other places where
>list_length(pathkeys) should be guaranteed to be > 1, removed a few
>asserts I found unnecessary, and merged duplicative
>pathkeys_[count_]_contained_in calls.
>

OK

>> > >One other thing:in the code above we create the regular sort path
>> > >inside of `if (input_path == cheapest_input_path)`, but incremental
>> > >sort is outside of that condition. I'm not sure I'm remembering why
>> > >that was, and it's not obvious to me reading it right now (though it's
>> > >getting late here, so maybe I'm just not thinking clearly). Do you
>> > >happen to remember why that is?
>> > >
>> >
>> > It's because for the regular sort, the path is either already sorted or
>> > it requires a full sort. But full sort only makes sense on the cheapest
>> > path, because we assume the additional sort cost is independent of the
>> > input cost, essentially
>> >
>> >     cost(path + Sort) = cost(path) + cost(Sort)
>> >
>> > and it's always
>> >
>> >      cost(path) + cost(Sort) >= cost(cheapest path) + cost(Sort)
>> >
>> > and by checking for cheapest path we simply skip building all the paths
>> > that we'd end up discarding anyway.
>> >
>> > With incremental sort we can't do this, the cost of the incremental sort
>> > depends on how well presorted is the input path.
>
>Thanks for the explanation. I've added a comment to that effect.
>

Thanks.

I've realized the way get_useful_pathkeys_for_relation() is coded kinda
works against the fastpath we added when comparing pathkeys. That
depends on comparing pointers to the list, but we've been building new
lists (and then returned those) which defeats the optimization. Attached
is a patch that returns the original list in most cases (and only
creates a copy when really necessary). This might also save a few cycles
on bulding the new list, of course.

I've done a bunch of read-only pgbench tests with fairly small scales (1
and 10). First with the built-in read-only transaction, and also with a
simple custom query doing an order-by. And I did this both on the
default schema and with a bunch of extra indexes. The script I used to
run this is attached, along with a summary of results.

There are results for master and v40 and v50 patches (the v50 also
includes the extra patch fixing get_useful_pathkeys_for_relation).

Overall, I'm happy with those results - the v50 seems to be within 1% of
master, in both directions. This very much seems like a noise.

I still want to do a bit more review of the costing / tuplesort changes,
which I plan to do tomorrow. If that goes well, I plan to start
committing this. So please if you think this is not ready or wants more
time for a review, let me know. I'm not yet sure if I'll commit this as
a single change, or in three separate commits.

James, can you review the proposed extra fix and merge the fixes into
the main patches?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Wed, Apr 1, 2020 at 5:42 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> ...
> I've realized the way get_useful_pathkeys_for_relation() is coded kinda
> works against the fastpath we added when comparing pathkeys. That
> depends on comparing pointers to the list, but we've been building new
> lists (and then returned those) which defeats the optimization. Attached
> is a patch that returns the original list in most cases (and only
> creates a copy when really necessary). This might also save a few cycles
> on bulding the new list, of course.
>
> I've done a bunch of read-only pgbench tests with fairly small scales (1
> and 10). First with the built-in read-only transaction, and also with a
> simple custom query doing an order-by. And I did this both on the
> default schema and with a bunch of extra indexes. The script I used to
> run this is attached, along with a summary of results.
>
> There are results for master and v40 and v50 patches (the v50 also
> includes the extra patch fixing get_useful_pathkeys_for_relation).
>
> Overall, I'm happy with those results - the v50 seems to be within 1% of
> master, in both directions. This very much seems like a noise.
>
> I still want to do a bit more review of the costing / tuplesort changes,
> which I plan to do tomorrow. If that goes well, I plan to start
> committing this. So please if you think this is not ready or wants more

I think we need to either implement this or remove the comment:
* XXX I wonder if we need to consider adding a projection here, as
* create_ordered_paths does.
in generate_useful_gather_paths().

In the same function we have the following code:
/*
 * When the partial path is already sorted, we can just add a gather
 * merge on top, and we're done - no point in adding explicit sort.
 *
 * XXX Can't we skip this (maybe only for the cheapest partial path)
 * when the path is already sorted? Then it's likely duplicate with
 * the path created by generate_gather_paths.
 */
if (is_sorted)
{
        path = create_gather_merge_path(root, rel, subpath, rel->reltarget,

 subpath->pathkeys, NULL, rowsp);

        add_path(rel, &path->path);
        continue;
}

looking at the relevant loop in generate_gather_paths:
/*
 * For each useful ordering, we can consider an order-preserving Gather
 * Merge.
 */
foreach(lc, rel->partial_pathlist)
{
        Path       *subpath = (Path *) lfirst(lc);
        GatherMergePath *path;

        if (subpath->pathkeys == NIL)
                continue;

        rows = subpath->rows * subpath->parallel_workers;
        path = create_gather_merge_path(root, rel, subpath, rel->reltarget,

 subpath->pathkeys, NULL, rowsp);
        add_path(rel, &path->path);
}

I believe we can eliminate the block entirely in
generate_useful_gather_paths(). Here's my reasoning: all paths for
which is_sorted is true must necessarily have pathkeys, and since we
already add a gather merge for every subpath with pathkeys, we've
already added gather merge paths for all of these.

I've included a patch to change this, but let me know if the reasoning
isn't sound.

We can also remove the XXX on this comment (in the same function):
* XXX This is not redundant with the gather merge path created in
* generate_gather_paths, because that merely preserves ordering of
* the cheapest partial path, while here we add an explicit sort to
* get match the useful ordering.

because of this code in generate_gather_paths():
cheapest_partial_path = linitial(rel->partial_pathlist);
rows =
        cheapest_partial_path->rows * cheapest_partial_path->parallel_workers;
simple_gather_path = (Path *)
        create_gather_path(root, rel, cheapest_partial_path, rel->reltarget,
                                           NULL, rowsp);
add_path(rel, simple_gather_path);

but we can cleanup the comment a bit: fix the grammar issue in the
last line and fix the reference to gather merge path (it's a gather
path).

I've included that in the same patch.

I also noticed that in create_incremental_sort_path we have this:
/* XXX comparison_cost shouldn't be 0? */
but I guess that's part of what you're reviewing tomorrow.

> time for a review, let me know. I'm not yet sure if I'll commit this as
> a single change, or in three separate commits.

I don't love the idea of committing it as a single patch, but at least
the first two I think probably go together. Otherwise we're
introducing a "fix" with no proven impact that will slow down planning
(even if only in a small way) only to intend to condition that on a
GUC in the next commit.

But I think you could potentially make an argument for keeping the
additional paths separate...but it's not absolutely necessary IMO.

> James, can you review the proposed extra fix and merge the fixes into
> the main patches?

I've reviewed it, and it looks correct, so merged into the main series.

Summary:
The attached series includes a couple of XXX fixes or comment cleanup
as noted above. I believe there are two more XXXs that needs to be
answered before we merge ("do we need to consider adding a projection"
and "what is the comparison cost for incremental sort").

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Wed, Apr 01, 2020 at 10:09:20PM -0400, James Coleman wrote:
>On Wed, Apr 1, 2020 at 5:42 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> ...
>> I've realized the way get_useful_pathkeys_for_relation() is coded kinda
>> works against the fastpath we added when comparing pathkeys. That
>> depends on comparing pointers to the list, but we've been building new
>> lists (and then returned those) which defeats the optimization. Attached
>> is a patch that returns the original list in most cases (and only
>> creates a copy when really necessary). This might also save a few cycles
>> on bulding the new list, of course.
>>
>> I've done a bunch of read-only pgbench tests with fairly small scales (1
>> and 10). First with the built-in read-only transaction, and also with a
>> simple custom query doing an order-by. And I did this both on the
>> default schema and with a bunch of extra indexes. The script I used to
>> run this is attached, along with a summary of results.
>>
>> There are results for master and v40 and v50 patches (the v50 also
>> includes the extra patch fixing get_useful_pathkeys_for_relation).
>>
>> Overall, I'm happy with those results - the v50 seems to be within 1% of
>> master, in both directions. This very much seems like a noise.
>>
>> I still want to do a bit more review of the costing / tuplesort changes,
>> which I plan to do tomorrow. If that goes well, I plan to start
>> committing this. So please if you think this is not ready or wants more
>
>I think we need to either implement this or remove the comment:
>* XXX I wonder if we need to consider adding a projection here, as
>* create_ordered_paths does.
>in generate_useful_gather_paths().
>

Yeah. I think we don't need the projection here. My reasoning is that if
we don't need it in generate_gather_paths(), we don't need it here.

>In the same function we have the following code:
>/*
> * When the partial path is already sorted, we can just add a gather
> * merge on top, and we're done - no point in adding explicit sort.
> *
> * XXX Can't we skip this (maybe only for the cheapest partial path)
> * when the path is already sorted? Then it's likely duplicate with
> * the path created by generate_gather_paths.
> */
>if (is_sorted)
>{
>        path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
>
> subpath->pathkeys, NULL, rowsp);
>
>        add_path(rel, &path->path);
>        continue;
>}
>
>looking at the relevant loop in generate_gather_paths:
>/*
> * For each useful ordering, we can consider an order-preserving Gather
> * Merge.
> */
>foreach(lc, rel->partial_pathlist)
>{
>        Path       *subpath = (Path *) lfirst(lc);
>        GatherMergePath *path;
>
>        if (subpath->pathkeys == NIL)
>                continue;
>
>        rows = subpath->rows * subpath->parallel_workers;
>        path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
>
> subpath->pathkeys, NULL, rowsp);
>        add_path(rel, &path->path);
>}
>
>I believe we can eliminate the block entirely in
>generate_useful_gather_paths(). Here's my reasoning: all paths for
>which is_sorted is true must necessarily have pathkeys, and since we
>already add a gather merge for every subpath with pathkeys, we've
>already added gather merge paths for all of these.
>
>I've included a patch to change this, but let me know if the reasoning
>isn't sound.
>

Good catch! I think you're correct - we don't need to generate this
path, and we can just skip that partial path entirely.

>We can also remove the XXX on this comment (in the same function):
>* XXX This is not redundant with the gather merge path created in
>* generate_gather_paths, because that merely preserves ordering of
>* the cheapest partial path, while here we add an explicit sort to
>* get match the useful ordering.
>
>because of this code in generate_gather_paths():
>cheapest_partial_path = linitial(rel->partial_pathlist);
>rows =
>        cheapest_partial_path->rows * cheapest_partial_path->parallel_workers;
>simple_gather_path = (Path *)
>        create_gather_path(root, rel, cheapest_partial_path, rel->reltarget,
>                                           NULL, rowsp);
>add_path(rel, simple_gather_path);
>
>but we can cleanup the comment a bit: fix the grammar issue in the
>last line and fix the reference to gather merge path (it's a gather
>path).
>
>I've included that in the same patch.
>

OK, makes sense.

>I also noticed that in create_incremental_sort_path we have this:
>/* XXX comparison_cost shouldn't be 0? */
>but I guess that's part of what you're reviewing tomorrow.
>

Right, one of the bits.

>> time for a review, let me know. I'm not yet sure if I'll commit this as
>> a single change, or in three separate commits.
>
>I don't love the idea of committing it as a single patch, but at least
>the first two I think probably go together. Otherwise we're
>introducing a "fix" with no proven impact that will slow down planning
>(even if only in a small way) only to intend to condition that on a
>GUC in the next commit.
>
>But I think you could potentially make an argument for keeping the
>additional paths separate...but it's not absolutely necessary IMO.
>

OK. I've been actually wondering whether to move the add_partial_path
after the main patch, for exactly this reason.

>> James, can you review the proposed extra fix and merge the fixes into
>> the main patches?
>
>I've reviewed it, and it looks correct, so merged into the main series.
>
>Summary:
>The attached series includes a couple of XXX fixes or comment cleanup
>as noted above. I believe there are two more XXXs that needs to be
>answered before we merge ("do we need to consider adding a projection"
>and "what is the comparison cost for incremental sort").
>

Thanks!


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Wed, Apr 1, 2020 at 10:47 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Wed, Apr 01, 2020 at 10:09:20PM -0400, James Coleman wrote:
> >On Wed, Apr 1, 2020 at 5:42 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >> ...
> >> I've realized the way get_useful_pathkeys_for_relation() is coded kinda
> >> works against the fastpath we added when comparing pathkeys. That
> >> depends on comparing pointers to the list, but we've been building new
> >> lists (and then returned those) which defeats the optimization. Attached
> >> is a patch that returns the original list in most cases (and only
> >> creates a copy when really necessary). This might also save a few cycles
> >> on bulding the new list, of course.
> >>
> >> I've done a bunch of read-only pgbench tests with fairly small scales (1
> >> and 10). First with the built-in read-only transaction, and also with a
> >> simple custom query doing an order-by. And I did this both on the
> >> default schema and with a bunch of extra indexes. The script I used to
> >> run this is attached, along with a summary of results.
> >>
> >> There are results for master and v40 and v50 patches (the v50 also
> >> includes the extra patch fixing get_useful_pathkeys_for_relation).
> >>
> >> Overall, I'm happy with those results - the v50 seems to be within 1% of
> >> master, in both directions. This very much seems like a noise.
> >>
> >> I still want to do a bit more review of the costing / tuplesort changes,
> >> which I plan to do tomorrow. If that goes well, I plan to start
> >> committing this. So please if you think this is not ready or wants more
> >
> >I think we need to either implement this or remove the comment:
> >* XXX I wonder if we need to consider adding a projection here, as
> >* create_ordered_paths does.
> >in generate_useful_gather_paths().
> >
>
> Yeah. I think we don't need the projection here. My reasoning is that if
> we don't need it in generate_gather_paths(), we don't need it here.

All right, then I'm removing the comment in the attached series.

> >In the same function we have the following code:
> >/*
> > * When the partial path is already sorted, we can just add a gather
> > * merge on top, and we're done - no point in adding explicit sort.
> > *
> > * XXX Can't we skip this (maybe only for the cheapest partial path)
> > * when the path is already sorted? Then it's likely duplicate with
> > * the path created by generate_gather_paths.
> > */
> >if (is_sorted)
> >{
> >        path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
> >
> > subpath->pathkeys, NULL, rowsp);
> >
> >        add_path(rel, &path->path);
> >        continue;
> >}
> >
> >looking at the relevant loop in generate_gather_paths:
> >/*
> > * For each useful ordering, we can consider an order-preserving Gather
> > * Merge.
> > */
> >foreach(lc, rel->partial_pathlist)
> >{
> >        Path       *subpath = (Path *) lfirst(lc);
> >        GatherMergePath *path;
> >
> >        if (subpath->pathkeys == NIL)
> >                continue;
> >
> >        rows = subpath->rows * subpath->parallel_workers;
> >        path = create_gather_merge_path(root, rel, subpath, rel->reltarget,
> >
> > subpath->pathkeys, NULL, rowsp);
> >        add_path(rel, &path->path);
> >}
> >
> >I believe we can eliminate the block entirely in
> >generate_useful_gather_paths(). Here's my reasoning: all paths for
> >which is_sorted is true must necessarily have pathkeys, and since we
> >already add a gather merge for every subpath with pathkeys, we've
> >already added gather merge paths for all of these.
> >
> >I've included a patch to change this, but let me know if the reasoning
> >isn't sound.
> >
>
> Good catch! I think you're correct - we don't need to generate this
> path, and we can just skip that partial path entirely.

The attached patch series merges the above fixes into the main patches.

> >We can also remove the XXX on this comment (in the same function):
> >* XXX This is not redundant with the gather merge path created in
> >* generate_gather_paths, because that merely preserves ordering of
> >* the cheapest partial path, while here we add an explicit sort to
> >* get match the useful ordering.
> >
> >because of this code in generate_gather_paths():
> >cheapest_partial_path = linitial(rel->partial_pathlist);
> >rows =
> >        cheapest_partial_path->rows * cheapest_partial_path->parallel_workers;
> >simple_gather_path = (Path *)
> >        create_gather_path(root, rel, cheapest_partial_path, rel->reltarget,
> >                                           NULL, rowsp);
> >add_path(rel, simple_gather_path);
> >
> >but we can cleanup the comment a bit: fix the grammar issue in the
> >last line and fix the reference to gather merge path (it's a gather
> >path).
> >
> >I've included that in the same patch.
> >
>
> OK, makes sense.
>
> >I also noticed that in create_incremental_sort_path we have this:
> >/* XXX comparison_cost shouldn't be 0? */
> >but I guess that's part of what you're reviewing tomorrow.
> >
>
> Right, one of the bits.
>
> >> time for a review, let me know. I'm not yet sure if I'll commit this as
> >> a single change, or in three separate commits.
> >
> >I don't love the idea of committing it as a single patch, but at least
> >the first two I think probably go together. Otherwise we're
> >introducing a "fix" with no proven impact that will slow down planning
> >(even if only in a small way) only to intend to condition that on a
> >GUC in the next commit.
> >
> >But I think you could potentially make an argument for keeping the
> >additional paths separate...but it's not absolutely necessary IMO.
> >
>
> OK. I've been actually wondering whether to move the add_partial_path
> after the main patch, for exactly this reason.
>
> >> James, can you review the proposed extra fix and merge the fixes into
> >> the main patches?
> >
> >I've reviewed it, and it looks correct, so merged into the main series.
> >
> >Summary:
> >The attached series includes a couple of XXX fixes or comment cleanup
> >as noted above. I believe there are two more XXXs that needs to be
> >answered before we merge ("do we need to consider adding a projection"
> >and "what is the comparison cost for incremental sort").
> >
>
> Thanks!

One other thing: the only "real" XXX I see left is in create_ordered_paths():
* XXX This is probably duplicate with the paths we already generate
* in generate_useful_gather_paths in apply_scanjoin_target_to_paths.

It's not as big of a deal as the others (e.g., projection one could
have been a bug), and it's a lot harder for me to determine if it's
actually duplicative. So we could just leave it in, we could just
remove it, or, perhaps you have some thoughts on determining if it's
true or not.

Thanks!
James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
Hi,

Thanks, the v52 looks almost ready. I've been looking at the two or
three things I mentioned, and I have a couple of comments.


1) /* XXX comparison_cost shouldn't be 0? */

I'm not worried about this, because this is not really intriduced by
this patch - create_sort_path has the same comment/issue, so I think
it's acceptable to do the same thing for incremental sort.


2) INITIAL_MEMTUPSIZE

tuplesort.c does this:

   #define INITIAL_MEMTUPSIZE Max(1024, \
       ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)

supposedly to keep the array size within ALLOCSET_SEPARATE_THRESHOLD.
But I think it fails to do that, for a couple of reasons.

Firstly, ALLOCSET_SEPARATE_THRESHOLD is 8192, and SortTuple is 21B at
minimum (without padding), so with 1024 elements it's guaranteed to be
at least 21kB - so exceeding the threshold. The maximum value is
something like 256.

Secondly, the second part of the formula is guaranteed to get us over
the threshold anyway, thanks to the +1. Let's say SortTuple is 30B. Then

   ALLOCSET_SEPARATE_THRESHOLD / 30 = 273

but we end up using 274, resulting in 8220B array. :-(

So I guess the formula should be more like

   Min(128, ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple))

or something like that.

FWIW I think the whole hypothesis that selecting the array size below
ALLOCSET_SEPARATE_THRESHOLD reduces overhead is dubious. AFAIC we
allocate this only once (or very few times), and if we need to grow the
array we'll hit the threshold anyway.

I'd just pick a reasonable constant - 128 or 256 seems reasonable, 1024
may be OK too.


3) add_partial_path

I'm a bit torn about using enable_incrementalsort in add_partial_path. I
know we've done that to eliminate the overhead, but it just seems weird.
I wonder if that really saves us anything or if it was just noise. I'll
do more testing before making a decision.

While looking at add_path, I see the comparisons there are made in the
opposite order - we first compare costs, and only then pathkeys. That
seems like a more efficient way, I guess (cheaper float comparison
first, pathkey comparison with a loop second). I wonder why it's not
done like that in add_partial_path too? Perhaps this will make it cheap
enough to remove the enable_incrementalsort check.


4) add_partial_path_precheck

While looking at add_partial_path, I realized add_partial_path_precheck
probably needs the same change - to consider startup cost too. I think
the expectation is that add_partial_path_precheck only rejects paths
that we know would then get rejected by add_partial_path.

But now the behavior is inconsistent, which might lead to surprising
behavior (I haven't seen such cases, but I haven't looked for them).
At the moment the add_partial_path_precheck is called only from
joinpath.c, maybe it's not an issue there.

If it's OK to keep it like this, it'd probably deserve a comment why
the difference is OK. And the comments also contain the same claim that
parallel plans only need to look at total cost.


5) Overall, I think the costing is OK. I'm sure we'll find cases that
will need improvements, but that's fine. However, we now have 

- cost_tuplesort (used to be cost_sort)
- cost_full_sort
- cost_incremental_sort
- cost_sort

I find it a bit confusing that we have cost_sort and cost_full_sort. Why
don't we just keep using the dummy path in label_sort_with_costsize?
That seems to be the only external caller outside costsize.c. Then we
could either make cost_full_sort static or get rid of it entirely.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Thu, Apr 2, 2020 at 8:20 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> Thanks, the v52 looks almost ready. I've been looking at the two or
> three things I mentioned, and I have a couple of comments.
>
>
> 1) /* XXX comparison_cost shouldn't be 0? */
>
> I'm not worried about this, because this is not really intriduced by
> this patch - create_sort_path has the same comment/issue, so I think
> it's acceptable to do the same thing for incremental sort.

Sounds good.

> 2) INITIAL_MEMTUPSIZE
>
> tuplesort.c does this:
>
>    #define INITIAL_MEMTUPSIZE Max(1024, \
>        ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
>
> supposedly to keep the array size within ALLOCSET_SEPARATE_THRESHOLD.
> But I think it fails to do that, for a couple of reasons.
>
> Firstly, ALLOCSET_SEPARATE_THRESHOLD is 8192, and SortTuple is 21B at
> minimum (without padding), so with 1024 elements it's guaranteed to be
> at least 21kB - so exceeding the threshold. The maximum value is
> something like 256.
>
> Secondly, the second part of the formula is guaranteed to get us over
> the threshold anyway, thanks to the +1. Let's say SortTuple is 30B. Then
>
>    ALLOCSET_SEPARATE_THRESHOLD / 30 = 273
>
> but we end up using 274, resulting in 8220B array. :-(
>
> So I guess the formula should be more like
>
>    Min(128, ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple))
>
> or something like that.
>
> FWIW I think the whole hypothesis that selecting the array size below
> ALLOCSET_SEPARATE_THRESHOLD reduces overhead is dubious. AFAIC we
> allocate this only once (or very few times), and if we need to grow the
> array we'll hit the threshold anyway.
>
> I'd just pick a reasonable constant - 128 or 256 seems reasonable, 1024
> may be OK too.

That was a part of the patch I haven't touched since I inherited it,
and I didn't feel like I knew enough about the Postgres memory
management to make a determination on whether the reasoning made
sense.

So I' happy to use a constant as suggested.

> 3) add_partial_path
>
> I'm a bit torn about using enable_incrementalsort in add_partial_path. I
> know we've done that to eliminate the overhead, but it just seems weird.
> I wonder if that really saves us anything or if it was just noise. I'll
> do more testing before making a decision.
>
> While looking at add_path, I see the comparisons there are made in the
> opposite order - we first compare costs, and only then pathkeys. That
> seems like a more efficient way, I guess (cheaper float comparison
> first, pathkey comparison with a loop second). I wonder why it's not
> done like that in add_partial_path too? Perhaps this will make it cheap
> enough to remove the enable_incrementalsort check.

I would love to avoid checking enable_incrementalsort there. I think
it's pretty gross to do so. But if it's the only way to keep the patch
acceptable, then obviously I'd leave it in. So I'm hopeful we can
avoid needing it.

> 4) add_partial_path_precheck
>
> While looking at add_partial_path, I realized add_partial_path_precheck
> probably needs the same change - to consider startup cost too. I think
> the expectation is that add_partial_path_precheck only rejects paths
> that we know would then get rejected by add_partial_path.
>
> But now the behavior is inconsistent, which might lead to surprising
> behavior (I haven't seen such cases, but I haven't looked for them).
> At the moment the add_partial_path_precheck is called only from
> joinpath.c, maybe it's not an issue there.
>
> If it's OK to keep it like this, it'd probably deserve a comment why
> the difference is OK. And the comments also contain the same claim that
> parallel plans only need to look at total cost.

I remember some discussion about that much earlier in this thread (or
in the spin-off thread [1] re: that specific change that didn't get a
lot of action), and I'm pretty sure the conclusion was that we should
change both. But I guess we never got around to doing it.

> 5) Overall, I think the costing is OK. I'm sure we'll find cases that
> will need improvements, but that's fine. However, we now have
>
> - cost_tuplesort (used to be cost_sort)
> - cost_full_sort
> - cost_incremental_sort
> - cost_sort
>
> I find it a bit confusing that we have cost_sort and cost_full_sort. Why
> don't we just keep using the dummy path in label_sort_with_costsize?
> That seems to be the only external caller outside costsize.c. Then we
> could either make cost_full_sort static or get rid of it entirely.

This another area of the patch I haven't really modified.

James

[1]: https://www.postgresql.org/message-id/flat/CAAaqYe-5HmM4ih6FWp2RNV9rruunfrFrLhqFXF_nrrNCPy1Zhg%40mail.gmail.com



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Thu, Apr 2, 2020 at 8:46 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Thu, Apr 2, 2020 at 8:20 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > Hi,
> >
> > Thanks, the v52 looks almost ready. I've been looking at the two or
> > three things I mentioned, and I have a couple of comments.
> >
> >
> > 1) /* XXX comparison_cost shouldn't be 0? */
> >
> > I'm not worried about this, because this is not really intriduced by
> > this patch - create_sort_path has the same comment/issue, so I think
> > it's acceptable to do the same thing for incremental sort.
>
> Sounds good.
>
> > 2) INITIAL_MEMTUPSIZE
> >
> > tuplesort.c does this:
> >
> >    #define INITIAL_MEMTUPSIZE Max(1024, \
> >        ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
> >
> > supposedly to keep the array size within ALLOCSET_SEPARATE_THRESHOLD.
> > But I think it fails to do that, for a couple of reasons.
> >
> > Firstly, ALLOCSET_SEPARATE_THRESHOLD is 8192, and SortTuple is 21B at
> > minimum (without padding), so with 1024 elements it's guaranteed to be
> > at least 21kB - so exceeding the threshold. The maximum value is
> > something like 256.
> >
> > Secondly, the second part of the formula is guaranteed to get us over
> > the threshold anyway, thanks to the +1. Let's say SortTuple is 30B. Then
> >
> >    ALLOCSET_SEPARATE_THRESHOLD / 30 = 273
> >
> > but we end up using 274, resulting in 8220B array. :-(
> >
> > So I guess the formula should be more like
> >
> >    Min(128, ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple))
> >
> > or something like that.
> >
> > FWIW I think the whole hypothesis that selecting the array size below
> > ALLOCSET_SEPARATE_THRESHOLD reduces overhead is dubious. AFAIC we
> > allocate this only once (or very few times), and if we need to grow the
> > array we'll hit the threshold anyway.
> >
> > I'd just pick a reasonable constant - 128 or 256 seems reasonable, 1024
> > may be OK too.
>
> That was a part of the patch I haven't touched since I inherited it,
> and I didn't feel like I knew enough about the Postgres memory
> management to make a determination on whether the reasoning made
> sense.
>
> So I' happy to use a constant as suggested.

I take that back. That code hasn't changed, it's just moved. Here's
the current code in tuplesort_begin_common on master:

/*
* Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
* see comments in grow_memtuples().
*/
state->memtupsize = Max(1024,
ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);

I'm not sure we ought to change that in this patch...

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Thu, Apr 2, 2020 at 8:46 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Thu, Apr 2, 2020 at 8:20 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > ...
> > 5) Overall, I think the costing is OK. I'm sure we'll find cases that
> > will need improvements, but that's fine. However, we now have
> >
> > - cost_tuplesort (used to be cost_sort)
> > - cost_full_sort
> > - cost_incremental_sort
> > - cost_sort
> >
> > I find it a bit confusing that we have cost_sort and cost_full_sort. Why
> > don't we just keep using the dummy path in label_sort_with_costsize?
> > That seems to be the only external caller outside costsize.c. Then we
> > could either make cost_full_sort static or get rid of it entirely.
>
> This another area of the patch I haven't really modified.

See attached for a cleanup of this; it removed cost_fullsort so
label_sort_with_costsize is back to how it was.

I've directly merged this into the patch series; if you'd like to see
the diff I can send that along.

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Thu, Apr 02, 2020 at 09:40:45PM -0400, James Coleman wrote:
>On Thu, Apr 2, 2020 at 8:46 PM James Coleman <jtc331@gmail.com> wrote:
>>
>> On Thu, Apr 2, 2020 at 8:20 PM Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> > ...
>> > 5) Overall, I think the costing is OK. I'm sure we'll find cases that
>> > will need improvements, but that's fine. However, we now have
>> >
>> > - cost_tuplesort (used to be cost_sort)
>> > - cost_full_sort
>> > - cost_incremental_sort
>> > - cost_sort
>> >
>> > I find it a bit confusing that we have cost_sort and cost_full_sort. Why
>> > don't we just keep using the dummy path in label_sort_with_costsize?
>> > That seems to be the only external caller outside costsize.c. Then we
>> > could either make cost_full_sort static or get rid of it entirely.
>>
>> This another area of the patch I haven't really modified.
>
>See attached for a cleanup of this; it removed cost_fullsort so
>label_sort_with_costsize is back to how it was.
>
>I've directly merged this into the patch series; if you'd like to see
>the diff I can send that along.
>

Thanks. Attached is v54 of the patch, with some minor changes. The main
two changes are in add_partial_path_precheck(), firstly to also consider
startup_cost, as discussed before. The second change (in 0003) is a bit
of an experiment to make add_partial_precheck() cheaper by only calling
compare_pathkeys after checking the costs first (which should be cheaper
than the function call). add_path_precheck already does it in that order
anyway.

I noticed is that compare_path_costs_fuzzily and add_path_precheck both
check consider_startup/consider_param_startup to decide whether to look
at startup_cost. add_partial_path_precheck probably should od that too.


Right now I'm running a battery of benchmarks to see if/how this affects
planner performance. Initially the results were rather noisy, but after
pinning the processes to processes (using taskset) and fixing frequency
(using cpupower) it's much better. The intermediate results seem pretty
fine (the results are withing 0.5% of the master, in both directions).
I'll share the final results.

Overall, I think this is pretty close to committable, and I'm planning
to get it committed on Monday unless someone objects.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sun, Apr 05, 2020 at 03:01:10PM +0200, Tomas Vondra wrote:
>On Thu, Apr 02, 2020 at 09:40:45PM -0400, James Coleman wrote:
>>On Thu, Apr 2, 2020 at 8:46 PM James Coleman <jtc331@gmail.com> wrote:
>>>
>>>On Thu, Apr 2, 2020 at 8:20 PM Tomas Vondra
>>><tomas.vondra@2ndquadrant.com> wrote:
>>>> ...
>>>> 5) Overall, I think the costing is OK. I'm sure we'll find cases that
>>>> will need improvements, but that's fine. However, we now have
>>>>
>>>> - cost_tuplesort (used to be cost_sort)
>>>> - cost_full_sort
>>>> - cost_incremental_sort
>>>> - cost_sort
>>>>
>>>> I find it a bit confusing that we have cost_sort and cost_full_sort. Why
>>>> don't we just keep using the dummy path in label_sort_with_costsize?
>>>> That seems to be the only external caller outside costsize.c. Then we
>>>> could either make cost_full_sort static or get rid of it entirely.
>>>
>>>This another area of the patch I haven't really modified.
>>
>>See attached for a cleanup of this; it removed cost_fullsort so
>>label_sort_with_costsize is back to how it was.
>>
>>I've directly merged this into the patch series; if you'd like to see
>>the diff I can send that along.
>>
>
>Thanks. Attached is v54 of the patch, with some minor changes. The main
>two changes are in add_partial_path_precheck(), firstly to also consider
>startup_cost, as discussed before. The second change (in 0003) is a bit
>of an experiment to make add_partial_precheck() cheaper by only calling
>compare_pathkeys after checking the costs first (which should be cheaper
>than the function call). add_path_precheck already does it in that order
>anyway.
>

Oh, I forgot to mention a change in add_partial_path - I've removed the
reference/dependency on enable_incrementalsort. It seemed rather ugly,
and the results without it seem fine (I'm benchmarking only the case
with incremental sort enabled anyway). I also plan to look at the other
optimization we bolted on last week, i.e. checking length of pathkeys.
I'll see if that actually makes measurable difference.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
Hi,

I've pushed the fist part of this patch series - I've reorganized it a
bit by moving the add_partial_path changes to the end. That way I've
been able to add regression test demonstrating impact of the change on
plans involving incremental sort nodes (which wouldn't be possible when
committing the add_partial_path first). I'll wait a bit before pushing
the two additional parts, so that if something fails we know which bit
caused it.

I've been running extensive benchmarks with the aim to detect any
regressions caused by this patch, particularly during planning. Attached
is the script I've used and spreadsheet with results. The numbers show
throughput with different queries (SELECT, EXPLANN and joins) with the
patches committed one by one. There's quite a bit of noise, even though
the script pins processes to cores and restricts CPU frequency. Overall,
I don't see any obvious regression - the numbers are generally within
0.5% of the master, and the behavior is the same even with -M prepared
which should not be subject to any planner overhead. I do have results
from another machine (2-socket Xeon) but the results are much more
noisy, although the general conclusions are about the same.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:
>Hi,
>
>I've pushed the fist part of this patch series - I've reorganized it a
>bit by moving the add_partial_path changes to the end. That way I've
>been able to add regression test demonstrating impact of the change on
>plans involving incremental sort nodes (which wouldn't be possible when
>committing the add_partial_path first). I'll wait a bit before pushing
>the two additional parts, so that if something fails we know which bit
>caused it.
>

Hmmm, I see the buildfarm is not happy about it - a couple of animals
failed, but some succeeded. The failure seems like a simple difference
in explain output, but it's not clear why would it happen (and I've
ran the tests many times but never seen this failure).

Investigating.

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> Hmmm, I see the buildfarm is not happy about it - a couple of animals
> failed, but some succeeded. The failure seems like a simple difference
> in explain output, but it's not clear why would it happen (and I've
> ran the tests many times but never seen this failure).

Did you ever use force_parallel_mode = regress?

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Apr 06, 2020 at 04:14:38PM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> Hmmm, I see the buildfarm is not happy about it - a couple of animals
>> failed, but some succeeded. The failure seems like a simple difference
>> in explain output, but it's not clear why would it happen (and I've
>> ran the tests many times but never seen this failure).
>
>Did you ever use force_parallel_mode = regress?
>

Ah, not sure - probably not in this round of tests and there were some
changes in the explain code. Thanks for the hint.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Mon, Apr 06, 2020 at 04:14:38PM -0400, Tom Lane wrote:
>> Did you ever use force_parallel_mode = regress?

> Ah, not sure - probably not in this round of tests and there were some
> changes in the explain code. Thanks for the hint.

Locally, things pass without force_parallel_mode, but turning it on
produces failures that look similar to rhinoceros's (didn't examine
other BF members).  At a guess, looks like missing or incorrect logic
for propagating some state back from parallel workers.

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Alvaro Herrera
Date:
On 2020-Apr-06, Tom Lane wrote:

> Locally, things pass without force_parallel_mode, but turning it on
> produces failures that look similar to rhinoceros's (didn't examine
> other BF members).

FWIW I looked at the eight failures there were about fifteen minutes ago
and they were all identical.  I can confirm that, in my laptop, the
tests work without that GUC, and fail in exactly that way with it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:
>On 2020-Apr-06, Tom Lane wrote:
>
>> Locally, things pass without force_parallel_mode, but turning it on
>> produces failures that look similar to rhinoceros's (didn't examine
>> other BF members).
>
>FWIW I looked at the eight failures there were about fifteen minutes ago
>and they were all identical.  I can confirm that, in my laptop, the
>tests work without that GUC, and fail in exactly that way with it.
>

Yes, there's a thinko in show_incremental_sort_info() and it returns too
soon. I'll push a fix in a minute.


thanks

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Apr 6, 2020 at 5:12 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:
> >On 2020-Apr-06, Tom Lane wrote:
> >
> >> Locally, things pass without force_parallel_mode, but turning it on
> >> produces failures that look similar to rhinoceros's (didn't examine
> >> other BF members).
> >
> >FWIW I looked at the eight failures there were about fifteen minutes ago
> >and they were all identical.  I can confirm that, in my laptop, the
> >tests work without that GUC, and fail in exactly that way with it.
> >
>
> Yes, there's a thinko in show_incremental_sort_info() and it returns too
> soon. I'll push a fix in a minute.

I'm stepping through this in a debugger; is what you're considering
that the for loop through the workers is off by one?

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Apr 6, 2020 at 5:20 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Mon, Apr 6, 2020 at 5:12 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:
> > >On 2020-Apr-06, Tom Lane wrote:
> > >
> > >> Locally, things pass without force_parallel_mode, but turning it on
> > >> produces failures that look similar to rhinoceros's (didn't examine
> > >> other BF members).
> > >
> > >FWIW I looked at the eight failures there were about fifteen minutes ago
> > >and they were all identical.  I can confirm that, in my laptop, the
> > >tests work without that GUC, and fail in exactly that way with it.
> > >
> >
> > Yes, there's a thinko in show_incremental_sort_info() and it returns too
> > soon. I'll push a fix in a minute.
>
> I'm stepping through this in a debugger; is what you're considering
> that the for loop through the workers is off by one?

Oh, nevermind, misread that.

Looks like if the leader doesn't participate, then we don't show
details for workers.

Tomas: Do you already have a patch? If not, I can work one up.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Apr 6, 2020 at 5:22 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Mon, Apr 6, 2020 at 5:20 PM James Coleman <jtc331@gmail.com> wrote:
> >
> > On Mon, Apr 6, 2020 at 5:12 PM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >
> > > On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:
> > > >On 2020-Apr-06, Tom Lane wrote:
> > > >
> > > >> Locally, things pass without force_parallel_mode, but turning it on
> > > >> produces failures that look similar to rhinoceros's (didn't examine
> > > >> other BF members).
> > > >
> > > >FWIW I looked at the eight failures there were about fifteen minutes ago
> > > >and they were all identical.  I can confirm that, in my laptop, the
> > > >tests work without that GUC, and fail in exactly that way with it.
> > > >
> > >
> > > Yes, there's a thinko in show_incremental_sort_info() and it returns too
> > > soon. I'll push a fix in a minute.
> >
> > I'm stepping through this in a debugger; is what you're considering
> > that the for loop through the workers is off by one?
>
> Oh, nevermind, misread that.
>
> Looks like if the leader doesn't participate, then we don't show
> details for workers.
>
> Tomas: Do you already have a patch? If not, I can work one up.

Well, already have it, so I'll send it just in case.

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Apr 06, 2020 at 11:12:32PM +0200, Tomas Vondra wrote:
>On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:
>>On 2020-Apr-06, Tom Lane wrote:
>>
>>>Locally, things pass without force_parallel_mode, but turning it on
>>>produces failures that look similar to rhinoceros's (didn't examine
>>>other BF members).
>>
>>FWIW I looked at the eight failures there were about fifteen minutes ago
>>and they were all identical.  I can confirm that, in my laptop, the
>>tests work without that GUC, and fail in exactly that way with it.
>>
>
>Yes, there's a thinko in show_incremental_sort_info() and it returns too
>soon. I'll push a fix in a minute.
>

OK, I've pushed a fix - this should make the buildfarm happy again.

It however seems to me a bit more needs to be done. The fix makes
show_incremental_sort_info closer to show_sort_info, but not entirely
because IncrementalSortState does not have sort_Done flag so it still
depends on (fullsortGroupInfo->groupCount > 0). I haven't noticed that
before, but not having that flag seems a bit weird to me.

It also seems possibly incorrect - we may end up with

   fullsortGroupInfo->groupCount == 0
   prefixsortGroupInfo->groupCount > 0

but we won't print anything.

James, any opinion on this? I'd say we should restore the sort_Done flag
and make it work as in plain Sort. Or some comment explaining why
depending on the counts is OK (assuming it is).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Apr 6, 2020 at 5:40 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Apr 06, 2020 at 11:12:32PM +0200, Tomas Vondra wrote:
> >On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:
> >>On 2020-Apr-06, Tom Lane wrote:
> >>
> >>>Locally, things pass without force_parallel_mode, but turning it on
> >>>produces failures that look similar to rhinoceros's (didn't examine
> >>>other BF members).
> >>
> >>FWIW I looked at the eight failures there were about fifteen minutes ago
> >>and they were all identical.  I can confirm that, in my laptop, the
> >>tests work without that GUC, and fail in exactly that way with it.
> >>
> >
> >Yes, there's a thinko in show_incremental_sort_info() and it returns too
> >soon. I'll push a fix in a minute.
> >
>
> OK, I've pushed a fix - this should make the buildfarm happy again.
>
> It however seems to me a bit more needs to be done. The fix makes
> show_incremental_sort_info closer to show_sort_info, but not entirely
> because IncrementalSortState does not have sort_Done flag so it still
> depends on (fullsortGroupInfo->groupCount > 0). I haven't noticed that
> before, but not having that flag seems a bit weird to me.
>
> It also seems possibly incorrect - we may end up with
>
>    fullsortGroupInfo->groupCount == 0
>    prefixsortGroupInfo->groupCount > 0
>
> but we won't print anything.

This shouldn't ever be possible, because the only way we get any
prefix groups at all is if we've already sorted a full sort group
during the mode transition.

> James, any opinion on this? I'd say we should restore the sort_Done flag
> and make it work as in plain Sort. Or some comment explaining why
> depending on the counts is OK (assuming it is).

There's previous email traffic on this thread about that (I can look
it up later this evening), but the short of it is that I believe that
relying on the group count is actually more correct than a sort_Done
flag in the case of incremental sort (in contrast to regular sort).

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> OK, I've pushed a fix - this should make the buildfarm happy again.

Well, it's *less* unhappy.  thorntail is showing that the number of
workers field is not stable; that will need to be masked.

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Apr 06, 2020 at 05:47:48PM -0400, James Coleman wrote:
>On Mon, Apr 6, 2020 at 5:40 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Apr 06, 2020 at 11:12:32PM +0200, Tomas Vondra wrote:
>> >On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:
>> >>On 2020-Apr-06, Tom Lane wrote:
>> >>
>> >>>Locally, things pass without force_parallel_mode, but turning it on
>> >>>produces failures that look similar to rhinoceros's (didn't examine
>> >>>other BF members).
>> >>
>> >>FWIW I looked at the eight failures there were about fifteen minutes ago
>> >>and they were all identical.  I can confirm that, in my laptop, the
>> >>tests work without that GUC, and fail in exactly that way with it.
>> >>
>> >
>> >Yes, there's a thinko in show_incremental_sort_info() and it returns too
>> >soon. I'll push a fix in a minute.
>> >
>>
>> OK, I've pushed a fix - this should make the buildfarm happy again.
>>
>> It however seems to me a bit more needs to be done. The fix makes
>> show_incremental_sort_info closer to show_sort_info, but not entirely
>> because IncrementalSortState does not have sort_Done flag so it still
>> depends on (fullsortGroupInfo->groupCount > 0). I haven't noticed that
>> before, but not having that flag seems a bit weird to me.
>>
>> It also seems possibly incorrect - we may end up with
>>
>>    fullsortGroupInfo->groupCount == 0
>>    prefixsortGroupInfo->groupCount > 0
>>
>> but we won't print anything.
>
>This shouldn't ever be possible, because the only way we get any
>prefix groups at all is if we've already sorted a full sort group
>during the mode transition.
>
>> James, any opinion on this? I'd say we should restore the sort_Done flag
>> and make it work as in plain Sort. Or some comment explaining why
>> depending on the counts is OK (assuming it is).
>
>There's previous email traffic on this thread about that (I can look
>it up later this evening), but the short of it is that I believe that
>relying on the group count is actually more correct than a sort_Done
>flag in the case of incremental sort (in contrast to regular sort).
>

OK. Maybe we should add a comment to explain.c saying it's OK.

I've pushed a fix for failures due to different planned workers (in the
test I added to show changes due to add_partial_path tweaks).

It seems we're not out of the woods yet, though. rhinoceros and
sidewinder failed with something like this:

                 Sort Method: quicksort  Memory: NNkB
+               Sort Method: unknown  Disk: NNkB

Would you mind investigating at it?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Apr 06, 2020 at 05:51:43PM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> OK, I've pushed a fix - this should make the buildfarm happy again.
>
>Well, it's *less* unhappy.  thorntail is showing that the number of
>workers field is not stable; that will need to be masked.
>

Yeah, I've already pushed a fix for that. But there seems to be another
failure in th explain output. Looking.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Mon, Apr 06, 2020 at 05:51:43PM -0400, Tom Lane wrote:
>> Well, it's *less* unhappy.  thorntail is showing that the number of
>> workers field is not stable; that will need to be masked.

> Yeah, I've already pushed a fix for that. But there seems to be another
> failure in th explain output. Looking.

I'm kind of unimpressed with that fix, because it guarantees that if
there's any problem with more than 2 workers, this test will never
find it.  I think you should do what I said and arrange to replace
the number-of-workers output with "N".

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Apr 06, 2020 at 06:34:04PM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On Mon, Apr 06, 2020 at 05:51:43PM -0400, Tom Lane wrote:
>>> Well, it's *less* unhappy.  thorntail is showing that the number of
>>> workers field is not stable; that will need to be masked.
>
>> Yeah, I've already pushed a fix for that. But there seems to be another
>> failure in th explain output. Looking.
>
>I'm kind of unimpressed with that fix, because it guarantees that if
>there's any problem with more than 2 workers, this test will never
>find it.  I think you should do what I said and arrange to replace
>the number-of-workers output with "N".
>

I can do that, but isn't this how every other regression test does it?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Apr 6, 2020 at 6:13 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Apr 06, 2020 at 05:47:48PM -0400, James Coleman wrote:
> >On Mon, Apr 6, 2020 at 5:40 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Mon, Apr 06, 2020 at 11:12:32PM +0200, Tomas Vondra wrote:
> >> >On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:
> >> >>On 2020-Apr-06, Tom Lane wrote:
> >> >>
> >> >>>Locally, things pass without force_parallel_mode, but turning it on
> >> >>>produces failures that look similar to rhinoceros's (didn't examine
> >> >>>other BF members).
> >> >>
> >> >>FWIW I looked at the eight failures there were about fifteen minutes ago
> >> >>and they were all identical.  I can confirm that, in my laptop, the
> >> >>tests work without that GUC, and fail in exactly that way with it.
> >> >>
> >> >
> >> >Yes, there's a thinko in show_incremental_sort_info() and it returns too
> >> >soon. I'll push a fix in a minute.
> >> >
> >>
> >> OK, I've pushed a fix - this should make the buildfarm happy again.
> >>
> >> It however seems to me a bit more needs to be done. The fix makes
> >> show_incremental_sort_info closer to show_sort_info, but not entirely
> >> because IncrementalSortState does not have sort_Done flag so it still
> >> depends on (fullsortGroupInfo->groupCount > 0). I haven't noticed that
> >> before, but not having that flag seems a bit weird to me.
> >>
> >> It also seems possibly incorrect - we may end up with
> >>
> >>    fullsortGroupInfo->groupCount == 0
> >>    prefixsortGroupInfo->groupCount > 0
> >>
> >> but we won't print anything.
> >
> >This shouldn't ever be possible, because the only way we get any
> >prefix groups at all is if we've already sorted a full sort group
> >during the mode transition.
> >
> >> James, any opinion on this? I'd say we should restore the sort_Done flag
> >> and make it work as in plain Sort. Or some comment explaining why
> >> depending on the counts is OK (assuming it is).
> >
> >There's previous email traffic on this thread about that (I can look
> >it up later this evening), but the short of it is that I believe that
> >relying on the group count is actually more correct than a sort_Done
> >flag in the case of incremental sort (in contrast to regular sort).
> >
>
> OK. Maybe we should add a comment to explain.c saying it's OK.
>
> I've pushed a fix for failures due to different planned workers (in the
> test I added to show changes due to add_partial_path tweaks).
>
> It seems we're not out of the woods yet, though. rhinoceros and
> sidewinder failed with something like this:
>
>                  Sort Method: quicksort  Memory: NNkB
> +               Sort Method: unknown  Disk: NNkB
>
> Would you mind investigating at it?

I assume that means those build farm members run with very low
work_mem? Is it an acceptable fix to adjust work_mem up a bit just for
these tests? Or is that bad practice and these are to expose issues
with changing into disk sort mode?

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Apr 6, 2020 at 7:09 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Mon, Apr 6, 2020 at 6:13 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > On Mon, Apr 06, 2020 at 05:47:48PM -0400, James Coleman wrote:
> > >On Mon, Apr 6, 2020 at 5:40 PM Tomas Vondra
> > ><tomas.vondra@2ndquadrant.com> wrote:
> > >>
> > >> On Mon, Apr 06, 2020 at 11:12:32PM +0200, Tomas Vondra wrote:
> > >> >On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:
> > >> >>On 2020-Apr-06, Tom Lane wrote:
> > >> >>
> > >> >>>Locally, things pass without force_parallel_mode, but turning it on
> > >> >>>produces failures that look similar to rhinoceros's (didn't examine
> > >> >>>other BF members).
> > >> >>
> > >> >>FWIW I looked at the eight failures there were about fifteen minutes ago
> > >> >>and they were all identical.  I can confirm that, in my laptop, the
> > >> >>tests work without that GUC, and fail in exactly that way with it.
> > >> >>
> > >> >
> > >> >Yes, there's a thinko in show_incremental_sort_info() and it returns too
> > >> >soon. I'll push a fix in a minute.
> > >> >
> > >>
> > >> OK, I've pushed a fix - this should make the buildfarm happy again.
> > >>
> > >> It however seems to me a bit more needs to be done. The fix makes
> > >> show_incremental_sort_info closer to show_sort_info, but not entirely
> > >> because IncrementalSortState does not have sort_Done flag so it still
> > >> depends on (fullsortGroupInfo->groupCount > 0). I haven't noticed that
> > >> before, but not having that flag seems a bit weird to me.
> > >>
> > >> It also seems possibly incorrect - we may end up with
> > >>
> > >>    fullsortGroupInfo->groupCount == 0
> > >>    prefixsortGroupInfo->groupCount > 0
> > >>
> > >> but we won't print anything.
> > >
> > >This shouldn't ever be possible, because the only way we get any
> > >prefix groups at all is if we've already sorted a full sort group
> > >during the mode transition.
> > >
> > >> James, any opinion on this? I'd say we should restore the sort_Done flag
> > >> and make it work as in plain Sort. Or some comment explaining why
> > >> depending on the counts is OK (assuming it is).
> > >
> > >There's previous email traffic on this thread about that (I can look
> > >it up later this evening), but the short of it is that I believe that
> > >relying on the group count is actually more correct than a sort_Done
> > >flag in the case of incremental sort (in contrast to regular sort).
> > >
> >
> > OK. Maybe we should add a comment to explain.c saying it's OK.
> >
> > I've pushed a fix for failures due to different planned workers (in the
> > test I added to show changes due to add_partial_path tweaks).
> >
> > It seems we're not out of the woods yet, though. rhinoceros and
> > sidewinder failed with something like this:
> >
> >                  Sort Method: quicksort  Memory: NNkB
> > +               Sort Method: unknown  Disk: NNkB
> >
> > Would you mind investigating at it?
>
> I assume that means those build farm members run with very low
> work_mem? Is it an acceptable fix to adjust work_mem up a bit just for
> these tests? Or is that bad practice and these are to expose issues
> with changing into disk sort mode?

On rhinoceros I see:

================== pgsql.build/src/test/regress/regression.diffs
===================
diff -U3 /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/src/test/regress/expected/subselect.out
/opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/src/test/regress/results/subselect.out
--- /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/src/test/regress/expected/subselect.out
2020-03-14 10:37:49.156761104 -0700
+++ /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/src/test/regress/results/subselect.out
2020-04-06 16:01:13.766798059 -0700
@@ -1328,8 +1328,9 @@
          ->  Sort (actual rows=3 loops=1)
                Sort Key: sq_limit.c1, sq_limit.pk
                Sort Method: top-N heapsort  Memory: xxx
+               Sort Method: unknown  Disk: 0kB
                ->  Seq Scan on sq_limit (actual rows=8 loops=1)
-(6 rows)
+(7 rows)

Same on sidewinder.

Given the 0kB I'm not sure this is *just* a work_mem thing, though
that's still something I'm curious to know about, and it's still part
of the "problem" here.

I'm investigating further.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Apr 06, 2020 at 07:09:11PM -0400, James Coleman wrote:
>On Mon, Apr 6, 2020 at 6:13 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Apr 06, 2020 at 05:47:48PM -0400, James Coleman wrote:
>> >On Mon, Apr 6, 2020 at 5:40 PM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> >>
>> >> On Mon, Apr 06, 2020 at 11:12:32PM +0200, Tomas Vondra wrote:
>> >> >On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:
>> >> >>On 2020-Apr-06, Tom Lane wrote:
>> >> >>
>> >> >>>Locally, things pass without force_parallel_mode, but turning it on
>> >> >>>produces failures that look similar to rhinoceros's (didn't examine
>> >> >>>other BF members).
>> >> >>
>> >> >>FWIW I looked at the eight failures there were about fifteen minutes ago
>> >> >>and they were all identical.  I can confirm that, in my laptop, the
>> >> >>tests work without that GUC, and fail in exactly that way with it.
>> >> >>
>> >> >
>> >> >Yes, there's a thinko in show_incremental_sort_info() and it returns too
>> >> >soon. I'll push a fix in a minute.
>> >> >
>> >>
>> >> OK, I've pushed a fix - this should make the buildfarm happy again.
>> >>
>> >> It however seems to me a bit more needs to be done. The fix makes
>> >> show_incremental_sort_info closer to show_sort_info, but not entirely
>> >> because IncrementalSortState does not have sort_Done flag so it still
>> >> depends on (fullsortGroupInfo->groupCount > 0). I haven't noticed that
>> >> before, but not having that flag seems a bit weird to me.
>> >>
>> >> It also seems possibly incorrect - we may end up with
>> >>
>> >>    fullsortGroupInfo->groupCount == 0
>> >>    prefixsortGroupInfo->groupCount > 0
>> >>
>> >> but we won't print anything.
>> >
>> >This shouldn't ever be possible, because the only way we get any
>> >prefix groups at all is if we've already sorted a full sort group
>> >during the mode transition.
>> >
>> >> James, any opinion on this? I'd say we should restore the sort_Done flag
>> >> and make it work as in plain Sort. Or some comment explaining why
>> >> depending on the counts is OK (assuming it is).
>> >
>> >There's previous email traffic on this thread about that (I can look
>> >it up later this evening), but the short of it is that I believe that
>> >relying on the group count is actually more correct than a sort_Done
>> >flag in the case of incremental sort (in contrast to regular sort).
>> >
>>
>> OK. Maybe we should add a comment to explain.c saying it's OK.
>>
>> I've pushed a fix for failures due to different planned workers (in the
>> test I added to show changes due to add_partial_path tweaks).
>>
>> It seems we're not out of the woods yet, though. rhinoceros and
>> sidewinder failed with something like this:
>>
>>                  Sort Method: quicksort  Memory: NNkB
>> +               Sort Method: unknown  Disk: NNkB
>>
>> Would you mind investigating at it?
>
>I assume that means those build farm members run with very low
>work_mem? Is it an acceptable fix to adjust work_mem up a bit just for
>these tests? Or is that bad practice and these are to expose issues
>with changing into disk sort mode?
>

I don't think so - I don't see any work_mem changes in the config - see
the extra_config at the beginning of the page with details:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2020-04-06%2023%3A00%3A16

Moreover, this seems to be in regular Sort, not Incremental Sort and it
very much seems like it gets confused to print a worker info because the
only way for Sort to print two "Sort Method" lines seems to be to enter
either both

   if (sortstate->sort_Done && sortstate->tuplesortstate != NULL)
   {
     ... print leader info ...
   }

   and

   if (sortstate->shared_info != NULL)
   {
     for (n = 0; n < sortstate->shared_info->num_workers; n++)
     {
       ... print worker info ...
     }
   }

or maybe there are two workers? It's strange ...


It doesn't seem to be particularly platform-specific, but I've been
unable to reproduce it so far. It seems on older gcc versions, though.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> It doesn't seem to be particularly platform-specific, but I've been
> unable to reproduce it so far. It seems on older gcc versions, though.

It's looking kind of like an uninitialized-memory problem.  Note
the latest from spurfowl,

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=spurfowl&dt=2020-04-07%2000%3A15%3A05

which got through "make check" and then failed during pg_upgrade's
repetition of the test.  Similarly on rhinoceros.  So there's definitely
instability there even on one machine.

Perhaps something to do with unexpected cache flushes??

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Apr 6, 2020 at 7:31 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Apr 06, 2020 at 07:09:11PM -0400, James Coleman wrote:
> >On Mon, Apr 6, 2020 at 6:13 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Mon, Apr 06, 2020 at 05:47:48PM -0400, James Coleman wrote:
> >> >On Mon, Apr 6, 2020 at 5:40 PM Tomas Vondra
> >> ><tomas.vondra@2ndquadrant.com> wrote:
> >> >>
> >> >> On Mon, Apr 06, 2020 at 11:12:32PM +0200, Tomas Vondra wrote:
> >> >> >On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:
> >> >> >>On 2020-Apr-06, Tom Lane wrote:
> >> >> >>
> >> >> >>>Locally, things pass without force_parallel_mode, but turning it on
> >> >> >>>produces failures that look similar to rhinoceros's (didn't examine
> >> >> >>>other BF members).
> >> >> >>
> >> >> >>FWIW I looked at the eight failures there were about fifteen minutes ago
> >> >> >>and they were all identical.  I can confirm that, in my laptop, the
> >> >> >>tests work without that GUC, and fail in exactly that way with it.
> >> >> >>
> >> >> >
> >> >> >Yes, there's a thinko in show_incremental_sort_info() and it returns too
> >> >> >soon. I'll push a fix in a minute.
> >> >> >
> >> >>
> >> >> OK, I've pushed a fix - this should make the buildfarm happy again.
> >> >>
> >> >> It however seems to me a bit more needs to be done. The fix makes
> >> >> show_incremental_sort_info closer to show_sort_info, but not entirely
> >> >> because IncrementalSortState does not have sort_Done flag so it still
> >> >> depends on (fullsortGroupInfo->groupCount > 0). I haven't noticed that
> >> >> before, but not having that flag seems a bit weird to me.
> >> >>
> >> >> It also seems possibly incorrect - we may end up with
> >> >>
> >> >>    fullsortGroupInfo->groupCount == 0
> >> >>    prefixsortGroupInfo->groupCount > 0
> >> >>
> >> >> but we won't print anything.
> >> >
> >> >This shouldn't ever be possible, because the only way we get any
> >> >prefix groups at all is if we've already sorted a full sort group
> >> >during the mode transition.
> >> >
> >> >> James, any opinion on this? I'd say we should restore the sort_Done flag
> >> >> and make it work as in plain Sort. Or some comment explaining why
> >> >> depending on the counts is OK (assuming it is).
> >> >
> >> >There's previous email traffic on this thread about that (I can look
> >> >it up later this evening), but the short of it is that I believe that
> >> >relying on the group count is actually more correct than a sort_Done
> >> >flag in the case of incremental sort (in contrast to regular sort).
> >> >
> >>
> >> OK. Maybe we should add a comment to explain.c saying it's OK.
> >>
> >> I've pushed a fix for failures due to different planned workers (in the
> >> test I added to show changes due to add_partial_path tweaks).
> >>
> >> It seems we're not out of the woods yet, though. rhinoceros and
> >> sidewinder failed with something like this:
> >>
> >>                  Sort Method: quicksort  Memory: NNkB
> >> +               Sort Method: unknown  Disk: NNkB
> >>
> >> Would you mind investigating at it?
> >
> >I assume that means those build farm members run with very low
> >work_mem? Is it an acceptable fix to adjust work_mem up a bit just for
> >these tests? Or is that bad practice and these are to expose issues
> >with changing into disk sort mode?
> >
>
> I don't think so - I don't see any work_mem changes in the config - see
> the extra_config at the beginning of the page with details:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2020-04-06%2023%3A00%3A16
>
> Moreover, this seems to be in regular Sort, not Incremental Sort and it
> very much seems like it gets confused to print a worker info because the
> only way for Sort to print two "Sort Method" lines seems to be to enter
> either both
>
>    if (sortstate->sort_Done && sortstate->tuplesortstate != NULL)
>    {
>      ... print leader info ...
>    }
>
>    and
>
>    if (sortstate->shared_info != NULL)
>    {
>      for (n = 0; n < sortstate->shared_info->num_workers; n++)
>      {
>        ... print worker info ...
>      }
>    }
>
> or maybe there are two workers? It's strange ...
>
>
> It doesn't seem to be particularly platform-specific, but I've been
> unable to reproduce it so far. It seems on older gcc versions, though.


I haven't been able to reproduce it, but I'm 99% confident this will fix it:

-            if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
+            if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS
+                || sinstrument->sortMethod == NULL)
                 continue;       /* ignore any unfilled slots */

Earlier we'd had this discussion about why SORT_TYPE_STILL_IN_PROGRESS
was explicitly set to 0 in the enum declaration. Since there was no
comment, we changed that, but here I believe that show_sort_info was
relying on that as an indicator that a worker didn't actually do any
work (since the DSM for the sort node gets set to all zeros, this
would work).

I'm not sure if the SORT_TYPE_STILL_IN_PROGRESS case is actually still
needed, though.

I've attached both a fix for this issue and a comment for the
full/prefix sort group if blocks.

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Apr 06, 2020 at 07:27:19PM -0400, James Coleman wrote:
>On Mon, Apr 6, 2020 at 7:09 PM James Coleman <jtc331@gmail.com> wrote:
>>
>> On Mon, Apr 6, 2020 at 6:13 PM Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> >
>> > On Mon, Apr 06, 2020 at 05:47:48PM -0400, James Coleman wrote:
>> > >On Mon, Apr 6, 2020 at 5:40 PM Tomas Vondra
>> > ><tomas.vondra@2ndquadrant.com> wrote:
>> > >>
>> > >> On Mon, Apr 06, 2020 at 11:12:32PM +0200, Tomas Vondra wrote:
>> > >> >On Mon, Apr 06, 2020 at 04:54:38PM -0400, Alvaro Herrera wrote:
>> > >> >>On 2020-Apr-06, Tom Lane wrote:
>> > >> >>
>> > >> >>>Locally, things pass without force_parallel_mode, but turning it on
>> > >> >>>produces failures that look similar to rhinoceros's (didn't examine
>> > >> >>>other BF members).
>> > >> >>
>> > >> >>FWIW I looked at the eight failures there were about fifteen minutes ago
>> > >> >>and they were all identical.  I can confirm that, in my laptop, the
>> > >> >>tests work without that GUC, and fail in exactly that way with it.
>> > >> >>
>> > >> >
>> > >> >Yes, there's a thinko in show_incremental_sort_info() and it returns too
>> > >> >soon. I'll push a fix in a minute.
>> > >> >
>> > >>
>> > >> OK, I've pushed a fix - this should make the buildfarm happy again.
>> > >>
>> > >> It however seems to me a bit more needs to be done. The fix makes
>> > >> show_incremental_sort_info closer to show_sort_info, but not entirely
>> > >> because IncrementalSortState does not have sort_Done flag so it still
>> > >> depends on (fullsortGroupInfo->groupCount > 0). I haven't noticed that
>> > >> before, but not having that flag seems a bit weird to me.
>> > >>
>> > >> It also seems possibly incorrect - we may end up with
>> > >>
>> > >>    fullsortGroupInfo->groupCount == 0
>> > >>    prefixsortGroupInfo->groupCount > 0
>> > >>
>> > >> but we won't print anything.
>> > >
>> > >This shouldn't ever be possible, because the only way we get any
>> > >prefix groups at all is if we've already sorted a full sort group
>> > >during the mode transition.
>> > >
>> > >> James, any opinion on this? I'd say we should restore the sort_Done flag
>> > >> and make it work as in plain Sort. Or some comment explaining why
>> > >> depending on the counts is OK (assuming it is).
>> > >
>> > >There's previous email traffic on this thread about that (I can look
>> > >it up later this evening), but the short of it is that I believe that
>> > >relying on the group count is actually more correct than a sort_Done
>> > >flag in the case of incremental sort (in contrast to regular sort).
>> > >
>> >
>> > OK. Maybe we should add a comment to explain.c saying it's OK.
>> >
>> > I've pushed a fix for failures due to different planned workers (in the
>> > test I added to show changes due to add_partial_path tweaks).
>> >
>> > It seems we're not out of the woods yet, though. rhinoceros and
>> > sidewinder failed with something like this:
>> >
>> >                  Sort Method: quicksort  Memory: NNkB
>> > +               Sort Method: unknown  Disk: NNkB
>> >
>> > Would you mind investigating at it?
>>
>> I assume that means those build farm members run with very low
>> work_mem? Is it an acceptable fix to adjust work_mem up a bit just for
>> these tests? Or is that bad practice and these are to expose issues
>> with changing into disk sort mode?
>
>On rhinoceros I see:
>
>================== pgsql.build/src/test/regress/regression.diffs
>===================
>diff -U3 /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/src/test/regress/expected/subselect.out
>/opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/src/test/regress/results/subselect.out
>--- /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/src/test/regress/expected/subselect.out
>2020-03-14 10:37:49.156761104 -0700
>+++ /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/src/test/regress/results/subselect.out
>2020-04-06 16:01:13.766798059 -0700
>@@ -1328,8 +1328,9 @@
>          ->  Sort (actual rows=3 loops=1)
>                Sort Key: sq_limit.c1, sq_limit.pk
>                Sort Method: top-N heapsort  Memory: xxx
>+               Sort Method: unknown  Disk: 0kB
>                ->  Seq Scan on sq_limit (actual rows=8 loops=1)
>-(6 rows)
>+(7 rows)
>
>Same on sidewinder.
>
>Given the 0kB I'm not sure this is *just* a work_mem thing, though
>that's still something I'm curious to know about, and it's still part
>of the "problem" here.
>
>I'm investigating further.
>

I don't see how could this be caused by low work_mem value, really. What
I think it happening is that when executed with

     force_parallel_mode = regress

we run this as if in a parallel worker, but then it gets somehow
confused and either (leader + worker) or two worker stats, or something
like that. I don't know why it's happening, though, or why would it be
triggered by the incremental sort patch ...

Actually, I just managed to trigger exactly this - the trick is that we
plan for certain number of workers, but then fail to start some. So for
example like this:

     create table t (a int, b int, c int);

     insert into t select mod(i,10), mod(i,10), i
       from generate_series(1,1000000) s(i);

     set force_parallel_mode = regress;

     set max_parallel_workers = 0;

     explain (costs off, analyze) select * from t order by a,b;

                           QUERY PLAN
---------------------------------------------------------------
  Sort (actual time=0.010..0.010 rows=0 loops=1)
    Sort Key: a, b
    Sort Method: quicksort  Memory: 25kB
    Sort Method: unknown  Disk: 0kB
    ->  Seq Scan on t (actual time=0.003..0.004 rows=0 loops=1)
  Planning Time: 0.040 ms
  Execution Time: 0.229 ms
(7 rows)

So we actually do have to print two lines, because without any workers
the leader ends up doing the work. But we don't know this is happening
because the the number of workers started is included in Gather node,
but force_parallel_mode=regress hides that.

The question is why are these failures correlated with incremental sort.
I don't think we've tweaked Sort at all, no?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Apr 06, 2020 at 08:42:13PM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> It doesn't seem to be particularly platform-specific, but I've been
>> unable to reproduce it so far. It seems on older gcc versions, though.
>
>It's looking kind of like an uninitialized-memory problem.  Note
>the latest from spurfowl,
>
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=spurfowl&dt=2020-04-07%2000%3A15%3A05
>
>which got through "make check" and then failed during pg_upgrade's
>repetition of the test.  Similarly on rhinoceros.  So there's definitely
>instability there even on one machine.
>
>Perhaps something to do with unexpected cache flushes??
>

I don't know, I've tried running the tests on a number of machines,
similar to those failing. Rapsberry Pi, Fedora 31, ... and it worked
everywhere while the failures seem consistent.

I've been able to reproduce these failures (same symptoms) by making
sure the worker (implied by force_parallel_mode=regress) won't start.

   set max_parallel_workers = 0;
   set force_parallel_mode = regress;

triggers exactly those failures for me (at least during make check, I
haven't tried pg_upgrade tests etc.).

So my theory is that we fail to start parallel workers on those
machines. It's not clear to me why would it be limited to some machines
and why would it be correlated to the incremental sort? I don't think
those machines have lower number of parallel workers, no?

But maybe incremental sort allowed using more parallel queries for more
queries, and we simply run out of parallel workers that way?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> I don't know, I've tried running the tests on a number of machines,
> similar to those failing. Rapsberry Pi, Fedora 31, ... and it worked
> everywhere while the failures seem consistent.

On my machine, it reproduces about one time in six with
force_parallel_mode = regress.  It seems possible given your
results that reducing max_parallel_workers would make it more
likely, but I've not tried that.

What I'm seeing, after adding some debug printouts, is that sortMethod is
frequently zero when we reach the EXPLAIN output for a worker.  In many of
the tests this happens even though there is no visible failure, because
we've got a filter function hiding the output :-(

So I concur with James' conclusion that the existing code is relying on
sortMethod initializing to zeroes, and that we did the wrong thing by
trying to give SORT_TYPE_STILL_IN_PROGRESS a nonzero representation.
I do not like his patch though, particularly not the type pun with NULL.
I think the correct fix is to change the enum declaration.

I know it's darn late where you are, do you want me to change it?

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Apr 6, 2020 at 9:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> > I don't know, I've tried running the tests on a number of machines,
> > similar to those failing. Rapsberry Pi, Fedora 31, ... and it worked
> > everywhere while the failures seem consistent.
>
> On my machine, it reproduces about one time in six with
> force_parallel_mode = regress.  It seems possible given your
> results that reducing max_parallel_workers would make it more
> likely, but I've not tried that.
>
> What I'm seeing, after adding some debug printouts, is that sortMethod is
> frequently zero when we reach the EXPLAIN output for a worker.  In many of
> the tests this happens even though there is no visible failure, because
> we've got a filter function hiding the output :-(
>
> So I concur with James' conclusion that the existing code is relying on
> sortMethod initializing to zeroes, and that we did the wrong thing by
> trying to give SORT_TYPE_STILL_IN_PROGRESS a nonzero representation.
> I do not like his patch though, particularly not the type pun with NULL.

Sentinel and NULL? I hadn't caught that at all.

> I think the correct fix is to change the enum declaration.

Hmm. I don't actually really like that, because it means the value
here isn't actually semantically correct. That is, the sort type is
not "in progress"; it's "we never started a sort at all". I don't
really love the conflating of those things that the old enum
declaration had (even it'd had a helpful comment). It seems to me that
we should make "we don't have a type" and "we have a type" distinct.

We could add a new enum value SORT_TYPE_UNINITIALIZED or similar though.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
James Coleman <jtc331@gmail.com> writes:
> On Mon, Apr 6, 2020 at 9:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think the correct fix is to change the enum declaration.

> Hmm. I don't actually really like that, because it means the value
> here isn't actually semantically correct. That is, the sort type is
> not "in progress"; it's "we never started a sort at all".

Well, yeah, but that pre-dated this patch, and right now is no
time to improve it; we can debate such fine points at more leisure
once the buildfarm isn't broken.

Obviously the comment needs fixed...

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Mon, Apr 6, 2020 at 10:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> James Coleman <jtc331@gmail.com> writes:
> > On Mon, Apr 6, 2020 at 9:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I think the correct fix is to change the enum declaration.
>
> > Hmm. I don't actually really like that, because it means the value
> > here isn't actually semantically correct. That is, the sort type is
> > not "in progress"; it's "we never started a sort at all".
>
> Well, yeah, but that pre-dated this patch, and right now is no
> time to improve it; we can debate such fine points at more leisure
> once the buildfarm isn't broken.

Fair enough. Unsure if Tomas is still online to comment and/or push,
but reverting SORT_TYPE_STILL_IN_PROGRESS back to 0 works for me as an
initial fix.

> Obviously the comment needs fixed...

The one in show_short_info?

I can work on that (and the other proposed cleanup above) with Tomas
tomorrow or later.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
James Coleman <jtc331@gmail.com> writes:
> Fair enough. Unsure if Tomas is still online to comment and/or push,
> but reverting SORT_TYPE_STILL_IN_PROGRESS back to 0 works for me as an
> initial fix.

I'm guessing he went to bed, so I'll push a fix in a moment.
The patch has survived enough test cycles here now to make me
moderately confident that it fixes the issue.

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Apr 06, 2020 at 10:19:41PM -0400, Tom Lane wrote:
>James Coleman <jtc331@gmail.com> writes:
>> Fair enough. Unsure if Tomas is still online to comment and/or push,
>> but reverting SORT_TYPE_STILL_IN_PROGRESS back to 0 works for me as an
>> initial fix.
>
>I'm guessing he went to bed, so I'll push a fix in a moment.
>The patch has survived enough test cycles here now to make me
>moderately confident that it fixes the issue.
>

Nope, how I could I sleep with half of the buildfarm still red?

I came to the same conclusion (that the change in TuplesortMethod
definiton is the culprit) a while ago and was about to push a fix that
initialized it correctly in ExecSortInitializeDSM. But I agree reverting
it back to the old definition is probably better.

Thanks!

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> I came to the same conclusion (that the change in TuplesortMethod
> definiton is the culprit) a while ago and was about to push a fix that
> initialized it correctly in ExecSortInitializeDSM. But I agree reverting
> it back to the old definition is probably better.

Yeah, for the moment.  James would like to not have
SORT_TYPE_STILL_IN_PROGRESS be part of the enum at all, I think,
and I can see his point --- but then we need some out-of-band
representation of "worker not done", so I'm not sure there'll be
any net reduction of cruft.  Anyway that can be dealt with after
we have a stable buildfarm.

Note also that there's a separate comment-only patch in
<CAAaqYe9qzKbxCvSp3dfLkuS1v8KKnB7kW3z-hZ2jnAQaveSm8w@mail.gmail.com>
that shouldn't be forgotten about.

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Apr 06, 2020 at 11:00:37PM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> I came to the same conclusion (that the change in TuplesortMethod
>> definiton is the culprit) a while ago and was about to push a fix that
>> initialized it correctly in ExecSortInitializeDSM. But I agree reverting
>> it back to the old definition is probably better.
>
>Yeah, for the moment.  James would like to not have
>SORT_TYPE_STILL_IN_PROGRESS be part of the enum at all, I think,
>and I can see his point --- but then we need some out-of-band
>representation of "worker not done", so I'm not sure there'll be
>any net reduction of cruft.  Anyway that can be dealt with after
>we have a stable buildfarm.
>

Agreed.

>Note also that there's a separate comment-only patch in
><CAAaqYe9qzKbxCvSp3dfLkuS1v8KKnB7kW3z-hZ2jnAQaveSm8w@mail.gmail.com>
>that shouldn't be forgotten about.
>

OK, I'll take care of that tomorrow. I have two smaller patches to
commit in the incremental sort patchset, so I'll add it to that.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Justin Pryzby
Date:
On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:
> I've pushed the fist part of this patch series - I've reorganized it a

I scanned through this again post-commit.  Find attached some suggestions.

Shouldn't non-text explain output always show both disk *and* mem, including
zeros ?

Should "Pre-sorted Groups:" be on a separate line ?
| Full-sort Groups: 1 Sort Method: quicksort Memory: avg=28kB peak=28kB Pre-sorted Groups: 1 Sort Method: quicksort
Memory:avg=30kB peak=30kB
 

And, should it use two spaces before "Sort Method", "Memory" and "Pre-sorted
Groups"?  I think you should maybe do that instead of the "semicolon
separator".  I think "two spaces" makes sense, since the units are different,
similar to hash buckets and normal sort node.  

     "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
     appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",

Note, I made a similar comment regarding two spaces for explain(WAL) here:
https://www.postgresql.org/message-id/20200402054120.GC14618%40telsasoft.com

And Peter E seemed to dislike that, here:
https://www.postgresql.org/message-id/ef8c966f-e50a-c583-7b1e-85de6f4ca0d3%402ndquadrant.com

Also, you're showing:
    ExplainPropertyInteger("Maximum Sort Space Used", "kB",
                groupInfo->maxMemorySpaceUsed, es);

But in show_hash_info() and show_hashagg_info(), and in your own text output,
that's called "Peak":
        ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
    ExplainPropertyInteger("Peak Memory Usage", "kB",
                spacePeakKb, es);

-- 
Justin

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:
> > I've pushed the fist part of this patch series - I've reorganized it a
>
> I scanned through this again post-commit.  Find attached some suggestions.
>
> Shouldn't non-text explain output always show both disk *and* mem, including
> zeros ?

Could you give more context on this? Is there a standard to follow?
Regular sort nodes only ever report one type, so there's not a good
parallel there.

> Should "Pre-sorted Groups:" be on a separate line ?
> | Full-sort Groups: 1 Sort Method: quicksort Memory: avg=28kB peak=28kB Pre-sorted Groups: 1 Sort Method: quicksort
Memory:avg=30kB peak=30kB
 

I'd originally had that, but Tomas wanted it to be more compact. It's
easy to adjust though if the consensus changes on that.

> And, should it use two spaces before "Sort Method", "Memory" and "Pre-sorted
> Groups"?  I think you should maybe do that instead of the "semicolon
> separator".  I think "two spaces" makes sense, since the units are different,
> similar to hash buckets and normal sort node.
>
>          "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
>          appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
>
> Note, I made a similar comment regarding two spaces for explain(WAL) here:
> https://www.postgresql.org/message-id/20200402054120.GC14618%40telsasoft.com
>
> And Peter E seemed to dislike that, here:
> https://www.postgresql.org/message-id/ef8c966f-e50a-c583-7b1e-85de6f4ca0d3%402ndquadrant.com

I read through that subthread, and the ending seemed to be Peter
wanting things to be unified. Was there a conclusion beyond that?

> Also, you're showing:
>         ExplainPropertyInteger("Maximum Sort Space Used", "kB",
>                                 groupInfo->maxMemorySpaceUsed, es);
>
> But in show_hash_info() and show_hashagg_info(), and in your own text output,
> that's called "Peak":
>         ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
>         ExplainPropertyInteger("Peak Memory Usage", "kB",
>                                 spacePeakKb, es);

Yes, that's a miss and should be fixed.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:
>On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
>>
>> On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:
>> > I've pushed the fist part of this patch series - I've reorganized it a
>>
>> I scanned through this again post-commit.  Find attached some suggestions.
>>
>> Shouldn't non-text explain output always show both disk *and* mem, including
>> zeros ?
>
>Could you give more context on this? Is there a standard to follow?
>Regular sort nodes only ever report one type, so there's not a good
>parallel there.
>
>> Should "Pre-sorted Groups:" be on a separate line ?
>> | Full-sort Groups: 1 Sort Method: quicksort Memory: avg=28kB peak=28kB Pre-sorted Groups: 1 Sort Method: quicksort
Memory:avg=30kB peak=30kB
 
>
>I'd originally had that, but Tomas wanted it to be more compact. It's
>easy to adjust though if the consensus changes on that.
>

I'm OK with changing the format if there's a consensus. The current
format seemed better to me, but I'm not particularly attached to it.

>> And, should it use two spaces before "Sort Method", "Memory" and "Pre-sorted
>> Groups"?  I think you should maybe do that instead of the "semicolon
>> separator".  I think "two spaces" makes sense, since the units are different,
>> similar to hash buckets and normal sort node.
>>
>>          "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
>>          appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
>>
>> Note, I made a similar comment regarding two spaces for explain(WAL) here:
>> https://www.postgresql.org/message-id/20200402054120.GC14618%40telsasoft.com
>>
>> And Peter E seemed to dislike that, here:
>> https://www.postgresql.org/message-id/ef8c966f-e50a-c583-7b1e-85de6f4ca0d3%402ndquadrant.com
>
>I read through that subthread, and the ending seemed to be Peter
>wanting things to be unified. Was there a conclusion beyond that?
>

Yeah, I don't think there was a clear consensus :-(

>> Also, you're showing:
>>         ExplainPropertyInteger("Maximum Sort Space Used", "kB",
>>                                 groupInfo->maxMemorySpaceUsed, es);
>>
>> But in show_hash_info() and show_hashagg_info(), and in your own text output,
>> that's called "Peak":
>>         ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
>>         ExplainPropertyInteger("Peak Memory Usage", "kB",
>>                                 spacePeakKb, es);
>
>Yes, that's a miss and should be fixed.
>

Will fix.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Justin Pryzby
Date:
On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:
> On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:
> > > I've pushed the fist part of this patch series - I've reorganized it a
> >
> > I scanned through this again post-commit.  Find attached some suggestions.
> >
> > Shouldn't non-text explain output always show both disk *and* mem, including
> > zeros ?
> 
> Could you give more context on this? Is there a standard to follow?
> Regular sort nodes only ever report one type, so there's not a good
> parallel there.

Right, I'm not sure either, since it seems to be a new case.  Maybe Tomas has a
strong intuition.

See at least the commit messages here:
3ec20c7091e97a554e7447ac2b7f4ed795631395
7d91b604d9b5d6ec8c19c57a9ffd2f27129cdd94
8ebb69f85445177575684a0ba5cfedda8d840a91

Maybe this one suggests that it should *not* be present unconditionally, but
only when that sort type is used?
4b234fd8bf21cd6f5ff44f1f1c613bf40860998d

Another thought: is checking if bytes>0 really a good way to determine if a
sort type was used ?  It seems like checking a bit or a pointer would be
better.  I guess a size of 0 is unlikely, and it's ok at least in text mode.

        if (groupInfo->maxMemorySpaceUsed > 0)

On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:
> > And, should it use two spaces before "Sort Method", "Memory" and "Pre-sorted
...
> I read through that subthread, and the ending seemed to be Peter
> wanting things to be unified. Was there a conclusion beyond that?

This discussion is ongoing.  I think let's wait until that's settled before
addressing this more complex and even newer case.  We can add "explain, two
spaces and equals vs colon" to the "Open items" list if need be - I hope the
discussion will not delay the release.

-- 
Justin



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Mon, Apr 06, 2020 at 11:25:21PM -0500, Justin Pryzby wrote:
>On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:
>> I've pushed the fist part of this patch series - I've reorganized it a
>
>I scanned through this again post-commit.  Find attached some suggestions.
>

Thanks. The typo fixes seem clear, except for this bit:

   * If we've set up either of the sort states yet, we need to reset them.
   * We could end them and null out the pointers, but there's no reason to
   * repay the setup cost, and because ???? guard setting up pivot comparator
   * state similarly, doing so might actually cause a leak.

I can't figure out what ???? should be. James, do you recall what this
should be?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Apr 7, 2020 at 7:02 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Apr 06, 2020 at 11:25:21PM -0500, Justin Pryzby wrote:
> >On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:
> >> I've pushed the fist part of this patch series - I've reorganized it a
> >
> >I scanned through this again post-commit.  Find attached some suggestions.
> >
>
> Thanks. The typo fixes seem clear, except for this bit:
>
>    * If we've set up either of the sort states yet, we need to reset them.
>    * We could end them and null out the pointers, but there's no reason to
>    * repay the setup cost, and because ???? guard setting up pivot comparator
>    * state similarly, doing so might actually cause a leak.
>
> I can't figure out what ???? should be. James, do you recall what this
> should be?

Yep, it's ExecIncrementalSort. If you look for the block guarded by
`if (fullsort_state == NULL)` you'll see the call to
preparePresortedCols(), which sets up the pivot comparator state
referenced by this comment.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Tue, Apr 07, 2020 at 07:50:26PM -0400, James Coleman wrote:
>On Tue, Apr 7, 2020 at 7:02 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Apr 06, 2020 at 11:25:21PM -0500, Justin Pryzby wrote:
>> >On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:
>> >> I've pushed the fist part of this patch series - I've reorganized it a
>> >
>> >I scanned through this again post-commit.  Find attached some suggestions.
>> >
>>
>> Thanks. The typo fixes seem clear, except for this bit:
>>
>>    * If we've set up either of the sort states yet, we need to reset them.
>>    * We could end them and null out the pointers, but there's no reason to
>>    * repay the setup cost, and because ???? guard setting up pivot comparator
>>    * state similarly, doing so might actually cause a leak.
>>
>> I can't figure out what ???? should be. James, do you recall what this
>> should be?
>
>Yep, it's ExecIncrementalSort. If you look for the block guarded by
>`if (fullsort_state == NULL)` you'll see the call to
>preparePresortedCols(), which sets up the pivot comparator state
>referenced by this comment.
>

OK, so it should be "... and because ExecIncrementalSort guard ..."?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Tue, Apr 7, 2020 at 7:58 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Apr 07, 2020 at 07:50:26PM -0400, James Coleman wrote:
> >On Tue, Apr 7, 2020 at 7:02 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Mon, Apr 06, 2020 at 11:25:21PM -0500, Justin Pryzby wrote:
> >> >On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:
> >> >> I've pushed the fist part of this patch series - I've reorganized it a
> >> >
> >> >I scanned through this again post-commit.  Find attached some suggestions.
> >> >
> >>
> >> Thanks. The typo fixes seem clear, except for this bit:
> >>
> >>    * If we've set up either of the sort states yet, we need to reset them.
> >>    * We could end them and null out the pointers, but there's no reason to
> >>    * repay the setup cost, and because ???? guard setting up pivot comparator
> >>    * state similarly, doing so might actually cause a leak.
> >>
> >> I can't figure out what ???? should be. James, do you recall what this
> >> should be?
> >
> >Yep, it's ExecIncrementalSort. If you look for the block guarded by
> >`if (fullsort_state == NULL)` you'll see the call to
> >preparePresortedCols(), which sets up the pivot comparator state
> >referenced by this comment.
> >
>
> OK, so it should be "... and because ExecIncrementalSort guard ..."?

Yes, "because ExecIncrementalSort guards presorted column functions by
checking to see if the full sort state has been initialized yet,
setting the sort states to null here might cause..." (that's more
specific IMO than my original "pivot comparator state...doing so").

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
Hi,

I've pushed the second part of this patch series, adding incremental
sort to additional places in the planner. As explained in the commit
message (and somewhere in this thread) we've decided to only update some
of the places that require sorted input (and do create_sort). This might
be overly cautious, I expect we'll add it to more places in the future.


As for the remaining part tweaking add_partial_path to also consider
startup cost (a bit confusingly 0001 - 0003 in the v54 patchset), I've
decided not to push it now and leave it for v14. The add_partial_path
change is simple, but I came to the conclusion that the "precheck"
function should be modified to follow the same logic - it would be a bit
strange if add_partial_path_precheck considered only total cost and
add_partial_path considered both startup and total cost. It would not
matter for most places because the add_partial_path_precheck is ony used
in join planning, but it's still strange.

I could have modified the add_partial_path_precheck too, but looking at
add_path_precheck we'd probably need to compute required_outer so that
we only compare startup_cost when really useful. Or we might simply
consider startup_cost every time and leave that up to add_partial_path,
but then that would be another difference in behavior.

That seems like way too much stuff to rework on the last day of the last
commitfest. It does mean we'll fail to generate the cheapest plan in some
cases (e.g. with LIMIT, there's an example in [1]) but that's a
pre-existing condition, not something introduced by incremental sort.

regards


[1] https://www.postgresql.org/message-id/20190720132244.3vgg2uynfpxh3me5%40development

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
hyrax is not too happy with this test:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-04-07%2004%3A55%3A15

It's not too clear to me why CLOBBER_CACHE_ALWAYS would be breaking
EXPLAIN output, but it evidently is.

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Tue, Apr 07, 2020 at 11:54:23PM -0400, Tom Lane wrote:
>hyrax is not too happy with this test:
>
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-04-07%2004%3A55%3A15
>
>It's not too clear to me why CLOBBER_CACHE_ALWAYS would be breaking
>EXPLAIN output, but it evidently is.
>

Thanks, I'll investigate. It's not clear to me either what might be
causing this, but I guess something must have gone wrong in
estimation/planning.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Wed, Apr 08, 2020 at 12:51:05PM +0200, Tomas Vondra wrote:
>On Tue, Apr 07, 2020 at 11:54:23PM -0400, Tom Lane wrote:
>>hyrax is not too happy with this test:
>>
>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-04-07%2004%3A55%3A15
>>
>>It's not too clear to me why CLOBBER_CACHE_ALWAYS would be breaking
>>EXPLAIN output, but it evidently is.
>>
>
>Thanks, I'll investigate. It's not clear to me either what might be
>causing this, but I guess something must have gone wrong in
>estimation/planning.
>

OK, I know what's going on - it's a rather embarassing issue in the
regression test. There's no analyze on the test tables, so it uses
default estimates for number of groups etc. But with clobber cache the
test runs long enough for autoanalyze to kick in and collect stats, so
we generate better estimates which changes the plan.

I'll get this fixed - explicit analyze and tweaking the data a bit
should do the trick.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Wed, Apr 8, 2020 at 9:43 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Wed, Apr 08, 2020 at 12:51:05PM +0200, Tomas Vondra wrote:
> >On Tue, Apr 07, 2020 at 11:54:23PM -0400, Tom Lane wrote:
> >>hyrax is not too happy with this test:
> >>
> >>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-04-07%2004%3A55%3A15
> >>
> >>It's not too clear to me why CLOBBER_CACHE_ALWAYS would be breaking
> >>EXPLAIN output, but it evidently is.
> >>
> >
> >Thanks, I'll investigate. It's not clear to me either what might be
> >causing this, but I guess something must have gone wrong in
> >estimation/planning.
> >
>
> OK, I know what's going on - it's a rather embarassing issue in the
> regression test. There's no analyze on the test tables, so it uses
> default estimates for number of groups etc. But with clobber cache the
> test runs long enough for autoanalyze to kick in and collect stats, so
> we generate better estimates which changes the plan.
>
> I'll get this fixed - explicit analyze and tweaking the data a bit
> should do the trick.

Looking at the tests that failed, I think we should consider just adding:
set enable_sort = off;
because several of those tests have very specific amounts of data to
ensure we test the transition points around the different modes in the
incremental sort node.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Wed, Apr 08, 2020 at 09:54:42AM -0400, James Coleman wrote:
>On Wed, Apr 8, 2020 at 9:43 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Wed, Apr 08, 2020 at 12:51:05PM +0200, Tomas Vondra wrote:
>> >On Tue, Apr 07, 2020 at 11:54:23PM -0400, Tom Lane wrote:
>> >>hyrax is not too happy with this test:
>> >>
>> >>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-04-07%2004%3A55%3A15
>> >>
>> >>It's not too clear to me why CLOBBER_CACHE_ALWAYS would be breaking
>> >>EXPLAIN output, but it evidently is.
>> >>
>> >
>> >Thanks, I'll investigate. It's not clear to me either what might be
>> >causing this, but I guess something must have gone wrong in
>> >estimation/planning.
>> >
>>
>> OK, I know what's going on - it's a rather embarassing issue in the
>> regression test. There's no analyze on the test tables, so it uses
>> default estimates for number of groups etc. But with clobber cache the
>> test runs long enough for autoanalyze to kick in and collect stats, so
>> we generate better estimates which changes the plan.
>>
>> I'll get this fixed - explicit analyze and tweaking the data a bit
>> should do the trick.
>
>Looking at the tests that failed, I think we should consider just adding:
>set enable_sort = off;
>because several of those tests have very specific amounts of data to
>ensure we test the transition points around the different modes in the
>incremental sort node.
>

Maybe, but I'd much rather tweak the data so that we test both the
costing and execution part.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Wed, Apr 08, 2020 at 04:08:39PM +0200, Tomas Vondra wrote:
>On Wed, Apr 08, 2020 at 09:54:42AM -0400, James Coleman wrote:
>>On Wed, Apr 8, 2020 at 9:43 AM Tomas Vondra
>><tomas.vondra@2ndquadrant.com> wrote:
>>>
>>>On Wed, Apr 08, 2020 at 12:51:05PM +0200, Tomas Vondra wrote:
>>>>On Tue, Apr 07, 2020 at 11:54:23PM -0400, Tom Lane wrote:
>>>>>hyrax is not too happy with this test:
>>>>>
>>>>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-04-07%2004%3A55%3A15
>>>>>
>>>>>It's not too clear to me why CLOBBER_CACHE_ALWAYS would be breaking
>>>>>EXPLAIN output, but it evidently is.
>>>>>
>>>>
>>>>Thanks, I'll investigate. It's not clear to me either what might be
>>>>causing this, but I guess something must have gone wrong in
>>>>estimation/planning.
>>>>
>>>
>>>OK, I know what's going on - it's a rather embarassing issue in the
>>>regression test. There's no analyze on the test tables, so it uses
>>>default estimates for number of groups etc. But with clobber cache the
>>>test runs long enough for autoanalyze to kick in and collect stats, so
>>>we generate better estimates which changes the plan.
>>>
>>>I'll get this fixed - explicit analyze and tweaking the data a bit
>>>should do the trick.
>>
>>Looking at the tests that failed, I think we should consider just adding:
>>set enable_sort = off;
>>because several of those tests have very specific amounts of data to
>>ensure we test the transition points around the different modes in the
>>incremental sort node.
>>
>
>Maybe, but I'd much rather tweak the data so that we test both the
>costing and execution part.
>

I do think this does the trick by increasing the number of rows a bit
(from 100 to 1000) to make the Sort more expensive than Incremental
Sort, while still testing the transition points.

James, can you verify it that's still true?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Wed, Apr 8, 2020 at 11:02 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Wed, Apr 08, 2020 at 04:08:39PM +0200, Tomas Vondra wrote:
> >On Wed, Apr 08, 2020 at 09:54:42AM -0400, James Coleman wrote:
> >>On Wed, Apr 8, 2020 at 9:43 AM Tomas Vondra
> >><tomas.vondra@2ndquadrant.com> wrote:
> >>>
> >>>On Wed, Apr 08, 2020 at 12:51:05PM +0200, Tomas Vondra wrote:
> >>>>On Tue, Apr 07, 2020 at 11:54:23PM -0400, Tom Lane wrote:
> >>>>>hyrax is not too happy with this test:
> >>>>>
> >>>>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-04-07%2004%3A55%3A15
> >>>>>
> >>>>>It's not too clear to me why CLOBBER_CACHE_ALWAYS would be breaking
> >>>>>EXPLAIN output, but it evidently is.
> >>>>>
> >>>>
> >>>>Thanks, I'll investigate. It's not clear to me either what might be
> >>>>causing this, but I guess something must have gone wrong in
> >>>>estimation/planning.
> >>>>
> >>>
> >>>OK, I know what's going on - it's a rather embarassing issue in the
> >>>regression test. There's no analyze on the test tables, so it uses
> >>>default estimates for number of groups etc. But with clobber cache the
> >>>test runs long enough for autoanalyze to kick in and collect stats, so
> >>>we generate better estimates which changes the plan.
> >>>
> >>>I'll get this fixed - explicit analyze and tweaking the data a bit
> >>>should do the trick.
> >>
> >>Looking at the tests that failed, I think we should consider just adding:
> >>set enable_sort = off;
> >>because several of those tests have very specific amounts of data to
> >>ensure we test the transition points around the different modes in the
> >>incremental sort node.
> >>
> >
> >Maybe, but I'd much rather tweak the data so that we test both the
> >costing and execution part.
> >
>
> I do think this does the trick by increasing the number of rows a bit
> (from 100 to 1000) to make the Sort more expensive than Incremental
> Sort, while still testing the transition points.
>
> James, can you verify it that's still true?

Those changes all look good to me from a "testing correctness" POV.
Also I like that we now test multiple sort methods in the explain
output, like: "Sort Methods: top-N heapsort, quicksort".

I personally find the `i/100` notation harder to read than a case, but
that's just an opinion...

Should we change `analyze` to `analyze t` to avoid unnecessarily
re-analyzing all other tables in the regression db?

James



Re: [PATCH] Incremental sort

From
David Steele
Date:
On 4/8/20 11:13 AM, James Coleman wrote:
>>
>> James, can you verify it that's still true?

I marked this entry as committed in the 2020-03 CF but it's not clear to 
me if that's entirely true. I'll leave it up to you (all) to move it to 
the 2020-07 CF if there is remaining work (other than making the build 
farm happy).

Regards,
-- 
-David
david@pgmasters.net



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
James Coleman <jtc331@gmail.com> writes:
> Should we change `analyze` to `analyze t` to avoid unnecessarily
> re-analyzing all other tables in the regression db?

Yes, a global analyze here is a remarkably horrid idea.

            regards, tom lane



Re: [PATCH] Incremental sort

From
James Coleman
Date:
On Wed, Apr 8, 2020 at 11:29 AM David Steele <david@pgmasters.net> wrote:
>
> On 4/8/20 11:13 AM, James Coleman wrote:
> >>
> >> James, can you verify it that's still true?
>
> I marked this entry as committed in the 2020-03 CF but it's not clear to
> me if that's entirely true. I'll leave it up to you (all) to move it to
> the 2020-07 CF if there is remaining work (other than making the build
> farm happy).

Thanks.

I think it's true enough. The vast majority is committed, and the
small amount that isn't we'll leave as future improvements and
separate threads.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Wed, Apr 08, 2020 at 11:13:26AM -0400, James Coleman wrote:
>On Wed, Apr 8, 2020 at 11:02 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Wed, Apr 08, 2020 at 04:08:39PM +0200, Tomas Vondra wrote:
>> >On Wed, Apr 08, 2020 at 09:54:42AM -0400, James Coleman wrote:
>> >>On Wed, Apr 8, 2020 at 9:43 AM Tomas Vondra
>> >><tomas.vondra@2ndquadrant.com> wrote:
>> >>>
>> >>>On Wed, Apr 08, 2020 at 12:51:05PM +0200, Tomas Vondra wrote:
>> >>>>On Tue, Apr 07, 2020 at 11:54:23PM -0400, Tom Lane wrote:
>> >>>>>hyrax is not too happy with this test:
>> >>>>>
>> >>>>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-04-07%2004%3A55%3A15
>> >>>>>
>> >>>>>It's not too clear to me why CLOBBER_CACHE_ALWAYS would be breaking
>> >>>>>EXPLAIN output, but it evidently is.
>> >>>>>
>> >>>>
>> >>>>Thanks, I'll investigate. It's not clear to me either what might be
>> >>>>causing this, but I guess something must have gone wrong in
>> >>>>estimation/planning.
>> >>>>
>> >>>
>> >>>OK, I know what's going on - it's a rather embarassing issue in the
>> >>>regression test. There's no analyze on the test tables, so it uses
>> >>>default estimates for number of groups etc. But with clobber cache the
>> >>>test runs long enough for autoanalyze to kick in and collect stats, so
>> >>>we generate better estimates which changes the plan.
>> >>>
>> >>>I'll get this fixed - explicit analyze and tweaking the data a bit
>> >>>should do the trick.
>> >>
>> >>Looking at the tests that failed, I think we should consider just adding:
>> >>set enable_sort = off;
>> >>because several of those tests have very specific amounts of data to
>> >>ensure we test the transition points around the different modes in the
>> >>incremental sort node.
>> >>
>> >
>> >Maybe, but I'd much rather tweak the data so that we test both the
>> >costing and execution part.
>> >
>>
>> I do think this does the trick by increasing the number of rows a bit
>> (from 100 to 1000) to make the Sort more expensive than Incremental
>> Sort, while still testing the transition points.
>>
>> James, can you verify it that's still true?
>
>Those changes all look good to me from a "testing correctness" POV.
>Also I like that we now test multiple sort methods in the explain
>output, like: "Sort Methods: top-N heapsort, quicksort".
>

OK, good. I'll push the fix.

>I personally find the `i/100` notation harder to read than a case, but
>that's just an opinion...
>

Yeah, but with 1000 rows we'd need a more complex CASE statement (I
don't think simply having two groups - small+large would work).

>Should we change `analyze` to `analyze t` to avoid unnecessarily
>re-analyzing all other tables in the regression db?
>

Ah, definitely. That was a mistake. Thanks for noticing.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort

From
Tomas Vondra
Date:
On Wed, Apr 08, 2020 at 11:42:12AM -0400, James Coleman wrote:
>On Wed, Apr 8, 2020 at 11:29 AM David Steele <david@pgmasters.net> wrote:
>>
>> On 4/8/20 11:13 AM, James Coleman wrote:
>> >>
>> >> James, can you verify it that's still true?
>>
>> I marked this entry as committed in the 2020-03 CF but it's not clear to
>> me if that's entirely true. I'll leave it up to you (all) to move it to
>> the 2020-07 CF if there is remaining work (other than making the build
>> farm happy).
>
>Thanks.
>
>I think it's true enough. The vast majority is committed, and the
>small amount that isn't we'll leave as future improvements and
>separate threads.
>

Right. The remaining bit is a generic issue not entirely specific to
incremental sort, so we better not hide it in this CF entry.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
One thing I just noticed and had a question about: in
preparePresortedCols (which sets up a function call context), do we
need to call pg_proc_aclcheck?

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Fri, Apr 10, 2020 at 10:12 AM James Coleman <jtc331@gmail.com> wrote:
>
> One thing I just noticed and had a question about: in
> preparePresortedCols (which sets up a function call context), do we
> need to call pg_proc_aclcheck?

Background: this came up because I noticed that pg_proc_aclcheck is
called in the scalar array op case in execExpr.c.

However grepping through the source code I see several places where a
function (including an equality op for an ordering op, like the case
we have here) gets looked up without calling pg_proc_aclcheck, but
then other places where the acl check is invoked.

In addition, I haven't been able to discern a reason for why sometimes
InvokeFunctionExecuteHook gets called with the function after lookup,
but not others.

So I'm not sure if either of these needed to be added to the equality
op/function lookup code in nodeIncrementalSort's preparePresortedCols
or not.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tom Lane
Date:
James Coleman <jtc331@gmail.com> writes:
> On Fri, Apr 10, 2020 at 10:12 AM James Coleman <jtc331@gmail.com> wrote:
>> One thing I just noticed and had a question about: in
>> preparePresortedCols (which sets up a function call context), do we
>> need to call pg_proc_aclcheck?

> Background: this came up because I noticed that pg_proc_aclcheck is
> called in the scalar array op case in execExpr.c.

> However grepping through the source code I see several places where a
> function (including an equality op for an ordering op, like the case
> we have here) gets looked up without calling pg_proc_aclcheck, but
> then other places where the acl check is invoked.

Rule of thumb is that we don't apply ACL checks to functions/ops
we get out of an opclass; adding a function to an opclass is tantamount
to giving public execute permission on it.  If the function/operator
reference came directly from the SQL query it must be checked.

> In addition, I haven't been able to discern a reason for why sometimes
> InvokeFunctionExecuteHook gets called with the function after lookup,
> but not others.

I would not stand here and say that that hook infrastructure is worth
anything at all.  Maybe the coverage is sufficient for some use-cases,
but who's to say?

            regards, tom lane



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Thu, Apr 16, 2020 at 1:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> James Coleman <jtc331@gmail.com> writes:
> > On Fri, Apr 10, 2020 at 10:12 AM James Coleman <jtc331@gmail.com> wrote:
> >> One thing I just noticed and had a question about: in
> >> preparePresortedCols (which sets up a function call context), do we
> >> need to call pg_proc_aclcheck?
>
> > Background: this came up because I noticed that pg_proc_aclcheck is
> > called in the scalar array op case in execExpr.c.
>
> > However grepping through the source code I see several places where a
> > function (including an equality op for an ordering op, like the case
> > we have here) gets looked up without calling pg_proc_aclcheck, but
> > then other places where the acl check is invoked.
>
> Rule of thumb is that we don't apply ACL checks to functions/ops
> we get out of an opclass; adding a function to an opclass is tantamount
> to giving public execute permission on it.  If the function/operator
> reference came directly from the SQL query it must be checked.

All right, in that case I believe we're OK here without modification.
We're looking up the equality op based on the ordering op the planner
has already selected for sorting the query, and I'm assuming that
looking that up via the op family is in the same category as "getting
out of an opclass" (since opclasses are part of an opfamily).

Thanks for the explanation.

> > In addition, I haven't been able to discern a reason for why sometimes
> > InvokeFunctionExecuteHook gets called with the function after lookup,
> > but not others.
>
> I would not stand here and say that that hook infrastructure is worth
> anything at all.  Maybe the coverage is sufficient for some use-cases,
> but who's to say?

Interesting. It does look to be particularly underused. Just grepping
for that hook invocation macro shows, for example, that it's not used
in nodeSort.c or tuplesort.c, so clearly it's not executed for the
functions we'd use in regular sort. Given that...I think we can
proceed without it here too.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Justin Pryzby
Date:
On Tue, Apr 07, 2020 at 10:53:05AM -0500, Justin Pryzby wrote:
> On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:
> > > And, should it use two spaces before "Sort Method", "Memory" and "Pre-sorted
> ...
> > I read through that subthread, and the ending seemed to be Peter
> > wanting things to be unified. Was there a conclusion beyond that?
> 
> This discussion is ongoing.  I think let's wait until that's settled before
> addressing this more complex and even newer case.  We can add "explain, two
> spaces and equals vs colon" to the "Open items" list if need be - I hope the
> discussion will not delay the release.

The change proposed on the WAL thread is minimal, and makes new explain(WAL)
output consistent with the that of explain(BUFFERS).

That uses a different format from "Sort", which is what incremental sort should
follow.  (Hashjoin also uses the Sort's format of two-spaces and colons rather
than equals).

So the attached 0001 makes explain output for incremental sort more consistent
with sort:

     - Two spaces;
     - colons rather than equals;
     - Don't use semicolon, which isn't in use anywhere else;

I tested with this:
template1=# DROP TABLE t; CREATE TABLE t(i int, j int); INSERT INTO t SELECT a-(a%100), a%1000 FROM
generate_series(1,99999)a;CREATE INDEX ON t(i); VACUUM VERBOSE ANALYZE t;
 
template1=# explain analyze SELECT * FROM t a ORDER BY i,j;
...
   Full-sort Groups: 1000  Sort Method: quicksort  Average Memory: 28kB  Peak Memory: 28kB  Pre-sorted Groups: 1000
SortMethod: quicksort  Average Memory: 30kB  Peak Memory: 30kB
 

On Tue, Apr 07, 2020 at 05:34:15PM +0200, Tomas Vondra wrote:
> On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:
> > On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > Should "Pre-sorted Groups:" be on a separate line ?
> > > | Full-sort Groups: 1 Sort Method: quicksort Memory: avg=28kB peak=28kB Pre-sorted Groups: 1 Sort Method:
quicksortMemory: avg=30kB peak=30kB
 
> > 
> > I'd originally had that, but Tomas wanted it to be more compact. It's
> > easy to adjust though if the consensus changes on that.
> 
> I'm OK with changing the format if there's a consensus. The current
> format seemed better to me, but I'm not particularly attached to it.

I still think Pre-sorted groups should be on a separate line, as in 0002.
In addition to looking better (to me), and being easier to read, another reason
is that there are essentially key=>values here, but the keys are repeated (Sort
Method, etc).

I also suggested to rename: s/Presorted/Pre-sorted/, but I didn't do that here.

Michael already patched most of the comment typos, the remainder I'm including
here as a "nearby patch"..

-- 
Justin

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Sat, Apr 18, 2020 at 10:36 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Tue, Apr 07, 2020 at 10:53:05AM -0500, Justin Pryzby wrote:
> > On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:
> > > > And, should it use two spaces before "Sort Method", "Memory" and "Pre-sorted
> > ...
> > > I read through that subthread, and the ending seemed to be Peter
> > > wanting things to be unified. Was there a conclusion beyond that?
> >
> > This discussion is ongoing.  I think let's wait until that's settled before
> > addressing this more complex and even newer case.  We can add "explain, two
> > spaces and equals vs colon" to the "Open items" list if need be - I hope the
> > discussion will not delay the release.
>
> The change proposed on the WAL thread is minimal, and makes new explain(WAL)
> output consistent with the that of explain(BUFFERS).
>
> That uses a different format from "Sort", which is what incremental sort should
> follow.  (Hashjoin also uses the Sort's format of two-spaces and colons rather
> than equals).

I think it's not great that buffers/sort are different, but I agree
that we should follow sort.

> So the attached 0001 makes explain output for incremental sort more consistent
> with sort:
>
>      - Two spaces;
>      - colons rather than equals;
>      - Don't use semicolon, which isn't in use anywhere else;
>
> I tested with this:
> template1=# DROP TABLE t; CREATE TABLE t(i int, j int); INSERT INTO t SELECT a-(a%100), a%1000 FROM
generate_series(1,99999)a;CREATE INDEX ON t(i); VACUUM VERBOSE ANALYZE t;
 
> template1=# explain analyze SELECT * FROM t a ORDER BY i,j;
> ...
>    Full-sort Groups: 1000  Sort Method: quicksort  Average Memory: 28kB  Peak Memory: 28kB  Pre-sorted Groups: 1000
SortMethod: quicksort  Average Memory: 30kB  Peak Memory: 30kB
 
>
> On Tue, Apr 07, 2020 at 05:34:15PM +0200, Tomas Vondra wrote:
> > On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:
> > > On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > > Should "Pre-sorted Groups:" be on a separate line ?
> > > > | Full-sort Groups: 1 Sort Method: quicksort Memory: avg=28kB peak=28kB Pre-sorted Groups: 1 Sort Method:
quicksortMemory: avg=30kB peak=30kB
 
> > >
> > > I'd originally had that, but Tomas wanted it to be more compact. It's
> > > easy to adjust though if the consensus changes on that.
> >
> > I'm OK with changing the format if there's a consensus. The current
> > format seemed better to me, but I'm not particularly attached to it.
>
> I still think Pre-sorted groups should be on a separate line, as in 0002.
> In addition to looking better (to me), and being easier to read, another reason
> is that there are essentially key=>values here, but the keys are repeated (Sort
> Method, etc).

I collapsed this into 0001 because I think that if we're going to do
away with the key=value style then we effectively to have to do this
to avoid the repeated values being confusing (with key=value it kinda
worked, because that made it seem like the avg/peak were clearly a
subset of the Sort Groups info).

I also cleaned up the newline patch a bit in the process (we already
have a way to indent with a flag so don't need to do it directly).

> I also suggested to rename: s/Presorted/Pre-sorted/, but I didn't do that here.

I went ahead and did that too; we already use "Full-sort", so the
proposed change makes both parallel.

> Michael already patched most of the comment typos, the remainder I'm including
> here as a "nearby patch"..

Modified slightly.

James

Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Justin Pryzby
Date:
Checking if it's possible to address this Opened Item before 13b1.

https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items
consistency of explain output: two spaces, equals vs colons, semicolons (incremental sort) 

On Sun, Apr 19, 2020 at 09:46:55AM -0400, James Coleman wrote:
> On Sat, Apr 18, 2020 at 10:36 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> >
> > On Tue, Apr 07, 2020 at 10:53:05AM -0500, Justin Pryzby wrote:
> > > On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:
> > > > > And, should it use two spaces before "Sort Method", "Memory" and "Pre-sorted
> > > ...
> > > > I read through that subthread, and the ending seemed to be Peter
> > > > wanting things to be unified. Was there a conclusion beyond that?
> > >
> > > This discussion is ongoing.  I think let's wait until that's settled before
> > > addressing this more complex and even newer case.  We can add "explain, two
> > > spaces and equals vs colon" to the "Open items" list if need be - I hope the
> > > discussion will not delay the release.
> >
> > The change proposed on the WAL thread is minimal, and makes new explain(WAL)
> > output consistent with the that of explain(BUFFERS).
> >
> > That uses a different format from "Sort", which is what incremental sort should
> > follow.  (Hashjoin also uses the Sort's format of two-spaces and colons rather
> > than equals).
> 
> I think it's not great that buffers/sort are different, but I agree
> that we should follow sort.
> 
> > So the attached 0001 makes explain output for incremental sort more consistent
> > with sort:
> >
> >      - Two spaces;
> >      - colons rather than equals;
> >      - Don't use semicolon, which isn't in use anywhere else;
> >
> > I tested with this:
> > template1=# DROP TABLE t; CREATE TABLE t(i int, j int); INSERT INTO t SELECT a-(a%100), a%1000 FROM
generate_series(1,99999)a;CREATE INDEX ON t(i); VACUUM VERBOSE ANALYZE t;
 
> > template1=# explain analyze SELECT * FROM t a ORDER BY i,j;
> > ...
> >    Full-sort Groups: 1000  Sort Method: quicksort  Average Memory: 28kB  Peak Memory: 28kB  Pre-sorted Groups: 1000
Sort Method: quicksort  Average Memory: 30kB  Peak Memory: 30kB
 
> >
> > On Tue, Apr 07, 2020 at 05:34:15PM +0200, Tomas Vondra wrote:
> > > On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:
> > > > On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > > > Should "Pre-sorted Groups:" be on a separate line ?
> > > > > | Full-sort Groups: 1 Sort Method: quicksort Memory: avg=28kB peak=28kB Pre-sorted Groups: 1 Sort Method:
quicksortMemory: avg=30kB peak=30kB
 
> > > >
> > > > I'd originally had that, but Tomas wanted it to be more compact. It's
> > > > easy to adjust though if the consensus changes on that.
> > >
> > > I'm OK with changing the format if there's a consensus. The current
> > > format seemed better to me, but I'm not particularly attached to it.
> >
> > I still think Pre-sorted groups should be on a separate line, as in 0002.
> > In addition to looking better (to me), and being easier to read, another reason
> > is that there are essentially key=>values here, but the keys are repeated (Sort
> > Method, etc).
> 
> I collapsed this into 0001 because I think that if we're going to do
> away with the key=value style then we effectively to have to do this
> to avoid the repeated values being confusing (with key=value it kinda
> worked, because that made it seem like the avg/peak were clearly a
> subset of the Sort Groups info).
> 
> I also cleaned up the newline patch a bit in the process (we already
> have a way to indent with a flag so don't need to do it directly).
> 
> > I also suggested to rename: s/Presorted/Pre-sorted/, but I didn't do that here.
> 
> I went ahead and did that too; we already use "Full-sort", so the
> proposed change makes both parallel.
> 
> > Michael already patched most of the comment typos, the remainder I'm including
> > here as a "nearby patch"..
> 
> Modified slightly.
> 
> James

> From becd60ba348558fa241db5cc2100a84b757cbdc5 Mon Sep 17 00:00:00 2001
> From: Justin Pryzby <pryzbyj@telsasoft.com>
> Date: Mon, 6 Apr 2020 17:37:31 -0500
> Subject: [PATCH v2 2/2] comment typos: Incremental Sort
> 
> commit d2d8a229bc58a2014dce1c7a4fcdb6c5ab9fb8da
> Author: Tomas Vondra <tomas.vondra@postgresql.org>
> 
> Previously reported here:
> https://www.postgresql.org/message-id/20200407042521.GH2228%40telsasoft.com
> ---
>  src/backend/commands/explain.c             |  4 ++--
>  src/backend/executor/nodeIncrementalSort.c | 10 ++++++----
>  src/backend/utils/sort/tuplesort.c         |  8 ++++----
>  3 files changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
> index 5f91c569a0..86c10458f4 100644
> --- a/src/backend/commands/explain.c
> +++ b/src/backend/commands/explain.c
> @@ -2865,7 +2865,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
>  }
>  
>  /*
> - * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
> + * If it's EXPLAIN ANALYZE, show tuplesort stats for an incremental sort node
>   */
>  static void
>  show_incremental_sort_info(IncrementalSortState *incrsortstate,
> @@ -2913,7 +2913,7 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
>              &incrsortstate->shared_info->sinfo[n];
>  
>              /*
> -             * If a worker hasn't process any sort groups at all, then exclude
> +             * If a worker hasn't processed any sort groups at all, then exclude
>               * it from output since it either didn't launch or didn't
>               * contribute anything meaningful.
>               */
> diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
> index 39ba11cdf7..05c60ec3e0 100644
> --- a/src/backend/executor/nodeIncrementalSort.c
> +++ b/src/backend/executor/nodeIncrementalSort.c
> @@ -987,8 +987,8 @@ ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
>  
>      /*
>       * Incremental sort can't be used with either EXEC_FLAG_REWIND,
> -     * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
> -     * batches in the current sort state.
> +     * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because the current sort state
> +     * contains only one sort batch rather than the full result set.
>       */
>      Assert((eflags & (EXEC_FLAG_BACKWARD |
>                        EXEC_FLAG_MARK)) == 0);
> @@ -1153,8 +1153,10 @@ ExecReScanIncrementalSort(IncrementalSortState *node)
>      /*
>       * If we've set up either of the sort states yet, we need to reset them.
>       * We could end them and null out the pointers, but there's no reason to
> -     * repay the setup cost, and because guard setting up pivot comparator
> -     * state similarly, doing so might actually cause a leak.
> +     * repay the setup cost, and because ExecIncrementalSort guards
> +     * presorted column functions by checking to see if the full sort state
> +     * has been initialized yet, setting the sort states to null here might
> +     * actually cause a leak.
>       */
>      if (node->fullsort_state != NULL)
>      {
> diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
> index de38c6c7e0..d59e3d5a8d 100644
> --- a/src/backend/utils/sort/tuplesort.c
> +++ b/src/backend/utils/sort/tuplesort.c
> @@ -1428,11 +1428,11 @@ tuplesort_updatemax(Tuplesortstate *state)
>      }
>  
>      /*
> -     * Sort evicts data to the disk when it didn't manage to fit those data to
> -     * the main memory.  This is why we assume space used on the disk to be
> +     * Sort evicts data to the disk when it wasn't able to fit that data into
> +     * main memory.  This is why we assume space used on the disk to be
>       * more important for tracking resource usage than space used in memory.
> -     * Note that amount of space occupied by some tuple set on the disk might
> -     * be less than amount of space occupied by the same tuple set in the
> +     * Note that the amount of space occupied by some tupleset on the disk might
> +     * be less than amount of space occupied by the same tupleset in
>       * memory due to more compact representation.
>       */
>      if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
> -- 
> 2.17.1
> 

> From 8e80be2b345c3940f76ffbd5e3c201a7ae855784 Mon Sep 17 00:00:00 2001
> From: Justin Pryzby <pryzbyj@telsasoft.com>
> Date: Wed, 15 Apr 2020 08:45:21 -0500
> Subject: [PATCH v2 1/2] Fix explain output for incr sort:
> 
>  - Two spaces
>  - colons rather than equals
>  - Don't use semicolon
>  - Put Pre-sorted groups on a separate line
>  - Rename Presorted to Pre-sorted (to match Full-sort)
> ---
>  src/backend/commands/explain.c                | 22 ++++++++-----------
>  .../regress/expected/incremental_sort.out     | 21 +++++++++---------
>  src/test/regress/sql/incremental_sort.sql     |  4 ++--
>  3 files changed, 22 insertions(+), 25 deletions(-)
> 
> diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
> index 7ae6131676..5f91c569a0 100644
> --- a/src/backend/commands/explain.c
> +++ b/src/backend/commands/explain.c
> @@ -2778,7 +2778,7 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
>      {
>          if (indent)
>              appendStringInfoSpaces(es->str, es->indent * 2);
> -        appendStringInfo(es->str, "%s Groups: " INT64_FORMAT " Sort Method", groupLabel,
> +        appendStringInfo(es->str, "%s Groups: " INT64_FORMAT "  Sort Method", groupLabel,
>                           groupInfo->groupCount);
>          /* plural/singular based on methodNames size */
>          if (list_length(methodNames) > 1)
> @@ -2798,24 +2798,20 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
>              const char *spaceTypeName;
>  
>              spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_MEMORY);
> -            appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
> +            appendStringInfo(es->str, "  Average %s: %ldkB  Peak %s: %ldkB",
>                               spaceTypeName, avgSpace,
> -                             groupInfo->maxMemorySpaceUsed);
> +                             spaceTypeName, groupInfo->maxMemorySpaceUsed);
>          }
>  
>          if (groupInfo->maxDiskSpaceUsed > 0)
>          {
>              long        avgSpace = groupInfo->totalDiskSpaceUsed / groupInfo->groupCount;
> -
>              const char *spaceTypeName;
>  
>              spaceTypeName = tuplesort_space_type_name(SORT_SPACE_TYPE_DISK);
> -            /* Add a semicolon separator only if memory stats were printed. */
> -            if (groupInfo->maxMemorySpaceUsed > 0)
> -                appendStringInfo(es->str, ";");
> -            appendStringInfo(es->str, " %s: avg=%ldkB peak=%ldkB",
> +            appendStringInfo(es->str, "  Average %s: %ldkB  Peak %s: %ldkB",
>                               spaceTypeName, avgSpace,
> -                             groupInfo->maxDiskSpaceUsed);
> +                             spaceTypeName, groupInfo->maxDiskSpaceUsed);
>          }
>      }
>      else
> @@ -2899,8 +2895,8 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
>          if (prefixsortGroupInfo->groupCount > 0)
>          {
>              if (es->format == EXPLAIN_FORMAT_TEXT)
> -                appendStringInfo(es->str, " ");
> -            show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
> +                appendStringInfo(es->str, "\n");
> +            show_incremental_sort_group_info(prefixsortGroupInfo, "Pre-sorted", true, es);
>          }
>          if (es->format == EXPLAIN_FORMAT_TEXT)
>              appendStringInfo(es->str, "\n");
> @@ -2943,8 +2939,8 @@ show_incremental_sort_info(IncrementalSortState *incrsortstate,
>              if (prefixsortGroupInfo->groupCount > 0)
>              {
>                  if (es->format == EXPLAIN_FORMAT_TEXT)
> -                    appendStringInfo(es->str, " ");
> -                show_incremental_sort_group_info(prefixsortGroupInfo, "Presorted", false, es);
> +                    appendStringInfo(es->str, "\n");
> +                show_incremental_sort_group_info(prefixsortGroupInfo, "Pre-sorted", true, es);
>              }
>              if (es->format == EXPLAIN_FORMAT_TEXT)
>                  appendStringInfo(es->str, "\n");
> diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
> index 238d89a206..2b40a26d82 100644
> --- a/src/test/regress/expected/incremental_sort.out
> +++ b/src/test/regress/expected/incremental_sort.out
> @@ -106,7 +106,7 @@ declare
>    space_key text;
>  begin
>    for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
> -    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
> +    for group_key in select unnest(array['Full-sort Groups', 'Pre-sorted Groups']::text[]) t loop
>        for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
>          node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
>          node := jsonb_set(node, array[group_key, space_key, 'Peak Sort Space Used'], '"NN"', false);
> @@ -128,7 +128,7 @@ declare
>    space_key text;
>  begin
>    for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
> -    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
> +    for group_key in select unnest(array['Full-sort Groups', 'Pre-sorted Groups']::text[]) t loop
>        group_stats := node->group_key;
>        for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
>          if (group_stats->space_key->'Peak Sort Space Used')::bigint < (group_stats->space_key->'Peak Sort Space
Used')::bigintthen
 
> @@ -533,13 +533,13 @@ select * from (select * from t order by a) s order by a, b limit 55;
>  
>  -- Test EXPLAIN ANALYZE with only a fullsort group.
>  select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 55');
> -                                 explain_analyze_without_memory                                 
> -------------------------------------------------------------------------------------------------
> +                                        explain_analyze_without_memory                                         
> +---------------------------------------------------------------------------------------------------------------
>   Limit (actual rows=55 loops=1)
>     ->  Incremental Sort (actual rows=55 loops=1)
>           Sort Key: t.a, t.b
>           Presorted Key: t.a
> -         Full-sort Groups: 2 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB peak=NNkB
> +         Full-sort Groups: 2  Sort Methods: top-N heapsort, quicksort  Average Memory: NNkB  Peak Memory: NNkB
>           ->  Sort (actual rows=101 loops=1)
>                 Sort Key: t.a
>                 Sort Method: quicksort  Memory: NNkB
> @@ -708,18 +708,19 @@ select * from t left join (select * from (select * from t order by a) v order by
>  rollback;
>  -- Test EXPLAIN ANALYZE with both fullsort and presorted groups.
>  select explain_analyze_without_memory('select * from (select * from t order by a) s order by a, b limit 70');
> -                                                                    explain_analyze_without_memory
                                                
 
>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
> +                                         explain_analyze_without_memory                                         
> +----------------------------------------------------------------------------------------------------------------
>   Limit (actual rows=70 loops=1)
>     ->  Incremental Sort (actual rows=70 loops=1)
>           Sort Key: t.a, t.b
>           Presorted Key: t.a
> -         Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB Presorted Groups: 5 Sort Methods:
top-Nheapsort, quicksort Memory: avg=NNkB peak=NNkB
 
> +         Full-sort Groups: 1  Sort Method: quicksort  Average Memory: NNkB  Peak Memory: NNkB
> +         Pre-sorted Groups: 5  Sort Methods: top-N heapsort, quicksort  Average Memory: NNkB  Peak Memory: NNkB
>           ->  Sort (actual rows=1000 loops=1)
>                 Sort Key: t.a
>                 Sort Method: quicksort  Memory: NNkB
>                 ->  Seq Scan on t (actual rows=1000 loops=1)
> -(9 rows)
> +(10 rows)
>  
>  select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from (select * from t order by a) s
orderby a, b limit 70'));
 
>                    jsonb_pretty                   
> @@ -747,7 +748,7 @@ select jsonb_pretty(explain_analyze_inc_sort_nodes_without_memory('select * from
>                   "Average Sort Space Used": "NN"+
>               }                                  +
>           },                                     +
> -         "Presorted Groups": {                  +
> +         "Pre-sorted Groups": {                 +
>               "Group Count": 5,                  +
>               "Sort Methods Used": [             +
>                   "top-N heapsort",              +
> diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
> index 2241cc9c02..6f70b36a81 100644
> --- a/src/test/regress/sql/incremental_sort.sql
> +++ b/src/test/regress/sql/incremental_sort.sql
> @@ -82,7 +82,7 @@ declare
>    space_key text;
>  begin
>    for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
> -    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
> +    for group_key in select unnest(array['Full-sort Groups', 'Pre-sorted Groups']::text[]) t loop
>        for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
>          node := jsonb_set(node, array[group_key, space_key, 'Average Sort Space Used'], '"NN"', false);
>          node := jsonb_set(node, array[group_key, space_key, 'Peak Sort Space Used'], '"NN"', false);
> @@ -105,7 +105,7 @@ declare
>    space_key text;
>  begin
>    for node in select * from jsonb_array_elements(explain_analyze_inc_sort_nodes(query)) t loop
> -    for group_key in select unnest(array['Full-sort Groups', 'Presorted Groups']::text[]) t loop
> +    for group_key in select unnest(array['Full-sort Groups', 'Pre-sorted Groups']::text[]) t loop
>        group_stats := node->group_key;
>        for space_key in select unnest(array['Sort Space Memory', 'Sort Space Disk']::text[]) t loop
>          if (group_stats->space_key->'Peak Sort Space Used')::bigint < (group_stats->space_key->'Peak Sort Space
Used')::bigintthen
 
> -- 
> 2.17.1
> 


-- 
Justin Pryzby
System Administrator
Telsasoft
+1-952-707-8581



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sat, May 09, 2020 at 03:18:36PM -0500, Justin Pryzby wrote:
>Checking if it's possible to address this Opened Item before 13b1.
>
>https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items
>consistency of explain output: two spaces, equals vs colons, semicolons (incremental sort)
>

Yes. Now that the other items related to incremental sort are fixed,
this is next on my TODO.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sat, May 09, 2020 at 10:50:12PM +0200, Tomas Vondra wrote:
>On Sat, May 09, 2020 at 03:18:36PM -0500, Justin Pryzby wrote:
>>Checking if it's possible to address this Opened Item before 13b1.
>>
>>https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items
>>consistency of explain output: two spaces, equals vs colons, semicolons (incremental sort)
>>
>
>Yes. Now that the other items related to incremental sort are fixed,
>this is next on my TODO.
>

OK, so aside from the typo/comment fixes, the proposed changes to the
explain format are:

  - two spaces to separate groups of related values
  - use colons rather than equals for fields
  - don't use semicolons
  - split full-groups and prefix-groups to separate lines

I'm generally OK with most of this - I'd probably keep the single-line
format, but I don't feel very strongly about that and if others think
using two lines is better ...

Barring objections I'll get this polished and pushed soon-ish (say,
early next week).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Peter Geoghegan
Date:
On Sat, May 9, 2020 at 3:19 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I'm generally OK with most of this - I'd probably keep the single-line
> format, but I don't feel very strongly about that and if others think
> using two lines is better ...
>
> Barring objections I'll get this polished and pushed soon-ish (say,
> early next week).

I see something about starting a new thread on the Open Items page.
Please CC me on this.

Speaking in my capacity as an RMT member: Glad to see that you plan to
close this item out next week. (I had planned on giving you a nudge
about this, but it looks like I don't really have to now.)

Thanks
-- 
Peter Geoghegan



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sat, May 09, 2020 at 06:48:02PM -0700, Peter Geoghegan wrote:
>On Sat, May 9, 2020 at 3:19 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> I'm generally OK with most of this - I'd probably keep the single-line
>> format, but I don't feel very strongly about that and if others think
>> using two lines is better ...
>>
>> Barring objections I'll get this polished and pushed soon-ish (say,
>> early next week).
>
>I see something about starting a new thread on the Open Items page.
>Please CC me on this.
>
>Speaking in my capacity as an RMT member: Glad to see that you plan to
>close this item out next week. (I had planned on giving you a nudge
>about this, but it looks like I don't really have to now.)
>

Not sure about about the new thread - the discussion continues on the
main incremental sort thread, I don't see any proposal to start a new
thread there. IMO it'd be pointless at this point.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Tomas Vondra
Date:
On Sun, May 10, 2020 at 02:25:23PM +0200, Tomas Vondra wrote:
>On Sat, May 09, 2020 at 06:48:02PM -0700, Peter Geoghegan wrote:
>>On Sat, May 9, 2020 at 3:19 PM Tomas Vondra
>><tomas.vondra@2ndquadrant.com> wrote:
>>>I'm generally OK with most of this - I'd probably keep the single-line
>>>format, but I don't feel very strongly about that and if others think
>>>using two lines is better ...
>>>
>>>Barring objections I'll get this polished and pushed soon-ish (say,
>>>early next week).
>>
>>I see something about starting a new thread on the Open Items page.
>>Please CC me on this.
>>
>>Speaking in my capacity as an RMT member: Glad to see that you plan to
>>close this item out next week. (I had planned on giving you a nudge
>>about this, but it looks like I don't really have to now.)
>>
>
>Not sure about about the new thread - the discussion continues on the
>main incremental sort thread, I don't see any proposal to start a new
>thread there. IMO it'd be pointless at this point.
>

I've pushed both patches, fixing typos and explain format.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Peter Geoghegan
Date:
On Tue, May 12, 2020 at 11:18 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I've pushed both patches, fixing typos and explain format.

Thanks, Tomas.

-- 
Peter Geoghegan



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Justin Pryzby
Date:
On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:
> On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:
> > > I've pushed the fist part of this patch series - I've reorganized it a
> >
> > I scanned through this again post-commit.  Find attached some suggestions.
> >
> > Shouldn't non-text explain output always show both disk *and* mem, including
> > zeros ?
>
> Could you give more context on this? Is there a standard to follow?
> Regular sort nodes only ever report one type, so there's not a good
> parallel there.

The change I proposed was like:

@@ -2829,7 +2829,6 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,

                ExplainPropertyList("Sort Methods Used", methodNames, es);

-               if (groupInfo->maxMemorySpaceUsed > 0)
                {
                        long            avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
                        const char *spaceTypeName;
...
-               if (groupInfo->maxDiskSpaceUsed > 0)
                {
...

To show in non-text format *both* disk and memory space used, even if zero.

I still think that's what's desirable.

If it's important to show *whether* a sort space was used, then I think there
should be a boolean, or an int 0/1.  But I don't think it's actually needed,
since someone parsing the explain output could just check
|if _dict['Peak Sort Space Used'] > 0: ...
the same as you're doing, without having to write some variation on:
|if 'Peak Sort Space Used' in _dict and _dict['Peak Sort Space Used'] > 0: ...

-- 
Justin



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
On Fri, Jun 19, 2020 at 12:04 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:
> > On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:
> > > > I've pushed the fist part of this patch series - I've reorganized it a
> > >
> > > I scanned through this again post-commit.  Find attached some suggestions.
> > >
> > > Shouldn't non-text explain output always show both disk *and* mem, including
> > > zeros ?
> >
> > Could you give more context on this? Is there a standard to follow?
> > Regular sort nodes only ever report one type, so there's not a good
> > parallel there.
>
> The change I proposed was like:
>
> @@ -2829,7 +2829,6 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo,
>
>                 ExplainPropertyList("Sort Methods Used", methodNames, es);
>
> -               if (groupInfo->maxMemorySpaceUsed > 0)
>                 {
>                         long            avgSpace = groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
>                         const char *spaceTypeName;
> ...
> -               if (groupInfo->maxDiskSpaceUsed > 0)
>                 {
> ...
>
> To show in non-text format *both* disk and memory space used, even if zero.
>
> I still think that's what's desirable.

I'm of the opposite opinion. I believe showing both unnecessarily is confusing.

> If it's important to show *whether* a sort space was used, then I think there
> should be a boolean, or an int 0/1.  But I don't think it's actually needed,
> since someone parsing the explain output could just check
> |if _dict['Peak Sort Space Used'] > 0: ...
> the same as you're doing, without having to write some variation on:
> |if 'Peak Sort Space Used' in _dict and _dict['Peak Sort Space Used'] > 0: ...

I think it's desirable for code to be explicitly about the type having
been used rather than implicitly assuming it based on 0/non-zero
values.

James



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
James Coleman
Date:
It seems like the consensus over at another discussion on this topic
[1] is that we ought to go ahead and print the zeros [for machine
readable output formats], even though that creates some interesting
scenarios like the fact that disk sorts will print 0 for memory even
though that's not true.

The change has already been made and pushed for hash disk spilling, so
I think we ought to use Justin's patch here.

James

[1] https://www.postgresql.org/message-id/2276865.1593102811%40sss.pgh.pa.us



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
"Jonathan S. Katz"
Date:
On 7/2/20 11:47 AM, James Coleman wrote:
> It seems like the consensus over at another discussion on this topic
> [1] is that we ought to go ahead and print the zeros [for machine
> readable output formats], even though that creates some interesting
> scenarios like the fact that disk sorts will print 0 for memory even
> though that's not true.
>
> The change has already been made and pushed for hash disk spilling, so
> I think we ought to use Justin's patch here.

Do people agree with James analysis? From the RMT perspective, we would
like to get this open item wrapped up for the next beta, given[1] is now
resolved.

Thanks!

Jonathan

[1]
https://www.postgresql.org/message-id/CAApHDvpBQx4Shmisjp7oKr%3DECX18KYKPB%3DKpdWYxMKQNvisgvQ%40mail.gmail.com


Attachment

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Peter Geoghegan
Date:
On Thu, Jul 9, 2020 at 12:06 PM Jonathan S. Katz <jkatz@postgresql.org> wrote:
> On 7/2/20 11:47 AM, James Coleman wrote:
> > It seems like the consensus over at another discussion on this topic
> > [1] is that we ought to go ahead and print the zeros [for machine
> > readable output formats], even though that creates some interesting
> > scenarios like the fact that disk sorts will print 0 for memory even
> > though that's not true.
> >
> > The change has already been made and pushed for hash disk spilling, so
> > I think we ought to use Justin's patch here.
>
> Do people agree with James analysis? From the RMT perspective, we would
> like to get this open item wrapped up for the next beta, given[1] is now
> resolved.

Tomas, Justin: Ping? Can we get an update on this?

Just for the record, David Rowley fixed the similar hashagg issue in
commit 40efbf8706cdd96e06bc4d1754272e46d9857875. I don't see any
reason for the delay here.

Thanks
-- 
Peter Geoghegan



Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From
Peter Geoghegan
Date:
On Thu, Jul 2, 2020 at 8:47 AM James Coleman <jtc331@gmail.com> wrote:
> It seems like the consensus over at another discussion on this topic
> [1] is that we ought to go ahead and print the zeros [for machine
> readable output formats], even though that creates some interesting
> scenarios like the fact that disk sorts will print 0 for memory even
> though that's not true.

What about having it print -1 for memory in this case instead? That's
still not ideal, but machine readable EXPLAIN output ought to
consistently show the same information per node, even when the answer
is in some sense indeterminate. That seems to be the standard that
we've settled on.

It might be worth teaching the JSON format to show a JSON null or
something instead. Not sure about that, though.

-- 
Peter Geoghegan