Re: [HACKERS] [PATCH] Incremental sort - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: [HACKERS] [PATCH] Incremental sort
Date
Msg-id CAPpHfdtvx1oVJHCUe1x-R=t55Uz4uBi9i55BcoTJ7zZN_7Gxfw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] [PATCH] Incremental sort  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] [PATCH] Incremental sort
Re: [HACKERS] [PATCH] Incremental sort
List pgsql-hackers
On Mon, May 8, 2017 at 6:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, May 5, 2017 at 11:13 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Incremental sort is faster in vast majority of cases.  It appears to be
> slower only when whose dataset is one sort group.  In this case incremental
> sort is useless, and it should be considered as misuse of incremental sort.
> Slowdown is related to the fact that we anyway have to do extra comparisons,
> unless we somehow push our comparison result into qsort itself and save some
> cpu cycles (but that would be unreasonable break of encapsulation).  Thus,
> in such cases regression seems to be inevitable anyway.  I think we could
> evade this regression during query planning.  If we see that there would be
> only few groups, we should choose plain sort instead of incremental sort.

I'm sorry that I don't have time to review this in detail right now,
but it sounds like you are doing good work to file down cases where
this might cause regressions, which is great.

Thank you for paying attention to this patch!
 
Regarding the point in
the paragraph above, I'd say that it's OK for the planner to be
responsible for picking between Sort and Incremental Sort in some way.
It is, after all, the planner's job to decide between different
strategies for executing the same query and, of course, sometimes it
will be wrong, but that's OK as long as it's not wrong too often (or
by too much, hopefully).

Right, I agree.
 
It may be a little difficult to get this
right, though, because I'm not sure that the information you need
actually exists (or is reliable).  For example, consider the case
where we need to sort 100m rows and there are 2 groups.  If 1 group
contains 1 row and the other group contains all of the rest, there is
really no point in an incremental sort.  On the other hand, if each
group contains 50m rows and we can get the data presorted by the
grouping column, there might be a lot of point to an incremental sort,
because two 50m-row sorts might be a lot cheaper than one 100m sort.
More generally, it's quite easy to imagine situations where the
individual groups can be quicksorted but sorting all of the rows
requires I/O, even when the number of groups isn't that big.  On the
other hand, the real sweet spot for this is probably the case where
the number of groups is very large, with many single-row groups or
many groups with just a few rows each, so if we can at least get this
to work in those cases that may be good enough.  On the third hand,
when costing aggregation, I think we often underestimate the number of
groups and there might well be similar problems here.

I agree with that.  I need to test this patch more carefully in the case when groups have different sizes.  It's likely I need to add yet another parameter to my testing script: groups count skew.

Patch rebased to current master is attached.  I'm going to improve my testing script and post new results. 

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] psql: new help related to variables are not too readable
Next
From: Thomas Munro
Date:
Subject: Re: [HACKERS] Parallel Hash take II