Thread: updated SORT/LIMIT patch

updated SORT/LIMIT patch

From
Gregory Stark
Date:
Updated patch against cvs update in case it makes applying easier.

One minor change:

. Added #include <limits.h> in tuplesort.h to pull in UINT_MAX
  (thanks to dpage for noticing this is necessary on OSX)



--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Attachment

Re: updated SORT/LIMIT patch

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Gregory Stark wrote:
>
> Updated patch against cvs update in case it makes applying easier.
>
> One minor change:
>
> . Added #include <limits.h> in tuplesort.h to pull in UINT_MAX
>   (thanks to dpage for noticing this is necessary on OSX)
>

[ Attachment, skipping... ]

>
>
> --
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: updated SORT/LIMIT patch

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> Updated patch against cvs update in case it makes applying easier.

Applied with revisions --- notably, I avoided adding any overhead to
HEAPCOMPARE() by the expedient of reversing the logical sort order
before heapify'ing.  We couldn't have done that before the NULLS_FIRST
patch went in, but now it's trivial to make the sort order reverse
fully.

Since you didn't include any documentation patch for the
optimize_bounded_sort GUC variable, I assumed it was meant only for
debugging and hid it behind #ifdef DEBUG_BOUNDED_SORT.

            regards, tom lane

Re: updated SORT/LIMIT patch

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> Updated patch against cvs update in case it makes applying easier.
>
> Applied with revisions --- notably, I avoided adding any overhead to
> HEAPCOMPARE() by the expedient of reversing the logical sort order
> before heapify'ing.  We couldn't have done that before the NULLS_FIRST
> patch went in, but now it's trivial to make the sort order reverse
> fully.

Hum. The major change I see is the bit related to rescans where you made it
resort if the bound had changed. But surely the only way the bound can change
is if it's a parameter, and if there is a parameter then surely the executor
must be doing more than just a plain rescan? The sort key could have changed
if it depends on the parameter.

What does the executor do differently in the case of a subplan with a
parameter that makes it re-execute the plan from scratch and not just do a
simple rescan?

> Since you didn't include any documentation patch for the
> optimize_bounded_sort GUC variable, I assumed it was meant only for
> debugging and hid it behind #ifdef DEBUG_BOUNDED_SORT.

Sure, I originally had it #ifdef'd on TRACE_SORT but took it out for reasons
that I don't recall.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: updated SORT/LIMIT patch

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> Hum. The major change I see is the bit related to rescans where you made it
> resort if the bound had changed. But surely the only way the bound can change
> is if it's a parameter, and if there is a parameter then surely the executor
> must be doing more than just a plain rescan?

The problem is that a parameter change in the LIMIT's expression would
not normally be propagated below the LIMIT.  In this case, since we're
allowing its effects to bubble down one more level of the tree, we need
to make sure that that level is recomputed too.

> What does the executor do differently in the case of a subplan with a
> parameter that makes it re-execute the plan from scratch and not just do a
> simple rescan?

Look at the chgParam signaling.  Since a Sort node itself has no
parameters, it historically has only had to re-sort if its input node
suffers a parameter change, which it checks in ExecReScanSort.  But now
the bound effectively acts like a parameter, and has to force a
recomputation.

            regards, tom lane

Re: updated SORT/LIMIT patch

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> What does the executor do differently in the case of a subplan with a
>> parameter that makes it re-execute the plan from scratch and not just do a
>> simple rescan?
>
> Look at the chgParam signaling.  Since a Sort node itself has no
> parameters, it historically has only had to re-sort if its input node
> suffers a parameter change, which it checks in ExecReScanSort.  But now
> the bound effectively acts like a parameter, and has to force a
> recomputation.

Hm, that all makes sense now. But then there's something mysterious going on
still as the regression test I tried to write for this actually does work:

edb=# select (select ROW(int_array_aggregate(empno::integer),min(sal),round(avg(sal)),max(sal)) as sal from (select *
fromemp order by sal offset 3*bucket limit 3) as x) from generate_series(0,(select count(*) from emp)/3) as bucket; 
LOG:  begin tuple sort: nkeys = 1, workMem = 1024, randomAccess = f
LOG:  switching to bounded heap sort at 7 tuples
LOG:  performsort starting: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  doing unheapify of 3 tuples
LOG:  performsort done: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  internal sort ended, 17 KB used: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  begin tuple sort: nkeys = 1, workMem = 1024, randomAccess = f
LOG:  switching to bounded heap sort at 13 tuples
LOG:  performsort starting: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  doing unheapify of 6 tuples
LOG:  performsort done: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  internal sort ended, 17 KB used: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  begin tuple sort: nkeys = 1, workMem = 1024, randomAccess = f
LOG:  performsort starting: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  doing qsort of 14 tuples
LOG:  performsort done: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  internal sort ended, 18 KB used: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  begin tuple sort: nkeys = 1, workMem = 1024, randomAccess = f
LOG:  performsort starting: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  doing qsort of 14 tuples
LOG:  performsort done: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  internal sort ended, 18 KB used: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  begin tuple sort: nkeys = 1, workMem = 1024, randomAccess = f
LOG:  performsort starting: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  doing qsort of 14 tuples
LOG:  performsort done: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  internal sort ended, 18 KB used: CPU 0.00s/0.00u sec elapsed 0.00 sec
                 ?column?
-------------------------------------------
 ("{7369,7900,7876}",800.00,950,1100.00)
 ("{7654,7521,7934}",1250.00,1267,1300.00)
 ("{7844,7499,7782}",1500.00,1850,2450.00)
 ("{7698,7566,7902}",2850.00,2942,3000.00)
 ("{7788,7839}",3000.00,4000,5000.00)
(5 rows)

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: updated SORT/LIMIT patch

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> Look at the chgParam signaling.  Since a Sort node itself has no
>> parameters, it historically has only had to re-sort if its input node
>> suffers a parameter change, which it checks in ExecReScanSort.  But now
>> the bound effectively acts like a parameter, and has to force a
>> recomputation.

> Hm, that all makes sense now. But then there's something mysterious going on
> still as the regression test I tried to write for this actually does work:

Yeah, because in this example nodeSort doesn't ask for randomAccess to
the sort result, and so ExecReScanSort is forced to repeat the sort
anyway.

[ greps a bit... ]  It looks like the only way that you could expose the
bug in the current state of the system would be if the sort/limit with
the outer parameter were the inside of a nestloop join in the subplan.
nodeNestloop would set EXEC_FLAG_REWIND, causing nodeSort to set
randomAccess, allowing ExecReScanSort to suppose that it could rewind
the sort.

            regards, tom lane

Re: updated SORT/LIMIT patch

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> [ greps a bit... ]  It looks like the only way that you could expose the
> bug in the current state of the system would be if the sort/limit with
> the outer parameter were the inside of a nestloop join in the subplan.
> nodeNestloop would set EXEC_FLAG_REWIND, causing nodeSort to set
> randomAccess, allowing ExecReScanSort to suppose that it could rewind
> the sort.

I finally managed to trigger this case and found that the checks don't
actually work:

postgres=# SELECT (SELECT n
                     FROM (VALUES (1)) AS x,
                          (SELECT n FROM generate_series(1,10) AS n
                            ORDER BY n LIMIT 1 OFFSET s-1) AS y) AS z
             FROM generate_series(1,10) AS s;
ERROR:  retrieved too many tuples in a bounded sort

What's going on is that nodeLimit.c only invokes recompute_limit when the first
tuple is actually generated. It has a comment saying "(We can't do this any
earlier, because parameters from upper nodes may not be set until now.)"
So the checks are still comparing the previous bound against the boundDone.

Attached is a small patch which fixes this case. It also makes the check
slightly more liberal -- we don't need to resort if the previous sort was
unbounded or the bound was greater than or equal to the new bound.

There is one bit I'm not too sure of. We may or may not end up requesting
tuples from our child node. If we do we have to ReScan it but by then we don't
have the exprCtx passed to the ReScan call. I just made it call ReScan always
even if we later decide we can just rewind the tuplesort, is that ok?

Also, I left a comment that it would be nice if we could peek at the
tuplesort's boundUsed and state to avoid resorting unnecessarily. Currently it
pretty much always resorts unless you construct a bizarre query like the above
to force the randomAccess flag to be true. Most of the time tuplesort is going
to sort in memory anyways even if random access isn't requested and resorting
is pointless.

I think it would be worthwhile adding a method to tuplesort to ask whether
random access is possible and how many tuples were actually kept. Then
nodeSort could ask it those values instead of just remembering what values
were requested.



--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Attachment

Re: updated SORT/LIMIT patch

From
Alvaro Herrera
Date:
Gregory Stark wrote:

> Attached is a small patch which fixes this case. It also makes the check
> slightly more liberal -- we don't need to resort if the previous sort was
> unbounded or the bound was greater than or equal to the new bound.

Huh, can you clarify this comment:

+        * XXX It would be nice to check tuplesortstate->boundUsed too but that
+        * seems like an abstraction violation. And for that matter to check
+        * the tuplesort to see if randomaccess is possible even if it wasn't
+        * requested so we don't resort input when the parameters haven't
+        * changed if it was sorted in memory.

I'm having serious trouble parsing it.

Thanks.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: updated SORT/LIMIT patch

From
Gregory Stark
Date:
"Alvaro Herrera" <alvherre@commandprompt.com> writes:

> Gregory Stark wrote:
>
>> Attached is a small patch which fixes this case. It also makes the check
>> slightly more liberal -- we don't need to resort if the previous sort was
>> unbounded or the bound was greater than or equal to the new bound.
>
> Huh, can you clarify this comment:
>
> +        * XXX It would be nice to check tuplesortstate->boundUsed too but that
> +        * seems like an abstraction violation. And for that matter to check
> +        * the tuplesort to see if randomaccess is possible even if it wasn't
> +        * requested so we don't resort input when the parameters haven't
> +        * changed if it was sorted in memory.
>
> I'm having serious trouble parsing it.

Well in the executor currently we check node->boundedDone and node->boundDone
to see whether the previous execution of this node had a bound. If it did and
we now don't or if it did but now our bound is larger then we have to
re-execute.

However the tuplesort may have actually done an unbounded sort either because
the bound (actually bound*2) may not have been reached or because it spilled
to disk and we did a disk sort. It would be nice to check that and not have to
reexecute unnecessarily.

But that means peeking into tuplesort's state. I started doing a patch
yesterday to do this by exporting a new method on tuplesort:

bool
tuplesort_random_access(Tuplesortstate *state, bool bounded, unsigned tuples_needed)
{
    switch (state->status)
    {
        case TSS_FINALMERGE:
            return false;
        case TSS_SORTEDONTAPE:
            return state->randomAccess;
        case TSS_SORTEDINMEM:
            return (!state->boundUsed || (bounded && state->bound > tuples_needed));
        default:
            return false;
    }
}

That solves the abstraction barrier issue but I'm not sure if it's quite
detailed enough. Really Sort only needs to rewind the tuplesort which it can
do even if state->randomAccess is false. We may need two separate functions,
one which tests if rewind is supported and another which tests if random
access is supported. Also, I haven't looked at how the final merge case is
handled. It might be possible to support rescan (but not random access) in
that state as well.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: updated SORT/LIMIT patch

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> [ greps a bit... ]  It looks like the only way that you could expose the
>> bug in the current state of the system would be if the sort/limit with
>> the outer parameter were the inside of a nestloop join in the subplan.

> I finally managed to trigger this case and found that the checks don't
> actually work:

Oh dear.

> Attached is a small patch which fixes this case.

This patch makes what was already a hack into a full-fledged crock (and
it's not just the self-doubting comments that make me distrust it).
I think we need to rip out this ad-hoc parameter change signaling code
and make it work through the regular chgParam mechanism.  Not sure about
details though.  There may be no clean solution short of folding
Sort and Limit into a single node type.

> I think it would be worthwhile adding a method to tuplesort to ask whether
> random access is possible and how many tuples were actually kept. Then
> nodeSort could ask it those values instead of just remembering what values
> were requested.

I disagree with this line of development, as it amounts to exposing
tuplesort implementation details as API.

            regards, tom lane

Re: updated SORT/LIMIT patch

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> This patch makes what was already a hack into a full-fledged crock (and
> it's not just the self-doubting comments that make me distrust it).
> I think we need to rip out this ad-hoc parameter change signaling code
> and make it work through the regular chgParam mechanism.  Not sure about
> details though.  There may be no clean solution short of folding
> Sort and Limit into a single node type.

Well I can't disagree, I always was concerned about the inter-node
communication part. If I have power on the train I might look at it then but
otherwise I'm offline until Monday.

>> I think it would be worthwhile adding a method to tuplesort to ask whether
>> random access is possible and how many tuples were actually kept. Then
>> nodeSort could ask it those values instead of just remembering what values
>> were requested.
>
> I disagree with this line of development, as it amounts to exposing
> tuplesort implementation details as API.

I'm not sure I agree. What's the difference if between using a boolean value
we pass to tuplesort requesting random access and using a boolean value we get
back from asking tuplesort?

The "tuples_needed" is a bit of a wart but then it's the same inevitable wart
as set_bound.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: updated SORT/LIMIT patch

From
Gregory Stark
Date:
"Gregory Stark" <stark@enterprisedb.com> writes:

> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>
>> This patch makes what was already a hack into a full-fledged crock (and
>> it's not just the self-doubting comments that make me distrust it).
>> I think we need to rip out this ad-hoc parameter change signaling code
>> and make it work through the regular chgParam mechanism.  Not sure about
>> details though.  There may be no clean solution short of folding
>> Sort and Limit into a single node type.
>
> Well I can't disagree, I always was concerned about the inter-node
> communication part. If I have power on the train I might look at it then but
> otherwise I'm offline until Monday.

I did in fact have power on the train and due the wonders of a local rsync'd
CVS repository I could even view cvs logs and diffs offline. It's interesting
how I've read all this code and comments several times and each time I get
more out of it.

It doesn't look like the timing of the ExecRescan is an issue at all. There
are plenty of nodes that Rescan their children much later than when they first
start up. Even Nested Loop does so.

I do still need to read more about parameters and how the parameter sets are
initially constructed. It would be nice to set it up so the sort node gets
signalled using the existing infrastructure.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: updated SORT/LIMIT patch

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> It doesn't look like the timing of the ExecRescan is an issue at all. There
> are plenty of nodes that Rescan their children much later than when they first
> start up. Even Nested Loop does so.

Right, but separating the child rescan from the tuplestore rescan bugs me.
Another problem is that the proposed patch does a child rescan call that
might (usually would) be unnecessary.

I am thinking that a cleaner fix is probably to make ExecRescanLimit do
the recompute_limits() bit immediately, so that the new limits are
available to the Sort node when it gets the rescan call.  The comment
about timing of recompute_limits() is referring to the fact that
parameters aren't set at ExecInitLimit() time, but I believe they are
(and should be) available at Rescan time.  Will give it a try anyway.

            regards, tom lane

Re: updated SORT/LIMIT patch

From
Tom Lane
Date:
I wrote:
> I am thinking that a cleaner fix is probably to make ExecRescanLimit do
> the recompute_limits() bit immediately, so that the new limits are
> available to the Sort node when it gets the rescan call.  The comment
> about timing of recompute_limits() is referring to the fact that
> parameters aren't set at ExecInitLimit() time, but I believe they are
> (and should be) available at Rescan time.  Will give it a try anyway.

Indeed, this way seems to work fine --- and in fact that's what we'd
have to do anyway if we were to merge the parameter-passing into
chgParam signaling.  I didn't try to do that, just committed a patch
to fix the immediate problem.

BTW, as for your earlier worries about useless re-sorts when
randomAccess wasn't requested: the design intention is that randomAccess
*will* be requested in any situation where repeat scans are likely.  So
there's no point in uglifying the tuplesort API to make an unexpected
rescan fast.  If you are seeing cases where a useless re-sort actually
happens, we might have some bugs in the EXEC_FLAG_REWIND signaling.

            regards, tom lane