Thread: LIMIT/SORT optimization

LIMIT/SORT optimization

From
Gregory Stark
Date:
Earlier I mentioned I had implemented the limited sort optimization. This
kicks in quite frequently on web applications that implement paging. Currently
Postgres has to sort the entire data set, often in a disk sort. If the page is
early in the result set it's a lot more efficient to just keep the top n
records in memory and do a single pass through the result set.

The patch is quite simple and is mostly localized to tuplesort.c which
provides a new function to advise it of the maximum necessary. It uses the
existing heapsort functionality which makes it easy to keep the top n records
by peeking at the top element of the heap and removing it if the new record
would displace it.

In experimenting I found heap sort about half the speed of quicksort so I made
it switch over to heap sort if the input size reached 2x the specified maximum
or if it can avoid spilling to disk by switching.

The two open issues (which are arguably the same issue) is how to get the
information down to the sort node and how to cost the plan. Currently it's a
bit of a hack: the Limit node peeks at its child and if it's a sort it calls a
special function to provide the limit.






--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Attachment

Re: LIMIT/SORT optimization

From
"Simon Riggs"
Date:
On Wed, 2007-02-07 at 10:49 +0000, Gregory Stark wrote:
> The two open issues (which are arguably the same issue) is how to get
> the information down to the sort node and how to cost the plan.
> Currently it's a bit of a hack: the Limit node peeks at its child and
> if it's a sort it calls a special function to provide the limit.

We can't lose the LIMIT node altogether, in case we have a paramterised
limit or a limit expression, so it does need to be in the executor.

Exploiting knowledge about adjacent plan types is already used in the
executor. If this seemed like it might be a generic trick, then I'd say
implement a generic function, but I don't see that it is.

We still want to push LIMIT/Sort down through an APPEND, but this won't
help us here - we'd need to do that in the planner.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: LIMIT/SORT optimization

From
Gregory Stark
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:

> On Wed, 2007-02-07 at 10:49 +0000, Gregory Stark wrote:
> > The two open issues (which are arguably the same issue) is how to get
> > the information down to the sort node and how to cost the plan.
> > Currently it's a bit of a hack: the Limit node peeks at its child and
> > if it's a sort it calls a special function to provide the limit.
>
> We can't lose the LIMIT node altogether, in case we have a paramterised
> limit or a limit expression, so it does need to be in the executor.

Right. The LIMIT node also implements offset and handles tricky border cases
such as cursors that move past the edges. It would be pointless to duplicate
the logic in tuplesort.c. The idea is to advise tuplesort.c when it can save
work by not sorting more work than necessary, not duplicate the work of Limit.

> Exploiting knowledge about adjacent plan types is already used in the
> executor. If this seemed like it might be a generic trick, then I'd say
> implement a generic function, but I don't see that it is.
>
> We still want to push LIMIT/Sort down through an APPEND, but this won't
> help us here - we'd need to do that in the planner.

Hm, that's exactly the type of situation I was afraid of needing to have the
information to propagate farther than an immediate child node and with more
sophisticated rules. However as you point out that can be handled by doing
optimizations that modify the plan tree. That keeps the scope of the
optimization to a minimum: sort nodes directly under limit nodes. That's
probably a better approach.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: LIMIT/SORT optimization

From
Bruce Momjian
Date:
Is there a newer version of this patch?

---------------------------------------------------------------------------

Gregory Stark wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
>
> > On Wed, 2007-02-07 at 10:49 +0000, Gregory Stark wrote:
> > > The two open issues (which are arguably the same issue) is how to get
> > > the information down to the sort node and how to cost the plan.
> > > Currently it's a bit of a hack: the Limit node peeks at its child and
> > > if it's a sort it calls a special function to provide the limit.
> >
> > We can't lose the LIMIT node altogether, in case we have a paramterised
> > limit or a limit expression, so it does need to be in the executor.
>
> Right. The LIMIT node also implements offset and handles tricky border cases
> such as cursors that move past the edges. It would be pointless to duplicate
> the logic in tuplesort.c. The idea is to advise tuplesort.c when it can save
> work by not sorting more work than necessary, not duplicate the work of Limit.
>
> > Exploiting knowledge about adjacent plan types is already used in the
> > executor. If this seemed like it might be a generic trick, then I'd say
> > implement a generic function, but I don't see that it is.
> >
> > We still want to push LIMIT/Sort down through an APPEND, but this won't
> > help us here - we'd need to do that in the planner.
>
> Hm, that's exactly the type of situation I was afraid of needing to have the
> information to propagate farther than an immediate child node and with more
> sophisticated rules. However as you point out that can be handled by doing
> optimizations that modify the plan tree. That keeps the scope of the
> optimization to a minimum: sort nodes directly under limit nodes. That's
> probably a better approach.
>
> --
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                 http://www.postgresql.org/about/donate

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: LIMIT/SORT optimization

From
Gregory Stark
Date:
Bruce Momjian <bruce@momjian.us> writes:

> Is there a newer version of this patch?

As requested, I've cut an updated version of this patch against CVS HEAD:

 http://community.enterprisedb.com/sort-limit-v5.patch.gz

The open issues I see are:

 Adjusting the costing of the sort to take into account the optimization

 Whether the communication between the Limit node and the Sort node is kosher
 or whether something more abstract is needed.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: LIMIT/SORT optimization

From
Gregory Stark
Date:
"Gregory Stark" <stark@enterprisedb.com> writes:

> Bruce Momjian <bruce@momjian.us> writes:
>
>> Is there a newer version of this patch?
>
> As requested, I've cut an updated version of this patch against CVS HEAD:
>
>  http://community.enterprisedb.com/sort-limit-v5.patch.gz

Someone asked why I've been posting links rather than attachments. The only
reason was because when I posted patches in the past they were dropped by the
mailing list. I would say "refused" except I never received a bounce, the
messages just never appeared on list or in the archive.

I'll try attaching this patch again, which is relatively small compared to the
recursive query patches and packed varlena patches which disappeared into the
ether. Also, this one is gzipped whereas in the past I usually attached
patches uncompressed so people could read them without saving and
uncompressing them. Perhaps one of those differences is the source of the
problem?

Do people prefer receiving attachments or downloadable links?
Does the answer change if the patches are quite large?


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Attachment

Re: LIMIT/SORT optimization

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> Do people prefer receiving attachments or downloadable links?
> Does the answer change if the patches are quite large?

Links suck from an archival standpoint; but at the same time you need
to pay some attention to the size of your email.  I think the current
threshold for moderator approval is somewhere between 50K and 100K
(on patches; less on our other lists).  gzipping large patches is
encouraged --- if people's mail readers need help in viewing such
an attachment, that's not your problem.

            regards, tom lane

Re: LIMIT/SORT optimization

From
Magnus Hagander
Date:
Tom Lane wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
>> Do people prefer receiving attachments or downloadable links?
>> Does the answer change if the patches are quite large?
>
> Links suck from an archival standpoint; but at the same time you need
> to pay some attention to the size of your email.  I think the current
> threshold for moderator approval is somewhere between 50K and 100K
> (on patches; less on our other lists).  gzipping large patches is
> encouraged --- if people's mail readers need help in viewing such
> an attachment, that's not your problem.

IIRC, when a patch gets held back, you get a notification. The problem
has been with mails that are silently dropped. AFAIK, that has happened
outside of mailman, somewhere else in the mail system. For example, we
used to drop anything that was a .tar (including .tar.gz), and last I
checked we still do that. But admittedly that was some time ago, but
I've seen no statement that it has been fixed.

(plain gzip usually worked fine, but .tar.gz to include a couple of new
files got silently dropped. For example, I tried sending my vcbuild
patches at least 4 times before I realized they were silently dropped)

So I'd avoid anything other than plaintext or plaintext.gz.

//Magnus

Re: LIMIT/SORT optimization

From
"Simon Riggs"
Date:
On Wed, 2007-03-14 at 15:16 +0000, Gregory Stark wrote:

> Do people prefer receiving attachments or downloadable links?

Attachments are very clearly submissions to the project.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: LIMIT/SORT optimization

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Gregory Stark wrote:
>
> "Gregory Stark" <stark@enterprisedb.com> writes:
>
> > Bruce Momjian <bruce@momjian.us> writes:
> >
> >> Is there a newer version of this patch?
> >
> > As requested, I've cut an updated version of this patch against CVS HEAD:
> >
> >  http://community.enterprisedb.com/sort-limit-v5.patch.gz
>
> Someone asked why I've been posting links rather than attachments. The only
> reason was because when I posted patches in the past they were dropped by the
> mailing list. I would say "refused" except I never received a bounce, the
> messages just never appeared on list or in the archive.
>
> I'll try attaching this patch again, which is relatively small compared to the
> recursive query patches and packed varlena patches which disappeared into the
> ether. Also, this one is gzipped whereas in the past I usually attached
> patches uncompressed so people could read them without saving and
> uncompressing them. Perhaps one of those differences is the source of the
> problem?
>
> Do people prefer receiving attachments or downloadable links?
> Does the answer change if the patches are quite large?
>

[ Attachment, skipping... ]

>
> --
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                 http://www.postgresql.org/about/donate

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: LIMIT/SORT optimization

From
Alvaro Herrera
Date:
Bruce Momjian wrote:
>
> Your patch has been added to the PostgreSQL unapplied patches list at:
>
>     http://momjian.postgresql.org/cgi-bin/pgpatches
>
> It will be applied as soon as one of the PostgreSQL committers reviews
> and approves it.

Did Greg push a version which didn't carry the "WIP" label to it?  As
far as I remember he was still asking how to make the Sort and Limit
nodes communicate.

> Gregory Stark wrote:
> >
> > "Gregory Stark" <stark@enterprisedb.com> writes:
> >
> > > Bruce Momjian <bruce@momjian.us> writes:
> > >
> > >> Is there a newer version of this patch?
> > >
> > > As requested, I've cut an updated version of this patch against CVS HEAD:
> > >
> > >  http://community.enterprisedb.com/sort-limit-v5.patch.gz


--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: LIMIT/SORT optimization

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> Bruce Momjian wrote:
> >
> > Your patch has been added to the PostgreSQL unapplied patches list at:
> >
> >     http://momjian.postgresql.org/cgi-bin/pgpatches
> >
> > It will be applied as soon as one of the PostgreSQL committers reviews
> > and approves it.
>
> Did Greg push a version which didn't carry the "WIP" label to it?  As
> far as I remember he was still asking how to make the Sort and Limit
> nodes communicate.

Good question.  I asked for a new version of this patch and the WIP was
only in the email subject line. Greg, is this ready for review?

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: LIMIT/SORT optimization

From
Gregory Stark
Date:
>> Did Greg push a version which didn't carry the "WIP" label to it?  As
>> far as I remember he was still asking how to make the Sort and Limit
>> nodes communicate.
>
> Good question.  I asked for a new version of this patch and the WIP was
> only in the email subject line. Greg, is this ready for review?
Well my question was whether it was acceptable to do things this way or
whether there was a better way. If it's the right way then sure, if not then
no. I guess that's what review is for.
In short I'm not sure what constitutes "ready for review". Are you asking if
it's ready to apply? I don't know, that's why we have reviews. Or are you
asking if it's ready for someone to look at? What's the point of posting WIP
patches if you don't want someone to look at it?
--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: LIMIT/SORT optimization

From
Bruce Momjian
Date:
Gregory Stark wrote:
> >> Did Greg push a version which didn't carry the "WIP" label to it?  As
> >> far as I remember he was still asking how to make the Sort and Limit
> >> nodes communicate.
> >
> > Good question.  I asked for a new version of this patch and the WIP was
> > only in the email subject line. Greg, is this ready for review?
> Well my question was whether it was acceptable to do things this way or
> whether there was a better way. If it's the right way then sure, if not then
> no. I guess that's what review is for.
> In short I'm not sure what constitutes "ready for review". Are you asking if
> it's ready to apply? I don't know, that's why we have reviews. Or are you
> asking if it's ready for someone to look at? What's the point of posting WIP
> patches if you don't want someone to look at it?

I am asking if _you_ are done working on it.  Seems you are, so I will
add it to the patch queue.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: LIMIT/SORT optimization

From
Heikki Linnakangas
Date:
Some comments on the patch below.

Gregory Stark wrote:

> + /* tuplesort_set_bound - External API to set a bound on a tuplesort
> +  *
> +  * Must be called before inserting any tuples.
> +
> +  * Sets a maximum number of tuples the caller is interested in. The first
> +  * <bound> tuples are maintained using a simple insertion sort and returned
> +  * normally. Any tuples that lie after those in the sorted result set are
> +  * simply thrown out
> +  */

The "Must be called before inserting any tuples" is in contradiction
with the comment in the header file:

> + /* This can be called at any time before performsort to advise tuplesort that
> +  * only this many tuples are interesting. If that many tuples fit in memory and
> +  * we haven't already overflowed to disk then tuplesort will switch to a simple
> +  * insertion sort or heap sort and throw away the uninteresting tuples.
> +  */

The latter seems to be correct.


> ! /*
> !  * Convert the existing unordered list of sorttuples to a heap in either order.
> !  * This used to be inline but now there are three separate places we heap sort
> !  * (initializing the tapes, if we have a bounded output, and any time the user
> !  * says he doesn't want to use glibc's qsort).
> !  *
> !  * NOTE heapify passes false for checkIndex (and stores a constant tupindex
> !  * passed as a parameter) even though we use heaps for multi-run sources
> !  * because we only heapify when we're doing in-memory sorts or in inittapes
> !  * before there's any point in comparing tupindexes.
> !  */
> !
> ! static void
> ! tuplesort_heapify(Tuplesortstate *state, int tupindex, HeapOrder heaporder)
> ! {

The comment claims that we use heap sort when the user says he doesn't
want to use glibc's qsort. I recall that we always use our own qsort
implementation nowadays. And we never used the heap sort as a qsort
replacement, did we?

In performsort, you convert the in-memory heap to a sorted list in one
go. I wonder if it would be better to switch to a new TSS_ALLINHEAP
state that means "all tuples are now in the in-memory heap", and call
tuplesort_heap_siftup in gettuple. It probably doesn't make much
difference in most cases, but if there's another limit node in the plan
with a smaller limit or the client only fetches a few top rows with a
cursor you'd avoid unheapifying tuples that are just thrown away later.

There's a few blocks of code surrounded with "#if 0 - #endif". Are those
just leftovers that should be removed, or are things that still need to
finished and enabled?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: LIMIT/SORT optimization

From
Gregory Stark
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> Some comments on the patch below.

Thanks!

> Gregory Stark wrote:
>
>
>
> The comment claims that we use heap sort when the user says he doesn't want to
> use glibc's qsort. I recall that we always use our own qsort implementation
> nowadays. And we never used the heap sort as a qsort replacement, did we?

Thanks, I had a version that used heap sort instead of qsort but that was
before I discovered what you said. So I stripped that useless bit out.

> In performsort, you convert the in-memory heap to a sorted list in one go. I
> wonder if it would be better to switch to a new TSS_ALLINHEAP state that means
> "all tuples are now in the in-memory heap", and call tuplesort_heap_siftup in
> gettuple.

The problem is that the heap is backwards. The head of the heap is the
greatest, ie, the last element we want to return. Hm, Is there such a thing as
a two-way heap?

> There's a few blocks of code surrounded with "#if 0 - #endif". Are those just
> leftovers that should be removed, or are things that still need to finished and
> enabled?

Uhm, I don't remember, will go look, thanks.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: LIMIT/SORT optimization

From
Gregory Stark
Date:
Updated patch attached:

1) Removes #if 0 optimizations

2) Changes #if 0 to #if NOT_USED for code that's there for completeness and to
   keep the code self-documenting purposes rather but isn't needed by anything
   currently

3) Fixed cost model to represent bounded sorts




"Gregory Stark" <stark@enterprisedb.com> writes:

> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>
>> There's a few blocks of code surrounded with "#if 0 - #endif". Are those just
>> leftovers that should be removed, or are things that still need to finished and
>> enabled?
>
> Uhm, I don't remember, will go look, thanks.

Ok, they were left over code from an optimization that I decided wasn't very
important to pursue. The code that was ifdef'd out detected when disk sorts
could abort a disk sort merge because it had already generated enough tuples
for to satisfy the limit.

But I never wrote the code to actually abort the run and it looks a bit
tricky. In any case the disk sort use case is extremely narrow, you would need
something like "LIMIT 50000" or more to do it and it would have to be a an
input table huge enough to cause multiple rounds of merges.


I think I've figured out how to adjust the cost model. It turns out that it
doesn't usually matter whether the cost model is correct since any case where
the optimization kicks in is a case you're reading a small portion of the
input so it's a case where an index would be *much* better if available. So
the only times the optimization is used is when there's no index available.
Nonetheless it's nice to get the estimates right so that higher levels in the
plan get reasonable values.

I think I figured out how to do the cost model. At least the results are
reasonable. I'm not sure if I've done it the "right" way though.


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Attachment

Re: LIMIT/SORT optimization

From
Bruce Momjian
Date:
I did some performance testing of the patch, and the results were good.
I did this:

    test=> CREATE TABLE test (x INTEGER);
    test=> INSERT INTO test SELECT * FROM generate_series(1, 1000000);
    test=> SET log_min_duration_statement = 0;
    test=> SELECT * FROM test ORDER BY x LIMIT 3;

and the results where, before the patch, for three runs:

  LOG:  duration: 1753.518 ms  statement: select * from test order by x limit 3;
  LOG:  duration: 1766.019 ms  statement: select * from test order by x limit 3;
  LOG:  duration: 1777.520 ms  statement: select * from test order by x limit 3;

and after the patch:

  LOG:  duration: 449.649 ms  statement: select * from test order by x limit 3;
  LOG:  duration: 443.450 ms  statement: select * from test order by x limit 3;
  LOG:  duration: 443.086 ms  statement: select * from test order by x limit 3;

---------------------------------------------------------------------------

Gregory Stark wrote:
>
> Updated patch attached:
>
> 1) Removes #if 0 optimizations
>
> 2) Changes #if 0 to #if NOT_USED for code that's there for completeness and to
>    keep the code self-documenting purposes rather but isn't needed by anything
>    currently
>
> 3) Fixed cost model to represent bounded sorts
>
>

[ Attachment, skipping... ]

>
>
> "Gregory Stark" <stark@enterprisedb.com> writes:
>
> > "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> >
> >> There's a few blocks of code surrounded with "#if 0 - #endif". Are those just
> >> leftovers that should be removed, or are things that still need to finished and
> >> enabled?
> >
> > Uhm, I don't remember, will go look, thanks.
>
> Ok, they were left over code from an optimization that I decided wasn't very
> important to pursue. The code that was ifdef'd out detected when disk sorts
> could abort a disk sort merge because it had already generated enough tuples
> for to satisfy the limit.
>
> But I never wrote the code to actually abort the run and it looks a bit
> tricky. In any case the disk sort use case is extremely narrow, you would need
> something like "LIMIT 50000" or more to do it and it would have to be a an
> input table huge enough to cause multiple rounds of merges.
>
>
> I think I've figured out how to adjust the cost model. It turns out that it
> doesn't usually matter whether the cost model is correct since any case where
> the optimization kicks in is a case you're reading a small portion of the
> input so it's a case where an index would be *much* better if available. So
> the only times the optimization is used is when there's no index available.
> Nonetheless it's nice to get the estimates right so that higher levels in the
> plan get reasonable values.
>
> I think I figured out how to do the cost model. At least the results are
> reasonable. I'm not sure if I've done it the "right" way though.
>
>
> --
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: LIMIT/SORT optimization

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> I did some performance testing of the patch, and the results were good.
> I did this:

>     test=> CREATE TABLE test (x INTEGER);
>     test=> INSERT INTO test SELECT * FROM generate_series(1, 1000000);
>     test=> SET log_min_duration_statement = 0;
>     test=> SELECT * FROM test ORDER BY x LIMIT 3;

LIMIT 3 seems an awfully favorable case; if the patch can only manage a
factor of 4 speedup there, what happens at limit 10, 20, 100?  Also,
you've tested only one sort size and only one (unspecified) value of
work_mem, and the usefulness of the patch would surely vary depending on
that.  In particular, what happens with a LIMIT large enough to overflow
work_mem?

Lastly, I suspect that sorting presorted input might be particularly
favorable for this patch.  Please try it with random data for comparison.

            regards, tom lane

Re: LIMIT/SORT optimization

From
"Simon Riggs"
Date:
On Sat, 2007-04-07 at 14:11 -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > I did some performance testing of the patch, and the results were good.
> > I did this:
>
> >     test=> CREATE TABLE test (x INTEGER);
> >     test=> INSERT INTO test SELECT * FROM generate_series(1, 1000000);
> >     test=> SET log_min_duration_statement = 0;
> >     test=> SELECT * FROM test ORDER BY x LIMIT 3;
>
> LIMIT 3 seems an awfully favorable case; if the patch can only manage a
> factor of 4 speedup there, what happens at limit 10, 20, 100?  Also,
> you've tested only one sort size and only one (unspecified) value of
> work_mem, and the usefulness of the patch would surely vary depending on
> that.  In particular, what happens with a LIMIT large enough to overflow
> work_mem?

Yeh, this is really designed to improve the case where we retrieve a
"screenfull" of data. i.e. 25, 50 or 100 records. Or worst case 10
screenfulls.

The code deliberately doesn't use an insertion sort for that reason,
since that is beyond the cut-off where that works best. So it should be
optimised for medium numbers of rows when no index is present.

The use case is important because we want to be able to populate data
for screens in a reasonably bounded time, not one that gets suddenly
worse should the number of possible matches exceed work_mem. [Think how
well Google reacts to varying numbers of candidate matches] Whatever
happens with LIMIT > work_mem doesn't fit the use case, so as long as it
is no slower than what we have now, that should be fine.

> Lastly, I suspect that sorting presorted input might be particularly
> favorable for this patch.  Please try it with random data for comparison.

Agreed.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: LIMIT/SORT optimization

From
Bruce Momjian
Date:
I reran the test using:

    test=> CREATE TABLE test (x INTEGER);
    test=> INSERT INTO test SELECT * FROM generate_series(1, 1000000);
    test=> SET log_min_duration_statement = 0;

and got on an unpatched system:

    1751.320 ms  select * from (select * from test order by x limit 3) as x limit 1;
    1725.092 ms  select * from (select * from test order by x limit 3) as x limit 1;
    1709.463 ms  select * from (select * from test order by x limit 3) as x limit 1;
    1702.917 ms  select * from (select * from test order by x limit 10) as x limit 1;
    1705.793 ms  select * from (select * from test order by x limit 10) as x limit 1;
    1704.046 ms  select * from (select * from test order by x limit 10) as x limit 1;
    1699.730 ms  select * from (select * from test order by x limit 100) as x limit 1;
    1712.628 ms  select * from (select * from test order by x limit 100) as x limit 1;
    1699.454 ms  select * from (select * from test order by x limit 100) as x limit 1;
    1720.207 ms  select * from (select * from test order by x limit 1000) as x limit 1;
    1725.519 ms  select * from (select * from test order by x limit 1000) as x limit 1;
    1728.933 ms  select * from (select * from test order by x limit 1000) as x limit 1;
    1699.609 ms  select * from (select * from test order by x limit 10000) as x limit 1;
    1698.386 ms  select * from (select * from test order by x limit 10000) as x limit 1;
    1698.985 ms  select * from (select * from test order by x limit 10000) as x limit 1;
    1700.740 ms  select * from (select * from test order by x limit 100000) as x limit 1;
    1700.989 ms  select * from (select * from test order by x limit 100000) as x limit 1;
    1695.771 ms  select * from (select * from test order by x limit 100000) as x limit 1;

which is expected because the sort work is constant.  With the patch I
see:

    433.892 ms  select * from (select * from test order by x limit 3) as x limit 1;
    496.016 ms  select * from (select * from test order by x limit 3) as x limit 1;
    434.604 ms  select * from (select * from test order by x limit 3) as x limit 1;
    433.265 ms  select * from (select * from test order by x limit 10) as x limit 1;
    432.058 ms  select * from (select * from test order by x limit 10) as x limit 1;
    431.329 ms  select * from (select * from test order by x limit 10) as x limit 1;
    429.722 ms  select * from (select * from test order by x limit 100) as x limit 1;
    434.754 ms  select * from (select * from test order by x limit 100) as x limit 1;
    429.758 ms  select * from (select * from test order by x limit 100) as x limit 1;
    432.060 ms  select * from (select * from test order by x limit 1000) as x limit 1;
    432.523 ms  select * from (select * from test order by x limit 1000) as x limit 1;
    433.917 ms  select * from (select * from test order by x limit 1000) as x limit 1;
    449.885 ms  select * from (select * from test order by x limit 10000) as x limit 1;
    450.182 ms  select * from (select * from test order by x limit 10000) as x limit 1;
    450.536 ms  select * from (select * from test order by x limit 10000) as x limit 1;
    1771.807 ms  select * from (select * from test order by x limit 100000) as x limit 1;
    1746.628 ms  select * from (select * from test order by x limit 100000) as x limit 1;
    1795.600 ms  select * from (select * from test order by x limit 100000) as x limit 1;

The patch is faster until we hit 100k or 10% of the table, at which
point it is the same speed.  What is interesting is 1M is also the same
speed:

    1756.401 ms  select * from (select * from test order by x limit 1000000) as x limit 1;
    1744.104 ms  select * from (select * from test order by x limit 1000000) as x limit 1;
    1734.198 ms  select * from (select * from test order by x limit 1000000) as x limit 1;

This is with the default work_mem of '1M'.  I used LIMIT 1 so the times
were not affected by the size of the data transfer to the client.


---------------------------------------------------------------------------

Bruce Momjian wrote:
>
> I did some performance testing of the patch, and the results were good.
> I did this:
>
>     test=> CREATE TABLE test (x INTEGER);
>     test=> INSERT INTO test SELECT * FROM generate_series(1, 1000000);
>     test=> SET log_min_duration_statement = 0;
>     test=> SELECT * FROM test ORDER BY x LIMIT 3;
>
> and the results where, before the patch, for three runs:
>
>   LOG:  duration: 1753.518 ms  statement: select * from test order by x limit 3;
>   LOG:  duration: 1766.019 ms  statement: select * from test order by x limit 3;
>   LOG:  duration: 1777.520 ms  statement: select * from test order by x limit 3;
>
> and after the patch:
>
>   LOG:  duration: 449.649 ms  statement: select * from test order by x limit 3;
>   LOG:  duration: 443.450 ms  statement: select * from test order by x limit 3;
>   LOG:  duration: 443.086 ms  statement: select * from test order by x limit 3;
>
> ---------------------------------------------------------------------------
>
> Gregory Stark wrote:
> >
> > Updated patch attached:
> >
> > 1) Removes #if 0 optimizations
> >
> > 2) Changes #if 0 to #if NOT_USED for code that's there for completeness and to
> >    keep the code self-documenting purposes rather but isn't needed by anything
> >    currently
> >
> > 3) Fixed cost model to represent bounded sorts
> >
> >
>
> [ Attachment, skipping... ]
>
> >
> >
> > "Gregory Stark" <stark@enterprisedb.com> writes:
> >
> > > "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> > >
> > >> There's a few blocks of code surrounded with "#if 0 - #endif". Are those just
> > >> leftovers that should be removed, or are things that still need to finished and
> > >> enabled?
> > >
> > > Uhm, I don't remember, will go look, thanks.
> >
> > Ok, they were left over code from an optimization that I decided wasn't very
> > important to pursue. The code that was ifdef'd out detected when disk sorts
> > could abort a disk sort merge because it had already generated enough tuples
> > for to satisfy the limit.
> >
> > But I never wrote the code to actually abort the run and it looks a bit
> > tricky. In any case the disk sort use case is extremely narrow, you would need
> > something like "LIMIT 50000" or more to do it and it would have to be a an
> > input table huge enough to cause multiple rounds of merges.
> >
> >
> > I think I've figured out how to adjust the cost model. It turns out that it
> > doesn't usually matter whether the cost model is correct since any case where
> > the optimization kicks in is a case you're reading a small portion of the
> > input so it's a case where an index would be *much* better if available. So
> > the only times the optimization is used is when there's no index available.
> > Nonetheless it's nice to get the estimates right so that higher levels in the
> > plan get reasonable values.
> >
> > I think I figured out how to do the cost model. At least the results are
> > reasonable. I'm not sure if I've done it the "right" way though.
> >
> >
> > --
> >   Gregory Stark
> >   EnterpriseDB          http://www.enterprisedb.com
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: explain analyze is your friend
>
> --
>   Bruce Momjian  <bruce@momjian.us>          http://momjian.us
>   EnterpriseDB                               http://www.enterprisedb.com
>
>   + If your life is a hard drive, Christ can be your backup. +
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: LIMIT/SORT optimization

From
Bruce Momjian
Date:
Oh, sorry, forgot to do a random table test.  The test used:

    DROP TABLE test;
    CREATE TABLE test (x INTEGER);
    INSERT INTO test SELECT random()*1000000 FROM generate_series(1, 1000000);

As expected the unpatched version is consistent for all LIMIT values
(first query was slow due to load after INSERT):

    14567.074 ms select * from (select * from test order by x limit 3) as x limit 1;
    4031.029 ms select * from (select * from test order by x limit 3) as x limit 1;
    3612.417 ms select * from (select * from test order by x limit 3) as x limit 1;
    3505.966 ms select * from (select * from test order by x limit 10) as x limit 1;
    3707.830 ms select * from (select * from test order by x limit 10) as x limit 1;
    3619.410 ms select * from (select * from test order by x limit 10) as x limit 1;
    5548.770 ms select * from (select * from test order by x limit 100) as x limit 1;
    3839.660 ms select * from (select * from test order by x limit 100) as x limit 1;
    4098.445 ms select * from (select * from test order by x limit 100) as x limit 1;
    3677.659 ms select * from (select * from test order by x limit 1000) as x limit 1;
    3956.980 ms select * from (select * from test order by x limit 1000) as x limit 1;
    3824.934 ms select * from (select * from test order by x limit 1000) as x limit 1;
    4641.589 ms select * from (select * from test order by x limit 10000) as x limit 1;
    4057.902 ms select * from (select * from test order by x limit 10000) as x limit 1;
    4682.779 ms select * from (select * from test order by x limit 10000) as x limit 1;
    4032.351 ms select * from (select * from test order by x limit 100000) as x limit 1;
    4572.528 ms select * from (select * from test order by x limit 100000) as x limit 1;
    4985.500 ms select * from (select * from test order by x limit 100000) as x limit 1;
    4942.422 ms select * from (select * from test order by x limit 1000000) as x limit 1;
    4669.230 ms select * from (select * from test order by x limit 1000000) as x limit 1;
    4639.258 ms select * from (select * from test order by x limit 1000000) as x limit 1;

and with the patch:

    1731.234 ms select * from (select * from test order by x limit 3) as x limit 1;
    570.315 ms select * from (select * from test order by x limit 3) as x limit 1;
    430.119 ms select * from (select * from test order by x limit 3) as x limit 1;
    431.580 ms select * from (select * from test order by x limit 10) as x limit 1;
    431.253 ms select * from (select * from test order by x limit 10) as x limit 1;
    432.112 ms select * from (select * from test order by x limit 10) as x limit 1;
    433.536 ms select * from (select * from test order by x limit 100) as x limit 1;
    433.115 ms select * from (select * from test order by x limit 100) as x limit 1;
    432.478 ms select * from (select * from test order by x limit 100) as x limit 1;
    442.886 ms select * from (select * from test order by x limit 1000) as x limit 1;
    442.133 ms select * from (select * from test order by x limit 1000) as x limit 1;
    444.905 ms select * from (select * from test order by x limit 1000) as x limit 1;
    522.782 ms select * from (select * from test order by x limit 10000) as x limit 1;
    521.481 ms select * from (select * from test order by x limit 10000) as x limit 1;
    521.526 ms select * from (select * from test order by x limit 10000) as x limit 1;
    3317.216 ms select * from (select * from test order by x limit 100000) as x limit 1;
    3365.467 ms select * from (select * from test order by x limit 100000) as x limit 1;
    3355.447 ms select * from (select * from test order by x limit 100000) as x limit 1;
    3307.745 ms select * from (select * from test order by x limit 1000000) as x limit 1;
    3315.602 ms select * from (select * from test order by x limit 1000000) as x limit 1;
    3585.736 ms select * from (select * from test order by x limit 1000000) as x limit 1;

---------------------------------------------------------------------------

Bruce Momjian wrote:
>
> I reran the test using:
>
>     test=> CREATE TABLE test (x INTEGER);
>     test=> INSERT INTO test SELECT * FROM generate_series(1, 1000000);
>     test=> SET log_min_duration_statement = 0;
>
> and got on an unpatched system:
>
>     1751.320 ms  select * from (select * from test order by x limit 3) as x limit 1;
>     1725.092 ms  select * from (select * from test order by x limit 3) as x limit 1;
>     1709.463 ms  select * from (select * from test order by x limit 3) as x limit 1;
>     1702.917 ms  select * from (select * from test order by x limit 10) as x limit 1;
>     1705.793 ms  select * from (select * from test order by x limit 10) as x limit 1;
>     1704.046 ms  select * from (select * from test order by x limit 10) as x limit 1;
>     1699.730 ms  select * from (select * from test order by x limit 100) as x limit 1;
>     1712.628 ms  select * from (select * from test order by x limit 100) as x limit 1;
>     1699.454 ms  select * from (select * from test order by x limit 100) as x limit 1;
>     1720.207 ms  select * from (select * from test order by x limit 1000) as x limit 1;
>     1725.519 ms  select * from (select * from test order by x limit 1000) as x limit 1;
>     1728.933 ms  select * from (select * from test order by x limit 1000) as x limit 1;
>     1699.609 ms  select * from (select * from test order by x limit 10000) as x limit 1;
>     1698.386 ms  select * from (select * from test order by x limit 10000) as x limit 1;
>     1698.985 ms  select * from (select * from test order by x limit 10000) as x limit 1;
>     1700.740 ms  select * from (select * from test order by x limit 100000) as x limit 1;
>     1700.989 ms  select * from (select * from test order by x limit 100000) as x limit 1;
>     1695.771 ms  select * from (select * from test order by x limit 100000) as x limit 1;
>
> which is expected because the sort work is constant.  With the patch I
> see:
>
>     433.892 ms  select * from (select * from test order by x limit 3) as x limit 1;
>     496.016 ms  select * from (select * from test order by x limit 3) as x limit 1;
>     434.604 ms  select * from (select * from test order by x limit 3) as x limit 1;
>     433.265 ms  select * from (select * from test order by x limit 10) as x limit 1;
>     432.058 ms  select * from (select * from test order by x limit 10) as x limit 1;
>     431.329 ms  select * from (select * from test order by x limit 10) as x limit 1;
>     429.722 ms  select * from (select * from test order by x limit 100) as x limit 1;
>     434.754 ms  select * from (select * from test order by x limit 100) as x limit 1;
>     429.758 ms  select * from (select * from test order by x limit 100) as x limit 1;
>     432.060 ms  select * from (select * from test order by x limit 1000) as x limit 1;
>     432.523 ms  select * from (select * from test order by x limit 1000) as x limit 1;
>     433.917 ms  select * from (select * from test order by x limit 1000) as x limit 1;
>     449.885 ms  select * from (select * from test order by x limit 10000) as x limit 1;
>     450.182 ms  select * from (select * from test order by x limit 10000) as x limit 1;
>     450.536 ms  select * from (select * from test order by x limit 10000) as x limit 1;
>     1771.807 ms  select * from (select * from test order by x limit 100000) as x limit 1;
>     1746.628 ms  select * from (select * from test order by x limit 100000) as x limit 1;
>     1795.600 ms  select * from (select * from test order by x limit 100000) as x limit 1;
>
> The patch is faster until we hit 100k or 10% of the table, at which
> point it is the same speed.  What is interesting is 1M is also the same
> speed:
>
>     1756.401 ms  select * from (select * from test order by x limit 1000000) as x limit 1;
>     1744.104 ms  select * from (select * from test order by x limit 1000000) as x limit 1;
>     1734.198 ms  select * from (select * from test order by x limit 1000000) as x limit 1;
>
> This is with the default work_mem of '1M'.  I used LIMIT 1 so the times
> were not affected by the size of the data transfer to the client.
>
>
> ---------------------------------------------------------------------------
>
> Bruce Momjian wrote:
> >
> > I did some performance testing of the patch, and the results were good.
> > I did this:
> >
> >     test=> CREATE TABLE test (x INTEGER);
> >     test=> INSERT INTO test SELECT * FROM generate_series(1, 1000000);
> >     test=> SET log_min_duration_statement = 0;
> >     test=> SELECT * FROM test ORDER BY x LIMIT 3;
> >
> > and the results where, before the patch, for three runs:
> >
> >   LOG:  duration: 1753.518 ms select * from test order by x limit 3;
> >   LOG:  duration: 1766.019 ms select * from test order by x limit 3;
> >   LOG:  duration: 1777.520 ms select * from test order by x limit 3;
> >
> > and after the patch:
> >
> >   LOG:  duration: 449.649 ms select * from test order by x limit 3;
> >   LOG:  duration: 443.450 ms select * from test order by x limit 3;
> >   LOG:  duration: 443.086 ms select * from test order by x limit 3;
> >
> > ---------------------------------------------------------------------------
> >
> > Gregory Stark wrote:
> > >
> > > Updated patch attached:
> > >
> > > 1) Removes #if 0 optimizations
> > >
> > > 2) Changes #if 0 to #if NOT_USED for code that's there for completeness and to
> > >    keep the code self-documenting purposes rather but isn't needed by anything
> > >    currently
> > >
> > > 3) Fixed cost model to represent bounded sorts
> > >
> > >
> >
> > [ Attachment, skipping... ]
> >
> > >
> > >
> > > "Gregory Stark" <stark@enterprisedb.com> writes:
> > >
> > > > "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> > > >
> > > >> There's a few blocks of code surrounded with "#if 0 - #endif". Are those just
> > > >> leftovers that should be removed, or are things that still need to finished and
> > > >> enabled?
> > > >
> > > > Uhm, I don't remember, will go look, thanks.
> > >
> > > Ok, they were left over code from an optimization that I decided wasn't very
> > > important to pursue. The code that was ifdef'd out detected when disk sorts
> > > could abort a disk sort merge because it had already generated enough tuples
> > > for to satisfy the limit.
> > >
> > > But I never wrote the code to actually abort the run and it looks a bit
> > > tricky. In any case the disk sort use case is extremely narrow, you would need
> > > something like "LIMIT 50000" or more to do it and it would have to be a an
> > > input table huge enough to cause multiple rounds of merges.
> > >
> > >
> > > I think I've figured out how to adjust the cost model. It turns out that it
> > > doesn't usually matter whether the cost model is correct since any case where
> > > the optimization kicks in is a case you're reading a small portion of the
> > > input so it's a case where an index would be *much* better if available. So
> > > the only times the optimization is used is when there's no index available.
> > > Nonetheless it's nice to get the estimates right so that higher levels in the
> > > plan get reasonable values.
> > >
> > > I think I figured out how to do the cost model. At least the results are
> > > reasonable. I'm not sure if I've done it the "right" way though.
> > >
> > >
> > > --
> > >   Gregory Stark
> > >   EnterpriseDB          http://www.enterprisedb.com
> > >
> > > ---------------------------(end of broadcast)---------------------------
> > > TIP 6: explain analyze is your friend
> >
> > --
> >   Bruce Momjian  <bruce@momjian.us>          http://momjian.us
> >   EnterpriseDB                               http://www.enterprisedb.com
> >
> >   + If your life is a hard drive, Christ can be your backup. +
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 9: In versions below 8.0, the planner will ignore your desire to
> >        choose an index scan if your joining column's datatypes do not
> >        match
>
> --
>   Bruce Momjian  <bruce@momjian.us>          http://momjian.us
>   EnterpriseDB                               http://www.enterprisedb.com
>
>   + If your life is a hard drive, Christ can be your backup. +
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: LIMIT/SORT optimization

From
"Simon Riggs"
Date:
On Sat, 2007-04-07 at 22:16 -0400, Bruce Momjian wrote:

> > The patch is faster until we hit 100k or 10% of the table, at which
> > point it is the same speed.  What is interesting is 1M is also the same
> > speed:

All tests good, AFAICS. Thanks Bruce.

Patch is operating as expected: the common case is considerably faster
and the uncommon case isn't any slower.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: LIMIT/SORT optimization

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Bruce Momjian <bruce@momjian.us> writes:
>> I did some performance testing of the patch, and the results were good.
>> I did this:
>
>>     test=> CREATE TABLE test (x INTEGER);
>>     test=> INSERT INTO test SELECT * FROM generate_series(1, 1000000);
>>     test=> SET log_min_duration_statement = 0;
>>     test=> SELECT * FROM test ORDER BY x LIMIT 3;
>
> LIMIT 3 seems an awfully favorable case; if the patch can only manage a
> factor of 4 speedup there, what happens at limit 10, 20, 100?

I did a whole bunch of benchmarking and spent a pile of time gathering numbers
in gnumeric to make some pretty graphs. Then gnumeric crashed :( It's days
like this that make me appreciate our own code review process even if I'm
usually wishing it were a bit looser.

Until I gather all those numbers together again I'll just describe the
executive summary. I found that the limit of a factor of 4 improvement was
mostly an artifact of the very narrow and cheap to sort keys. When more of the
time is spent pushing around large tuples or in even moderately expensive
comparisons the speedup is much greater.

When I tested with narrow tuples consisting of only a single integer I found
that the maximum speedup was, as Bruce found, about 4x. I think what's going
on is that the overhead in the linear factor of just reading in all those
tuples is large enough to hide the improvements in sorting. Especially since
with such narrow tuples the resulting data is just too small to overflow
filesystem cache where the big benefits come.

When I tested with text strings ranging from 0-97 characters long in locale
en_GB.UTF-8 I found that the speedup kept going up as the input data set grew.

Sorting the top 1,000 strings from an input set of 1,000,000 strings went from
8m11s in stock Postgres to 12s using the bounded heapsort.

(Which was an even better result than I had prior to fully randomizing the
data. It probably just got packed better on disk in the source table.)

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com