Thread: Re: [COMMITTERS] pgsql: Teach tuplesort.c about "top N" sorting, in which only the first

Magnus Hagander <magnus@hagander.net> writes:
> On Fri, May 04, 2007 at 12:38:18PM -0400, Tom Lane wrote:
>> Magnus Hagander <magnus@hagander.net> writes:
>>> Could we show it in EXPLAIN ANALYZE somehow? I'm thinking it would be good
>>> to see at runtime (for example as a hint that if you put in a bit more
>>> work_mem it might get used)
>> 
>> I don't see that this is any more interesting than whether the sort
>> spilled to disk or not; which is something we don't show in EXPLAIN
>> ANALYZE either.  trace_sort is the agreed API for examining that.

> Now that you mention it, that'd be nice to have as well - the point being
> making it available without recompile.

Well, trace_sort is available by default, but...

>> It's not exactly easy to do, because (a) none of this information
>> is exposed outside tuplesort.c, and (b) the tuplesortstate object
>> is probably gone by the time EXPLAIN ANALYZE runs, anyway.

> Hmm. Ok. Don't know enough about those parts of the code to comment on
> that, but I'll certainly take your word for it :-)

I take back point (b) --- the state object is released at ExecutorEnd,
and EXPLAIN ANALYZE examines the tree before doing that, so if we added
some kind of reporting function to tuplesort.c's API it'd be doable
easily enough.

What do you think the output should look like?  The first thought that
comes to mind is to add "method=memory" (or disk or top-N) to the
"actual" annotation:

regression=# explain analyze select * from tenk1 order by fivethous limit 100;
           QUERY PLAN
 

------------------------------------------------------------------------------------------------------------------------Limit
(cost=840.19..840.44 rows=100 width=244) (actual time=140.511..141.604 rows=100 loops=1)  ->  Sort
(cost=840.19..865.19rows=10000 width=244) (actual time=140.492..140.880 rows=100 loops=1 method=top-N)
                                                                                   ^^^^^^^^^^^^        Sort Key:
fivethous       ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=244) (actual time=0.074..51.849 rows=10000
loops=1)Totalruntime: 143.089 ms
 
(5 rows)

Another possibility, which could be wedged into explain.c slightly more
easily, is to append "Method: top-N" or some such to the Sort Key line,
but I'm not sure that that would look nice.

Comments, ideas?
        regards, tom lane


> >> It's not exactly easy to do, because (a) none of this information
> >> is exposed outside tuplesort.c, and (b) the tuplesortstate object
> >> is probably gone by the time EXPLAIN ANALYZE runs, anyway.
> 
> > Hmm. Ok. Don't know enough about those parts of the code to comment on
> > that, but I'll certainly take your word for it :-)
> 
> I take back point (b) --- the state object is released at ExecutorEnd,
> and EXPLAIN ANALYZE examines the tree before doing that, so if we added
> some kind of reporting function to tuplesort.c's API it'd be doable
> easily enough.
> 
> What do you think the output should look like?  The first thought that
> comes to mind is to add "method=memory" (or disk or top-N) to the
> "actual" annotation:
> 
> regression=# explain analyze select * from tenk1 order by fivethous limit 100;
>                                                        QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=840.19..840.44 rows=100 width=244) (actual time=140.511..141.604 rows=100 loops=1)
>    ->  Sort  (cost=840.19..865.19 rows=10000 width=244) (actual time=140.492..140.880 rows=100 loops=1 method=top-N)
>                                                                                                        ^^^^^^^^^^^^
>          Sort Key: fivethous
>          ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=244) (actual time=0.074..51.849 rows=10000
loops=1)
>  Total runtime: 143.089 ms
> (5 rows)

Looks pretty good to me. Easy to find and hard to misunderstand :)

/Magnus



On May 4, 2007, at 7:08 PM, Tom Lane wrote:
> What do you think the output should look like?  The first thought that
> comes to mind is to add "method=memory" (or disk or top-N) to the
> "actual" annotation:
>
> regression=# explain analyze select * from tenk1 order by fivethous  
> limit 100;
>                                                        QUERY PLAN
> ---------------------------------------------------------------------- 
> --------------------------------------------------
>  Limit  (cost=840.19..840.44 rows=100 width=244) (actual  
> time=140.511..141.604 rows=100 loops=1)
>    ->  Sort  (cost=840.19..865.19 rows=10000 width=244) (actual  
> time=140.492..140.880 rows=100 loops=1 method=top-N)
>                                                                        
>                                  ^^^^^^^^^^^^
>          Sort Key: fivethous
>          ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000  
> width=244) (actual time=0.074..51.849 rows=10000 loops=1)
>  Total runtime: 143.089 ms
> (5 rows)
>
> Another possibility, which could be wedged into explain.c slightly  
> more
> easily, is to append "Method: top-N" or some such to the Sort Key  
> line,
> but I'm not sure that that would look nice.

If the method is disk it would be nice to know how much spilled to  
disk. That would tell you if it would be worth increasing work_mem,  
and by how much.

On a related note, it would also be *really* nice if we kept stats on  
how many sorts or hashes had spilled to disk, perhaps along with how  
much had spilled. Right now the only way to monitor that in a  
production system is to setup a cron job to watch pgsql_tmp, which is  
far from elegant.

I know there's concern about how much we add to the stats file, but I  
don't think this needs to be on a per-relation basis; per-database  
should be fine.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Jim Nasby <decibel@decibel.org> writes:
> On a related note, it would also be *really* nice if we kept stats on  
> how many sorts or hashes had spilled to disk, perhaps along with how  
> much had spilled. Right now the only way to monitor that in a  
> production system is to setup a cron job to watch pgsql_tmp, which is  
> far from elegant.

No, you can turn on trace_sort and track it from watching the log.
If pgfouine hasn't got something for that already, I'd be surprised.
        regards, tom lane


> > What do you think the output should look like?  The first thought that
> > comes to mind is to add "method=memory" (or disk or top-N) to the
> > "actual" annotation:

Having the "disk" and "memory" would be really useful too.  

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco


Jim Nasby <decibel@decibel.org> writes:
> If the method is disk it would be nice to know how much spilled to  
> disk. That would tell you if it would be worth increasing work_mem,  
> and by how much.

Well, a more radical proposal is to add a whole 'nother line to the
output, which would give us room for several bits of info.  Perhaps
like this:
  ->  Sort  (cost=840.19..865.19 rows=10000 width=244) (actual time=151.769..152.157 rows=100 loops=1)        Sort Key:
fivethous       Sort Method: top-N  Memory: 17KB        ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=244)
(actual

or
        Sort Method: disk  Memory: 1000KB  Disk: 18482KB

Not sure what other info might be useful.
        regards, tom lane


On 5/4/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>    ->  Sort  (cost=840.19..865.19 rows=10000 width=244) (actual time=140.492..140.880 rows=100 loops=1 method=top-N)
>                                                                                                        ^^^^^^^^^^^^
>          Sort Key: fivethous
> Another possibility, which could be wedged into explain.c slightly more
> easily, is to append "Method: top-N" or some such to the Sort Key line,
> but I'm not sure that that would look nice.

Is it possible to have something like Sort (disk|top-N|memory) instead
of Sort? I'm really not sure it's a good idea to break the (actual
time=0.074..51.849 rows=10000 loops=1) output we have for every node.
It's easier to read the output when it's consistent.
If not, append it at the end of the Sort Key line is better IMHO.

--
Guillaume


On 5/4/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> No, you can turn on trace_sort and track it from watching the log.
> If pgfouine hasn't got something for that already, I'd be surprised.

Well, it hasn't. I never used trace_sort so i didn't think of
implementing something to use it. I'll take a look at it for the next
versions to see if I can do something useful.

If anyone has suggestions/needs on this point, they are welcome.

--
Guillaume


On 5/4/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>          Sort Method: disk  Memory: 1000KB  Disk: 18482KB

+1 for this one.

--
Guillaume


"Guillaume Smet" <guillaume.smet@gmail.com> writes:
> Is it possible to have something like Sort (disk|top-N|memory) instead
> of Sort?

That would be sane if the decision were fixed at plan time, but it isn't.
What do you think of the add-a-line approach?
        regards, tom lane


Guillaume Smet wrote:
> On 5/4/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>          Sort Method: disk  Memory: 1000KB  Disk: 18482KB
> 
> +1 for this one.

I like that one too ...


Stefan


Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Guillaume Smet wrote:
>> On 5/4/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Sort Method: disk  Memory: 1000KB  Disk: 18482KB
>> 
>> +1 for this one.

> I like that one too ...

OK, in the event it looks like one of these four messages:
 "Sort Method:  top-N heapsort  Memory: %ldkB" "Sort Method:  quicksort  Memory: %ldkB" "Sort Method:  external sort
Disk:%ldkB" "Sort Method:  external merge  Disk: %ldkB"
 

where "external merge" implies that the final merge pass was done
on-the-fly instead of materializing the fully sorted data on disk.
I'm not wedded to these method descriptions if anyone has better phrases
in mind.

Also, I tried to make the disk cases print disk and memory usage both,
but was getting wacko numbers for memory usage.  I had forgotten that
the disk-sort path doesn't bother to track memory usage accurately once
it starts returning tuples to the caller (since at that point all the
decisions are made).  I'm not really excited about fixing that; it
would add per-tuple overhead for not much value.
        regards, tom lane


On May 4, 2007, at 7:49 PM, Tom Lane wrote:
> Jim Nasby <decibel@decibel.org> writes:
>> On a related note, it would also be *really* nice if we kept stats on
>> how many sorts or hashes had spilled to disk, perhaps along with how
>> much had spilled. Right now the only way to monitor that in a
>> production system is to setup a cron job to watch pgsql_tmp, which is
>> far from elegant.
>
> No, you can turn on trace_sort and track it from watching the log.
> If pgfouine hasn't got something for that already, I'd be surprised.

There's several problems with that. First, trace_sort isn't  
documented (or at least it's not in postgresql.conf), so most folks  
don't know it exists. Second, in order to see it's output you have to  
drop log_min_messages to debug. That results in a huge log volume,  
especially on a production system.

Aside from that, log files are not a good way to monitor performance,  
they should be used for reporting on exception conditions. If the log  
was meant to be the means for monitoring performance, then why have  
the statistics system at all?

As for pgfouine, I've never been to a customer that knew what it was.  
But almost all of them have other monitoring tools such as cricket,  
MRTG and Nagios setup. Those that don't at least know they exist.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Jim Nasby <decibel@decibel.org> writes:
> There's several problems with that. First, trace_sort isn't  
> documented (or at least it's not in postgresql.conf), so most folks  
> don't know it exists. Second, in order to see it's output you have to  
> drop log_min_messages to debug. That results in a huge log volume,  
> especially on a production system.

Nonsense ... trace_sort output is at LOG level.
        regards, tom lane


On May 6, 2007, at 9:32 AM, Tom Lane wrote:
> Jim Nasby <decibel@decibel.org> writes:
>> There's several problems with that. First, trace_sort isn't
>> documented (or at least it's not in postgresql.conf), so most folks
>> don't know it exists. Second, in order to see it's output you have to
>> drop log_min_messages to debug. That results in a huge log volume,
>> especially on a production system.
>
> Nonsense ... trace_sort output is at LOG level.

I stand corrected. But my point still remains. It wouldn't be unusual  
for a website database to be running several sorts a second; that  
means 4 lines per sort, which is still a large amount of data.

If we really want to make the logfile the approved method for  
monitoring performance, then why do we have the stats infrastructure  
at all? It could all be replaced with logging output and pgfouine.

Why are we maintaining two separate sets of code for monitoring  
database performance?
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Jim Nasby wrote:

> If we really want to make the logfile the approved method for  
> monitoring performance, then why do we have the stats infrastructure  
> at all? It could all be replaced with logging output and pgfouine.

First we'd have to fix the usability problem of our redirect_stderr
stuff for pgfouine (multiline messages from different backends are
mixed, according to Guillaume).

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Alvaro Herrera wrote:
> Jim Nasby wrote:
> 
>> If we really want to make the logfile the approved method for  
>> monitoring performance, then why do we have the stats infrastructure  
>> at all? It could all be replaced with logging output and pgfouine.
> 
> First we'd have to fix the usability problem of our redirect_stderr
> stuff for pgfouine (multiline messages from different backends are
> mixed, according to Guillaume).

I'd like to sign on to the list of people saying that logging isn't the
best way to do performance measuring. Providing a way to get at the
counters in realtime for monitoring or graphing or whatever is what
AFAIK everybody else do, and it's for a good reason - it fits in to
existing monitoring/management solutions. It may be that our current
stats system isn't the best way to do this, but if it isn't that just
means we have to come up with a better way.

//Magnus


Magnus Hagander wrote:
> Alvaro Herrera wrote:
>> Jim Nasby wrote:
>>
>>> If we really want to make the logfile the approved method for  
>>> monitoring performance, then why do we have the stats infrastructure  
>>> at all? It could all be replaced with logging output and pgfouine.
>> First we'd have to fix the usability problem of our redirect_stderr
>> stuff for pgfouine (multiline messages from different backends are
>> mixed, according to Guillaume).
> 
> I'd like to sign on to the list of people saying that logging isn't the
> best way to do performance measuring. Providing a way to get at the
> counters in realtime for monitoring or graphing or whatever is what
> AFAIK everybody else do, and it's for a good reason - it fits in to
> existing monitoring/management solutions. It may be that our current
> stats system isn't the best way to do this, but if it isn't that just
> means we have to come up with a better way.


Speaking of which, it might be interesting to actually show these values
in the stats collector. I was thinking three cols for each database
(probably the best level?) that counts each of those three counters. If
you have a lot of sorts (percentage-wise) spilling to disk, it is often
something you want to investigate, so exposing it that way seems like a
good thing.

Comments on that? Oh, and is it too late to sneak this into 8.3 claiming
it's an extension to the addition of the new sort feature? ;-)

//Magnus


On May 8, 2007, at 2:24 PM, Magnus Hagander wrote:
> Speaking of which, it might be interesting to actually show these  
> values
> in the stats collector. I was thinking three cols for each database
> (probably the best level?) that counts each of those three  
> counters. If
> you have a lot of sorts (percentage-wise) spilling to disk, it is  
> often
> something you want to investigate, so exposing it that way seems  
> like a
> good thing.

What 3 columns? In-memory sorts, on-disk sorts, and on-disk size?  
(Sum of how much spilled to disk).

I agree that per-database makes sense, though I'd settle for per- 
cluster.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




On Wed, May 09, 2007 at 10:55:12AM -0500, Jim Nasby wrote:
> On May 8, 2007, at 2:24 PM, Magnus Hagander wrote:
> >Speaking of which, it might be interesting to actually show these  
> >values
> >in the stats collector. I was thinking three cols for each database
> >(probably the best level?) that counts each of those three  
> >counters. If
> >you have a lot of sorts (percentage-wise) spilling to disk, it is  
> >often
> >something you want to investigate, so exposing it that way seems  
> >like a
> >good thing.
> 
> What 3 columns? In-memory sorts, on-disk sorts, and on-disk size?  
> (Sum of how much spilled to disk).

I was thinking in-mem sorts, on-disk sorts, limited-by-LIMIT sorts (that
would be the new feature..)

//Magnus


"Magnus Hagander" <magnus@hagander.net> writes:

>> What 3 columns? In-memory sorts, on-disk sorts, and on-disk size?  
>> (Sum of how much spilled to disk).
>
> I was thinking in-mem sorts, on-disk sorts, limited-by-LIMIT sorts (that
> would be the new feature..)

Tom's code distinguished in-memory, top-N, on-disk with final merge postponed,
and on-disk with materialized result. Four categories. But I think the
distinction between the two types of in-memory and the two types of on-disk
sorts is only really useful when you're looking at an individual query. And
even then probably only useful to a Postgres hacker, not a DBA.

It seems like it would be more useful to just break it down into in-memory and
on-disk but for each give number of sorts, number of tuples, and space used.

What would be really handy is breaking this down by table -- probably that
would only be possible when the sort is sorting directly a table scan. I don't
even know how easy it would be to get that information.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Gregory Stark wrote:
> "Magnus Hagander" <magnus@hagander.net> writes:
> 
>>> What 3 columns? In-memory sorts, on-disk sorts, and on-disk size?  
>>> (Sum of how much spilled to disk).
>> I was thinking in-mem sorts, on-disk sorts, limited-by-LIMIT sorts (that
>> would be the new feature..)
> 
> Tom's code distinguished in-memory, top-N, on-disk with final merge postponed,
> and on-disk with materialized result. Four categories. But I think the
> distinction between the two types of in-memory and the two types of on-disk
> sorts is only really useful when you're looking at an individual query. And
> even then probably only useful to a Postgres hacker, not a DBA.

Missed the two on-disk distinctions, yeah. But you're probably right
that on-disk vs in-memory is enough, the interesting thing is to get
indications on when you hit disk given what it does for performance.


> It seems like it would be more useful to just break it down into in-memory and
> on-disk but for each give number of sorts, number of tuples, and space used.
> 
> What would be really handy is breaking this down by table -- probably that
> would only be possible when the sort is sorting directly a table scan. I don't
> even know how easy it would be to get that information.

And how would you deal with the data that's sorting the result of a join
or something like that - makes things a lot more complicated ;)

And the original question remains, 8.3 or 8.4...

//Magnus


Magnus Hagander wrote:
> Gregory Stark wrote:
>> "Magnus Hagander" <magnus@hagander.net> writes:
>>
>>>> What 3 columns? In-memory sorts, on-disk sorts, and on-disk size?  
>>>> (Sum of how much spilled to disk).
>>> I was thinking in-mem sorts, on-disk sorts, limited-by-LIMIT sorts (that
>>> would be the new feature..)
>> Tom's code distinguished in-memory, top-N, on-disk with final merge postponed,
>> and on-disk with materialized result. Four categories. But I think the
>> distinction between the two types of in-memory and the two types of on-disk
>> sorts is only really useful when you're looking at an individual query. And
>> even then probably only useful to a Postgres hacker, not a DBA.
> 
> Missed the two on-disk distinctions, yeah. But you're probably right
> that on-disk vs in-memory is enough, the interesting thing is to get
> indications on when you hit disk given what it does for performance.

Keep in mind that when the sort "goes to disk", it actually just means 
that it used up work_mem and switches to merge sort with tapes. In a 
typical configuration, there's plenty of RAM available to buffer the 
tapes, so the terms on-disk and in-memory sorts are misleading. If I've 
understood earlier discussion correctly, the quicksort -> tape sort 
point is not even that interesting because the tape sort is actually not 
that much slower than quicksort, as long as it fits in RAM.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On May 9, 2007, at 11:01 AM, Magnus Hagander wrote:
> On Wed, May 09, 2007 at 10:55:12AM -0500, Jim Nasby wrote:
>> On May 8, 2007, at 2:24 PM, Magnus Hagander wrote:
>>> Speaking of which, it might be interesting to actually show these
>>> values
>>> in the stats collector. I was thinking three cols for each database
>>> (probably the best level?) that counts each of those three
>>> counters. If
>>> you have a lot of sorts (percentage-wise) spilling to disk, it is
>>> often
>>> something you want to investigate, so exposing it that way seems
>>> like a
>>> good thing.
>>
>> What 3 columns? In-memory sorts, on-disk sorts, and on-disk size?
>> (Sum of how much spilled to disk).
>
> I was thinking in-mem sorts, on-disk sorts, limited-by-LIMIT sorts  
> (that
> would be the new feature..)

It would also be useful to know how much got spilled to disk. If it's  
a large amount per sort, then there's probably not much you could do,  
but if it's just a tad over available memory per-sort...
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)