Thread: Merge algorithms for large numbers of "tapes"

Merge algorithms for large numbers of "tapes"

From
Tom Lane
Date:
BTW, I was just looking over Knuth's discussion of sorting again, and
realized that there is still something more that could be done within
the existing sort framework.  We currently use standard polyphase merge
(his Algorithm 5.4.2D), which IIRC I chose because it was simple and
for relatively small numbers of tapes T it was about as good as anything
else.  Knuth spends a great deal of energy on minimizing tape rewind
time which of course is of no interest to us, and I had supposed that
all of his more-complex algorithms were really only of interest if you
needed to consider rewind time.  However, now that we've changed the
code to prefer large numbers of tapes, it's not at all clear that
Algorithm D is still the right one to use.  In particular I'm looking at
cascade merge, Algorithm 5.4.3C, which appears to use significantly
fewer passes when T is large.  Do you want to try that?
        regards, tom lane


Re: Merge algorithms for large numbers of "tapes"

From
"Jonah H. Harris"
Date:
On 3/7/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
BTW, I was just looking over Knuth's discussion of sorting again, and
realized that there is still something more that could be done within
the existing sort framework.  We currently use standard polyphase merge
(his Algorithm 5.4.2D), which IIRC I chose because it was simple and
for relatively small numbers of tapes T it was about as good as anything
else.  Knuth spends a great deal of energy on minimizing tape rewind
time which of course is of no interest to us, and I had supposed that
all of his more-complex algorithms were really only of interest if you
needed to consider rewind time.  However, now that we've changed the
code to prefer large numbers of tapes, it's not at all clear that
Algorithm D is still the right one to use.  In particular I'm looking at
cascade merge, Algorithm 5.4.3C, which appears to use significantly
fewer passes when T is large.  Do you want to try that?

I haven't personally played with this algorithm but having spent the last 15 minutes reading it over, it does sound like an interesting idea for trial.  At first glance it didn't seem much better than polyphase for our case, but after reading the entire algorithm, discussion, and thinking it over for a couple minutes, I could see it as potentially better.

Guess we won't really know 'til it can be tested :)


--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324

Re: Merge algorithms for large numbers of "tapes"

From
"Dann Corbit"
Date:

Use a priority queue for the sorted sub-lists.  When the key-object extracted from the head of the smallest queue exceeds the key-object from the head of the second queue, adjust the priority of the smallest queue within the list of queues.

 

It uses a total of 2 read/write passes over the data, no matter how many subfiles you have.  It is dominatingly faster than any other sort of external merge when you have lots of subfiles.

 

I posted some message to the list on this subject before, and gave a pointer to sample code that demonstrates the concept.

 

If you have one million sub-files, it still only takes a total of two read-write passes.


From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Jonah H. Harris
Sent: Tuesday, March 07, 2006 5:28 PM
To: Tom Lane
Cc: Simon Riggs; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Merge algorithms for large numbers of "tapes"

 

On 3/7/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, I was just looking over Knuth's discussion of sorting again, and
realized that there is still something more that could be done within
the existing sort framework.  We currently use standard polyphase merge
(his Algorithm 5.4.2D), which IIRC I chose because it was simple and
for relatively small numbers of tapes T it was about as good as anything
else.  Knuth spends a great deal of energy on minimizing tape rewind
time which of course is of no interest to us, and I had supposed that
all of his more-complex algorithms were really only of interest if you
needed to consider rewind time.  However, now that we've changed the
code to prefer large numbers of tapes, it's not at all clear that
Algorithm D is still the right one to use.  In particular I'm looking at
cascade merge, Algorithm 5.4.3C, which appears to use significantly
fewer passes when T is large.  Do you want to try that?


I haven't personally played with this algorithm but having spent the last 15 minutes reading it over, it does sound like an interesting idea for trial.  At first glance it didn't seem much better than polyphase for our case, but after reading the entire algorithm, discussion, and thinking it over for a couple minutes, I could see it as potentially better.

Guess we won't really know 'til it can be tested :)



--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324

Re: Merge algorithms for large numbers of "tapes"

From
"Luke Lonergan"
Date:
Tom,

> fewer passes when T is large.  Do you want to try that?

Two passes is the state-of-the-practice on external disk sorts.

If we¹re looking to replace the tape sort approach, I would hope for a two
pass approach, with the merge pass avoided in the case of unidirectional
access.

- Luke




Re: Merge algorithms for large numbers of "tapes"

From
Tom Lane
Date:
"Luke Lonergan" <llonergan@greenplum.com> writes:
> Two passes is the state-of-the-practice on external disk sorts.

There is no such thing as a fixed number of passes regardless of
available memory and size of the data.
        regards, tom lane


Re: Merge algorithms for large numbers of "tapes"

From
"Luke Lonergan"
Date:
<font face="Verdana, Helvetica, Arial"><span style="font-size:14.0px">Yes – all of the current best practice external
sortsuse two passes.  A first to produce the runs, which results in “S” number of “files”, then a single merge pass
acrossthe runs.  At most 1 pass across the S runs is required to implement the merge.<br /><br /> - Luke <br /><br
/><br/> On 3/7/06 8:03 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:<br /><br /></span></font><blockquote><font
face="Verdana,Helvetica, Arial"><span style="font-size:14.0px">"Luke Lonergan" <llonergan@greenplum.com>
writes:<br/> > Two passes is the state-of-the-practice on external disk sorts.<br /><br /> There is no such thing as
afixed number of passes regardless of<br /> available memory and size of the data.<br /><br />
                        regards,tom lane<br /><br /><br /></span></font></blockquote><font face="Verdana, Helvetica,
Arial"><spanstyle="font-size:14.0px"><br /></span></font> 

Re: Merge algorithms for large numbers of "tapes"

From
Greg Stark
Date:
"Jonah H. Harris" <jonah.harris@gmail.com> writes:

> On 3/7/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > However, now that we've changed the code to prefer large numbers of tapes,
> > it's not at all clear that Algorithm D is still the right one to use. In
> > particular I'm looking at cascade merge, Algorithm 5.4.3C, which appears
> > to use significantly fewer passes when T is large. Do you want to try
> > that?
> 
> Guess we won't really know 'til it can be tested :)

It would also be interesting to allow multiple temporary areas to be declared
and try to spread tape files across the temporary areas. Ideally keeping input
and output tapes on separate drives.

-- 
greg



Re: Merge algorithms for large numbers of "tapes"

From
"Luke Lonergan"
Date:
Tom,

On 3/7/06 8:03 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> "Luke Lonergan" <llonergan@greenplum.com> writes:
>> Two passes is the state-of-the-practice on external disk sorts.
> 
> There is no such thing as a fixed number of passes regardless of
> available memory and size of the data.

While technically correct, the practice shows that two-passes is the norm
for the vast majority of cases since 1986: http://research.microsoft.com/~Gray/papers/TandemTR86.3_FastSort.doc

Square root of the sort set is the memory requirement for a two pass sort.
10M for 10GB of sort set, for instance.

Given the current resource management limitations we live with, you are
correct about multi-pass being necessary, but we should be aware that modern
commercial databases allocate enough memory to provide two-pass external
sorts.

- Luke




Re: Merge algorithms for large numbers of "tapes"

From
Simon Riggs
Date:
On Tue, 2006-03-07 at 18:14 -0500, Tom Lane wrote:
> BTW, I was just looking over Knuth's discussion of sorting again, and
> realized that there is still something more that could be done within
> the existing sort framework.  We currently use standard polyphase merge
> (his Algorithm 5.4.2D), which IIRC I chose because it was simple and
> for relatively small numbers of tapes T it was about as good as anything
> else.  Knuth spends a great deal of energy on minimizing tape rewind
> time which of course is of no interest to us, and I had supposed that
> all of his more-complex algorithms were really only of interest if you
> needed to consider rewind time.  However, now that we've changed the
> code to prefer large numbers of tapes, it's not at all clear that
> Algorithm D is still the right one to use.  In particular I'm looking at
> cascade merge, Algorithm 5.4.3C, which appears to use significantly
> fewer passes when T is large.  

Ah! Well spotted. Yeh, looks like it will improve performance a good
deal. So, yes, definitely a TODO item. 

> Do you want to try that?

The Cascade Merge re-writes the way logical tapes are selected and how
the runs are merged. It doesn't seem to do anything at all about the
run-forming, which would still use heapsort. So the only effect is when
we have more runs than "tapes", so for the limits of where we would
begin noticing any benefit would be:
work_mem= 1 GB        benefit at 8 TB
work_mem= 256MB         benefit at 0.5 TB
work_mem= 8MB        benefit at 256 MB
work_mem= 1MB        benefit at 12 MB (min 7 tapes).
(based upon runs on average twice size of memory, and each logical tape
requiring 256KB memory, i.e. min(work_mem/4, 6) * work_mem * 2, which
for work_mem > 2 MB gives 0.5 * work_mem^2)

Which means the benefit we get is when we have for some reason been
unable to give the sort enough space, or not set parameters correctly.
So, still a concern...but makes me think about 2 other issues first:

1. Earlier we had some results that showed that the heapsorts got slower
when work_mem was higher and that concerns me most of all right now.
It's possible you'll have reduced that considerably with the
pull-out-the-first-attr patch. I'll look into some test results to show
that has gone away. We also have Nyberg et al telling us that as of 1994
they established that heapsort would always be slower than qsort, as a
result of CPU cache locality improvements. An improvement here would
effect all sorts > work_mem.

2. Improvement in the way we do overall memory allocation, so we would
not have the problem of undersetting work_mem that we currently
experience. If we solved this problem we would have faster sorts in
*all* cases, not just extremely large ones. Dynamically setting work_mem
higher when possible would be very useful. I've looked at this a few
times and have some suggestions, but perhaps its worth asking for ideas
in this area?

Best Regards, Simon Riggs




Re: Merge algorithms for large numbers of "tapes"

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> 1. Earlier we had some results that showed that the heapsorts got slower
> when work_mem was higher and that concerns me most of all right now.

Fair enough, but that's completely independent of the merge algorithm.
(I don't think the Nyberg results necessarily apply to our situation
anyway, as we are not sorting arrays of integers, and hence the cache
effects are far weaker for us.  I don't mind trying alternate sort
algorithms, but I'm not going to believe an improvement in advance of
direct evidence in our own environment.)

> 2. Improvement in the way we do overall memory allocation, so we would
> not have the problem of undersetting work_mem that we currently
> experience. If we solved this problem we would have faster sorts in
> *all* cases, not just extremely large ones. Dynamically setting work_mem
> higher when possible would be very useful.

I think this would be extremely dangerous, as it would encourage
processes to take more than their fair share of available resources.
Also, to the extent that you believe the problem is insufficient L2
cache, it seems increasing work_mem to many times the size of L2 will
always be counterproductive.  (Certainly there is no value in increasing
work_mem until we are in a regime where it consistently improves
performance significantly, which it seems we aren't yet.)
        regards, tom lane


Re: Merge algorithms for large numbers of "tapes"

From
"Luke Lonergan"
Date:
Tom,

On 3/8/06 7:21 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Simon Riggs <simon@2ndquadrant.com> writes:
>> 1. Earlier we had some results that showed that the heapsorts got slower
>> when work_mem was higher and that concerns me most of all right now.
> 
> Fair enough, but that's completely independent of the merge algorithm.
> (I don't think the Nyberg results necessarily apply to our situation
> anyway, as we are not sorting arrays of integers, and hence the cache
> effects are far weaker for us.  I don't mind trying alternate sort

Even with the indirection, we should investigate alternative approaches that
others have demonstrated to be superior WRT L2 cache use.

A major commercial database currently performs external sorts of various
fields 4 times faster, and commonly uses more than 256MB of sort memory in
one example case to do it.

> I think this would be extremely dangerous, as it would encourage
> processes to take more than their fair share of available resources.

I agree - in fact, we currently have no structured concept of "fair share of
available resources", nor a way to share them.

I think the answer to this should involve the use of statement queuing and
resource queues.
> Also, to the extent that you believe the problem is insufficient L2
> cache, it seems increasing work_mem to many times the size of L2 will
> always be counterproductive.  (Certainly there is no value in increasing
> work_mem until we are in a regime where it consistently improves
> performance significantly, which it seems we aren't yet.)

Not if you cache block, the optimization that operates on a block of memory
one L2 block in size at a time.

- Luke 




Re: Merge algorithms for large numbers of "tapes"

From
"Jim C. Nasby"
Date:
On Wed, Mar 08, 2006 at 07:28:16AM -0800, Luke Lonergan wrote:
> > I think this would be extremely dangerous, as it would encourage
> > processes to take more than their fair share of available resources.
> 
> I agree - in fact, we currently have no structured concept of "fair share of
> available resources", nor a way to share them.

A concept it would be great to add at some point, both for memory and
IO. But that's another discussion entirely.

> I think the answer to this should involve the use of statement queuing and
> resource queues.

Something else to consider is reducing the amount of memory used when we
have to fail to a tape sort, because at that point we'll be
substantially slower. So, for example, allow in-memory sorts to use up
to 1GB, because it shouldn't take a long period of time to read that
data in, and the sort will then be extremely fast. That means that the
sort would be using that amount of memory for a short period of time. If
we do have to fail to disk, cut back to 128MB, because having 8x that
certainly won't make the sort run anywhere close to 8x faster. The trick
would be releasing memory that a sort we thought could fit in memory but
couldn't. It would also be good to start estimating which sorts should
fit in memory and which won't before we start (AFAIK the current code
assumes we'll fit in memory until it runs out).
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Merge algorithms for large numbers of "tapes"

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> If we do have to fail to disk, cut back to 128MB, because having 8x that
> certainly won't make the sort run anywhere close to 8x faster.

Not sure that follows.  In particular, the entire point of the recent
changes has been to extend the range in which we can use a single merge
pass --- that is, write the data once as N sorted runs, then merge them
in a single read pass.  As soon as you have to do an actual merge-back-
to-disk pass, your total I/O volume doubles, so there is definitely a
considerable gain if that can be avoided.  And a larger work_mem
translates directly to fewer/longer sorted runs.
        regards, tom lane


Re: Merge algorithms for large numbers of "tapes"

From
"Jim C. Nasby"
Date:
On Wed, Mar 08, 2006 at 11:20:50AM -0500, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > If we do have to fail to disk, cut back to 128MB, because having 8x that
> > certainly won't make the sort run anywhere close to 8x faster.
> 
> Not sure that follows.  In particular, the entire point of the recent
> changes has been to extend the range in which we can use a single merge
> pass --- that is, write the data once as N sorted runs, then merge them
> in a single read pass.  As soon as you have to do an actual merge-back-
> to-disk pass, your total I/O volume doubles, so there is definitely a
> considerable gain if that can be avoided.  And a larger work_mem
> translates directly to fewer/longer sorted runs.

But do fewer/longer sorted runs translate into not merging back to disk?
I thought that was controlled by if we had to be able to rewind the
result set.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Merge algorithms for large numbers of "tapes"

From
Simon Riggs
Date:
On Wed, 2006-03-08 at 10:21 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > 1. Earlier we had some results that showed that the heapsorts got slower
> > when work_mem was higher and that concerns me most of all right now.
> 
> Fair enough, but that's completely independent of the merge algorithm.
> (I don't think the Nyberg results necessarily apply to our situation
> anyway, as we are not sorting arrays of integers, and hence the cache
> effects are far weaker for us.  I don't mind trying alternate sort
> algorithms, but I'm not going to believe an improvement in advance of
> direct evidence in our own environment.)

Of course, this would be prototyped first...and I agree about possible
variability of those results for us.

> > 2. Improvement in the way we do overall memory allocation, so we would
> > not have the problem of undersetting work_mem that we currently
> > experience. If we solved this problem we would have faster sorts in
> > *all* cases, not just extremely large ones. Dynamically setting work_mem
> > higher when possible would be very useful.
> 
> I think this would be extremely dangerous, as it would encourage
> processes to take more than their fair share of available resources.

Fair share is the objective. I was trying to describe the general case
so we could discuss a solution that would allow a dynamic approach
rather than the static one we have now.

Want to handle these cases: "How much to allocate, when..."
A. we have predicted number of users 
B. we have a busy system - more than predicted number of users
C. we have a quiet system - less than predicted number of users

In B/C we have to be careful that we don't under/overallocate resources
only to find the situation changes immediately afterwards.

In many cases the static allocation is actually essential since you may
be more interested in guaranteeing a conservative run time rather than
seeking to produce occasional/unpredictable bursts of speed. But in many
cases people want to have certain tasks go faster when its quiet and go
slower when its not.

> Also, to the extent that you believe the problem is insufficient L2
> cache, it seems increasing work_mem to many times the size of L2 will
> always be counterproductive.  

Sorry to confuse: (1) and (2) were completely separate, so no intended
interaction between L2 cache and memory.

> (Certainly there is no value in increasing
> work_mem until we are in a regime where it consistently improves
> performance significantly, which it seems we aren't yet.)

Very much agreed.

Best Regards, Simon Riggs



Re: Merge algorithms for large numbers of "tapes"

From
"Luke Lonergan"
Date:
Jim,

On 3/8/06 9:49 AM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:

> On Wed, Mar 08, 2006 at 11:20:50AM -0500, Tom Lane wrote:

>> Not sure that follows.  In particular, the entire point of the recent
>> changes has been to extend the range in which we can use a single merge
>> pass --- that is, write the data once as N sorted runs, then merge them
>> in a single read pass.  As soon as you have to do an actual merge-back-
>> to-disk pass, your total I/O volume doubles, so there is definitely a
>> considerable gain if that can be avoided.  And a larger work_mem
>> translates directly to fewer/longer sorted runs.
> 
> But do fewer/longer sorted runs translate into not merging back to disk?
> I thought that was controlled by if we had to be able to rewind the
> result set.

In the *tape* algorithm, there is an intermediate abstraction in the merging
called tapes (!) that are used to store intermediate merge results.  Simon's
work implemented more tapes, which asymptotically approaches a single merge
pass as the number of tapes approaches the number of runs.

The Replacement Selection algorithm generally will produce about 1/2 the
number of runs that a simpler partial sort algorithm would, and the more
memory it uses, the fewer runs there are, and with fewer runs, fewer tapes
are required to avoid more passes on the merge.

This whole tape abstraction is something that I believe is unique to
Postgres among modern databases, and we have found that by removing it
entirely along with logtape.c, we remove 2000 lines of useless code that
only complicates our optimization problem.

- Luke 




Re: Merge algorithms for large numbers of "tapes"

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> But do fewer/longer sorted runs translate into not merging back to disk?
> I thought that was controlled by if we had to be able to rewind the
> result set.

A plain SELECT ... ORDER BY doesn't assume that anymore.  It is still
required for some cases such as the input to a merge join, but the
on-the-fly-final-merge code is going to be used a lot more in 8.2 than
it was before.
        regards, tom lane


Re: Merge algorithms for large numbers of "tapes"

From
"Dann Corbit"
Date:
I do not clearly understand the sorting code in PostgreSQL.  If I did
have a good grasp of it, I would take a go at improving it.

Here are some suggestions of things that I know work really, really
well:

#1.  Two pass merge (none of that silly poly-tape merge goo)

#2.  Load ONLY the keys that are to be sorted into memory.  Use a
pointer exchange sort, and do not move the physical rows of data at all.

I am pretty sure from this thread that PostgreSQL is not doing #1, and I
have no idea if it is doing #2.

A useful trick:
Since merge is mentioned, I should say something else about merge joins.
If you do not have room to load the sorted keys for bsearch, load every
kth key (where k is computed by sizeof merge_ram / sizeof key_data).
Then, when you have found the block the thing you are looking for by the
"kth key bsearch", bsearch that block.

Now, maybe PostrgeSQL already uses tricks better than these.  I don't
know.  But if they prove helpful suggestions I will be glad of it.

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Tom Lane
> Sent: Wednesday, March 08, 2006 12:32 PM
> To: Jim C. Nasby
> Cc: Luke Lonergan; Simon Riggs; pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] Merge algorithms for large numbers of "tapes"
>
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > But do fewer/longer sorted runs translate into not merging back to
disk?
> > I thought that was controlled by if we had to be able to rewind the
> > result set.
>
> A plain SELECT ... ORDER BY doesn't assume that anymore.  It is still
> required for some cases such as the input to a merge join, but the
> on-the-fly-final-merge code is going to be used a lot more in 8.2 than
> it was before.
>
>             regards, tom lane
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that
your
>        message can get through to the mailing list cleanly


Re: Merge algorithms for large numbers of "tapes"

From
"Luke Lonergan"
Date:
Dann,

On 3/8/06 12:39 PM, "Dann Corbit" <DCorbit@connx.com> wrote:

> Here are some suggestions of things that I know work really, really
> well:

Can you point to an example?  That might help move the discussion along.

The reason to interject about the tape goo in this discussion is that we
seem to be spending a lot of time optimizing around the tape goo without
tackling the overall structure of the external sort.  I think we'll just end
up having to replace all of this goo when we really get around to fixing the
problem.

Add to this that other commercial databases external sort in 1/4 the time or
better on the same hardware with the same CPU/memory resources using a
2-pass external sort.
> #1.  Two pass merge (none of that silly poly-tape merge goo)

Voice of reason here.  It's what the other database systems do.
> #2.  Load ONLY the keys that are to be sorted into memory.  Use a
> pointer exchange sort, and do not move the physical rows of data at all.

Sounds right.  Example of this in practice?
> I am pretty sure from this thread that PostgreSQL is not doing #1, and I
> have no idea if it is doing #2.

Yep.  Even Knuth says that the tape goo is only interesting from a
historical perspective and may not be relevant in an era of disk drives.

- Luke




Re: Merge algorithms for large numbers of "tapes"

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: Luke Lonergan [mailto:llonergan@greenplum.com]
> Sent: Wednesday, March 08, 2006 1:52 PM
> To: Dann Corbit; Tom Lane; Jim C. Nasby
> Cc: Simon Riggs; pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] Merge algorithms for large numbers of "tapes"
>
> Dann,
>
> On 3/8/06 12:39 PM, "Dann Corbit" <DCorbit@connx.com> wrote:
>
> > Here are some suggestions of things that I know work really, really
> > well:
>
> Can you point to an example?  That might help move the discussion
along.

I wrote all of the sorting and merging stuff for CONNX Solutions
http://www.connx.com

I have carefully benched all of this stuff and (at least for our system)
the ideas I propose work well.  Of course, every system is different and
the only way to know if it is an improvement is to try it in place.
> The reason to interject about the tape goo in this discussion is that
we
> seem to be spending a lot of time optimizing around the tape goo
without
> tackling the overall structure of the external sort.  I think we'll
just
> end
> up having to replace all of this goo when we really get around to
fixing
> the
> problem.

I suggest trying several alternatives and benching them with real world
queries and especially with the open database benchmark suite.

> Add to this that other commercial databases external sort in 1/4 the
time
> or
> better on the same hardware with the same CPU/memory resources using a
> 2-pass external sort.

Our sort merge is so fast that I can join two tables on a column with no
index faster than on a database that has a unique clustered index on the
column.  Benchmarked against Oracle, SQL*Server, and several others.

If you check our ORDER BY on a large table with no index, you will see
that it is competitive with the best commercial systems.

If you are interested, you could get an eval of CONNX and try it
yourself (eval is free for some number of days, I don't remember what).
> > #1.  Two pass merge (none of that silly poly-tape merge goo)
>
> Voice of reason here.  It's what the other database systems do.
>
> > #2.  Load ONLY the keys that are to be sorted into memory.  Use a
> > pointer exchange sort, and do not move the physical rows of data at
all.
>
> Sounds right.  Example of this in practice?

It is what we use here.  It is the only way to fly.  This is well known,
and if you read a few articles from the ACM, you will see that it has
been known for decades.
> > I am pretty sure from this thread that PostgreSQL is not doing #1,
and I
> > have no idea if it is doing #2.
>
> Yep.  Even Knuth says that the tape goo is only interesting from a
> historical perspective and may not be relevant in an era of disk
drives.
>
> - Luke
>



Re: Merge algorithms for large numbers of "tapes"

From
"Dann Corbit"
Date:
There are some articles here that are worth reading if you want to sort
fast:

http://research.microsoft.com/barc/SortBenchmark/

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Dann Corbit
> Sent: Wednesday, March 08, 2006 1:59 PM
> To: Luke Lonergan; Tom Lane; Jim C. Nasby
> Cc: Simon Riggs; pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] Merge algorithms for large numbers of "tapes"
>
> > -----Original Message-----
> > From: Luke Lonergan [mailto:llonergan@greenplum.com]
> > Sent: Wednesday, March 08, 2006 1:52 PM
> > To: Dann Corbit; Tom Lane; Jim C. Nasby
> > Cc: Simon Riggs; pgsql-hackers@postgreSQL.org
> > Subject: Re: [HACKERS] Merge algorithms for large numbers of "tapes"
> >
> > Dann,
> >
> > On 3/8/06 12:39 PM, "Dann Corbit" <DCorbit@connx.com> wrote:
> >
> > > Here are some suggestions of things that I know work really,
really
> > > well:
> >
> > Can you point to an example?  That might help move the discussion
> along.
>
> I wrote all of the sorting and merging stuff for CONNX Solutions
> http://www.connx.com
>
> I have carefully benched all of this stuff and (at least for our
system)
> the ideas I propose work well.  Of course, every system is different
and
> the only way to know if it is an improvement is to try it in place.
>
> > The reason to interject about the tape goo in this discussion is
that
> we
> > seem to be spending a lot of time optimizing around the tape goo
> without
> > tackling the overall structure of the external sort.  I think we'll
> just
> > end
> > up having to replace all of this goo when we really get around to
> fixing
> > the
> > problem.
>
> I suggest trying several alternatives and benching them with real
world
> queries and especially with the open database benchmark suite.
>
> > Add to this that other commercial databases external sort in 1/4 the
> time
> > or
> > better on the same hardware with the same CPU/memory resources using
a
> > 2-pass external sort.
>
> Our sort merge is so fast that I can join two tables on a column with
no
> index faster than on a database that has a unique clustered index on
the
> column.  Benchmarked against Oracle, SQL*Server, and several others.
>
> If you check our ORDER BY on a large table with no index, you will see
> that it is competitive with the best commercial systems.
>
> If you are interested, you could get an eval of CONNX and try it
> yourself (eval is free for some number of days, I don't remember
what).
>
>  > > #1.  Two pass merge (none of that silly poly-tape merge goo)
> >
> > Voice of reason here.  It's what the other database systems do.
> >
> > > #2.  Load ONLY the keys that are to be sorted into memory.  Use a
> > > pointer exchange sort, and do not move the physical rows of data
at
> all.
> >
> > Sounds right.  Example of this in practice?
>
> It is what we use here.  It is the only way to fly.  This is well
known,
> and if you read a few articles from the ACM, you will see that it has
> been known for decades.
>
> > > I am pretty sure from this thread that PostgreSQL is not doing #1,
> and I
> > > have no idea if it is doing #2.
> >
> > Yep.  Even Knuth says that the tape goo is only interesting from a
> > historical perspective and may not be relevant in an era of disk
> drives.
> >
> > - Luke
> >
>
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings


Re: Merge algorithms for large numbers of "tapes"

From
Tom Lane
Date:
"Dann Corbit" <DCorbit@connx.com> writes:
> Here are some suggestions of things that I know work really, really
> well:
> #1.  Two pass merge (none of that silly poly-tape merge goo)

This amounts to an assumption that you have infinite work_mem, in which
case you hardly need an external sort at all.  If your work_mem is in
fact finite, then at some point you need more than two passes.  I'm not
really interested in ripping out support for sort operations that are
much larger than work_mem.

> #2.  Load ONLY the keys that are to be sorted into memory.  Use a
> pointer exchange sort, and do not move the physical rows of data at all.

This suggestion isn't a whole lot better; in general the rows to be
sorted don't exist until we compute them, and so proposing that we
"don't load them until later" is pretty much irrelevant.  Also, in
a lot of common cases the keys to be sorted are the bulk of the data
anyway.
        regards, tom lane


Re: Merge algorithms for large numbers of "tapes"

From
"Dann Corbit"
Date:

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Wednesday, March 08, 2006 3:17 PM
> To: Dann Corbit
> Cc: Jim C. Nasby; Luke Lonergan; Simon Riggs;
pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Merge algorithms for large numbers of "tapes"
>
> "Dann Corbit" <DCorbit@connx.com> writes:
> > Here are some suggestions of things that I know work really, really
> > well:
> > #1.  Two pass merge (none of that silly poly-tape merge goo)
>
> This amounts to an assumption that you have infinite work_mem, in
which
> case you hardly need an external sort at all.  If your work_mem is in
> fact finite, then at some point you need more than two passes.  I'm
not
> really interested in ripping out support for sort operations that are
> much larger than work_mem.

No it does not.  I have explained this before.  You can have one million
files and merge them all into a final output with a single pass.  It
does not matter how big they are or how much memory you have.

The idea is very simple.  Each subfile has its top record inserted into
a priority queue of file handles (or whatever else you want to use --
temp tables, you name it). When you extract the smallest record from the
queue, the priority changes and that file handle gets moved to a new
place in the queue.  You keep pulling records from the queue until the
entire queue is empty.

The outline is like this:
1. Sort chunks
2. Write chunks
3. Insert top record of chunks into priority queue
4. Extract records from queue, writing them to final output
5. Repeat step 4 until queue is empty.


> > #2.  Load ONLY the keys that are to be sorted into memory.  Use a
> > pointer exchange sort, and do not move the physical rows of data at
all.
>
> This suggestion isn't a whole lot better; in general the rows to be
> sorted don't exist until we compute them, and so proposing that we
> "don't load them until later" is pretty much irrelevant.  Also, in
> a lot of common cases the keys to be sorted are the bulk of the data
> anyway.

This suggestion is in addition to suggestion 1.  They are not even
related except that both suggestions make the sort run a lot faster.

I think I did not explain it clearly enough.  Suppose that you have a
set of rows you need to sort.  Instead of loading the whole row into
memory, just load the columns (or parts of columns) that are being
sorted.  I hope that it is more clear now.

>
>             regards, tom lane


Re: Merge algorithms for large numbers of "tapes"

From
Greg Stark
Date:
"Luke Lonergan" <llonergan@greenplum.com> writes:

> > I am pretty sure from this thread that PostgreSQL is not doing #1, and I
> > have no idea if it is doing #2.
> 
> Yep.  Even Knuth says that the tape goo is only interesting from a
> historical perspective and may not be relevant in an era of disk drives.

As the size of the data grows larger the behaviour of hard drives looks more
and more like tapes. The biggest factor controlling the speed of i/o
operations is how many seeks are required to complete them. Effectively
"rewinds" are still the problem it's just that the cost of rewinds becomes
constant regardless of how long the "tape" is.

That's one thing that gives me pause about the current approach of using more
tapes. It seems like ideally the user would create a temporary work space on
each spindle and the database would arrange to use no more than that number of
tapes. Then each merge operation would involve only sequential access for both
reads and writes.

-- 
greg



Re: Merge algorithms for large numbers of "tapes"

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: gsstark@mit.edu [mailto:gsstark@mit.edu]
> Sent: Wednesday, March 08, 2006 3:56 PM
> To: Luke Lonergan
> Cc: Dann Corbit; Tom Lane; Jim C. Nasby; Simon Riggs; pgsql-
> hackers@postgresql.org
> Subject: Re: [HACKERS] Merge algorithms for large numbers of "tapes"
>
>
> "Luke Lonergan" <llonergan@greenplum.com> writes:
>
> > > I am pretty sure from this thread that PostgreSQL is not doing #1,
and
> I
> > > have no idea if it is doing #2.
> >
> > Yep.  Even Knuth says that the tape goo is only interesting from a
> > historical perspective and may not be relevant in an era of disk
drives.
>
> As the size of the data grows larger the behaviour of hard drives
looks
> more
> and more like tapes. The biggest factor controlling the speed of i/o
> operations is how many seeks are required to complete them.
Effectively
> "rewinds" are still the problem it's just that the cost of rewinds
becomes
> constant regardless of how long the "tape" is.
>
> That's one thing that gives me pause about the current approach of
using
> more
> tapes. It seems like ideally the user would create a temporary work
space
> on
> each spindle and the database would arrange to use no more than that
> number of
> tapes. Then each merge operation would involve only sequential access
for
> both
> reads and writes.

If the chief concern is in the number of subfiles created, replacement
selection doubles the length of the subfiles while consuming no more
memory.
{The big-O of the algorithm sucks, though}

It is certainly worth testing several cases.

It is not a bad idea to enable more than one method of performing an
operation.

In the ideal case, you would have specific information about drives,
spindles, rates for seek, transfer, etc.

It all depends on how much effort you want to throw at it.


Re: Merge algorithms for large numbers of "tapes"

From
"Jim C. Nasby"
Date:
On Wed, Mar 08, 2006 at 10:49:16AM -0800, Luke Lonergan wrote:
> Jim,
> 
> On 3/8/06 9:49 AM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:
> 
> > On Wed, Mar 08, 2006 at 11:20:50AM -0500, Tom Lane wrote:
> 
> >> Not sure that follows.  In particular, the entire point of the recent
> >> changes has been to extend the range in which we can use a single merge
> >> pass --- that is, write the data once as N sorted runs, then merge them
> >> in a single read pass.  As soon as you have to do an actual merge-back-
> >> to-disk pass, your total I/O volume doubles, so there is definitely a
> >> considerable gain if that can be avoided.  And a larger work_mem
> >> translates directly to fewer/longer sorted runs.
> > 
> > But do fewer/longer sorted runs translate into not merging back to disk?
> > I thought that was controlled by if we had to be able to rewind the
> > result set.
> 
> In the *tape* algorithm, there is an intermediate abstraction in the merging
> called tapes (!) that are used to store intermediate merge results.  Simon's
> work implemented more tapes, which asymptotically approaches a single merge
> pass as the number of tapes approaches the number of runs.
> 
> The Replacement Selection algorithm generally will produce about 1/2 the
> number of runs that a simpler partial sort algorithm would, and the more
> memory it uses, the fewer runs there are, and with fewer runs, fewer tapes
> are required to avoid more passes on the merge.
> 
> This whole tape abstraction is something that I believe is unique to
> Postgres among modern databases, and we have found that by removing it
> entirely along with logtape.c, we remove 2000 lines of useless code that
> only complicates our optimization problem.

Oh, geez, I think I get it now. I was thinking that we did something
like sort a chunk, write it to disk, repeat until all data processed and
then just read from the stuff on disk in order, switching between files
as needed. But of course that would suck horribly if we were actually
using tapes. Like others have said, surely there's got to be a much
better way to go about things with more modern hardware. If there is,
then hopefully the possibility exists of returning memory back to "the
pool" if it's not going to be as useful as it would be to a sort that
would fit in-memory.

As an example, in my hypothetical algorithm that sorts one chunk at a
time and then bounces between chunks when reading the data back out, it
would probably be better to have fewer, larger chunks than many more
small ones. But the difference between 256M chunks and 1GB chunks
probably wouldn't be that big a difference, certainly not a 4x
improvement. So it makes sense to go with the smaller chunks if it means
that other sorts would be able to operate entirely in-memory. In an
ideal world, this allocation could even by dynamic, based on what else
was happening on the machine.

But I'll take any incremental improvement I can get right now. :) Just
having the ability to set a more aggressive work_mem without worrying
about causing a swap storm would be a huge improvement over the current
situation. Being able to cut back on memory use when we fall back to
disk would be icing on the cake. :)
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Merge algorithms for large numbers of "tapes"

From
"Jonah H. Harris"
Date:
An interesting read at

http://www.vldb.org/conf/1997/P376.PDF

On 3/8/06, Dann Corbit < DCorbit@connx.com> wrote:
> -----Original Message-----
> From: gsstark@mit.edu [mailto:gsstark@mit.edu]
> Sent: Wednesday, March 08, 2006 3:56 PM
> To: Luke Lonergan
> Cc: Dann Corbit; Tom Lane; Jim C. Nasby; Simon Riggs; pgsql-
> hackers@postgresql.org
> Subject: Re: [HACKERS] Merge algorithms for large numbers of "tapes"
>
>
> "Luke Lonergan" < llonergan@greenplum.com> writes:
>
> > > I am pretty sure from this thread that PostgreSQL is not doing #1,
and
> I
> > > have no idea if it is doing #2.
> >
> > Yep.  Even Knuth says that the tape goo is only interesting from a
> > historical perspective and may not be relevant in an era of disk
drives.
>
> As the size of the data grows larger the behaviour of hard drives
looks
> more
> and more like tapes. The biggest factor controlling the speed of i/o
> operations is how many seeks are required to complete them.
Effectively
> "rewinds" are still the problem it's just that the cost of rewinds
becomes
> constant regardless of how long the "tape" is.
>
> That's one thing that gives me pause about the current approach of
using
> more
> tapes. It seems like ideally the user would create a temporary work
space
> on
> each spindle and the database would arrange to use no more than that
> number of
> tapes. Then each merge operation would involve only sequential access
for
> both
> reads and writes.

If the chief concern is in the number of subfiles created, replacement
selection doubles the length of the subfiles while consuming no more
memory.
{The big-O of the algorithm sucks, though}

It is certainly worth testing several cases.

It is not a bad idea to enable more than one method of performing an
operation.

In the ideal case, you would have specific information about drives,
spindles, rates for seek, transfer, etc.

It all depends on how much effort you want to throw at it.

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq



--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324

Re: Merge algorithms for large numbers of "tapes"

From
"Jim C. Nasby"
Date:
On Wed, Mar 08, 2006 at 03:35:53PM -0800, Dann Corbit wrote:
> I think I did not explain it clearly enough.  Suppose that you have a
> set of rows you need to sort.  Instead of loading the whole row into
> memory, just load the columns (or parts of columns) that are being
> sorted.  I hope that it is more clear now.

The issue is that there is a non-trivial amount of overhead in going
back to disk to get the raw data, and then you have to parse that into a
valid in-memory tuple. A worst-case scenario is if you're sorting all
the data that you've been asked to retrieve, ie:

SELECT a, b, c ... ORDER BY b, a, c;

That case is almost guaranteed to take longer if you try and do it with
just pointers.

But there is the other case:

SELECT a, b, c, big_honking_text_field ... ORDER BY a, b, c;

In this example it's entirely possible that leaving the big_honking
field out of the actual sorting would be a big win. Especially if your
temporary space was on a different set of spindles.

Regarding your suggestion of testing different kinds of sorts, that's
certainly a good idea if it can be done without a huge amount of work
coding each one up. Ultimately, it might make the most sense to support
multiple sort algorithms (at least for now) and let the planner decide
which one to use. That would at least get us a lot more real-world data
than any other method would.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Merge algorithms for large numbers of "tapes"

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: Jim C. Nasby [mailto:jnasby@pervasive.com]
> Sent: Wednesday, March 08, 2006 5:44 PM
> To: Dann Corbit
> Cc: Tom Lane; Luke Lonergan; Simon Riggs; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Merge algorithms for large numbers of "tapes"
>
> On Wed, Mar 08, 2006 at 03:35:53PM -0800, Dann Corbit wrote:
> > I think I did not explain it clearly enough.  Suppose that you have
a
> > set of rows you need to sort.  Instead of loading the whole row into
> > memory, just load the columns (or parts of columns) that are being
> > sorted.  I hope that it is more clear now.
>
> The issue is that there is a non-trivial amount of overhead in going
> back to disk to get the raw data, and then you have to parse that into
a
> valid in-memory tuple. A worst-case scenario is if you're sorting all
> the data that you've been asked to retrieve, ie:
>
> SELECT a, b, c ... ORDER BY b, a, c;
>
> That case is almost guaranteed to take longer if you try and do it
with
> just pointers.
>
> But there is the other case:
>
> SELECT a, b, c, big_honking_text_field ... ORDER BY a, b, c;
>
> In this example it's entirely possible that leaving the big_honking
> field out of the actual sorting would be a big win. Especially if your
> temporary space was on a different set of spindles.
>
> Regarding your suggestion of testing different kinds of sorts, that's
> certainly a good idea if it can be done without a huge amount of work
> coding each one up. Ultimately, it might make the most sense to
support
> multiple sort algorithms (at least for now) and let the planner decide
> which one to use. That would at least get us a lot more real-world
data
> than any other method would.

I typically do it something like this:

MSD_Radix_Sort_Hunks()
{
// We might have to bail for many reasons :
// Early part of the key may be identical for all rows
// We may not have a binning algorithm for this data type
// We may also only partially sort with MSD Radix sort
If (Set_Is_Too_Small_Or_Otherwise_Bail())
{
Introspective_Sort_Hunks();
}
Else
MSD_Radix_Alg(); // Cookie cutter of data stream into sorted hunks
}

Introspective_Sort_Hunks()
{
If (Set_Is_Too_Small_Or_Otherwise_Bail())
{
Ford_Johnson_Variant(); // Near optimal sort of very small sets
}
Else
Introspective_Alg();// Cookie cutter of data stream into sorted hunks
}

Queue_based_hunk_merge();

Now, you might have a merge that makes choices on entry similar to the
way that my sorts make choices on entry.

You will notice that my sorts decide internally on what algorithm to
perform.  Certainly, this is a simple approach that can generalize in
many ways.


> --
> Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
> Pervasive Software      http://pervasive.com    work: 512-231-6117
> vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Merge algorithms for large numbers of "tapes"

From
"Jim C. Nasby"
Date:
On Wed, Mar 08, 2006 at 06:55:59PM -0500, Greg Stark wrote:
> 
> "Luke Lonergan" <llonergan@greenplum.com> writes:
> 
> > > I am pretty sure from this thread that PostgreSQL is not doing #1, and I
> > > have no idea if it is doing #2.
> > 
> > Yep.  Even Knuth says that the tape goo is only interesting from a
> > historical perspective and may not be relevant in an era of disk drives.
> 
> As the size of the data grows larger the behaviour of hard drives looks more
> and more like tapes. The biggest factor controlling the speed of i/o
> operations is how many seeks are required to complete them. Effectively
> "rewinds" are still the problem it's just that the cost of rewinds becomes
> constant regardless of how long the "tape" is.

But it will take a whole lot of those rewinds to equal the amount of
time required by an additional pass through the data. I'll venture a
guess that as long as you've got enough memory to still read chunks back
in 8k blocks  that it won't be possible for a multi-pass sort to
out-perform a one-pass sort. Especially if you also had the ability to
do pre-fetching (not something to fuss with now, but certainly a
possibility in the future).
In any case, what we really need is at least good models backed by good
drive performance data. And we really should have that anyway so that we
can improve upon our cost estimator functions. I'm betting that what
that will show us is that no single sort method is going to work best
for all cases. IE: I'd bet that if your data set is sufficiently larger
than available memory that you'll actually be better off with a
multi-pass approach over a single/two pass approach.

> That's one thing that gives me pause about the current approach of using more
> tapes. It seems like ideally the user would create a temporary work space on
> each spindle and the database would arrange to use no more than that number of
> tapes. Then each merge operation would involve only sequential access for both
> reads and writes.

For that to be of any use, wouldn't you need to use only as many tapes
as spindles/2? Otherwise you're still trying to read and write from the
same set of drives, which means you're probably doing a lot of seeking.
Or do the tape algorithms re-write data as they read it?
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Merge algorithms for large numbers of "tapes"

From
Andrew Dunstan
Date:
Dann Corbit wrote:

>I do not clearly understand the sorting code in PostgreSQL.  If I did
>have a good grasp of it, I would take a go at improving it.
>
>  
>

"Show me the code" (and the benchmarks).

Seriously. We see regular discussions on this and similar topics, but I 
haven't seen a patch that anyone has proven is an unequivocal 
improvement. that I can recall.

cheers

andrew




Re: Merge algorithms for large numbers of "tapes"

From
Greg Stark
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:

> On Wed, Mar 08, 2006 at 06:55:59PM -0500, Greg Stark wrote:
> > 
> > "Luke Lonergan" <llonergan@greenplum.com> writes:
> > 
> > > > I am pretty sure from this thread that PostgreSQL is not doing #1, and I
> > > > have no idea if it is doing #2.
> > > 
> > > Yep.  Even Knuth says that the tape goo is only interesting from a
> > > historical perspective and may not be relevant in an era of disk drives.
> > 
> > As the size of the data grows larger the behaviour of hard drives looks more
> > and more like tapes. The biggest factor controlling the speed of i/o
> > operations is how many seeks are required to complete them. Effectively
> > "rewinds" are still the problem it's just that the cost of rewinds becomes
> > constant regardless of how long the "tape" is.
> 
> But it will take a whole lot of those rewinds to equal the amount of
> time required by an additional pass through the data. I'll venture a
> guess that as long as you've got enough memory to still read chunks back
> in 8k blocks  that it won't be possible for a multi-pass sort to
> out-perform a one-pass sort. 

Well that's clearly a bit overoptimistic. If we believe the random page cost
of 4 then having more tapes than you have spindles would impose a penalty
equal to having four times as many passes. 

(And that's *with* the 8k block size. And with the kernel performing pre-fetch
already too.)

> For that to be of any use, wouldn't you need to use only as many tapes
> as spindles/2? Otherwise you're still trying to read and write from the
> same set of drives, which means you're probably doing a lot of seeking.
> Or do the tape algorithms re-write data as they read it?

Well, spindles-1. I was thinking as many tapes as you have spindles *in total*,
ie, including the output tape. You only have one output tape for each n-way
merge though.

-- 
greg



Re: Merge algorithms for large numbers of "tapes"

From
"Zeugswetter Andreas DCP SD"
Date:
> > This amounts to an assumption that you have infinite work_mem, in
> which
> > case you hardly need an external sort at all.  If your
> work_mem is in
> > fact finite, then at some point you need more than two passes.  I'm
> not
> > really interested in ripping out support for sort
> operations that are
> > much larger than work_mem.
>
> No it does not.  I have explained this before.  You can have
> one million files and merge them all into a final output with
> a single pass.  It does not matter how big they are or how
> much memory you have.

Hh ? But if you have too many files your disk access is basically
then going to be random access (since you have 1000nds of files per
spindle).
From tests on AIX I have pretty much concluded, that if you read
256k blocks at a time though, random access does not really hurt that
much
any more.
So, if you can hold 256k per file in memory that should be sufficient.

Andreas


Re: Merge algorithms for large numbers of "tapes"

From
Hannu Krosing
Date:
Ühel kenal päeval, K, 2006-03-08 kell 20:08, kirjutas Jim C. Nasby:

> But it will take a whole lot of those rewinds to equal the amount of
> time required by an additional pass through the data. 

I guess that missing a sector read also implies a "rewind", i.e. if you
don't process the data read from a "tape" fast enough, you will have to
wait a whole disc revolution (~== "seek time" on modern disks) before
you get the next chunk of data.

> I'll venture a
> guess that as long as you've got enough memory to still read chunks back
> in 8k blocks  that it won't be possible for a multi-pass sort to
> out-perform a one-pass sort. Especially if you also had the ability to
> do pre-fetching (not something to fuss with now, but certainly a
> possibility in the future).
>  
> In any case, what we really need is at least good models backed by good
> drive performance data.

And filesystem performance data, as postgres uses OS-s native
filesystems.

--------------
Hannu



Re: Merge algorithms for large numbers of "tapes"

From
Florian Weimer
Date:
* Greg Stark:

> That's one thing that gives me pause about the current approach of
> using more tapes. It seems like ideally the user would create a
> temporary work space on each spindle and the database would arrange
> to use no more than that number of tapes. Then each merge operation
> would involve only sequential access for both reads and writes.

And you'd need to preallocate the files in some way or other, to avoid
file system fragmentation.


Re: Merge algorithms for large numbers of "tapes"

From
"Jim C. Nasby"
Date:
On Wed, Mar 08, 2006 at 10:20:08PM -0500, Greg Stark wrote:
> > For that to be of any use, wouldn't you need to use only as many tapes
> > as spindles/2? Otherwise you're still trying to read and write from the
> > same set of drives, which means you're probably doing a lot of seeking.
> > Or do the tape algorithms re-write data as they read it?
> 
> Well, spindles-1. I was thinking as many tapes as you have spindles *in total*,
> ie, including the output tape. You only have one output tape for each n-way
> merge though.

Well, the reality remains though; most folks are unlikely to setup
enough dedicated temp areas so that we can do one tape per disk, so it
would be really good to have a sort method that didn't rely on that.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Merge algorithms for large numbers of "tapes"

From
"Luke Lonergan"
Date:
Jim,

On 3/9/06 8:35 AM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:

> Well, the reality remains though; most folks are unlikely to setup
> enough dedicated temp areas so that we can do one tape per disk, so it
> would be really good to have a sort method that didn't rely on that.

Agreed - however optimizing the run output and merge pass is straightforward
without knowing the underlying I/O infrastructure.

Consider that a popular commercial database, running on a 6-disk RAID5 with
one filesystem, performs external sorting 4 times faster (1/4 of the time)
than Postgres using a two pass sort.  There is no special optimization of
the I/O path involved, it's simply a matter of using a modern external
sorting approach (no tapes).

Tom's point about finite memory is definitely important - it does take
roughly SQRT(sort set) of memory to perform the two pass sort, but that is a
completely manageable amount of memory.  The problem we have now is that we
don't use a dynamic memory allocation mechanism to provide this amount of
RAM to the task.  That's why the tape algorithm is "safe", because you can
guarantee an external sort result, even with tiny memory.

But I believe the right answer is to implement the modern sorting algorithm
and the memory allocation to support it.  Sorting is too important to most
operations to be so far behind - 400% slower is not acceptable, and I don't
think tweaking the current approach will get us there.

- Luke     




Re: Merge algorithms for large numbers of "tapes"

From
Tom Lane
Date:
"Luke Lonergan" <llonergan@greenplum.com> writes:
> Consider that a popular commercial database, running on a 6-disk RAID5 with
> one filesystem, performs external sorting 4 times faster (1/4 of the time)
> than Postgres using a two pass sort.  There is no special optimization of
> the I/O path involved, it's simply a matter of using a modern external
> sorting approach (no tapes).

I think this argumentation hinges on some irrational aversion to the
word "tape".  Given adequate work_mem, the CVS-tip behavior is exactly
what you propose already (at least for the cases where we don't need
random access to the sort result).  AFAICS the only result of removing
the support for multipass merge is that the code would fail, rather than
run slowly, if it didn't have adequate work_mem for a particular
problem.  Somehow I don't see that as an improvement.
        regards, tom lane


Re: Merge algorithms for large numbers of "tapes"

From
"Luke Lonergan"
Date:
Tom,

On 3/9/06 9:44 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> I think this argumentation hinges on some irrational aversion to the
> word "tape".  Given adequate work_mem, the CVS-tip behavior is exactly
> what you propose already (at least for the cases where we don't need
> random access to the sort result).

Nope.  There's the matter of this thing called logtape.c, in addition to the
use of the "tape" as a means of grouping runs.  In the current
implementation, runs are not tapes, and tapes as used in the implementation
are an abstraction that only obscures the underlying processes in a
meaningful way.

My objection to tapes is a rational one, and we have internally demonstrated
that by eliminating logtape.c and large hunks of tape algorithm related
code, we get slightly faster performance with 2,000 fewer lines of code,
ergo, the code is not useful.  We did this in two days of work, and in the
process uncovered the fact that access was always set to RANDOM, the import
of which we've seen discussed here.

> AFAICS the only result of removing
> the support for multipass merge is that the code would fail, rather than
> run slowly, if it didn't have adequate work_mem for a particular
> problem.  Somehow I don't see that as an improvement.

I would only suggest that we replace the existing algorithm with one that
will work regardless of (reasonable) memory requirements.  Perhaps we can
agree that at least 1MB of RAM for external sorting will always be available
and proceed from there?

- Luke 




Re: Merge algorithms for large numbers of "tapes"

From
Tom Lane
Date:
"Luke Lonergan" <llonergan@greenplum.com> writes:
> I would only suggest that we replace the existing algorithm with one that
> will work regardless of (reasonable) memory requirements.  Perhaps we can
> agree that at least 1MB of RAM for external sorting will always be available
> and proceed from there?

If you can sort indefinitely large amounts of data with 1MB work_mem,
go for it.
        regards, tom lane


Re: Merge algorithms for large numbers of "tapes"

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> "Luke Lonergan" <llonergan@greenplum.com> writes:
> > I would only suggest that we replace the existing algorithm with one that
> > will work regardless of (reasonable) memory requirements.  Perhaps we can
> > agree that at least 1MB of RAM for external sorting will always be available
> > and proceed from there?
>
> If you can sort indefinitely large amounts of data with 1MB work_mem,
> go for it.

It seems you two are talking past each other and I'm at least slightly
confused.  So, I'd like to ask for a bit of clarification and perhaps
that will help everyone.

#1: I'm as much a fan of eliminating unnecessary code as anyone
#2: There have been claims of two-pass improving things 400%
#3: Supposedly two-pass requires on the order of sqrt(total) memory
#4: We have planner statistics to estimate size of total
#5: We have a work_mem limitation for a reason

So, if we get a huge performance increase, what's wrong with:
if [ sqrt(est(total)) <= work_mem ]; then two-pass-sort();
else tape-sort();
fi

?

If the performance isn't much different and tape-sort can do it with
less memory then I don't really see any point in removing it.

If the intent is to remove it and then ask for the default work_mem to
be increased- I doubt going about it this way would work very well. :)
Thanks,
    Stephen

Re: Merge algorithms for large numbers of "tapes"

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: Stephen Frost [mailto:sfrost@snowman.net]
> Sent: Thursday, March 09, 2006 3:49 PM
> To: Tom Lane
> Cc: Luke Lonergan; Jim C. Nasby; Greg Stark; Dann Corbit; Simon Riggs;
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Merge algorithms for large numbers of "tapes"
>
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> > "Luke Lonergan" <llonergan@greenplum.com> writes:
> > > I would only suggest that we replace the existing algorithm with
one
> that
> > > will work regardless of (reasonable) memory requirements.  Perhaps
we
> can
> > > agree that at least 1MB of RAM for external sorting will always be
> available
> > > and proceed from there?
> >
> > If you can sort indefinitely large amounts of data with 1MB
work_mem,
> > go for it.
>
> It seems you two are talking past each other and I'm at least slightly
> confused.  So, I'd like to ask for a bit of clarification and perhaps
> that will help everyone.
>
> #1: I'm as much a fan of eliminating unnecessary code as anyone
> #2: There have been claims of two-pass improving things 400%
> #3: Supposedly two-pass requires on the order of sqrt(total) memory

Two pass does not require sqrt(total) memory.  This figure is clearly
wrong.

Two pass will create the count of subfiles proportional to:
Subfile_count = original_stream_size/sort_memory_buffer_size

The merge pass requires (sizeof record * subfile_count) memory.

Example:
You have a 7 gigabyte table to sort and you have 100 MB sort buffer.
The number of subfiles will be:
7000000000 / 100000000 = 70 files

Suppose that a record is 2K wide.

The merge pass requires 70*2k = 143,360 bytes of RAM.

Suppose that a record is 65535 bytes wide.

The merge pass requires 70*65535 = 4,587,450 bytes of RAM.

> #4: We have planner statistics to estimate size of total
> #5: We have a work_mem limitation for a reason
>
> So, if we get a huge performance increase, what's wrong with:
> if [ sqrt(est(total)) <= work_mem ]; then
>   two-pass-sort();
> else
>   tape-sort();
> fi
>
> ?
>
> If the performance isn't much different and tape-sort can do it with
> less memory then I don't really see any point in removing it.
>
> If the intent is to remove it and then ask for the default work_mem to
> be increased- I doubt going about it this way would work very well. :)
>
>     Thanks,
>
>         Stephen


Re: Merge algorithms for large numbers of "tapes"

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> So, if we get a huge performance increase, what's wrong with:
> if [ sqrt(est(total)) <=3D work_mem ]; then
>   two-pass-sort();
> else
>   tape-sort();
> fi
> ?

Possibly nothing.  However, from an algorithmic point of view the
CVS-tip code *is* two-pass-sort, given adequate work_mem and no
requirement for random access.  Further, the available profile data
doesn't show any indication that the logtape.c code is eating 3/4ths
of the time (at least not after we fixed the ltsReleaseBlock problem).
So I basically do not believe Luke's assertion that removing logtape.c
is going to produce a 4X speedup.  Maybe it's time to produce some code
that we can all test.
        regards, tom lane


Re: Merge algorithms for large numbers of "tapes"

From
"Luke Lonergan"
Date:
Stephen,

On 3/9/06 3:48 PM, "Stephen Frost" <sfrost@snowman.net> wrote:

> So, if we get a huge performance increase, what's wrong with:
> if [ sqrt(est(total)) <= work_mem ]; then
>   two-pass-sort();
> else
>   tape-sort();
> fi

I have something similar but less complex in mind.

One of the observed behaviors with the current approach is that increasing
work_mem actually slows external sorting down.  This is because the heapsort
embedded in the replacement selection algorithm in the tape sort is not L2
cache friendly.

The easiest, simplest algorithm to employ here would be to quicksort in
chunks of work_mem to produce the runs, output them in a simple manner to
heap files, then merge them in one pass, materializing if necessary for
random access.

Granted there are seek optimizations necessary to make the merge pass
efficient, but these are obviously tractable in a simple manner as evidenced
by others (Nyquist) and our own internal experiments.
The simplicity of this is that the current approach switches from a
quicksort to the polyphase tape sort when work_mem is exceeded, which
involves a fairly complex chunk of code right now.  In this new approach,
when the sort set exceeds work_mem, we just write it out and continue.

> If the intent is to remove it and then ask for the default work_mem to
> be increased- I doubt going about it this way would work very well. :)

Yep - the main question to address is whether work_mem is always sufficient
to buffer the merge results in one pass, or whether degenerating to a
multi-pass can be done gracefully if not.

Tim Kordas here plans to work on this sometime next week using code he's
already written, and I'd expect a pretty quick set of improvements through
this simplified approach.

- Luke




Re: Merge algorithms for large numbers of "tapes"

From
"Luke Lonergan"
Date:
Tom,

On 3/9/06 3:59 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Possibly nothing.  However, from an algorithmic point of view the
> CVS-tip code *is* two-pass-sort, given adequate work_mem and no
> requirement for random access.  Further, the available profile data
> doesn't show any indication that the logtape.c code is eating 3/4ths
> of the time (at least not after we fixed the ltsReleaseBlock problem).
> So I basically do not believe Luke's assertion that removing logtape.c
> is going to produce a 4X speedup.  Maybe it's time to produce some code
> that we can all test.

Let's be fair - I've never asserted that logtape.c is solely responsible for
the performance.

- Luke  




Re: Merge algorithms for large numbers of "tapes"

From
"Luke Lonergan"
Date:
Dann,

On 3/9/06 3:56 PM, "Dann Corbit" <DCorbit@connx.com> wrote:

> Two pass does not require sqrt(total) memory.  This figure is clearly
> wrong.

Clearly you haven't read the paper I posted previously in this thread from
1986 written by Jim Grey at Tandem.

- Luke 




Re: Merge algorithms for large numbers of "tapes"

From
"Zeugswetter Andreas DCP SD"
Date:
> Two pass will create the count of subfiles proportional to:
> Subfile_count = original_stream_size/sort_memory_buffer_size
>
> The merge pass requires (sizeof record * subfile_count) memory.

That is true from an algorithmic perspective. But to make the
merge efficient you would need to have enough RAM to cache a reasonably
large block per subfile_count. Else you would need to reread the same
page/block from one subfile multiple times.
(If you had one disk per subfile you could also rely on the disk's own
cache,
but I think we can rule that out)

> Example:
> You have a 7 gigabyte table to sort and you have 100 MB sort buffer.
> The number of subfiles will be:
> 7000000000 / 100000000 = 70 files

To be efficient you need (70 + 1) \* max(record_size, 256k) = 18 Mb

Plus you need a structure per subfile that points to the current record
in the buffer.

Andreas


Re: Merge algorithms for large numbers of "tapes"

From
Martijn van Oosterhout
Date:
On Fri, Mar 10, 2006 at 09:57:28AM +0100, Zeugswetter Andreas DCP SD wrote:
>
> > Two pass will create the count of subfiles proportional to:
> > Subfile_count = original_stream_size/sort_memory_buffer_size
> >
> > The merge pass requires (sizeof record * subfile_count) memory.
>
> That is true from an algorithmic perspective. But to make the
> merge efficient you would need to have enough RAM to cache a reasonably
> large block per subfile_count. Else you would need to reread the same
> page/block from one subfile multiple times.
> (If you had one disk per subfile you could also rely on the disk's own
> cache,
> but I think we can rule that out)

But what about the OS cache? Linux will read upto the next 128KB of a
file if it's contiguous on disk, which is likely with modern
filesystems. It's likely to be much "fairer" than any way we can come
up with to share memory.

Question is, do we want our algorithm to rely on that caching?
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Merge algorithms for large numbers of "tapes"

From
"Zeugswetter Andreas DCP SD"
Date:
> > > Two pass will create the count of subfiles proportional to:
> > > Subfile_count = original_stream_size/sort_memory_buffer_size
> > >
> > > The merge pass requires (sizeof record * subfile_count) memory.
> >
> > That is true from an algorithmic perspective. But to make the merge
> > efficient you would need to have enough RAM to cache a reasonably
> > large block per subfile_count. Else you would need to
> reread the same
> > page/block from one subfile multiple times.
> > (If you had one disk per subfile you could also rely on the
> disk's own
> > cache, but I think we can rule that out)
>
> But what about the OS cache? Linux will read upto the next
> 128KB of a file if it's contiguous on disk, which is likely
> with modern filesystems. It's likely to be much "fairer" than
> any way we can come up with to share memory.

We were discussing how much RAM is needed, and not how much
the backend allocates itself. So if the backend needs to duplicate some
of the OS cache, that will only add to the memory requirement.
The most likely scenario is, that the backend additionally holds one
page
per subfile.

> Question is, do we want our algorithm to rely on that caching?

Currently we do, and I don't think that is so bad actually.
The only optimization I would consider, is adding a sequential access
hint
to the "tape file :-)" open.

Andreas