Thread: polyphase merge?

polyphase merge?

From
Don Marvick
Date:
Dear All,<br /><br />I apologize if this has been discussed before. I have tried to search to the mailing list and
couldnot find an exact answer.<br /><br />Currently, Postgres uses Knuth's Algorithm 5.4.2D to perform external merge
sort.IMHO the algorithm is designed for tapes. <br /><br />Why don't the simple text book merge pattern described in
KnuthPage 365 (5.4.9) is used for disk based system? The same merge pattern is also described in Ramakrishnan text book
andit is optimal if seek time is not counted, which of course not the real-world case.<br /><br />Best regards,<br
/><br/>Don<br /><br /> 

Re: polyphase merge?

From
Don Marvick
Date:
Dear All,<br /><br />Since nobody replied, I would give it a try. I am going to implement the merge pattern described
inKnuth Page 365 (5.4.9), essentially it is as follow:<br />- create initial runs using replacement selection
(basicallythis is as in the current implementation)<br /> - add enough dummy runs of size 0 so that the number of
sortedrun minus one can be divided by k-1 (k is merge fan in)<br />- repetitively merge k smallest runs until only 1
runleft<br /><br />I am new to postgresql, hence any advice would be appreciated.<br /><br />Can anybody give me some
adviceon how it can done? <br />1. how a run should be represented? should I reuse the tape mechanism? e.g. 1 tape 1
run<br/>- or should I use a temp buffer?<br /><br />2. How memory should be allocated? I assume I divide the memory
equallyto k runs, hence each run will get M/k memory. Each read of a run would be of size M/k bytes. <br /><br />3.
Prefetching.Then, we can precompute the read sequence of blocks of run during the entire merge process. Based on this,
weknow the blocks of run to be prefetched at a point of time.<br /><br />3. Is it possible to perform read/write I/O
whilethe merge is being performed? Hence we overlap I/O and computation.<br /><br />4. any other issue needs
consideration?<br/><br />Best regards,<br /><br />Don<br /><br /><div class="gmail_quote">On Thu, Jan 29, 2009 at 4:11
PM,Don Marvick <span dir="ltr"><<a href="mailto:donmarvick@gmail.com">donmarvick@gmail.com</a>></span> wrote:<br
/><blockquoteclass="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex;
padding-left:1ex;">Dear All,<br /><br />I apologize if this has been discussed before. I have tried to search to the
mailinglist and could not find an exact answer.<br /><br />Currently, Postgres uses Knuth's Algorithm 5.4.2D to perform
externalmerge sort. IMHO the algorithm is designed for tapes. <br /><br />Why don't the simple text book merge pattern
describedin Knuth Page 365 (5.4.9) is used for disk based system? The same merge pattern is also described in
Ramakrishnantext book and it is optimal if seek time is not counted, which of course not the real-world case.<br /><br
/>Bestregards,<br /><font color="#888888"><br />Don<br /><br /></font></blockquote></div><br /> 

Re: polyphase merge?

From
Simon Riggs
Date:
On Wed, 2009-02-04 at 14:42 +0800, Don Marvick wrote:

> 4. any other issue needs consideration?

Most attempts to improve sorting further have fallen to nothing because
of the lack of repeatable test results. In particular, coming up with
test cases *after* writing the code is a good way to get lost in
discussions and have your work passed over.

The best starting point is to collect a number of test cases, both
typical and extreme cases. Do some research to show that "typical" cases
really are that and be prepared for some expressions of reasonable
doubt. Publish that data so test results can be objectively verified and
then measure those cases on existing code, with various settings of
tunable parameters.

Then it will be a simple matter to prove your changes are effective in
target cases without damaging other cases. We would also want the
changes to work automatically without additional tunable parameters.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: polyphase merge?

From
Greg Stark
Date:
Is this basically the same as our current algorithm but without
multiplexing the tapes onto single files? I have been wondering
whether we multiplex the tapes any better than filesystems can lay out
separate files actually.

Or is there something else that's different here?

-- 
greg


Re: polyphase merge?

From
Simon Riggs
Date:
On Wed, 2009-02-04 at 10:55 +0000, Greg Stark wrote:
> Is this basically the same as our current algorithm but without
> multiplexing the tapes onto single files? I have been wondering
> whether we multiplex the tapes any better than filesystems can lay out
> separate files actually.

I don't think you'll be able to do that more efficiently than we already
do. Read the top of tuplesort.c

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: polyphase merge?

From
Tom Lane
Date:
Greg Stark <stark@enterprisedb.com> writes:
> Is this basically the same as our current algorithm but without
> multiplexing the tapes onto single files? I have been wondering
> whether we multiplex the tapes any better than filesystems can lay out
> separate files actually.

The reason for the multiplexing is so that space can get re-used
quickly.  If each tape were represented as a separate file, there would
be no way to release blocks as they're read; you could only give back
the whole file after reaching end of tape.  Which would at least double
the amount of disk space needed to sort X amount of data.  (It's
actually even worse, more like 4X, though the multiplier might depend on
the number of "tapes" --- I don't recall the details anymore.)

The penalty we pay is that in the later merge passes, the blocks
representing a single tape aren't very well ordered.

It might be interesting to think about some compromise that wastes a
little more space in order to get better sequentiality of disk access.
It'd be easy to do if we were willing to accept a 2X space penalty,
but I'm not sure if that would fly or not.  It definitely *wasn't*
acceptable to the community a few years ago when the current code was
written.  Disks have gotten bigger since then, but so have the problems
people want to solve.
        regards, tom lane


Re: polyphase merge?

From
Tom Lane
Date:
Don Marvick <donmarvick@gmail.com> writes:
> Since nobody replied, I would give it a try. I am going to implement the
> merge pattern described in Knuth Page 365 (5.4.9), essentially it is as
> follow:
> - create initial runs using replacement selection (basically this is as in
> the current implementation)
> - add enough dummy runs of size 0 so that the number of sorted run minus one
> can be divided by k-1 (k is merge fan in)
> - repetitively merge k smallest runs until only 1 run left

How is this better than, or even different from, what is there now?
        regards, tom lane


Re: polyphase merge?

From
Gregory Stark
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:

> On Wed, 2009-02-04 at 10:55 +0000, Greg Stark wrote:
>> Is this basically the same as our current algorithm but without
>> multiplexing the tapes onto single files? I have been wondering
>> whether we multiplex the tapes any better than filesystems can lay out
>> separate files actually.
>
> I don't think you'll be able to do that more efficiently than we already
> do. Read the top of tuplesort.c

Huh?

The question I posed was whether we do it any better than filesystems do, not
whether we could do a better job. If filesystems can do as good a job then we
might as well create separate files for each tape and let the filesystem
decide where to allocate space. That would mean we could massively simplify
logtape.c and just create a separate file for every tape.

Possible reasons that filesystems could do better than us are that they have
access to more information about actual storage layout on disk, that they have
more information about hardware characteristics, and that it would give the
filesystem cache a better idea of the access pattern which logtape.c might be
confusing the picture of currently.

On the other hand possible reasons why filesystems would suck at this include
creating and deleting new files being a slow or locking operation on many
filesystems, and dealing with directories of large numbers of files being
poorly optimized on others.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: polyphase merge?

From
Don Marvick
Date:
Good point. I would note this issue. Thanks for the advice :).<br /><br />Regards,<br /><br />Don<br /><br /><div
class="gmail_quote">OnWed, Feb 4, 2009 at 6:09 PM, Simon Riggs <span dir="ltr"><<a
href="mailto:simon@2ndquadrant.com">simon@2ndquadrant.com</a>></span>wrote:<br /><blockquote class="gmail_quote"
style="border-left:1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d"><br
/>On Wed, 2009-02-04 at 14:42 +0800, Don Marvick wrote:<br /><br /> > 4. any other issue needs consideration?<br
/><br/></div>Most attempts to improve sorting further have fallen to nothing because<br /> of the lack of repeatable
testresults. In particular, coming up with<br /> test cases *after* writing the code is a good way to get lost in<br />
discussionsand have your work passed over.<br /><br /> The best starting point is to collect a number of test cases,
both<br/> typical and extreme cases. Do some research to show that "typical" cases<br /> really are that and be
preparedfor some expressions of reasonable<br /> doubt. Publish that data so test results can be objectively verified
and<br/> then measure those cases on existing code, with various settings of<br /> tunable parameters.<br /><br /> Then
itwill be a simple matter to prove your changes are effective in<br /> target cases without damaging other cases. We
wouldalso want the<br /> changes to work automatically without additional tunable parameters.<br /><font
color="#888888"><br/> --<br />  Simon Riggs           <a href="http://www.2ndQuadrant.com"
target="_blank">www.2ndQuadrant.com</a><br/>  PostgreSQL Training, Services and Support<br /><br
/></font></blockquote></div><br/> 

Re: polyphase merge?

From
Don Marvick
Date:
This is what I understand from existing implementation:<br />- fix the merge fan in F, and number of tapes=F+1<br />-
foreach sorted run generated, we have a formula to determine which tape it belongs to<br />- the sorted run is added to
theend of the tape<br /> - all sorted runs have been added to tape. There is always an empty tape called output tape<br
/>-add dummy runs if necessary, to each tape<br />- merge the input tapes, write result to output tape, until one of
theinput tape is empty<br /> - repeat this process until 1 sorted run remains<br /><br /><br />I think the main
differenceis that at each step, the current impl does not merge the smallest k-runs. It merges from the beginning of
eachtape. For 1 pass, this does not make any difference. For 2 passes onwards there are some differences. In some
complexquery, where the memory is eaten by some other operators, more passes are needed and the differences become
larger.This is the only improvement that can be achieved. Let me know if this does not make any sense.<br /><br
/>Regardingdisk layout, I am open to suggestion whether we should reuse the tape module or not. <br /><br /><br /><br
/><divclass="gmail_quote">On Wed, Feb 4, 2009 at 11:20 PM, Tom Lane <span dir="ltr"><<a
href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>></span>wrote:<br /><blockquote class="gmail_quote"
style="border-left:1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d">Don
Marvick<<a href="mailto:donmarvick@gmail.com">donmarvick@gmail.com</a>> writes:<br /> > Since nobody replied,
Iwould give it a try. I am going to implement the<br /> > merge pattern described in Knuth Page 365 (5.4.9),
essentiallyit is as<br /> > follow:<br /> > - create initial runs using replacement selection (basically this is
asin<br /> > the current implementation)<br /> > - add enough dummy runs of size 0 so that the number of sorted
runminus one<br /> > can be divided by k-1 (k is merge fan in)<br /> > - repetitively merge k smallest runs until
only1 run left<br /><br /></div>How is this better than, or even different from, what is there now?<br /><br />        
              regards, tom lane<br /></blockquote></div><br /> 

Re: polyphase merge?

From
Don Marvick
Date:
Hmm probably it's worth to try this. Has anybody try this before? If not, I am interested to give it a try.<br /><br
/>Regards,<br/><br />Don<br /><br /><div class="gmail_quote">On Wed, Feb 4, 2009 at 11:25 PM, Gregory Stark <span
dir="ltr"><<ahref="mailto:stark@enterprisedb.com">stark@enterprisedb.com</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left:
1ex;"><divclass="Ih2E3d">Simon Riggs <simon@2ndQuadrant.com> writes:<br /><br /> > On Wed, 2009-02-04 at 10:55
+0000,Greg Stark wrote:<br /> >> Is this basically the same as our current algorithm but without<br /> >>
multiplexingthe tapes onto single files? I have been wondering<br /> >> whether we multiplex the tapes any better
thanfilesystems can lay out<br /> >> separate files actually.<br /> ><br /> > I don't think you'll be able
todo that more efficiently than we already<br /> > do. Read the top of tuplesort.c<br /><br /></div>Huh?<br /><br />
Thequestion I posed was whether we do it any better than filesystems do, not<br /> whether we could do a better job. If
filesystemscan do as good a job then we<br /> might as well create separate files for each tape and let the
filesystem<br/> decide where to allocate space. That would mean we could massively simplify<br /> logtape.c and just
createa separate file for every tape.<br /><br /> Possible reasons that filesystems could do better than us are that
theyhave<br /> access to more information about actual storage layout on disk, that they have<br /> more information
abouthardware characteristics, and that it would give the<br /> filesystem cache a better idea of the access pattern
whichlogtape.c might be<br /> confusing the picture of currently.<br /><br /> On the other hand possible reasons why
filesystemswould suck at this include<br /> creating and deleting new files being a slow or locking operation on
many<br/> filesystems, and dealing with directories of large numbers of files being<br /> poorly optimized on
others.<br/><font color="#888888"><br /> --<br />  Gregory Stark<br />  EnterpriseDB          <a
href="http://www.enterprisedb.com"target="_blank">http://www.enterprisedb.com</a><br />  Ask me about EnterpriseDB's
RemoteDBAservices!<br /></font></blockquote></div><br />