Thread: polyphase merge?
Dear All,<br /><br />I apologize if this has been discussed before. I have tried to search to the mailing list and couldnot find an exact answer.<br /><br />Currently, Postgres uses Knuth's Algorithm 5.4.2D to perform external merge sort.IMHO the algorithm is designed for tapes. <br /><br />Why don't the simple text book merge pattern described in KnuthPage 365 (5.4.9) is used for disk based system? The same merge pattern is also described in Ramakrishnan text book andit is optimal if seek time is not counted, which of course not the real-world case.<br /><br />Best regards,<br /><br/>Don<br /><br />
Dear All,<br /><br />Since nobody replied, I would give it a try. I am going to implement the merge pattern described inKnuth Page 365 (5.4.9), essentially it is as follow:<br />- create initial runs using replacement selection (basicallythis is as in the current implementation)<br /> - add enough dummy runs of size 0 so that the number of sortedrun minus one can be divided by k-1 (k is merge fan in)<br />- repetitively merge k smallest runs until only 1 runleft<br /><br />I am new to postgresql, hence any advice would be appreciated.<br /><br />Can anybody give me some adviceon how it can done? <br />1. how a run should be represented? should I reuse the tape mechanism? e.g. 1 tape 1 run<br/>- or should I use a temp buffer?<br /><br />2. How memory should be allocated? I assume I divide the memory equallyto k runs, hence each run will get M/k memory. Each read of a run would be of size M/k bytes. <br /><br />3. Prefetching.Then, we can precompute the read sequence of blocks of run during the entire merge process. Based on this, weknow the blocks of run to be prefetched at a point of time.<br /><br />3. Is it possible to perform read/write I/O whilethe merge is being performed? Hence we overlap I/O and computation.<br /><br />4. any other issue needs consideration?<br/><br />Best regards,<br /><br />Don<br /><br /><div class="gmail_quote">On Thu, Jan 29, 2009 at 4:11 PM,Don Marvick <span dir="ltr"><<a href="mailto:donmarvick@gmail.com">donmarvick@gmail.com</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left:1ex;">Dear All,<br /><br />I apologize if this has been discussed before. I have tried to search to the mailinglist and could not find an exact answer.<br /><br />Currently, Postgres uses Knuth's Algorithm 5.4.2D to perform externalmerge sort. IMHO the algorithm is designed for tapes. <br /><br />Why don't the simple text book merge pattern describedin Knuth Page 365 (5.4.9) is used for disk based system? The same merge pattern is also described in Ramakrishnantext book and it is optimal if seek time is not counted, which of course not the real-world case.<br /><br />Bestregards,<br /><font color="#888888"><br />Don<br /><br /></font></blockquote></div><br />
On Wed, 2009-02-04 at 14:42 +0800, Don Marvick wrote: > 4. any other issue needs consideration? Most attempts to improve sorting further have fallen to nothing because of the lack of repeatable test results. In particular, coming up with test cases *after* writing the code is a good way to get lost in discussions and have your work passed over. The best starting point is to collect a number of test cases, both typical and extreme cases. Do some research to show that "typical" cases really are that and be prepared for some expressions of reasonable doubt. Publish that data so test results can be objectively verified and then measure those cases on existing code, with various settings of tunable parameters. Then it will be a simple matter to prove your changes are effective in target cases without damaging other cases. We would also want the changes to work automatically without additional tunable parameters. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Is this basically the same as our current algorithm but without multiplexing the tapes onto single files? I have been wondering whether we multiplex the tapes any better than filesystems can lay out separate files actually. Or is there something else that's different here? -- greg
On Wed, 2009-02-04 at 10:55 +0000, Greg Stark wrote: > Is this basically the same as our current algorithm but without > multiplexing the tapes onto single files? I have been wondering > whether we multiplex the tapes any better than filesystems can lay out > separate files actually. I don't think you'll be able to do that more efficiently than we already do. Read the top of tuplesort.c -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Greg Stark <stark@enterprisedb.com> writes: > Is this basically the same as our current algorithm but without > multiplexing the tapes onto single files? I have been wondering > whether we multiplex the tapes any better than filesystems can lay out > separate files actually. The reason for the multiplexing is so that space can get re-used quickly. If each tape were represented as a separate file, there would be no way to release blocks as they're read; you could only give back the whole file after reaching end of tape. Which would at least double the amount of disk space needed to sort X amount of data. (It's actually even worse, more like 4X, though the multiplier might depend on the number of "tapes" --- I don't recall the details anymore.) The penalty we pay is that in the later merge passes, the blocks representing a single tape aren't very well ordered. It might be interesting to think about some compromise that wastes a little more space in order to get better sequentiality of disk access. It'd be easy to do if we were willing to accept a 2X space penalty, but I'm not sure if that would fly or not. It definitely *wasn't* acceptable to the community a few years ago when the current code was written. Disks have gotten bigger since then, but so have the problems people want to solve. regards, tom lane
Don Marvick <donmarvick@gmail.com> writes: > Since nobody replied, I would give it a try. I am going to implement the > merge pattern described in Knuth Page 365 (5.4.9), essentially it is as > follow: > - create initial runs using replacement selection (basically this is as in > the current implementation) > - add enough dummy runs of size 0 so that the number of sorted run minus one > can be divided by k-1 (k is merge fan in) > - repetitively merge k smallest runs until only 1 run left How is this better than, or even different from, what is there now? regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes: > On Wed, 2009-02-04 at 10:55 +0000, Greg Stark wrote: >> Is this basically the same as our current algorithm but without >> multiplexing the tapes onto single files? I have been wondering >> whether we multiplex the tapes any better than filesystems can lay out >> separate files actually. > > I don't think you'll be able to do that more efficiently than we already > do. Read the top of tuplesort.c Huh? The question I posed was whether we do it any better than filesystems do, not whether we could do a better job. If filesystems can do as good a job then we might as well create separate files for each tape and let the filesystem decide where to allocate space. That would mean we could massively simplify logtape.c and just create a separate file for every tape. Possible reasons that filesystems could do better than us are that they have access to more information about actual storage layout on disk, that they have more information about hardware characteristics, and that it would give the filesystem cache a better idea of the access pattern which logtape.c might be confusing the picture of currently. On the other hand possible reasons why filesystems would suck at this include creating and deleting new files being a slow or locking operation on many filesystems, and dealing with directories of large numbers of files being poorly optimized on others. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Good point. I would note this issue. Thanks for the advice :).<br /><br />Regards,<br /><br />Don<br /><br /><div class="gmail_quote">OnWed, Feb 4, 2009 at 6:09 PM, Simon Riggs <span dir="ltr"><<a href="mailto:simon@2ndquadrant.com">simon@2ndquadrant.com</a>></span>wrote:<br /><blockquote class="gmail_quote" style="border-left:1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d"><br />On Wed, 2009-02-04 at 14:42 +0800, Don Marvick wrote:<br /><br /> > 4. any other issue needs consideration?<br /><br/></div>Most attempts to improve sorting further have fallen to nothing because<br /> of the lack of repeatable testresults. In particular, coming up with<br /> test cases *after* writing the code is a good way to get lost in<br /> discussionsand have your work passed over.<br /><br /> The best starting point is to collect a number of test cases, both<br/> typical and extreme cases. Do some research to show that "typical" cases<br /> really are that and be preparedfor some expressions of reasonable<br /> doubt. Publish that data so test results can be objectively verified and<br/> then measure those cases on existing code, with various settings of<br /> tunable parameters.<br /><br /> Then itwill be a simple matter to prove your changes are effective in<br /> target cases without damaging other cases. We wouldalso want the<br /> changes to work automatically without additional tunable parameters.<br /><font color="#888888"><br/> --<br /> Simon Riggs <a href="http://www.2ndQuadrant.com" target="_blank">www.2ndQuadrant.com</a><br/> PostgreSQL Training, Services and Support<br /><br /></font></blockquote></div><br/>
This is what I understand from existing implementation:<br />- fix the merge fan in F, and number of tapes=F+1<br />- foreach sorted run generated, we have a formula to determine which tape it belongs to<br />- the sorted run is added to theend of the tape<br /> - all sorted runs have been added to tape. There is always an empty tape called output tape<br />-add dummy runs if necessary, to each tape<br />- merge the input tapes, write result to output tape, until one of theinput tape is empty<br /> - repeat this process until 1 sorted run remains<br /><br /><br />I think the main differenceis that at each step, the current impl does not merge the smallest k-runs. It merges from the beginning of eachtape. For 1 pass, this does not make any difference. For 2 passes onwards there are some differences. In some complexquery, where the memory is eaten by some other operators, more passes are needed and the differences become larger.This is the only improvement that can be achieved. Let me know if this does not make any sense.<br /><br />Regardingdisk layout, I am open to suggestion whether we should reuse the tape module or not. <br /><br /><br /><br /><divclass="gmail_quote">On Wed, Feb 4, 2009 at 11:20 PM, Tom Lane <span dir="ltr"><<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>></span>wrote:<br /><blockquote class="gmail_quote" style="border-left:1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d">Don Marvick<<a href="mailto:donmarvick@gmail.com">donmarvick@gmail.com</a>> writes:<br /> > Since nobody replied, Iwould give it a try. I am going to implement the<br /> > merge pattern described in Knuth Page 365 (5.4.9), essentiallyit is as<br /> > follow:<br /> > - create initial runs using replacement selection (basically this is asin<br /> > the current implementation)<br /> > - add enough dummy runs of size 0 so that the number of sorted runminus one<br /> > can be divided by k-1 (k is merge fan in)<br /> > - repetitively merge k smallest runs until only1 run left<br /><br /></div>How is this better than, or even different from, what is there now?<br /><br /> regards, tom lane<br /></blockquote></div><br />
Hmm probably it's worth to try this. Has anybody try this before? If not, I am interested to give it a try.<br /><br />Regards,<br/><br />Don<br /><br /><div class="gmail_quote">On Wed, Feb 4, 2009 at 11:25 PM, Gregory Stark <span dir="ltr"><<ahref="mailto:stark@enterprisedb.com">stark@enterprisedb.com</a>></span> wrote:<br /><blockquote class="gmail_quote"style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><divclass="Ih2E3d">Simon Riggs <simon@2ndQuadrant.com> writes:<br /><br /> > On Wed, 2009-02-04 at 10:55 +0000,Greg Stark wrote:<br /> >> Is this basically the same as our current algorithm but without<br /> >> multiplexingthe tapes onto single files? I have been wondering<br /> >> whether we multiplex the tapes any better thanfilesystems can lay out<br /> >> separate files actually.<br /> ><br /> > I don't think you'll be able todo that more efficiently than we already<br /> > do. Read the top of tuplesort.c<br /><br /></div>Huh?<br /><br /> Thequestion I posed was whether we do it any better than filesystems do, not<br /> whether we could do a better job. If filesystemscan do as good a job then we<br /> might as well create separate files for each tape and let the filesystem<br/> decide where to allocate space. That would mean we could massively simplify<br /> logtape.c and just createa separate file for every tape.<br /><br /> Possible reasons that filesystems could do better than us are that theyhave<br /> access to more information about actual storage layout on disk, that they have<br /> more information abouthardware characteristics, and that it would give the<br /> filesystem cache a better idea of the access pattern whichlogtape.c might be<br /> confusing the picture of currently.<br /><br /> On the other hand possible reasons why filesystemswould suck at this include<br /> creating and deleting new files being a slow or locking operation on many<br/> filesystems, and dealing with directories of large numbers of files being<br /> poorly optimized on others.<br/><font color="#888888"><br /> --<br /> Gregory Stark<br /> EnterpriseDB <a href="http://www.enterprisedb.com"target="_blank">http://www.enterprisedb.com</a><br /> Ask me about EnterpriseDB's RemoteDBAservices!<br /></font></blockquote></div><br />