Thread: CUDA Sorting
Hello everyone, I'm implementing a CUDA based sorting on PostgreSQL, and I believe it can improve the ORDER BY statement performance in 4 to 10 times. I already have a generic CUDA sort that performs around 10 times faster than std qsort. I also managed to load CUDA into pgsql. Since I'm new to pgsql development, I replaced the code of pgsql qsort_arg to get used with the way postgres does the sort. The problem is that I can't use the qsort_arg_comparator comparator function on GPU, I need to implement my own. I didn't find out how to access the sorting key value data of the tuples on the Tuplesortstate or SortTuple structures. This part looks complicated because it seems the state holds the pointer for the scanner(?), but I didn't managed to access the values directly. Can anyone tell me how this works? Cheers, Vítor
On 19 September 2011 13:11, Vitor Reus <vitor.reus@gmail.com> wrote: > Hello everyone, > > I'm implementing a CUDA based sorting on PostgreSQL, and I believe it > can improve the ORDER BY statement performance in 4 to 10 times. I > already have a generic CUDA sort that performs around 10 times faster > than std qsort. I also managed to load CUDA into pgsql. > > Since I'm new to pgsql development, I replaced the code of pgsql > qsort_arg to get used with the way postgres does the sort. The problem > is that I can't use the qsort_arg_comparator comparator function on > GPU, I need to implement my own. I didn't find out how to access the > sorting key value data of the tuples on the Tuplesortstate or > SortTuple structures. This part looks complicated because it seems the > state holds the pointer for the scanner(?), but I didn't managed to > access the values directly. Can anyone tell me how this works? I can't help with explaining the inner workings of sorting code, but just a note that CUDA is a proprietary framework from nVidia and confines its use to nVidia GPUs only. You'd probably be better off investing in the OpenCL standard which is processor-agnostic. Work has already been done in this area by Tim Child with pgOpenCL, although doesn't appear to be available yet. It might be worth engaging with him to see if there are commonalities to what you're both trying to achieve. -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 19 September 2011 14:32, Vitor Reus <vitor.reus@gmail.com> wrote: > 2011/9/19 Thom Brown <thom@linux.com>: >> On 19 September 2011 13:11, Vitor Reus <vitor.reus@gmail.com> wrote: >>> Hello everyone, >>> >>> I'm implementing a CUDA based sorting on PostgreSQL, and I believe it >>> can improve the ORDER BY statement performance in 4 to 10 times. I >>> already have a generic CUDA sort that performs around 10 times faster >>> than std qsort. I also managed to load CUDA into pgsql. >>> >>> Since I'm new to pgsql development, I replaced the code of pgsql >>> qsort_arg to get used with the way postgres does the sort. The problem >>> is that I can't use the qsort_arg_comparator comparator function on >>> GPU, I need to implement my own. I didn't find out how to access the >>> sorting key value data of the tuples on the Tuplesortstate or >>> SortTuple structures. This part looks complicated because it seems the >>> state holds the pointer for the scanner(?), but I didn't managed to >>> access the values directly. Can anyone tell me how this works? >> >> I can't help with explaining the inner workings of sorting code, but >> just a note that CUDA is a proprietary framework from nVidia and >> confines its use to nVidia GPUs only. You'd probably be better off >> investing in the OpenCL standard which is processor-agnostic. Work >> has already been done in this area by Tim Child with pgOpenCL, >> although doesn't appear to be available yet. It might be worth >> engaging with him to see if there are commonalities to what you're >> both trying to achieve. >> >> -- >> Thom Brown >> Twitter: @darkixion >> IRC (freenode): dark_ixion >> Registered Linux user: #516935 >> >> EnterpriseDB UK: http://www.enterprisedb.com >> The Enterprise PostgreSQL Company >> > > Hi Thom Brown, > > thank you very much for your reply. > > I am aware that CUDA is a proprietary framework, but since the high > level CUDA API is easier than OpenCL, it will be faster to implement > and test. Also, CUDA can be translated to OpenCL in a straightforward > way, since the low level CUDA API generated code is really similar to > OpenCL. > > I'll try engaging with Tim Child, but it seems that his work is to > create GPU support for specific SQL, like procedural SQL statements > with CUDA extensions, did I understand it right? And my focus is to > "unlock" the GPU power without the user being aware of this. Please use Reply To All in your responses so the mailing list is included. Is your aim to have this committed into core PostgreSQL, or just for your own version? If it's the former, I don't anticipate any enthusiasm from the hacker community. But you're right, Tim Child's work is aimed at procedural acceleration rather than speeding up core functionality (from what I gather anyway). -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Sep 19, 2011 at 1:11 PM, Vitor Reus <vitor.reus@gmail.com> wrote: > Since I'm new to pgsql development, I replaced the code of pgsql > qsort_arg to get used with the way postgres does the sort. The problem > is that I can't use the qsort_arg_comparator comparator function on > GPU, I need to implement my own. I didn't find out how to access the > sorting key value data of the tuples on the Tuplesortstate or > SortTuple structures. This part looks complicated because it seems the > state holds the pointer for the scanner(?), but I didn't managed to > access the values directly. Can anyone tell me how this works? > This is something I've been curious about for a while. The biggest difficulty is that Postgres has a user-extensible type system and calls user provided functions to do things like comparisons. Postgres only supports comparison sorts and does so by calling the user function for the data type being sorted. These user defined function is looked up earlier in the query parsing and analysis phase and stored in Tuplesortstate->scanKeys which is an array of structures that hold information about the ordering required. In there there's a pointer to the function, a set of flags (such as NULLS FIRST/LAST), the text collation needed and the collation. I assume you're going to have to have tuplesort.c recognize if all the comparators are one of a small set of standard comparators that you can implement on the GPU such as integer and floating point comparison. In which case you could call a specialized qsort which implements that comparator inlined instead of calling the standard function. That might actually be a useful optimization to do anyways since it may well be much faster even without the GPU. So that would probably be a good place to start. But the barrier to get over here might be relatively high. In order to tolerate that amount of duplicated code and special cases there would have to be benchmarks showing it's significantly faster and helps real-world user queries. It would also have to be pretty cleanly implemented so that it doesn't impose a lot of extra overhead every time this code needs to be changed -- for example when adding collations it would have been unfortunate to have to add it to half a dozen specializations of tuplesort (though frankly I don't think that would have made that much of a dent in the happiness of the people who worked on collations). All that said my personal opinion is that this can be done cleanly and would be more than worth the benefit even without the GPU -- sorting integers and floating point numbers is a very common case and Peter Geoghan recently showed our qsort could be about twice as fast if it could inline the comparisons. With the GPU I'm curious to see how well it handles multiple processes contending for resources, it might be a flashy feature that gets lots of attention but might not really be very useful in practice. But it would be very interesting to see. -- greg
On 09/19/2011 10:12 AM, Greg Stark wrote: > With the GPU I'm curious to see how well > it handles multiple processes contending for resources, it might be a > flashy feature that gets lots of attention but might not really be > very useful in practice. But it would be very interesting to see. > The main problem here is that the sort of hardware commonly used for production database servers doesn't have any serious enough GPU to support CUDA/OpenCL available. The very clear trend now is that all systems other than gaming ones ship with motherboard graphics chipsets more than powerful enough for any task but that. I just checked the 5 most popular configurations of server I see my customers deploy PostgreSQL onto (a mix of Dell and HP units), and you don't get a serious GPU from any of them. Intel's next generation Ivy Bridge chipset, expected for the spring of 2012, is going to add support for OpenCL to the built-in motherboard GPU. We may eventually see that trickle into the server hardware side of things too. I've never seen a PostgreSQL server capable of running CUDA, and I don't expect that to change. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On 19 September 2011 15:36, Greg Smith <greg@2ndquadrant.com> wrote: > On 09/19/2011 10:12 AM, Greg Stark wrote: >> >> With the GPU I'm curious to see how well >> it handles multiple processes contending for resources, it might be a >> flashy feature that gets lots of attention but might not really be >> very useful in practice. But it would be very interesting to see. >> > > The main problem here is that the sort of hardware commonly used for > production database servers doesn't have any serious enough GPU to support > CUDA/OpenCL available. The very clear trend now is that all systems other > than gaming ones ship with motherboard graphics chipsets more than powerful > enough for any task but that. I just checked the 5 most popular > configurations of server I see my customers deploy PostgreSQL onto (a mix of > Dell and HP units), and you don't get a serious GPU from any of them. > > Intel's next generation Ivy Bridge chipset, expected for the spring of 2012, > is going to add support for OpenCL to the built-in motherboard GPU. We may > eventually see that trickle into the server hardware side of things too. > > I've never seen a PostgreSQL server capable of running CUDA, and I don't > expect that to change. But couldn't that also be seen as a chicken/egg situation? No-one buys GPUs for database servers because the database won't make use of it, but databases don't implement GPU functionality since database servers don't tend to have GPUs. It's more likely the latter of those two reasonings would have to be the first to budge. But nVidia does produce a non-graphics-oriented GPGPU line called Tesla dedicated to such processing. -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Sep 19, 2011 at 3:36 PM, Greg Smith <greg@2ndquadrant.com> wrote: > The main problem here is that the sort of hardware commonly used for > production database servers doesn't have any serious enough GPU to support > CUDA/OpenCL available Of course that could change if adding a GPU would help Postgres... I would expect it to help mostly for data warehouse batch query type systems, especially ones with very large i/o subsystems that can saturate the memory bus with sequential i/o. "Run your large batch queries twice as fast by adding a $400 part to your $40,000 server" might be a pretty compelling sales pitch :) That said, to help in the case I described you would have to implement the tapesort algorithm on the GPU as well. I expect someone has implemented heaps for CUDA/OpenCL already though. -- greg
On 19 September 2011 15:54, Greg Stark <stark@mit.edu> wrote: > On Mon, Sep 19, 2011 at 3:36 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> The main problem here is that the sort of hardware commonly used for >> production database servers doesn't have any serious enough GPU to support >> CUDA/OpenCL available > > Of course that could change if adding a GPU would help Postgres... I > would expect it to help mostly for data warehouse batch query type > systems, especially ones with very large i/o subsystems that can > saturate the memory bus with sequential i/o. "Run your large batch > queries twice as fast by adding a $400 part to your $40,000 server" > might be a pretty compelling sales pitch :) > > That said, to help in the case I described you would have to implement > the tapesort algorithm on the GPU as well. I expect someone has > implemented heaps for CUDA/OpenCL already though. I seem to recall a paper on such a thing by Carnegie Mellon University. Can't remember where I saw it though. -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 19 September 2011 16:10, Thom Brown <thom@linux.com> wrote: > On 19 September 2011 15:54, Greg Stark <stark@mit.edu> wrote: >> On Mon, Sep 19, 2011 at 3:36 PM, Greg Smith <greg@2ndquadrant.com> wrote: >>> The main problem here is that the sort of hardware commonly used for >>> production database servers doesn't have any serious enough GPU to support >>> CUDA/OpenCL available >> >> Of course that could change if adding a GPU would help Postgres... I >> would expect it to help mostly for data warehouse batch query type >> systems, especially ones with very large i/o subsystems that can >> saturate the memory bus with sequential i/o. "Run your large batch >> queries twice as fast by adding a $400 part to your $40,000 server" >> might be a pretty compelling sales pitch :) >> >> That said, to help in the case I described you would have to implement >> the tapesort algorithm on the GPU as well. I expect someone has >> implemented heaps for CUDA/OpenCL already though. > > I seem to recall a paper on such a thing by Carnegie Mellon > University. Can't remember where I saw it though. Found it! http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/ngm/15-823/project/Final.pdf -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greg Stark <stark@mit.edu> writes: > That said, to help in the case I described you would have to implement > the tapesort algorithm on the GPU as well. I think the real problem would be that we are seldom sorting just the key values. If you have to push the tuples through the GPU too, your savings are going to go up in smoke pretty quickly ... FWIW, I tend to believe a variant of what Greg Stark said upthread: there would surely be some win from reducing the impedance mismatch for comparison functions. In concrete terms, there would be no reason to have tuplesort.c's myFunctionCall2Coll, and maybe not inlineApplySortFunction either, if the datatype-specific comparison functions had APIs that were closer to what sorting wants rather than following the general SQL-callable-function API. And those functions cost a *lot* more than a one-instruction comparison does. But it's very much more of a stretch to believe that inlining per se is going to do much for us, and even more of a stretch to believe that getting a separate processor involved is going to be a win. regards, tom lane
2011/9/19 Thom Brown <thom@linux.com> > Is your aim to have this committed into core PostgreSQL, or just for > your own version? If it's the former, I don't anticipate any > enthusiasm from the hacker community. This is a research thesis and I'm not confident to commit it on the core just by myself. I will, however, release the source, and I believe it will open the way to future work be committed on core PostgreSQL. 2011/9/19 Greg Stark <stark@mit.edu> > Of course that could change if adding a GPU would help Postgres... I > would expect it to help mostly for data warehouse batch query type > systems, especially ones with very large i/o subsystems that can > saturate the memory bus with sequential i/o. "Run your large batch > queries twice as fast by adding a $400 part to your $40,000 server" > might be a pretty compelling sales pitch :) My focus is also energy proportionality. If you add a GPU, you will increase the power consumption in about 2 times, but perhaps could increse the efficiency much more. > That said, to help in the case I described you would have to implement > the tapesort algorithm on the GPU as well. I expect someone has > implemented heaps for CUDA/OpenCL already though. For now, I'm planning to implement just the in-memory sort, for simplicity and to see if it would give a real performance gain. 2011/9/19 Greg Stark <stark@mit.edu>: > In which case you could call a specialized qsort which > implements that comparator inlined instead of calling the standard > function. Actually I'm now trying to make a custom comparator for integers, but I didn't had great progress. If this works, I'll port it to GPU and start working with the next comparators, such as float, then strings, in a incremental way. 2011/9/19 Thom Brown <thom@linux.com>: > Found it! http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/ngm/15-823/project/Final.pdf This is a really great work, and I'm basing mine on it. But it's implemented using OpenGL (yes, not OpenCL), and therefore has a lot of limitations. I also tried to contact naju but didn't get any answer. Vítor Uwe Reus
On Mon, Sep 19, 2011 at 10:36 AM, Greg Smith <greg@2ndquadrant.com> wrote: > Intel's next generation Ivy Bridge chipset, expected for the spring of 2012, > is going to add support for OpenCL to the built-in motherboard GPU. We may > eventually see that trickle into the server hardware side of things too. Note that Amazon's EC2 offerings include a configuration with a pair of GPUs. Whether or not this continues has a certain "chicken and egg" aspect to it... - I'm glad that Amazon is selling such a configuration, as it does give folks the option of trying it out. - Presumably, it will only continue on their product list if customers do more than merely "trying it out." I think I'd be shocked if PostgreSQL offered much support for such a configuration in the next year; despite there being some work ongoing, drawing the functionality into core would require Core decisions that I'd be surprised to see so quickly. Unfortunately, that may be slow enough progress that PostgreSQL won't be contributing to the would-be success of the technology. If this kind of GPU usage fails to attract much interest, then it's probably a good thing that we're not committed to it. But if other uses lead to it taking off, then we'll doubtless get a lot of noise on lists about a year from now to the effect "Why don't you have this in core yet? Not 3773t enough!?!?" Having a bit of progress taking place now would probably be good timing, in case it *does* take off... -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
On Mon, Sep 19, 2011 at 7:11 AM, Vitor Reus <vitor.reus@gmail.com> wrote: > Hello everyone, > > I'm implementing a CUDA based sorting on PostgreSQL, and I believe it > can improve the ORDER BY statement performance in 4 to 10 times. I > already have a generic CUDA sort that performs around 10 times faster > than std qsort. I also managed to load CUDA into pgsql. NVIDIA cards are not that good as ATI cards. ATI cards are much faster with integer operations, and should be ideal for sorting transaction ids or sort of similar numbers (unless you are going to sort prices stored as float, which ATI still beats NVIDIA but not by that much) Another problem you have to deal with is PCI Express speed. Transfer is very slow compared to RAM. You will have to put more GPUs to match the performance and this will increase solution cost. There was a sorting algorithm for 4 CPU cores that was beating sort on a GTX 285 (I don't have the link, sorry), but CPUs are not that bad with sorting like you think. AMD is already working with embedding GPUs into the motherboard, if I am not mistaken there are already some of them on the market available for purchase. Anyone who uses a tiny embedded ATI for sorting problems with integers will outperform your NVIDIA based PCI-Express connected GPU with CUDA, because basically your algorithm will waste a lot of time transfering data to GPU and getting it back. But if you use embedded ATI GPU , you can also use SSE registers on each CPU core to add more performance to your algorithm. It is not going to be a very hardware compatible solution but if you want good speed/cost, this should be the best solution. I recommend doing some bandwidth benchmark test before you start coding. Regards Nulik > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- ================================== The power of zero is infinite
2011/9/19 Nulik Nol <nuliknol@gmail.com>: > On Mon, Sep 19, 2011 at 7:11 AM, Vitor Reus <vitor.reus@gmail.com> wrote: > I recommend doing some bandwidth benchmark test before you start coding. I already did some benchmarks with GPU sorting (not in pgsql), and measured total sort times, copy bandwidth and energy usage, and got some exciting results: I got around 1GB/s bandwidth with a GeForce GT 430 on a MS-9803 MB. The power increase ratio was 2.75 times, In a Core 2 Duo T8300, adding the GT 430: http://tinyurl.com/6h7cgv2 The sorting time performance increases when you have more data, but in average is 7.8 times faster than CPU: http://tinyurl.com/6c95dc2
* Thom Brown (thom@linux.com) wrote: > But nVidia does produce a non-graphics-oriented GPGPU line called > Tesla dedicated to such processing. Just as a side-note, I've got a couple Tesla's that aren't doing terribly much at the moment and they're in a Linux 'server'-type box from Penguin computing. I could certainly install PG on it and run some tests- if someone's written the code and provides the tests. I agree that it'd be interesting to do, but I share Lord Stark's feelings about the challenges and lack of potential gain- it's a very small set of queries that would benefit from this. You need to be working with enough data to make the cost of tranferring it all over to the GPU worthwhile, just for starters.. Thanks, Stephen
On 09/19/2011 10:53 AM, Thom Brown wrote: > But couldn't that also be seen as a chicken/egg situation? The chicken/egg problem here is a bit deeper than just "no one offers GPUs because no one wants them" on server systems. One of the reasons there aren't more GPUs in typical database server configurations is that you're already filling up some number of the full size slots, and correspondingly the bandwidth available to cards, with disk controllers. It doesn't help that many server class motherboards don't even have a x16 PCI-e slot on them, which is what most GPUs as delivered on regular consumer video cards are optimized for. > But nVidia does produce a non-graphics-oriented GPGPU line called > Tesla dedicated to such processing. > Tesla units start at around $1500 USD, which is a nice budget to spend on either more RAM (to allow higher work_mem), faster storage to store temporary files onto, or a faster CPU to chew through all sorts of tasks more quickly. The Tesla units are easy to justify if you have a serious GPU-oriented application. The good bang for the buck point with CPU sorting for PostgreSQL is probably going to be a $50-$100 video card instead. For example, the card Vitor is seeing good results on costs around $60. (That's also a system with fairly slow RAM, though; it will be interesting to see if the gain holds up on newer systems.) -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On Sep 19, 2011, at 5:16 PM, Tom Lane wrote: > Greg Stark <stark@mit.edu> writes: >> That said, to help in the case I described you would have to implement >> the tapesort algorithm on the GPU as well. > > I think the real problem would be that we are seldom sorting just the > key values. If you have to push the tuples through the GPU too, your > savings are going to go up in smoke pretty quickly … > i would argument along a similar line. to make GPU code fast it has to be pretty much tailored to do exactly one thing - otherwise you have no chance to get anywhereclose to card-bandwith. if you look at "two similar" GPU codes which seem to do the same thing you might easily see that one is 10 times faster thanthe other - for bloody reason such as memory alignment, memory transaction size or whatever. this opens a bit of a problem: PostgreSQL sorting is so generic and so flexible that i would be really surprised if somebodycould come up with a solution which really comes close to what the GPU can do. it would definitely be interesting to see a prototype, however. btw, there is a handful of interesting talks / lectures about GPU programming provided by the university of chicago (justcannot find the link atm). regards, hans -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de
2011/9/19 Greg Smith <greg@2ndquadrant.com>: > On 09/19/2011 10:53 AM, Thom Brown wrote: >> >> But couldn't that also be seen as a chicken/egg situation? > > > The chicken/egg problem here is a bit deeper than just "no one offers GPUs > because no one wants them" on server systems. One of the reasons there > aren't more GPUs in typical database server configurations is that you're > already filling up some number of the full size slots, and correspondingly > the bandwidth available to cards, with disk controllers. It doesn't help > that many server class motherboards don't even have a x16 PCI-e slot on > them, which is what most GPUs as delivered on regular consumer video cards > are optimized for. > Sandy bridge and ivy bridge intel series are CPU/GPU. I don't know how using the GPU affect the CPU part but it might be interesting to explore... -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation
On Sep19, 2011, at 19:46 , Stephen Frost wrote: > I agree that it'd be interesting to do, but I share Lord Stark's > feelings about the challenges and lack of potential gain- it's a very > small set of queries that would benefit from this. You need to be > working with enough data to make the cost of tranferring it all over to > the GPU worthwhile, just for starters.. I wonder if anyone has ever tried to employ a GPU for more low-level tasks. Things like sorting or hashing are hard to move to the GPU in postgres because, in the general case, they involve essentially arbitrary user-defined functions. But couldn't for example the WAL CRC computation be moved to a GPU? Or, to get really crazy, even the search for the optimal join order (only for a large number of joins though, i.e. where we currently switch to a genetic algorithmn)? best regards, Florian Pflug
> > I already did some benchmarks with GPU sorting (not in pgsql), and > measured total sort times, copy bandwidth and energy usage, and got > some exciting results: Was that qsort implementation on CPU cache friendly and optimized for SSE ? To make a fair comparison you have to take the best CPU implementation and compare it to best GPU implementation. Because if not, you are comparing full throttled GPU vs lazy CPU. Check this paper on how hash join was optimized 17x when SSE instructions were used. www.vldb.org/pvldb/2/vldb09-257.pdf Regards -- ================================== The power of zero is infinite
On Mon, 2011-09-19 at 15:12 +0100, Greg Stark wrote: > On Mon, Sep 19, 2011 at 1:11 PM, Vitor Reus <vitor.reus@gmail.com> wrote: > > Since I'm new to pgsql development, I replaced the code of pgsql > > qsort_arg to get used with the way postgres does the sort. The problem > > is that I can't use the qsort_arg_comparator comparator function on > > GPU, I need to implement my own. I didn't find out how to access the > > sorting key value data of the tuples on the Tuplesortstate or > > SortTuple structures. This part looks complicated because it seems the > > state holds the pointer for the scanner(?), but I didn't managed to > > access the values directly. Can anyone tell me how this works? .... > With the GPU I'm curious to see how well > it handles multiple processes contending for resources, it might be a > flashy feature that gets lots of attention but might not really be > very useful in practice. But it would be very interesting to see. There are cases where concurrency may not be that important like some specialized OLAP loads where you have to sort, for example finding a median in large data sets. -- ------- Hannu Krosing PostgreSQL Unlimited Scalability and Performance Consultant 2ndQuadrant Nordic PG Admin Book: http://www.2ndQuadrant.com/books/
On Mon, 2011-09-19 at 10:36 -0400, Greg Smith wrote: > On 09/19/2011 10:12 AM, Greg Stark wrote: > > With the GPU I'm curious to see how well > > it handles multiple processes contending for resources, it might be a > > flashy feature that gets lots of attention but might not really be > > very useful in practice. But it would be very interesting to see. > > > > The main problem here is that the sort of hardware commonly used for > production database servers doesn't have any serious enough GPU to > support CUDA/OpenCL available. The very clear trend now is that all > systems other than gaming ones ship with motherboard graphics chipsets > more than powerful enough for any task but that. I just checked the 5 > most popular configurations of server I see my customers deploy > PostgreSQL onto (a mix of Dell and HP units), and you don't get a > serious GPU from any of them. > > Intel's next generation Ivy Bridge chipset, expected for the spring of > 2012, is going to add support for OpenCL to the built-in motherboard > GPU. We may eventually see that trickle into the server hardware side > of things too. > > I've never seen a PostgreSQL server capable of running CUDA, and I don't > expect that to change. CUDA sorting could be beneficial on general server hardware if it can run well on multiple cpus in parallel. GPU-s being in essence parallel processors on fast shared memory, it may be that even on ordinary RAM and lots of CPUs some CUDA algorithms are a significant win. and then there is non-graphics GPU availabe on EC2 Cluster GPU Quadruple Extra Large Instance 22 GB of memory 33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem” architecture) 2 x NVIDIA Tesla “Fermi”M2050 GPUs 1690 GB of instance storage 64-bit platform I/O Performance: Very High (10 Gigabit Ethernet) API name:cg1.4xlarge It costs $2.10 per hour, probably a lot less if you use the Spot Instances. > -- > Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD > PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us > >
Hey hackers,
I'm still having problems reading the values of the columns in tuplesort.c, in order to understand how to port this to CUDA.
Should I use the heap_getattr macro to read them?
2011/9/24 Hannu Krosing <hannu@krosing.net>
On Mon, 2011-09-19 at 10:36 -0400, Greg Smith wrote:CUDA sorting could be beneficial on general server hardware if it can
> On 09/19/2011 10:12 AM, Greg Stark wrote:
> > With the GPU I'm curious to see how well
> > it handles multiple processes contending for resources, it might be a
> > flashy feature that gets lots of attention but might not really be
> > very useful in practice. But it would be very interesting to see.
> >
>
> The main problem here is that the sort of hardware commonly used for
> production database servers doesn't have any serious enough GPU to
> support CUDA/OpenCL available. The very clear trend now is that all
> systems other than gaming ones ship with motherboard graphics chipsets
> more than powerful enough for any task but that. I just checked the 5
> most popular configurations of server I see my customers deploy
> PostgreSQL onto (a mix of Dell and HP units), and you don't get a
> serious GPU from any of them.
>
> Intel's next generation Ivy Bridge chipset, expected for the spring of
> 2012, is going to add support for OpenCL to the built-in motherboard
> GPU. We may eventually see that trickle into the server hardware side
> of things too.
>
> I've never seen a PostgreSQL server capable of running CUDA, and I don't
> expect that to change.
run well on multiple cpus in parallel. GPU-s being in essence parallel
processors on fast shared memory, it may be that even on ordinary RAM
and lots of CPUs some CUDA algorithms are a significant win.
and then there is non-graphics GPU availabe on EC2
Cluster GPU Quadruple Extra Large Instance
22 GB of memory
33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem”
architecture)
2 x NVIDIA Tesla “Fermi” M2050 GPUs
1690 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
API name: cg1.4xlarge
It costs $2.10 per hour, probably a lot less if you use the Spot
Instances.
> --
> Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
> PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
>
>
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 19/09/2011 21:41, PostgreSQL - Hans-Jürgen Schönig wrote: > > On Sep 19, 2011, at 5:16 PM, Tom Lane wrote: > >> Greg Stark<stark@mit.edu> writes: >>> That said, to help in the case I described you would have to implement >>> the tapesort algorithm on the GPU as well. >> >> I think the real problem would be that we are seldom sorting just the >> key values. If you have to push the tuples through the GPU too, your >> savings are going to go up in smoke pretty quickly … >> > > > i would argument along a similar line. > to make GPU code fast it has to be pretty much tailored to do exactly one thing - otherwise you have no chance to get anywhereclose to card-bandwith. > if you look at "two similar" GPU codes which seem to do the same thing you might easily see that one is 10 times fasterthan the other - for bloody reason such as memory alignment, memory transaction size or whatever. > this opens a bit of a problem: PostgreSQL sorting is so generic and so flexible that i would be really surprised if somebodycould come up with a solution which really comes close to what the GPU can do. > it would definitely be interesting to see a prototype, however. Thrust Nvidia library provides the same sorting flexibility as postgres does. // generate 32M random numbers on the host thrust::host_vector<int> h_vec(32 << 20); thrust::generate(h_vec.begin(),h_vec.end(), rand); // transfer data to the device thrust::device_vector<int> d_vec = h_vec; // sort data on the device (846M keys per second on GeForce GTX 480) thrust::sort(d_vec.begin(), d_vec.end()); // transfer data back to host thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); as you can see the type to be ordered is template, and the thrust::sort have also a version in where it takes the comparator to use. So compared with pg_qsort thrust::sort gives you the same flexibility. http://docs.thrust.googlecode.com/hg/group__sorting.html Regards Gaetano Mendola
On 19/09/2011 16:36, Greg Smith wrote: > On 09/19/2011 10:12 AM, Greg Stark wrote: >> With the GPU I'm curious to see how well >> it handles multiple processes contending for resources, it might be a >> flashy feature that gets lots of attention but might not really be >> very useful in practice. But it would be very interesting to see. > > The main problem here is that the sort of hardware commonly used for > production database servers doesn't have any serious enough GPU to > support CUDA/OpenCL available. The very clear trend now is that all > systems other than gaming ones ship with motherboard graphics chipsets > more than powerful enough for any task but that. I just checked the 5 > most popular configurations of server I see my customers deploy > PostgreSQL onto (a mix of Dell and HP units), and you don't get a > serious GPU from any of them. > > Intel's next generation Ivy Bridge chipset, expected for the spring of > 2012, is going to add support for OpenCL to the built-in motherboard > GPU. We may eventually see that trickle into the server hardware side of > things too. The trend is to have server capable of running CUDA providing GPU via external hardware (PCI Express interface with PCI Express switches), look for example at PowerEdge C410x PCIe Expansion Chassis from DELL. I did some experimenst timing the sort done with CUDA and the sort done with pg_qsort: CUDA pg_qsort 33Milion integers: ~ 900 ms, ~ 6000 ms 1Milion integers: ~ 21 ms, ~ 162 ms 100k integers: ~ 2 ms, ~ 13 ms CUDA time has already in the copy operations (host->device, device->host). As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R) Xeon(R) CPU X5650 @ 2.67GHz Copy operations and kernel runs (the sort for instance) can run in parallel, so while you are sorting a batch of data, you can copy the next batch in parallel. As you can see the boost is not negligible. Next Nvidia hardware (Keplero family) is PCI Express 3 ready, so expect in the near future the "bottle neck" of the device->host->device copies to have less impact. I strongly believe there is space to provide modern database engine of a way to offload sorts to GPU. > I've never seen a PostgreSQL server capable of running CUDA, and I> don't expect that to change. That sounds like: "I think there is a world market for maybe five computers." - IBM Chairman Thomas Watson, 1943 Regards Gaetano Mendola
I'm wondering if CUDA will win in geomentry operations, for example, tesing point <@ complex_polygon Oleg On Sun, 12 Feb 2012, Gaetano Mendola wrote: > On 19/09/2011 16:36, Greg Smith wrote: >> On 09/19/2011 10:12 AM, Greg Stark wrote: >>> With the GPU I'm curious to see how well >>> it handles multiple processes contending for resources, it might be a >>> flashy feature that gets lots of attention but might not really be >>> very useful in practice. But it would be very interesting to see. >> >> The main problem here is that the sort of hardware commonly used for >> production database servers doesn't have any serious enough GPU to >> support CUDA/OpenCL available. The very clear trend now is that all >> systems other than gaming ones ship with motherboard graphics chipsets >> more than powerful enough for any task but that. I just checked the 5 >> most popular configurations of server I see my customers deploy >> PostgreSQL onto (a mix of Dell and HP units), and you don't get a >> serious GPU from any of them. >> >> Intel's next generation Ivy Bridge chipset, expected for the spring of >> 2012, is going to add support for OpenCL to the built-in motherboard >> GPU. We may eventually see that trickle into the server hardware side of >> things too. > > > The trend is to have server capable of running CUDA providing GPU via > external hardware (PCI Express interface with PCI Express switches), look for > example at PowerEdge C410x PCIe Expansion Chassis from DELL. > > I did some experimenst timing the sort done with CUDA and the sort done with > pg_qsort: > CUDA pg_qsort > 33Milion integers: ~ 900 ms, ~ 6000 ms > 1Milion integers: ~ 21 ms, ~ 162 ms > 100k integers: ~ 2 ms, ~ 13 ms > > CUDA time has already in the copy operations (host->device, device->host). > > As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R) > Xeon(R) CPU X5650 @ 2.67GHz > > Copy operations and kernel runs (the sort for instance) can run in parallel, > so while you are sorting a batch of data, you can copy the next batch in > parallel. > > As you can see the boost is not negligible. > > Next Nvidia hardware (Keplero family) is PCI Express 3 ready, so expect in > the near future the "bottle neck" of the device->host->device copies to have > less impact. > > I strongly believe there is space to provide modern database engine of > a way to offload sorts to GPU. > >> I've never seen a PostgreSQL server capable of running CUDA, and I >> don't expect that to change. > > That sounds like: > > "I think there is a world market for maybe five computers." > - IBM Chairman Thomas Watson, 1943 > > Regards > Gaetano Mendola > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On 12/02/2012 13:13, Oleg Bartunov wrote: > I'm wondering if CUDA will win in geomentry operations, for example, > tesing point <@ complex_polygon I'm not sure if the algorithm you mentioned can be implemented in terms of vector algebra, blas, etc. It's plenty of geometry operations implemented in CUDA out there, my field of CUDA application is not this one so I'm not that much in it. However I can point you to official NVIDIA npp library that provides vector algebra algorithms, and some geometry algorithms as well. http://developer.download.nvidia.com/compute/DevZone/docs/html/CUDALibraries/doc/NPP_Library.pdf (take a look at around page 620). Regards Gaetano Mendola > Oleg > On Sun, 12 Feb 2012, Gaetano Mendola wrote: > >> On 19/09/2011 16:36, Greg Smith wrote: >>> On 09/19/2011 10:12 AM, Greg Stark wrote: >>>> With the GPU I'm curious to see how well >>>> it handles multiple processes contending for resources, it might be a >>>> flashy feature that gets lots of attention but might not really be >>>> very useful in practice. But it would be very interesting to see. >>> >>> The main problem here is that the sort of hardware commonly used for >>> production database servers doesn't have any serious enough GPU to >>> support CUDA/OpenCL available. The very clear trend now is that all >>> systems other than gaming ones ship with motherboard graphics chipsets >>> more than powerful enough for any task but that. I just checked the 5 >>> most popular configurations of server I see my customers deploy >>> PostgreSQL onto (a mix of Dell and HP units), and you don't get a >>> serious GPU from any of them. >>> >>> Intel's next generation Ivy Bridge chipset, expected for the spring of >>> 2012, is going to add support for OpenCL to the built-in motherboard >>> GPU. We may eventually see that trickle into the server hardware side of >>> things too. >> >> >> The trend is to have server capable of running CUDA providing GPU via >> external hardware (PCI Express interface with PCI Express switches), >> look for example at PowerEdge C410x PCIe Expansion Chassis from DELL. >> >> I did some experimenst timing the sort done with CUDA and the sort >> done with pg_qsort: >> CUDA pg_qsort >> 33Milion integers: ~ 900 ms, ~ 6000 ms >> 1Milion integers: ~ 21 ms, ~ 162 ms >> 100k integers: ~ 2 ms, ~ 13 ms >> >> CUDA time has already in the copy operations (host->device, >> device->host). >> >> As GPU I was using a C2050, and the CPU doing the pg_qsort was a >> Intel(R) Xeon(R) CPU X5650 @ 2.67GHz >> >> Copy operations and kernel runs (the sort for instance) can run in >> parallel, so while you are sorting a batch of data, you can copy the >> next batch in parallel. >> >> As you can see the boost is not negligible. >> >> Next Nvidia hardware (Keplero family) is PCI Express 3 ready, so >> expect in the near future the "bottle neck" of the >> device->host->device copies to have less impact. >> >> I strongly believe there is space to provide modern database engine of >> a way to offload sorts to GPU. >> >>> I've never seen a PostgreSQL server capable of running CUDA, and I >>> don't expect that to change. >> >> That sounds like: >> >> "I think there is a world market for maybe five computers." >> - IBM Chairman Thomas Watson, 1943 >> >> Regards >> Gaetano Mendola >> >> >> > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 >
On 02/11/2012 08:14 PM, Gaetano Mendola wrote: > The trend is to have server capable of running CUDA providing GPU via > external hardware (PCI Express interface with PCI Express switches), > look for example at PowerEdge C410x PCIe Expansion Chassis from DELL. The C410X adds 16 PCIe slots to a server, housed inside a separate 3U enclosure. That's a completely sensible purchase if your goal is to build a computing cluster, where a lot of work is handed off to a set of GPUs. I think that's even less likely to be a cost-effective option for a database server. Adding a single dedicated GPU installed in a server to accelerate sorting is something that might be justifiable, based on your benchmarks. This is a much more expensive option than that though. Details at http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who wants to see just how big this external box is. > I did some experimenst timing the sort done with CUDA and the sort > done with pg_qsort: > CUDA pg_qsort > 33Milion integers: ~ 900 ms, ~ 6000 ms > 1Milion integers: ~ 21 ms, ~ 162 ms > 100k integers: ~ 2 ms, ~ 13 ms > CUDA time has already in the copy operations (host->device, > device->host). > As GPU I was using a C2050, and the CPU doing the pg_qsort was a > Intel(R) Xeon(R) CPU X5650 @ 2.67GHz That's really interesting, and the X5650 is by no means a slow CPU. So this benchmark is providing a lot of CPU power yet still seeing over a 6X speedup in sort times. It sounds like the PCI Express bus has gotten fast enough that the time to hand data over and get it back again can easily be justified for medium to large sized sorts. It would be helpful to take this patch and confirm whether it scales when using in parallel. Easiest way to do that would be to use the pgbench "-f" feature, which allows running an arbitrary number of some query at once. Seeing whether this acceleration continued to hold as the number of clients increases is a useful data point. Is it possible for you to break down where the time is being spent? For example, how much of this time is consumed in the GPU itself, compared to time spent transferring data between CPU and GPU? I'm also curious where the bottleneck is at with this approach. If it's the speed of the PCI-E bus for smaller data sets, adding more GPUs may never be practical. If the bus can handle quite a few of these at once before it saturates, it might be possible to overload a single GPU. That seems like it would be really hard to reach for database sorting though; I can't really defend justify my gut feel for that being true though. > > I've never seen a PostgreSQL server capable of running CUDA, and I > > don't expect that to change. > > That sounds like: > > "I think there is a world market for maybe five computers." > - IBM Chairman Thomas Watson, 1943 Yes, and "640K will be enough for everyone", ha ha. (Having said the 640K thing is flat out denied by Gates, BTW, and no one has come up with proof otherwise). I think you've made an interesting case for this sort of acceleration now being useful for systems doing what's typically considered a data warehouse task. I regularly see servers waiting for far more than 13M integers to sort. And I am seeing a clear trend toward providing more PCI-E slots in servers now. Dell's R810 is the most popular single server model my customers have deployed in the last year, and it has 5 X8 slots in it. It's rare all 5 of those are filled. As long as a dedicated GPU works fine when dropped to X8 speeds, I know a fair number of systems where one of those could be added now. There's another data point in your favor I didn't notice before your last e-mail. Amazon has a "Cluster GPU Quadruple Extra Large" node type that runs with NVIDIA Tesla hardware. That means the installed base of people who could consider CUDA is higher than I expected. To demonstrate how much that costs, to provision a GPU enabled reserved instance from Amazon for one year costs $2410 at "Light Utilization", giving a system with 22GB of RAM and 1.69GB of storage. (I find the reserved prices easier to compare with dedicated hardware than the hourly ones) That's halfway between the High-Memory Double Extra Large Instance (34GB RAM/850GB disk) at $1100 and the High-Memory Quadruple Extra Large Instance (64GB RAM/1690GB disk) at $2200. If someone could prove sorting was a bottleneck on their server, that isn't an unreasonable option to consider on a cloud-based database deployment. I still think that an approach based on OpenCL is more likely to be suitable for PostgreSQL, which was part of why I gave CUDA low odds here. The points in favor of OpenCL are: -Since you last posted, OpenCL compiling has switched to using LLVM as their standard compiler. Good PostgreSQL support for LLVM isn't far away. It looks to me like the compiler situation for CUDA requires their PathScale based compiler. I don't know enough about this area to say which compiling tool chain will end up being easier to deal with. -Intel is making GPU support standard for OpenCL, as I mentioned before. NVIDIA will be hard pressed to compete with Intel for GPU acceleration once more systems supporting that enter the market. -Easy availability of OpenCL on Mac OS X for development sake. Lots of Postgres hackers with OS X systems, even though there aren't too many OS X database servers. The fact that Amazon provides a way to crack the chicken/egg hardware problem immediately helps a lot though, I don't even need a physical card here to test CUDA GPU acceleration on Linux now. With that data point, your benchmarks are good enough to say I'd be willing to help review a patch in this area here as part of the 9.3 development cycle. That may validate that GPU acceleration is useful, and then the next step would be considering how portable that will be to other GPU interfaces. I still expect CUDA will be looked back on as a dead end for GPU accelerated computing one day. Computing history is not filled with many single-vendor standards who competed successfully against Intel providing the same thing. AMD's x86-64 is the only example I can think of where Intel didn't win that sort of race, which happened (IMHO) only because Intel's Itanium failed to prioritize backwards compatibility highly enough. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
2012/2/13 Greg Smith <greg@2ndquadrant.com>: > On 02/11/2012 08:14 PM, Gaetano Mendola wrote: >> >> The trend is to have server capable of running CUDA providing GPU via >> external hardware (PCI Express interface with PCI Express switches), look >> for example at PowerEdge C410x PCIe Expansion Chassis from DELL. > > > The C410X adds 16 PCIe slots to a server, housed inside a separate 3U > enclosure. That's a completely sensible purchase if your goal is to build a > computing cluster, where a lot of work is handed off to a set of GPUs. I > think that's even less likely to be a cost-effective option for a database > server. Adding a single dedicated GPU installed in a server to accelerate > sorting is something that might be justifiable, based on your benchmarks. > This is a much more expensive option than that though. Details at > http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who wants > to see just how big this external box is. > > >> I did some experimenst timing the sort done with CUDA and the sort done >> with pg_qsort: >> CUDA pg_qsort >> 33Milion integers: ~ 900 ms, ~ 6000 ms >> 1Milion integers: ~ 21 ms, ~ 162 ms >> 100k integers: ~ 2 ms, ~ 13 ms >> CUDA time has already in the copy operations (host->device, device->host). >> As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R) >> Xeon(R) CPU X5650 @ 2.67GHz > > > That's really interesting, and the X5650 is by no means a slow CPU. So this > benchmark is providing a lot of CPU power yet still seeing over a 6X speedup > in sort times. It sounds like the PCI Express bus has gotten fast enough > that the time to hand data over and get it back again can easily be > justified for medium to large sized sorts. > > It would be helpful to take this patch and confirm whether it scales when > using in parallel. Easiest way to do that would be to use the pgbench "-f" > feature, which allows running an arbitrary number of some query at once. > Seeing whether this acceleration continued to hold as the number of clients > increases is a useful data point. > > Is it possible for you to break down where the time is being spent? For > example, how much of this time is consumed in the GPU itself, compared to > time spent transferring data between CPU and GPU? I'm also curious where > the bottleneck is at with this approach. If it's the speed of the PCI-E bus > for smaller data sets, adding more GPUs may never be practical. If the bus > can handle quite a few of these at once before it saturates, it might be > possible to overload a single GPU. That seems like it would be really hard > to reach for database sorting though; I can't really defend justify my gut > feel for that being true though. > > >> > I've never seen a PostgreSQL server capable of running CUDA, and I >> > don't expect that to change. >> >> That sounds like: >> >> "I think there is a world market for maybe five computers." >> - IBM Chairman Thomas Watson, 1943 > > > Yes, and "640K will be enough for everyone", ha ha. (Having said the 640K > thing is flat out denied by Gates, BTW, and no one has come up with proof > otherwise). > > I think you've made an interesting case for this sort of acceleration now > being useful for systems doing what's typically considered a data warehouse > task. I regularly see servers waiting for far more than 13M integers to > sort. And I am seeing a clear trend toward providing more PCI-E slots in > servers now. Dell's R810 is the most popular single server model my > customers have deployed in the last year, and it has 5 X8 slots in it. It's > rare all 5 of those are filled. As long as a dedicated GPU works fine when > dropped to X8 speeds, I know a fair number of systems where one of those > could be added now. > > There's another data point in your favor I didn't notice before your last > e-mail. Amazon has a "Cluster GPU Quadruple Extra Large" node type that > runs with NVIDIA Tesla hardware. That means the installed base of people > who could consider CUDA is higher than I expected. To demonstrate how much > that costs, to provision a GPU enabled reserved instance from Amazon for one > year costs $2410 at "Light Utilization", giving a system with 22GB of RAM > and 1.69GB of storage. (I find the reserved prices easier to compare with > dedicated hardware than the hourly ones) That's halfway between the > High-Memory Double Extra Large Instance (34GB RAM/850GB disk) at $1100 and > the High-Memory Quadruple Extra Large Instance (64GB RAM/1690GB disk) at > $2200. If someone could prove sorting was a bottleneck on their server, > that isn't an unreasonable option to consider on a cloud-based database > deployment. > > I still think that an approach based on OpenCL is more likely to be suitable > for PostgreSQL, which was part of why I gave CUDA low odds here. The points > in favor of OpenCL are: > > -Since you last posted, OpenCL compiling has switched to using LLVM as their > standard compiler. Good PostgreSQL support for LLVM isn't far away. It > looks to me like the compiler situation for CUDA requires their PathScale > based compiler. I don't know enough about this area to say which compiling > tool chain will end up being easier to deal with. > > -Intel is making GPU support standard for OpenCL, as I mentioned before. > NVIDIA will be hard pressed to compete with Intel for GPU acceleration once > more systems supporting that enter the market. > > -Easy availability of OpenCL on Mac OS X for development sake. Lots of > Postgres hackers with OS X systems, even though there aren't too many OS X > database servers. > > The fact that Amazon provides a way to crack the chicken/egg hardware > problem immediately helps a lot though, I don't even need a physical card > here to test CUDA GPU acceleration on Linux now. With that data point, your > benchmarks are good enough to say I'd be willing to help review a patch in > this area here as part of the 9.3 development cycle. That may validate that > GPU acceleration is useful, and then the next step would be considering how > portable that will be to other GPU interfaces. I still expect CUDA will be > looked back on as a dead end for GPU accelerated computing one day. > Computing history is not filled with many single-vendor standards who > competed successfully against Intel providing the same thing. AMD's x86-64 > is the only example I can think of where Intel didn't win that sort of race, > which happened (IMHO) only because Intel's Itanium failed to prioritize > backwards compatibility highly enough. > As a side node. My module (PG-Strom) also uses CUDA, although it tried to implement it with OpenCL at begining of the project, because it didn't work well when multiple sessions uses a GPU device concurrently. The second background process get an error due to out-of-resources during another process opens a GPU device. I'm not clear whether it is a limitation of OpenCL, driver of Nvidia, or bugs of my code. Anyway, I switched to CUDA, instead of the investigation on binary drivers. :-( Thanks, -- KaiGai Kohei <kaigai@kaigai.gr.jp>
<p><br /> On Feb 13, 2012 11:39 a.m., "Kohei KaiGai" <<a href="mailto:kaigai@kaigai.gr.jp">kaigai@kaigai.gr.jp</a>>wrote:<br /> ><br /> > 2012/2/13 Greg Smith <<a href="mailto:greg@2ndquadrant.com">greg@2ndquadrant.com</a>>:<br/> > > On 02/11/2012 08:14 PM, Gaetano Mendola wrote:<br/> > >><br /> > >> The trend is to have server capable of running CUDA providing GPU via<br />> >> external hardware (PCI Express interface with PCI Express switches), look<br /> > >> for exampleat PowerEdge C410x PCIe Expansion Chassis from DELL.<br /> > ><br /> > ><br /> > > The C410X adds16 PCIe slots to a server, housed inside a separate 3U<br /> > > enclosure. That's a completely sensible purchaseif your goal is to build a<br /> > > computing cluster, where a lot of work is handed off to a set of GPUs. I<br /> > > think that's even less likely to be a cost-effective option for a database<br /> > > server. Adding a single dedicated GPU installed in a server to accelerate<br /> > > sorting is something that mightbe justifiable, based on your benchmarks.<br /> > > This is a much more expensive option than that though. Detailsat<br /> > > <a href="http://www.dell.com/us/enterprise/p/poweredge-c410x/pd">http://www.dell.com/us/enterprise/p/poweredge-c410x/pd</a> foranyone who wants<br /> > > to see just how big this external box is.<br /> > ><br /> > ><br /> >>> I did some experimenst timing the sort done with CUDA and the sort done<br /> > >> with pg_qsort:<br/> > >> CUDA pg_qsort<br /> > >> 33Milion integers: ~ 900 ms, ~ 6000 ms<br /> > >> 1Milion integers: ~ 21 ms, ~ 162 ms<br /> > >> 100k integers: ~ 2 ms, ~ 13 ms<br /> > >> CUDA time has already in the copy operations (host->device, device->host).<br/> > >> As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R)<br /> >>> Xeon(R) CPU X5650 @ 2.67GHz<br /> > ><br /> > ><br /> > > That's really interesting, andthe X5650 is by no means a slow CPU. So this<br /> > > benchmark is providing a lot of CPU power yet still seeingover a 6X speedup<br /> > > in sort times. It sounds like the PCI Express bus has gotten fast enough<br /> >> that the time to hand data over and get it back again can easily be<br /> > > justified for medium to largesized sorts.<br /> > ><br /> > > It would be helpful to take this patch and confirm whether it scales when<br/> > > using in parallel. Easiest way to do that would be to use the pgbench "-f"<br /> > > feature,which allows running an arbitrary number of some query at once.<br /> > > Seeing whether this accelerationcontinued to hold as the number of clients<br /> > > increases is a useful data point.<br /> > ><br/> > > Is it possible for you to break down where the time is being spent? For<br /> > > example, howmuch of this time is consumed in the GPU itself, compared to<br /> > > time spent transferring data between CPUand GPU? I'm also curious where<br /> > > the bottleneck is at with this approach. If it's the speed of the PCI-Ebus<br /> > > for smaller data sets, adding more GPUs may never be practical. If the bus<br /> > > canhandle quite a few of these at once before it saturates, it might be<br /> > > possible to overload a single GPU. That seems like it would be really hard<br /> > > to reach for database sorting though; I can't really defendjustify my gut<br /> > > feel for that being true though.<br /> > ><br /> > ><br /> > >>> I've never seen a PostgreSQL server capable of running CUDA, and I<br /> > >> > don't expect thatto change.<br /> > >><br /> > >> That sounds like:<br /> > >><br /> > >> "I thinkthere is a world market for maybe five computers."<br /> > >> - IBM Chairman Thomas Watson, 1943<br /> >><br /> > ><br /> > > Yes, and "640K will be enough for everyone", ha ha. (Having said the 640K<br />> > thing is flat out denied by Gates, BTW, and no one has come up with proof<br /> > > otherwise).<br /> >><br /> > > I think you've made an interesting case for this sort of acceleration now<br /> > > beinguseful for systems doing what's typically considered a data warehouse<br /> > > task. I regularly see serverswaiting for far more than 13M integers to<br /> > > sort. And I am seeing a clear trend toward providing morePCI-E slots in<br /> > > servers now. Dell's R810 is the most popular single server model my<br /> > > customershave deployed in the last year, and it has 5 X8 slots in it. It's<br /> > > rare all 5 of those are filled. As long as a dedicated GPU works fine when<br /> > > dropped to X8 speeds, I know a fair number of systemswhere one of those<br /> > > could be added now.<br /> > ><br /> > > There's another data pointin your favor I didn't notice before your last<br /> > > e-mail. Amazon has a "Cluster GPU Quadruple Extra Large"node type that<br /> > > runs with NVIDIA Tesla hardware. That means the installed base of people<br /> >> who could consider CUDA is higher than I expected. To demonstrate how much<br /> > > that costs, to provisiona GPU enabled reserved instance from Amazon for one<br /> > > year costs $2410 at "Light Utilization", givinga system with 22GB of RAM<br /> > > and 1.69GB of storage. (I find the reserved prices easier to compare with<br/> > > dedicated hardware than the hourly ones) That's halfway between the<br /> > > High-Memory DoubleExtra Large Instance (34GB RAM/850GB disk) at $1100 and<br /> > > the High-Memory Quadruple Extra Large Instance(64GB RAM/1690GB disk) at<br /> > > $2200. If someone could prove sorting was a bottleneck on their server,<br/> > > that isn't an unreasonable option to consider on a cloud-based database<br /> > > deployment.<br/> > ><br /> > > I still think that an approach based on OpenCL is more likely to be suitable<br/> > > for PostgreSQL, which was part of why I gave CUDA low odds here. The points<br /> > > in favorof OpenCL are:<br /> > ><br /> > > -Since you last posted, OpenCL compiling has switched to using LLVM astheir<br /> > > standard compiler. Good PostgreSQL support for LLVM isn't far away. It<br /> > > looks tome like the compiler situation for CUDA requires their PathScale<br /> > > based compiler. I don't know enough aboutthis area to say which compiling<br /> > > tool chain will end up being easier to deal with.<br /> > ><br/> > > -Intel is making GPU support standard for OpenCL, as I mentioned before.<br /> > > NVIDIA willbe hard pressed to compete with Intel for GPU acceleration once<br /> > > more systems supporting that enter themarket.<br /> > ><br /> > > -Easy availability of OpenCL on Mac OS X for development sake. Lots of<br />> > Postgres hackers with OS X systems, even though there aren't too many OS X<br /> > > database servers.<br/> > ><br /> > > The fact that Amazon provides a way to crack the chicken/egg hardware<br /> >> problem immediately helps a lot though, I don't even need a physical card<br /> > > here to test CUDA GPUacceleration on Linux now. With that data point, your<br /> > > benchmarks are good enough to say I'd be willingto help review a patch in<br /> > > this area here as part of the 9.3 development cycle. That may validatethat<br /> > > GPU acceleration is useful, and then the next step would be considering how<br /> > >portable that will be to other GPU interfaces. I still expect CUDA will be<br /> > > looked back on as a deadend for GPU accelerated computing one day.<br /> > > Computing history is not filled with many single-vendor standardswho<br /> > > competed successfully against Intel providing the same thing. AMD's x86-64<br /> > >is the only example I can think of where Intel didn't win that sort of race,<br /> > > which happened (IMHO) onlybecause Intel's Itanium failed to prioritize<br /> > > backwards compatibility highly enough.<br /> > ><br/> > As a side node. My module (PG-Strom) also uses CUDA, although it tried to<br /> > implement it with OpenCLat begining of the project, because it didn't work<br /> > well when multiple sessions uses a GPU device concurrently.<br/> > The second background process get an error due to out-of-resources during<br /> > another processopens a GPU device.<br /> ><br /> > I'm not clear whether it is a limitation of OpenCL, driver of Nvidia, orbugs of<br /> > my code. Anyway, I switched to CUDA, instead of the investigation on binary<br /> > drivers. :-(<br/> ><br /> > Thanks,<br /> > --<br /> > KaiGai Kohei <<a href="mailto:kaigai@kaigai.gr.jp">kaigai@kaigai.gr.jp</a>><br/><p>I have no experience with opencl but for sure with Cuda4.1you can share the same device from multiple host thread, as in for example allocate memory in one host thread anduse it in another thread. May be with opencl you were facing the very same limit.<br />
I don't think we should be looking at either CUDA or OpenCL directly. We should be looking for a generic library that can target either and is well maintained and actively developed. Any GPU code we write ourselves would rapidly be overtaken by changes in the hardware and innovations in parallel algorithms. If we find a library that provides a sorting api and adapt our code to use it then we'll get the benefits of any new hardware feature as the library adds support for them.
<p><br /> On Feb 13, 2012 7:49 p.m., "Greg Stark" <<a href="mailto:stark@mit.edu">stark@mit.edu</a>> wrote:<br /> ><br/> > I don't think we should be looking at either CUDA or OpenCL directly.<br /> > We should be looking fora generic library that can target either and<br /> > is well maintained and actively developed. Any GPU code we write<br/> > ourselves would rapidly be overtaken by changes in the hardware and<br /> > innovations in parallel algorithms.If we find a library that provides<br /> > a sorting api and adapt our code to use it then we'll get the benefits<br/> > of any new hardware feature as the library adds support for them.<p>To sort integer I used the ThrustNvidia library.
On 13/02/2012 08:26, Greg Smith wrote: > On 02/11/2012 08:14 PM, Gaetano Mendola wrote: >> The trend is to have server capable of running CUDA providing GPU via >> external hardware (PCI Express interface with PCI Express switches), >> look for example at PowerEdge C410x PCIe Expansion Chassis from DELL. > > The C410X adds 16 PCIe slots to a server, housed inside a separate 3U > enclosure. That's a completely sensible purchase if your goal is to > build a computing cluster, where a lot of work is handed off to a set of > GPUs. I think that's even less likely to be a cost-effective option for > a database server. Adding a single dedicated GPU installed in a server > to accelerate sorting is something that might be justifiable, based on > your benchmarks. This is a much more expensive option than that though. > Details at http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for > anyone who wants to see just how big this external box is. > >> I did some experimenst timing the sort done with CUDA and the sort >> done with pg_qsort: >> CUDA pg_qsort >> 33Milion integers: ~ 900 ms, ~ 6000 ms >> 1Milion integers: ~ 21 ms, ~ 162 ms >> 100k integers: ~ 2 ms, ~ 13 ms >> CUDA time has already in the copy operations (host->device, >> device->host). >> As GPU I was using a C2050, and the CPU doing the pg_qsort was a >> Intel(R) Xeon(R) CPU X5650 @ 2.67GHz > > That's really interesting, and the X5650 is by no means a slow CPU. So > this benchmark is providing a lot of CPU power yet still seeing over a > 6X speedup in sort times. It sounds like the PCI Express bus has gotten > fast enough that the time to hand data over and get it back again can > easily be justified for medium to large sized sorts. > > It would be helpful to take this patch and confirm whether it scales > when using in parallel. Easiest way to do that would be to use the > pgbench "-f" feature, which allows running an arbitrary number of some > query at once. Seeing whether this acceleration continued to hold as the > number of clients increases is a useful data point. > > Is it possible for you to break down where the time is being spent? For > example, how much of this time is consumed in the GPU itself, compared > to time spent transferring data between CPU and GPU? I'm also curious > where the bottleneck is at with this approach. If it's the speed of the > PCI-E bus for smaller data sets, adding more GPUs may never be > practical. If the bus can handle quite a few of these at once before it > saturates, it might be possible to overload a single GPU. That seems > like it would be really hard to reach for database sorting though; I > can't really defend justify my gut feel for that being true though. There you go (times are in ms): Size H->D SORT D->H TOTAL 64 0.209824 0.479392 0.013856 0.703072 128 0.098144 0.41744 0.01312 0.528704 256 0.096832 0.420352 0.013696 0.53088 512 0.097568 0.3952 0.014464 0.507232 1024 0.09872 0.396608 0.014624 0.509952 2048 0.101344 0.56224 0.016896 0.68048 4096 0.106176 0.562976 0.02016 0.689312 8192 0.116512 0.571264 0.02672 0.714496 16384 0.136096 0.587584 0.040192 0.763872 32768 0.179296 0.658112 0.066304 0.903712 65536 0.212352 0.84816 0.118016 1.178528 131072 0.317056 1.1465 0.22784 1.691396 262144 0.529376 1.82237 0.42512 2.776866 524288 0.724032 2.39834 0.64576 3.768132 1048576 1.11162 3.51978 1.12176 5.75316 2097152 1.95939 5.93434 2.06992 9.96365 4194304 3.76192 10.6011 4.10614 18.46916 8388608 7.16845 19.9245 7.93741 35.03036 16777216 13.8693 38.7413 15.4073 68.0179 33554432 27.3017 75.6418 30.6646 133.6081 67108864 54.2171 151.192 60.327 265.7361 pg_sort 64 0.010000 128 0.010000 256 0.021000 512 0.128000 1024 0.092000 2048 0.196000 4096 0.415000 8192 0.883000 16384 1.881000 32768 3.960000 65536 8.432000 131072 17.951000 262144 37.140000 524288 78.320000 1048576 163.276000 2097152 339.118000 4194304 693.223000 8388608 1423.142000 16777216 2891.218000 33554432 5910.851000 67108864 11980.930000 As you can notice the times with CUDA are lower than the timing I have reported on my previous post because the server was doing something else in mean while, I have repeated those benchmarks with server completely unused. And this is the boost as in pg_sort/cuda : 64 0.0142232943 128 0.018914175 256 0.039556962 512 0.2070058671 1024 0.1804091365 2048 0.2880319774 4096 0.6078524674 8192 1.2372357578 16384 2.4637635625 32768 4.4106972133 65536 7.1742037525 131072 10.5090706139 262144 13.3719091955 524288 20.5834084369 1048576 28.2516043357 2097152 33.9618513296 4194304 37.5247168794 8388608 40.5135716561 16777216 42.4743633661 33554432 44.2394809896 67108864 45.1499777411 >> > I've never seen a PostgreSQL server capable of running CUDA, and I >> > don't expect that to change. >> >> That sounds like: >> >> "I think there is a world market for maybe five computers." >> - IBM Chairman Thomas Watson, 1943 > > Yes, and "640K will be enough for everyone", ha ha. (Having said the > 640K thing is flat out denied by Gates, BTW, and no one has come up with > proof otherwise). > > I think you've made an interesting case for this sort of acceleration > now being useful for systems doing what's typically considered a data > warehouse task. I regularly see servers waiting for far more than 13M > integers to sort. And I am seeing a clear trend toward providing more > PCI-E slots in servers now. Dell's R810 is the most popular single > server model my customers have deployed in the last year, and it has 5 > X8 slots in it. It's rare all 5 of those are filled. As long as a > dedicated GPU works fine when dropped to X8 speeds, I know a fair number > of systems where one of those could be added now. > > There's another data point in your favor I didn't notice before your > last e-mail. Amazon has a "Cluster GPU Quadruple Extra Large" node type > that runs with NVIDIA Tesla hardware. That means the installed base of > people who could consider CUDA is higher than I expected. To demonstrate > how much that costs, to provision a GPU enabled reserved instance from > Amazon for one year costs $2410 at "Light Utilization", giving a system > with 22GB of RAM and 1.69GB of storage. (I find the reserved prices > easier to compare with dedicated hardware than the hourly ones) That's > halfway between the High-Memory Double Extra Large Instance (34GB > RAM/850GB disk) at $1100 and the High-Memory Quadruple Extra Large > Instance (64GB RAM/1690GB disk) at $2200. If someone could prove sorting > was a bottleneck on their server, that isn't an unreasonable option to > consider on a cloud-based database deployment. > > I still think that an approach based on OpenCL is more likely to be > suitable for PostgreSQL, which was part of why I gave CUDA low odds > here. The points in favor of OpenCL are: > > -Since you last posted, OpenCL compiling has switched to using LLVM as > their standard compiler. Good PostgreSQL support for LLVM isn't far > away. It looks to me like the compiler situation for CUDA requires their > PathScale based compiler. I don't know enough about this area to say > which compiling tool chain will end up being easier to deal with. NVidia compiler named nvcc switched to LLVM as well (CUDA4.1). > -Intel is making GPU support standard for OpenCL, as I mentioned before. > NVIDIA will be hard pressed to compete with Intel for GPU acceleration > once more systems supporting that enter the market. > > -Easy availability of OpenCL on Mac OS X for development sake. Lots of > Postgres hackers with OS X systems, even though there aren't too many OS > X database servers. > The fact that Amazon provides a way to crack the chicken/egg hardware > problem immediately helps a lot though, I don't even need a physical > card here to test CUDA GPU acceleration on Linux now. With that data > point, your benchmarks are good enough to say I'd be willing to help > review a patch in this area here as part of the 9.3 development cycle. > That may validate that GPU acceleration is useful, and then the next > step would be considering how portable that will be to other GPU > interfaces. I still expect CUDA will be looked back on as a dead end for > GPU accelerated computing one day. Computing history is not filled with > many single-vendor standards who competed successfully against Intel > providing the same thing. AMD's x86-64 is the only example I can think > of where Intel didn't win that sort of race, which happened (IMHO) only > because Intel's Itanium failed to prioritize backwards compatibility > highly enough. I think that due the fact NVIDA nvcc uses LLVM now it means that soon we will be able to compile "CUDA" programs for any target architecture supported by LLVM. Regards Gaetano Mendola
On Mon, Feb 13, 2012 at 20:48, Greg Stark <stark@mit.edu> wrote: > I don't think we should be looking at either CUDA or OpenCL directly. > We should be looking for a generic library that can target either and > is well maintained and actively developed. I understand your point about using some external library for the primitives, but I don't see why it needs to support both CUDA and OpenCL. Libraries for GPU-accelerated primitives generally target OpenCL *or* CUDA, not both. As far as I understand (and someone correct me if I'm wrong), the difference between them is mostly the API and the fact that CUDA had a head start, and thus a larger developer community around it. (All the early adopters went to CUDA) But OpenCL already acts as an abstraction layer. CUDA is NVIDIA-specific, but OpenCL is supported by AMD, Intel as well as NVIDIA. It's pretty rare for servers to have separate graphics cards, but recent Intel and AMD CPUs already have a GPU included on die, which is another bonus for OpenCL. So I'd say, the way things are heading, it's only a matter of time before OpenCL takes over and there will be little reason to look back. Regards, Marti
On 13/02/2012 19:48, Greg Stark wrote: > I don't think we should be looking at either CUDA or OpenCL directly. > We should be looking for a generic library that can target either and > is well maintained and actively developed. Any GPU code we write > ourselves would rapidly be overtaken by changes in the hardware and > innovations in parallel algorithms. If we find a library that provides > a sorting api and adapt our code to use it then we'll get the benefits > of any new hardware feature as the library adds support for them. > I think one option is to make the sort function pluggable with a shared library/dll. I see several benefits from this: - It could be in the interest of the hardware vendor to provide the most powerful sort implementation (I'm sure for example that TBB sort implementation is faster that pg_sort) - It can permit people to "play" with it without being deep involved in pg development and stuffs. - It can relieve the postgres core group the choose about the right language/tool/implementation to use. - Also for people not willing (or not able for the matter) to upgrade postgres engine to change instead the sort function upon an hardware upgrade. Of course if this happens postgres engine has to make some sort of sanity check (that the function for example actually sorts) before to "thrust" the plugged sort. The engine can even have multiple sort implementation available and use the most proficient one (imagine some sorts acts better on a certain range value or on certain element size). Regards Gaetano Mendola
On 13/02/2012 19:48, Greg Stark wrote: > I don't think we should be looking at either CUDA or OpenCL directly. > We should be looking for a generic library that can target either and > is well maintained and actively developed. Any GPU code we write > ourselves would rapidly be overtaken by changes in the hardware and > innovations in parallel algorithms. If we find a library that provides > a sorting api and adapt our code to use it then we'll get the benefits > of any new hardware feature as the library adds support for them. > I think one option is to make the sort function plugable with a shared library/dll. I see several benefits from this: - It could be in the interest of the hardware vendor to provide the most powerful sort implementation (I'm sure for example that TBB sort implementation is faster that pg_sort) - It can permit people to "play" with it without being deep involved in pg development and stuffs. - It can relieve the postgres core group the choose about the right language/tool/implementation to use. - Also for people not willing (or not able for the matter) to upgrade postgres engine to change instead the sort function upon an hardware upgrade. Of course if this happens postgres engine has to make some sort of sanity check (that the function for example actually sorts) before to "thrust" the plugged sort. The engine can even have multiple sort implementation available and use the most proficient one (imagine some sorts acts better on a certain range value or on certain element size). Regards Gaetano Mendola
On 15 February 2012 20:00, Gaetano Mendola <mendola@gmail.com> wrote: > On 13/02/2012 19:48, Greg Stark wrote: >> >> I don't think we should be looking at either CUDA or OpenCL directly. >> We should be looking for a generic library that can target either and >> is well maintained and actively developed. Any GPU code we write >> ourselves would rapidly be overtaken by changes in the hardware and >> innovations in parallel algorithms. If we find a library that provides >> a sorting api and adapt our code to use it then we'll get the benefits >> of any new hardware feature as the library adds support for them. >> > > I think one option is to make the sort function pluggable with a shared > library/dll. I see several benefits from this: > > - It could be in the interest of the hardware vendor to provide the most > powerful sort implementation (I'm sure for example that TBB sort > implementation is faster that pg_sort) > > - It can permit people to "play" with it without being deep involved in pg > development and stuffs. Sorry, but I find it really hard to believe that the non-availability of pluggable sorting is what's holding people back here. Some vanguard needs to go and prove the idea by building a rough prototype before we can even really comment on what an API should look like. For example, I am given to understand that GPUs generally sort using radix sort - resolving the impedance mismatch that prevents someone from using a non-comparison based sort sure sounds like a lot of work for an entirely speculative reward. Someone who cannot understand tuplesort, which is not all that complicated, has no business trying to build GPU sorting into Postgres. I had a patch committed a few hours ago that almost included the capability of assigning an alternative sorting function, but only one with the exact same signature as my variant of qsort_arg. pg_qsort isn't used to sort tuples at all, by the way. Threading building blocks is not going to form the basis of any novel sorting implementation, because comparators in general are not thread safe, and it isn't available on all the platforms we support, and because of how longjmp interacts with C++ stack unwinding and so on and so on. Now, you could introduce some kind of parallelism into sorting integers and floats, but that's an awful lot of work for a marginal reward. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 15/02/2012 23:11, Peter Geoghegan wrote: > On 15 February 2012 20:00, Gaetano Mendola<mendola@gmail.com> wrote: >> On 13/02/2012 19:48, Greg Stark wrote: >>> >>> I don't think we should be looking at either CUDA or OpenCL directly. >>> We should be looking for a generic library that can target either and >>> is well maintained and actively developed. Any GPU code we write >>> ourselves would rapidly be overtaken by changes in the hardware and >>> innovations in parallel algorithms. If we find a library that provides >>> a sorting api and adapt our code to use it then we'll get the benefits >>> of any new hardware feature as the library adds support for them. >>> >> >> I think one option is to make the sort function pluggable with a shared >> library/dll. I see several benefits from this: >> >> - It could be in the interest of the hardware vendor to provide the most >> powerful sort implementation (I'm sure for example that TBB sort >> implementation is faster that pg_sort) >> >> - It can permit people to "play" with it without being deep involved in pg >> development and stuffs. > > Sorry, but I find it really hard to believe that the non-availability > of pluggable sorting is what's holding people back here. Some vanguard > needs to go and prove the idea by building a rough prototype before we > can even really comment on what an API should look like. For example, > I am given to understand that GPUs generally sort using radix sort - > resolving the impedance mismatch that prevents someone from using a > non-comparison based sort sure sounds like a lot of work for an > entirely speculative reward. AFAIK thrust library uses the radix sort if the keys you are sorting are POD data comparable with a "<" operator otherwise it does the comparison based sort using the operator provided. http://docs.thrust.googlecode.com/hg/modules.html I'm not saying that the non-availability of pluggable sort completely holds people back, I'm saying that it will simplify the process now and int the future, of course that's my opinion. > Someone who cannot understand tuplesort, which is not all that > complicated, has no business trying to build GPU sorting into > Postgres. That sounds a bit harsh. I'm one of those indeed, I haven't look in the details not having enough time for it. At work we do GPU computing (not the sort type stuff) and given the fact I'm a Postgres enthusiast I asked my self: "my server is able to sort around 500 milions integer per seconds, if postgres was able to do that as well it would be very nice". What I have to say? Sorry for my thoughts. > I had a patch committed a few hours ago that almost included the > capability of assigning an alternative sorting function, but only one > with the exact same signature as my variant of qsort_arg. pg_qsort > isn't used to sort tuples at all, by the way. Then I did look in the wrong direction. Thank you for point that out. > Threading building blocks is not going to form the basis of any novel > sorting implementation, because comparators in general are not thread > safe, and it isn't available on all the platforms we support, and > because of how longjmp interacts with C++ stack unwinding and so on > and so on. Now, you could introduce some kind of parallelism into > sorting integers and floats, but that's an awful lot of work for a > marginal reward. The TBB was just example that did come in my mind. What do you mean with you could introduce some kind of parallelism? As far as I know any algorithm using the divide and conquer can be parallelized. Regards Gaetano Mendola
On 15/02/2012 23:11, Peter Geoghegan wrote: > On 15 February 2012 20:00, Gaetano Mendola<mendola@gmail.com> wrote: >> On 13/02/2012 19:48, Greg Stark wrote: >>> >>> I don't think we should be looking at either CUDA or OpenCL directly. >>> We should be looking for a generic library that can target either and >>> is well maintained and actively developed. Any GPU code we write >>> ourselves would rapidly be overtaken by changes in the hardware and >>> innovations in parallel algorithms. If we find a library that provides >>> a sorting api and adapt our code to use it then we'll get the benefits >>> of any new hardware feature as the library adds support for them. >>> >> >> I think one option is to make the sort function pluggable with a shared >> library/dll. I see several benefits from this: >> >> - It could be in the interest of the hardware vendor to provide the most >> powerful sort implementation (I'm sure for example that TBB sort >> implementation is faster that pg_sort) >> >> - It can permit people to "play" with it without being deep involved in pg >> development and stuffs. > > Sorry, but I find it really hard to believe that the non-availability > of pluggable sorting is what's holding people back here. Some vanguard > needs to go and prove the idea by building a rough prototype before we > can even really comment on what an API should look like. For example, > I am given to understand that GPUs generally sort using radix sort - > resolving the impedance mismatch that prevents someone from using a > non-comparison based sort sure sounds like a lot of work for an > entirely speculative reward. AFAIK thrust library uses the radix sort if the keys you are sorting are POD data comparable with a "<" operator otherwise it does the comparison based sort using the operator provided. http://docs.thrust.googlecode.com/hg/modules.html I'm not saying that the non-availability of pluggable sort completely holds people back, I'm saying that it will simplify the process now and int the future, of course that's my opinion. > Someone who cannot understand tuplesort, which is not all that > complicated, has no business trying to build GPU sorting into > Postgres. That sounds a bit harsh. I'm one of those indeed, I haven't look in the details not having enough time for it. At work we do GPU computing (not the sort type stuff) and given the fact I'm a Postgres enthusiast I asked my self: "my server is able to sort around 500 milions integer per seconds, if postgres was able to do that as well it would be very nice". What I have to say? Sorry for my thoughts. > I had a patch committed a few hours ago that almost included the > capability of assigning an alternative sorting function, but only one > with the exact same signature as my variant of qsort_arg. pg_qsort > isn't used to sort tuples at all, by the way. Then I did look in the wrong direction. Thank you for point that out. > Threading building blocks is not going to form the basis of any novel > sorting implementation, because comparators in general are not thread > safe, and it isn't available on all the platforms we support, and > because of how longjmp interacts with C++ stack unwinding and so on > and so on. Now, you could introduce some kind of parallelism into > sorting integers and floats, but that's an awful lot of work for a > marginal reward. The TBB was just example that did come in my mind. What do you mean with you could introduce some kind of parallelism? As far as I know any algorithm using the divide and conquer can be parallelized. Regards Gaetano Mendola
On 15 February 2012 22:54, Gaetano Mendola <mendola@gmail.com> wrote: > That sounds a bit harsh. I'm one of those indeed, I haven't look in the > details not having enough time for it. At work we do GPU computing (not > the sort type stuff) and given the fact I'm a Postgres enthusiast I > asked my self: "my server is able to sort around 500 milions integer per > seconds, if postgres was able to do that as well it would be very nice". > > What I have to say? Sorry for my thoughts. I'm not trying to sound harsh. The only reason that my patch *nearly* had support for this was because the implementation that we nearly went with would have only needed another couple of lines of code to support it. It very probably wouldn't have turned out to have been useful for any novel sorting idea, and was really only intended to be used to support user-defined full sorting specialisations. That didn't end up making the cut. My point is that whatever is holding back the development of a useful prototype here, it definitely isn't the lack of an existing API. We don't know what such an API should look like, and just how invasive it needs to be. More importantly, it remains to be seen how useful this idea is in the real world - we don't have so much as a synthetic test case with a single client, as far as I'm aware. I'd encourage the OP to share his work on github or something along those lines. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
-----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Gaetano Mendola Sent: Wednesday, February 15, 2012 2:54 PM To: Peter Geoghegan; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] CUDA Sorting On 15/02/2012 23:11, Peter Geoghegan wrote: > On 15 February 2012 20:00, Gaetano Mendola<mendola@gmail.com> wrote: >> On 13/02/2012 19:48, Greg Stark wrote: >>> >>> I don't think we should be looking at either CUDA or OpenCL directly. >>> We should be looking for a generic library that can target either >>> and is well maintained and actively developed. Any GPU code we write >>> ourselves would rapidly be overtaken by changes in the hardware and >>> innovations in parallel algorithms. If we find a library that >>> provides a sorting api and adapt our code to use it then we'll get >>> the benefits of any new hardware feature as the library adds support for them. >>> >> >> I think one option is to make the sort function pluggable with a >> shared library/dll. I see several benefits from this: >> >> - It could be in the interest of the hardware vendor to provide the >> most powerful sort implementation (I'm sure for example that TBB sort >> implementation is faster that pg_sort) >> >> - It can permit people to "play" with it without being deep >> involved in pg development and stuffs. > > Sorry, but I find it really hard to believe that the non-availability > of pluggable sorting is what's holding people back here. Some vanguard > needs to go and prove the idea by building a rough prototype before we > can even really comment on what an API should look like. For example, > I am given to understand that GPUs generally sort using radix sort - > resolving the impedance mismatch that prevents someone from using a > non-comparison based sort sure sounds like a lot of work for an > entirely speculative reward. AFAIK thrust library uses the radix sort if the keys you are sorting are POD data comparable with a "<" operator otherwiseit does the comparison based sort using the operator provided. http://docs.thrust.googlecode.com/hg/modules.html I'm not saying that the non-availability of pluggable sort completely holds people back, I'm saying that it will simplifythe process now and int the future, of course that's my opinion. > Someone who cannot understand tuplesort, which is not all that > complicated, has no business trying to build GPU sorting into > Postgres. That sounds a bit harsh. I'm one of those indeed, I haven't look in the details not having enough time for it. At work wedo GPU computing (not the sort type stuff) and given the fact I'm a Postgres enthusiast I asked my self: "my server isable to sort around 500 milions integer per seconds, if postgres was able to do that as well it would be very nice". What I have to say? Sorry for my thoughts. > I had a patch committed a few hours ago that almost included the > capability of assigning an alternative sorting function, but only one > with the exact same signature as my variant of qsort_arg. pg_qsort > isn't used to sort tuples at all, by the way. Then I did look in the wrong direction. Thank you for point that out. > Threading building blocks is not going to form the basis of any novel > sorting implementation, because comparators in general are not thread > safe, and it isn't available on all the platforms we support, and > because of how longjmp interacts with C++ stack unwinding and so on > and so on. Now, you could introduce some kind of parallelism into > sorting integers and floats, but that's an awful lot of work for a > marginal reward. The TBB was just example that did come in my mind. What do you mean with you could introduce some kind of parallelism? As far as I know any algorithm using the divide and conquer can be parallelized. >> Radix sorting can be used for any data type, if you create a callback that provides the most significant bits in "width"buckets. At any rate, I can't imagine why anyone would want to complain about sorting 40 times faster than before,considering the amount of time database spend in ordering data. I have a Cuda card in this machine (NVIDIA GeForce GTX 460) and I would not mind it a bit if my database "ORDER BY" clausesuddenly started running ten times faster than before when I am dealing with a huge volume of data. There have been other experiments along these lines such as: GPU-based Sorting in PostgreSQL Naju Mancheril, School of Computer Science - Carnegie Mellon University www.cs.virginia.edu/~skadron/Papers/bakkum_sqlite_gpgpu10.pdf (This is for SQLite, but the grammar of SQLite is almost apure subset of PostgreSQL, including things like vacuum...) http://wiki.postgresql.org/images/6/65/Pgopencl.pdf http://dl.acm.org/citation.cfm?id=1807207 http://www.scribd.com/doc/51484335/PostgreSQL-OpenCL-Procedural-Language-pgEast-March-2011 See also http://highscalability.com/scaling-postgresql-using-cuda <<