Thread: CUDA Sorting

CUDA Sorting

From
Vitor Reus
Date:
Hello everyone,

I'm implementing a CUDA based sorting on PostgreSQL, and I believe it
can improve the ORDER BY statement performance in 4 to 10 times. I
already have a generic CUDA sort that performs around 10 times faster
than std qsort. I also managed to load CUDA into pgsql.

Since I'm new to pgsql development, I replaced the code of pgsql
qsort_arg to get used with the way postgres does the sort. The problem
is that I can't use the qsort_arg_comparator comparator function on
GPU, I need to implement my own. I didn't find out how to access the
sorting key value data of the tuples on the Tuplesortstate or
SortTuple structures. This part looks complicated because it seems the
state holds the pointer for the scanner(?), but I didn't managed to
access the values directly. Can anyone tell me how this works?

Cheers,
Vítor


Re: CUDA Sorting

From
Thom Brown
Date:
On 19 September 2011 13:11, Vitor Reus <vitor.reus@gmail.com> wrote:
> Hello everyone,
>
> I'm implementing a CUDA based sorting on PostgreSQL, and I believe it
> can improve the ORDER BY statement performance in 4 to 10 times. I
> already have a generic CUDA sort that performs around 10 times faster
> than std qsort. I also managed to load CUDA into pgsql.
>
> Since I'm new to pgsql development, I replaced the code of pgsql
> qsort_arg to get used with the way postgres does the sort. The problem
> is that I can't use the qsort_arg_comparator comparator function on
> GPU, I need to implement my own. I didn't find out how to access the
> sorting key value data of the tuples on the Tuplesortstate or
> SortTuple structures. This part looks complicated because it seems the
> state holds the pointer for the scanner(?), but I didn't managed to
> access the values directly. Can anyone tell me how this works?

I can't help with explaining the inner workings of sorting code, but
just a note that CUDA is a proprietary framework from nVidia and
confines its use to nVidia GPUs only.  You'd probably be better off
investing in the OpenCL standard which is processor-agnostic.  Work
has already been done in this area by Tim Child with pgOpenCL,
although doesn't appear to be available yet.  It might be worth
engaging with him to see if there are commonalities to what you're
both trying to achieve.

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: CUDA Sorting

From
Thom Brown
Date:
On 19 September 2011 14:32, Vitor Reus <vitor.reus@gmail.com> wrote:
> 2011/9/19 Thom Brown <thom@linux.com>:
>> On 19 September 2011 13:11, Vitor Reus <vitor.reus@gmail.com> wrote:
>>> Hello everyone,
>>>
>>> I'm implementing a CUDA based sorting on PostgreSQL, and I believe it
>>> can improve the ORDER BY statement performance in 4 to 10 times. I
>>> already have a generic CUDA sort that performs around 10 times faster
>>> than std qsort. I also managed to load CUDA into pgsql.
>>>
>>> Since I'm new to pgsql development, I replaced the code of pgsql
>>> qsort_arg to get used with the way postgres does the sort. The problem
>>> is that I can't use the qsort_arg_comparator comparator function on
>>> GPU, I need to implement my own. I didn't find out how to access the
>>> sorting key value data of the tuples on the Tuplesortstate or
>>> SortTuple structures. This part looks complicated because it seems the
>>> state holds the pointer for the scanner(?), but I didn't managed to
>>> access the values directly. Can anyone tell me how this works?
>>
>> I can't help with explaining the inner workings of sorting code, but
>> just a note that CUDA is a proprietary framework from nVidia and
>> confines its use to nVidia GPUs only.  You'd probably be better off
>> investing in the OpenCL standard which is processor-agnostic.  Work
>> has already been done in this area by Tim Child with pgOpenCL,
>> although doesn't appear to be available yet.  It might be worth
>> engaging with him to see if there are commonalities to what you're
>> both trying to achieve.
>>
>> --
>> Thom Brown
>> Twitter: @darkixion
>> IRC (freenode): dark_ixion
>> Registered Linux user: #516935
>>
>> EnterpriseDB UK: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>
> Hi Thom Brown,
>
> thank you very much for your reply.
>
> I am aware that CUDA is a proprietary framework, but since the high
> level CUDA API is easier than OpenCL, it will be faster to implement
> and test. Also, CUDA can be translated to OpenCL in a straightforward
> way, since the low level CUDA API generated code is really similar to
> OpenCL.
>
> I'll try engaging with Tim Child, but it seems that his work is to
> create GPU support for specific SQL, like procedural SQL statements
> with CUDA extensions, did I understand it right? And my focus is to
> "unlock" the GPU power without the user being aware of this.

Please use Reply To All in your responses so the mailing list is included.

Is your aim to have this committed into core PostgreSQL, or just for
your own version?  If it's the former, I don't anticipate any
enthusiasm from the hacker community.

But you're right, Tim Child's work is aimed at procedural acceleration
rather than speeding up core functionality (from what I gather
anyway).

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: CUDA Sorting

From
Greg Stark
Date:
On Mon, Sep 19, 2011 at 1:11 PM, Vitor Reus <vitor.reus@gmail.com> wrote:
> Since I'm new to pgsql development, I replaced the code of pgsql
> qsort_arg to get used with the way postgres does the sort. The problem
> is that I can't use the qsort_arg_comparator comparator function on
> GPU, I need to implement my own. I didn't find out how to access the
> sorting key value data of the tuples on the Tuplesortstate or
> SortTuple structures. This part looks complicated because it seems the
> state holds the pointer for the scanner(?), but I didn't managed to
> access the values directly. Can anyone tell me how this works?
>


This is something I've been curious about for a while. The biggest
difficulty is that Postgres has a user-extensible type system and
calls user provided functions to do things like comparisons. Postgres
only supports comparison sorts and does so by calling the user
function for the data type being sorted.

These user defined function is looked up earlier in the query parsing
and analysis phase and stored in Tuplesortstate->scanKeys which is an
array of structures that hold information about the ordering required.
In there there's a pointer to the function, a set of flags (such as
NULLS FIRST/LAST), the text collation needed and the collation.

I assume you're going to have to have tuplesort.c recognize if all the
comparators are one of a small set of standard comparators that you
can implement on the GPU such as integer and floating point
comparison. In which case you could call a specialized qsort which
implements that comparator inlined instead of calling the standard
function. That might actually be a useful optimization to do anyways
since it may well be much faster even without the GPU.  So that would
probably be a good place to start.

But the barrier to get over here might be relatively high. In order to
tolerate that amount of duplicated code and special cases there would
have to be benchmarks showing it's significantly faster and helps
real-world user queries. It would also have to be pretty cleanly
implemented so that it doesn't impose a lot of extra overhead every
time this code needs to be changed -- for example when adding
collations it would have been unfortunate to have to add it to half a
dozen specializations of tuplesort (though frankly I don't think that
would have made that much of a dent in the happiness of the people who
worked on collations).

All that said my personal opinion is that this can be done cleanly and
would be more than worth the benefit even without the GPU -- sorting
integers and floating point numbers is a very common case and Peter
Geoghan recently showed our qsort could be about twice as fast if it
could inline the comparisons. With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.

-- 
greg


Re: CUDA Sorting

From
Greg Smith
Date:
On 09/19/2011 10:12 AM, Greg Stark wrote:
> With the GPU I'm curious to see how well
> it handles multiple processes contending for resources, it might be a
> flashy feature that gets lots of attention but might not really be
> very useful in practice. But it would be very interesting to see.
>    

The main problem here is that the sort of hardware commonly used for 
production database servers doesn't have any serious enough GPU to 
support CUDA/OpenCL available.  The very clear trend now is that all 
systems other than gaming ones ship with motherboard graphics chipsets 
more than powerful enough for any task but that.  I just checked the 5 
most popular configurations of server I see my customers deploy 
PostgreSQL onto (a mix of Dell and HP units), and you don't get a 
serious GPU from any of them.

Intel's next generation Ivy Bridge chipset, expected for the spring of 
2012, is going to add support for OpenCL to the built-in motherboard 
GPU.  We may eventually see that trickle into the server hardware side 
of things too.

I've never seen a PostgreSQL server capable of running CUDA, and I don't 
expect that to change.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



Re: CUDA Sorting

From
Thom Brown
Date:
On 19 September 2011 15:36, Greg Smith <greg@2ndquadrant.com> wrote:
> On 09/19/2011 10:12 AM, Greg Stark wrote:
>>
>> With the GPU I'm curious to see how well
>> it handles multiple processes contending for resources, it might be a
>> flashy feature that gets lots of attention but might not really be
>> very useful in practice. But it would be very interesting to see.
>>
>
> The main problem here is that the sort of hardware commonly used for
> production database servers doesn't have any serious enough GPU to support
> CUDA/OpenCL available.  The very clear trend now is that all systems other
> than gaming ones ship with motherboard graphics chipsets more than powerful
> enough for any task but that.  I just checked the 5 most popular
> configurations of server I see my customers deploy PostgreSQL onto (a mix of
> Dell and HP units), and you don't get a serious GPU from any of them.
>
> Intel's next generation Ivy Bridge chipset, expected for the spring of 2012,
> is going to add support for OpenCL to the built-in motherboard GPU.  We may
> eventually see that trickle into the server hardware side of things too.
>
> I've never seen a PostgreSQL server capable of running CUDA, and I don't
> expect that to change.

But couldn't that also be seen as a chicken/egg situation?  No-one
buys GPUs for database servers because the database won't make use of
it, but databases don't implement GPU functionality since database
servers don't tend to have GPUs.  It's more likely the latter of those
two reasonings would have to be the first to budge.

But nVidia does produce a non-graphics-oriented GPGPU line called
Tesla dedicated to such processing.

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: CUDA Sorting

From
Greg Stark
Date:
On Mon, Sep 19, 2011 at 3:36 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> The main problem here is that the sort of hardware commonly used for
> production database servers doesn't have any serious enough GPU to support
> CUDA/OpenCL available

Of course that could change if adding a GPU would help Postgres... I
would expect it to help mostly for data warehouse batch query type
systems, especially ones with very large i/o subsystems that can
saturate the memory bus with sequential i/o. "Run your large batch
queries twice as fast by adding a $400 part to your $40,000 server"
might be a pretty compelling sales pitch :)

That said, to help in the case I described you would have to implement
the tapesort algorithm on the GPU as well. I expect someone has
implemented heaps for CUDA/OpenCL already though.

-- 
greg


Re: CUDA Sorting

From
Thom Brown
Date:
On 19 September 2011 15:54, Greg Stark <stark@mit.edu> wrote:
> On Mon, Sep 19, 2011 at 3:36 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> The main problem here is that the sort of hardware commonly used for
>> production database servers doesn't have any serious enough GPU to support
>> CUDA/OpenCL available
>
> Of course that could change if adding a GPU would help Postgres... I
> would expect it to help mostly for data warehouse batch query type
> systems, especially ones with very large i/o subsystems that can
> saturate the memory bus with sequential i/o. "Run your large batch
> queries twice as fast by adding a $400 part to your $40,000 server"
> might be a pretty compelling sales pitch :)
>
> That said, to help in the case I described you would have to implement
> the tapesort algorithm on the GPU as well. I expect someone has
> implemented heaps for CUDA/OpenCL already though.

I seem to recall a paper on such a thing by Carnegie Mellon
University.  Can't remember where I saw it though.

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: CUDA Sorting

From
Thom Brown
Date:
On 19 September 2011 16:10, Thom Brown <thom@linux.com> wrote:
> On 19 September 2011 15:54, Greg Stark <stark@mit.edu> wrote:
>> On Mon, Sep 19, 2011 at 3:36 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>>> The main problem here is that the sort of hardware commonly used for
>>> production database servers doesn't have any serious enough GPU to support
>>> CUDA/OpenCL available
>>
>> Of course that could change if adding a GPU would help Postgres... I
>> would expect it to help mostly for data warehouse batch query type
>> systems, especially ones with very large i/o subsystems that can
>> saturate the memory bus with sequential i/o. "Run your large batch
>> queries twice as fast by adding a $400 part to your $40,000 server"
>> might be a pretty compelling sales pitch :)
>>
>> That said, to help in the case I described you would have to implement
>> the tapesort algorithm on the GPU as well. I expect someone has
>> implemented heaps for CUDA/OpenCL already though.
>
> I seem to recall a paper on such a thing by Carnegie Mellon
> University.  Can't remember where I saw it though.

Found it! http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/ngm/15-823/project/Final.pdf

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: CUDA Sorting

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> That said, to help in the case I described you would have to implement
> the tapesort algorithm on the GPU as well.

I think the real problem would be that we are seldom sorting just the
key values.  If you have to push the tuples through the GPU too, your
savings are going to go up in smoke pretty quickly ...

FWIW, I tend to believe a variant of what Greg Stark said upthread:
there would surely be some win from reducing the impedance mismatch for
comparison functions.  In concrete terms, there would be no reason to
have tuplesort.c's myFunctionCall2Coll, and maybe not
inlineApplySortFunction either, if the datatype-specific comparison
functions had APIs that were closer to what sorting wants rather than
following the general SQL-callable-function API.  And those functions
cost a *lot* more than a one-instruction comparison does.  But it's very
much more of a stretch to believe that inlining per se is going to do
much for us, and even more of a stretch to believe that getting a
separate processor involved is going to be a win.
        regards, tom lane


Re: CUDA Sorting

From
Vitor Reus
Date:
2011/9/19 Thom Brown <thom@linux.com>
> Is your aim to have this committed into core PostgreSQL, or just for
> your own version?  If it's the former, I don't anticipate any
> enthusiasm from the hacker community.

This is a research thesis and I'm not confident to commit it on the
core just by myself. I will, however, release the source, and I
believe it will open the way to future work be committed on core
PostgreSQL.


2011/9/19 Greg Stark <stark@mit.edu>
> Of course that could change if adding a GPU would help Postgres... I
> would expect it to help mostly for data warehouse batch query type
> systems, especially ones with very large i/o subsystems that can
> saturate the memory bus with sequential i/o. "Run your large batch
> queries twice as fast by adding a $400 part to your $40,000 server"
> might be a pretty compelling sales pitch :)

My focus is also energy proportionality. If you add a GPU, you will
increase the power consumption in about 2 times, but perhaps could
increse the efficiency much more.


> That said, to help in the case I described you would have to implement
> the tapesort algorithm on the GPU as well. I expect someone has
> implemented heaps for CUDA/OpenCL already though.

For now, I'm planning to implement just the in-memory sort, for
simplicity and to see if it would give a real performance gain.


2011/9/19 Greg Stark <stark@mit.edu>:
> In which case you could call a specialized qsort which
> implements that comparator inlined instead of calling the standard
> function.

Actually I'm now trying to make a custom comparator for integers, but
I didn't had great progress. If this works, I'll port it to GPU and
start working with the next comparators, such as float, then strings,
in a incremental way.


2011/9/19 Thom Brown <thom@linux.com>:
> Found it! http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/ngm/15-823/project/Final.pdf
This is a really great work, and I'm basing mine on it. But it's
implemented using OpenGL (yes, not OpenCL), and therefore has a lot of
limitations. I also tried to contact naju but didn't get any answer.

Vítor Uwe Reus


Re: CUDA Sorting

From
Christopher Browne
Date:
On Mon, Sep 19, 2011 at 10:36 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> Intel's next generation Ivy Bridge chipset, expected for the spring of 2012,
> is going to add support for OpenCL to the built-in motherboard GPU.  We may
> eventually see that trickle into the server hardware side of things too.

Note that Amazon's EC2 offerings include a configuration with a pair of GPUs.

Whether or not this continues has a certain "chicken and egg" aspect to it...

- I'm glad that Amazon is selling such a configuration, as it does
give folks the option of trying it out.

- Presumably, it will only continue on their product list if customers
do more than merely "trying it out."

I think I'd be shocked if PostgreSQL offered much support for such a
configuration in the next year; despite there being some work ongoing,
drawing the functionality into core would require Core decisions that
I'd be surprised to see so quickly.

Unfortunately, that may be slow enough progress that PostgreSQL won't
be contributing to the would-be success of the technology.

If this kind of GPU usage fails to attract much interest, then it's
probably a good thing that we're not committed to it.  But if other
uses lead to it taking off, then we'll doubtless get a lot of noise on
lists about a year from now to the effect "Why don't you have this in
core yet?  Not 3773t enough!?!?"

Having a bit of progress taking place now would probably be good
timing, in case it *does* take off...
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: CUDA Sorting

From
Nulik Nol
Date:
On Mon, Sep 19, 2011 at 7:11 AM, Vitor Reus <vitor.reus@gmail.com> wrote:
> Hello everyone,
>
> I'm implementing a CUDA based sorting on PostgreSQL, and I believe it
> can improve the ORDER BY statement performance in 4 to 10 times. I
> already have a generic CUDA sort that performs around 10 times faster
> than std qsort. I also managed to load CUDA into pgsql.
NVIDIA cards are not that good as ATI cards. ATI cards are much faster
with integer operations, and should be ideal for sorting transaction
ids or sort of similar numbers (unless you are going to sort prices
stored as float, which ATI still beats NVIDIA but not by that much)
Another problem you have to deal with is PCI Express speed. Transfer
is very slow compared to RAM. You will have to put more GPUs to match
the performance and this will increase solution cost. There was a
sorting algorithm for 4 CPU cores that was beating sort on a GTX 285
(I don't have the link, sorry), but CPUs are not that bad with sorting
like you think.
AMD is already working with embedding GPUs into the motherboard, if I
am not mistaken there are already some of them on the market available
for purchase.
Anyone who uses a tiny embedded ATI for sorting problems with integers
will outperform your NVIDIA based PCI-Express connected GPU with CUDA,
because basically your algorithm will waste a lot of time transfering
data to GPU and getting it back.
But if you use embedded ATI GPU , you can also use SSE registers on
each CPU core to add more performance to your algorithm. It is not
going to be a very hardware compatible solution but if you want good
speed/cost, this should be the best solution.
I recommend doing some bandwidth benchmark test before you start coding.

Regards
Nulik
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
==================================
The power of zero is infinite


Re: CUDA Sorting

From
Vitor Reus
Date:
2011/9/19 Nulik Nol <nuliknol@gmail.com>:
> On Mon, Sep 19, 2011 at 7:11 AM, Vitor Reus <vitor.reus@gmail.com> wrote:
> I recommend doing some bandwidth benchmark test before you start coding.

I already did some benchmarks with GPU sorting (not in pgsql), and
measured total sort times, copy bandwidth and energy usage, and got
some exciting results:

I got around 1GB/s bandwidth with a GeForce GT 430 on a MS-9803 MB.
The power increase ratio was 2.75 times, In a Core 2 Duo T8300, adding
the GT 430: http://tinyurl.com/6h7cgv2
The sorting time performance increases when you have more data, but in
average is 7.8 times faster than CPU: http://tinyurl.com/6c95dc2


Re: CUDA Sorting

From
Stephen Frost
Date:
* Thom Brown (thom@linux.com) wrote:
> But nVidia does produce a non-graphics-oriented GPGPU line called
> Tesla dedicated to such processing.

Just as a side-note, I've got a couple Tesla's that aren't doing
terribly much at the moment and they're in a Linux 'server'-type box
from Penguin computing.  I could certainly install PG on it and run some
tests- if someone's written the code and provides the tests.

I agree that it'd be interesting to do, but I share Lord Stark's
feelings about the challenges and lack of potential gain- it's a very
small set of queries that would benefit from this.  You need to be
working with enough data to make the cost of tranferring it all over to
the GPU worthwhile, just for starters..
Thanks,
    Stephen

Re: CUDA Sorting

From
Greg Smith
Date:
On 09/19/2011 10:53 AM, Thom Brown wrote:
> But couldn't that also be seen as a chicken/egg situation?


The chicken/egg problem here is a bit deeper than just "no one offers 
GPUs because no one wants them" on server systems.  One of the reasons 
there aren't more GPUs in typical database server configurations is that 
you're already filling up some number of the full size slots, and 
correspondingly the bandwidth available to cards, with disk 
controllers.  It doesn't help that many server class motherboards don't 
even have a x16 PCI-e slot on them, which is what most GPUs as delivered 
on regular consumer video cards are optimized for.

> But nVidia does produce a non-graphics-oriented GPGPU line called
> Tesla dedicated to such processing.
>    

Tesla units start at around $1500 USD, which is a nice budget to spend 
on either more RAM (to allow higher work_mem), faster storage to store 
temporary files onto, or a faster CPU to chew through all sorts of tasks 
more quickly.  The Tesla units are easy to justify if you have a serious 
GPU-oriented application.  The good bang for the buck point with CPU 
sorting for PostgreSQL is probably going to be a $50-$100 video card 
instead.  For example, the card Vitor is seeing good results on costs 
around $60.  (That's also a system with fairly slow RAM, though; it will 
be interesting to see if the gain holds up on newer systems.)

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



Re: CUDA Sorting

From
PostgreSQL - Hans-Jürgen Schönig
Date:
On Sep 19, 2011, at 5:16 PM, Tom Lane wrote:

> Greg Stark <stark@mit.edu> writes:
>> That said, to help in the case I described you would have to implement
>> the tapesort algorithm on the GPU as well.
>
> I think the real problem would be that we are seldom sorting just the
> key values.  If you have to push the tuples through the GPU too, your
> savings are going to go up in smoke pretty quickly …
>


i would argument along a similar line.
to make GPU code fast it has to be pretty much tailored to do exactly one thing - otherwise you have no chance to get
anywhereclose to card-bandwith. 
if you look at "two similar" GPU codes which seem to do the same thing you might easily see that one is 10 times faster
thanthe other - for bloody reason such as memory alignment, memory transaction size or whatever. 
this opens a bit of a problem: PostgreSQL sorting is so generic and so flexible that i would be really surprised if
somebodycould come up with a solution which really comes close to what the GPU can do. 
it would definitely be interesting to see a prototype, however.

btw, there is a handful of interesting talks / lectures about GPU programming provided by the university of chicago
(justcannot find the link atm). 
regards,
    hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de



Re: CUDA Sorting

From
Cédric Villemain
Date:
2011/9/19 Greg Smith <greg@2ndquadrant.com>:
> On 09/19/2011 10:53 AM, Thom Brown wrote:
>>
>> But couldn't that also be seen as a chicken/egg situation?
>
>
> The chicken/egg problem here is a bit deeper than just "no one offers GPUs
> because no one wants them" on server systems.  One of the reasons there
> aren't more GPUs in typical database server configurations is that you're
> already filling up some number of the full size slots, and correspondingly
> the bandwidth available to cards, with disk controllers.  It doesn't help
> that many server class motherboards don't even have a x16 PCI-e slot on
> them, which is what most GPUs as delivered on regular consumer video cards
> are optimized for.
>

Sandy bridge and ivy bridge intel series are CPU/GPU. I don't know how
using the GPU affect the CPU part but it might be interesting to
explore...


--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


Re: CUDA Sorting

From
Florian Pflug
Date:
On Sep19, 2011, at 19:46 , Stephen Frost wrote:
> I agree that it'd be interesting to do, but I share Lord Stark's
> feelings about the challenges and lack of potential gain- it's a very
> small set of queries that would benefit from this.  You need to be
> working with enough data to make the cost of tranferring it all over to
> the GPU worthwhile, just for starters..

I wonder if anyone has ever tried to employ a GPU for more low-level
tasks. Things like sorting or hashing are hard to move to the
GPU in postgres because, in the general case, they involve essentially
arbitrary user-defined functions. But couldn't for example the WAL CRC
computation be moved to a GPU? Or, to get really crazy, even the search
for the optimal join order (only for a large number of joins though,
i.e. where we currently switch to a genetic algorithmn)?

best regards,
Florian Pflug



Re: CUDA Sorting

From
Nulik Nol
Date:
>
> I already did some benchmarks with GPU sorting (not in pgsql), and
> measured total sort times, copy bandwidth and energy usage, and got
> some exciting results:
Was that qsort implementation on CPU cache friendly and optimized for SSE ?
To make a fair comparison you have to take the best CPU implementation
and compare it to best GPU implementation. Because if not, you are
comparing full throttled GPU vs lazy CPU.
Check this paper on how hash join was optimized 17x when SSE
instructions were used.
www.vldb.org/pvldb/2/vldb09-257.pdf

Regards


-- 
==================================
The power of zero is infinite


Re: CUDA Sorting

From
Hannu Krosing
Date:
On Mon, 2011-09-19 at 15:12 +0100, Greg Stark wrote:
> On Mon, Sep 19, 2011 at 1:11 PM, Vitor Reus <vitor.reus@gmail.com> wrote:
> > Since I'm new to pgsql development, I replaced the code of pgsql
> > qsort_arg to get used with the way postgres does the sort. The problem
> > is that I can't use the qsort_arg_comparator comparator function on
> > GPU, I need to implement my own. I didn't find out how to access the
> > sorting key value data of the tuples on the Tuplesortstate or
> > SortTuple structures. This part looks complicated because it seems the
> > state holds the pointer for the scanner(?), but I didn't managed to
> > access the values directly. Can anyone tell me how this works?

....

> With the GPU I'm curious to see how well
> it handles multiple processes contending for resources, it might be a
> flashy feature that gets lots of attention but might not really be
> very useful in practice. But it would be very interesting to see.

There are cases where concurrency may not be that important like some
specialized OLAP loads where you have to sort, for example finding a
median in large data sets.


-- 
-------
Hannu Krosing
PostgreSQL Unlimited Scalability and Performance Consultant
2ndQuadrant Nordic
PG Admin Book: http://www.2ndQuadrant.com/books/



Re: CUDA Sorting

From
Hannu Krosing
Date:
On Mon, 2011-09-19 at 10:36 -0400, Greg Smith wrote:
> On 09/19/2011 10:12 AM, Greg Stark wrote:
> > With the GPU I'm curious to see how well
> > it handles multiple processes contending for resources, it might be a
> > flashy feature that gets lots of attention but might not really be
> > very useful in practice. But it would be very interesting to see.
> >
>
> The main problem here is that the sort of hardware commonly used for
> production database servers doesn't have any serious enough GPU to
> support CUDA/OpenCL available.  The very clear trend now is that all
> systems other than gaming ones ship with motherboard graphics chipsets
> more than powerful enough for any task but that.  I just checked the 5
> most popular configurations of server I see my customers deploy
> PostgreSQL onto (a mix of Dell and HP units), and you don't get a
> serious GPU from any of them.
>
> Intel's next generation Ivy Bridge chipset, expected for the spring of
> 2012, is going to add support for OpenCL to the built-in motherboard
> GPU.  We may eventually see that trickle into the server hardware side
> of things too.
>
> I've never seen a PostgreSQL server capable of running CUDA, and I don't
> expect that to change.

CUDA sorting could be beneficial on general server hardware if it can
run well on multiple cpus in parallel. GPU-s being in essence parallel
processors on fast shared memory, it may be that even on ordinary RAM
and lots of CPUs some CUDA algorithms are a significant win.

and then there is non-graphics GPU availabe on EC2
 Cluster GPU Quadruple Extra Large Instance
 22 GB of memory 33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem”      architecture) 2 x NVIDIA Tesla
“Fermi”M2050 GPUs 1690 GB of instance storage 64-bit platform I/O Performance: Very High (10 Gigabit Ethernet) API
name:cg1.4xlarge 

It costs $2.10 per hour, probably a lot less if you use the Spot
Instances.

> --
> Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
> PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
>
>




Re: CUDA Sorting

From
Vitor Reus
Date:
Hey hackers,

I'm still having problems reading the values of the columns in tuplesort.c, in order to understand how to port this to CUDA.

Should I use the heap_getattr macro to read them?

2011/9/24 Hannu Krosing <hannu@krosing.net>
On Mon, 2011-09-19 at 10:36 -0400, Greg Smith wrote:
> On 09/19/2011 10:12 AM, Greg Stark wrote:
> > With the GPU I'm curious to see how well
> > it handles multiple processes contending for resources, it might be a
> > flashy feature that gets lots of attention but might not really be
> > very useful in practice. But it would be very interesting to see.
> >
>
> The main problem here is that the sort of hardware commonly used for
> production database servers doesn't have any serious enough GPU to
> support CUDA/OpenCL available.  The very clear trend now is that all
> systems other than gaming ones ship with motherboard graphics chipsets
> more than powerful enough for any task but that.  I just checked the 5
> most popular configurations of server I see my customers deploy
> PostgreSQL onto (a mix of Dell and HP units), and you don't get a
> serious GPU from any of them.
>
> Intel's next generation Ivy Bridge chipset, expected for the spring of
> 2012, is going to add support for OpenCL to the built-in motherboard
> GPU.  We may eventually see that trickle into the server hardware side
> of things too.
>
> I've never seen a PostgreSQL server capable of running CUDA, and I don't
> expect that to change.

CUDA sorting could be beneficial on general server hardware if it can
run well on multiple cpus in parallel. GPU-s being in essence parallel
processors on fast shared memory, it may be that even on ordinary RAM
and lots of CPUs some CUDA algorithms are a significant win.

and then there is non-graphics GPU availabe on EC2

 Cluster GPU Quadruple Extra Large Instance

 22 GB of memory
 33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem”
      architecture)
 2 x NVIDIA Tesla “Fermi” M2050 GPUs
 1690 GB of instance storage
 64-bit platform
 I/O Performance: Very High (10 Gigabit Ethernet)
 API name: cg1.4xlarge

It costs $2.10 per hour, probably a lot less if you use the Spot
Instances.

> --
> Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
> PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
>
>



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: CUDA Sorting

From
Gaetano Mendola
Date:
On 19/09/2011 21:41, PostgreSQL - Hans-Jürgen Schönig wrote:
>
> On Sep 19, 2011, at 5:16 PM, Tom Lane wrote:
>
>> Greg Stark<stark@mit.edu>  writes:
>>> That said, to help in the case I described you would have to implement
>>> the tapesort algorithm on the GPU as well.
>>
>> I think the real problem would be that we are seldom sorting just the
>> key values.  If you have to push the tuples through the GPU too, your
>> savings are going to go up in smoke pretty quickly …
>>
>
>
> i would argument along a similar line.
> to make GPU code fast it has to be pretty much tailored to do exactly one thing - otherwise you have no chance to get
anywhereclose to card-bandwith. 
> if you look at "two similar" GPU codes which seem to do the same thing you might easily see that one is 10 times
fasterthan the other - for bloody reason such as memory alignment, memory transaction size or whatever. 
> this opens a bit of a problem: PostgreSQL sorting is so generic and so flexible that i would be really surprised if
somebodycould come up with a solution which really comes close to what the GPU can do. 
> it would definitely be interesting to see a prototype, however.

Thrust Nvidia library provides the same sorting flexibility as postgres
does.
    // generate 32M random numbers on the host    thrust::host_vector<int> h_vec(32 << 20);
thrust::generate(h_vec.begin(),h_vec.end(), rand); 
    // transfer data to the device    thrust::device_vector<int> d_vec = h_vec;
    // sort data on the device (846M keys per second on GeForce GTX 480)    thrust::sort(d_vec.begin(), d_vec.end());
    // transfer data back to host    thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());


as you can see the type to be ordered is template, and
the thrust::sort have also a version in where it takes the comparator to
use.
So compared with pg_qsort  thrust::sort gives you the same flexibility.

http://docs.thrust.googlecode.com/hg/group__sorting.html

Regards
Gaetano Mendola












Re: CUDA Sorting

From
Gaetano Mendola
Date:
On 19/09/2011 16:36, Greg Smith wrote:
> On 09/19/2011 10:12 AM, Greg Stark wrote:
>> With the GPU I'm curious to see how well
>> it handles multiple processes contending for resources, it might be a
>> flashy feature that gets lots of attention but might not really be
>> very useful in practice. But it would be very interesting to see.
>
> The main problem here is that the sort of hardware commonly used for
> production database servers doesn't have any serious enough GPU to
> support CUDA/OpenCL available. The very clear trend now is that all
> systems other than gaming ones ship with motherboard graphics chipsets
> more than powerful enough for any task but that. I just checked the 5
> most popular configurations of server I see my customers deploy
> PostgreSQL onto (a mix of Dell and HP units), and you don't get a
> serious GPU from any of them.
>
> Intel's next generation Ivy Bridge chipset, expected for the spring of
> 2012, is going to add support for OpenCL to the built-in motherboard
> GPU. We may eventually see that trickle into the server hardware side of
> things too.


The trend is to have server capable of running CUDA providing GPU via 
external hardware (PCI Express interface with PCI Express switches), 
look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.

I did some experimenst timing the sort done with CUDA and the sort done 
with pg_qsort:                       CUDA      pg_qsort
33Milion integers:   ~ 900 ms,  ~ 6000 ms
1Milion integers:    ~  21 ms,  ~  162 ms
100k integers:       ~   2 ms,  ~   13 ms

CUDA time has already in the copy operations (host->device, device->host).

As GPU I was using a C2050, and the CPU doing the pg_qsort was a 
Intel(R) Xeon(R) CPU X5650  @ 2.67GHz

Copy operations and kernel runs (the sort for instance) can run in 
parallel, so while you are sorting a batch of data, you can copy the 
next batch in parallel.

As you can see the boost is not negligible.

Next Nvidia hardware (Keplero family) is PCI Express 3 ready, so expect 
in the near future the "bottle neck" of the device->host->device copies 
to have less impact.

I strongly believe there is space to provide modern database engine of
a way to offload sorts to GPU.
> I've never seen a PostgreSQL server capable of running CUDA, and I> don't expect that to change.

That sounds like:

"I think there is a world market for maybe five computers."
- IBM Chairman Thomas Watson, 1943

Regards
Gaetano Mendola



Re: CUDA Sorting

From
Oleg Bartunov
Date:
I'm wondering if CUDA will win in geomentry operations, for example,
tesing point <@ complex_polygon

Oleg
On Sun, 12 Feb 2012, Gaetano Mendola wrote:

> On 19/09/2011 16:36, Greg Smith wrote:
>> On 09/19/2011 10:12 AM, Greg Stark wrote:
>>> With the GPU I'm curious to see how well
>>> it handles multiple processes contending for resources, it might be a
>>> flashy feature that gets lots of attention but might not really be
>>> very useful in practice. But it would be very interesting to see.
>> 
>> The main problem here is that the sort of hardware commonly used for
>> production database servers doesn't have any serious enough GPU to
>> support CUDA/OpenCL available. The very clear trend now is that all
>> systems other than gaming ones ship with motherboard graphics chipsets
>> more than powerful enough for any task but that. I just checked the 5
>> most popular configurations of server I see my customers deploy
>> PostgreSQL onto (a mix of Dell and HP units), and you don't get a
>> serious GPU from any of them.
>> 
>> Intel's next generation Ivy Bridge chipset, expected for the spring of
>> 2012, is going to add support for OpenCL to the built-in motherboard
>> GPU. We may eventually see that trickle into the server hardware side of
>> things too.
>
>
> The trend is to have server capable of running CUDA providing GPU via 
> external hardware (PCI Express interface with PCI Express switches), look for 
> example at PowerEdge C410x PCIe Expansion Chassis from DELL.
>
> I did some experimenst timing the sort done with CUDA and the sort done with 
> pg_qsort:
>                       CUDA      pg_qsort
> 33Milion integers:   ~ 900 ms,  ~ 6000 ms
> 1Milion integers:    ~  21 ms,  ~  162 ms
> 100k integers:       ~   2 ms,  ~   13 ms
>
> CUDA time has already in the copy operations (host->device, device->host).
>
> As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R) 
> Xeon(R) CPU X5650  @ 2.67GHz
>
> Copy operations and kernel runs (the sort for instance) can run in parallel, 
> so while you are sorting a batch of data, you can copy the next batch in 
> parallel.
>
> As you can see the boost is not negligible.
>
> Next Nvidia hardware (Keplero family) is PCI Express 3 ready, so expect in 
> the near future the "bottle neck" of the device->host->device copies to have 
> less impact.
>
> I strongly believe there is space to provide modern database engine of
> a way to offload sorts to GPU.
>
>> I've never seen a PostgreSQL server capable of running CUDA, and I
>> don't expect that to change.
>
> That sounds like:
>
> "I think there is a world market for maybe five computers."
> - IBM Chairman Thomas Watson, 1943
>
> Regards
> Gaetano Mendola
>
>
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: CUDA Sorting

From
Gaetano Mendola
Date:
On 12/02/2012 13:13, Oleg Bartunov wrote:
> I'm wondering if CUDA will win in geomentry operations, for example,
> tesing point <@ complex_polygon


I'm not sure if the algorithm you mentioned can be implemented in terms
of vector algebra, blas, etc.

It's plenty of geometry operations implemented in CUDA out there, my
field of CUDA application is not this one so I'm not that much in it.

However I can point you to official NVIDIA npp library that provides
vector algebra algorithms, and some geometry algorithms as well.

http://developer.download.nvidia.com/compute/DevZone/docs/html/CUDALibraries/doc/NPP_Library.pdf

(take a look at around page 620).

Regards
Gaetano Mendola


> Oleg
> On Sun, 12 Feb 2012, Gaetano Mendola wrote:
>
>> On 19/09/2011 16:36, Greg Smith wrote:
>>> On 09/19/2011 10:12 AM, Greg Stark wrote:
>>>> With the GPU I'm curious to see how well
>>>> it handles multiple processes contending for resources, it might be a
>>>> flashy feature that gets lots of attention but might not really be
>>>> very useful in practice. But it would be very interesting to see.
>>>
>>> The main problem here is that the sort of hardware commonly used for
>>> production database servers doesn't have any serious enough GPU to
>>> support CUDA/OpenCL available. The very clear trend now is that all
>>> systems other than gaming ones ship with motherboard graphics chipsets
>>> more than powerful enough for any task but that. I just checked the 5
>>> most popular configurations of server I see my customers deploy
>>> PostgreSQL onto (a mix of Dell and HP units), and you don't get a
>>> serious GPU from any of them.
>>>
>>> Intel's next generation Ivy Bridge chipset, expected for the spring of
>>> 2012, is going to add support for OpenCL to the built-in motherboard
>>> GPU. We may eventually see that trickle into the server hardware side of
>>> things too.
>>
>>
>> The trend is to have server capable of running CUDA providing GPU via
>> external hardware (PCI Express interface with PCI Express switches),
>> look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.
>>
>> I did some experimenst timing the sort done with CUDA and the sort
>> done with pg_qsort:
>> CUDA pg_qsort
>> 33Milion integers: ~ 900 ms, ~ 6000 ms
>> 1Milion integers: ~ 21 ms, ~ 162 ms
>> 100k integers: ~ 2 ms, ~ 13 ms
>>
>> CUDA time has already in the copy operations (host->device,
>> device->host).
>>
>> As GPU I was using a C2050, and the CPU doing the pg_qsort was a
>> Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>>
>> Copy operations and kernel runs (the sort for instance) can run in
>> parallel, so while you are sorting a batch of data, you can copy the
>> next batch in parallel.
>>
>> As you can see the boost is not negligible.
>>
>> Next Nvidia hardware (Keplero family) is PCI Express 3 ready, so
>> expect in the near future the "bottle neck" of the
>> device->host->device copies to have less impact.
>>
>> I strongly believe there is space to provide modern database engine of
>> a way to offload sorts to GPU.
>>
>>> I've never seen a PostgreSQL server capable of running CUDA, and I
>>> don't expect that to change.
>>
>> That sounds like:
>>
>> "I think there is a world market for maybe five computers."
>> - IBM Chairman Thomas Watson, 1943
>>
>> Regards
>> Gaetano Mendola
>>
>>
>>
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>



Re: CUDA Sorting

From
Greg Smith
Date:
On 02/11/2012 08:14 PM, Gaetano Mendola wrote:
> The trend is to have server capable of running CUDA providing GPU via 
> external hardware (PCI Express interface with PCI Express switches), 
> look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.

The C410X adds 16 PCIe slots to a server, housed inside a separate 3U 
enclosure.  That's a completely sensible purchase if your goal is to 
build a computing cluster, where a lot of work is handed off to a set of 
GPUs.  I think that's even less likely to be a cost-effective option for 
a database server.  Adding a single dedicated GPU installed in a server 
to accelerate sorting is something that might be justifiable, based on 
your benchmarks.  This is a much more expensive option than that 
though.  Details at 
http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who 
wants to see just how big this external box is.

> I did some experimenst timing the sort done with CUDA and the sort 
> done with pg_qsort:
>                        CUDA      pg_qsort
> 33Milion integers:   ~ 900 ms,  ~ 6000 ms
> 1Milion integers:    ~  21 ms,  ~  162 ms
> 100k integers:       ~   2 ms,  ~   13 ms
> CUDA time has already in the copy operations (host->device, 
> device->host).
> As GPU I was using a C2050, and the CPU doing the pg_qsort was a 
> Intel(R) Xeon(R) CPU X5650  @ 2.67GHz

That's really interesting, and the X5650 is by no means a slow CPU.  So 
this benchmark is providing a lot of CPU power yet still seeing over a 
6X speedup in sort times.  It sounds like the PCI Express bus has gotten 
fast enough that the time to hand data over and get it back again can 
easily be justified for medium to large sized sorts.

It would be helpful to take this patch and confirm whether it scales 
when using in parallel.  Easiest way to do that would be to use the 
pgbench "-f" feature, which allows running an arbitrary number of some 
query at once.  Seeing whether this acceleration continued to hold as 
the number of clients increases is a useful data point.

Is it possible for you to break down where the time is being spent?  For 
example, how much of this time is consumed in the GPU itself, compared 
to time spent transferring data between CPU and GPU?  I'm also curious 
where the bottleneck is at with this approach.  If it's the speed of the 
PCI-E bus for smaller data sets, adding more GPUs may never be 
practical.  If the bus can handle quite a few of these at once before it 
saturates, it might be possible to overload a single GPU.  That seems 
like it would be really hard to reach for database sorting though; I 
can't really defend justify my gut feel for that being true though.

> > I've never seen a PostgreSQL server capable of running CUDA, and I
> > don't expect that to change.
>
> That sounds like:
>
> "I think there is a world market for maybe five computers."
> - IBM Chairman Thomas Watson, 1943

Yes, and "640K will be enough for everyone", ha ha.  (Having said the 
640K thing is flat out denied by Gates, BTW, and no one has come up with 
proof otherwise).

I think you've made an interesting case for this sort of acceleration 
now being useful for systems doing what's typically considered a data 
warehouse task.  I regularly see servers waiting for far more than 13M 
integers to sort.  And I am seeing a clear trend toward providing more 
PCI-E slots in servers now.  Dell's R810 is the most popular single 
server model my customers have deployed in the last year, and it has 5 
X8 slots in it.  It's rare all 5 of those are filled.  As long as a 
dedicated GPU works fine when dropped to X8 speeds, I know a fair number 
of systems where one of those could be added now.

There's another data point in your favor I didn't notice before your 
last e-mail.  Amazon has a "Cluster GPU Quadruple Extra Large" node type 
that runs with NVIDIA Tesla hardware.  That means the installed base of 
people who could consider CUDA is higher than I expected.  To 
demonstrate how much that costs, to provision a GPU enabled reserved 
instance from Amazon for one year costs $2410 at "Light Utilization", 
giving a system with 22GB of RAM and 1.69GB of storage.  (I find the 
reserved prices easier to compare with dedicated hardware than the 
hourly ones)  That's halfway between the High-Memory Double Extra Large 
Instance (34GB RAM/850GB disk) at $1100 and the High-Memory Quadruple 
Extra Large Instance (64GB RAM/1690GB disk) at $2200.  If someone could 
prove sorting was a bottleneck on their server, that isn't an 
unreasonable option to consider on a cloud-based database deployment.

I still think that an approach based on OpenCL is more likely to be 
suitable for PostgreSQL, which was part of why I gave CUDA low odds 
here.  The points in favor of OpenCL are:

-Since you last posted, OpenCL compiling has switched to using LLVM as 
their standard compiler.  Good PostgreSQL support for LLVM isn't far 
away.  It looks to me like the compiler situation for CUDA requires 
their PathScale based compiler.  I don't know enough about this area to 
say which compiling tool chain will end up being easier to deal with.

-Intel is making GPU support standard for OpenCL, as I mentioned 
before.  NVIDIA will be hard pressed to compete with Intel for GPU 
acceleration once more systems supporting that enter the market.

-Easy availability of OpenCL on Mac OS X for development sake.  Lots of 
Postgres hackers with OS X systems, even though there aren't too many OS 
X database servers.

The fact that Amazon provides a way to crack the chicken/egg hardware 
problem immediately helps a lot though, I don't even need a physical 
card here to test CUDA GPU acceleration on Linux now.  With that data 
point, your benchmarks are good enough to say I'd be willing to help 
review a patch in this area here as part of the 9.3 development cycle.  
That may validate that GPU acceleration is useful, and then the next 
step would be considering how portable that will be to other GPU 
interfaces.  I still expect CUDA will be looked back on as a dead end 
for GPU accelerated computing one day.  Computing history is not filled 
with many single-vendor standards who competed successfully against 
Intel providing the same thing.  AMD's x86-64 is the only example I can 
think of where Intel didn't win that sort of race, which happened (IMHO) 
only because Intel's Itanium failed to prioritize backwards 
compatibility highly enough.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: CUDA Sorting

From
Kohei KaiGai
Date:
2012/2/13 Greg Smith <greg@2ndquadrant.com>:
> On 02/11/2012 08:14 PM, Gaetano Mendola wrote:
>>
>> The trend is to have server capable of running CUDA providing GPU via
>> external hardware (PCI Express interface with PCI Express switches), look
>> for example at PowerEdge C410x PCIe Expansion Chassis from DELL.
>
>
> The C410X adds 16 PCIe slots to a server, housed inside a separate 3U
> enclosure.  That's a completely sensible purchase if your goal is to build a
> computing cluster, where a lot of work is handed off to a set of GPUs.  I
> think that's even less likely to be a cost-effective option for a database
> server.  Adding a single dedicated GPU installed in a server to accelerate
> sorting is something that might be justifiable, based on your benchmarks.
>  This is a much more expensive option than that though.  Details at
> http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who wants
> to see just how big this external box is.
>
>
>> I did some experimenst timing the sort done with CUDA and the sort done
>> with pg_qsort:
>>                       CUDA      pg_qsort
>> 33Milion integers:   ~ 900 ms,  ~ 6000 ms
>> 1Milion integers:    ~  21 ms,  ~  162 ms
>> 100k integers:       ~   2 ms,  ~   13 ms
>> CUDA time has already in the copy operations (host->device, device->host).
>> As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R)
>> Xeon(R) CPU X5650  @ 2.67GHz
>
>
> That's really interesting, and the X5650 is by no means a slow CPU.  So this
> benchmark is providing a lot of CPU power yet still seeing over a 6X speedup
> in sort times.  It sounds like the PCI Express bus has gotten fast enough
> that the time to hand data over and get it back again can easily be
> justified for medium to large sized sorts.
>
> It would be helpful to take this patch and confirm whether it scales when
> using in parallel.  Easiest way to do that would be to use the pgbench "-f"
> feature, which allows running an arbitrary number of some query at once.
>  Seeing whether this acceleration continued to hold as the number of clients
> increases is a useful data point.
>
> Is it possible for you to break down where the time is being spent?  For
> example, how much of this time is consumed in the GPU itself, compared to
> time spent transferring data between CPU and GPU?  I'm also curious where
> the bottleneck is at with this approach.  If it's the speed of the PCI-E bus
> for smaller data sets, adding more GPUs may never be practical.  If the bus
> can handle quite a few of these at once before it saturates, it might be
> possible to overload a single GPU.  That seems like it would be really hard
> to reach for database sorting though; I can't really defend justify my gut
> feel for that being true though.
>
>
>> > I've never seen a PostgreSQL server capable of running CUDA, and I
>> > don't expect that to change.
>>
>> That sounds like:
>>
>> "I think there is a world market for maybe five computers."
>> - IBM Chairman Thomas Watson, 1943
>
>
> Yes, and "640K will be enough for everyone", ha ha.  (Having said the 640K
> thing is flat out denied by Gates, BTW, and no one has come up with proof
> otherwise).
>
> I think you've made an interesting case for this sort of acceleration now
> being useful for systems doing what's typically considered a data warehouse
> task.  I regularly see servers waiting for far more than 13M integers to
> sort.  And I am seeing a clear trend toward providing more PCI-E slots in
> servers now.  Dell's R810 is the most popular single server model my
> customers have deployed in the last year, and it has 5 X8 slots in it.  It's
> rare all 5 of those are filled.  As long as a dedicated GPU works fine when
> dropped to X8 speeds, I know a fair number of systems where one of those
> could be added now.
>
> There's another data point in your favor I didn't notice before your last
> e-mail.  Amazon has a "Cluster GPU Quadruple Extra Large" node type that
> runs with NVIDIA Tesla hardware.  That means the installed base of people
> who could consider CUDA is higher than I expected.  To demonstrate how much
> that costs, to provision a GPU enabled reserved instance from Amazon for one
> year costs $2410 at "Light Utilization", giving a system with 22GB of RAM
> and 1.69GB of storage.  (I find the reserved prices easier to compare with
> dedicated hardware than the hourly ones)  That's halfway between the
> High-Memory Double Extra Large Instance (34GB RAM/850GB disk) at $1100 and
> the High-Memory Quadruple Extra Large Instance (64GB RAM/1690GB disk) at
> $2200.  If someone could prove sorting was a bottleneck on their server,
> that isn't an unreasonable option to consider on a cloud-based database
> deployment.
>
> I still think that an approach based on OpenCL is more likely to be suitable
> for PostgreSQL, which was part of why I gave CUDA low odds here.  The points
> in favor of OpenCL are:
>
> -Since you last posted, OpenCL compiling has switched to using LLVM as their
> standard compiler.  Good PostgreSQL support for LLVM isn't far away.  It
> looks to me like the compiler situation for CUDA requires their PathScale
> based compiler.  I don't know enough about this area to say which compiling
> tool chain will end up being easier to deal with.
>
> -Intel is making GPU support standard for OpenCL, as I mentioned before.
>  NVIDIA will be hard pressed to compete with Intel for GPU acceleration once
> more systems supporting that enter the market.
>
> -Easy availability of OpenCL on Mac OS X for development sake.  Lots of
> Postgres hackers with OS X systems, even though there aren't too many OS X
> database servers.
>
> The fact that Amazon provides a way to crack the chicken/egg hardware
> problem immediately helps a lot though, I don't even need a physical card
> here to test CUDA GPU acceleration on Linux now.  With that data point, your
> benchmarks are good enough to say I'd be willing to help review a patch in
> this area here as part of the 9.3 development cycle.  That may validate that
> GPU acceleration is useful, and then the next step would be considering how
> portable that will be to other GPU interfaces.  I still expect CUDA will be
> looked back on as a dead end for GPU accelerated computing one day.
>  Computing history is not filled with many single-vendor standards who
> competed successfully against Intel providing the same thing.  AMD's x86-64
> is the only example I can think of where Intel didn't win that sort of race,
> which happened (IMHO) only because Intel's Itanium failed to prioritize
> backwards compatibility highly enough.
>
As a side node. My module (PG-Strom) also uses CUDA, although it tried to
implement it with OpenCL at begining of the project, because it didn't work
well when multiple sessions uses a GPU device concurrently.
The second background process get an error due to out-of-resources during
another process opens a GPU device.

I'm not clear whether it is a limitation of OpenCL, driver of Nvidia, or bugs of
my code. Anyway, I switched to CUDA, instead of the investigation on binary
drivers. :-(

Thanks,
--
KaiGai Kohei <kaigai@kaigai.gr.jp>


Re: CUDA Sorting

From
Gaetano Mendola
Date:
<p><br /> On Feb 13, 2012 11:39 a.m., "Kohei KaiGai" <<a
href="mailto:kaigai@kaigai.gr.jp">kaigai@kaigai.gr.jp</a>>wrote:<br /> ><br /> > 2012/2/13 Greg Smith <<a
href="mailto:greg@2ndquadrant.com">greg@2ndquadrant.com</a>>:<br/> > > On 02/11/2012 08:14 PM, Gaetano Mendola
wrote:<br/> > >><br /> > >> The trend is to have server capable of running CUDA providing GPU via<br
/>> >> external hardware (PCI Express interface with PCI Express switches), look<br /> > >> for
exampleat PowerEdge C410x PCIe Expansion Chassis from DELL.<br /> > ><br /> > ><br /> > > The C410X
adds16 PCIe slots to a server, housed inside a separate 3U<br /> > > enclosure.  That's a completely sensible
purchaseif your goal is to build a<br /> > > computing cluster, where a lot of work is handed off to a set of
GPUs. I<br /> > > think that's even less likely to be a cost-effective option for a database<br /> > >
server. Adding a single dedicated GPU installed in a server to accelerate<br /> > > sorting is something that
mightbe justifiable, based on your benchmarks.<br /> > >  This is a much more expensive option than that though.
 Detailsat<br /> > > <a
href="http://www.dell.com/us/enterprise/p/poweredge-c410x/pd">http://www.dell.com/us/enterprise/p/poweredge-c410x/pd</a>
foranyone who wants<br /> > > to see just how big this external box is.<br /> > ><br /> > ><br />
>>> I did some experimenst timing the sort done with CUDA and the sort done<br /> > >> with
pg_qsort:<br/> > >>                       CUDA      pg_qsort<br /> > >> 33Milion integers:   ~ 900
ms, ~ 6000 ms<br /> > >> 1Milion integers:    ~  21 ms,  ~  162 ms<br /> > >> 100k integers:       ~
 2 ms,  ~   13 ms<br /> > >> CUDA time has already in the copy operations (host->device,
device->host).<br/> > >> As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R)<br />
>>> Xeon(R) CPU X5650  @ 2.67GHz<br /> > ><br /> > ><br /> > > That's really interesting,
andthe X5650 is by no means a slow CPU.  So this<br /> > > benchmark is providing a lot of CPU power yet still
seeingover a 6X speedup<br /> > > in sort times.  It sounds like the PCI Express bus has gotten fast enough<br />
>> that the time to hand data over and get it back again can easily be<br /> > > justified for medium to
largesized sorts.<br /> > ><br /> > > It would be helpful to take this patch and confirm whether it scales
when<br/> > > using in parallel.  Easiest way to do that would be to use the pgbench "-f"<br /> > >
feature,which allows running an arbitrary number of some query at once.<br /> > >  Seeing whether this
accelerationcontinued to hold as the number of clients<br /> > > increases is a useful data point.<br /> >
><br/> > > Is it possible for you to break down where the time is being spent?  For<br /> > > example,
howmuch of this time is consumed in the GPU itself, compared to<br /> > > time spent transferring data between
CPUand GPU?  I'm also curious where<br /> > > the bottleneck is at with this approach.  If it's the speed of the
PCI-Ebus<br /> > > for smaller data sets, adding more GPUs may never be practical.  If the bus<br /> > >
canhandle quite a few of these at once before it saturates, it might be<br /> > > possible to overload a single
GPU. That seems like it would be really hard<br /> > > to reach for database sorting though; I can't really
defendjustify my gut<br /> > > feel for that being true though.<br /> > ><br /> > ><br /> >
>>> I've never seen a PostgreSQL server capable of running CUDA, and I<br /> > >> > don't expect
thatto change.<br /> > >><br /> > >> That sounds like:<br /> > >><br /> > >> "I
thinkthere is a world market for maybe five computers."<br /> > >> - IBM Chairman Thomas Watson, 1943<br />
>><br /> > ><br /> > > Yes, and "640K will be enough for everyone", ha ha.  (Having said the 640K<br
/>> > thing is flat out denied by Gates, BTW, and no one has come up with proof<br /> > > otherwise).<br />
>><br /> > > I think you've made an interesting case for this sort of acceleration now<br /> > >
beinguseful for systems doing what's typically considered a data warehouse<br /> > > task.  I regularly see
serverswaiting for far more than 13M integers to<br /> > > sort.  And I am seeing a clear trend toward providing
morePCI-E slots in<br /> > > servers now.  Dell's R810 is the most popular single server model my<br /> > >
customershave deployed in the last year, and it has 5 X8 slots in it.  It's<br /> > > rare all 5 of those are
filled. As long as a dedicated GPU works fine when<br /> > > dropped to X8 speeds, I know a fair number of
systemswhere one of those<br /> > > could be added now.<br /> > ><br /> > > There's another data
pointin your favor I didn't notice before your last<br /> > > e-mail.  Amazon has a "Cluster GPU Quadruple Extra
Large"node type that<br /> > > runs with NVIDIA Tesla hardware.  That means the installed base of people<br />
>> who could consider CUDA is higher than I expected.  To demonstrate how much<br /> > > that costs, to
provisiona GPU enabled reserved instance from Amazon for one<br /> > > year costs $2410 at "Light Utilization",
givinga system with 22GB of RAM<br /> > > and 1.69GB of storage.  (I find the reserved prices easier to compare
with<br/> > > dedicated hardware than the hourly ones)  That's halfway between the<br /> > > High-Memory
DoubleExtra Large Instance (34GB RAM/850GB disk) at $1100 and<br /> > > the High-Memory Quadruple Extra Large
Instance(64GB RAM/1690GB disk) at<br /> > > $2200.  If someone could prove sorting was a bottleneck on their
server,<br/> > > that isn't an unreasonable option to consider on a cloud-based database<br /> > >
deployment.<br/> > ><br /> > > I still think that an approach based on OpenCL is more likely to be
suitable<br/> > > for PostgreSQL, which was part of why I gave CUDA low odds here.  The points<br /> > > in
favorof OpenCL are:<br /> > ><br /> > > -Since you last posted, OpenCL compiling has switched to using LLVM
astheir<br /> > > standard compiler.  Good PostgreSQL support for LLVM isn't far away.  It<br /> > > looks
tome like the compiler situation for CUDA requires their PathScale<br /> > > based compiler.  I don't know enough
aboutthis area to say which compiling<br /> > > tool chain will end up being easier to deal with.<br /> >
><br/> > > -Intel is making GPU support standard for OpenCL, as I mentioned before.<br /> > >  NVIDIA
willbe hard pressed to compete with Intel for GPU acceleration once<br /> > > more systems supporting that enter
themarket.<br /> > ><br /> > > -Easy availability of OpenCL on Mac OS X for development sake.  Lots of<br
/>> > Postgres hackers with OS X systems, even though there aren't too many OS X<br /> > > database
servers.<br/> > ><br /> > > The fact that Amazon provides a way to crack the chicken/egg hardware<br />
>> problem immediately helps a lot though, I don't even need a physical card<br /> > > here to test CUDA
GPUacceleration on Linux now.  With that data point, your<br /> > > benchmarks are good enough to say I'd be
willingto help review a patch in<br /> > > this area here as part of the 9.3 development cycle.  That may
validatethat<br /> > > GPU acceleration is useful, and then the next step would be considering how<br /> >
>portable that will be to other GPU interfaces.  I still expect CUDA will be<br /> > > looked back on as a
deadend for GPU accelerated computing one day.<br /> > >  Computing history is not filled with many single-vendor
standardswho<br /> > > competed successfully against Intel providing the same thing.  AMD's x86-64<br /> >
>is the only example I can think of where Intel didn't win that sort of race,<br /> > > which happened (IMHO)
onlybecause Intel's Itanium failed to prioritize<br /> > > backwards compatibility highly enough.<br /> >
><br/> > As a side node. My module (PG-Strom) also uses CUDA, although it tried to<br /> > implement it with
OpenCLat begining of the project, because it didn't work<br /> > well when multiple sessions uses a GPU device
concurrently.<br/> > The second background process get an error due to out-of-resources during<br /> > another
processopens a GPU device.<br /> ><br /> > I'm not clear whether it is a limitation of OpenCL, driver of Nvidia,
orbugs of<br /> > my code. Anyway, I switched to CUDA, instead of the investigation on binary<br /> > drivers.
:-(<br/> ><br /> > Thanks,<br /> > --<br /> > KaiGai Kohei <<a
href="mailto:kaigai@kaigai.gr.jp">kaigai@kaigai.gr.jp</a>><br/><p>I have no experience with opencl but for sure with
Cuda4.1you can share the same device from multiple host thread, as in for example allocate memory in one host thread
anduse it in another thread. May be with opencl you were facing the very same limit.<br /> 

Re: CUDA Sorting

From
Greg Stark
Date:
I don't think we should be looking at either CUDA or OpenCL directly.
We should be looking for a generic library that can target either and
is well maintained and actively developed. Any GPU code we write
ourselves would rapidly be overtaken by changes in the hardware and
innovations in parallel algorithms. If we find a library that provides
a sorting api and adapt our code to use it then we'll get the benefits
of any new hardware feature as the library adds support for them.


Re: CUDA Sorting

From
Gaetano Mendola
Date:
<p><br /> On Feb 13, 2012 7:49 p.m., "Greg Stark" <<a href="mailto:stark@mit.edu">stark@mit.edu</a>> wrote:<br />
><br/> > I don't think we should be looking at either CUDA or OpenCL directly.<br /> > We should be looking
fora generic library that can target either and<br /> > is well maintained and actively developed. Any GPU code we
write<br/> > ourselves would rapidly be overtaken by changes in the hardware and<br /> > innovations in parallel
algorithms.If we find a library that provides<br /> > a sorting api and adapt our code to use it then we'll get the
benefits<br/> > of any new hardware feature as the library adds support for them.<p>To sort integer I used the
ThrustNvidia library. 

Re: CUDA Sorting

From
Gaetano Mendola
Date:
On 13/02/2012 08:26, Greg Smith wrote:
> On 02/11/2012 08:14 PM, Gaetano Mendola wrote:
>> The trend is to have server capable of running CUDA providing GPU via
>> external hardware (PCI Express interface with PCI Express switches),
>> look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.
>
> The C410X adds 16 PCIe slots to a server, housed inside a separate 3U
> enclosure. That's a completely sensible purchase if your goal is to
> build a computing cluster, where a lot of work is handed off to a set of
> GPUs. I think that's even less likely to be a cost-effective option for
> a database server. Adding a single dedicated GPU installed in a server
> to accelerate sorting is something that might be justifiable, based on
> your benchmarks. This is a much more expensive option than that though.
> Details at http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for
> anyone who wants to see just how big this external box is.
>
>> I did some experimenst timing the sort done with CUDA and the sort
>> done with pg_qsort:
>> CUDA pg_qsort
>> 33Milion integers: ~ 900 ms, ~ 6000 ms
>> 1Milion integers: ~ 21 ms, ~ 162 ms
>> 100k integers: ~ 2 ms, ~ 13 ms
>> CUDA time has already in the copy operations (host->device,
>> device->host).
>> As GPU I was using a C2050, and the CPU doing the pg_qsort was a
>> Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>
> That's really interesting, and the X5650 is by no means a slow CPU. So
> this benchmark is providing a lot of CPU power yet still seeing over a
> 6X speedup in sort times. It sounds like the PCI Express bus has gotten
> fast enough that the time to hand data over and get it back again can
> easily be justified for medium to large sized sorts.
>
> It would be helpful to take this patch and confirm whether it scales
> when using in parallel. Easiest way to do that would be to use the
> pgbench "-f" feature, which allows running an arbitrary number of some
> query at once. Seeing whether this acceleration continued to hold as the
> number of clients increases is a useful data point.
>
> Is it possible for you to break down where the time is being spent? For
> example, how much of this time is consumed in the GPU itself, compared
> to time spent transferring data between CPU and GPU? I'm also curious
> where the bottleneck is at with this approach. If it's the speed of the
> PCI-E bus for smaller data sets, adding more GPUs may never be
> practical. If the bus can handle quite a few of these at once before it
> saturates, it might be possible to overload a single GPU. That seems
> like it would be really hard to reach for database sorting though; I
> can't really defend justify my gut feel for that being true though.

There you go (times are in ms):

Size       H->D     SORT     D->H     TOTAL
64     0.209824 0.479392 0.013856 0.703072
128     0.098144 0.41744  0.01312  0.528704
256     0.096832 0.420352 0.013696 0.53088
512     0.097568 0.3952   0.014464 0.507232
1024     0.09872  0.396608 0.014624 0.509952
2048     0.101344 0.56224  0.016896 0.68048
4096     0.106176 0.562976 0.02016  0.689312
8192     0.116512 0.571264 0.02672  0.714496
16384     0.136096 0.587584 0.040192 0.763872
32768     0.179296 0.658112 0.066304 0.903712
65536     0.212352 0.84816  0.118016 1.178528
131072     0.317056 1.1465   0.22784  1.691396
262144     0.529376 1.82237  0.42512  2.776866
524288     0.724032 2.39834  0.64576  3.768132
1048576     1.11162  3.51978  1.12176  5.75316
2097152     1.95939  5.93434  2.06992  9.96365
4194304     3.76192  10.6011  4.10614  18.46916
8388608     7.16845  19.9245  7.93741  35.03036
16777216 13.8693  38.7413  15.4073  68.0179
33554432 27.3017  75.6418  30.6646  133.6081
67108864 54.2171  151.192  60.327   265.7361

pg_sort

64           0.010000
128          0.010000
256          0.021000
512          0.128000
1024         0.092000
2048         0.196000
4096         0.415000
8192         0.883000
16384        1.881000
32768        3.960000
65536        8.432000
131072      17.951000
262144      37.140000
524288      78.320000
1048576    163.276000
2097152    339.118000
4194304    693.223000
8388608   1423.142000
16777216  2891.218000
33554432  5910.851000
67108864 11980.930000

As you can notice the times with CUDA are lower than the timing I have 
reported on my previous post because the server was doing something else
in mean while, I have repeated those benchmarks with server completely
unused.

And this is the boost as in pg_sort/cuda :

64     0.0142232943
128     0.018914175
256     0.039556962
512     0.2070058671
1024     0.1804091365
2048     0.2880319774
4096     0.6078524674
8192     1.2372357578
16384     2.4637635625
32768     4.4106972133
65536     7.1742037525
131072     10.5090706139
262144     13.3719091955
524288     20.5834084369
1048576     28.2516043357
2097152     33.9618513296
4194304     37.5247168794
8388608     40.5135716561
16777216 42.4743633661
33554432 44.2394809896
67108864 45.1499777411


>> > I've never seen a PostgreSQL server capable of running CUDA, and I
>> > don't expect that to change.
>>
>> That sounds like:
>>
>> "I think there is a world market for maybe five computers."
>> - IBM Chairman Thomas Watson, 1943
>
> Yes, and "640K will be enough for everyone", ha ha. (Having said the
> 640K thing is flat out denied by Gates, BTW, and no one has come up with
> proof otherwise).
>
> I think you've made an interesting case for this sort of acceleration
> now being useful for systems doing what's typically considered a data
> warehouse task. I regularly see servers waiting for far more than 13M
> integers to sort. And I am seeing a clear trend toward providing more
> PCI-E slots in servers now. Dell's R810 is the most popular single
> server model my customers have deployed in the last year, and it has 5
> X8 slots in it. It's rare all 5 of those are filled. As long as a
> dedicated GPU works fine when dropped to X8 speeds, I know a fair number
> of systems where one of those could be added now.
>
> There's another data point in your favor I didn't notice before your
> last e-mail. Amazon has a "Cluster GPU Quadruple Extra Large" node type
> that runs with NVIDIA Tesla hardware. That means the installed base of
> people who could consider CUDA is higher than I expected. To demonstrate
> how much that costs, to provision a GPU enabled reserved instance from
> Amazon for one year costs $2410 at "Light Utilization", giving a system
> with 22GB of RAM and 1.69GB of storage. (I find the reserved prices
> easier to compare with dedicated hardware than the hourly ones) That's
> halfway between the High-Memory Double Extra Large Instance (34GB
> RAM/850GB disk) at $1100 and the High-Memory Quadruple Extra Large
> Instance (64GB RAM/1690GB disk) at $2200. If someone could prove sorting
> was a bottleneck on their server, that isn't an unreasonable option to
> consider on a cloud-based database deployment.
>
> I still think that an approach based on OpenCL is more likely to be
> suitable for PostgreSQL, which was part of why I gave CUDA low odds
> here. The points in favor of OpenCL are:
>
> -Since you last posted, OpenCL compiling has switched to using LLVM as
> their standard compiler. Good PostgreSQL support for LLVM isn't far
> away. It looks to me like the compiler situation for CUDA requires their
> PathScale based compiler. I don't know enough about this area to say
> which compiling tool chain will end up being easier to deal with.

NVidia compiler named nvcc switched to LLVM as well (CUDA4.1).

> -Intel is making GPU support standard for OpenCL, as I mentioned before.
> NVIDIA will be hard pressed to compete with Intel for GPU acceleration
> once more systems supporting that enter the market.
>
> -Easy availability of OpenCL on Mac OS X for development sake. Lots of
> Postgres hackers with OS X systems, even though there aren't too many OS
> X database servers.
> The fact that Amazon provides a way to crack the chicken/egg hardware
> problem immediately helps a lot though, I don't even need a physical
> card here to test CUDA GPU acceleration on Linux now. With that data
> point, your benchmarks are good enough to say I'd be willing to help
> review a patch in this area here as part of the 9.3 development cycle.
> That may validate that GPU acceleration is useful, and then the next
> step would be considering how portable that will be to other GPU
> interfaces. I still expect CUDA will be looked back on as a dead end for
> GPU accelerated computing one day. Computing history is not filled with
> many single-vendor standards who competed successfully against Intel
> providing the same thing. AMD's x86-64 is the only example I can think
> of where Intel didn't win that sort of race, which happened (IMHO) only
> because Intel's Itanium failed to prioritize backwards compatibility
> highly enough.

I think that due the fact NVIDA nvcc uses LLVM now it means that soon we 
will be able to compile "CUDA" programs for any target architecture 
supported by LLVM.

Regards
Gaetano Mendola




Re: CUDA Sorting

From
Marti Raudsepp
Date:
On Mon, Feb 13, 2012 at 20:48, Greg Stark <stark@mit.edu> wrote:
> I don't think we should be looking at either CUDA or OpenCL directly.
> We should be looking for a generic library that can target either and
> is well maintained and actively developed.

I understand your point about using some external library for the
primitives, but I don't see why it needs to support both CUDA and
OpenCL. Libraries for GPU-accelerated primitives generally target
OpenCL *or* CUDA, not both.

As far as I understand (and someone correct me if I'm wrong), the
difference between them is mostly the API and the fact that CUDA had a
head start, and thus a larger developer community around it. (All the
early adopters went to CUDA)

But OpenCL already acts as an abstraction layer. CUDA is
NVIDIA-specific, but OpenCL is supported by AMD, Intel as well as
NVIDIA. It's pretty rare for servers to have separate graphics cards,
but recent Intel and AMD CPUs already have a GPU included on die,
which is another bonus for OpenCL.

So I'd say, the way things are heading, it's only a matter of time
before OpenCL takes over and there will be little reason to look back.

Regards,
Marti


Re: CUDA Sorting

From
Gaetano Mendola
Date:
On 13/02/2012 19:48, Greg Stark wrote:
> I don't think we should be looking at either CUDA or OpenCL directly.
> We should be looking for a generic library that can target either and
> is well maintained and actively developed. Any GPU code we write
> ourselves would rapidly be overtaken by changes in the hardware and
> innovations in parallel algorithms. If we find a library that provides
> a sorting api and adapt our code to use it then we'll get the benefits
> of any new hardware feature as the library adds support for them.
>

I think one option is to make the sort function pluggable with a shared
library/dll. I see several benefits from this:
 - It could be in the interest of the hardware vendor to provide the 
most powerful sort implementation (I'm sure for example that TBB sort 
implementation is faster that pg_sort)
 - It can permit people to "play" with it without being deep involved 
in pg development and stuffs.
 - It can relieve the postgres core group the choose about the right 
language/tool/implementation to use.
 - Also for people not willing (or not able for the matter) to upgrade
postgres engine to change instead the sort function upon an hardware
upgrade.


Of course if this happens postgres engine has to make some sort of
sanity check (that the function for example actually sorts) before to 
"thrust" the plugged sort.
The engine can even have multiple sort implementation available and
use the most proficient one (imagine some sorts acts better on
a certain range value or on certain element size).


Regards
Gaetano Mendola



Re: CUDA Sorting

From
Gaetano Mendola
Date:
On 13/02/2012 19:48, Greg Stark wrote:
> I don't think we should be looking at either CUDA or OpenCL directly.
> We should be looking for a generic library that can target either and
> is well maintained and actively developed. Any GPU code we write
> ourselves would rapidly be overtaken by changes in the hardware and
> innovations in parallel algorithms. If we find a library that provides
> a sorting api and adapt our code to use it then we'll get the benefits
> of any new hardware feature as the library adds support for them.
>

I think one option is to make the sort function plugable with a shared
library/dll. I see several benefits from this:
 - It could be in the interest of the hardware vendor to provide the 
most powerful sort implementation (I'm sure for example that TBB sort 
implementation is faster that pg_sort)
 - It can permit people to "play" with it without being deep involved 
in pg development and stuffs.
 - It can relieve the postgres core group the choose about the right 
language/tool/implementation to use.
 - Also for people not willing (or not able for the matter) to upgrade
postgres engine to change instead the sort function upon an hardware
upgrade.


Of course if this happens postgres engine has to make some sort of
sanity check (that the function for example actually sorts) before to 
"thrust" the plugged sort.
The engine can even have multiple sort implementation available and
use the most proficient one (imagine some sorts acts better on
a certain range value or on certain element size).


Regards
Gaetano Mendola



Re: CUDA Sorting

From
Peter Geoghegan
Date:
On 15 February 2012 20:00, Gaetano Mendola <mendola@gmail.com> wrote:
> On 13/02/2012 19:48, Greg Stark wrote:
>>
>> I don't think we should be looking at either CUDA or OpenCL directly.
>> We should be looking for a generic library that can target either and
>> is well maintained and actively developed. Any GPU code we write
>> ourselves would rapidly be overtaken by changes in the hardware and
>> innovations in parallel algorithms. If we find a library that provides
>> a sorting api and adapt our code to use it then we'll get the benefits
>> of any new hardware feature as the library adds support for them.
>>
>
> I think one option is to make the sort function pluggable with a shared
> library/dll. I see several benefits from this:
>
>  - It could be in the interest of the hardware vendor to provide the most
> powerful sort implementation (I'm sure for example that TBB sort
> implementation is faster that pg_sort)
>
>  - It can permit people to "play" with it without being deep involved in pg
> development and stuffs.

Sorry, but I find it really hard to believe that the non-availability
of pluggable sorting is what's holding people back here. Some vanguard
needs to go and prove the idea by building a rough prototype before we
can even really comment on what an API should look like. For example,
I am given to understand that GPUs generally sort using radix sort -
resolving the impedance mismatch that prevents someone from using a
non-comparison based sort sure sounds like a lot of work for an
entirely speculative reward.

Someone who cannot understand tuplesort, which is not all that
complicated, has no business trying to build GPU sorting into
Postgres.

I had a patch committed a few hours ago that almost included the
capability of assigning an alternative sorting function, but only one
with the exact same signature as my variant of qsort_arg. pg_qsort
isn't used to sort tuples at all, by the way.

Threading building blocks is not going to form the basis of any novel
sorting implementation, because comparators in general are not thread
safe, and it isn't available on all the platforms we support, and
because of how longjmp interacts with C++ stack unwinding and so on
and so on. Now, you could introduce some kind of parallelism into
sorting integers and floats, but that's an awful lot of work for a
marginal reward.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: CUDA Sorting

From
Gaetano Mendola
Date:
On 15/02/2012 23:11, Peter Geoghegan wrote:
> On 15 February 2012 20:00, Gaetano Mendola<mendola@gmail.com>  wrote:
>> On 13/02/2012 19:48, Greg Stark wrote:
>>>
>>> I don't think we should be looking at either CUDA or OpenCL directly.
>>> We should be looking for a generic library that can target either and
>>> is well maintained and actively developed. Any GPU code we write
>>> ourselves would rapidly be overtaken by changes in the hardware and
>>> innovations in parallel algorithms. If we find a library that provides
>>> a sorting api and adapt our code to use it then we'll get the benefits
>>> of any new hardware feature as the library adds support for them.
>>>
>>
>> I think one option is to make the sort function pluggable with a shared
>> library/dll. I see several benefits from this:
>>
>>   - It could be in the interest of the hardware vendor to provide the most
>> powerful sort implementation (I'm sure for example that TBB sort
>> implementation is faster that pg_sort)
>>
>>   - It can permit people to "play" with it without being deep involved in pg
>> development and stuffs.
>
> Sorry, but I find it really hard to believe that the non-availability
> of pluggable sorting is what's holding people back here. Some vanguard
> needs to go and prove the idea by building a rough prototype before we
> can even really comment on what an API should look like. For example,
> I am given to understand that GPUs generally sort using radix sort -
> resolving the impedance mismatch that prevents someone from using a
> non-comparison based sort sure sounds like a lot of work for an
> entirely speculative reward.

AFAIK thrust library uses the radix sort if the keys you are sorting are 
POD data comparable with a "<" operator otherwise it does the
comparison based sort using the operator provided.

http://docs.thrust.googlecode.com/hg/modules.html

I'm not saying that the non-availability of pluggable sort completely
holds people back, I'm saying that it will simplify the process now
and int the future, of course that's my opinion.

> Someone who cannot understand tuplesort, which is not all that
> complicated, has no business trying to build GPU sorting into
> Postgres.

That sounds a bit harsh. I'm one of those indeed, I haven't look in the 
details not having enough time for it. At work we do GPU computing (not
the sort type stuff) and given the fact I'm a Postgres enthusiast I
asked my self: "my server is able to sort around 500 milions integer per
seconds, if postgres was able to do that as well it would be very nice".

What I have to say? Sorry for my thoughts.

> I had a patch committed a few hours ago that almost included the
> capability of assigning an alternative sorting function, but only one
> with the exact same signature as my variant of qsort_arg. pg_qsort
> isn't used to sort tuples at all, by the way.

Then I did look in the wrong direction. Thank you for point that out.

> Threading building blocks is not going to form the basis of any novel
> sorting implementation, because comparators in general are not thread
> safe, and it isn't available on all the platforms we support, and
> because of how longjmp interacts with C++ stack unwinding and so on
> and so on. Now, you could introduce some kind of parallelism into
> sorting integers and floats, but that's an awful lot of work for a
> marginal reward.

The TBB was just example that did come in my mind.
What do you mean with you could introduce some kind of parallelism?
As far as I know any algorithm using the divide and conquer can be
parallelized.

Regards
Gaetano Mendola




Re: CUDA Sorting

From
Gaetano Mendola
Date:
On 15/02/2012 23:11, Peter Geoghegan wrote:
> On 15 February 2012 20:00, Gaetano Mendola<mendola@gmail.com>  wrote:
>> On 13/02/2012 19:48, Greg Stark wrote:
>>>
>>> I don't think we should be looking at either CUDA or OpenCL directly.
>>> We should be looking for a generic library that can target either and
>>> is well maintained and actively developed. Any GPU code we write
>>> ourselves would rapidly be overtaken by changes in the hardware and
>>> innovations in parallel algorithms. If we find a library that provides
>>> a sorting api and adapt our code to use it then we'll get the benefits
>>> of any new hardware feature as the library adds support for them.
>>>
>>
>> I think one option is to make the sort function pluggable with a shared
>> library/dll. I see several benefits from this:
>>
>>   - It could be in the interest of the hardware vendor to provide the most
>> powerful sort implementation (I'm sure for example that TBB sort
>> implementation is faster that pg_sort)
>>
>>   - It can permit people to "play" with it without being deep involved in pg
>> development and stuffs.
>
> Sorry, but I find it really hard to believe that the non-availability
> of pluggable sorting is what's holding people back here. Some vanguard
> needs to go and prove the idea by building a rough prototype before we
> can even really comment on what an API should look like. For example,
> I am given to understand that GPUs generally sort using radix sort -
> resolving the impedance mismatch that prevents someone from using a
> non-comparison based sort sure sounds like a lot of work for an
> entirely speculative reward.

AFAIK thrust library uses the radix sort if the keys you are sorting are 
POD data comparable with a "<" operator otherwise it does the
comparison based sort using the operator provided.

http://docs.thrust.googlecode.com/hg/modules.html

I'm not saying that the non-availability of pluggable sort completely
holds people back, I'm saying that it will simplify the process now
and int the future, of course that's my opinion.

> Someone who cannot understand tuplesort, which is not all that
> complicated, has no business trying to build GPU sorting into
> Postgres.

That sounds a bit harsh. I'm one of those indeed, I haven't look in the 
details not having enough time for it. At work we do GPU computing (not
the sort type stuff) and given the fact I'm a Postgres enthusiast I
asked my self: "my server is able to sort around 500 milions integer per
seconds, if postgres was able to do that as well it would be very nice".

What I have to say? Sorry for my thoughts.

> I had a patch committed a few hours ago that almost included the
> capability of assigning an alternative sorting function, but only one
> with the exact same signature as my variant of qsort_arg. pg_qsort
> isn't used to sort tuples at all, by the way.

Then I did look in the wrong direction. Thank you for point that out.

> Threading building blocks is not going to form the basis of any novel
> sorting implementation, because comparators in general are not thread
> safe, and it isn't available on all the platforms we support, and
> because of how longjmp interacts with C++ stack unwinding and so on
> and so on. Now, you could introduce some kind of parallelism into
> sorting integers and floats, but that's an awful lot of work for a
> marginal reward.

The TBB was just example that did come in my mind.
What do you mean with you could introduce some kind of parallelism?
As far as I know any algorithm using the divide and conquer can be
parallelized.

Regards
Gaetano Mendola




Re: CUDA Sorting

From
Peter Geoghegan
Date:
On 15 February 2012 22:54, Gaetano Mendola <mendola@gmail.com> wrote:
> That sounds a bit harsh. I'm one of those indeed, I haven't look in the
> details not having enough time for it. At work we do GPU computing (not
> the sort type stuff) and given the fact I'm a Postgres enthusiast I
> asked my self: "my server is able to sort around 500 milions integer per
> seconds, if postgres was able to do that as well it would be very nice".
>
> What I have to say? Sorry for my thoughts.

I'm not trying to sound harsh.

The only reason that my patch *nearly* had support for this was
because the implementation that we nearly went with would have only
needed another couple of lines of code to support it. It very probably
wouldn't have turned out to have been useful for any novel sorting
idea, and was really only intended to be used to support user-defined
full sorting specialisations. That didn't end up making the cut.

My point is that whatever is holding back the development of a useful
prototype here, it definitely isn't the lack of an existing API. We
don't know what such an API should look like, and just how invasive it
needs to be. More importantly, it remains to be seen how useful this
idea is in the real world - we don't have so much as a synthetic test
case with a single client, as far as I'm aware.

I'd encourage the OP to share his work on github or something along those lines.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: CUDA Sorting

From
Dann Corbit
Date:
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Gaetano Mendola
Sent: Wednesday, February 15, 2012 2:54 PM
To: Peter Geoghegan; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] CUDA Sorting

On 15/02/2012 23:11, Peter Geoghegan wrote:
> On 15 February 2012 20:00, Gaetano Mendola<mendola@gmail.com>  wrote:
>> On 13/02/2012 19:48, Greg Stark wrote:
>>>
>>> I don't think we should be looking at either CUDA or OpenCL directly.
>>> We should be looking for a generic library that can target either
>>> and is well maintained and actively developed. Any GPU code we write
>>> ourselves would rapidly be overtaken by changes in the hardware and
>>> innovations in parallel algorithms. If we find a library that
>>> provides a sorting api and adapt our code to use it then we'll get
>>> the benefits of any new hardware feature as the library adds support for them.
>>>
>>
>> I think one option is to make the sort function pluggable with a
>> shared library/dll. I see several benefits from this:
>>
>>   - It could be in the interest of the hardware vendor to provide the
>> most powerful sort implementation (I'm sure for example that TBB sort
>> implementation is faster that pg_sort)
>>
>>   - It can permit people to "play" with it without being deep
>> involved in pg development and stuffs.
>
> Sorry, but I find it really hard to believe that the non-availability
> of pluggable sorting is what's holding people back here. Some vanguard
> needs to go and prove the idea by building a rough prototype before we
> can even really comment on what an API should look like. For example,
> I am given to understand that GPUs generally sort using radix sort -
> resolving the impedance mismatch that prevents someone from using a
> non-comparison based sort sure sounds like a lot of work for an
> entirely speculative reward.

AFAIK thrust library uses the radix sort if the keys you are sorting are POD data comparable with a "<" operator
otherwiseit does the comparison based sort using the operator provided. 

http://docs.thrust.googlecode.com/hg/modules.html

I'm not saying that the non-availability of pluggable sort completely holds people back, I'm saying that it will
simplifythe process now and int the future, of course that's my opinion. 

> Someone who cannot understand tuplesort, which is not all that
> complicated, has no business trying to build GPU sorting into
> Postgres.

That sounds a bit harsh. I'm one of those indeed, I haven't look in the details not having enough time for it. At work
wedo GPU computing (not the sort type stuff) and given the fact I'm a Postgres enthusiast I asked my self: "my server
isable to sort around 500 milions integer per seconds, if postgres was able to do that as well it would be very nice". 

What I have to say? Sorry for my thoughts.

> I had a patch committed a few hours ago that almost included the
> capability of assigning an alternative sorting function, but only one
> with the exact same signature as my variant of qsort_arg. pg_qsort
> isn't used to sort tuples at all, by the way.

Then I did look in the wrong direction. Thank you for point that out.

> Threading building blocks is not going to form the basis of any novel
> sorting implementation, because comparators in general are not thread
> safe, and it isn't available on all the platforms we support, and
> because of how longjmp interacts with C++ stack unwinding and so on
> and so on. Now, you could introduce some kind of parallelism into
> sorting integers and floats, but that's an awful lot of work for a
> marginal reward.

The TBB was just example that did come in my mind.
What do you mean with you could introduce some kind of parallelism?
As far as I know any algorithm using the divide and conquer can be parallelized.
>>
Radix sorting can be used for any data type, if you create a callback that provides the most significant bits in
"width"buckets.  At any rate, I can't imagine why anyone would want to complain about sorting 40 times faster than
before,considering the amount of time database spend in ordering data. 

I have a Cuda card in this machine (NVIDIA GeForce GTX 460) and I would not mind it a bit if my database "ORDER BY"
clausesuddenly started running ten times faster than before when I am dealing with a huge volume of data. 

There have been other experiments along these lines such as:
GPU-based Sorting in PostgreSQL Naju Mancheril, School of Computer Science - Carnegie Mellon University
www.cs.virginia.edu/~skadron/Papers/bakkum_sqlite_gpgpu10.pdf (This is for SQLite, but the grammar of SQLite is almost
apure subset of PostgreSQL, including things like vacuum...) 
http://wiki.postgresql.org/images/6/65/Pgopencl.pdf
http://dl.acm.org/citation.cfm?id=1807207
http://www.scribd.com/doc/51484335/PostgreSQL-OpenCL-Procedural-Language-pgEast-March-2011

See also
http://highscalability.com/scaling-postgresql-using-cuda


<<