Re: CUDA Sorting - Mailing list pgsql-hackers
From | Kohei KaiGai |
---|---|
Subject | Re: CUDA Sorting |
Date | |
Msg-id | CADyhKSUZpOV4Tj1D4Y3gamK-nH8wCpripdUDbuwRz7=0U3n_uw@mail.gmail.com Whole thread Raw |
In response to | Re: CUDA Sorting (Greg Smith <greg@2ndQuadrant.com>) |
Responses |
Re: CUDA Sorting
|
List | pgsql-hackers |
2012/2/13 Greg Smith <greg@2ndquadrant.com>: > On 02/11/2012 08:14 PM, Gaetano Mendola wrote: >> >> The trend is to have server capable of running CUDA providing GPU via >> external hardware (PCI Express interface with PCI Express switches), look >> for example at PowerEdge C410x PCIe Expansion Chassis from DELL. > > > The C410X adds 16 PCIe slots to a server, housed inside a separate 3U > enclosure. That's a completely sensible purchase if your goal is to build a > computing cluster, where a lot of work is handed off to a set of GPUs. I > think that's even less likely to be a cost-effective option for a database > server. Adding a single dedicated GPU installed in a server to accelerate > sorting is something that might be justifiable, based on your benchmarks. > This is a much more expensive option than that though. Details at > http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who wants > to see just how big this external box is. > > >> I did some experimenst timing the sort done with CUDA and the sort done >> with pg_qsort: >> CUDA pg_qsort >> 33Milion integers: ~ 900 ms, ~ 6000 ms >> 1Milion integers: ~ 21 ms, ~ 162 ms >> 100k integers: ~ 2 ms, ~ 13 ms >> CUDA time has already in the copy operations (host->device, device->host). >> As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R) >> Xeon(R) CPU X5650 @ 2.67GHz > > > That's really interesting, and the X5650 is by no means a slow CPU. So this > benchmark is providing a lot of CPU power yet still seeing over a 6X speedup > in sort times. It sounds like the PCI Express bus has gotten fast enough > that the time to hand data over and get it back again can easily be > justified for medium to large sized sorts. > > It would be helpful to take this patch and confirm whether it scales when > using in parallel. Easiest way to do that would be to use the pgbench "-f" > feature, which allows running an arbitrary number of some query at once. > Seeing whether this acceleration continued to hold as the number of clients > increases is a useful data point. > > Is it possible for you to break down where the time is being spent? For > example, how much of this time is consumed in the GPU itself, compared to > time spent transferring data between CPU and GPU? I'm also curious where > the bottleneck is at with this approach. If it's the speed of the PCI-E bus > for smaller data sets, adding more GPUs may never be practical. If the bus > can handle quite a few of these at once before it saturates, it might be > possible to overload a single GPU. That seems like it would be really hard > to reach for database sorting though; I can't really defend justify my gut > feel for that being true though. > > >> > I've never seen a PostgreSQL server capable of running CUDA, and I >> > don't expect that to change. >> >> That sounds like: >> >> "I think there is a world market for maybe five computers." >> - IBM Chairman Thomas Watson, 1943 > > > Yes, and "640K will be enough for everyone", ha ha. (Having said the 640K > thing is flat out denied by Gates, BTW, and no one has come up with proof > otherwise). > > I think you've made an interesting case for this sort of acceleration now > being useful for systems doing what's typically considered a data warehouse > task. I regularly see servers waiting for far more than 13M integers to > sort. And I am seeing a clear trend toward providing more PCI-E slots in > servers now. Dell's R810 is the most popular single server model my > customers have deployed in the last year, and it has 5 X8 slots in it. It's > rare all 5 of those are filled. As long as a dedicated GPU works fine when > dropped to X8 speeds, I know a fair number of systems where one of those > could be added now. > > There's another data point in your favor I didn't notice before your last > e-mail. Amazon has a "Cluster GPU Quadruple Extra Large" node type that > runs with NVIDIA Tesla hardware. That means the installed base of people > who could consider CUDA is higher than I expected. To demonstrate how much > that costs, to provision a GPU enabled reserved instance from Amazon for one > year costs $2410 at "Light Utilization", giving a system with 22GB of RAM > and 1.69GB of storage. (I find the reserved prices easier to compare with > dedicated hardware than the hourly ones) That's halfway between the > High-Memory Double Extra Large Instance (34GB RAM/850GB disk) at $1100 and > the High-Memory Quadruple Extra Large Instance (64GB RAM/1690GB disk) at > $2200. If someone could prove sorting was a bottleneck on their server, > that isn't an unreasonable option to consider on a cloud-based database > deployment. > > I still think that an approach based on OpenCL is more likely to be suitable > for PostgreSQL, which was part of why I gave CUDA low odds here. The points > in favor of OpenCL are: > > -Since you last posted, OpenCL compiling has switched to using LLVM as their > standard compiler. Good PostgreSQL support for LLVM isn't far away. It > looks to me like the compiler situation for CUDA requires their PathScale > based compiler. I don't know enough about this area to say which compiling > tool chain will end up being easier to deal with. > > -Intel is making GPU support standard for OpenCL, as I mentioned before. > NVIDIA will be hard pressed to compete with Intel for GPU acceleration once > more systems supporting that enter the market. > > -Easy availability of OpenCL on Mac OS X for development sake. Lots of > Postgres hackers with OS X systems, even though there aren't too many OS X > database servers. > > The fact that Amazon provides a way to crack the chicken/egg hardware > problem immediately helps a lot though, I don't even need a physical card > here to test CUDA GPU acceleration on Linux now. With that data point, your > benchmarks are good enough to say I'd be willing to help review a patch in > this area here as part of the 9.3 development cycle. That may validate that > GPU acceleration is useful, and then the next step would be considering how > portable that will be to other GPU interfaces. I still expect CUDA will be > looked back on as a dead end for GPU accelerated computing one day. > Computing history is not filled with many single-vendor standards who > competed successfully against Intel providing the same thing. AMD's x86-64 > is the only example I can think of where Intel didn't win that sort of race, > which happened (IMHO) only because Intel's Itanium failed to prioritize > backwards compatibility highly enough. > As a side node. My module (PG-Strom) also uses CUDA, although it tried to implement it with OpenCL at begining of the project, because it didn't work well when multiple sessions uses a GPU device concurrently. The second background process get an error due to out-of-resources during another process opens a GPU device. I'm not clear whether it is a limitation of OpenCL, driver of Nvidia, or bugs of my code. Anyway, I switched to CUDA, instead of the investigation on binary drivers. :-( Thanks, -- KaiGai Kohei <kaigai@kaigai.gr.jp>
pgsql-hackers by date: