Re: CUDA Sorting - Mailing list pgsql-hackers
From | Gaetano Mendola |
---|---|
Subject | Re: CUDA Sorting |
Date | |
Msg-id | 4F3B05AF.6080903@gmail.com Whole thread Raw |
In response to | Re: CUDA Sorting (Greg Smith <greg@2ndQuadrant.com>) |
List | pgsql-hackers |
On 13/02/2012 08:26, Greg Smith wrote: > On 02/11/2012 08:14 PM, Gaetano Mendola wrote: >> The trend is to have server capable of running CUDA providing GPU via >> external hardware (PCI Express interface with PCI Express switches), >> look for example at PowerEdge C410x PCIe Expansion Chassis from DELL. > > The C410X adds 16 PCIe slots to a server, housed inside a separate 3U > enclosure. That's a completely sensible purchase if your goal is to > build a computing cluster, where a lot of work is handed off to a set of > GPUs. I think that's even less likely to be a cost-effective option for > a database server. Adding a single dedicated GPU installed in a server > to accelerate sorting is something that might be justifiable, based on > your benchmarks. This is a much more expensive option than that though. > Details at http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for > anyone who wants to see just how big this external box is. > >> I did some experimenst timing the sort done with CUDA and the sort >> done with pg_qsort: >> CUDA pg_qsort >> 33Milion integers: ~ 900 ms, ~ 6000 ms >> 1Milion integers: ~ 21 ms, ~ 162 ms >> 100k integers: ~ 2 ms, ~ 13 ms >> CUDA time has already in the copy operations (host->device, >> device->host). >> As GPU I was using a C2050, and the CPU doing the pg_qsort was a >> Intel(R) Xeon(R) CPU X5650 @ 2.67GHz > > That's really interesting, and the X5650 is by no means a slow CPU. So > this benchmark is providing a lot of CPU power yet still seeing over a > 6X speedup in sort times. It sounds like the PCI Express bus has gotten > fast enough that the time to hand data over and get it back again can > easily be justified for medium to large sized sorts. > > It would be helpful to take this patch and confirm whether it scales > when using in parallel. Easiest way to do that would be to use the > pgbench "-f" feature, which allows running an arbitrary number of some > query at once. Seeing whether this acceleration continued to hold as the > number of clients increases is a useful data point. > > Is it possible for you to break down where the time is being spent? For > example, how much of this time is consumed in the GPU itself, compared > to time spent transferring data between CPU and GPU? I'm also curious > where the bottleneck is at with this approach. If it's the speed of the > PCI-E bus for smaller data sets, adding more GPUs may never be > practical. If the bus can handle quite a few of these at once before it > saturates, it might be possible to overload a single GPU. That seems > like it would be really hard to reach for database sorting though; I > can't really defend justify my gut feel for that being true though. There you go (times are in ms): Size H->D SORT D->H TOTAL 64 0.209824 0.479392 0.013856 0.703072 128 0.098144 0.41744 0.01312 0.528704 256 0.096832 0.420352 0.013696 0.53088 512 0.097568 0.3952 0.014464 0.507232 1024 0.09872 0.396608 0.014624 0.509952 2048 0.101344 0.56224 0.016896 0.68048 4096 0.106176 0.562976 0.02016 0.689312 8192 0.116512 0.571264 0.02672 0.714496 16384 0.136096 0.587584 0.040192 0.763872 32768 0.179296 0.658112 0.066304 0.903712 65536 0.212352 0.84816 0.118016 1.178528 131072 0.317056 1.1465 0.22784 1.691396 262144 0.529376 1.82237 0.42512 2.776866 524288 0.724032 2.39834 0.64576 3.768132 1048576 1.11162 3.51978 1.12176 5.75316 2097152 1.95939 5.93434 2.06992 9.96365 4194304 3.76192 10.6011 4.10614 18.46916 8388608 7.16845 19.9245 7.93741 35.03036 16777216 13.8693 38.7413 15.4073 68.0179 33554432 27.3017 75.6418 30.6646 133.6081 67108864 54.2171 151.192 60.327 265.7361 pg_sort 64 0.010000 128 0.010000 256 0.021000 512 0.128000 1024 0.092000 2048 0.196000 4096 0.415000 8192 0.883000 16384 1.881000 32768 3.960000 65536 8.432000 131072 17.951000 262144 37.140000 524288 78.320000 1048576 163.276000 2097152 339.118000 4194304 693.223000 8388608 1423.142000 16777216 2891.218000 33554432 5910.851000 67108864 11980.930000 As you can notice the times with CUDA are lower than the timing I have reported on my previous post because the server was doing something else in mean while, I have repeated those benchmarks with server completely unused. And this is the boost as in pg_sort/cuda : 64 0.0142232943 128 0.018914175 256 0.039556962 512 0.2070058671 1024 0.1804091365 2048 0.2880319774 4096 0.6078524674 8192 1.2372357578 16384 2.4637635625 32768 4.4106972133 65536 7.1742037525 131072 10.5090706139 262144 13.3719091955 524288 20.5834084369 1048576 28.2516043357 2097152 33.9618513296 4194304 37.5247168794 8388608 40.5135716561 16777216 42.4743633661 33554432 44.2394809896 67108864 45.1499777411 >> > I've never seen a PostgreSQL server capable of running CUDA, and I >> > don't expect that to change. >> >> That sounds like: >> >> "I think there is a world market for maybe five computers." >> - IBM Chairman Thomas Watson, 1943 > > Yes, and "640K will be enough for everyone", ha ha. (Having said the > 640K thing is flat out denied by Gates, BTW, and no one has come up with > proof otherwise). > > I think you've made an interesting case for this sort of acceleration > now being useful for systems doing what's typically considered a data > warehouse task. I regularly see servers waiting for far more than 13M > integers to sort. And I am seeing a clear trend toward providing more > PCI-E slots in servers now. Dell's R810 is the most popular single > server model my customers have deployed in the last year, and it has 5 > X8 slots in it. It's rare all 5 of those are filled. As long as a > dedicated GPU works fine when dropped to X8 speeds, I know a fair number > of systems where one of those could be added now. > > There's another data point in your favor I didn't notice before your > last e-mail. Amazon has a "Cluster GPU Quadruple Extra Large" node type > that runs with NVIDIA Tesla hardware. That means the installed base of > people who could consider CUDA is higher than I expected. To demonstrate > how much that costs, to provision a GPU enabled reserved instance from > Amazon for one year costs $2410 at "Light Utilization", giving a system > with 22GB of RAM and 1.69GB of storage. (I find the reserved prices > easier to compare with dedicated hardware than the hourly ones) That's > halfway between the High-Memory Double Extra Large Instance (34GB > RAM/850GB disk) at $1100 and the High-Memory Quadruple Extra Large > Instance (64GB RAM/1690GB disk) at $2200. If someone could prove sorting > was a bottleneck on their server, that isn't an unreasonable option to > consider on a cloud-based database deployment. > > I still think that an approach based on OpenCL is more likely to be > suitable for PostgreSQL, which was part of why I gave CUDA low odds > here. The points in favor of OpenCL are: > > -Since you last posted, OpenCL compiling has switched to using LLVM as > their standard compiler. Good PostgreSQL support for LLVM isn't far > away. It looks to me like the compiler situation for CUDA requires their > PathScale based compiler. I don't know enough about this area to say > which compiling tool chain will end up being easier to deal with. NVidia compiler named nvcc switched to LLVM as well (CUDA4.1). > -Intel is making GPU support standard for OpenCL, as I mentioned before. > NVIDIA will be hard pressed to compete with Intel for GPU acceleration > once more systems supporting that enter the market. > > -Easy availability of OpenCL on Mac OS X for development sake. Lots of > Postgres hackers with OS X systems, even though there aren't too many OS > X database servers. > The fact that Amazon provides a way to crack the chicken/egg hardware > problem immediately helps a lot though, I don't even need a physical > card here to test CUDA GPU acceleration on Linux now. With that data > point, your benchmarks are good enough to say I'd be willing to help > review a patch in this area here as part of the 9.3 development cycle. > That may validate that GPU acceleration is useful, and then the next > step would be considering how portable that will be to other GPU > interfaces. I still expect CUDA will be looked back on as a dead end for > GPU accelerated computing one day. Computing history is not filled with > many single-vendor standards who competed successfully against Intel > providing the same thing. AMD's x86-64 is the only example I can think > of where Intel didn't win that sort of race, which happened (IMHO) only > because Intel's Itanium failed to prioritize backwards compatibility > highly enough. I think that due the fact NVIDA nvcc uses LLVM now it means that soon we will be able to compile "CUDA" programs for any target architecture supported by LLVM. Regards Gaetano Mendola
pgsql-hackers by date: