Re: CUDA Sorting - Mailing list pgsql-hackers

From Gaetano Mendola
Subject Re: CUDA Sorting
Date
Msg-id 4F3B05AF.6080903@gmail.com
Whole thread Raw
In response to Re: CUDA Sorting  (Greg Smith <greg@2ndQuadrant.com>)
List pgsql-hackers
On 13/02/2012 08:26, Greg Smith wrote:
> On 02/11/2012 08:14 PM, Gaetano Mendola wrote:
>> The trend is to have server capable of running CUDA providing GPU via
>> external hardware (PCI Express interface with PCI Express switches),
>> look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.
>
> The C410X adds 16 PCIe slots to a server, housed inside a separate 3U
> enclosure. That's a completely sensible purchase if your goal is to
> build a computing cluster, where a lot of work is handed off to a set of
> GPUs. I think that's even less likely to be a cost-effective option for
> a database server. Adding a single dedicated GPU installed in a server
> to accelerate sorting is something that might be justifiable, based on
> your benchmarks. This is a much more expensive option than that though.
> Details at http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for
> anyone who wants to see just how big this external box is.
>
>> I did some experimenst timing the sort done with CUDA and the sort
>> done with pg_qsort:
>> CUDA pg_qsort
>> 33Milion integers: ~ 900 ms, ~ 6000 ms
>> 1Milion integers: ~ 21 ms, ~ 162 ms
>> 100k integers: ~ 2 ms, ~ 13 ms
>> CUDA time has already in the copy operations (host->device,
>> device->host).
>> As GPU I was using a C2050, and the CPU doing the pg_qsort was a
>> Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>
> That's really interesting, and the X5650 is by no means a slow CPU. So
> this benchmark is providing a lot of CPU power yet still seeing over a
> 6X speedup in sort times. It sounds like the PCI Express bus has gotten
> fast enough that the time to hand data over and get it back again can
> easily be justified for medium to large sized sorts.
>
> It would be helpful to take this patch and confirm whether it scales
> when using in parallel. Easiest way to do that would be to use the
> pgbench "-f" feature, which allows running an arbitrary number of some
> query at once. Seeing whether this acceleration continued to hold as the
> number of clients increases is a useful data point.
>
> Is it possible for you to break down where the time is being spent? For
> example, how much of this time is consumed in the GPU itself, compared
> to time spent transferring data between CPU and GPU? I'm also curious
> where the bottleneck is at with this approach. If it's the speed of the
> PCI-E bus for smaller data sets, adding more GPUs may never be
> practical. If the bus can handle quite a few of these at once before it
> saturates, it might be possible to overload a single GPU. That seems
> like it would be really hard to reach for database sorting though; I
> can't really defend justify my gut feel for that being true though.

There you go (times are in ms):

Size       H->D     SORT     D->H     TOTAL
64     0.209824 0.479392 0.013856 0.703072
128     0.098144 0.41744  0.01312  0.528704
256     0.096832 0.420352 0.013696 0.53088
512     0.097568 0.3952   0.014464 0.507232
1024     0.09872  0.396608 0.014624 0.509952
2048     0.101344 0.56224  0.016896 0.68048
4096     0.106176 0.562976 0.02016  0.689312
8192     0.116512 0.571264 0.02672  0.714496
16384     0.136096 0.587584 0.040192 0.763872
32768     0.179296 0.658112 0.066304 0.903712
65536     0.212352 0.84816  0.118016 1.178528
131072     0.317056 1.1465   0.22784  1.691396
262144     0.529376 1.82237  0.42512  2.776866
524288     0.724032 2.39834  0.64576  3.768132
1048576     1.11162  3.51978  1.12176  5.75316
2097152     1.95939  5.93434  2.06992  9.96365
4194304     3.76192  10.6011  4.10614  18.46916
8388608     7.16845  19.9245  7.93741  35.03036
16777216 13.8693  38.7413  15.4073  68.0179
33554432 27.3017  75.6418  30.6646  133.6081
67108864 54.2171  151.192  60.327   265.7361

pg_sort

64           0.010000
128          0.010000
256          0.021000
512          0.128000
1024         0.092000
2048         0.196000
4096         0.415000
8192         0.883000
16384        1.881000
32768        3.960000
65536        8.432000
131072      17.951000
262144      37.140000
524288      78.320000
1048576    163.276000
2097152    339.118000
4194304    693.223000
8388608   1423.142000
16777216  2891.218000
33554432  5910.851000
67108864 11980.930000

As you can notice the times with CUDA are lower than the timing I have 
reported on my previous post because the server was doing something else
in mean while, I have repeated those benchmarks with server completely
unused.

And this is the boost as in pg_sort/cuda :

64     0.0142232943
128     0.018914175
256     0.039556962
512     0.2070058671
1024     0.1804091365
2048     0.2880319774
4096     0.6078524674
8192     1.2372357578
16384     2.4637635625
32768     4.4106972133
65536     7.1742037525
131072     10.5090706139
262144     13.3719091955
524288     20.5834084369
1048576     28.2516043357
2097152     33.9618513296
4194304     37.5247168794
8388608     40.5135716561
16777216 42.4743633661
33554432 44.2394809896
67108864 45.1499777411


>> > I've never seen a PostgreSQL server capable of running CUDA, and I
>> > don't expect that to change.
>>
>> That sounds like:
>>
>> "I think there is a world market for maybe five computers."
>> - IBM Chairman Thomas Watson, 1943
>
> Yes, and "640K will be enough for everyone", ha ha. (Having said the
> 640K thing is flat out denied by Gates, BTW, and no one has come up with
> proof otherwise).
>
> I think you've made an interesting case for this sort of acceleration
> now being useful for systems doing what's typically considered a data
> warehouse task. I regularly see servers waiting for far more than 13M
> integers to sort. And I am seeing a clear trend toward providing more
> PCI-E slots in servers now. Dell's R810 is the most popular single
> server model my customers have deployed in the last year, and it has 5
> X8 slots in it. It's rare all 5 of those are filled. As long as a
> dedicated GPU works fine when dropped to X8 speeds, I know a fair number
> of systems where one of those could be added now.
>
> There's another data point in your favor I didn't notice before your
> last e-mail. Amazon has a "Cluster GPU Quadruple Extra Large" node type
> that runs with NVIDIA Tesla hardware. That means the installed base of
> people who could consider CUDA is higher than I expected. To demonstrate
> how much that costs, to provision a GPU enabled reserved instance from
> Amazon for one year costs $2410 at "Light Utilization", giving a system
> with 22GB of RAM and 1.69GB of storage. (I find the reserved prices
> easier to compare with dedicated hardware than the hourly ones) That's
> halfway between the High-Memory Double Extra Large Instance (34GB
> RAM/850GB disk) at $1100 and the High-Memory Quadruple Extra Large
> Instance (64GB RAM/1690GB disk) at $2200. If someone could prove sorting
> was a bottleneck on their server, that isn't an unreasonable option to
> consider on a cloud-based database deployment.
>
> I still think that an approach based on OpenCL is more likely to be
> suitable for PostgreSQL, which was part of why I gave CUDA low odds
> here. The points in favor of OpenCL are:
>
> -Since you last posted, OpenCL compiling has switched to using LLVM as
> their standard compiler. Good PostgreSQL support for LLVM isn't far
> away. It looks to me like the compiler situation for CUDA requires their
> PathScale based compiler. I don't know enough about this area to say
> which compiling tool chain will end up being easier to deal with.

NVidia compiler named nvcc switched to LLVM as well (CUDA4.1).

> -Intel is making GPU support standard for OpenCL, as I mentioned before.
> NVIDIA will be hard pressed to compete with Intel for GPU acceleration
> once more systems supporting that enter the market.
>
> -Easy availability of OpenCL on Mac OS X for development sake. Lots of
> Postgres hackers with OS X systems, even though there aren't too many OS
> X database servers.
> The fact that Amazon provides a way to crack the chicken/egg hardware
> problem immediately helps a lot though, I don't even need a physical
> card here to test CUDA GPU acceleration on Linux now. With that data
> point, your benchmarks are good enough to say I'd be willing to help
> review a patch in this area here as part of the 9.3 development cycle.
> That may validate that GPU acceleration is useful, and then the next
> step would be considering how portable that will be to other GPU
> interfaces. I still expect CUDA will be looked back on as a dead end for
> GPU accelerated computing one day. Computing history is not filled with
> many single-vendor standards who competed successfully against Intel
> providing the same thing. AMD's x86-64 is the only example I can think
> of where Intel didn't win that sort of race, which happened (IMHO) only
> because Intel's Itanium failed to prioritize backwards compatibility
> highly enough.

I think that due the fact NVIDA nvcc uses LLVM now it means that soon we 
will be able to compile "CUDA" programs for any target architecture 
supported by LLVM.

Regards
Gaetano Mendola




pgsql-hackers by date:

Previous
From: Dan Ports
Date:
Subject: Re: SSI rw-conflicts and 2PC
Next
From: Bruce Momjian
Date:
Subject: Re: pg_test_fsync performance