Re: CUDA Sorting - Mailing list pgsql-hackers

From Kohei KaiGai
Subject Re: CUDA Sorting
Date
Msg-id CADyhKSUZpOV4Tj1D4Y3gamK-nH8wCpripdUDbuwRz7=0U3n_uw@mail.gmail.com
Whole thread Raw
In response to Re: CUDA Sorting  (Greg Smith <greg@2ndQuadrant.com>)
Responses Re: CUDA Sorting
List pgsql-hackers
2012/2/13 Greg Smith <greg@2ndquadrant.com>:
> On 02/11/2012 08:14 PM, Gaetano Mendola wrote:
>>
>> The trend is to have server capable of running CUDA providing GPU via
>> external hardware (PCI Express interface with PCI Express switches), look
>> for example at PowerEdge C410x PCIe Expansion Chassis from DELL.
>
>
> The C410X adds 16 PCIe slots to a server, housed inside a separate 3U
> enclosure.  That's a completely sensible purchase if your goal is to build a
> computing cluster, where a lot of work is handed off to a set of GPUs.  I
> think that's even less likely to be a cost-effective option for a database
> server.  Adding a single dedicated GPU installed in a server to accelerate
> sorting is something that might be justifiable, based on your benchmarks.
>  This is a much more expensive option than that though.  Details at
> http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who wants
> to see just how big this external box is.
>
>
>> I did some experimenst timing the sort done with CUDA and the sort done
>> with pg_qsort:
>>                       CUDA      pg_qsort
>> 33Milion integers:   ~ 900 ms,  ~ 6000 ms
>> 1Milion integers:    ~  21 ms,  ~  162 ms
>> 100k integers:       ~   2 ms,  ~   13 ms
>> CUDA time has already in the copy operations (host->device, device->host).
>> As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R)
>> Xeon(R) CPU X5650  @ 2.67GHz
>
>
> That's really interesting, and the X5650 is by no means a slow CPU.  So this
> benchmark is providing a lot of CPU power yet still seeing over a 6X speedup
> in sort times.  It sounds like the PCI Express bus has gotten fast enough
> that the time to hand data over and get it back again can easily be
> justified for medium to large sized sorts.
>
> It would be helpful to take this patch and confirm whether it scales when
> using in parallel.  Easiest way to do that would be to use the pgbench "-f"
> feature, which allows running an arbitrary number of some query at once.
>  Seeing whether this acceleration continued to hold as the number of clients
> increases is a useful data point.
>
> Is it possible for you to break down where the time is being spent?  For
> example, how much of this time is consumed in the GPU itself, compared to
> time spent transferring data between CPU and GPU?  I'm also curious where
> the bottleneck is at with this approach.  If it's the speed of the PCI-E bus
> for smaller data sets, adding more GPUs may never be practical.  If the bus
> can handle quite a few of these at once before it saturates, it might be
> possible to overload a single GPU.  That seems like it would be really hard
> to reach for database sorting though; I can't really defend justify my gut
> feel for that being true though.
>
>
>> > I've never seen a PostgreSQL server capable of running CUDA, and I
>> > don't expect that to change.
>>
>> That sounds like:
>>
>> "I think there is a world market for maybe five computers."
>> - IBM Chairman Thomas Watson, 1943
>
>
> Yes, and "640K will be enough for everyone", ha ha.  (Having said the 640K
> thing is flat out denied by Gates, BTW, and no one has come up with proof
> otherwise).
>
> I think you've made an interesting case for this sort of acceleration now
> being useful for systems doing what's typically considered a data warehouse
> task.  I regularly see servers waiting for far more than 13M integers to
> sort.  And I am seeing a clear trend toward providing more PCI-E slots in
> servers now.  Dell's R810 is the most popular single server model my
> customers have deployed in the last year, and it has 5 X8 slots in it.  It's
> rare all 5 of those are filled.  As long as a dedicated GPU works fine when
> dropped to X8 speeds, I know a fair number of systems where one of those
> could be added now.
>
> There's another data point in your favor I didn't notice before your last
> e-mail.  Amazon has a "Cluster GPU Quadruple Extra Large" node type that
> runs with NVIDIA Tesla hardware.  That means the installed base of people
> who could consider CUDA is higher than I expected.  To demonstrate how much
> that costs, to provision a GPU enabled reserved instance from Amazon for one
> year costs $2410 at "Light Utilization", giving a system with 22GB of RAM
> and 1.69GB of storage.  (I find the reserved prices easier to compare with
> dedicated hardware than the hourly ones)  That's halfway between the
> High-Memory Double Extra Large Instance (34GB RAM/850GB disk) at $1100 and
> the High-Memory Quadruple Extra Large Instance (64GB RAM/1690GB disk) at
> $2200.  If someone could prove sorting was a bottleneck on their server,
> that isn't an unreasonable option to consider on a cloud-based database
> deployment.
>
> I still think that an approach based on OpenCL is more likely to be suitable
> for PostgreSQL, which was part of why I gave CUDA low odds here.  The points
> in favor of OpenCL are:
>
> -Since you last posted, OpenCL compiling has switched to using LLVM as their
> standard compiler.  Good PostgreSQL support for LLVM isn't far away.  It
> looks to me like the compiler situation for CUDA requires their PathScale
> based compiler.  I don't know enough about this area to say which compiling
> tool chain will end up being easier to deal with.
>
> -Intel is making GPU support standard for OpenCL, as I mentioned before.
>  NVIDIA will be hard pressed to compete with Intel for GPU acceleration once
> more systems supporting that enter the market.
>
> -Easy availability of OpenCL on Mac OS X for development sake.  Lots of
> Postgres hackers with OS X systems, even though there aren't too many OS X
> database servers.
>
> The fact that Amazon provides a way to crack the chicken/egg hardware
> problem immediately helps a lot though, I don't even need a physical card
> here to test CUDA GPU acceleration on Linux now.  With that data point, your
> benchmarks are good enough to say I'd be willing to help review a patch in
> this area here as part of the 9.3 development cycle.  That may validate that
> GPU acceleration is useful, and then the next step would be considering how
> portable that will be to other GPU interfaces.  I still expect CUDA will be
> looked back on as a dead end for GPU accelerated computing one day.
>  Computing history is not filled with many single-vendor standards who
> competed successfully against Intel providing the same thing.  AMD's x86-64
> is the only example I can think of where Intel didn't win that sort of race,
> which happened (IMHO) only because Intel's Itanium failed to prioritize
> backwards compatibility highly enough.
>
As a side node. My module (PG-Strom) also uses CUDA, although it tried to
implement it with OpenCL at begining of the project, because it didn't work
well when multiple sessions uses a GPU device concurrently.
The second background process get an error due to out-of-resources during
another process opens a GPU device.

I'm not clear whether it is a limitation of OpenCL, driver of Nvidia, or bugs of
my code. Anyway, I switched to CUDA, instead of the investigation on binary
drivers. :-(

Thanks,
--
KaiGai Kohei <kaigai@kaigai.gr.jp>


pgsql-hackers by date:

Previous
From: Marti Raudsepp
Date:
Subject: Re: bitfield and gcc
Next
From: Dimitri Fontaine
Date:
Subject: Re: Finer Extension dependencies