Re: CUDA Sorting - Mailing list pgsql-hackers

From Greg Smith
Subject Re: CUDA Sorting
Date
Msg-id 4F38BB39.2040708@2ndQuadrant.com
Whole thread Raw
In response to Re: CUDA Sorting  (Gaetano Mendola <mendola@gmail.com>)
Responses Re: CUDA Sorting
Re: CUDA Sorting
List pgsql-hackers
On 02/11/2012 08:14 PM, Gaetano Mendola wrote:
> The trend is to have server capable of running CUDA providing GPU via 
> external hardware (PCI Express interface with PCI Express switches), 
> look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.

The C410X adds 16 PCIe slots to a server, housed inside a separate 3U 
enclosure.  That's a completely sensible purchase if your goal is to 
build a computing cluster, where a lot of work is handed off to a set of 
GPUs.  I think that's even less likely to be a cost-effective option for 
a database server.  Adding a single dedicated GPU installed in a server 
to accelerate sorting is something that might be justifiable, based on 
your benchmarks.  This is a much more expensive option than that 
though.  Details at 
http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who 
wants to see just how big this external box is.

> I did some experimenst timing the sort done with CUDA and the sort 
> done with pg_qsort:
>                        CUDA      pg_qsort
> 33Milion integers:   ~ 900 ms,  ~ 6000 ms
> 1Milion integers:    ~  21 ms,  ~  162 ms
> 100k integers:       ~   2 ms,  ~   13 ms
> CUDA time has already in the copy operations (host->device, 
> device->host).
> As GPU I was using a C2050, and the CPU doing the pg_qsort was a 
> Intel(R) Xeon(R) CPU X5650  @ 2.67GHz

That's really interesting, and the X5650 is by no means a slow CPU.  So 
this benchmark is providing a lot of CPU power yet still seeing over a 
6X speedup in sort times.  It sounds like the PCI Express bus has gotten 
fast enough that the time to hand data over and get it back again can 
easily be justified for medium to large sized sorts.

It would be helpful to take this patch and confirm whether it scales 
when using in parallel.  Easiest way to do that would be to use the 
pgbench "-f" feature, which allows running an arbitrary number of some 
query at once.  Seeing whether this acceleration continued to hold as 
the number of clients increases is a useful data point.

Is it possible for you to break down where the time is being spent?  For 
example, how much of this time is consumed in the GPU itself, compared 
to time spent transferring data between CPU and GPU?  I'm also curious 
where the bottleneck is at with this approach.  If it's the speed of the 
PCI-E bus for smaller data sets, adding more GPUs may never be 
practical.  If the bus can handle quite a few of these at once before it 
saturates, it might be possible to overload a single GPU.  That seems 
like it would be really hard to reach for database sorting though; I 
can't really defend justify my gut feel for that being true though.

> > I've never seen a PostgreSQL server capable of running CUDA, and I
> > don't expect that to change.
>
> That sounds like:
>
> "I think there is a world market for maybe five computers."
> - IBM Chairman Thomas Watson, 1943

Yes, and "640K will be enough for everyone", ha ha.  (Having said the 
640K thing is flat out denied by Gates, BTW, and no one has come up with 
proof otherwise).

I think you've made an interesting case for this sort of acceleration 
now being useful for systems doing what's typically considered a data 
warehouse task.  I regularly see servers waiting for far more than 13M 
integers to sort.  And I am seeing a clear trend toward providing more 
PCI-E slots in servers now.  Dell's R810 is the most popular single 
server model my customers have deployed in the last year, and it has 5 
X8 slots in it.  It's rare all 5 of those are filled.  As long as a 
dedicated GPU works fine when dropped to X8 speeds, I know a fair number 
of systems where one of those could be added now.

There's another data point in your favor I didn't notice before your 
last e-mail.  Amazon has a "Cluster GPU Quadruple Extra Large" node type 
that runs with NVIDIA Tesla hardware.  That means the installed base of 
people who could consider CUDA is higher than I expected.  To 
demonstrate how much that costs, to provision a GPU enabled reserved 
instance from Amazon for one year costs $2410 at "Light Utilization", 
giving a system with 22GB of RAM and 1.69GB of storage.  (I find the 
reserved prices easier to compare with dedicated hardware than the 
hourly ones)  That's halfway between the High-Memory Double Extra Large 
Instance (34GB RAM/850GB disk) at $1100 and the High-Memory Quadruple 
Extra Large Instance (64GB RAM/1690GB disk) at $2200.  If someone could 
prove sorting was a bottleneck on their server, that isn't an 
unreasonable option to consider on a cloud-based database deployment.

I still think that an approach based on OpenCL is more likely to be 
suitable for PostgreSQL, which was part of why I gave CUDA low odds 
here.  The points in favor of OpenCL are:

-Since you last posted, OpenCL compiling has switched to using LLVM as 
their standard compiler.  Good PostgreSQL support for LLVM isn't far 
away.  It looks to me like the compiler situation for CUDA requires 
their PathScale based compiler.  I don't know enough about this area to 
say which compiling tool chain will end up being easier to deal with.

-Intel is making GPU support standard for OpenCL, as I mentioned 
before.  NVIDIA will be hard pressed to compete with Intel for GPU 
acceleration once more systems supporting that enter the market.

-Easy availability of OpenCL on Mac OS X for development sake.  Lots of 
Postgres hackers with OS X systems, even though there aren't too many OS 
X database servers.

The fact that Amazon provides a way to crack the chicken/egg hardware 
problem immediately helps a lot though, I don't even need a physical 
card here to test CUDA GPU acceleration on Linux now.  With that data 
point, your benchmarks are good enough to say I'd be willing to help 
review a patch in this area here as part of the 9.3 development cycle.  
That may validate that GPU acceleration is useful, and then the next 
step would be considering how portable that will be to other GPU 
interfaces.  I still expect CUDA will be looked back on as a dead end 
for GPU accelerated computing one day.  Computing history is not filled 
with many single-vendor standards who competed successfully against 
Intel providing the same thing.  AMD's x86-64 is the only example I can 
think of where Intel didn't win that sort of race, which happened (IMHO) 
only because Intel's Itanium failed to prioritize backwards 
compatibility highly enough.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: [COMMITTERS] pgsql: Add new keywords SNAPSHOT and TYPES to the keyword list in gram.
Next
From: Amit Kapila
Date:
Subject: Re: double writes using "double-write buffer" approach [WIP]