Home > mailing lists
Re: CUDA Sorting - Mailing list pgsql-hackers

From	Gaetano Mendola
Subject	Re: CUDA Sorting
Date	February 13, 2012 16:33:16
Msg-id	CAJycT5p1SKcb7A4A+U=-PyPZ4nYTrxZeK1xwK8eO4J_8cTYVwg@mail.gmail.com Whole thread Raw
In response to	Re: CUDA Sorting (Kohei KaiGai <kaigai@kaigai.gr.jp>)
Responses	Re: CUDA Sorting
List	pgsql-hackers
Tree view
<p><br /> On Feb 13, 2012 11:39 a.m., "Kohei KaiGai" <<a
href="mailto:kaigai@kaigai.gr.jp">kaigai@kaigai.gr.jp</a>>wrote:<br /> ><br /> > 2012/2/13 Greg Smith <<a
href="mailto:greg@2ndquadrant.com">greg@2ndquadrant.com</a>>:<br/> > > On 02/11/2012 08:14 PM, Gaetano Mendola
wrote:<br/> > >><br /> > >> The trend is to have server capable of running CUDA providing GPU via<br
/>> >> external hardware (PCI Express interface with PCI Express switches), look<br /> > >> for
exampleat PowerEdge C410x PCIe Expansion Chassis from DELL.<br /> > ><br /> > ><br /> > > The C410X
adds16 PCIe slots to a server, housed inside a separate 3U<br /> > > enclosure.  That's a completely sensible
purchaseif your goal is to build a<br /> > > computing cluster, where a lot of work is handed off to a set of
GPUs. I<br /> > > think that's even less likely to be a cost-effective option for a database<br /> > >
server. Adding a single dedicated GPU installed in a server to accelerate<br /> > > sorting is something that
mightbe justifiable, based on your benchmarks.<br /> > >  This is a much more expensive option than that though.
 Detailsat<br /> > > <a
href="http://www.dell.com/us/enterprise/p/poweredge-c410x/pd">http://www.dell.com/us/enterprise/p/poweredge-c410x/pd</a>
foranyone who wants<br /> > > to see just how big this external box is.<br /> > ><br /> > ><br />
>>> I did some experimenst timing the sort done with CUDA and the sort done<br /> > >> with
pg_qsort:<br/> > >>                       CUDA      pg_qsort<br /> > >> 33Milion integers:   ~ 900
ms, ~ 6000 ms<br /> > >> 1Milion integers:    ~  21 ms,  ~  162 ms<br /> > >> 100k integers:       ~
 2 ms,  ~   13 ms<br /> > >> CUDA time has already in the copy operations (host->device,
device->host).<br/> > >> As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R)<br />
>>> Xeon(R) CPU X5650  @ 2.67GHz<br /> > ><br /> > ><br /> > > That's really interesting,
andthe X5650 is by no means a slow CPU.  So this<br /> > > benchmark is providing a lot of CPU power yet still
seeingover a 6X speedup<br /> > > in sort times.  It sounds like the PCI Express bus has gotten fast enough<br />
>> that the time to hand data over and get it back again can easily be<br /> > > justified for medium to
largesized sorts.<br /> > ><br /> > > It would be helpful to take this patch and confirm whether it scales
when<br/> > > using in parallel.  Easiest way to do that would be to use the pgbench "-f"<br /> > >
feature,which allows running an arbitrary number of some query at once.<br /> > >  Seeing whether this
accelerationcontinued to hold as the number of clients<br /> > > increases is a useful data point.<br /> >
><br/> > > Is it possible for you to break down where the time is being spent?  For<br /> > > example,
howmuch of this time is consumed in the GPU itself, compared to<br /> > > time spent transferring data between
CPUand GPU?  I'm also curious where<br /> > > the bottleneck is at with this approach.  If it's the speed of the
PCI-Ebus<br /> > > for smaller data sets, adding more GPUs may never be practical.  If the bus<br /> > >
canhandle quite a few of these at once before it saturates, it might be<br /> > > possible to overload a single
GPU. That seems like it would be really hard<br /> > > to reach for database sorting though; I can't really
defendjustify my gut<br /> > > feel for that being true though.<br /> > ><br /> > ><br /> >
>>> I've never seen a PostgreSQL server capable of running CUDA, and I<br /> > >> > don't expect
thatto change.<br /> > >><br /> > >> That sounds like:<br /> > >><br /> > >> "I
thinkthere is a world market for maybe five computers."<br /> > >> - IBM Chairman Thomas Watson, 1943<br />
>><br /> > ><br /> > > Yes, and "640K will be enough for everyone", ha ha.  (Having said the 640K<br
/>> > thing is flat out denied by Gates, BTW, and no one has come up with proof<br /> > > otherwise).<br />
>><br /> > > I think you've made an interesting case for this sort of acceleration now<br /> > >
beinguseful for systems doing what's typically considered a data warehouse<br /> > > task.  I regularly see
serverswaiting for far more than 13M integers to<br /> > > sort.  And I am seeing a clear trend toward providing
morePCI-E slots in<br /> > > servers now.  Dell's R810 is the most popular single server model my<br /> > >
customershave deployed in the last year, and it has 5 X8 slots in it.  It's<br /> > > rare all 5 of those are
filled. As long as a dedicated GPU works fine when<br /> > > dropped to X8 speeds, I know a fair number of
systemswhere one of those<br /> > > could be added now.<br /> > ><br /> > > There's another data
pointin your favor I didn't notice before your last<br /> > > e-mail.  Amazon has a "Cluster GPU Quadruple Extra
Large"node type that<br /> > > runs with NVIDIA Tesla hardware.  That means the installed base of people<br />
>> who could consider CUDA is higher than I expected.  To demonstrate how much<br /> > > that costs, to
provisiona GPU enabled reserved instance from Amazon for one<br /> > > year costs $2410 at "Light Utilization",
givinga system with 22GB of RAM<br /> > > and 1.69GB of storage.  (I find the reserved prices easier to compare
with<br/> > > dedicated hardware than the hourly ones)  That's halfway between the<br /> > > High-Memory
DoubleExtra Large Instance (34GB RAM/850GB disk) at $1100 and<br /> > > the High-Memory Quadruple Extra Large
Instance(64GB RAM/1690GB disk) at<br /> > > $2200.  If someone could prove sorting was a bottleneck on their
server,<br/> > > that isn't an unreasonable option to consider on a cloud-based database<br /> > >
deployment.<br/> > ><br /> > > I still think that an approach based on OpenCL is more likely to be
suitable<br/> > > for PostgreSQL, which was part of why I gave CUDA low odds here.  The points<br /> > > in
favorof OpenCL are:<br /> > ><br /> > > -Since you last posted, OpenCL compiling has switched to using LLVM
astheir<br /> > > standard compiler.  Good PostgreSQL support for LLVM isn't far away.  It<br /> > > looks
tome like the compiler situation for CUDA requires their PathScale<br /> > > based compiler.  I don't know enough
aboutthis area to say which compiling<br /> > > tool chain will end up being easier to deal with.<br /> >
><br/> > > -Intel is making GPU support standard for OpenCL, as I mentioned before.<br /> > >  NVIDIA
willbe hard pressed to compete with Intel for GPU acceleration once<br /> > > more systems supporting that enter
themarket.<br /> > ><br /> > > -Easy availability of OpenCL on Mac OS X for development sake.  Lots of<br
/>> > Postgres hackers with OS X systems, even though there aren't too many OS X<br /> > > database
servers.<br/> > ><br /> > > The fact that Amazon provides a way to crack the chicken/egg hardware<br />
>> problem immediately helps a lot though, I don't even need a physical card<br /> > > here to test CUDA
GPUacceleration on Linux now.  With that data point, your<br /> > > benchmarks are good enough to say I'd be
willingto help review a patch in<br /> > > this area here as part of the 9.3 development cycle.  That may
validatethat<br /> > > GPU acceleration is useful, and then the next step would be considering how<br /> >
>portable that will be to other GPU interfaces.  I still expect CUDA will be<br /> > > looked back on as a
deadend for GPU accelerated computing one day.<br /> > >  Computing history is not filled with many single-vendor
standardswho<br /> > > competed successfully against Intel providing the same thing.  AMD's x86-64<br /> >
>is the only example I can think of where Intel didn't win that sort of race,<br /> > > which happened (IMHO)
onlybecause Intel's Itanium failed to prioritize<br /> > > backwards compatibility highly enough.<br /> >
><br/> > As a side node. My module (PG-Strom) also uses CUDA, although it tried to<br /> > implement it with
OpenCLat begining of the project, because it didn't work<br /> > well when multiple sessions uses a GPU device
concurrently.<br/> > The second background process get an error due to out-of-resources during<br /> > another
processopens a GPU device.<br /> ><br /> > I'm not clear whether it is a limitation of OpenCL, driver of Nvidia,
orbugs of<br /> > my code. Anyway, I switched to CUDA, instead of the investigation on binary<br /> > drivers.
:-(<br/> ><br /> > Thanks,<br /> > --<br /> > KaiGai Kohei <<a
href="mailto:kaigai@kaigai.gr.jp">kaigai@kaigai.gr.jp</a>><br/><p>I have no experience with opencl but for sure with
Cuda4.1you can share the same device from multiple host thread, as in for example allocate memory in one host thread
anduse it in another thread. May be with opencl you were facing the very same limit.<br />
pgsql-hackers by date:
From: Pavel Stehule
Date: 13 February 2012, 16:31:26
Subject: Re: Access Error Details from PL/pgSQL
From: "David E. Wheeler"
Date: 13 February 2012, 16:40:12
Subject: Re: Access Error Details from PL/pgSQL
Re: CUDA Sorting - Mailing list pgsql-hackers

Previous

Next