Thread: Queries runs slow on GPU with PG-Strom

Queries runs slow on GPU with PG-Strom

From
YANG
Date:
Hello,

I've performed some tests on pg_strom according to the wiki. But it seems that
queries run slower on GPU than CPU. Can someone shed a light on what's wrong
with my settings. My setup was Quadro K620 + CUDA 7.0 (For Ubuntu 14.10) +
Ubuntu 15.04. And the results was

with pg_strom
=============

explain SELECT count(*) FROM t0 WHERE sqrt((x-25.6)^2 + (y-12.8)^2) < 10;
                                                                                 QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=190993.70..190993.71 rows=1 width=0) (actual time=18792.236..18792.236 rows=1 loops=1)  ->  Custom Scan
(GpuPreAgg) (cost=7933.07..184161.18 rows=86 width=108) (actual time=4249.656..18792.074 rows=77 loops=1)
Bulkload:On (density: 100.00%)        Reduction: NoGroup        Device Filter: (sqrt((((x - '25.6'::double precision) ^
'2'::doubleprecision) + ((y - '12.8'::double precision) ^ '2'::double precision))) < '10'::double precision)        ->
CustomScan (BulkScan) on t0  (cost=6933.07..182660.32 rows=10000060 width=0) (actual time=139.399..18499.246
rows=10000000loops=1)Planning time: 0.262 msExecution time: 19268.650 ms
 
(8 rows)



explain analyze SELECT cat, AVG(x) FROM t0 NATURAL JOIN t1 GROUP BY cat;
                                                                   QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------HashAggregate
(cost=298541.48..298541.81 rows=26 width=12) (actual time=11311.568..11311.572 rows=26 loops=1)  Group Key: t0.cat  ->
CustomScan (GpuPreAgg)  (cost=5178.82..250302.07 rows=1088 width=52) (actual time=3304.727..11310.021 rows=2307
loops=1)       Bulkload: On (density: 100.00%)        Reduction: Local + Global        ->  Custom Scan (GpuJoin)
(cost=4178.82..248541.18rows=10000060 width=12) (actual time=923.417..2661.113 rows=10000000 loops=1)
Bulkload:On (density: 100.00%)              Depth 1: Logic: GpuHashJoin, HashKeys: (aid), JoinQual: (aid = aid),
nrows_ratio:1.00000000              ->  Custom Scan (BulkScan) on t0  (cost=0.00..242858.60 rows=10000060 width=16)
(actualtime=6.980..871.431 rows=10000000 loops=1)              ->  Seq Scan on t1  (cost=0.00..734.00 rows=40000
width=4)(actual time=0.204..7.309 rows=40000 loops=1)Planning time: 47.834 msExecution time: 11355.103 ms
 
(12 rows)


without pg_strom
================

test=# explain analyze SELECT count(*) FROM t0 WHERE sqrt((x-25.6)^2 + (y-12.8)^2) < 10;
                                             QUERY PLAN
 

------------------------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=426193.03..426193.04 rows=1 width=0) (actual time=3880.379..3880.379 rows=1 loops=1)  ->  Seq Scan on t0
(cost=0.00..417859.65rows=3333353 width=0) (actual time=0.075..3859.200 rows=314063 loops=1)        Filter: (sqrt((((x
-'25.6'::double precision) ^ '2'::double precision) + ((y - '12.8'::double precision) ^ '2'::double precision))) <
'10'::doubleprecision)        Rows Removed by Filter: 9685937Planning time: 0.411 msExecution time: 3880.445 ms
 
(6 rows)

t=# explain analyze SELECT cat, AVG(x) FROM t0 NATURAL JOIN t1 GROUP BY cat;
            QUERY PLAN
 

------------------------------------------------------------------------------------------------------------------------------HashAggregate
(cost=431593.73..431594.05 rows=26 width=12) (actual time=4960.810..4960.812 rows=26 loops=1)  Group Key: t0.cat  ->
HashJoin  (cost=1234.00..381593.43 rows=10000060 width=12) (actual time=20.859..3367.510 rows=10000000 loops=1)
HashCond: (t0.aid = t1.aid)        ->  Seq Scan on t0  (cost=0.00..242858.60 rows=10000060 width=16) (actual
time=0.021..895.908rows=10000000 loops=1)        ->  Hash  (cost=734.00..734.00 rows=40000 width=4) (actual
time=20.567..20.567rows=40000 loops=1)              Buckets: 65536  Batches: 1  Memory Usage: 1919kB              ->
SeqScan on t1  (cost=0.00..734.00 rows=40000 width=4) (actual time=0.017..11.013 rows=40000 loops=1)Planning time:
0.567msExecution time: 4961.029 ms
 
(10 rows)



Here is the details how I installed pg_strom,

1. download postgresql 9.5alpha1 and compile it with
   ,----   | ./configure --prefix=/export/pg-9.5 --enable-debug --enable-cassert   | make -j8 all   | make install
`----

2. install cuda-7.0 (ubuntu 14.10 package from nvidia website)

3. download and compile pg_strom with pg_config in /export/pg-9.5/bin
       ,----       | make       | make install       `----


4. create a db with --no-local
       ,----       | initdb --no-local 9.5       `----

5. change postgresql.conf
       ,----       | shared_buffers=1GB       | shared_preload_libraries='pg_strom.so'       | logging_collector = on
   | log_filename='postgresql-%d.log'       | pg_strom.enabled=on       `----
 


6. start postgres
       ,----       | pg_ctl -D 9.5 start       `----
  and got the following outputs
       ,----       | LOG:  CUDA Runtime version: 7.0.0       | LOG:  NVIDIA driver version: 346.59       | LOG:  GPU0
QuadroK620 (384 CUDA cores, 1124MHz), L2 2048KB, RAM 2047MB (128bits, 900KHz), capability 5.0       | LOG:  NVRTC -
CUDARuntime Compilation vertion 7.0       | LOG:  redirecting log output to logging collector process       | HINT:
Futurelog output will appear in directory "pg_log".       `----
 



7. import testdb
       ,----       | createdb test       | psql test < ~/devel/pg_strom/test/testdb.sql       | psql test -c 'create
extensionpg_strom'       `----
 



Re: Queries runs slow on GPU with PG-Strom

From
Josh Berkus
Date:
On 07/22/2015 08:16 AM, YANG wrote:
> 
> Hello,
> 
> I've performed some tests on pg_strom according to the wiki. But it seems that
> queries run slower on GPU than CPU. Can someone shed a light on what's wrong
> with my settings. My setup was Quadro K620 + CUDA 7.0 (For Ubuntu 14.10) +
> Ubuntu 15.04. And the results was

I believe that pgStrom has its own mailing list.  KaiGai?


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Queries runs slow on GPU with PG-Strom

From
Kouhei Kaigai
Date:
Hi Yang,

> I've performed some tests on pg_strom according to the wiki. But it seems that
> queries run slower on GPU than CPU. Can someone shed a light on what's wrong
> with my settings. My setup was Quadro K620 + CUDA 7.0 (For Ubuntu 14.10) +
> Ubuntu 15.04. And the results was
> :
>         ,----
>         | LOG:  CUDA Runtime version: 7.0.0
>         | LOG:  NVIDIA driver version: 346.59
>         | LOG:  GPU0 Quadro K620 (384 CUDA cores, 1124MHz), L2 2048KB, RAM 2047MB
> (128bits, 900KHz), capability 5.0
>         | LOG:  NVRTC - CUDA Runtime Compilation vertion 7.0
>         | LOG:  redirecting log output to logging collector process
>         | HINT:  Future log output will appear in directory "pg_log".
>         `----
>
It looks to me your GPU processor has poor memory access capability,
thus, preprocess of aggregation (that heavy uses atomic operations
towards the global memory) consumes majority of processing time.
Please try the query with: SET pg_strom.enable_gpupreagg = off;
GpuJoin uses less atomic operation, so it has an advantage.

GPU's two major advantage are massive amount of cores and higher
memory bandwidth than GPU, so, fundamentally, I'd like to recommend
to use better GPU board...
According to NVIDIA, K620 uses DDR3 DRAM, thus here is no advantage
on memory access speed. How about GTX750Ti (no external power is
needed like K620) or AWS's g2.2xlarge instance type?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of YANG
> Sent: Thursday, July 23, 2015 12:16 AM
> To: pgsql-hackers@postgresql.org
> Subject: [HACKERS] Queries runs slow on GPU with PG-Strom
>
>
> Hello,
>
> I've performed some tests on pg_strom according to the wiki. But it seems that
> queries run slower on GPU than CPU. Can someone shed a light on what's wrong
> with my settings. My setup was Quadro K620 + CUDA 7.0 (For Ubuntu 14.10) +
> Ubuntu 15.04. And the results was
>
> with pg_strom
> =============
>
> explain SELECT count(*) FROM t0 WHERE sqrt((x-25.6)^2 + (y-12.8)^2) < 10;
>
>
> QUERY PLAN
> ----------------------------------------------------------------------------
> ----------------------------------------------------------------------------
> -----------------------
>  Aggregate  (cost=190993.70..190993.71 rows=1 width=0) (actual
> time=18792.236..18792.236 rows=1 loops=1)
>    ->  Custom Scan (GpuPreAgg)  (cost=7933.07..184161.18 rows=86 width=108)
> (actual time=4249.656..18792.074 rows=77 loops=1)
>          Bulkload: On (density: 100.00%)
>          Reduction: NoGroup
>          Device Filter: (sqrt((((x - '25.6'::double precision) ^ '2'::double
> precision) + ((y - '12.8'::double precision) ^ '2'::double precision))) <
> '10'::double precision)
>          ->  Custom Scan (BulkScan) on t0  (cost=6933.07..182660.32
> rows=10000060 width=0) (actual time=139.399..18499.246 rows=10000000 loops=1)
>  Planning time: 0.262 ms
>  Execution time: 19268.650 ms
> (8 rows)
>
>
>
> explain analyze SELECT cat, AVG(x) FROM t0 NATURAL JOIN t1 GROUP BY cat;
>
>                                                                     QUERY
> PLAN
> ----------------------------------------------------------------------------
> ----------------------------------------------------------------------
>  HashAggregate  (cost=298541.48..298541.81 rows=26 width=12) (actual
> time=11311.568..11311.572 rows=26 loops=1)
>    Group Key: t0.cat
>    ->  Custom Scan (GpuPreAgg)  (cost=5178.82..250302.07 rows=1088 width=52)
> (actual time=3304.727..11310.021 rows=2307 loops=1)
>          Bulkload: On (density: 100.00%)
>          Reduction: Local + Global
>          ->  Custom Scan (GpuJoin)  (cost=4178.82..248541.18 rows=10000060
> width=12) (actual time=923.417..2661.113 rows=10000000 loops=1)
>                Bulkload: On (density: 100.00%)
>                Depth 1: Logic: GpuHashJoin, HashKeys: (aid), JoinQual: (aid =
> aid), nrows_ratio: 1.00000000
>                ->  Custom Scan (BulkScan) on t0  (cost=0.00..242858.60
> rows=10000060 width=16) (actual time=6.980..871.431 rows=10000000 loops=1)
>                ->  Seq Scan on t1  (cost=0.00..734.00 rows=40000 width=4)
> (actual time=0.204..7.309 rows=40000 loops=1)
>  Planning time: 47.834 ms
>  Execution time: 11355.103 ms
> (12 rows)
>
>
> without pg_strom
> ================
>
> test=# explain analyze SELECT count(*) FROM t0 WHERE sqrt((x-25.6)^2 +
> (y-12.8)^2) < 10;
>
> QUERY PLAN
> ----------------------------------------------------------------------------
> ----------------------------------------------------------------------------
> ----------------
>  Aggregate  (cost=426193.03..426193.04 rows=1 width=0) (actual
> time=3880.379..3880.379 rows=1 loops=1)
>    ->  Seq Scan on t0  (cost=0.00..417859.65 rows=3333353 width=0) (actual
> time=0.075..3859.200 rows=314063 loops=1)
>          Filter: (sqrt((((x - '25.6'::double precision) ^ '2'::double precision)
> + ((y - '12.8'::double precision) ^ '2'::double precision))) < '10'::double
> precision)
>          Rows Removed by Filter: 9685937
>  Planning time: 0.411 ms
>  Execution time: 3880.445 ms
> (6 rows)
>
> t=# explain analyze SELECT cat, AVG(x) FROM t0 NATURAL JOIN t1 GROUP BY cat;
>                                                           QUERY PLAN
> ----------------------------------------------------------------------------
> --------------------------------------------------
>  HashAggregate  (cost=431593.73..431594.05 rows=26 width=12) (actual
> time=4960.810..4960.812 rows=26 loops=1)
>    Group Key: t0.cat
>    ->  Hash Join  (cost=1234.00..381593.43 rows=10000060 width=12) (actual
> time=20.859..3367.510 rows=10000000 loops=1)
>          Hash Cond: (t0.aid = t1.aid)
>          ->  Seq Scan on t0  (cost=0.00..242858.60 rows=10000060 width=16)
> (actual time=0.021..895.908 rows=10000000 loops=1)
>          ->  Hash  (cost=734.00..734.00 rows=40000 width=4) (actual
> time=20.567..20.567 rows=40000 loops=1)
>                Buckets: 65536  Batches: 1  Memory Usage: 1919kB
>                ->  Seq Scan on t1  (cost=0.00..734.00 rows=40000 width=4)
> (actual time=0.017..11.013 rows=40000 loops=1)
>  Planning time: 0.567 ms
>  Execution time: 4961.029 ms
> (10 rows)
>
>
>
> Here is the details how I installed pg_strom,
>
> 1. download postgresql 9.5alpha1 and compile it with
>
>     ,----
>     | ./configure --prefix=/export/pg-9.5 --enable-debug --enable-cassert
>     | make -j8 all
>     | make install
>     `----
>
> 2. install cuda-7.0 (ubuntu 14.10 package from nvidia website)
>
> 3. download and compile pg_strom with pg_config in /export/pg-9.5/bin
>
>         ,----
>         | make
>         | make install
>         `----
>
>
> 4. create a db with --no-local
>
>         ,----
>         | initdb --no-local 9.5
>         `----
>
> 5. change postgresql.conf
>
>         ,----
>         | shared_buffers=1GB
>         | shared_preload_libraries='pg_strom.so'
>         | logging_collector = on
>         | log_filename='postgresql-%d.log'
>         | pg_strom.enabled=on
>         `----
>
>
> 6. start postgres
>
>         ,----
>         | pg_ctl -D 9.5 start
>         `----
>
>    and got the following outputs
>
>         ,----
>         | LOG:  CUDA Runtime version: 7.0.0
>         | LOG:  NVIDIA driver version: 346.59
>         | LOG:  GPU0 Quadro K620 (384 CUDA cores, 1124MHz), L2 2048KB, RAM 2047MB
> (128bits, 900KHz), capability 5.0
>         | LOG:  NVRTC - CUDA Runtime Compilation vertion 7.0
>         | LOG:  redirecting log output to logging collector process
>         | HINT:  Future log output will appear in directory "pg_log".
>         `----
>
>
>
> 7. import testdb
>
>         ,----
>         | createdb test
>         | psql test < ~/devel/pg_strom/test/testdb.sql
>         | psql test -c 'create extension pg_strom'
>         `----
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



Re: Queries runs slow on GPU with PG-Strom

From
Kouhei Kaigai
Date:
> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Josh Berkus
> Sent: Thursday, July 23, 2015 2:49 AM
> To: YANG; pgsql-hackers@postgresql.org; KaiGai Kohei
> Subject: Re: [HACKERS] Queries runs slow on GPU with PG-Strom
> 
> On 07/22/2015 08:16 AM, YANG wrote:
> >
> > Hello,
> >
> > I've performed some tests on pg_strom according to the wiki. But it seems that
> > queries run slower on GPU than CPU. Can someone shed a light on what's wrong
> > with my settings. My setup was Quadro K620 + CUDA 7.0 (For Ubuntu 14.10) +
> > Ubuntu 15.04. And the results was
> 
> I believe that pgStrom has its own mailing list.  KaiGai?
>
Sorry, I replied to this earlier.

We have no own mailing list, but issue tracker on GitHub is a proper
way to report problems to the developers.
 https://github.com/pg-strom/devel/issues

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>