Thread: PG-Strom - A GPU optimized asynchronous executor module

PG-Strom - A GPU optimized asynchronous executor module

From
Kohei KaiGai
Date:
Hi,

I tried to implement a fdw module that is designed to utilize GPU
devices to execute
qualifiers of sequential-scan on foreign tables managed by this module.

It was named PG-Strom, and the following wikipage gives a brief
overview of this module.   http://wiki.postgresql.org/wiki/PGStrom

In our measurement, it achieves about x10 times faster on
sequential-scan with complex-
qualifiers, of course, it quite depends on type of workloads.

Example)
A query counts number of records with (x,y) located within a particular range.
A regular table 'rtbl' and foreign table 'ftbl' contains same
contents; with 10 million of records.

postgres=# SELECT count(*) FROM rtbl WHERE sqrt((x-25.6)^2 + (y-12.8)^2) < 51.2;count
-------43134
(1 row)

Time: 10537.069 ms

postgres=# SELECT count(*) FROM ftbl WHERE sqrt((x-25.6)^2 + (y-12.8)^2) < 51.2;count
-------43134
(1 row)

Time: 744.252 ms

(*) Let's see the "How to use" section of the wikipage to reproduce my testcase.

It seems to me quite good result. However, I doubt myself whether the case of
sequential-scan on regular table was not tuned appropriately.
Could you tell me some hint to tune up sequential scan on large tables?
All I did on the test case is expansion of shared_buffers to 1024MB that is
enough to load whole of the example tables on memory.

Thanks,
-- 
KaiGai Kohei <kaigai@kaigai.gr.jp>


Re: PG-Strom - A GPU optimized asynchronous executor module

From
Robert Haas
Date:
On Sun, Jan 22, 2012 at 10:48 AM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
> I tried to implement a fdw module that is designed to utilize GPU
> devices to execute
> qualifiers of sequential-scan on foreign tables managed by this module.
>
> It was named PG-Strom, and the following wikipage gives a brief
> overview of this module.
>    http://wiki.postgresql.org/wiki/PGStrom
>
> In our measurement, it achieves about x10 times faster on
> sequential-scan with complex-
> qualifiers, of course, it quite depends on type of workloads.

That's pretty neat.  In terms of tuning the non-GPU based
implementation, have you done any profiling?  Sometimes that leads to
an "oh, woops" moment.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: PG-Strom - A GPU optimized asynchronous executor module

From
Kohei KaiGai
Date:
2012/1/23 Robert Haas <robertmhaas@gmail.com>:
> On Sun, Jan 22, 2012 at 10:48 AM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
>> I tried to implement a fdw module that is designed to utilize GPU
>> devices to execute
>> qualifiers of sequential-scan on foreign tables managed by this module.
>>
>> It was named PG-Strom, and the following wikipage gives a brief
>> overview of this module.
>>    http://wiki.postgresql.org/wiki/PGStrom
>>
>> In our measurement, it achieves about x10 times faster on
>> sequential-scan with complex-
>> qualifiers, of course, it quite depends on type of workloads.
>
> That's pretty neat.  In terms of tuning the non-GPU based
> implementation, have you done any profiling?  Sometimes that leads to
> an "oh, woops" moment.
>
Not yet, except for \timing.

What options are available to see rate of workloads of components
within a particular query?
I tried to google some keywords, but does not hit to me.


As an aside, I also tries to modify is_device_executable_qual() always
return false to disable qualifiers pushed-down.
In this case, 2100ms of 7679ms was consumed within this module, thus,
I guess rest of 5500ms was mostly consumed by ExecQual(), although
it is just an estimation...

postgres=# SET pg_strom.exec_profile = on;
SET
Time: 1.075 ms
postgres=# SELECT count(*) FROM ftbl WHERE sqrt((x-25.6)^2 + (y-12.8)^2) < 10;
INFO:  PG-Strom Exec Profile on "ftbl"
INFO:  Total PG-Strom consumed time: 2100.898 ms
INFO:  Time to JIT Compile GPU code: 0.000 ms
INFO:  Time to initialize devices:   0.000 ms
INFO:  Time to Load column-stores:   7.013 ms
INFO:  Time to Scan column-stores:   1219.746 ms
INFO:  Time to Fetch virtual tuples: 874.095 ms
INFO:  Time of GPU Synchronization:  0.000 ms
INFO:  Time of Async memcpy:         0.000 ms
INFO:  Time of Async kernel exec:    0.000 mscount
------- 3159
(1 row)

Time: 7679.342 ms


Thanks,
--
KaiGai Kohei <kaigai@kaigai.gr.jp>


Re: PG-Strom - A GPU optimized asynchronous executor module

From
Robert Haas
Date:
On Mon, Jan 23, 2012 at 1:38 AM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
> What options are available to see rate of workloads of components
> within a particular query?

I usually use oprofile, though I'm given to understand it's been
superseded by a new tool called perf.  I haven't had a chance to
experiment with perf yet, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: PG-Strom - A GPU optimized asynchronous executor module

From
Simon Riggs
Date:
On Sun, Jan 22, 2012 at 3:48 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:

> I tried to implement a fdw module that is designed to utilize GPU
> devices to execute
> qualifiers of sequential-scan on foreign tables managed by this module.
>
> It was named PG-Strom, and the following wikipage gives a brief
> overview of this module.
>    http://wiki.postgresql.org/wiki/PGStrom
>
> In our measurement, it achieves about x10 times faster on
> sequential-scan with complex-
> qualifiers, of course, it quite depends on type of workloads.

Very cool. Someone's been busy.

I see you've introduced 3 new features here at same time
* GPU access
* column store
* compiled WHERE clauses

It would be useful to see if we can determine which of those gives the
most benefit and whether other directions emerge.

Also, the query you mention is probably the best performing query you
can come up with. It looks like a GIS query, yet isn't. Would it be
possible to run tests on the TPC-H suite and do a full comparison of
strengths/weaknesses so we can understand the breadth of applicability
of the techniques.

This is a very interesting line of discussion, but please can we hold
off further posts about it until after the CF is over?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PG-Strom - A GPU optimized asynchronous executor module

From
Kohei KaiGai
Date:
2012/1/23 Simon Riggs <simon@2ndquadrant.com>:
> On Sun, Jan 22, 2012 at 3:48 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
>
>> I tried to implement a fdw module that is designed to utilize GPU
>> devices to execute
>> qualifiers of sequential-scan on foreign tables managed by this module.
>>
>> It was named PG-Strom, and the following wikipage gives a brief
>> overview of this module.
>>    http://wiki.postgresql.org/wiki/PGStrom
>>
>> In our measurement, it achieves about x10 times faster on
>> sequential-scan with complex-
>> qualifiers, of course, it quite depends on type of workloads.
>
> Very cool. Someone's been busy.
>
> I see you've introduced 3 new features here at same time
> * GPU access
> * column store
> * compiled WHERE clauses
>
> It would be useful to see if we can determine which of those gives the
> most benefit and whether other directions emerge.
>
> Also, the query you mention is probably the best performing query you
> can come up with. It looks like a GIS query, yet isn't. Would it be
> possible to run tests on the TPC-H suite and do a full comparison of
> strengths/weaknesses so we can understand the breadth of applicability
> of the techniques.
>
DBT-2 is a good alternative, even though TPC-H is expensive to run.

> This is a very interesting line of discussion, but please can we hold
> off further posts about it until after the CF is over?
>
Yep, I agree.
We should handle existing patches first, then new features of v9.3.

I'll back to review the pgsql_fdw.

Thanks,
--
KaiGai Kohei <kaigai@kaigai.gr.jp>


Re: PG-Strom - A GPU optimized asynchronous executor module

From
Simon Riggs
Date:
On Mon, Jan 23, 2012 at 2:49 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:

>> Also, the query you mention is probably the best performing query you
>> can come up with. It looks like a GIS query, yet isn't. Would it be
>> possible to run tests on the TPC-H suite and do a full comparison of
>> strengths/weaknesses so we can understand the breadth of applicability
>> of the techniques.
>>
> DBT-2 is a good alternative, even though TPC-H is expensive to run.

DBT-2 is an OLTP test, not a DSS/DW test.

I'm not interested in the full TPC-H test, just a query by query
comparison of how well this stacks up. If there are other tests that
are also balanced/representative, I'd like to see those also. Just so
we can see the benefit envelope.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services