Thread: is parallel union all possible over dblink?

is parallel union all possible over dblink?

From

Svetlin Manavski

Date:

29 June 2011, 13:18:02

Hi all,
I am running PostgreSQL 9.0 on a number of nodes in an application level cluster (there is different data on different machines). Currently a PL/pgSQL function generates automatically aggregation queries like the following:

(select * from appqosfe.F_total_utilization(1306918800000000000::INT8, NULL, 60000000000::INT8, NULL))
UNION ALL
(SELECT * from dblink('remote1','select * from appqosfe.F_total_utilization(1306918800000000000::INT8, NULL, 60000000000::INT8, NULL)') as T1(detectroid numeric, timegroup numeric, numbytes numeric, numpackets numeric))
order by timegroup asc

The above example supposes that only 2 nodes are active (one local and one remote). Here I can clearly see that the remote sub-query starts only when the local one is completed so the total time grows linearly with the number of nodes.

Question: Is there a way to get the same result from within a PL/pgSQL function but running all the sub-queries in parallel? In case it is not directly available, which one would be the simplest way to implement it in my application? (I am very keen to avoid the obvious solution of an additional multi-threaded layer which would do it out of the RDBMS)

Thank you,
Svetlin Manavski

Re: is parallel union all possible over dblink?

From

"Marinos Yannikos"

Date:

29 June 2011, 15:37:37

On Wed, 29 Jun 2011 13:55:58 +0200, Svetlin Manavski
<svetlin.manavski@gmail.com> wrote:

> Question: Is there a way to get the same result from within a PL/pgSQL
> function but running all the sub-queries in parallel? In case it is not
> directly available, which one would be the simplest way to implement it
> in
> my application? (I am very keen to avoid the obvious solution of an
> additional multi-threaded layer which would do it out of the RDBMS)

Have you tried dblink_send_query() + dblink_get_results() yet?

http://www.postgresql.org/docs/current/static/contrib-dblink-send-query.html

You'd have to do something like this to your queries [untested]:

select dblink_send_query('remote1','select * from
appqosfe.F_total_utilization(1306918800000000000::INT8, NULL,
60000000000::INT8, NULL)');
(select * from appqosfe.F_total_utilization(1306918800000000000::INT8,
NULL,
60000000000::INT8, NULL))
UNION ALL
(SELECT * from dblink_get_result('remote1') as T1(detectroid numeric,
timegroup numeric,
numbytes numeric, numpackets numeric))
order by timegroup asc;

i.e. start your remote query/-ies asynchronously, then collect the results
in the UNION query. At least in theory it should work...

Regards,
  Marinos

Re: is parallel union all possible over dblink?

From

Greg Spiegelberg

Date:

29 June 2011, 16:14:47

On Wed, Jun 29, 2011 at 12:37 PM, Marinos Yannikos <mjy@geizhals.at> wrote:

On Wed, 29 Jun 2011 13:55:58 +0200, Svetlin Manavski <svetlin.manavski@gmail.com> wrote:

Question: Is there a way to get the same result from within a PL/pgSQL
function but running all the sub-queries in parallel? In case it is not
directly available, which one would be the simplest way to implement it in
my application? (I am very keen to avoid the obvious solution of an
additional multi-threaded layer which would do it out of the RDBMS)

Have you tried dblink_send_query() + dblink_get_results() yet?

http://www.postgresql.org/docs/current/static/contrib-dblink-send-query.html

You'd have to do something like this to your queries [untested]:

select dblink_send_query('remote1','select * from

appqosfe.F_total_utilization(1306918800000000000::INT8, NULL,
60000000000::INT8, NULL)');

(select * from appqosfe.F_total_utilization(1306918800000000000::INT8, NULL,
60000000000::INT8, NULL))
UNION ALL
(SELECT * from dblink_get_result('remote1') as T1(detectroid numeric, timegroup numeric,
numbytes numeric, numpackets numeric))
order by timegroup asc;

i.e. start your remote query/-ies asynchronously, then collect the results in the UNION query. At least in theory it should work...

This does work however you'll need to add a little more to it to ensure your UNION succeeds. In pseudo...

connection #1:

CREATE TABLE target_1 ...

BEGIN;

LOCK TABLE target_1 IN ACCESS EXCLUSIVE MODE;

INSERT INTO target_1 SELECT ...

COMMIT;

connection #2:

CREATE TABLE target_2 ...

BEGIN;

LOCK TABLE target_2 IN ACCESS EXCLUSIVE MODE;

INSERT INTO target_2 SELECT ...

COMMIT;

connection #3:

SELECT * FROM target_1 UNION SELECT * FROM target_2;

Connections 1 and 2 can be done in simultaneously and after both have reached the LOCK statement then the SELECT on connection 3 can be executed. Same fundamentals if all three connections are to different databases and connection 3 uses dblink to pull the data.

Another alternative is to use GridSQL. I haven't used it myself but seen it in action on a large install with 4 backend databases. Pretty slick.

Greg

Re: is parallel union all possible over dblink?

From

Shaun Thomas

Date:

29 June 2011, 16:38:33

On 06/29/2011 02:14 PM, Greg Spiegelberg wrote:

> Another alternative is to use GridSQL. I haven't used it myself but
> seen it in action on a large install with 4 backend databases. Pretty
> slick.

We actually demoed this as a proof of concept a while back. Even just
having two instances on the same machine resulted in linear improvements
in execution speed thanks to parallel query execution.

Setting it up is something of a PITA, though, and the metadata database
is completely arbitrary. You basically must use the GridSQL intermediate
layer if you ever want to see your data again. I wouldn't use it for
anything but a reporting database that can be reconstructed if necessary.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@peak6.com

______________________________________________

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

Re: is parallel union all possible over dblink?

From

Greg Spiegelberg

Date:

30 June 2011, 09:37:32

On Thu, Jun 30, 2011 at 3:02 AM, Svetlin Manavski <svetlin.manavski@gmail.com> wrote:

I am now a bit puzzled after the initial satisfaction by Marinos' reply.

1. what do you mean exactly by "to ensure your UNION succeeds". The dblink docs do not mention anything about issues using directly the suggested dblink_send_query() + dblink_get_results(). What problems should I expect in using them as suggested by Marinos?

Admittedly, I hadn't used those specific dblink functions and imagined dblink_get_result() failing if the query on the connection wasn't finished. It appears now that after some experimentation that it's perfectly happy hanging until the query is finished executing.

2. If I understand correctly your method, it is not applicable from inside a stored procedure, is it? I need to keep all the business logic within PostgreSQL and provide just a clean interface to a simple GUI layer

Then dblink is your answer. My suggestion applies if you were implementing a solution in the application.

3. Unfortunately GridSQL and Pgpool-II do not seem mature and stable products to be used in commercial software. Neither one provides clear documentation. GridSQL has been discontinued and it is not clear what kind of future it will have. I have not tried GridSQL but I did try Pgpool-II. It is disappointing that it may stop working correctly even just because of the way you write the query (e.g. using uppercase in a field or using named field in group by, ecc.). Even worse, when it does not recognize something in the parallel query, it just provides incorrect result (from only the local DB) rather than raising an exception. So I guess Pgpool-II in its current state is good only for very simple applications, which are not supposed to be reliable at all.

I don't think GridSQL is discontinued. Appears though EnterpriseDB has open sourced it and moved to http://sourceforge.net/projects/gridsql/. Not incredibly active but some as recent as last month.

Sorry for the confusion.

Greg

Re: is parallel union all possible over dblink?

From

Svetlin Manavski

Date:

01 July 2011, 10:54:37

I am now a bit puzzled after the initial satisfaction by Marinos' reply.

1. what do you mean exactly by "to ensure your UNION succeeds". The dblink docs do not mention anything about issues using directly the suggested dblink_send_query() + dblink_get_results(). What problems should I expect in using them as suggested by Marinos?

2. If I understand correctly your method, it is not applicable from inside a stored procedure, is it? I need to keep all the business logic within PostgreSQL and provide just a clean interface to a simple GUI layer

3. Unfortunately GridSQL and Pgpool-II do not seem mature and stable products to be used in commercial software. Neither one provides clear documentation. GridSQL has been discontinued and it is not clear what kind of future it will have. I have not tried GridSQL but I did try Pgpool-II. It is disappointing that it may stop working correctly even just because of the way you write the query (e.g. using uppercase in a field or using named field in group by, ecc.). Even worse, when it does not recognize something in the parallel query, it just provides incorrect result (from only the local DB) rather than raising an exception. So I guess Pgpool-II in its current state is good only for very simple applications, which are not supposed to be reliable at all.

Thank you,
Svetlin Manavski

On Wed, Jun 29, 2011 at 8:14 PM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:

This does work however you'll need to add a little more to it to ensure your UNION succeeds. In pseudo...

connection #1:
CREATE TABLE target_1 ...
BEGIN;
LOCK TABLE target_1 IN ACCESS EXCLUSIVE MODE;
INSERT INTO target_1 SELECT ...
COMMIT;

connection #2:
CREATE TABLE target_2 ...
BEGIN;
LOCK TABLE target_2 IN ACCESS EXCLUSIVE MODE;
INSERT INTO target_2 SELECT ...
COMMIT;

connection #3:
SELECT * FROM target_1 UNION SELECT * FROM target_2;

Connections 1 and 2 can be done in simultaneously and after both have reached the LOCK statement then the SELECT on connection 3 can be executed. Same fundamentals if all three connections are to different databases and connection 3 uses dblink to pull the data.

Another alternative is to use GridSQL. I haven't used it myself but seen it in action on a large install with 4 backend databases. Pretty slick.

Greg