Thread: is parallel union all possible over dblink?
Hi all,
I am running PostgreSQL 9.0 on a number of nodes in an application level cluster (there is different data on different machines). Currently a PL/pgSQL function generates automatically aggregation queries like the following:
(select * from appqosfe.F_total_utilization(1306918800000000000::INT8, NULL, 60000000000::INT8, NULL))
UNION ALL
(SELECT * from dblink('remote1','select * from appqosfe.F_total_utilization(1306918800000000000::INT8, NULL, 60000000000::INT8, NULL)') as T1(detectroid numeric, timegroup numeric, numbytes numeric, numpackets numeric))
order by timegroup asc
The above example supposes that only 2 nodes are active (one local and one remote). Here I can clearly see that the remote sub-query starts only when the local one is completed so the total time grows linearly with the number of nodes.
Question: Is there a way to get the same result from within a PL/pgSQL function but running all the sub-queries in parallel? In case it is not directly available, which one would be the simplest way to implement it in my application? (I am very keen to avoid the obvious solution of an additional multi-threaded layer which would do it out of the RDBMS)
Thank you,
Svetlin Manavski
I am running PostgreSQL 9.0 on a number of nodes in an application level cluster (there is different data on different machines). Currently a PL/pgSQL function generates automatically aggregation queries like the following:
(select * from appqosfe.F_total_utilization(1306918800000000000::INT8, NULL, 60000000000::INT8, NULL))
UNION ALL
(SELECT * from dblink('remote1','select * from appqosfe.F_total_utilization(1306918800000000000::INT8, NULL, 60000000000::INT8, NULL)') as T1(detectroid numeric, timegroup numeric, numbytes numeric, numpackets numeric))
order by timegroup asc
The above example supposes that only 2 nodes are active (one local and one remote). Here I can clearly see that the remote sub-query starts only when the local one is completed so the total time grows linearly with the number of nodes.
Question: Is there a way to get the same result from within a PL/pgSQL function but running all the sub-queries in parallel? In case it is not directly available, which one would be the simplest way to implement it in my application? (I am very keen to avoid the obvious solution of an additional multi-threaded layer which would do it out of the RDBMS)
Thank you,
Svetlin Manavski
On Wed, 29 Jun 2011 13:55:58 +0200, Svetlin Manavski <svetlin.manavski@gmail.com> wrote: > Question: Is there a way to get the same result from within a PL/pgSQL > function but running all the sub-queries in parallel? In case it is not > directly available, which one would be the simplest way to implement it > in > my application? (I am very keen to avoid the obvious solution of an > additional multi-threaded layer which would do it out of the RDBMS) Have you tried dblink_send_query() + dblink_get_results() yet? http://www.postgresql.org/docs/current/static/contrib-dblink-send-query.html You'd have to do something like this to your queries [untested]: select dblink_send_query('remote1','select * from appqosfe.F_total_utilization(1306918800000000000::INT8, NULL, 60000000000::INT8, NULL)'); (select * from appqosfe.F_total_utilization(1306918800000000000::INT8, NULL, 60000000000::INT8, NULL)) UNION ALL (SELECT * from dblink_get_result('remote1') as T1(detectroid numeric, timegroup numeric, numbytes numeric, numpackets numeric)) order by timegroup asc; i.e. start your remote query/-ies asynchronously, then collect the results in the UNION query. At least in theory it should work... Regards, Marinos
On Wed, Jun 29, 2011 at 12:37 PM, Marinos Yannikos <mjy@geizhals.at> wrote:
On Wed, 29 Jun 2011 13:55:58 +0200, Svetlin Manavski <svetlin.manavski@gmail.com> wrote:Have you tried dblink_send_query() + dblink_get_results() yet?Question: Is there a way to get the same result from within a PL/pgSQL
function but running all the sub-queries in parallel? In case it is not
directly available, which one would be the simplest way to implement it in
my application? (I am very keen to avoid the obvious solution of an
additional multi-threaded layer which would do it out of the RDBMS)
http://www.postgresql.org/docs/current/static/contrib-dblink-send-query.html
You'd have to do something like this to your queries [untested]:
select dblink_send_query('remote1','select * from60000000000::INT8, NULL)');
appqosfe.F_total_utilization(1306918800000000000::INT8, NULL,(SELECT * from dblink_get_result('remote1') as T1(detectroid numeric, timegroup numeric,
(select * from appqosfe.F_total_utilization(1306918800000000000::INT8, NULL,
60000000000::INT8, NULL))
UNION ALL
numbytes numeric, numpackets numeric))
order by timegroup asc;
i.e. start your remote query/-ies asynchronously, then collect the results in the UNION query. At least in theory it should work...
This does work however you'll need to add a little more to it to ensure your UNION succeeds. In pseudo...
connection #1:
CREATE TABLE target_1 ...
BEGIN;
LOCK TABLE target_1 IN ACCESS EXCLUSIVE MODE;
INSERT INTO target_1 SELECT ...
COMMIT;
connection #2:
CREATE TABLE target_2 ...
BEGIN;
LOCK TABLE target_2 IN ACCESS EXCLUSIVE MODE;
INSERT INTO target_2 SELECT ...
COMMIT;
connection #3:
SELECT * FROM target_1 UNION SELECT * FROM target_2;
Connections 1 and 2 can be done in simultaneously and after both have reached the LOCK statement then the SELECT on connection 3 can be executed. Same fundamentals if all three connections are to different databases and connection 3 uses dblink to pull the data.
Another alternative is to use GridSQL. I haven't used it myself but seen it in action on a large install with 4 backend databases. Pretty slick.
Greg
On 06/29/2011 02:14 PM, Greg Spiegelberg wrote: > Another alternative is to use GridSQL. I haven't used it myself but > seen it in action on a large install with 4 backend databases. Pretty > slick. We actually demoed this as a proof of concept a while back. Even just having two instances on the same machine resulted in linear improvements in execution speed thanks to parallel query execution. Setting it up is something of a PITA, though, and the metadata database is completely arbitrary. You basically must use the GridSQL intermediate layer if you ever want to see your data again. I wouldn't use it for anything but a reporting database that can be reconstructed if necessary. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
On Thu, Jun 30, 2011 at 3:02 AM, Svetlin Manavski <svetlin.manavski@gmail.com> wrote:
I am now a bit puzzled after the initial satisfaction by Marinos' reply.
1. what do you mean exactly by "to ensure your UNION succeeds". The dblink docs do not mention anything about issues using directly the suggested dblink_send_query() + dblink_get_results(). What problems should I expect in using them as suggested by Marinos?
Admittedly, I hadn't used those specific dblink functions and imagined dblink_get_result() failing if the query on the connection wasn't finished. It appears now that after some experimentation that it's perfectly happy hanging until the query is finished executing.
2. If I understand correctly your method, it is not applicable from inside a stored procedure, is it? I need to keep all the business logic within PostgreSQL and provide just a clean interface to a simple GUI layer
Then dblink is your answer. My suggestion applies if you were implementing a solution in the application.
3. Unfortunately GridSQL and Pgpool-II do not seem mature and stable products to be used in commercial software. Neither one provides clear documentation. GridSQL has been discontinued and it is not clear what kind of future it will have. I have not tried GridSQL but I did try Pgpool-II. It is disappointing that it may stop working correctly even just because of the way you write the query (e.g. using uppercase in a field or using named field in group by, ecc.). Even worse, when it does not recognize something in the parallel query, it just provides incorrect result (from only the local DB) rather than raising an exception. So I guess Pgpool-II in its current state is good only for very simple applications, which are not supposed to be reliable at all.
I don't think GridSQL is discontinued. Appears though EnterpriseDB has open sourced it and moved to http://sourceforge.net/projects/gridsql/. Not incredibly active but some as recent as last month.
Sorry for the confusion.
Greg
I am now a bit puzzled after the initial satisfaction by Marinos' reply.
1. what do you mean exactly by "to ensure your UNION succeeds". The dblink docs do not mention anything about issues using directly the suggested dblink_send_query() + dblink_get_results(). What problems should I expect in using them as suggested by Marinos?
2. If I understand correctly your method, it is not applicable from inside a stored procedure, is it? I need to keep all the business logic within PostgreSQL and provide just a clean interface to a simple GUI layer
3. Unfortunately GridSQL and Pgpool-II do not seem mature and stable products to be used in commercial software. Neither one provides clear documentation. GridSQL has been discontinued and it is not clear what kind of future it will have. I have not tried GridSQL but I did try Pgpool-II. It is disappointing that it may stop working correctly even just because of the way you write the query (e.g. using uppercase in a field or using named field in group by, ecc.). Even worse, when it does not recognize something in the parallel query, it just provides incorrect result (from only the local DB) rather than raising an exception. So I guess Pgpool-II in its current state is good only for very simple applications, which are not supposed to be reliable at all.
Thank you,
Svetlin Manavski
1. what do you mean exactly by "to ensure your UNION succeeds". The dblink docs do not mention anything about issues using directly the suggested dblink_send_query() + dblink_get_results(). What problems should I expect in using them as suggested by Marinos?
2. If I understand correctly your method, it is not applicable from inside a stored procedure, is it? I need to keep all the business logic within PostgreSQL and provide just a clean interface to a simple GUI layer
3. Unfortunately GridSQL and Pgpool-II do not seem mature and stable products to be used in commercial software. Neither one provides clear documentation. GridSQL has been discontinued and it is not clear what kind of future it will have. I have not tried GridSQL but I did try Pgpool-II. It is disappointing that it may stop working correctly even just because of the way you write the query (e.g. using uppercase in a field or using named field in group by, ecc.). Even worse, when it does not recognize something in the parallel query, it just provides incorrect result (from only the local DB) rather than raising an exception. So I guess Pgpool-II in its current state is good only for very simple applications, which are not supposed to be reliable at all.
Thank you,
Svetlin Manavski
On Wed, Jun 29, 2011 at 8:14 PM, Greg Spiegelberg <gspiegelberg@gmail.com> wrote:
This does work however you'll need to add a little more to it to ensure your UNION succeeds. In pseudo...connection #1:CREATE TABLE target_1 ...BEGIN;LOCK TABLE target_1 IN ACCESS EXCLUSIVE MODE;INSERT INTO target_1 SELECT ...COMMIT;connection #2:CREATE TABLE target_2 ...BEGIN;LOCK TABLE target_2 IN ACCESS EXCLUSIVE MODE;INSERT INTO target_2 SELECT ...COMMIT;connection #3:SELECT * FROM target_1 UNION SELECT * FROM target_2;Connections 1 and 2 can be done in simultaneously and after both have reached the LOCK statement then the SELECT on connection 3 can be executed. Same fundamentals if all three connections are to different databases and connection 3 uses dblink to pull the data.Another alternative is to use GridSQL. I haven't used it myself but seen it in action on a large install with 4 backend databases. Pretty slick.Greg