Re: PgPool parallel query performance rules of thumb - Mailing list pgsql-performance

From Mason Sharp
Subject Re: PgPool parallel query performance rules of thumb
Date
Msg-id 51494DB187D98F4C88DBEBF1F5F6D423050F09CE@edb06.mail01.enterprisedb.com
Whole thread Raw
In response to PgPool parallel query performance rules of thumb  (John Beaver <john.e.beaver@gmail.com>)
List pgsql-performance
Hi John,

It has been a while since I played around with PgPool-II. In the tests
that I did, it did help with load balancing. For parallel query, it
helped for simple queries, such as when querying a single table. If that
is your typical use case, you may benefit. For other queries, it was not
as effective. For example:

SELECT t1.col1, t1.col2
FROM t1 inner join t2 on t1.col1 = t2.col1
WHERE t2.col3 > 1000
ORDER BY t1.col1

Assume that the optimizer decided to process t2 first. It would apply
the where predicate t2.col3 > 1000 in parallel across all the nodes,
which is a good thing, and pull in those results. But, for t1, it will
query all of the nodes, then pull in all of the rows (just t1.col1 and
t1.col2 though) into a single node and perform the join and sort there
as well. You are not getting much parallelism on that step, particularly
noticeable if it is a large table.

So, there is some benefit, but it is limited. Also, again, it has been a
while since I ran this. It may have since improved (I apologize if this
is inaccurate), and I do like the other features of PgPool and what SRA
has done.

In contrast, GridSQL would parallelize this better. (Full disclosure: I
work on the free and open source GridSQL project.) It would likely
process t2 first, like pgpool. However, it would send the intermediate
results to the other nodes in the cluster. If it turns out that t1.col1
was also the column on which a distribution hash was based for t1, it
would ship those intermediate rows to only those nodes that it needs to
for joining. Then, on this second step, all of these joins would happen
in parallel, with ORDER BY applied. Back at the coordinator, since an
ORDER BY is present, GridSQL would do a merge-sort from the results of
the other nodes and return them to the client.

I hope that helps. On pgfoundry.org there are forums within the pgpool
project where they can probably better answer your questions. If you
have any questions about GridSQL, please feel free to post in the forums
at enterprisedb.com or email me directly.

Regards,

Mason


> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-
> owner@postgresql.org] On Behalf Of John Beaver
> Sent: Wednesday, June 04, 2008 11:59 AM
> To: Pgsql-Performance
> Subject: [PERFORM] PgPool parallel query performance rules of thumb
>
> Hi,
>     I'm trying to make use of a cluster of 40 nodes that my group has,
> and I'm curious if anyone has experience with PgPool's parallel query
> mode. Under what circumstances could I expect the most benefit from
> query parallelization as implemented by PgPool?
>
> --
> Sent via pgsql-performance mailing list
(pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

pgsql-performance by date:

Previous
From: Volkan YAZICI
Date:
Subject: Optimizing AGE()
Next
From: "Scott Marlowe"
Date:
Subject: Re: Checkpoint tuning on 8.2.4