Re: PgPool parallel query performance rules of thumb - Mailing list pgsql-performance
From | Mason Sharp |
---|---|
Subject | Re: PgPool parallel query performance rules of thumb |
Date | |
Msg-id | 51494DB187D98F4C88DBEBF1F5F6D423050F09CE@edb06.mail01.enterprisedb.com Whole thread Raw |
In response to | PgPool parallel query performance rules of thumb (John Beaver <john.e.beaver@gmail.com>) |
List | pgsql-performance |
Hi John, It has been a while since I played around with PgPool-II. In the tests that I did, it did help with load balancing. For parallel query, it helped for simple queries, such as when querying a single table. If that is your typical use case, you may benefit. For other queries, it was not as effective. For example: SELECT t1.col1, t1.col2 FROM t1 inner join t2 on t1.col1 = t2.col1 WHERE t2.col3 > 1000 ORDER BY t1.col1 Assume that the optimizer decided to process t2 first. It would apply the where predicate t2.col3 > 1000 in parallel across all the nodes, which is a good thing, and pull in those results. But, for t1, it will query all of the nodes, then pull in all of the rows (just t1.col1 and t1.col2 though) into a single node and perform the join and sort there as well. You are not getting much parallelism on that step, particularly noticeable if it is a large table. So, there is some benefit, but it is limited. Also, again, it has been a while since I ran this. It may have since improved (I apologize if this is inaccurate), and I do like the other features of PgPool and what SRA has done. In contrast, GridSQL would parallelize this better. (Full disclosure: I work on the free and open source GridSQL project.) It would likely process t2 first, like pgpool. However, it would send the intermediate results to the other nodes in the cluster. If it turns out that t1.col1 was also the column on which a distribution hash was based for t1, it would ship those intermediate rows to only those nodes that it needs to for joining. Then, on this second step, all of these joins would happen in parallel, with ORDER BY applied. Back at the coordinator, since an ORDER BY is present, GridSQL would do a merge-sort from the results of the other nodes and return them to the client. I hope that helps. On pgfoundry.org there are forums within the pgpool project where they can probably better answer your questions. If you have any questions about GridSQL, please feel free to post in the forums at enterprisedb.com or email me directly. Regards, Mason > -----Original Message----- > From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance- > owner@postgresql.org] On Behalf Of John Beaver > Sent: Wednesday, June 04, 2008 11:59 AM > To: Pgsql-Performance > Subject: [PERFORM] PgPool parallel query performance rules of thumb > > Hi, > I'm trying to make use of a cluster of 40 nodes that my group has, > and I'm curious if anyone has experience with PgPool's parallel query > mode. Under what circumstances could I expect the most benefit from > query parallelization as implemented by PgPool? > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
pgsql-performance by date: