Re: Are there any options to parallelize queries? - Mailing list pgsql-general

From Craig Ringer
Subject Re: Are there any options to parallelize queries?
Date
Msg-id 503450F6.7010603@ringerc.id.au
Whole thread Raw
In response to Are there any options to parallelize queries?  (Seref Arikan <serefarikan@kurumsalteknoloji.com>)
Responses Re: Are there any options to parallelize queries?  (Seref Arikan <serefarikan@kurumsalteknoloji.com>)
List pgsql-general
On 08/21/2012 04:45 PM, Seref Arikan wrote:

> Parallel software frameworks such as Erlang's OTP or Scala's Akka do
> help a lot, but it would be a lot better if I could feed those
> frameworks with data faster. So, what options do I have to execute
> queries in parallel, assuming a transactional system running on
> postgresql?

AFAIK Native support for parallelisation of query execution is currently
almost non-existent in Pg. You generally have to break your queries up
into smaller queries that do part of the work, run them in parallel, and
integrate the results together client-side.

There are some tools that can help with this. For example, I think
PgPool-II has some parallelisation features, though I've never used
them. Discussion I've seen on this list suggests that many people handle
it in their code directly.

Note that Pg is *very* good at concurently running many queries, with
features like synchronized scans. The whole DB is written around fast
concurrent execution of queries, and it'll happily use every CPU and I/O
resource you have. However, individual queries cannot use multiple CPUs
or I/O "threads", you need many queries running in parallel to use the
hardware's resources fully.


As far as I know the only native in-query parallelisation Pg offers is
via effective_io_concurrency, and currently that only affects bitmap
heap scans:

     http://archives.postgresql.org/pgsql-general/2009-10/msg00671.php

... not seqscans or other access methods.

Execution of each query is done with a single process running a single
thread, so there's no CPU parallelism except where the compiler can
introduce some behind the scenes - which isn't much. I/O isn't
parallelised across invocations of nested loops, by splitting seqscans
up into chunks, etc either.

There are some upsides to this limitation, though:

- The Pg code is easier to understand, maintain, and fix

- It's easier to add features

- It's easier to get right, so it's less buggy and more
   reliable.


As the world goes more and more parallel Pg is likely to follow at some
point, but it's going to be a mammoth job. I don't see anyone
volunteering the many months of their free time required, there's nobody
being funded to work on it, and I don't see any of the commercial Pg
forks that've added parallel features trying to merge their work back
into mainline.

If you have a commercial need, perhaps you can find time to fund work on
something that'd help you out, like honouring effective_io_concurrency
for sequential scans?

--
Craig Ringer


pgsql-general by date:

Previous
From: Michael Clark
Date:
Subject: Re: Problems with timestamp with time zone and old dates?
Next
From: Scott Marlowe
Date:
Subject: Re: Problems with timestamp with time zone and old dates?