Thread: Parallel Query should be a top priority

Parallel Query should be a top priority

From
Mike Mascari
Date:
PostgreSQL has made substantial progress over the years and is
approaching enterprise-quality feature sets. However, one of the major
stopping points for enterprise deployment is lack of parallel query
support. DB2, Oracle, even SQL Server Enterprise Edition all have
parallel query support. A recent Dr. Dobbs Journal documented how
Moore's Law is no longer applicable - processor speeds are increasing at
a decreasing rate. Large boxes, such as Sun Enterprise 25000's, have
large quantities of UltraSPARC III and IV processors, whose SPECint
ratings are inferior to that of less-scalable Opterons (as an example.)
In these configurations, Oracle performs exceeding well, because it has
multiple paths for parallelization of a single query:

http://www.pafumi.net/parallel_query.htm

Without parallel query, the *only* way to decrease the execution time of
a single query whose data has been fully cached is to buy the
latest-and-greatest which is increasing in speed at decreasing rates,
rather than scaling up the number of processors in a single box. A speed
barrier to PostgreSQL's ability to execute a single query is fast
approaching.

I love PostgreSQL, and with tablespaces, PITR, nested transactions, and
more PLs than one knows what do with, it's my favorite database from a
usability standpoint.  But in terms of performance, the one missing
piece to the performance puzzle is parallel query.

"Consider parallel processing a single query" should be moved out from
under Miscellaneous on the TODO list and re-categorized as the formerly
existent URGENT feature...

Mike Mascari


Re: Parallel Query should be a top priority

From
"Qingqing Zhou"
Date:
"Mike Mascari" <mascarm@mascari.com> writes
> "Consider parallel processing a single query" should be moved out from
> under Miscellaneous on the TODO list and re-categorized as the formerly
> existent URGENT feature...
>

Yes, inter/inner-operation of PQO could be an obvious winner in some
situations. For example, in a rather idle multi-processor machine, run a
complex query could get times of acceleration.

However, in some other cases, like in a quite busy machine, inter-query
parallel mechanism is already good enough, esp. for the system like PG using
MVCC concurrency control. I remember there was someone working on the
parallel scan idea of PG, I believe that's more realistic and benefitial
than the implementation of the full set of PQO.

Regards,
Qingqing





Re: Parallel Query should be a top priority

From
Bruno Wolff III
Date:
On Sun, Mar 27, 2005 at 23:58:35 -0500,
  Mike Mascari <mascarm@mascari.com> wrote:
>
> Without parallel query, the *only* way to decrease the execution time of
> a single query whose data has been fully cached is to buy the
> latest-and-greatest which is increasing in speed at decreasing rates,
> rather than scaling up the number of processors in a single box. A speed
> barrier to PostgreSQL's ability to execute a single query is fast
> approaching.

I think that is a bit extreme. For some queries you will be able to
parallelize accross mutliple back ends and realize some speedup.

I would also think that this argument could also apply to cases where
the data is on several sets of disks and you wanted to be reading from
both sets at once rather than serially.

> I love PostgreSQL, and with tablespaces, PITR, nested transactions, and
> more PLs than one knows what do with, it's my favorite database from a
> usability standpoint.  But in terms of performance, the one missing
> piece to the performance puzzle is parallel query.
>
> "Consider parallel processing a single query" should be moved out from
> under Miscellaneous on the TODO list and re-categorized as the formerly
> existent URGENT feature...

I think there are other things that could be done to improve optimization
that will benefit more people than parallelized queries. Those are really only
useful to people where the database is being used by a handful (less than
the number of processors and/or disk channels) of users concurrently, who
are making long running queries and waiting for the results.

Re: Parallel Query should be a top priority

From
Oleg Bartunov
Date:
Interesting, that Stonebraker in his interview said about
parallel query processing
http://searchenterpriselinux.techtarget.com/qna/0,289202,sid39_gci1025832,00.html

Putting aside Larry Ellison, would you say, anything should have been done differently?
Stonebraker: We made a couple of significant mistakes. The one I most would
like to have back was Informix made a nice run in the early 1990s selling
parallel query processing and they were really fast and routinely beat Oracle
in performance bakeoffs. Informix horizontally partitioned databases and
spread them out over different processors and used multiple processors on a
single query very efficiently. That was technology that Ingres started
developing in 1987 and then Ingres decided to cancel that initiative,
so that's one I'd like to have back. Another initiative that failed was that
Ingres put a fair amount of money into writing a distributed database system
and there just wasn't much of a market for distributed databases.
I would have killed that one and kept alive the parallel query processing
effort. Ultimately Informix got squashed by Oracle anyway.
It's not clear this would have made a whole lot of difference in the outcome.


On Mon, 28 Mar 2005, Bruno Wolff III wrote:

> On Sun, Mar 27, 2005 at 23:58:35 -0500,
>  Mike Mascari <mascarm@mascari.com> wrote:
>>
>> Without parallel query, the *only* way to decrease the execution time of
>> a single query whose data has been fully cached is to buy the
>> latest-and-greatest which is increasing in speed at decreasing rates,
>> rather than scaling up the number of processors in a single box. A speed
>> barrier to PostgreSQL's ability to execute a single query is fast
>> approaching.
>
> I think that is a bit extreme. For some queries you will be able to
> parallelize accross mutliple back ends and realize some speedup.
>
> I would also think that this argument could also apply to cases where
> the data is on several sets of disks and you wanted to be reading from
> both sets at once rather than serially.
>
>> I love PostgreSQL, and with tablespaces, PITR, nested transactions, and
>> more PLs than one knows what do with, it's my favorite database from a
>> usability standpoint.  But in terms of performance, the one missing
>> piece to the performance puzzle is parallel query.
>>
>> "Consider parallel processing a single query" should be moved out from
>> under Miscellaneous on the TODO list and re-categorized as the formerly
>> existent URGENT feature...
>
> I think there are other things that could be done to improve optimization
> that will benefit more people than parallelized queries. Those are really only
> useful to people where the database is being used by a handful (less than
> the number of processors and/or disk channels) of users concurrently, who
> are making long running queries and waiting for the results.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83