Re: why postgresql over other RDBMS - Mailing list pgsql-general

From Stefan Kaltenbrunner
Subject Re: why postgresql over other RDBMS
Date
Msg-id 46580CF6.1000909@kaltenbrunner.cc
Whole thread Raw
In response to Re: why postgresql over other RDBMS  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: why postgresql over other RDBMS
List pgsql-general
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Erik Jones wrote:
>>> And, to finish up, is there any reason that pg_restore couldn't
>>> already work with separate processes working in parallel?
>
>> The problem is that the ordering of objects in the dump is the only
>> thing that makes the dump consistent with regards to the dependencies of
>> objects.  So pg_restore cannot make any assumptions of parallelisability
>> of the restoration process of objects in the dump.
>
> That's true at the level of DDL operations, but AFAIK we could
> parallelize table-loading and index-creation steps pretty effectively
> --- and that's where all the time goes.

yes loading the data and creating the index is the most time consuming
part of a large dump and reload cycle.

>
> A more interesting question is what sort of hardware you need for that
> actually to be a win, though.  Loading a few tables in parallel sounds
> like an ideal recipe for oversaturating your disk bandwidth...

you don't actually need that much of disk bandwidth both COPY and CREATE
INDEX are CPU bottlenecked on modern boxes and reasonable disk
subsystems - spreading their work over multiple cores/processes can give
big benefits.
For example I have managed to load ~2B rows (5 integer columns - no
indexes) at a rate of about 320000 rows/s on a modern(but already 1,5
years old) 4 core 2.6Ghz Opteron box (with 12 disks and BBWC iirc).
Using 4 concurrent processes to load the data resulted in about 930000
rows/s loaded (hitting the disk-io limit at that rate).
So having the ability to parallelize those operations at both the dump
and the restore level would be a huge win.
A manual experiment I did a while back with doing that by hand (ie.
splitting the dump manually and feeding it in parallel with a
concurrency of 2)  on a copy of a production database brought down the
restore time from 3h+ to a bit less than 2 hours.


Stefan

pgsql-general by date:

Previous
From: Ron Johnson
Date:
Subject: Re: why postgresql over other RDBMS
Next
From: Gerhard Wiesinger
Date:
Subject: Re: Tools for dumping pg_xlog, pg_clog, etc?