Re: parallel pg_restore - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: parallel pg_restore
Date
Msg-id 48DA6153.7040003@dunslane.net
Whole thread Raw
In response to Re: parallel pg_restore  (Dimitri Fontaine <dfontaine@hi-media.com>)
Responses Re: parallel pg_restore  (Dimitri Fontaine <dfontaine@hi-media.com>)
List pgsql-hackers

Dimitri Fontaine wrote:
> Hi,
>
> Le mardi 23 septembre 2008, Andrew Dunstan a écrit :
>   
>> In any case, my agenda goes something like this:
>>
>>     * get it working with a basic selection algorithm on Unix (nearly
>>       done - keep your eyes open for a patch soon)
>>     * start testing
>>     * get it working on Windows
>>     * improve the selection algorithm
>>     * harden code
>>     
>
> I'm not sure whether your work will feature single table restore splitting, 
> but if it's the case, you could consider having a look at what I've done in 
> pgloader. The parallel loading work there was asked for by Simon Riggs and 
> Greg Smith and you could test two different parallel algorithms.
> The aim was to have a "simple" testbed allowing PostgreSQL hackers to choose 
> what to implement in pg_restore, so I still hope it'll get usefull someday :)
>
>
>   

No. The proposal will perform exactly the same set of steps as 
single-threaded pg_restore, but in parallel. The individual steps won't 
be broken up.

Quite apart from anything else, parallel data loading of individual 
tables will defeat clustering, as well as making it impossible to avoid 
WAL logging of the load (which I have made provision for).

The fact that custom archives are compressed by default would in fact 
make parallel loading of individual tables' data difficult with the 
present format. We'd have to do something like expanding it on the 
client (which might not even have enough disk space) and then split it 
before loading it to the server. That's pretty yucky. Alternatively, 
each loader thread would need to start decompressing the data from the 
start and thow away data until it got to the point it wanted to start 
restoring from. Also pretty yucky.

Far better would be to provide for multiple data members in the archive 
and teach pg_dump to split large tables as it writes the archive. Then 
pg_restore would need comparatively little adjustment.

Also, of course, you can split tables yourself by partitioning them. 
That would buy you parallel data load with what I am doing now, with no 
extra work.

In any case, data loading is very far from being the only problem. One 
of my clients has long running restores where the data load takes about 
20% or so of the time - the rest is in index creation and the like. No 
amount of table splitting will make a huge difference to them, but 
parallel processing will. As against that, if your problem is in loading 
one huge table, this won't help you much. However, this is not a pattern 
I see much - most of my clients seem to have several large tables plus a 
boatload of indexes. They will benefit a lot.

cheers

andrew


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Proposal of SE-PostgreSQL patches (for CommitFest:Sep)
Next
From: Tom Lane
Date:
Subject: Re: Proposal of SE-PostgreSQL patches (for CommitFest:Sep)