Re: pg_dump additional options for performance - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: pg_dump additional options for performance
Date
Msg-id 1204049601.4252.399.camel@ebony.site
Whole thread Raw
In response to Re: pg_dump additional options for performance  (Greg Smith <gsmith@gregsmith.com>)
List pgsql-hackers
On Tue, 2008-02-26 at 12:27 -0500, Greg Smith wrote:
> On Tue, 26 Feb 2008, Simon Riggs wrote:
> 
> > Splitting up the dump is the enabler for splitting up the load.
> 
> While the pg_dump split train seems to be leaving the station, I feel 
> compelled to point out that focus does nothing to help people who are 
> bulk-loading data that came from somewhere else.  If my data is already in 
> PostgreSQL, and I'm doing a dump/load, I can usually split the data easily 
> enough with existing tools to handle that right now via COPY (SELECT...) 
> TO.  Some tools within pg_dump would be nice, but I don't need them that 
> much.  It's gigantic files that came from some other DB I don't even have 
> access to that I struggle with loading efficiently.

Don't write them as gigantic files... :-)

> The work Dimitri is doing is wandering in that direction and that may be 
> enough.  I note that something that addresses loading big files regardless 
> of source could also work on PostgreSQL dumps, while a pg_dump focused 
> effort helps nothing but that specific workflow.  I wonder if doing too 
> much work on the pg_dump path is the best use of someone's time when the 
> more general case will need to be addressed one day anyway.

I take your points. I've spent as much time discussing this as coding
it, which is OK cos coding the wrong thing is also a waste of time.

If we have individual load files then we will be able to use a parallel
load utility on them, whether they come from pg_dump or otherwise.
Neither of the current pg_dump output formats would allow that.
--data-only does allow that but there's no way to dump a sensible schema
to go with that, which is what this thread is about.

Dimitri is working on a parallel loader that allows error handling. I
don't think its worth duplicating that effort by attempting to make COPY
work in parallel. Even if we did, the single client could easily become
a bottleneck. So an external utility that spawns multiple processes and
runs multiple concurrent COPY statements sounds quite good to me.

He's asked for help, if anybody can assist.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com 



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: pg_dump additional options for performance
Next
From: Tom Lane
Date:
Subject: Re: pg_dump additional options for performance