Re: pg_dump additional options for performance - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: pg_dump additional options for performance
Date
Msg-id 1204024976.4252.241.camel@ebony.site
Whole thread Raw
In response to Re: pg_dump additional options for performance  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: pg_dump additional options for performance  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Tue, 2008-02-19 at 14:18 -0800, Jeff Davis wrote:
> On Wed, 2008-02-06 at 15:13 +0000, Simon Riggs wrote:
> > The -s option creates the table, as well as creating constraints and
> > indexes. These objects need to be dropped prior to loading, if we are to
> > follow the performance recommendations in the docs. But the only way to
> > do that is to manually edit the script to produce a cut down script.
> > 
> > So it would be good if we could dump objects in 3 groups
> > 1. all commands required to re-create table
> > 2. data
> > 3. all commands required to complete table after data load
> > 
> > My proposal is to provide two additional modes:
> > --schema-pre-load corresponding to (1) above
> > --schema-post-load corresponding to (3) above
> 
> Another thought:
> 
> We could also break step #3 into two steps: those objects required for
> correctness (e.g. unique indexes, other constraints, etc); and those
> objects that are merely for performance (i.e. non-constraining indexes).
> Let's call these steps #3A and #3B.
> 
> After reading the thread here:
> 
> http://archives.postgresql.org/pgsql-performance/2008-02/msg00211.php
> 
> it's clear that building indexes can take a long time, and they aren't
> strictly required for correct database operation. Often, it's perfectly
> reasonable to operate the database without a few of the indexes, so long
> as they don't imply a constraint.
> 
> Step #3B could be done with "CONCURRENTLY" to allow uninterrupted
> operation.
> 
> We could even allow pg_restore to run step #3B in multiple sessions
> building different indexes to use multiple processor cores.
> 
> This is just a thought. I can see that it's getting fairly complex, and
> it may be better to just leave these things up to the DBA.

My thinking is to do either:

* keep it as simple as possible to allow DBA to manually improve
performance

* express dependency information in the pg_dump output to allow some
level of parallelism to use that information to advantage automatically

Doing the second one only seems a lot of work and would prevent a DBA
from being able to code things a particular way for their install. It
seems possible to get good benefit from the first one and yet still do
the second part of that later.

If we can work out some notation or grammar to put into the script
output then we might imagine having a concurrent psql that spawns
multiple sessions and executes. With the correct grammar all we would
need is an additional psql option --parallel=N and then psql would work
its way through the script keeping N sessions busy when we reload.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com 



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: pg_dump additional options for performance
Next
From: Martijn van Oosterhout
Date:
Subject: Re: Producer/Consumer Issues in the COPY across network