Re: pg_dump additional options for performance - Mailing list pgsql-patches

From daveg
Subject Re: pg_dump additional options for performance
Date
Msg-id 20080727013351.GB30335@sonic.net
Whole thread Raw
In response to Re: pg_dump additional options for performance  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-patches
On Sat, Jul 26, 2008 at 01:56:14PM -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > I want to dump tables separately for performance reasons. There are
> > documented tests showing 100% gains using this method. There is no gain
> > adding this to pg_restore. There is a gain to be had - parallelising
> > index creation, but this patch doesn't provide parallelisation.
>
> Right, but the parallelization is going to happen sometime, and it is
> going to happen in the context of pg_restore.  So I think it's pretty
> silly to argue that no one will ever want this feature to work in
> pg_restore.
>
> To extend the example I just gave to Stephen, I think a fairly probable
> scenario is where you only need to tweak some "before" object
> definitions, and then you could do
>
> pg_restore --schema-before-data whole.dump >before.sql
> edit before.sql
> psql -f before.sql target_db
> pg_restore --data-only --schema-after-data -d target_db whole.dump
>
> which (given a parallelizing pg_restore) would do all the time-consuming
> steps in a fully parallelized fashion.

A few thoughts about pg_restore performance:

To take advantage of non-logged copy, the create and load should be in
the same transaction.

To take advantage of file and buffer cache, it would be be good to do
indexes immediately after table data. Many tables will be small enough
to fit in cache and this will avoid re-reading them for index builds. This
effect becomes stronger with more indexes on one table. There may also be
some filesytem placement benefit to building the indexes for a table
immediately after loading the data.

The buffer fan file cache advantage also applies to constraint creation,
but this is complicated by the need for indexes and data in the referenced
tables.

It seems that a high performance restore will want to proced in a different
order than the current sort order or that proposed by the before/data/after
patch.

 - The simplest unit of work for parallelism may be the table and its
   "decorations", eg indexes and relational constraints.

 - Sort tables by foreign key dependency so that referenced tables are
   loaded before referencing tables.

 - Do table creation and data load together in one transaction to use
   non-logged copy. Index builds, and constraint creation should follow
   immediately, either as part of the same transaction, or possibly
   parallelized themselves.

Table creation, data load, index builds, and constraint creation could
be packaged up as the unit of work to be done in a subprocess which either
completes or fails as a unit. The worker process would be called with
connection info, a file pointer to the data, and the DDL for the table.
pg_restore would keep a work queue of tables to be restored in FK dependency
order and also do the other schema operations such as functions and types.

-dg

--
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.

pgsql-patches by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: pg_dump additional options for performance
Next
From: "Joshua D. Drake"
Date:
Subject: Re: pg_dump additional options for performance