Thread: Any work on better parallelization of pg_dump?

Any work on better parallelization of pg_dump?

From
hubert depesz lubaczewski
Date:
Hi,
we have rather uncommon case - DB with ~ 50GB of data, but this is
spread across ~ 80000 tables.

Running pg_dump -Fd -jxx dumps in parallel, but only data, and MOST of
the time is spent on queries that run sequentially, and as far as I can
tell, get schema of tables, and sequence values.

This happens on Pg 9.5. Are there any plans to make getting schema
faster for such cases? Either by parallelization, or at least by getting
schema for all tables "at once", and having pg_dump "sort it out",
instead of getting schema for each table separately?

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/


Re: Any work on better parallelization of pg_dump?

From
Jehan-Guillaume de Rorthais
Date:
On Mon, 29 Aug 2016 13:38:03 +0200
hubert depesz lubaczewski <depesz@depesz.com> wrote:

> Hi,
> we have rather uncommon case - DB with ~ 50GB of data, but this is
> spread across ~ 80000 tables.
>
> Running pg_dump -Fd -jxx dumps in parallel, but only data, and MOST of
> the time is spent on queries that run sequentially, and as far as I can
> tell, get schema of tables, and sequence values.
>
> This happens on Pg 9.5. Are there any plans to make getting schema
> faster for such cases? Either by parallelization, or at least by getting
> schema for all tables "at once", and having pg_dump "sort it out",
> instead of getting schema for each table separately?

Another issue I found in current implementation is how pg_restore deal with PK.
As it takes an exclusif lock on the table, it is executed alone before indexes
creation.

Splitting the PK in unique index creation then the constraint creation might
save a lot of time as other index can be built during the PK creation.

Regards,


Re: Any work on better parallelization of pg_dump?

From
Alvaro Herrera
Date:
Jehan-Guillaume de Rorthais wrote:
> On Mon, 29 Aug 2016 13:38:03 +0200
> hubert depesz lubaczewski <depesz@depesz.com> wrote:
>
> > Hi,
> > we have rather uncommon case - DB with ~ 50GB of data, but this is
> > spread across ~ 80000 tables.
> >
> > Running pg_dump -Fd -jxx dumps in parallel, but only data, and MOST of
> > the time is spent on queries that run sequentially, and as far as I can
> > tell, get schema of tables, and sequence values.
> >
> > This happens on Pg 9.5. Are there any plans to make getting schema
> > faster for such cases? Either by parallelization, or at least by getting
> > schema for all tables "at once", and having pg_dump "sort it out",
> > instead of getting schema for each table separately?

Depesz: I suggest you start coding ASAP.

> Another issue I found in current implementation is how pg_restore deal with PK.
> As it takes an exclusif lock on the table, it is executed alone before indexes
> creation.
>
> Splitting the PK in unique index creation then the constraint creation might
> save a lot of time as other index can be built during the PK creation.

Yeah.  I recall there being some stupid limitation in ALTER TABLE .. ADD
CONSTRAINT USING INDEX to create a primary key from a previously
existing unique index, which would be very good to fix (I don't recall
what it was, but it was something infuriatingly silly).  I suggest you
start coding that ASAP.

(Two new contributors to pg_dump!  Yay!)

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Any work on better parallelization of pg_dump?

From
hubert depesz lubaczewski
Date:
On Mon, Aug 29, 2016 at 01:13:17PM -0300, Alvaro Herrera wrote:
> > > This happens on Pg 9.5. Are there any plans to make getting schema
> > > faster for such cases? Either by parallelization, or at least by getting
> > > schema for all tables "at once", and having pg_dump "sort it out",
> > > instead of getting schema for each table separately?
> Depesz: I suggest you start coding ASAP.

If only I knew C :(

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/


Re: Any work on better parallelization of pg_dump?

From
Jehan-Guillaume de Rorthais
Date:
On Mon, 29 Aug 2016 13:13:17 -0300
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> Jehan-Guillaume de Rorthais wrote:
> > On Mon, 29 Aug 2016 13:38:03 +0200
> > hubert depesz lubaczewski <depesz@depesz.com> wrote:
> >
> > > Hi,
> > > we have rather uncommon case - DB with ~ 50GB of data, but this is
> > > spread across ~ 80000 tables.
> > >
> > > Running pg_dump -Fd -jxx dumps in parallel, but only data, and MOST of
> > > the time is spent on queries that run sequentially, and as far as I can
> > > tell, get schema of tables, and sequence values.
> > >
> > > This happens on Pg 9.5. Are there any plans to make getting schema
> > > faster for such cases? Either by parallelization, or at least by getting
> > > schema for all tables "at once", and having pg_dump "sort it out",
> > > instead of getting schema for each table separately?
>
> Depesz: I suggest you start coding ASAP.
>
> > Another issue I found in current implementation is how pg_restore deal with
> > PK. As it takes an exclusif lock on the table, it is executed alone before
> > indexes creation.
> >
> > Splitting the PK in unique index creation then the constraint creation might
> > save a lot of time as other index can be built during the PK creation.
>
> Yeah.  I recall there being some stupid limitation in ALTER TABLE .. ADD
> CONSTRAINT USING INDEX to create a primary key from a previously
> existing unique index, which would be very good to fix (I don't recall
> what it was, but it was something infuriatingly silly).

Could you elaborate? I already had to implement some custom scripts to
restore some tables using this method. The scripts were using psql and
"xargs -P" to restore the indexes and the PK outside of pg_restore.

> I suggest you start coding that ASAP.

I did start, but with no success so far as the code in pg_dump/pg_restore is
quite obscure at a first look. The few attempt I did to catch the PK creation
statement and split it in 2 distincts statements failed to be processed in
parallel IIRC. I probably dropped the patch in the meantime though.


Re: Any work on better parallelization of pg_dump?

From
Alvaro Herrera
Date:
Jehan-Guillaume de Rorthais wrote:

> > Yeah.  I recall there being some stupid limitation in ALTER TABLE .. ADD
> > CONSTRAINT USING INDEX to create a primary key from a previously
> > existing unique index, which would be very good to fix (I don't recall
> > what it was, but it was something infuriatingly silly).
>
> Could you elaborate? I already had to implement some custom scripts to
> restore some tables using this method. The scripts were using psql and
> "xargs -P" to restore the indexes and the PK outside of pg_restore.

Ahh, nevermind.  What I remembered was that if you have an UNIQUE
constraint, you cannot update its index to be a primary key, and you
cannot remove the associated constraint without dropping the index; you
have to make a new unique index instead.  If you have a plain UNIQUE
index, it works fine.  In other words,

-- this fails: the unique index is already associated with a constraint
create table ioguix (a int unique);
alter table ioguix add primary key using index ioguix_a_key ;

-- this works
create unique index ioguix_2_idx on ioguix (a);
alter table ioguix add primary key using index ioguix_2_idx ;

> > I suggest you start coding that ASAP.
>
> I did start, but with no success so far as the code in pg_dump/pg_restore is
> quite obscure at a first look. The few attempt I did to catch the PK creation
> statement and split it in 2 distincts statements failed to be processed in
> parallel IIRC. I probably dropped the patch in the meantime though.

pg_dump is *really* obscure, and its parallel processing even more so.
That just says that we need more people to try and figure it out!

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services