Thread: Any work on better parallelization of pg_dump?
Hi, we have rather uncommon case - DB with ~ 50GB of data, but this is spread across ~ 80000 tables. Running pg_dump -Fd -jxx dumps in parallel, but only data, and MOST of the time is spent on queries that run sequentially, and as far as I can tell, get schema of tables, and sequence values. This happens on Pg 9.5. Are there any plans to make getting schema faster for such cases? Either by parallelization, or at least by getting schema for all tables "at once", and having pg_dump "sort it out", instead of getting schema for each table separately? Best regards, depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/
On Mon, 29 Aug 2016 13:38:03 +0200 hubert depesz lubaczewski <depesz@depesz.com> wrote: > Hi, > we have rather uncommon case - DB with ~ 50GB of data, but this is > spread across ~ 80000 tables. > > Running pg_dump -Fd -jxx dumps in parallel, but only data, and MOST of > the time is spent on queries that run sequentially, and as far as I can > tell, get schema of tables, and sequence values. > > This happens on Pg 9.5. Are there any plans to make getting schema > faster for such cases? Either by parallelization, or at least by getting > schema for all tables "at once", and having pg_dump "sort it out", > instead of getting schema for each table separately? Another issue I found in current implementation is how pg_restore deal with PK. As it takes an exclusif lock on the table, it is executed alone before indexes creation. Splitting the PK in unique index creation then the constraint creation might save a lot of time as other index can be built during the PK creation. Regards,
Jehan-Guillaume de Rorthais wrote: > On Mon, 29 Aug 2016 13:38:03 +0200 > hubert depesz lubaczewski <depesz@depesz.com> wrote: > > > Hi, > > we have rather uncommon case - DB with ~ 50GB of data, but this is > > spread across ~ 80000 tables. > > > > Running pg_dump -Fd -jxx dumps in parallel, but only data, and MOST of > > the time is spent on queries that run sequentially, and as far as I can > > tell, get schema of tables, and sequence values. > > > > This happens on Pg 9.5. Are there any plans to make getting schema > > faster for such cases? Either by parallelization, or at least by getting > > schema for all tables "at once", and having pg_dump "sort it out", > > instead of getting schema for each table separately? Depesz: I suggest you start coding ASAP. > Another issue I found in current implementation is how pg_restore deal with PK. > As it takes an exclusif lock on the table, it is executed alone before indexes > creation. > > Splitting the PK in unique index creation then the constraint creation might > save a lot of time as other index can be built during the PK creation. Yeah. I recall there being some stupid limitation in ALTER TABLE .. ADD CONSTRAINT USING INDEX to create a primary key from a previously existing unique index, which would be very good to fix (I don't recall what it was, but it was something infuriatingly silly). I suggest you start coding that ASAP. (Two new contributors to pg_dump! Yay!) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Aug 29, 2016 at 01:13:17PM -0300, Alvaro Herrera wrote: > > > This happens on Pg 9.5. Are there any plans to make getting schema > > > faster for such cases? Either by parallelization, or at least by getting > > > schema for all tables "at once", and having pg_dump "sort it out", > > > instead of getting schema for each table separately? > Depesz: I suggest you start coding ASAP. If only I knew C :( Best regards, depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/
On Mon, 29 Aug 2016 13:13:17 -0300 Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Jehan-Guillaume de Rorthais wrote: > > On Mon, 29 Aug 2016 13:38:03 +0200 > > hubert depesz lubaczewski <depesz@depesz.com> wrote: > > > > > Hi, > > > we have rather uncommon case - DB with ~ 50GB of data, but this is > > > spread across ~ 80000 tables. > > > > > > Running pg_dump -Fd -jxx dumps in parallel, but only data, and MOST of > > > the time is spent on queries that run sequentially, and as far as I can > > > tell, get schema of tables, and sequence values. > > > > > > This happens on Pg 9.5. Are there any plans to make getting schema > > > faster for such cases? Either by parallelization, or at least by getting > > > schema for all tables "at once", and having pg_dump "sort it out", > > > instead of getting schema for each table separately? > > Depesz: I suggest you start coding ASAP. > > > Another issue I found in current implementation is how pg_restore deal with > > PK. As it takes an exclusif lock on the table, it is executed alone before > > indexes creation. > > > > Splitting the PK in unique index creation then the constraint creation might > > save a lot of time as other index can be built during the PK creation. > > Yeah. I recall there being some stupid limitation in ALTER TABLE .. ADD > CONSTRAINT USING INDEX to create a primary key from a previously > existing unique index, which would be very good to fix (I don't recall > what it was, but it was something infuriatingly silly). Could you elaborate? I already had to implement some custom scripts to restore some tables using this method. The scripts were using psql and "xargs -P" to restore the indexes and the PK outside of pg_restore. > I suggest you start coding that ASAP. I did start, but with no success so far as the code in pg_dump/pg_restore is quite obscure at a first look. The few attempt I did to catch the PK creation statement and split it in 2 distincts statements failed to be processed in parallel IIRC. I probably dropped the patch in the meantime though.
Jehan-Guillaume de Rorthais wrote: > > Yeah. I recall there being some stupid limitation in ALTER TABLE .. ADD > > CONSTRAINT USING INDEX to create a primary key from a previously > > existing unique index, which would be very good to fix (I don't recall > > what it was, but it was something infuriatingly silly). > > Could you elaborate? I already had to implement some custom scripts to > restore some tables using this method. The scripts were using psql and > "xargs -P" to restore the indexes and the PK outside of pg_restore. Ahh, nevermind. What I remembered was that if you have an UNIQUE constraint, you cannot update its index to be a primary key, and you cannot remove the associated constraint without dropping the index; you have to make a new unique index instead. If you have a plain UNIQUE index, it works fine. In other words, -- this fails: the unique index is already associated with a constraint create table ioguix (a int unique); alter table ioguix add primary key using index ioguix_a_key ; -- this works create unique index ioguix_2_idx on ioguix (a); alter table ioguix add primary key using index ioguix_2_idx ; > > I suggest you start coding that ASAP. > > I did start, but with no success so far as the code in pg_dump/pg_restore is > quite obscure at a first look. The few attempt I did to catch the PK creation > statement and split it in 2 distincts statements failed to be processed in > parallel IIRC. I probably dropped the patch in the meantime though. pg_dump is *really* obscure, and its parallel processing even more so. That just says that we need more people to try and figure it out! -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services