Thread: pg_dump with 1100 schemas being a bit slow
Hello, After a series of sessions to search the web for information, I am asking the help of people having a bit more knowledge of the internals of pg_dump to try to solve a performance problem I have. I am running PostgreSQL version 8.3.8 both server and pg_dump, The context is a farm hosting of a web application (http://www.indefero.net) where each installation get its own schema. Each table in the schema is a "real" table, not a view of the same table in the public schema with "WHERE schema='currentschema'" clause. This setup allows me to easily run the web application nearly unmodified between the downloadable version and the "farm" version (schemas rock!). Now you have the background. Each night I am running: pg_dump --blobs --schema=%s --no-acl -U postgres indefero | gzip > /path/to/backups/%s/%s-%s.sql.gz this for each installation, so 1100 times. Substitution strings are to timestamp and get the right schema. My problem is that the dump increased steadily with the number of schemas (now about 20s from about 12s with 850 schemas) and pg_dump is now ballooning at 120MB of memory usage when running the dump. The thing is that my overage schema size is a bit more than 1MB, with the schema just after installation being 850kB. Max size: 2.8MB Min size: 0.85MB Avg: 1.0MB Total size: ~1GB To get the size I run the following command with sum off the "size": SELECT relname, pg_total_relation_size(CAST(relname AS TEXT)) AS size FROM pg_class AS pgc, pg_namespace AS pgn WHERE pg_table_is_visible(pgc.oid) IS TRUE AND relkind = 'r' AND pgc.relnamespace = pgn.oid AND pgn.nspname NOT IN ('information_schema', 'pg_catalog')'; I think that pg_dump, when looking at the objects to dump, also it is limited to a given schema, is scanning the complete database in one those calls: http://doxygen.postgresql.org/pg__dump_8c-source.html#l00717 Is there an option: "I know what I am doing, do not look outside of the schema" available which can help in my case? Because running SELECT is snappy, vmstats shows no swapping and the complete system is basically running very well with a load average below 1. So, your help is very welcomed, loïc -- Loïc d'Anterroches - Céondo Ltd - http://www.ceondo.com
"Loic d'Anterroches" <diaeresis@gmail.com> writes: > Each night I am running: > pg_dump --blobs --schema=%s --no-acl -U postgres indefero | gzip > > /path/to/backups/%s/%s-%s.sql.gz > this for each installation, so 1100 times. Substitution strings are to > timestamp and get the right schema. This seems like a pretty dumb way to go at it. Why don't you just do one -Fc dump for the whole database? If you ever actually need to restore a single schema, there's a pg_restore switch for that. > I think that pg_dump, when looking at the objects to dump, also it is > limited to a given schema, is scanning the complete database in one > those calls: Yes, it has to examine all database objects in order to trace dependencies properly. > Is there an option: "I know what I am doing, do not look outside of > the schema" available which can help in my case? No. regards, tom lane
On Wed, Oct 7, 2009 at 4:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Loic d'Anterroches" <diaeresis@gmail.com> writes: >> Each night I am running: >> pg_dump --blobs --schema=%s --no-acl -U postgres indefero | gzip > >> /path/to/backups/%s/%s-%s.sql.gz >> this for each installation, so 1100 times. Substitution strings are to >> timestamp and get the right schema. > > This seems like a pretty dumb way to go at it. Why don't you just do > one -Fc dump for the whole database? If you ever actually need to > restore a single schema, there's a pg_restore switch for that. Thank you for your fast answer. This is the way I started to do the work, but then I started to have issues because of the numbers of tables to be soft locked at the same time increasing each time, I had to push the max_locks_per_transaction settings up each time. The added benefit of doing a per schema dump is that I provide it to the users directly, that way they have a full export of their data. I cannot increase the max_locks_per_transaction all the time with the increasing number of schemas, no? What is the problem if I put this settings at a high value (outside of the memory overhead per connection as far as I understood the doc)? >> I think that pg_dump, when looking at the objects to dump, also it is >> limited to a given schema, is scanning the complete database in one >> those calls: > > Yes, it has to examine all database objects in order to trace > dependencies properly. > >> Is there an option: "I know what I am doing, do not look outside of >> the schema" available which can help in my case? > > No. So it looks like I may need to go a different way. I can setup a WAL based backup server and dump the content of each schema in an application specific way (JSON export), this way I can keep the ready to use backup (at the moment a restore is just a series of import for each schema) with the slave and it will keep my customers happy with the JSON dump. If you know a better solution, I would be pleased to be guided in the right direction. loïc -- Loïc d'Anterroches - Céondo Ltd - http://www.ceondo.com
In response to "Loic d'Anterroches" <diaeresis@gmail.com>: > On Wed, Oct 7, 2009 at 4:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > "Loic d'Anterroches" <diaeresis@gmail.com> writes: > >> Each night I am running: > >> pg_dump --blobs --schema=%s --no-acl -U postgres indefero | gzip > > >> /path/to/backups/%s/%s-%s.sql.gz > >> this for each installation, so 1100 times. Substitution strings are to > >> timestamp and get the right schema. Have you tested the speed without the gzip? We found that compressing the dump takes considerably longer than pg_dump does, but pg_dump can't release its locks until gzip has completely processed all of the data, because of the pipe. By doing the pg_dump in a different step than the compression, we were able to eliminate our table locking issues, i.e.: pg_dump --blobs --schema=%s --no-acl -U postgres indefero > /path/to/backups/%s/%s-%s.sql && gzip /path/to/backups/%s/%s-%s.sql Of course, you'll need enough disk space to store the uncompressed dump while gzip works. -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/
Loic,
>settings up each time. The added benefit of doing a per schema dump is
>that I provide it to the users directly, that way they have a full
you should try the timing with
pg_dump --format=c completedatabase.dmp
and then generating the separte schemas in an extra step like
pg_restore --schema=%s --file=outputfilename.sql completedatabase.dmp
I found that even with maximum compression
pg_dump --format=c --compress=9
the pg_dump compression was quicker then dump + gzip/bzip/7z compression afterwards.
And after the dumpfile is created, pg_restore will leave your database alone.
(make sure to put completedatabase.dmp on a separate filesystem). You can even try to run more then one pg_restore --file in parallel.
Best wishes,
Harald
--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
no fx, no carrier pigeon
-
%s is too gigantic of an industry to bend to the whims of reality
>settings up each time. The added benefit of doing a per schema dump is
>that I provide it to the users directly, that way they have a full
>export of their data.
you should try the timing with
pg_dump --format=c completedatabase.dmp
and then generating the separte schemas in an extra step like
pg_restore --schema=%s --file=outputfilename.sql completedatabase.dmp
I found that even with maximum compression
pg_dump --format=c --compress=9
the pg_dump compression was quicker then dump + gzip/bzip/7z compression afterwards.
And after the dumpfile is created, pg_restore will leave your database alone.
(make sure to put completedatabase.dmp on a separate filesystem). You can even try to run more then one pg_restore --file in parallel.
Best wishes,
Harald
--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
no fx, no carrier pigeon
-
%s is too gigantic of an industry to bend to the whims of reality
On Wed, Oct 7, 2009 at 5:54 PM, Bill Moran <wmoran@potentialtech.com> wrote: > In response to "Loic d'Anterroches" <diaeresis@gmail.com>: > >> On Wed, Oct 7, 2009 at 4:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > "Loic d'Anterroches" <diaeresis@gmail.com> writes: >> >> Each night I am running: >> >> pg_dump --blobs --schema=%s --no-acl -U postgres indefero | gzip > >> >> /path/to/backups/%s/%s-%s.sql.gz >> >> this for each installation, so 1100 times. Substitution strings are to >> >> timestamp and get the right schema. > > Have you tested the speed without the gzip? This is the first thing I did but in that case I was not able to get any significant improvement. The data to gzip is very small "per schema" so this is not the bottleneck. > We found that compressing the dump takes considerably longer than pg_dump > does, but pg_dump can't release its locks until gzip has completely > processed all of the data, because of the pipe. Good tip, I keep that in mind for the future! Thanks, loïc -- Loïc d'Anterroches - Céondo Ltd - http://www.ceondo.com
Harald, >>settings up each time. The added benefit of doing a per schema dump is >>that I provide it to the users directly, that way they have a full >>export of their data. > > you should try the timing with > > pg_dump --format=c completedatabase.dmp > > and then generating the separte schemas in an extra step like > > pg_restore --schema=%s --file=outputfilename.sql completedatabase.dmp > > I found that even with maximum compression > > pg_dump --format=c --compress=9 > > the pg_dump compression was quicker then dump + gzip/bzip/7z compression > afterwards. > > And after the dumpfile is created, pg_restore will leave your database > alone. > (make sure to put completedatabase.dmp on a separate filesystem). You can > even try to run more then one pg_restore --file in parallel. Yummy! The speed of a full dump and the benefits of the per schema dump for the users. I will try this one tonight when the load is low. I will keep you informed of the results. Thanks a lot for all the good ideas, pointers! loïc -- Loïc d'Anterroches - Céondo Ltd - http://www.ceondo.com
On Wed, 2009-10-07 at 12:51 +0200, Loic d'Anterroches wrote: > Hello, > My problem is that the dump increased steadily with the number of > schemas (now about 20s from about 12s with 850 schemas) and pg_dump is > now ballooning at 120MB of memory usage when running the dump. > And it will continue to. The number of locks that are needing to be acquired will consistently increase the amount of time it takes to backup the database as you add schemas and objects. This applies to whether or not you are running a single dump per schema or a global dump with -Fc. I agree with the other participants in this thread that it makes more sense for you to use -Fc but your speed isn't going to change all that much overall. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander
Hi Josua, On Wed, Oct 7, 2009 at 6:29 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > On Wed, 2009-10-07 at 12:51 +0200, Loic d'Anterroches wrote: >> Hello, > >> My problem is that the dump increased steadily with the number of >> schemas (now about 20s from about 12s with 850 schemas) and pg_dump is >> now ballooning at 120MB of memory usage when running the dump. >> > > And it will continue to. The number of locks that are needing to be > acquired will consistently increase the amount of time it takes to > backup the database as you add schemas and objects. This applies to > whether or not you are running a single dump per schema or a global dump > with -Fc. > > I agree with the other participants in this thread that it makes more > sense for you to use -Fc but your speed isn't going to change all that > much overall. If the speed of a full dump against a series of schema dump is not going to be dramatically different, time to change to a more long term efficient way to do the job. Anyway, the benefits of having a WAL powered slave are interesting as I can use it as failover in another datacenter in case of problems on the main DB server. It will also add no load on the main server, which is a good thing. Thanks all of you for the detailed answers, I feel good using PostgreSQL for all my production deployments. Being able to go such good answers in a very short amount of time when nothing is available on the net is really nice. loïc -- Loïc d'Anterroches - Céondo Ltd - http://www.ceondo.com
On Wed, 2009-10-07 at 12:51 +0200, Loic d'Anterroches wrote: > Hello, > My problem is that the dump increased steadily with the number of > schemas (now about 20s from about 12s with 850 schemas) and pg_dump is > now ballooning at 120MB of memory usage when running the dump. > And it will continue to. The number of locks that are needing to be acquired will consistently increase the amount of time it takes to backup the database as you add schemas and objects. This applies to whether or not you are running a single dump per schema or a global dump with -Fc. I agree with the other participants in this thread that it makes more sense for you to use -Fc but your speed isn't going to change all that much overall. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander