Thread: pg_dump with 1100 schemas being a bit slow

pg_dump with 1100 schemas being a bit slow

From
"Loic d'Anterroches"
Date:
Hello,

After a series of sessions to search the web for information, I am
asking the help of people having a bit more knowledge of the internals
of pg_dump to try to solve a performance problem I have. I am running
PostgreSQL version 8.3.8 both server and pg_dump,

The context is a farm hosting of a web application
(http://www.indefero.net) where each installation get its own schema.
Each table in the schema is a "real" table, not a view of the same
table in the public schema with "WHERE schema='currentschema'" clause.
This setup allows me to easily run the web application nearly
unmodified between the downloadable version and the "farm" version
(schemas rock!). Now you have the background.

Each night I am running:
pg_dump --blobs --schema=%s --no-acl -U postgres indefero | gzip >
/path/to/backups/%s/%s-%s.sql.gz
this for each installation, so 1100 times. Substitution strings are to
timestamp and get the right schema.

My problem is that the dump increased steadily with the number of
schemas (now about 20s from about 12s with 850 schemas) and pg_dump is
now ballooning at 120MB of memory usage when running the dump.

The thing is that my overage schema size is a bit more than 1MB, with
the schema just after installation being 850kB.
Max size: 2.8MB
Min size: 0.85MB
Avg: 1.0MB
Total size: ~1GB

To get the size I run the following command with sum off the "size":

SELECT relname, pg_total_relation_size(CAST(relname AS
TEXT)) AS size FROM pg_class AS pgc, pg_namespace AS pgn
     WHERE pg_table_is_visible(pgc.oid) IS TRUE AND relkind = 'r'
     AND pgc.relnamespace = pgn.oid
     AND pgn.nspname NOT IN ('information_schema', 'pg_catalog')';


I think that pg_dump, when looking at the objects to dump, also it is
limited to a given schema, is scanning the complete database in one
those calls:
http://doxygen.postgresql.org/pg__dump_8c-source.html#l00717

Is there an option: "I know what I am doing, do not look outside of
the schema" available which can help in my case? Because running
SELECT is snappy, vmstats shows no swapping and the complete system is
basically running very well with a load average below 1.

So, your help is very welcomed,
loïc

--
Loïc d'Anterroches - Céondo Ltd - http://www.ceondo.com

Re: pg_dump with 1100 schemas being a bit slow

From
Tom Lane
Date:
"Loic d'Anterroches" <diaeresis@gmail.com> writes:
> Each night I am running:
> pg_dump --blobs --schema=%s --no-acl -U postgres indefero | gzip >
> /path/to/backups/%s/%s-%s.sql.gz
> this for each installation, so 1100 times. Substitution strings are to
> timestamp and get the right schema.

This seems like a pretty dumb way to go at it.  Why don't you just do
one -Fc dump for the whole database?  If you ever actually need to
restore a single schema, there's a pg_restore switch for that.

> I think that pg_dump, when looking at the objects to dump, also it is
> limited to a given schema, is scanning the complete database in one
> those calls:

Yes, it has to examine all database objects in order to trace
dependencies properly.

> Is there an option: "I know what I am doing, do not look outside of
> the schema" available which can help in my case?

No.

            regards, tom lane

Re: pg_dump with 1100 schemas being a bit slow

From
"Loic d'Anterroches"
Date:
On Wed, Oct 7, 2009 at 4:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Loic d'Anterroches" <diaeresis@gmail.com> writes:
>> Each night I am running:
>> pg_dump --blobs --schema=%s --no-acl -U postgres indefero | gzip >
>> /path/to/backups/%s/%s-%s.sql.gz
>> this for each installation, so 1100 times. Substitution strings are to
>> timestamp and get the right schema.
>
> This seems like a pretty dumb way to go at it.  Why don't you just do
> one -Fc dump for the whole database?  If you ever actually need to
> restore a single schema, there's a pg_restore switch for that.

Thank you for your fast answer.

This is the way I started to do the work, but then I started to have
issues because of the numbers of tables to be soft locked at the same
time increasing each time, I had to push the max_locks_per_transaction
settings up each time. The added benefit of doing a per schema dump is
that I provide it to the users directly, that way they have a full
export of their data. I cannot increase the max_locks_per_transaction
all the time with the increasing number of schemas, no?

What is the problem if I put this settings at a high value (outside of
the memory overhead per connection as far as I understood the doc)?

>> I think that pg_dump, when looking at the objects to dump, also it is
>> limited to a given schema, is scanning the complete database in one
>> those calls:
>
> Yes, it has to examine all database objects in order to trace
> dependencies properly.
>
>> Is there an option: "I know what I am doing, do not look outside of
>> the schema" available which can help in my case?
>
> No.

So it looks like I may need to go a different way. I can setup a WAL
based backup server and dump the content of each schema in an
application specific way (JSON export), this way I can keep the ready
to use backup (at the moment a restore is just a series of import for
each schema) with the slave and it will keep my customers happy with
the JSON dump.

If you know a better solution, I would be pleased to be guided in the
right direction.
loïc

--
Loïc d'Anterroches - Céondo Ltd - http://www.ceondo.com

Re: pg_dump with 1100 schemas being a bit slow

From
Bill Moran
Date:
In response to "Loic d'Anterroches" <diaeresis@gmail.com>:

> On Wed, Oct 7, 2009 at 4:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > "Loic d'Anterroches" <diaeresis@gmail.com> writes:
> >> Each night I am running:
> >> pg_dump --blobs --schema=%s --no-acl -U postgres indefero | gzip >
> >> /path/to/backups/%s/%s-%s.sql.gz
> >> this for each installation, so 1100 times. Substitution strings are to
> >> timestamp and get the right schema.

Have you tested the speed without the gzip?

We found that compressing the dump takes considerably longer than pg_dump
does, but pg_dump can't release its locks until gzip has completely
processed all of the data, because of the pipe.

By doing the pg_dump in a different step than the compression, we were
able to eliminate our table locking issues, i.e.:

pg_dump --blobs --schema=%s --no-acl -U postgres indefero > /path/to/backups/%s/%s-%s.sql && gzip
/path/to/backups/%s/%s-%s.sql

Of course, you'll need enough disk space to store the uncompressed
dump while gzip works.

--
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

Re: pg_dump with 1100 schemas being a bit slow

From
"Massa, Harald Armin"
Date:
Loic,

>settings up each time. The added benefit of doing a per schema dump is
>that I provide it to the users directly, that way they have a full
>export of their data.

you should try the timing with

pg_dump --format=c  completedatabase.dmp

and then generating the separte schemas in an extra step like

pg_restore --schema=%s --file=outputfilename.sql completedatabase.dmp

I found that even with maximum compression

pg_dump --format=c --compress=9

the pg_dump compression was quicker then  dump + gzip/bzip/7z compression afterwards.

And after the dumpfile is created, pg_restore will leave your database alone.
(make sure to put completedatabase.dmp on a separate filesystem). You can even try to run more then one pg_restore --file in parallel.

Best wishes,

Harald

--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
no fx, no carrier pigeon
-
%s is too gigantic of an industry to bend to the whims of reality

Re: pg_dump with 1100 schemas being a bit slow

From
"Loic d'Anterroches"
Date:
On Wed, Oct 7, 2009 at 5:54 PM, Bill Moran <wmoran@potentialtech.com> wrote:
> In response to "Loic d'Anterroches" <diaeresis@gmail.com>:
>
>> On Wed, Oct 7, 2009 at 4:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > "Loic d'Anterroches" <diaeresis@gmail.com> writes:
>> >> Each night I am running:
>> >> pg_dump --blobs --schema=%s --no-acl -U postgres indefero | gzip >
>> >> /path/to/backups/%s/%s-%s.sql.gz
>> >> this for each installation, so 1100 times. Substitution strings are to
>> >> timestamp and get the right schema.
>
> Have you tested the speed without the gzip?

This is the first thing I did but in that case I was not able to get
any significant improvement. The data to gzip is very small "per
schema" so this is not the bottleneck.

> We found that compressing the dump takes considerably longer than pg_dump
> does, but pg_dump can't release its locks until gzip has completely
> processed all of the data, because of the pipe.

Good tip, I keep that in mind for the future!

Thanks,
loïc

--
Loïc d'Anterroches - Céondo Ltd - http://www.ceondo.com

Re: pg_dump with 1100 schemas being a bit slow

From
"Loic d'Anterroches"
Date:
Harald,

>>settings up each time. The added benefit of doing a per schema dump is
>>that I provide it to the users directly, that way they have a full
>>export of their data.
>
> you should try the timing with
>
> pg_dump --format=c  completedatabase.dmp
>
> and then generating the separte schemas in an extra step like
>
> pg_restore --schema=%s --file=outputfilename.sql completedatabase.dmp
>
> I found that even with maximum compression
>
> pg_dump --format=c --compress=9
>
> the pg_dump compression was quicker then  dump + gzip/bzip/7z compression
> afterwards.
>
> And after the dumpfile is created, pg_restore will leave your database
> alone.
> (make sure to put completedatabase.dmp on a separate filesystem). You can
> even try to run more then one pg_restore --file in parallel.

Yummy! The speed of a full dump and the benefits of the per schema
dump for the users. I will try this one tonight when the load is low.
I will keep you informed of the results.

Thanks a lot for all the good ideas, pointers!
loïc

--
Loïc d'Anterroches - Céondo Ltd - http://www.ceondo.com

Re: pg_dump with 1100 schemas being a bit slow

From
"Joshua D. Drake"
Date:
On Wed, 2009-10-07 at 12:51 +0200, Loic d'Anterroches wrote:
> Hello,

> My problem is that the dump increased steadily with the number of
> schemas (now about 20s from about 12s with 850 schemas) and pg_dump is
> now ballooning at 120MB of memory usage when running the dump.
>

And it will continue to. The number of locks that are needing to be
acquired will consistently increase the amount of time it takes to
backup the database as you add schemas and objects. This applies to
whether or not you are running a single dump per schema or a global dump
with -Fc.

I agree with the other participants in this thread that it makes more
sense for you to use -Fc but your speed isn't going to change all that
much overall.

Joshua D. Drake

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander

Re: pg_dump with 1100 schemas being a bit slow

From
"Loic d'Anterroches"
Date:
Hi Josua,

On Wed, Oct 7, 2009 at 6:29 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> On Wed, 2009-10-07 at 12:51 +0200, Loic d'Anterroches wrote:
>> Hello,
>
>> My problem is that the dump increased steadily with the number of
>> schemas (now about 20s from about 12s with 850 schemas) and pg_dump is
>> now ballooning at 120MB of memory usage when running the dump.
>>
>
> And it will continue to. The number of locks that are needing to be
> acquired will consistently increase the amount of time it takes to
> backup the database as you add schemas and objects. This applies to
> whether or not you are running a single dump per schema or a global dump
> with -Fc.
>
> I agree with the other participants in this thread that it makes more
> sense for you to use -Fc but your speed isn't going to change all that
> much overall.

If the speed of a full dump against a series of schema dump is not
going to be dramatically different, time to change to a more long term
efficient way to do the job. Anyway, the benefits of having a WAL
powered slave are interesting as I can use it as failover in another
datacenter in case of problems on the main DB server. It will also add
no load on the main server, which is a good thing.

Thanks all of you for the detailed answers, I feel good using
PostgreSQL for all my production deployments. Being able to go such
good answers in a very short amount of time when nothing is available
on the net is really nice.

loïc

--
Loïc d'Anterroches - Céondo Ltd - http://www.ceondo.com

Re: pg_dump with 1100 schemas being a bit slow

From
"Joshua D. Drake"
Date:
On Wed, 2009-10-07 at 12:51 +0200, Loic d'Anterroches wrote:
> Hello,

> My problem is that the dump increased steadily with the number of
> schemas (now about 20s from about 12s with 850 schemas) and pg_dump is
> now ballooning at 120MB of memory usage when running the dump.
>

And it will continue to. The number of locks that are needing to be
acquired will consistently increase the amount of time it takes to
backup the database as you add schemas and objects. This applies to
whether or not you are running a single dump per schema or a global dump
with -Fc.

I agree with the other participants in this thread that it makes more
sense for you to use -Fc but your speed isn't going to change all that
much overall.

Joshua D. Drake

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander