Thread: a back up question

a back up question

From

Martin Mueller

Date:

06 December 2017, 00:52:28

Are there rules for thumb for deciding when you can dump a whole database and when you’d be better off dumping groups of tables? I have a database that has around 100 tables, some of them quite large, and right now the data directory is well over 100GB. My hunch is that I should divide and conquer, but I don’t have a clear sense of what counts as “too big” these days. Nor do I have a clear sense of whether the constraints have to do with overall size, the number of tables, or machine memory (my machine has 32GB of memory).

Is 10GB a good practical limit to keep in mind?

Re: a back up question

From

"David G. Johnston"

Date:

06 December 2017, 00:59:29

On Tue, Dec 5, 2017 at 2:52 PM, Martin Mueller <martinmueller@northwestern.edu> wrote:

Are there rules for thumb for deciding when you can dump a whole database and when you’d be better off dumping groups of tables? I have a database that has around 100 tables, some of them quite large, and right now the data directory is well over 100GB. My hunch is that I should divide and conquer, but I don’t have a clear sense of what counts as “too big” these days. Nor do I have a clear sense of whether the constraints have to do with overall size, the number of tables, or machine memory (my machine has 32GB of memory).

Is 10GB a good practical limit to keep in mind?

I'd say the rule-of-thumb is if you have to "divide-and-conquer" you should use non-pg_dump based backup solutions. Too big is usually measured in units of time, not memory.

Any ability to partition your backups into discrete chunks is going to be very specific to your personal setup. Restoring such a monster without constraint violations is something I'd be VERY worried about.

David J.

Re: a back up question

From

Carl Karsten

Date:

06 December 2017, 01:06:39

Nothing wrong with lots of tables and data.

Don't impose any constraints on your problem you don't need to.

Like what are you backing up to? $400 for a 1T ssd or $80 fo a 2T usb3 spinny disk.

If you are backing up while the db is being updated, you need to make sure updates are queued until the backup is done. don't mess with that process. personally I would assume the db is always being updated and expect that.

On Tue, Dec 5, 2017 at 3:52 PM, Martin Mueller <martinmueller@northwestern.edu> wrote:

Are there rules for thumb for deciding when you can dump a whole database and when you’d be better off dumping groups of tables? I have a database that has around 100 tables, some of them quite large, and right now the data directory is well over 100GB. My hunch is that I should divide and conquer, but I don’t have a clear sense of what counts as “too big” these days. Nor do I have a clear sense of whether the constraints have to do with overall size, the number of tables, or machine memory (my machine has 32GB of memory).

Is 10GB a good practical limit to keep in mind?

Carl K

Re: a back up question

From

Martin Mueller

Date:

06 December 2017, 01:09:32

Time is not really a problem for me, if we talk about hours rather than days. On a roughly comparable machine I’ve made backups of databases less than 10 GB, and it was a matter of minutes. But I know that there are scale problems. Sometimes programs just hang if the data are beyond some size. Is that likely in Postgres if you go from ~ 10 GB to ~100 GB? There isn’t any interdependence among my tables beyond queries I construct on the fly, because I use the database in a single user environment

From: "David G. Johnston" <david.g.johnston@gmail.com>
Date: Tuesday, December 5, 2017 at 3:59 PM
To: Martin Mueller <martinmueller@northwestern.edu>
Cc: "pgsql-general@lists.postgresql.org" <pgsql-general@lists.postgresql.org>
Subject: Re: a back up question

On Tue, Dec 5, 2017 at 2:52 PM, Martin Mueller <martinmueller@northwestern.edu> wrote:

Are there rules for thumb for deciding when you can dump a whole database and when you’d be better off dumping groups of tables? I have a database that has around 100 tables, some of them quite large, and right now the data directory is well over 100GB. My hunch is that I should divide and conquer, but I don’t have a clear sense of what counts as “too big” these days. Nor do I have a clear sense of whether the constraints have to do with overall size, the number of tables, or machine memory (my machine has 32GB of memory).

Is 10GB a good practical limit to keep in mind?

I'd say the rule-of-thumb is if you have to "divide-and-conquer" you should use non-pg_dump based backup solutions. Too big is usually measured in units of time, not memory.

David J.

Re: a back up question

From

Alvaro Herrera

Date:

06 December 2017, 01:15:19

Carl Karsten wrote:
> Nothing wrong with lots of tables and data.
> 
> Don't impose any constraints on your problem you don't need to.
> 
> Like what are you backing up to?    $400 for a 1T ssd or $80 fo a 2T usb3
> spinny disk.
> 
> If you are backing up while the db is being updated, you need to make sure
> updates are queued until the backup is done.  don't mess with that
> process.   personally I would assume the db is always being updated and
> expect that.

A backup generated by pg_dump never includes writes that are in flight
while the backup is being taken.  That would make the backup absolutely
worthless!

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: a back up question

From

"David G. Johnston"

Date:

06 December 2017, 01:20:34

On Tue, Dec 5, 2017 at 3:09 PM, Martin Mueller <martinmueller@northwestern.edu> wrote:

Time is not really a problem for me, if we talk about hours rather than days. On a roughly comparable machine I’ve made backups of databases less than 10 GB, and it was a matter of minutes. But I know that there are scale problems. Sometimes programs just hang if the data are beyond some size. Is that likely in Postgres if you go from ~ 10 GB to ~100 GB? There isn’t any interdependence among my tables beyond queries I construct on the fly, because I use the database in a single user environment

The convention on these lists is to inline and/or bottom-post; please avoid top-posting.

That you are using a relational database system to house tables without any interdependence (relationships) between them is an interesting proposition. That you are in a "single user environment" in most cases would have no impact on this...

PostgreSQL itself, bugs not withstanding, won't "hang" no matter how much data is being processed. It does, however, take out locks so that the entire dump represents that exact same snapshot for all dumped objects. Those locks can impact queries. In particular using "TRUNCATE" becomes pretty much impossible while a dump backup is in progress (I get bit by this, I tend to truncate unlogged tables quite a bit in my usage of PostgreSQL). Normal updates and selects usually work without problem though any transactions started after the backup will not be part of the output no matter how long after the transaction closes the backup finishes.

I suspect that typically you will end up annoyed at how long the backup takes well before any program/system issues become apparent. Data is streamed to the output file handle so active memory usage and database size are not really correlated.

David J.

Re: a back up question

From

Carl Karsten

Date:

06 December 2017, 01:51:28

On Tue, Dec 5, 2017 at 4:15 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Carl Karsten wrote:
> Nothing wrong with lots of tables and data.
>
> Don't impose any constraints on your problem you don't need to.
>
> Like what are you backing up to? $400 for a 1T ssd or $80 fo a 2T usb3
> spinny disk.
>
> If you are backing up while the db is being updated, you need to make sure
> updates are queued until the backup is done. don't mess with that
> process. personally I would assume the db is always being updated and
> expect that.

A backup generated by pg_dump never includes writes that are in flight
while the backup is being taken. That would make the backup absolutely
worthless!

Hmm, i kinda glossed over my point:

if you come up with your own process to chop up the backup into little pieces, you risk letting writes in, and then yeah, worthless.

Carl K

Re: a back up question

From

John R Pierce

Date:

06 December 2017, 01:55:41

On 12/5/2017 2:09 PM, Martin Mueller wrote:

Time is not really a problem for me, if we talk about hours rather than days. On a roughly comparable machine I’ve made backups of databases less than 10 GB, and it was a matter of minutes. But I know that there are scale problems. Sometimes programs just hang if the data are beyond some size. Is that likely in Postgres if you go from ~ 10 GB to ~100 GB? There isn’t any interdependence among my tables beyond queries I construct on the fly, because I use the database in a single user environment

another factor is restore time. restores have to create indexes. creating indexes on multi-million-row tables can take awhile. (hint, be sure to set maintenance_work_mem to 1GB before doing this!)

-- 
john r pierce, recycling bits in santa cruz

Re: a back up question

From

Alvaro Herrera

Date:

06 December 2017, 02:05:03

Carl Karsten wrote:
> On Tue, Dec 5, 2017 at 4:15 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
> wrote:

> > A backup generated by pg_dump never includes writes that are in flight
> > while the backup is being taken.  That would make the backup absolutely
> > worthless!
> 
> Hmm, i kinda glossed over my point:
> if you come up with your own process to chop up the backup into little
> pieces, you risk letting writes in, and then yeah, worthless.

Ah, sure.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: a back up question

From

Karsten Hilbert

Date:

06 December 2017, 13:39:03

On Tue, Dec 05, 2017 at 09:52:28PM +0000, Martin Mueller wrote:

> Are there rules for thumb for deciding when you can dump a
> whole database and when you’d be better off dumping groups of
> tables?

It seems to me we'd have to define the objective of "dumping" first ?

Regards,
Karsten
-- 
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: a back up question

From

Martin Mueller

Date:

06 December 2017, 15:52:53

On 12/6/17, 4:39 AM, "karsten.hilbert@gmx.net" <karsten.hilbert@gmx.net> wrote:

    On Tue, Dec 05, 2017 at 09:52:28PM +0000, Martin Mueller wrote:

    > Are there rules for thumb for deciding when you can dump a
    > whole database and when you’d be better off dumping groups of
    > tables?

    It seems to me we'd have to define the objective of "dumping" first ?

    Regards,
    Karsten
    -- 
    GPG key ID E4071346 @ eu.pool.sks-keyservers.net
    E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

The objective is to create a  backup from which I can restore any or all tables in the event of a crash. In my case,  I
usePostgres for my own scholarly purposes. Publications of whatever kind are not directly made public via the database.
Iam my only customer, and a service interruption, while a nuisance to me, does not create a crisis for others. I don’t
wantto lose my work, but a service interruption of a day or a week is no big deal.

Re: a back up question

From

Karsten Hilbert

Date:

06 December 2017, 16:24:43

On Wed, Dec 06, 2017 at 12:52:53PM +0000, Martin Mueller wrote:

>> Are there rules for thumb for deciding when you can dump a
>> whole database and when you’d be better off dumping groups of
>> tables?

>> It seems to me we'd have to define the objective of "dumping" first ?

> The objective is to create a  backup from which I can
> restore any or all tables in the event of a crash.

I see.

"Any or all" speaks in recommendation of non-plain output
formats _if_ using pg_dump.

> In my case,  I use Postgres for my own scholarly purposes.
> Publications of whatever kind are not directly made public
> via the database. I am my only customer, and a service
> interruption, while a nuisance to me, does not create a
> crisis for others. I don’t want to lose my work, but a
> service interruption of a day or a week is no big deal. 

In that case I would stick to pg_dump, perhaps with directory
format and then tarred and compressed, until you notice
actual problems (unbearable slowdown of the machine during
backup, running out of disk space).

My 2 cents,
Karsten
-- 
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: a back up question

From

Stephen Frost

Date:

06 December 2017, 17:57:15

John, all,

* John R Pierce (pierce@hogranch.com) wrote:
> On 12/5/2017 2:09 PM, Martin Mueller wrote:
> >Time is not really a problem for me, if we talk about hours rather
> >than days.  On a roughly comparable machine I’ve made backups of
> >databases less than 10 GB, and it was a matter of minutes.  But I
> >know that there are scale problems. Sometimes programs just hang
> >if the data are beyond some size.  Is that likely in Postgres if
> >you go from ~ 10 GB to ~100 GB?  There isn’t any interdependence
> >among my tables beyond  queries I construct on the fly, because I
> >use the database in a single user environment
>
> another factor is restore time.    restores have to create
> indexes.   creating indexes on multi-million-row tables can take
> awhile.  (hint, be sure to set maintenance_work_mem to 1GB before
> doing this!)

I'm sure you're aware of this John, but for others following along, just
to be clear: indexes have to be recreated when restoring from a
*logical* (eg: pg_dump based) backups.  Indexes don't have to be
recreated for *physical* (eg: file-based) backups.

Neither pg_dump nor the various physical-backup utilities should hang or
have issues with larger data sets.

Thanks!

Stephen

Re: a back up question

From

Vick Khera

Date:

06 December 2017, 17:57:45

On Wed, Dec 6, 2017 at 7:52 AM, Martin Mueller <martinmueller@northwestern.edu> wrote:

The objective is to create a backup from which I can restore any or all tables in the event of a crash. In my case, I use Postgres for my own scholarly purposes. Publications of whatever kind are not directly made public via the database. I am my only customer, and a service interruption, while a nuisance to me, does not create a crisis for others. I don’t want to lose my work, but a service interruption of a day or a week is no big deal.

I'd stick with pg_dump for sure. Two main choices depending on how big your database is and how fast your disks are: 1) "c" format into a single flat compressed file from which you can restore; 2) "d" format which you would then subsequently need to compress and tar for easy tracking and off-site copying. The only real advantage to "d" format is that you can parallelize the dumps if you have enough spare I/O bandwidth.

For my backups on a production database serving thousands of customers per day (mostly in the US) on a web app, I just did a "c" format pg_dump nightly around 3am US Eastern time. It was our low time, and the impact on the database server was not significant since it had more RAM than the size of the database on disk (256GB RAM vs 100GB disk size including indexes). The backups are on a different machine which connects via LAN to the DB server and writes to its own local disk then copied that to an off-site server. Before I had such beefy hardware, I would do the dump from a replica which was updated using Slony1 software. The pg_dump backups were for disaster recovery and customer error recovery, so I kept about 2 weeks' worth of them.

Since you have no other consumers of your data, just use a simple "c" format dump however often you like, then copy those off-site. Easy peasy.

Re: a back up question

From

Magnus Hagander

Date:

06 December 2017, 18:00:21

On Wed, Dec 6, 2017 at 9:52 PM, Martin Mueller <martinmueller@northwestern.edu> wrote:

On 12/6/17, 4:39 AM, "karsten.hilbert@gmx.net" <karsten.hilbert@gmx.net> wrote:

On Tue, Dec 05, 2017 at 09:52:28PM +0000, Martin Mueller wrote:

> Are there rules for thumb for deciding when you can dump a
> whole database and when you’d be better off dumping groups of
> tables?

It seems to me we'd have to define the objective of "dumping" first ?

Regards,
Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

The objective is to create a backup from which I can restore any or all tables in the event of a crash. In my case, I use Postgres for my own scholarly purposes. Publications of whatever kind are not directly made public via the database. I am my only customer, and a service interruption, while a nuisance to me, does not create a crisis for others. I don’t want to lose my work, but a service interruption of a day or a week is no big deal.

If you reach the point where you have to consider splitting up the dumps for performance reasons, then you have really reached the point where pg_dump just isn't good enough for backups anymore. There are good uses for pg_dump even on such large databases, but backups aren't one of them.

You should then instead use pg_basebackup, or if you need even more functionality and performance than this provides, look at the external tools like pgbackrest or pgbarman. These tools don't need to be any more complicated to use than pg_dump, but will give you a much better backup.

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/