Thread: Differential backup

Differential backup

From
Simon Riggs
Date:
Thinking about allowing a backup to tell which files have changed in the
database since last backup. This would allow an external utility to copy
away only changed files.

Now there's a few ways of doing this and many will say this is already
possible using file access times.

An explicit mechanism where Postgres could authoritatively say which
files have changed would make many feel safer, especially when other
databases also do this.

We keep track of which files require fsync(), so we could also keep
track of changed files using that same information.

Is this route worthwhile? Or in some way unacceptable?

-- Simon Riggs           www.2ndQuadrant.com



Re: Differential backup

From
Alvaro Herrera
Date:
Simon Riggs wrote:
> 
> Thinking about allowing a backup to tell which files have changed in the
> database since last backup. This would allow an external utility to copy
> away only changed files.
> 
> Now there's a few ways of doing this and many will say this is already
> possible using file access times.
> 
> An explicit mechanism where Postgres could authoritatively say which
> files have changed would make many feel safer, especially when other
> databases also do this.
> 
> We keep track of which files require fsync(), so we could also keep
> track of changed files using that same information.

Why file level?  Seems a bit too coarse (particularly if you have large
file support enabled).  Maybe we could keep block-level last change info
in a separate fork.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Differential backup

From
Simon Riggs
Date:
On Tue, 2010-04-27 at 09:50 -0400, Alvaro Herrera wrote:
> Simon Riggs wrote:
> > 
> > Thinking about allowing a backup to tell which files have changed in the
> > database since last backup. This would allow an external utility to copy
> > away only changed files.
> > 
> > Now there's a few ways of doing this and many will say this is already
> > possible using file access times.
> > 
> > An explicit mechanism where Postgres could authoritatively say which
> > files have changed would make many feel safer, especially when other
> > databases also do this.
> > 
> > We keep track of which files require fsync(), so we could also keep
> > track of changed files using that same information.
> 
> Why file level?  Seems a bit too coarse (particularly if you have large
> file support enabled).  Maybe we could keep block-level last change info
> in a separate fork.

Block-level is mostly available by using LSN, you just need to scan the
file. So block level seems not useful enough for the extra overhead.

File-level would be sufficient for most purposes. If you wanted to go
finer grained you can then scan just the files that have changed.

-- Simon Riggs           www.2ndQuadrant.com



Re: Differential backup

From
"Kevin Grittner"
Date:
Simon Riggs <simon@2ndQuadrant.com> wrote:
> Thinking about allowing a backup to tell which files have changed
> in the database since last backup. This would allow an external
> utility to copy away only changed files.
> 
> Now there's a few ways of doing this and many will say this is
> already possible using file access times.
Who would say otherwise?  Under what circumstances would PostgreSQL
modify a file without changing the "last modified" timestamp or the
file size?  If you're concerned about the converse, with daemon-
based rsync you can copy just the modified portions of a file on
which the directory information has changed.  Or is this targeting
platforms which don't have rsync?
> An explicit mechanism where Postgres could authoritatively say
> which files have changed would make many feel safer, especially
> when other databases also do this.
Why?  I must be missing something, because my feeling is that if you
can't trust your OS to cover something like this, how can you trust
any application *running* under that OS to do it?
> Is this route worthwhile?
I'm not seeing it, but I could be missing something.  Can you
describe a use case where this would be beneficial?
-Kevin


Re: Differential backup

From
Simon Riggs
Date:
On Tue, 2010-04-27 at 08:59 -0500, Kevin Grittner wrote:
> > An explicit mechanism where Postgres could authoritatively say
> > which files have changed would make many feel safer, especially
> > when other databases also do this.
>  
> Why?  I must be missing something, because my feeling is that if you
> can't trust your OS to cover something like this, how can you trust
> any application *running* under that OS to do it?

Good questions. I'm exploring a perceived need. 

I don't think people want this because they think the OS is flaky. It's
more about trusting all of the configurations of all of the filesystems
in use. An explicit mechanism would be more verifiably accurate. It
might just be about control and blame.
-- Simon Riggs           www.2ndQuadrant.com



Re: Differential backup

From
Florian Pflug
Date:
On Apr 27, 2010, at 15:50 , Alvaro Herrera wrote:
> Simon Riggs wrote:
>> Thinking about allowing a backup to tell which files have changed in the
>> database since last backup. This would allow an external utility to copy
>> away only changed files.
>>
>> Now there's a few ways of doing this and many will say this is already
>> possible using file access times.
>>
>> An explicit mechanism where Postgres could authoritatively say which
>> files have changed would make many feel safer, especially when other
>> databases also do this.
>>
>> We keep track of which files require fsync(), so we could also keep
>> track of changed files using that same information.
>
> Why file level?  Seems a bit too coarse (particularly if you have large
> file support enabled).  Maybe we could keep block-level last change info
> in a separate fork.

Hm, but most backup solutions work per-file and not per-block, so file-level tracking probably has more use-cases that
block-leveltracking.. 

In any case, it seems that this information could easily be extracted from the WAL. The archive_command could call a
simpletool that parses the WAL and tracks the latest LSN per database file or page or whatever granularity is required.
This,together with the backup label of the last backup should be enough to compute the list of changed files I think. 

best regards,
Florian Pflug



Re: Differential backup

From
Florian Pflug
Date:
On Apr 27, 2010, at 16:08 , Simon Riggs wrote:
> On Tue, 2010-04-27 at 08:59 -0500, Kevin Grittner wrote:
>> Why?  I must be missing something, because my feeling is that if you
>> can't trust your OS to cover something like this, how can you trust
>> any application *running* under that OS to do it?
>
> Good questions. I'm exploring a perceived need.
>
> I don't think people want this because they think the OS is flaky. It's
> more about trusting all of the configurations of all of the filesystems
> in use. An explicit mechanism would be more verifiably accurate. It
> might just be about control and blame.

I believe a reason for people (including me) to not have 100% faith in file modification times are non-monotone system
clocks.I've seen more than one system where a cron job running ntpdate every night was used as a poor man's replacement
forntpd... 

So the real advantage of rolling our own solution is the ability to use LSNs instead of timestamps I'd say.

best regards,
Florian Pflug



Re: Differential backup

From
Michael Tharp
Date:
On 04/27/2010 09:59 AM, Kevin Grittner wrote:
> Under what circumstances would PostgreSQL
> modify a file without changing the "last modified" timestamp or the
> file size?

Do all OSes have sub-second precision mtimes? Because otherwise I could 
see a scenario such at this:

* File is modified
* Backup inspects and copies the file in the same second
* File is modified again in the same second, so the mtime doesn't change
* Backup is run again some time later and sees that the mtime has not 
changed

Even with microsecond precision this kind of scenario makes me squidgy, 
especially if some OSes decide that skipping frequent mtime updates is 
OK. Florian's point about clock changes is also very relevant. Since 
Postgres has the capability to give a better answer about what is in the 
file, it would be best to use that.

-- m. tharp


Re: Differential backup

From
Merlin Moncure
Date:
On Tue, Apr 27, 2010 at 10:32 AM, Michael Tharp
<gxti@partiallystapled.com> wrote:
> On 04/27/2010 09:59 AM, Kevin Grittner wrote:
>>
>> Under what circumstances would PostgreSQL
>> modify a file without changing the "last modified" timestamp or the
>> file size?
>
> Do all OSes have sub-second precision mtimes? Because otherwise I could see
> a scenario such at this:
>
> * File is modified
> * Backup inspects and copies the file in the same second
> * File is modified again in the same second, so the mtime doesn't change
> * Backup is run again some time later and sees that the mtime has not
> changed
>
> Even with microsecond precision this kind of scenario makes me squidgy,
> especially if some OSes decide that skipping frequent mtime updates is OK.
> Florian's point about clock changes is also very relevant. Since Postgres
> has the capability to give a better answer about what is in the file, it
> would be best to use that.

Why not just force all files to be checked irregardless of mtime?  The
proposal only seems a win to me if a fair percentage of the larger
files don't change, which strikes me as a relatively low level case to
optimize for.  Maybe I'm missing the objective, but it looks like the
payoff is to avoid scanning large files for checksums.  If I was even
infinitesimally insecure about rsync missing files because of
clock/filesystem issues, I'd simply force it.

One cool thing about making postgres 'aware' of last backup time is
that you could warn the user in various places that the database is
not being properly backed up (pg_dump would have to monitor
last_backup_time as well then).  Good stuff, but I bet most people who
aren't backing up the database also aren't checking the log :-).

The block level case seems pretty much covered by the hot standby feature.

merlin


Re: Differential backup

From
"Kevin Grittner"
Date:
Merlin Moncure <mmoncure@gmail.com> wrote:
> The proposal only seems a win to me if a fair percentage of the
> larger files don't change, which strikes me as a relatively low
> level case to optimize for.
That's certainly a situation we face, with a relatively slow WAN in
the middle.
http://archives.postgresql.org/pgsql-admin/2009-07/msg00071.php
I don't know how rare or common that is.
-Kevin


Re: Differential backup

From
Csaba Nagy
Date:
Hi all,

On Tue, 2010-04-27 at 11:07 -0400, Merlin Moncure wrote:
> The block level case seems pretty much covered by the hot standby feature.

One use case we would have is to dump only the changes from the last
backup of a single table. This table takes 30% of the DB disk space, it
is in the order of ~400GB, and it's only inserted, never updated, then
after ~1 year the old entries are archived. There's ~10M new entries
daily in this table. If the backup would be smart enough to only read
the changed blocks (in this case only for newly inserted records), it
would be a fairly big win...

Cheers,
Csaba.




Re: Differential backup

From
Merlin Moncure
Date:
On Tue, Apr 27, 2010 at 11:13 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Merlin Moncure <mmoncure@gmail.com> wrote:
>
>> The proposal only seems a win to me if a fair percentage of the
>> larger files don't change, which strikes me as a relatively low
>> level case to optimize for.
>
> That's certainly a situation we face, with a relatively slow WAN in
> the middle.
>
> http://archives.postgresql.org/pgsql-admin/2009-07/msg00071.php
>
> I don't know how rare or common that is.

hm...interesting read.  pretty clever.  Your archiving requirements are high.

With the new stuff (HS/SR) taken into consideration, would you have
done your DR the same way if you had to do it all over again?

Part of my concern here is that manual filesystem level backups are
going to become an increasingly arcane method of doing things as the
HS/SR train starts leaving the station.

hm, it would be pretty neat to see some of the things you do pushed
into logical (pg_dump) style backups...with some enhancements so that
it can skip tables haven't changed and are exhibited in a previously
supplied dump.  This is more complicated but maybe more useful for a
broader audience?

Side question: is it impractical to backup via pg_dump a hot standby
because of query conflict issues?

merlin


Re: Differential backup

From
"Kevin Grittner"
Date:
Merlin Moncure <mmoncure@gmail.com> wrote:
> Your archiving requirements are high.
They are set by a Steering Committee composed of the Directory of
State Courts and various District Court Administrators, Judges,
Clerks of Court, and Registers in Probate who rely on this data and
*really* want to be safe.  I just work here.  ;-)
> With the new stuff (HS/SR) taken into consideration, would you
> have done your DR the same way if you had to do it all over again?
When SR is available, if I can maintain the flow of WAL files while
doing so, I would feed our "warm standby" farm with SR connections. 
Otherwise I'd do the same.  It's pretty much mandated that we keep
those copies.  It'd be ideal if SR could reconstruct the WAL file
segments on the receiving end, to avoid sending the data twice.
Dare I dream?  :-)
-Kevin


Re: Differential backup

From
Robert Haas
Date:
On Tue, Apr 27, 2010 at 10:08 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Tue, 2010-04-27 at 08:59 -0500, Kevin Grittner wrote:
>> > An explicit mechanism where Postgres could authoritatively say
>> > which files have changed would make many feel safer, especially
>> > when other databases also do this.
>>
>> Why?  I must be missing something, because my feeling is that if you
>> can't trust your OS to cover something like this, how can you trust
>> any application *running* under that OS to do it?
>
> Good questions. I'm exploring a perceived need.
>
> I don't think people want this because they think the OS is flaky. It's
> more about trusting all of the configurations of all of the filesystems
> in use. An explicit mechanism would be more verifiably accurate. It
> might just be about control and blame.

What I think would be cool, though it's not what you proposed, is an
integrated base backup feature.  Say your SR slave gets too far behind
and can't catch up for some reason (the system administrator
accidentally nuked the archive, or you were living on the edge and not
keeping one).  It would be neat to have a way, either manually or
maybe even automatically, to tell the slave, hey, go make a new base
backup.  And it would connect to the master and do pg_start_backup()
and stream down the whole database contents and do pg_stop_backup().
Of course you can do all of this with scripts, but ISTM an integrated
capability would be much easier to administer and might offer some
interesting opportunities for compression.

With respect to what you actually proposed, like Kevin, I'm not sure
what it's good for.  It might make sense if we know what the use case
is but the value certainly isn't obvious.

...Robert


Re: Differential backup

From
Craig Ringer
Date:
Simon Riggs wrote:

> Is this route worthwhile? Or in some way unacceptable?

From an admin perspective, I think block-level differentials would be a
lot more useful, especially if index storage could be safely excluded.

IMO Pg really could use an "index bad or missing, rebuild on postmaster
start" flag so that indexes could simply be omitted from backups and
would be automatically REINDEXed on startup. That'd be *great* for
pg_start_backup() / pg_stop_backup() filesystem level backups,
especially if indexes were configured to live in another tablespace.



Another avenue possibly worth investigating may be using the in-heap
mvcc information to do SQL-level differential backups of individual
tables or of the whole database. think:
   pg_dump --incremental --last-backup-id '10296:10296:'

where "--last-backup-id" is the output of "select
txid_current_snapshot()" from the last backup, and could possibly be
fished out of a header in the previous dump.

This would be *incredibly* handy for people who have one database in a
that's more important than another and needs long-term history storage,
but for whom PITR is a PITA because it's whole-cluster-or-nothing.

This is trivial to do for individual append-only tables. I was trying to
figure out how to handle updates/deletes but quickly found myself
completely stumped.

I'd be surprised if this hasn't been looked at and put in the
"impossible" or "too hard" pile, but thought it was worth mentioning on
the off chance.


-- 
Craig Ringer

Tech-related writing: http://soapyfrogs.blogspot.com/


Re: Differential backup

From
Hannu Krosing
Date:
On Tue, 2010-04-27 at 14:30 +0100, Simon Riggs wrote:
> Thinking about allowing a backup to tell which files have changed in the
> database since last backup. This would allow an external utility to copy
> away only changed files.
> 
> Now there's a few ways of doing this and many will say this is already
> possible using file access times.
> 
> An explicit mechanism where Postgres could authoritatively say which
> files have changed would make many feel safer, especially when other
> databases also do this.
> 
> We keep track of which files require fsync(), so we could also keep
> track of changed files using that same information.

Would it make sense to split this in two , one for DML/"logical
changes" (insert, update, delete, truncate) and another for physical,
"non-functional", file-level changes (vacuum, setting hint bits, ...)

BTW, is the stats-collection reliable enough for this or is it still
possible to lose some changes if we did this together with updating info
for pg_stat_user_tables/pg_statio_user_tables ?

-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training




Re: Differential backup

From
Hannu Krosing
Date:
On Tue, 2010-04-27 at 08:59 -0500, Kevin Grittner wrote:
> Simon Riggs <simon@2ndQuadrant.com> wrote:
>  
> > Thinking about allowing a backup to tell which files have changed
> > in the database since last backup. This would allow an external
> > utility to copy away only changed files.
> > 
> > Now there's a few ways of doing this and many will say this is
> > already possible using file access times.
>  
> Who would say otherwise?  Under what circumstances would PostgreSQL
> modify a file without changing the "last modified" timestamp or the
> file size?  If you're concerned about the converse, with daemon-
> based rsync you can copy just the modified portions of a file on
> which the directory information has changed.  Or is this targeting
> platforms which don't have rsync?

I see the main value when doing pg_dump based backups and being able to
know if the table was modified by DML (insert/update/delete/truncate) or
by something "invisible" like vacuum or setting hint bits.

Currently the only way to keep this info is by having triggers on all
tables on all DML 

> > An explicit mechanism where Postgres could authoritatively say
> > which files have changed would make many feel safer, especially
> > when other databases also do this.
>  
> Why?  I must be missing something, because my feeling is that if you
> can't trust your OS to cover something like this, how can you trust
> any application *running* under that OS to do it?
>  
> > Is this route worthwhile?
>  
> I'm not seeing it, but I could be missing something.  Can you
> describe a use case where this would be beneficial?
>  
> -Kevin
> 


-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training




Re: Differential backup

From
Hannu Krosing
Date:
On Tue, 2010-04-27 at 10:32 -0400, Michael Tharp wrote:
> On 04/27/2010 09:59 AM, Kevin Grittner wrote:
> > Under what circumstances would PostgreSQL
> > modify a file without changing the "last modified" timestamp or the
> > file size?
> 
> Do all OSes have sub-second precision mtimes? Because otherwise I could 
> see a scenario such at this:
> 
> * File is modified
> * Backup inspects and copies the file in the same second
> * File is modified again in the same second, so the mtime doesn't change
> * Backup is run again some time later and sees that the mtime has not 
> changed
> 
> Even with microsecond precision this kind of scenario makes me squidgy, 
> especially if some OSes decide that skipping frequent mtime updates is 
> OK. 

To be on the safe side you need to record the latest table data change
time _after_ the backup anyway, it is easy to wait a few secs to be
sure.

> Florian's point about clock changes is also very relevant. Since 
> Postgres has the capability to give a better answer about what is in the 
> file, it would be best to use that.
> 
> -- m. tharp
> 


-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training




Re: Differential backup

From
"Kevin Grittner"
Date:
Hannu Krosing <hannu@2ndquadrant.com> wrote:
> I see the main value when doing pg_dump based backups
Ah, now that makes more sense.
-Kevin


Re: Differential backup

From
Hannu Krosing
Date:
On Tue, 2010-04-27 at 12:14 -0400, Merlin Moncure wrote:
> On Tue, Apr 27, 2010 at 11:13 AM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:
> > Merlin Moncure <mmoncure@gmail.com> wrote:
> >
> >> The proposal only seems a win to me if a fair percentage of the
> >> larger files don't change, which strikes me as a relatively low
> >> level case to optimize for.
> >
> > That's certainly a situation we face, with a relatively slow WAN in
> > the middle.
> >
> > http://archives.postgresql.org/pgsql-admin/2009-07/msg00071.php
> >
> > I don't know how rare or common that is.
> 
> hm...interesting read.  pretty clever.  Your archiving requirements are high.
> 
> With the new stuff (HS/SR) taken into consideration, would you have
> done your DR the same way if you had to do it all over again?
> 
> Part of my concern here is that manual filesystem level backups are
> going to become an increasingly arcane method of doing things as the
> HS/SR train starts leaving the station.

Actually the HS/SR speaks _for_ adding explicit change dates to files,
as the mod times on slave side will be different, and you may still want
to know when the table really was last modified

> 
> hm, it would be pretty neat to see some of the things you do pushed
> into logical (pg_dump) style backups...with some enhancements so that
> it can skip tables haven't changed and are exhibited in a previously
> supplied dump.  This is more complicated but maybe more useful for a
> broader audience?

Yes, I see the main value in of this for pg_dump backups, as physical
files already have this in terms of file ctime/mtime/atime

> 
> Side question: is it impractical to backup via pg_dump a hot standby
> because of query conflict issues?
> 
> merlin
> 


-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training




Re: Differential backup

From
Hannu Krosing
Date:
On Tue, 2010-04-27 at 17:28 +0200, Csaba Nagy wrote:
> Hi all,
> 
> On Tue, 2010-04-27 at 11:07 -0400, Merlin Moncure wrote:
> > The block level case seems pretty much covered by the hot standby feature.
> 
> One use case we would have is to dump only the changes from the last
> backup of a single table. This table takes 30% of the DB disk space, it
> is in the order of ~400GB, and it's only inserted, never updated, then
> after ~1 year the old entries are archived. There's ~10M new entries
> daily in this table. If the backup would be smart enough to only read
> the changed blocks (in this case only for newly inserted records), it
> would be a fairly big win...

The standard trick for this kind of table is having this table
partitioned by insertion date - this way you have two benefits:

1) you already know which table to backup (the latest, and maye one
before that if you just switche to new one)

2) archiving will be fast (copy full latest table away and the truncate
it) instead of slow (copy "old enough" records out, then do delete of
the same records, both ow which are quite slow, and you also need to do
vacuum after that, which is also slow on large tables)

This would actually be a good sample case for tracking "latest dml",
except that in this particular corner case you can arrange for this
yourself.

-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training




Re: Differential backup

From
"Kevin Grittner"
Date:
Hannu Krosing <hannu@2ndquadrant.com> wrote:
> On Tue, 2010-04-27 at 17:28 +0200, Csaba Nagy wrote:
>> One use case we would have is to dump only the changes from the
>> last backup of a single table. This table takes 30% of the DB
>> disk space, it is in the order of ~400GB, and it's only inserted,
>> never updated, then after ~1 year the old entries are archived.
>> There's ~10M new entries daily in this table. If the backup would
>> be smart enough to only read the changed blocks (in this case
>> only for newly inserted records), it would be a fairly big win...
That is covered pretty effectively in PITR-style backups with the
hard link and rsync approach cited earlier in the thread.  Those 1GB
table segment files which haven't changed aren't read or written,
and only those portions of the other files which have actually
changed are sent over the wire (although the entire disk file is
written on the receiving end).
> The standard trick for this kind of table is having this table
> partitioned by insertion date
That doesn't always work.  In our situation the supreme court sets
records retention rules which can be quite complex, but usually key
on *final disposition* of a case rather than insertion date; that
is, the earliest date on which the data related to a case is
*allowed* to be deleted isn't known until weeks or years after
insertion.  Additionally, it is the elected clerk of court in each
county who determines when and if data for that county will be
purged once it has reached the minimum retention threshold set by
supreme court rules.
That's not to say that partitioning couldn't help with some backup
strategies; just that it doesn't solve all "insert-only" (with
eventual purge) use cases.  One of the nicest things about
PostgreSQL is the availability of several easy and viable backup
strategies, so that you can tailor one to fit your environment.
-Kevin


Re: Differential backup

From
Greg Stark
Date:
On Tue, Apr 27, 2010 at 8:15 PM, Craig Ringer
<craig@postnewspapers.com.au> wrote:
>
> Another avenue possibly worth investigating may be using the in-heap
> mvcc information to do SQL-level differential backups of individual
> tables or of the whole database. think:
>

You can't use the mvcc information to do incremental backups because
transactions don't necessarily commit or take snapshots in sequential
order. A row can be inserted by transaction id 500 and then later
updated by transaction id 400 and then later deleted by transaction id
300.

I think what you need to use is the LSN on the page. Normally when you
take a hot backup you note the LSN at the start and end of the backup
and know that you need to replay that range of logs to have a
consistent restore. An incremental backup would be the same except it
would only back up any blocks that have an LSN > the start of the last
backup. Then you could restore the previous backup, apply this
incremental to bring the restore up to the beginning of this backup,
then apply the logs for the range of this backup.



-- 
greg